04 Jul 2026
Planet Python
Bob Belderbos: One Core, Two Interfaces, No Rewrites
When building applications, I always build the core first, then the interfaces. It was no different with Ask the Canon: a uv run main.py ask "..." CLI for quick iteration and validation, then the web app for MVP. Search, ranking, citations, all using the same engine.
Ask the Canon's core is a handful of pure functions in one module. Both interfaces are thin wrappers. This is the second post in a series on how it's built. The first one was about the retrieval engine. This one is about the wider architecture.
Functional core, two interfaces
I just have one module with pure functions, clear contracts, and no hidden state:
def embed(texts: list[str]) -> np.ndarray: ...
def load_library(book_ids=None) -> tuple[list[Passage], np.ndarray]: ...
def search_passages(query, passages, vectors, ...) -> list[tuple[int, float]]: ...
def reflow(text: str) -> str: ...
load_library reads the cached .npy files off disk and hands back a list of Passage tuples plus the stacked matrix.
search_passages takes those two and a query and returns ranked (index, score) pairs.
The web layer consumes the core functions, no re-implementation:
from main import (
embed,
Passage,
humanize_author,
load_library,
reflow,
search_passages,
)
The CLI's ask() and the web app's /api/ask share the same spine: load the library, call search_passages, walk the ranked (index, score) pairs. From there each does its own thing. The CLI prints rich panels and offers an interactive deep-read; the web app serializes to Match JSON and logs a bit of analytics on the way out.
The ranking decision, what comes back and in what order, is shared. Everything downstream is presentation, which is exactly where a CLI and a web app should differ.
We do the same in our agentic AI program: one core engine, three interfaces (CLI, Telegram, API / web dashboard).
I needed caching
load_library is not cheap. It walks books/, reads a JSON file and an .npy file per book, and stacks 80k vectors into one matrix with np.vstack. You don't want to pay that overhead on every HTTP request!
In the CLI that's a non-issue: the process loads once and exits. On the web side, it's one decorator away:
from functools import cache
@cache
def library() -> tuple[list[Passage], np.ndarray]:
return load_library()
@cache turns the first call into the real load and every call after into a dictionary lookup, much faster.
@app.get("/api/ask")
def ask(q: str, k: int = 5, per_book: int = 2, floor: float = 0.6) -> list[Match]:
passages, vectors = library() # cached
...
Pre-warm on startup, not on the first visitor
There's a subtlety @cache doesn't solve on its own. If the first request is what triggers library() ("wakes up PyTorch"), then the first real visitor pays that tax. App restarts are rare, but making the first visitor wait still isn't acceptable.
FastAPI's lifespan offers a nice fix for this: do it as soon as the app starts, before the first request:
@asynccontextmanager
async def lifespan(app: FastAPI):
init_db()
logger.info("Pre-warming vector library and loading models into RAM...")
_ = library() # fills the @cache with the stacked matrix
_ = embed(["warmup"]) # forces PyTorch to wake up and allocate
logger.info("Ready for traffic.")
yield
app = FastAPI(title="classics", lifespan=lifespan)
- I left a log line to watch the startup time. I also added some comments for possible collaborators and my future self.
- I use
_as a throwaway variable to make it clear the return value is ignored. - You can put shutdown logic after
yield, similar to how pytest fixtures work. Clean.
By the time the first request lands, both are warm.
Lazy loading
I am a proponent of imports at the top, but lazy loading is a serious performance consideration. It's coming in 3.15:
Lazy imports defer the loading and execution of a module until the first time the imported name is used, in contrast to 'normal' imports, which eagerly load and execute a module at the point of the import statement. - PEP 810 - Explicit lazy imports
That's the automatic version, landing in 3.15. Here I do it by hand: defer the model import into the function that needs it:
@cache
def _model():
import sentence_transformers as st # lazy, so the offline env vars take effect first
return st.SentenceTransformer(EMBED_MODEL)
So the model loads once, and only if something actually calls _model(). @cache hands back the same instance every time after.
The "offline env vars" part refers to the second reason I need the import here. At the top of the module I have:
os.environ.setdefault("HF_HUB_OFFLINE", "1")
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
os.environ.setdefault("TQDM_DISABLE", "1")
Hugging Face reads HF_HUB_OFFLINE at import time. Import sentence-transformers before those are set and it will try to reach out to the internet, which is not what I want because I have the data and model cached locally. Set them first and the model stays fully offline, no surprise network calls.
Functions vs classes
None of this needs a class. The core is functions over plain data (Passage and Chunk are NamedTuples), the only state is a memoized function, and the two interfaces are thin adapters that share common behavior.
That's the payoff. When I want a third interface tomorrow (e.g. a scheduled job or a different API), it imports the same functions and gets the same behavior for free.
Next up in part 3: the small post-processing tricks that make the results actually good, no bigger model required.
04 Jul 2026 12:00am GMT
Armin Ronacher: Better Models: Worse Tools
A very strange Pi issue sent me down a rabbit hole over the last two days. The short version is that newer Claude models sometimes call Pi's edit tool with extra, invented fields in the nested edits[] array. And not Haiku or some small model: Opus 4.8. The edit itself is usually correct but the arguments do not match the schema as the model invents made-up keys and Pi thus rejects the tool call and asks to try again.
That alone is not too surprising as models emit malformed tool calls sometimes. Particularly small ones. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.
In case you are curious about Fable: I intentionally did not test it because I was not sure if the classifiers they are running might downgrade me to Opus silently.
Tool Calls Are Text
If you have not spent too much time looking at LLM tool calling internals, the important thing to understand is that tool calls are not magic and use some rather crude in-band signalling. The model receives a transcript, a system prompt and a list of available tools. The server munches that into a large prompt with special marker tokens. Because the model was trained and reinforced on examples of that format, at some point during generation it emits something that the API or client interprets as "call this tool with these arguments".
For a file edit tool, the intended invocation payload might say something like this:
{
"path": "some/file.py",
"edits": [
{
"oldText": "text to replace",
"newText": "replacement text"
}
]
}
A harness then validates the arguments, performs the edit, and feeds the result back into the model. If validation fails, the model sees an error and usually tries again.
How exactly that formatting happens is not known for the Anthropic models, but some people have gotten out "ANTML" markers and they at times do leak also into public communications. To the best of my knowledge, the call above would come out serialized like this from the model:
<antml:function_calls>
<antml:invoke name="edit">
<antml:parameter name="path">some/file.py</antml:parameter>
<antml:parameter name="edits">
[
{
"oldText": "text to replace",
"newText": "replacement text"
}
]
</antml:parameter>
</antml:invoke>
</antml:function_calls>
An important thing to note here is that this thing, while looking like XML, is not really XML. It's just a thing they found convenient to tokenize and train on. The other thing to note is that a basic top-level string parameter appears in-line whereas an array of objects is implemented via JSON serialization. While I'm not entirely sure that this is how it works, there are some indications that this is not too far off. This will become relevant later.
There are two very different ways to make the model produce a structure like this:
- You can ask the model to produce valid JSON matching a schema and then validate it afterwards.
- You can constrain the sampler so that invalid JSON, or even invalid schema shapes, cannot be sampled in the first place.
The second approach is what people usually refer to as grammar-aware or constrained decoding. The sampler masks out tokens that would violate the grammar. If the model is currently inside a JSON object and the schema says only oldText and newText are allowed, the sampler can prevent it from emitting "in_file" or "type". Grammar-aware decoding can be used both to constrain something to be syntactically valid JSON and also to enforce specific enum values or keys.
Without any form of constraints the model is merely following a learned convention.
The Failure
Pi's edit tool supports multiple exact string replacements in one call. That is why the arguments contain an edits array. In the failing cases the model produces entries like this:
{
"oldText": "...",
"newText": "...",
"requireUnique": true
}
or this:
{
"oldText": "...",
"newText": "...",
"oldText2": "",
"newText2": ""
}
Across repeated trials I saw a whole zoo of invented trailing keys: type, id, kind, unique, requireUnique, matchCase, in_file, forceMatchCount, children, notes, cost, oldText2, newText2, oldText_2, newText_2, and even an event.0.additionalProperties key inside the edit object itself.
The most annoying part is that the actual oldText and newText payloads were byte-correct in the invalid calls I inspected. The model had in fact produced the right invocation but then added nonsense at the end of the object.
The failure is also heavily context-dependent. A fresh single-turn prompt like "edit this file" did not reproduce it at all for me. An agentic history where the model had read files, diagnosed a problem and then composed a multi-line edit could reproduce it. And more annoyingly, not all transcripts will show that behavior. In fact, I needed Petr Baudis's transcripts to reproduce this for me at all! In that user's session continuing the session caused Opus 4.8 to fail around 20% of the time. Stripping thinking blocks from history reduced the failure rate by half. Turning on strict tool invocation eliminated it in my runs.
Why It's Getting Worse
My strongest hypothesis is that this is not random deterioration but a training artifact.
When older Anthropic models were trained, they were trained on some tools (some of which were documented). But that training did not yet have a user-shipped harness like Claude Code as the obvious target. Modern Anthropic models are most likely different because their post-training includes Claude Code or a harness that looks very similar. The model learns what a successful tool call looks like in that environment. It also learns what mistakes are tolerated by that environment.
Claude Code's own tools are comparatively flat. The ordinary edit tool is not Pi's nested edits[] shape; it is closer to file_path, old_string, new_string, and an optional flag (replace_all). Looking at Claude Code's client is very instructive: it contains retry paths for malformed tool use, parameter aliases, type coercions, Unicode repairs and filtering of unknown keys. In other words, Anthropic's own client appears to expect and accept a fair amount of slop and repairs it, mostly silently.
If reinforcement learning happens in a harness like that, or a simulation of one, then slightly malformed tool calls can still complete the task and receive reward. The harness fully absorbs the error and there is little gradient against inventing an alias, adding a stray field or using a nearby parameter name.
Worse, the model may become very strongly adapted to the canonical Claude Code edit tool shape. A different harness can present a tool with the same semantic intent but a different schema. Such a tool can increasingly be off-distribution. The better-trained model might actually fight you harder because its prior is stronger.
This is not too surprising, but it is a change from how this was a few months ago. When Opus 4.5 launched, it adapted to other edit tools exceptionally well. In fact, I was pretty convinced that we're on a good path where the models are more likely to adapt to any sort of tool shape that comes around for as long as the instructions are good.
Now I'm somewhat worried about the track we're on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology. And that ecology is not documented. While there is a text editor tool that is documented, you will see that this format is in fact not followed by Claude Code. What Claude Code does internally (which is a closed-source harness) is hidden from you.
The Slop Harness
Claude Code is obviously closed-source but we can look at the minified code and get some idea of what it does. And honestly, it's very forgiving of incoming data.
For a start, Claude Code checks the model's visible text for leaked <invoke markup. It also emits some telemetry when that happens and then it has its own state machine to retry such bad calls by pushing back to the model.
It has explicit Unicode escape repair which fixes broken \uXXXX sequences and lone surrogates in string values. It also has per-tool aliases for parameters. For instance, Edit accepts old_str (presumably from the times when the models were trained on the officially documented text editor tool), the newer old_string from the schema, new_str/new_string, path as an alias for file_path, and some more.
It also silently filters out unexpected keys and it does not use strict mode either. The issue with strict mode is that Anthropic applies complexity limits to the tool definitions that cause API requests to fail, so presumably that's why Claude Code does not attempt to use it.
Strictness
Will this problem be with us in other harnesses too? One huge issue with Anthropic is that the models are completely closed, and so is the harness. Codex models are also closed, but at least the harness is not. We also have gpt-oss which is at least a bit interesting. The models are explicitly trained to use OpenAI's harmony response format and there is a lot of documentation that at least tells us how OpenAI people think about this.
Harmony makes channels and tool-call content types part of the prompt format. A function call can look like this:
<|start|>assistant<|channel|>commentary to=functions.get_weather
<|constrain|>json<|message|>{"location":"San Francisco"}<|call|>
The important bit is <|constrain|>json. The model can express in-band that this message body is JSON, and an inference stack can use that boundary to switch into JSON-constrained sampling for the body of the tool call. Presumably a bit of this also happens in Anthropic's models, at least in strict mode I would imagine.
The marker in harmony helps the sampler to detect when it needs to sample with a specific grammar, and because it is part of the transcript, it makes that rather easy to do. For hosted GPT models, there is also an option to provide a LARK grammar for custom tools that need to adhere to something like this.
Anthropic appears different from that, though maybe not entirely. If an array of objects is represented as JSON, as it appears to be, then the model has to write JSON inside the tool parameter. There is probably basic grammar-constrained sampling going on, and that may partly explain the extra keys. For a nested array parameter, that JSON includes escaped multi-line file content inside string literals, inside one tag. The unexpected, made-up keys appear exactly at the highest-entropy point of that task: after closing a several-hundred-token escaped newText string, where the model must decide } vs , "...".
Opus 4.8 and Sonnet 5 seem to have much stronger priors about what an edit tool call should look like and that prior appears to be Claude Code's edit schema: a flat old/new string pair, plus the optional replace_all flag. My guess is that Opus has learned that an edit operation may have one extra optional field, but under Pi's nested oldText/newText shape it has no trained name for that field. So it samples a plausible name fresh each time, which is why the failures produce dozens of random keys rather than one stable alias.
As strict mode in Anthropic appears to fix this, I presume that on the server side they are refusing to sample a key that is not permitted by the JSON schema structure. That would also explain why they have limits to the complexity of the tool definitions when strict mode is enabled.
So far, the Codex models I tested did not show this type of regression. I tested all available ones except 5.6, which I do not have access to yet.
What This Means For Harnesses
The uncomfortable lesson is that tool schemas are not neutral, at least not on Anthropic models. We like to pretend that a schema is an abstract contract and the model is a general reasoner that will follow it, but that might no longer be the case for some of the tools.
Tool schemas are somewhere in the distribution and some shapes are close to what the model saw during post-training and some are far away. Some are easy for the provider's hidden encoding (e.g. top-level attributes in ANTML), whereas some require the model to write large escaped JSON objects inside nested arrays after long multiline strings. The model may be smart enough to understand the schema and still be bad at sampling the exact shape under pressure.
If this type of model behavior continues, I wonder what the implications for harnesses are. Obviously one could turn on strict sampling in Anthropic and the problem should go away. On the other hand, that the model has this behavior shows the impact that reinforcement learning has on them. Fighting that prior is probably futile if you want to get the best model performance.
Right now the reality is that Claude Code is not open source and we cannot really know what they are doing in their RL environments either. We cannot assume Claude-Code-trained behavior will transfer cleanly to your tools unless they are a close match. The more post-training happens inside one dominant harness, the more every other harness will have to inherit its quirks.
I used to be more skeptical of strict grammar-constrained tool invocation because constrained decoding can have quality tradeoffs. I still think that can be true in general, but this bug moved my priors significantly. If the newest models get better at solving the task while getting worse at faithfully emitting an alternative tool schema, then the harness needs stronger guarantees somewhere.
If you want to find out more, or you want to discuss this, consider reading the issue on the Pi tracker.
04 Jul 2026 12:00am GMT
03 Jul 2026
Django community aggregator: Community blog posts
Issue 344: Happy Birthday Djangonaut Space!
03 Jul 2026 3:00pm GMT
Planet Python
Mycli: Release v2.0.0
mycli is a command line interface for MySQL which includes auto-completion and syntax highlighting.
Read the install instructions to find out how to get the latest version.
Mycli v2.0.0 has breaking changes!
Major features added in recent months include
03 Jul 2026 7:00am GMT
02 Jul 2026
Django community aggregator: Community blog posts
Python Leiden (NL) meetup summaries
Two summaries of the July 2 2026 Python meetup in Leiden. I've omitted one, "Python with Karel" by EiEi Tun, as I've made a summary of that talk in Utrecht a month ago, already :-)
Building modern internal team CLIs with incremental automation - Farid Nouri Neshat
Obligatory xkcd cartoons: https://xkcd.com/974 and https://xkcd.com/1319 and https://xkcd.com/1205
Toil: manual, repetitive, automatable, distracting you from your real work, no enduring value. Yes, he likes to automate things :-) Some examples of repetitive manual tasks:
- Creating dev containers.
- Gathering data for troubleshooting.
- Something that needs to be set manually in a database.
- Setting up a new AWS account.
- Creating a new dev environment on the new colleague's laptop.
How to automate? Do it iteratively! Your boss might not like you to spend a day automating the task. But if you do it small steps at a time...
-
Do it manually the very first time.
-
Then start with documenting the steps.
-
Then turn it into a do-nothing scaffold script:
def step1(): print("Open the AWS page manually") input("Press enter to continue") -
Everytime you do the task, automate a small bit and flesh out the script over time.
-
After many iterations, you'll have automated it fully!
"I don't have time to automate it", you might say? Well, why don't you have time? Is it perhaps because you haven't automated things?
A good motivator: if you hate the task... Hate driven development :-)
After a while, you'll have lots of random scripts. Stuff them in a repository. Slowly document them. Try to get them to use the same conventions. Perhaps you can re-use functionality in a library.
Something you need quicky is some CLI, a command line interface. He likes typer to make his CLIs: much nicer than Python's own "argparse":
import typer
app = typer.Typer()
@app.command()
def hello(name: str):
print(f"Hello {name}")
if __name__ == "__main__":
app()
AI comment: AI agents can use your CLI. Use the docstring and help functions to help orient the AI to your custom CLI. You can, for instance, use a CLI to give the agent access to your database's content without giving it direct access to the database.
AI agents can be dangerous. A solution might be to use "feature flags". You can disable production access until you enable some setting or flag that AI doesn't know about.
He also mentioned the rich library for formatting and colorizing your textual output.
What I've learned maintaining the MCP Python SDK - Marcelo Trylesinski
He's one of the three maintainers of the MCP Python SDK. SDK = software development kit. MCP: model context protocol, so a way for AI agents to connect to some other piece of software.
MCP is basically "OpenAPI for your agents". It exposes three things from the server side:
- tools
- resources
- prompts (though tools are mostly the only thing that is used)
The client provides:
- sampling
- elicitation (="producing a reaction", so mostly it means that the AI server asks you questions)
- roots
- logging
The MCP spec kept growing. But clients never caught up, so it was mostly only the "tools" part that got used.
A big problem is that servers cannot scale. The AI server might have lots of machines with a loadbalancer in front of it, but as a user you need to stay connected to the one machine that has your context.
There's a new version of the spec (final version this month) that actually removed stuff, instead of growing. The "client provides" list mentioned above? Sampling, roots and logging are gone as they were hardly used.
MCP is now a small core, with optional extensions. Examples: tasks, MCP apps, enterprise auth.
The MCP Python SDK supports the new version, too. He demonstrated a small Python script that had a function that said you could have three bananas. He connected it via MCP to Claude and could ask Claude for the number of available bananas. It got back, via the Python tool, with the correct answer.
02 Jul 2026 4:00am GMT
01 Jul 2026
Django community aggregator: Community blog posts
Weeknotes (2026 week 27)
Weeknotes (2026 week 27)
The last entry in this series was published 10 weeks ago so it really is time for another review of the releases I did during this time.
Releases
feincms3-forms
The feincms3-forms forms builder has gained a documentation page on the wonderful Read the Docs service. The 0.6.1 release doesn't contain any code changes, just pyproject.toml updates and the mentioned documentation rework.
django-imagefield
django-imagefield 0.23 is still in alpha. The handling of image fields when using libvips is optimized to use less memory hopefully. We'll see. I also added some tests to verify that .mpo files are handled properly.
feincms3
The Vimeo embed now always sets the dnt=1 parameter on the <iframe>, which asks Vimeo to not track the user.
django-mptt
I wrote about the somewhat annoying maintenance again. The library is still officially unmaintained, but I did a lot of work either just closing issues or also fixing them. The docs also contain many clarifications. I only released 0.19rc1 for now.
feincms3-sites and feincms3-language-sites
Last time I mentioned that default HTTP/S ports are now stripped so that the host matching can determine the correct site. Now a new case appeared where trailing dots weren't stripped. The normalization of hosts has been extended. I'm sure we're still missing some exotic cases where we should do more normalization, but we'll cross that bridge when we get there.
django-prose-editor and django-js-asset
Various upgrades to the editor and especially the importmaps rework in both packages - the importmap infrastructure should now be CSP-compatible! I wrote more about that in the last post The 2026 way of using importmaps in Django.
django-content-editor
Minor bugfixes and a major version bump because of the rework of the JavaScript code into multiple ES modules. The content editor now uses importmaps as well.
django-fhadmin
Small bugfix so that links aren't underlined in the app groups list when they shouldn't be, matching how the Django admin itself behaves.
django-cabinet
The cabinet / prose editor integration for the file (or image) picker is final and released as a stable version.
django-json-schema-editor
This small release only contains more correct German translations of strings.
Honorable mention: django-debug-toolbar
I didn't actually create this release, but I contributed various changes to it. The changelog for 7.0 is here.
01 Jul 2026 5:00pm GMT
23 Jun 2026
Planet Twisted
Glyph Lefkowitz: Adversarial Communication
As I have discussed in previous posts, "AIs" can make mistakes. In fact, they do make mistakes, and their mistake-making patterns are such that where and how they will make mistakes is both uncertain and constantly changing.
Thus, in any scenario where you want to attempt to make "productive" use of "AI", you must have a system in place for checking every result. Not checking some results; checking every result. If each result might have a consequence for you (and if it didn't have a consequence, why bother automating it?) and you cannot predict in advance which kinds of results will need verification, then verification is always required.
The verification often ends up being just as expensive as doing the work in the first place, which means that if you want your usage of "AI" to be personally profitable, you have to find someone else to externalize the cost of verification onto. This person becomes your adversary, and, if you are successful, your "AI's" victim.
The Ladder-Climber And Their Reverse-Centaur Rungs
One way that this constellation of facts can straightforwardly assemble themselves into a dystopian nightmare is the phenomenon, described by Cory Doctorow, of the reverse centaur. This is when your employer non-consensually turns you into the verification system. The "AI" does the fun part of initially performing the work, and then you do the boring part where you check if the robot is right and clean up its messes, even if everyone already knows that it would, in aggregate, be cheaper for you to do the work in the first place.
Reverse centaurs can be made from any automation, not only "AI" automation. I think that there is a reason that this term happens to have emerged in the "age of AI", though, and not with earlier automation technologies (even those which were considerably more viscerally horrific). That reason is: the wrongness of "AI" output is not merely a technical feature that must be compensated for, it is a generalized externality.
As I mentioned above, if you are responsible for the entirety of the work, both extruding the "AI" output and checking it, it's usually cheaper to have humans do the entirety of the work to begin with. When humans do the writing directly, we can check as we go, and thus verification doesn't need to be as comprehensive.
When "AI" coding advocates say "code review is the bottleneck", what they are observing is that the LLM is still rolling the dice for each PR, and a human is still necessary to verify that each of those rolls is a winner. But calling this process "code review" is a bit of a misnomer; it's not really "code review" in the traditional sense, it's human understanding.
Before the advent of "AI", the human understanding was implicit in the process of writing the code in the first place1, and the code review was a way of diffusing and extending that understanding. Now that the code can be authored with no initial understanding taking place, that cost has not gone away, it has moved.
Human understanding was always the bottleneck.
However, this is taking a collaborative view of a software project, where satisfying the needs and solving the problems of your customers are the goals. We can see that "AI" is a bad tool to satisfy those goals, because all it's doing is converting the first half of the work, that of understanding the code as you write it, to understanding the agent's output as you read it.
What if, instead, we were to take the view that every software company is a Hobbesian nightmare, red in tooth and claw? In this view, the only goal of a software project is for the individual developers to make their promo cycles and get their bonuses. Given that there is only a certain amount of money to go around, this is a zero-sum game where each programmer wants to look more productive than their colleagues.
Pretty much every organization finds it easy to reward "productivity" as expressed by lines of code emitted, but the benefits of doing thorough and thoughtful design, analysis, and code review very difficult to reward. In this world, an LLM is an invaluable tool for the sociopathic ladder-climber, particularly if your legacy organization is still structuring their workflows as if the person prompting the bot is "writing" the code, and then they get to foist off the act of "reviewing" the code onto someone else.
Here, the prompter effectively externalizes the cost of the LLM's failures but internalizes any benefits. The prompter will vibe-code a big feature, so large that the assigned reviewer can't possibly comprehend it all effectively. When this happens, the reviewer will, eventually, be pressured to approve it, even if they can try to spot a few problems along the way. The reviewer has their own work to get back to, after all, the obligation to review the prompter's (read: the bot's) code is a drain on their time that they are not going to get rewarded for.
If this feature is a big success, the prompter gets a promotion. If it causes a big issue, well, the reviewer must not have been careful enough.
This is why LLMs are "good for coding", and also why their biggest promoters keep having outages.
The Generative Gish Galloper
Coding is the biggest "success story" of this type of adversarial communication, but it is by far not the only instance of such a thing. LLMs create a new form of leverage that can turn Brandolini's law from a linear advantage into an exponential one. If you are engaged in a political debate where you want to overwhelm the other side in nonsense, an LLM can generate bullshit faster than it is physically possible for a human being to type, let alone respond thoughtfully. There is an asymmetry to the utility of this weapon as well: only one side of the political spectrum wants to flood the zone and destroy trust in institutions and the concept of truth. There's a good reason that the fascists love it.
Straightforward Spam and Fraud
This is kind of obvious, but LLMs can generate lightly-customized, plausible-looking text much more quickly than any human being. This facilitates their use in fraud, spam, and scams. In a spamming or fraudulent interaction, once again, the costs are externalized onto the victim: the recipient of a spam message has to do all the work of "checking" the LLM's output. Spammers already expect very low hit rates from boilerplate, and if the LLM can increase those percentages from 1% to 5% the technology will pay for itself; they don't need anything like reliable accuracy.
Customer "Support"
If you have any kind of commercial relationship with a company, I probably don't even need to mention this: customer "support" bots are a misery. Everybody knows it at this point. But customer support is usually conceptualized by businesses as an adversarial interaction, because it is a cost center. They maintain internal metrics on time-to-resolution and try to optimize them. Implicitly, this creates a dynamic where the goal of the customer service agent's job is not to solve your problem, but to emit noise that will cause you to think your problem is resolved, or to give up, as fast as possible. Unsurprisingly, LLMs can emit this noise faster than humans can, getting those customers off the phone. But those customers will remember those interactions, and the story outside the TTR metrics is horrible.
Similarly to the situation in software development, LLMs can look very good on paper for customer support, but mostly what they are doing is illuminating the problems with the industry's existing metrics, by turning "winning the metrics battle against the customer" into a more obvious and immediate defeat for the company's long term reputation.
"Education"
In 2026 it is sadly a fact of life that students cheat all the time using "AI", and that this cheating is very successful, in that the teachers find it very hard to detect.
LLMs are great for cheating on schoolwork because the student is externalizing the work of the checking onto the teachers, who are often starting at a disadvantage to begin with, at least in the US.
My view is that this is happening because of a divergence in the way that students vs. teachers (or, more accurately, "the broader educational system") view grading.
When a student is asked to write an essay, the teachers see the effort as both intrinsically worthwhile for the student, as well as useful as a pedagogical tool to evaluate and react to the student's progress. The student, by contrast, sees a stumbling block designed to knock them off the path to success and into a permanent underclass. It is no wonder that the student sees "AI" as useful to their own goals and has no compunction about deploying it.
There is a bitter irony that the ability to understand the inherent value of actually writing the essay on their own is the sort of thing that students can really only learn by writing a bunch of essays. There's no way that I can think of which makes the benefit legible as long as a shortcut is available.
The net effect here is a downward spiral, where the already-wobbling educational system is sustaining an attack that it doesn't have the resources to recover from. The individual students' attacks against their teachers and their schools' grading systems might appear to momentarily succeed, but they will win the battle and lose the war.
Spamming "For Good"?
Usually when we talk about someone unilaterally choosing to enter into an adversarial relationship, that's an "attack" and for good reasons we have a negative impression of the attacker. However, I would be remiss if I did not point out that there are some cases where the relationship was already adversarial; just because you're the attacker doesn't mean that you are evil.
For example we might imagine use-cases like automatically filing appeals for prior authorizations against health insurance. It's relatively well-known at this point that the main way for-profit insurers maintain their margins is by denying claims right up to the line of the policies themselves being fraud, so using a spamming tool to fight them might be entirely justifiable2 in that case.
Similarly, using an LLM could be justified in a fight against a company refusing to honor a warranty. One could imagine using an LLM to immediately generate replies and escalations.
However, even in imagined cases like these, the underlying problem is that the insurers and the vendors already have a tremendous amount of structural power, so it is more likely that they will have the advantage in deploying a communications weapon like an LLM, as well as enacting policies to simply ignore any LLM-based communication that you might submit. Worse, if these strategies were to become widespread, they might provide an excuse to reject any communications by feeding them into an unreliable "LLM detector" and issuing an automated "computer says no" even to hand-written correspondence.
It is also worth stressing that these cases are imagined, as compared to the very real coworker-abuse, spam, scam, fraud, and disinformation campaigns being waged in real life today.
Therefore, while legitimate uses might exist, it's hard to imagine that there's anywhere they would be genuinely valuable and sustainable. In the best case "AI" will provide a temporary advantage for underdogs that will provoke an arms race which the resource-advantaged adversaries will win in the long run, in the worst case the arms race itself will cement permanent structural change that will make things worse.
"Search" By Stealing
Most of the adversarial utility of "AI" is on the "write" side, since write-amplification is more obviously aggressive than reading. But the "read" side of LLMs - summarization and question-answering - can be a form of attack as well.
To begin with, the act of reading itself is currently enormously destructive, but that's arguably not a fundamental aspect of this technology. They could set reasonable rate-limits and respect things like robots.txt, as search engines have for decades now. They could also refrain from committing criminal levels of copyright infringement. But, today, using "AI" tools does suborn this sort of out-of-control crawling.
More insidiously, consider the scenario described in this YouTube video. The LTT Bros decided to try Linux again, and in the course of so doing, they had problems. When trying to solve these problems, they were faced with a choice: they could consult Reddit, or they could ask an LLM. Asking an LLM would "gaslight the heck out of" them, but they still found it preferable, because they would at least get an answer without getting yelled at.
Initially this sounds great. But it also means that you want to extract knowledge from a community, while mechanically eliding any values or norms that the community may want to impart as part of offering that knowledge. As someone who spent many years in a community tech support role, this is worrying. Many requests for support are people asking how to do things that will momentarily solve a superficial problem but create a long-term reliability problem or even an immediate security risk, that the question-asker doesn't want to hear about. Consider the question "I'm tired of entering my password so much, how do I make it so my laptop unlocks automatically". An obsequious chatbot will helpfully tell you how to do this without pushback.
But, this is also a sort of ethically murky area. The Linux community is somewhat famously, for many years now, a toxic cesspool of general hostility, misogyny, etc. It is certainly a good thing that people can get access to this knowledge without subjecting themselves to abuse. But it also means that the people with the power and the privilege to change the community for the better can just quietly withdraw, rather than fixing the problems. It also means that the positive elements of culture cannot be transmitted, and people will have no opportunity to learn about unknown unknowns.
In this case, the "adversarial" communication is with society. The thing that using an LLM for search lets you do is withdraw from society and avoid forming any personal connections. There are some personal connections which are painful and annoying, and so that can feel like a momentary balm. But the need to make connections in general is, like, the concept of society itself.
Who Am I Hurting?
LLMs are good at adversarial communication. They are so good at it, relative to their other benefits, that they will tend to make communications adversarial if you are not remaining vigilant about the possibility that it might do so. My request to you, dear reader, if you are going to use such tools, is to always ask yourself, "who might I be hurting, if I use an LLM for this?"
If you're using an "AI", who is its adversary? If you haven't given it one yet, who might the "AI" turn into an adversary? Who might you overwhelm with an asymmetric amount of output, or, if you're receiving information and not sending it, who are you taking that information from without consulting?
Figure out the answers to these questions and conduct yourself accordingly; the answer might be "yourself".
Acknowledgments
Thank you to my patrons who are supporting my writing on this blog. If you like what you've read here and you'd like to read more of it, or you'd like to support my various open-source endeavors, you can support my work as a sponsor!
-
One of the reasons that software developers tend to prefer greenfield development is that when you are given a blank page, you can project your own specific understanding onto it. You can structure the codebase in a way that works for your brain, down to the variable naming conventions and the module layouts. LLM-assisted development makes everything into instant brownfield work, which makes developers instantly miserable; even those who are excited about the technology will frequently complain about how it feels like their agency has been stolen and their joy in the work has been diminished. But I digress. ↩
-
Modulo the massive amount of other externalities involved in using LLMs, of course, but I don't have the time or energy to get into those here. ↩
23 Jun 2026 8:06pm GMT
09 Jun 2026
Planet Twisted
Hynek Schlawack: How to Ditch Codecov for Python Projects
Codecov's unreliability breaking CI on my open source projects has been a constant source of frustration for me for years. I have found a way to enforce coverage over a whole GitHub Actions build matrix that doesn't rely on third-party services.
09 Jun 2026 12:00am GMT
22 May 2026
Planet Twisted
Glyph Lefkowitz: Opaque Types in Python
Let's say you're writing a Python library.
In this library, you have some collection of state that represents "options" or "configuration" for a bunch of operations. Such a set of options is a bundle of potentially ever-increasing complexity. Thus, you will want it to have an extremely minimal compatibility surface, with a very carefully chosen public interface, that is either small, or perhaps nothing at all. Such an object conveys state and might have some private behavior, but all you want consumers to be able to do is build it in very constrained, specific ways, and then pass it along as a parameter to your own APIs.
By way of example, imagine that you're wrapping a library that handles shipping physical packages.
There are a zillion ways to do it ship a package. There are different carriers who can ship it for you. There's air freight, and ground freight, and sea freight. There's overnight shipping. There's the option to require a signature. There's package tracking and certified mail. Suffice it to say, lots of stuff.
If you are starting out to implement such a library, you might need an object called something like ShippingOptions that encapsulates some of this. At the core of your library you might have a function like this:
1 2 3 4 5 |
|
If you are starting out implementing such a library, you know that you're going to get the initial implementation of ShippingOptions wrong; or, at the very least, if not "wrong", then "incomplete". You should not want to commit to an expansive public API with a ton of different attributes until you really understand the problem domain pretty well.
Yet, ShippingOptions is absolutely vital to the rest of your library. You'll need to construct it and pass it to various methods like estimateShippingCost and shipPackage. So you're not going to want a ton of complexity and churn as you evolve it to be more complex.
Worse yet, this object has to hold a ton of state. It's got attributes, maybe even quite complex internal attributes that relate to different shipping services.
Right now, today, you need to add something so you can have "no rush", "standard" and "expedited" options. You can't just put off implementing that indefinitely until you can come up with the perfect shape. What to do?
The tool you want here is the opaque data type design pattern. C is lousy with such things (FILE, pthread_*_t, fd_set, etc). A typedef in a header file can easily achieve this.
But in Python, if you expose a dataclass - or any class, really - even if you keep all your fields private, the constructor is still, inherently, public. You can make it raise an exception or something, but your type checker still won't help your users; it'll still look like it's a normal class.
Luckily, Python typing provides a tool for this: typing.NewType.
Let's review our requirements:
- We need a type that our client code can use in its type annotations; it needs to be public.
- They need to be able to consruct it somehow, even if they shouldn't be able to see its attributes or its internal constructor arguments.
- To express high-level things (like "ship fast") that should stay supported as we add more nuanced and complex configurations in the future (like "ship with the fastest possible option provided by the lowest-cost carrier that supports signature verification").
In order to solve these problems respectively, we will use:
- a public
NewType, which gives us our public name... - which wraps a private class with entirely private attributes, to give us an actual data structure, while not exposing the constructor,
- a set of public constructor functions, which returns our
NewType.
When we put that all together, it looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
As a snapshot in time, this is not all that interesting; we could have just exposed _RealShipOpts as a public class and saved ourselves some time. The fact that this exposes a constructor that takes a string is not a big deal for the present moment. For an initial quick and dirty implementation, we can just do checks like if options._speed == "fast" in our shipping and estimation code.
However, the main thing we are doing here is preserving our flexibility to evolve the related APIs into the future, so let's see how we might do that. For example, let's allow the shipping options to contain a concrete and specific carrier and freight method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
As a NewType, our public ShippingOptions type doesn't have a constructor. Since _RealShipOpts is private, and all its attributes are private, we can completely remove the old versions.
Anything within our shipping library can still access the private variables on ShippingOptions; as a NewType, it's the same type as its base at runtime, so it presents minimal1 overhead.
Clients outside our shipping library can still call all of our public constructors: shipFast, shipNormal, and shipSlow all still work with the same (as far as calling code knows) signature and behavior.
If you need to build and convey some state within your public API, while avoiding breakages associated with compatibility churn, hopefully this technique can help you do that!
Acknowledgments
Thanks for reading, and thank you to my patrons who are supporting my writing on this blog. If you like what you've read here and you'd like to read more of it, or you'd like to support my various open-source endeavors, you can support my work as a sponsor.
-
The overhead is minimal, but it is not completely zero. The suggested idiom for converting to a
NewTypeis to call it like a function, as I've done in these examples, but if you are wanting to use this pattern inside of a hot loop, you can use# type: ignore[return-value]comments to avoid that small cost. ↩
22 May 2026 12:33am GMT