23 Aug 2025
Django community aggregator: Community blog posts
Menu improvements in django-prose-editor
Menu improvements in django-prose-editor
I have repeatedly mentioned the django-prose-editor project in my weeknotes but I haven't written a proper post about it since rebuilding it on top of Tiptap at the end of 2024.
Much has happened in the meantime. A lot of work went into the menu system (as alluded to in the title of this post), but by no means does that cover all the work. As always, the CHANGELOG is the authoritative source.
0.11 introduced HTML sanitization which only allows HTML tags and attributes which can be added through the editor interface. Previously, we used nh3 to clean up HTML and protect against XSS, but now we can be much more strict and use a restrictive allowlist.
We also switched to using ES modules and importmaps in the browser.
Last but not least 0.11 also introduced end-to-end testing using Playwright.
The main feature in 0.12 was the switch to Tiptap 3.0 which fixed problems with shared extension storage when using several prose editors on the same page.
In 0.13 we switched from esbuild to rslib. Esbuild's configuration is nicer to look at, but rslib is built on the very powerful rspack which I'm using everywhere.
In 0.14, 0.15 and 0.16 the Menu
extension was made more reusable and the way extension can register their own menu items was reworked.
The upcoming 0.17 release (alpha releases are available and I'm using them in production right now!) is a larger release again and introduces a completely reworked menu system. The menu now not only supports button groups and dialogs but also dropdowns directly in the toolbar. This allows for example showing a dropdown for block types:
The styles are the same as those used in the editor interface.
The same interface can not only be used for HTML elements, but also for HTML classes. Tiptap has a TextStyle extension which allows using inline styles; I'd rather have a more restricted way of styling spans, and the prose editor TextClass
extension does just that: It allows applying a list of predefined CSS classes to <span>
elements. Of course the dropdown also shows the resulting presentation if you provide the necessary CSS to the admin interface.
23 Aug 2025 5:00pm GMT
22 Aug 2025
Planet Python
Sebastian Pölsterl: scikit-survival 0.25.0 with improved documentation released
I am pleased to announce that scikit-survival 0.25.0 has been released.
This release adds support for scikit-learn 1.7, in addition to version 1.6. However, the most significant changes in this release affect the documentation. The API documentation has been completely overhauled to improve clarity and consistency. I hope this marks a significant improvement for users new to scikit-survival.
One of the biggest pain points for users seems to be understanding which metric can be used to evaluate the performance of a given estimator. The user guide now summarizes the different options.
Which Performance Metrics Exist?
The performance metrics for evaluating survival models can be broadly divided into three groups:
-
Concordance Index (C-index): Measures the rank correlation between predicted risk scores and observed event times. Two implementations are available in scikit-survival:
-
concordance_index_censored(): This implements Harrell's estimator, which can be optimistic with high censoring.
-
concordance_index_ipcw(): An inverse probability of censoring weighted (IPCW) alternative that provides a less biased estimate, especially with high censoring. It is the preferred estimator of the C-Index.
-
-
Cumulative/Dynamic Area Under the ROC Curve (AUC): Extends the AUC to survival data, quantifying how well a model distinguishes subjects who experience an event by a given time from those who do not. It can handle time-dependent risk scores and is implemented in cumulative_dynamic_auc().
-
Brier Score: An extension of the mean squared error to right-censored data. The Brier score assesses both discrimination and calibration based on a model's estimated survival functions. You can either compute the Brier score at specific time point(s) using brier_score() or compute an overall measure by integrating the Brier score over a range of time points via integrated_brier_score().
What Do Survival Models Predict?
Survival models can predict several quantities, depending on the model being used. First of all, every estimator has a predict()
method, which either returns a unit-less risk score or the predicted time of an event.
-
If predictions are risk scores, higher values indicate an increased risk of experiencing an event. The scores have no unit and are only meaningful for ranking samples by their risk of experiencing an event. This is for example the case for CoxPHSurvivalAnalysis.
from sksurv.datasets import load_veterans_lung_cancer from sksurv.linear_model import CoxPHSurvivalAnalysis from sksurv.metrics import concordance_index_censored from sksurv.preprocessing import OneHotEncoder # Load data X, y = load_veterans_lung_cancer() Xt = OneHotEncoder().fit_transform(X) # Fit model estimator = CoxPHSurvivalAnalysis().fit(Xt, y) # Predict risk score predicted_risk = estimator.predict(Xt) # Evaluate risk scores cindex = concordance_index_censored( y["Status"], y["Survival_in_days"], predicted_risk )
-
If predictions directly relate to the time point of an event, lower scores indicate shorter survival, while higher scores indicate longer survival. See for example IPCRidge.
from sksurv.datasets import load_veterans_lung_cancer from sksurv.linear_model import IPCRidge from sksurv.metrics import concordance_index_censored from sksurv.preprocessing import OneHotEncoder # Load the data X, y = load_veterans_lung_cancer() Xt = OneHotEncoder().fit_transform(X) # Fit the model estimator = IPCRidge().fit(Xt, y) # Predict time of an event predicted_time = estimator.predict(Xt) # Flip sign of predictions to obtain a risk score cindex = concordance_index_censored( y["Status"], y["Survival_in_days"], -1 * predicted_time )
Both types of predictions can be evaluated by cumulative_dynamic_auc() too but not the Brier score.
While the concordance index is easy to interpret, it is not a useful measure of performance if a specific time range is of primary interest (e.g. predicting death within 2 years). This is particularly relevant for survival models that can make time-dependent predictions.
For instance, RandomSurvivalForest, can also predict survival functions (via predict_survival_function()
) or cumulative hazard functions (via predict_cumulative_hazard_function()
). These functions return lists of StepFunction instances. Each instance can be evaluated at a set of time points to obtain predicted survival probabilities (or cumulative hazards). The Brier score and cumulative_dynamic_auc() are capable of evaluating time-dependent predictions, but not the C-Index.
import numpy as np
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import integrated_brier_score
from sksurv.preprocessing import OneHotEncoder
# Load the data
X, y = load_veterans_lung_cancer()
Xt = OneHotEncoder().fit_transform(X)
# Fit the model
estimator = RandomSurvivalForest().fit(Xt, y)
# predict survival functions
surv_funcs = estimator.predict_survival_function(Xt)
# select time points to evaluate performance at
times = np.arange(7, 365)
# create predictions at selected time points
preds = np.asarray(
[[sfn(t) for t in times] for sfn in surv_funcs]
)
# compute integral
score = integrated_brier_score(y, y, preds, times)
For more details on evaluating survival models, please have a look at the user guide and the API documentation.
22 Aug 2025 9:55pm GMT
Rodrigo Girão Serrão: functools.Placeholder
Learn how to use functools.Placeholder
, new in Python 3.14, with real-life examples.
By reading this article you will understand what functools.Placeholder
is for and how to use it effectively.
Partial function application
The new Placeholder
, added in Python 3.14, only makes sense in the context of functools.partial
, so in order to understand Placeholder
you will need to understand how functools.partial
works and how to use it.
In a nutshell, partial
allows you to perform partial function application, by "freezing" arguments to functions.
How to pass arguments to functools.partial
Up until Python 3.13, you could use partial
to freeze arguments in two types of ways:
- you could pass positional arguments to
partial
, which would be passed in the same order to the function being used withpartial
; or - you could pass keyword arguments to
partial
, which would be passed with the same name to the function being used withpartial
.
Using keyword arguments to skip the first argument
The method 2. is especially useful if you're trying to freeze an argument that is not the first one. For example, if you use the built-in help
on the built-in int
, you can see this signature:
int(x, base=10) -> integer
If you want to convert a binary string to an integer, you can set base=2
:
print(int("101", 2)) # 5
Now, suppose you want to create a function from_binary
by "freezing" the argument 2
in the built-in int
. Writing
from_binary = partial(int, 2)
won't work, since in partial(int, 2)
, the value 2
is seen as the argument x
from the signature above. However, you can pass the base as a keyword argument, skipping the first argument x
from the signature of the built-in int
:
from functools import partial
from_binary = partial(int, base=2)
print(from_binary("101")) # 5
But this doesn't always work.
When keyword arguments don't work
Consider the following function that uses the string methods maketrans
and translate
to strip punctuation from a string:
import string
_table = str.maketrans("", "", string.punctuation)
def remove_punctuation(string):
return string.translate(_table)
print(remove_punctuation("Hello, world!")) # Hello world
The function remove_punctuation
is a thin wrapper around the string method str.translate
, which is the function doing all the work. In fact, if you look at str.translate
as a function, you always pass _table
as the second argument; what changes is the first argument:
print(str.translate("Hello, world!", _table)) # Hello world
print(str.translate("What?!", _table)) # What
This may lead you to wanting to use partial
to freeze the value _table
on the function str.translate
, so you use the built-in help
to check the signature of str.translate
:
translate(self, table, /) unbound builtins.str method
You can see that the first argument is self
, the string you are trying to translate, and then table
is the translation table (that str.maketrans
built magically for you). But you can also see the forward slash /
, which means that self
and table
are positional-only arguments that cannot be passed in as keyword arguments!
As such,...
22 Aug 2025 7:21pm GMT
Rodrigo Girão Serrão: TIL #130 – Format Python code directly with uv
Today I learned you can format your Python code directly with uv.
In uv version 0.8.13, released one or two days ago, uv added the command format
that allows you to format your Python code directly through the uv CLI.
Update your uv
First and foremost, make sure you're rocking uv 0.8.13 or greater by running uv self update
.
Format your code with uv
To format your code with uv you can simply run uv format
, which will use Ruff to format the code in your current directory:
$ uv format
The idea is not to have uv replace Ruff; it's just so that you don't have to think about a separate tool if you don't want to.
uv format
arguments
uv format
accepts the same arguments and options that ruff format
accepts, so you'll want to check the Ruff docs to learn more. My favourite option is --diff
, to take a look at the formatting diff without doing any formatting changes.
As of now, the feature is marked as being experimental, which means it might change in the future!
22 Aug 2025 4:34pm GMT
Django community aggregator: Community blog posts
Django News - State of Python 2025 Results - Aug 22nd 2025
News
State of Python 2025 Is Out!
Explore the key trends and actionable ideas from the latest Python Developers Survey, which was conducted jointly by the Python Software Foundation and JetBrains PyCharm and includes insights from over 30,000 developers.
PyPI now serves project status markers in API responses
PyPI now exposes standardized project status markers through its HTML and JSON index APIs, enabling package installers to programmatically signal dependency status and manage installations.
Preventing Domain Resurrection Attacks
PyPI now checks for expired domains to prevent domain resurrection attacks, a type of supply-chain attack where someone buys an expired domain and uses it to take an account through password resets.
Updates to Django
Today "Updates to Django" is presented by Velda Kiara from Django Events Foundation North America (DEFNA)! 🚀
Last week we had 15 pull requests merged into Django by 10 different contributors - including a first-time contributor! Congratulations to Rohit for having their first commits merged into Django - welcome on board!
Django Core Updates ✨
-
Fix to
Subquery.resolve_expression()
output field handling which corrects how Django determines theoutput_field
in subqueries. This adjustment restores consistent and predictable query behavior. -
Template partials arrive in DTL adds two new tags,
partial
andpartialdef
, that let developers define and reuse named chunks of templates. This brings cleaner organization and modularity to template design
Community Updates 🦄
Want to celebrate Django's birthday with fellow Djangonauts? Head over to the Django20 website to attend one of the birthday celebrations to a city near you.
That's all for this week in Django development! 🐍🦄
Django Newsletter
Wagtail CMS
Wagtail Space 2025 is a go!
Wagtail Space 2025 is a free three-day Zoom event featuring lightning talks and networking to shape future Wagtail improvements.
Sponsored Link 1
AI-Powered Django Development & Consulting
REVSYS specializes in integrating powerful AI technologies, including GPT-5, directly into your Django applications. We help bring modern, intelligent features to your project that boost user engagement and streamline content workflows.
Articles
Sometimes LFU > LRU
Stop letting bot traffic evict your customers' sessions. A simple Redis configuration switch from LRU to LFU solved our crawler problem, with a Django configuration example.
Python Namespace Packages are a pain
Ensuring init.py is present in every directory prevents ambiguous namespace packages, streamlines module imports, and mitigates cryptic errors in Python packaging.
DjangoCon Africa x UbuCon 2025 Reflections: Stay For The Community
DjangoCon Africa x UbuCon 2025 underlines robust community collaboration, open-source initiative growth and challenges in sustaining African Django development through engaging sprints and talks.
Customize your IPython shell in Docker
Customize your IPython shell in Docker with tailored profiles and startup scripts that streamline Django shell_plus debugging, imports, and UUID extraction workflows.
Best Python Books (2025)
An up-to-date list of the best books for learning Python.
Events
Friends of PyCon Africa Livestream
Join the August 30th livestream celebrating the vibrant Python community across Africa! This isn't your typical webinar - it's a dynamic, fun-filled conversation where Python community members will drop in and out throughout the event, sharing their stories, projects, and passion for Python.
Guests include Carlton Gibson, Dawn Wages, Michael Kennedy, Sarah Abderemane, and more.
Be Part of Something Amazing: Volunteer at DjangoCon US 2025 in Chicago!
Join DjangoCon US 2025 as a volunteer in Chicago to gain insider event management experience, expand your network, and strengthen the Django community.
Videos
Talk Python Live Stream: Celebrating Django's 20th Birthday with its Creators
A discussion of Django's past, present, and future featuring Adrian Holovaty, Simon Willison, Thibaud Colas, Jeff Triplett, Will Vincent, and Michael Kennedy.
"How a Solo Hobbyist Learned to Love Testing" - Carl James
PyOhio talk on slowly integrating testing into Django apps and, by proxy, learning more about the underlying libraries along the way.
DjangoCon Videos
Logs, shells, caches and other strange words we use daily
This insightful talk reveals the unexpected origins of computing terms, linking historical context to modern software engineering practices relevant to Django experts.
Sponsored Link 2
Sponsor this newsletter!
Podcasts
Django Brew #6: Celebrating 20 Years of Django
A podcast episode celebrates Django's 20th anniversary using trivia, reflections, and community updates to engage developers with historical highlights and events.
Test & Code | 238: So Long, and Thanks for All the Fish
Brian Okken reflects on a decade of Python testing and podcasting, sharing lessons learned and inviting continued engagement via his Python platforms. A farewell to a fun 10 years.
Django News Jobs
Senior Python Developer at Basalt Health 🆕
Senior Full Stack Engineer at Lyst 🆕
Backend Python Software Engineer (Hybrid) at NVIDIA 🆕
Senior Python Developer at Brightwater
Senior Backend Engineer at Prowler
Django Newsletter
Projects
joshuadavidthomas/mcp-django-shell
MCP server providing a stateful Django shell for AI assistants.
edelvalle/djhtmx
Interactive UI components for Django using htmx.org.
This RSS feed is published on https://django-news.com/. You can also subscribe via email.
22 Aug 2025 3:00pm GMT
18 Aug 2025
Django community aggregator: Community blog posts
Configurable UI in Software
Another short one today, that is a pattern I have noticed in a couple pieces of software I use, notably Todoist, Slack & Vivaldi. All three of these allow a user to configure the menu options to some degree. Slack has the option to customise the navigation options shown within a particular workspace to optimise the experience for a user. Todoist takes this a step further in the mobile app to allow a user to sort the menu items.
Browsers have always had a great experience of customisation, but Vivaldi takes this to an awesome extreme by allowing a user to customise each and every possible context menu, giving true flexability to their users.
Personally I have never considered the power of this and wonder if there are any efficient implementation of this for Django without creating a huge amount of complexity. The naive default solution would likely involve a model and a context processor and/or a middleware, it might be something I add in my next project, if we feel it would be beneficial to our users.
18 Aug 2025 5:00am GMT
15 Aug 2025
Planet Twisted
Glyph Lefkowitz: The Futzing Fraction
The most optimistic vision of generative AI1 is that it will relieve us of the tedious, repetitive elements of knowledge work so that we can get to work on the really interesting problems that such tedium stands in the way of. Even if you fully believe in this vision, it's hard to deny that today, some tedium is associated with the process of using generative AI itself.
Generative AI also isn't free, and so, as responsible consumers, we need to ask: is it worth it? What's the ROI of genAI, and how can we tell? In this post, I'd like to explore a logical framework for evaluating genAI expenditures, to determine if your organization is getting its money's worth.
Perpetually Proffering Permuted Prompts
I think most LLM users would agree with me that a typical workflow with an LLM rarely involves prompting it only one time and getting a perfectly useful answer that solves the whole problem.
Generative AI best practices, even from the most optimistic vendors all suggest that you should continuously evaluate everything. ChatGPT, which is really the only genAI product with significantly scaled adoption, still says at the bottom of every interaction:
ChatGPT can make mistakes. Check important info.
If we have to "check important info" on every interaction, it stands to reason that even if we think it's useful, some of those checks will find an error. Again, if we think it's useful, presumably the next thing to do is to perturb our prompt somehow, and issue it again, in the hopes that the next invocation will, by dint of either:
- better luck this time with the stochastic aspect of the inference process,
- enhanced application of our skill to engineer a better prompt based on the deficiencies of the current inference, or
- better performance of the model by populating additional context in subsequent chained prompts.
Unfortunately, given the relative lack of reliable methods to re-generate the prompt and receive a better answer2, checking the output and re-prompting the model can feel like just kinda futzing around with it. You try, you get a wrong answer, you try a few more times, eventually you get the right answer that you wanted in the first place. It's a somewhat unsatisfying process, but if you get the right answer eventually, it does feel like progress, and you didn't need to use up another human's time.
In fact, the hottest buzzword of the last hype cycle is "agentic". While I have my own feelings about this particular word3, its current practical definition is "a generative AI system which automates the process of re-prompting itself, by having a deterministic program evaluate its outputs for correctness".
A better term for an "agentic" system would be a "self-futzing system".
However, the ability to automate some level of checking and re-prompting does not mean that you can fully delegate tasks to an agentic tool, either. It is, plainly put, not safe. If you leave the AI on its own, you will get terrible results that will at best make for a funny story45 and at worst might end up causing serious damage67.
Taken together, this all means that for any consequential task that you want to accomplish with genAI, you need an expert human in the loop. The human must be capable of independently doing the job that the genAI system is being asked to accomplish.
When the genAI guesses correctly and produces usable output, some of the human's time will be saved. When the genAI guesses wrong and produces hallucinatory gibberish or even "correct" output that nevertheless fails to account for some unstated but necessary property such as security or scale, some of the human's time will be wasted evaluating it and re-trying it.
Income from Investment in Inference
Let's evaluate an abstract, hypothetical genAI system that can automate some work for our organization. To avoid implicating any specific vendor, let's call the system "Mallory".
Is Mallory worth the money? How can we know?
Logically, there are only two outcomes that might result from using Mallory to do our work.
- We prompt Mallory to do some work; we check its work, it is correct, and some time is saved.
- We prompt Mallory to do some work; we check its work, it fails, and we futz around with the result; this time is wasted.
As a logical framework, this makes sense, but ROI is an arithmetical concept, not a logical one. So let's translate this into some terms.
In order to evaluate Mallory, let's define the Futzing Fraction, " FF ", in terms of the following variables:
- H
-
the average amount of time a Human worker would take to do a task, unaided by Mallory
- I
-
the amount of time that Mallory takes to run one Inference8
- C
-
the amount of time that a human has to spend Checking Mallory's output for each inference
- P
-
the Probability that Mallory will produce a correct inference for each prompt
- W
-
the average amount of time that it takes for a human to Write one prompt for Mallory
- E
-
since we are normalizing everything to time, rather than money, we do also have to account for the dollar of Mallory as as a product, so we will include the Equivalent amount of human time we could purchase for the marginal cost of one9 inference.
As in last week's example of simple ROI arithmetic, we will put our costs in the numerator, and our benefits in the denominator.
The idea here is that for each prompt, the minimum amount of time-equivalent cost possible is W+I+C+E. The user must, at least once, write a prompt, wait for inference to run, then check the output; and, of course, pay any costs to Mallory's vendor.
If the probability of a correct answer is P=13, then they will do this entire process 3 times10, so we put P in the denominator. Finally, we divide everything by H, because we are trying to determine if we are actually saving any time or money, versus just letting our existing human, who has to be driving this process anyway, do the whole thing.
If the Futzing Fraction evaluates to a number greater than 1, as previously discussed, you are a bozo; you're spending more time futzing with Mallory than getting value out of it.
Figuring out the Fraction is Frustrating
In order to even evaluate the value of the Futzing Fraction though, you have to have a sound method to even get a vague sense of all the terms.
If you are a business leader, a lot of this is relatively easy to measure. You vaguely know what H is, because you know what your payroll costs, and similarly, you can figure out E with some pretty trivial arithmetic based on Mallory's pricing table. There are endless YouTube channels, spec sheets and benchmarks to give you I. W is probably going to be so small compared to H that it hardly merits consideration11.
But, are you measuring C? If your employees are not checking the outputs of the AI, you're on a path to catastrophe that no ROI calculation can capture, so it had better be greater than zero.
Are you measuring P? How often does the AI get it right on the first try?
Challenges to Computing Checking Costs
In the fraction defined above, the term C is going to be large. Larger than you think.
Measuring P and C with a high degree of precision is probably going to be very hard; possibly unreasonably so, or too expensive12 to bother with in practice. So you will undoubtedly need to work with estimates and proxy metrics. But you have to be aware that this is a problem domain where your normal method of estimating is going to be extremely vulnerable to inherent cognitive bias, and find ways to measure.
Margins, Money, and Metacognition
First let's discuss cognitive and metacognitive bias.
My favorite cognitive bias is the availability heuristic and a close second is its cousin salience bias. Humans are empirically predisposed towards noticing and remembering things that are more striking, and to overestimate their frequency.
If you are estimating the variables above based on the vibe that you're getting from the experience of using an LLM, you may be overestimating its utility.
Consider a slot machine.
If you put a dollar in to a slot machine, and you lose that dollar, this is an unremarkable event. Expected, even. It doesn't seem interesting. You can repeat this over and over again, a thousand times, and each time it will seem equally unremarkable. If you do it a thousand times, you will probably get gradually more anxious as your sense of your dwindling bank account becomes slowly more salient, but losing one more dollar still seems unremarkable.
If you put a dollar in a slot machine and it gives you a thousand dollars, that will probably seem pretty cool. Interesting. Memorable. You might tell a story about this happening, but you definitely wouldn't really remember any particular time you lost one dollar.
Luckily, when you arrive at a casino with slot machines, you probably know well enough to set a hard budget in the form of some amount of physical currency you will have available to you. The odds are against you, you'll probably lose it all, but any responsible gambler will have an immediate, physical representation of their balance in front of them, so when they have lost it all, they can see that their hands are empty, and can try to resist the "just one more pull" temptation, after hitting that limit.
Now, consider Mallory.
If you put ten minutes into writing a prompt, and Mallory gives a completely off-the-rails, useless answer, and you lose ten minutes, well, that's just what using a computer is like sometimes. Mallory malfunctioned, or hallucinated, but it does that sometimes, everybody knows that. You only wasted ten minutes. It's fine. Not a big deal. Let's try it a few more times. Just ten more minutes. It'll probably work this time.
If you put ten minutes into writing a prompt, and it completes a task that would have otherwise taken you 4 hours, that feels amazing. Like the computer is magic! An absolute endorphin rush.
Very memorable. When it happens, it feels like P=1.
But... did you have a time budget before you started? Did you have a specified N such that "I will give up on Mallory as soon as I have spent N minutes attempting to solve this problem with it"? When the jackpot finally pays out that 4 hours, did you notice that you put 6 hours worth of 10-minute prompt coins into it in?
If you are attempting to use the same sort of heuristic intuition that probably works pretty well for other business leadership decisions, Mallory's slot-machine chat-prompt user interface is practically designed to subvert those sensibilities. Most business activities do not have nearly such an emotionally variable, intermittent reward schedule. They're not going to trick you with this sort of cognitive illusion.
Thus far we have been talking about cognitive bias, but there is a metacognitive bias at play too: while Dunning-Kruger, everybody's favorite metacognitive bias does have some problems with it, the main underlying metacognitive bias is that we tend to believe our own thoughts and perceptions, and it requires active effort to distance ourselves from them, even if we know they might be wrong.
This means you must assume any intuitive estimate of C is going to be biased low; similarly P is going to be biased high. You will forget the time you spent checking, and you will underestimate the number of times you had to re-check.
To avoid this, you will need to decide on a Ulysses pact to provide some inputs to a calculation for these factors that you will not be able to able to fudge if they seem wrong to you.
Problematically Plausible Presentation
Another nasty little cognitive-bias landmine for you to watch out for is the authority bias, for two reasons:
- People will tend to see Mallory as an unbiased, external authority, and thereby see it as more of an authority than a similarly-situated human13.
- Being an LLM, Mallory will be overconfident in its answers14.
The nature of LLM training is also such that commonly co-occurring tokens in the training corpus produce higher likelihood of co-occurring in the output; they're just going to be closer together in the vector-space of the weights; that's, like, what training a model is, establishing those relationships.
If you've ever used an heuristic to informally evaluate someone's credibility by listening for industry-specific shibboleths or ways of describing a particular issue, that skill is now useless. Having ingested every industry's expert literature, commonly-occurring phrases will always be present in Mallory's output. Mallory will usually sound like an expert, but then make mistakes at random.15.
While you might intuitively estimate C by thinking "well, if I asked a person, how could I check that they were correct, and how long would that take?" that estimate will be extremely optimistic, because the heuristic techniques you would use to quickly evaluate incorrect information from other humans will fail with Mallory. You need to go all the way back to primary sources and actually fully verify the output every time, or you will likely fall into one of these traps.
Mallory Mangling Mentorship
So far, I've been describing the effect Mallory will have in the context of an individual attempting to get some work done. If we are considering organization-wide adoption of Mallory, however, we must also consider the impact on team dynamics. There are a number of possible potential side effects that one might consider when looking at, but here I will focus on just one that I have observed.
I have a cohort of friends in the software industry, most of whom are individual contributors. I'm a programmer who likes programming, so are most of my friends, and we are also (sigh), charitably, pretty solidly middle-aged at this point, so we tend to have a lot of experience.
As such, we are often the folks that the team - or, in my case, the community - goes to when less-experienced folks need answers.
On its own, this is actually pretty great. Answering questions from more junior folks is one of the best parts of a software development job. It's an opportunity to be helpful, mostly just by knowing a thing we already knew. And it's an opportunity to help someone else improve their own agency by giving them knowledge that they can use in the future.
However, generative AI throws a bit of a wrench into the mix.
Let's imagine a scenario where we have 2 developers: Alice, a staff engineer who has a good understanding of the system being built, and Bob, a relatively junior engineer who is still onboarding.
The traditional interaction between Alice and Bob, when Bob has a question, goes like this:
- Bob gets confused about something in the system being developed, because Bob's understanding of the system is incorrect.
- Bob formulates a question based on this confusion.
- Bob asks Alice that question.
- Alice knows the system, so she gives an answer which accurately reflects the state of the system to Bob.
- Bob's understanding of the system improves, and thus he will have fewer and better-informed questions going forward.
You can imagine how repeating this simple 5-step process will eventually transform Bob into a senior developer, and then he can start answering questions on his own. Making sufficient time for regularly iterating this loop is the heart of any good mentorship process.
Now, though, with Mallory in the mix, the process now has a new decision point, changing it from a linear sequence to a flow chart.
We begin the same way, with steps 1 and 2. Bob's confused, Bob formulates a question, but then:
- Bob asks Mallory that question.
Here, our path then diverges into a "happy" path, a "meh" path, and a "sad" path.
The "happy" path proceeds like so:
- Mallory happens to formulate a correct answer.
- Bob's understanding of the system improves, and thus he will have fewer and better-informed questions going forward.
Great. Problem solved. We just saved some of Alice's time. But as we learned earlier,
Mallory can make mistakes. When that happens, we will need to check important info. So let's get checking:
- Mallory happens to formulate an incorrect answer.
- Bob investigates this answer.
- Bob realizes that this answer is incorrect because it is inconsistent with some of his prior, correct knowledge of the system, or his investigation.
- Bob asks Alice the same question; GOTO traditional interaction step 4.
On this path, Bob spent a while futzing around with Mallory, to no particular benefit. This wastes some of Bob's time, but then again, Bob could have ended up on the happy path, so perhaps it was worth the risk; at least Bob wasn't wasting any of Alice's much more valuable time in the process.16
Notice that beginning at the start of step 4, we must begin allocating all of Bob's time to C, so C already starts getting a bit bigger than if it were just Bob checking Mallory's output specifically on tasks that Bob is doing.
That brings us to the "sad" path.
- Mallory happens to formulate an incorrect answer.
- Bob investigates this answer.
- Bob does not realize that this answer is incorrect because he is unable to recognize any inconsistencies with his existing, incomplete knowledge of the system.
- Bob integrates Mallory's incorrect information of the system into his mental model.
- Bob proceeds to make a larger and larger mess of his work, based on an incorrect mental model.
- Eventually, Bob asks Alice a new, worse question, based on this incorrect understanding.
- Sadly we cannot return to the happy path at this point, because now Alice must unravel the complex series of confusing misunderstandings that Mallory has unfortunately conveyed to Bob at this point. In the really sad case, Bob actually doesn't believe Alice for a while, because Mallory seems unbiased17, and Alice has to waste even more time convincing Bob before she can simply explain to him.
Now, we have wasted some of Bob's time, and some of Alice's time. Everything from step 5-10 is C, and as soon as Alice gets involved, we are now adding to C at double real-time. If more team members are pulled in to the investigation, you are now multiplying C by the number of investigators, potentially running at triple or quadruple real time.
But That's Not All
Here I've presented a brief selection reasons why C will be both large, and larger than you expect. To review:
- Gambling-style mechanics of the user interface will interfere with your own self-monitoring and developing a good estimate.
- You can't use human heuristics for quickly spotting bad answers.
- Wrong answers given to junior people who can't evaluate them will waste more time from your more senior employees.
But this is a small selection of ways that Mallory's output can cost you money and time. It's harder to simplistically model second-order effects like this, but there's also a broad range of possibilities for ways that, rather than simply checking and catching errors, an error slips through and starts doing damage. Or ways in which the output isn't exactly wrong, but still sub-optimal in ways which can be difficult to notice in the short term.
For example, you might successfully vibe-code your way to launch a series of applications, successfully "checking" the output along the way, but then discover that the resulting code is unmaintainable garbage that prevents future feature delivery, and needs to be re-written18. But this kind of intellectual debt isn't even specific to technical debt while coding; it can even affect such apparently genAI-amenable fields as LinkedIn content marketing19.
Problems with the Prediction of P
C isn't the only challenging term though. P, is just as, if not more important, and just as hard to measure.
LLM marketing materials love to phrase their accuracy in terms of a percentage. Accuracy claims for LLMs in general tend to hover around 70%20. But these scores vary per field, and when you aggregate them across multiple topic areas, they start to trend down. This is exactly why "agentic" approaches for more immediately-verifiable LLM outputs (with checks like "did the code work") got popular in the first place: you need to try more than once.
Independently measured claims about accuracy tend to be quite a bit lower21. The field of AI benchmarks is exploding, but it probably goes without saying that LLM vendors game those benchmarks22, because of course every incentive would encourage them to do that. Regardless of what their arbitrary scoring on some benchmark might say, all that matters to your business is whether it is accurate for the problems you are solving, for the way that you use it. Which is not necessarily going to correspond to any benchmark. You will need to measure it for yourself.
With that goal in mind, our formulation of P must be a somewhat harsher standard than "accuracy". It's not merely "was the factual information contained in any generated output accurate", but, "is the output good enough that some given real knowledge-work task is done and the human does not need to issue another prompt"?
Surprisingly Small Space for Slip-Ups
The problem with reporting these things as percentages at all, however, is that our actual definition for P is 1attempts, where attempts for any given attempt, at least, must be an integer greater than or equal to 1.
Taken in aggregate, if we succeed on the first prompt more often than not, we could end up with a P>12, but combined with the previous observation that you almost always have to prompt it more than once, the practical reality is that P will start at 50% and go down from there.
If we plug in some numbers, trying to be as extremely optimistic as we can, and say that we have a uniform stream of tasks, every one of which can be addressed by Mallory, every one of which:
- we can measure perfectly, with no overhead
- would take a human 45 minutes
- takes Mallory only a single minute to generate a response
- Mallory will require only 1 re-prompt, so "good enough" half the time
- takes a human only 5 minutes to write a prompt for
- takes a human only 5 minutes to check the result of
- has a per-prompt cost of the equivalent of a single second of a human's time
Thought experiments are a dicey basis for reasoning in the face of disagreements, so I have tried to formulate something here that is absolutely, comically, over-the-top stacked in favor of the AI optimist here.
Would that be a profitable? It sure seems like it, given that we are trading off 45 minutes of human time for 1 minute of Mallory-time and 10 minutes of human time. If we ask Python:
1 2 3 4 5 |
|
We get a futzing fraction of about 0.4896. Not bad! Sounds like, at least under these conditions, it would indeed be cost-effective to deploy Mallory. But… realistically, do you reliably get useful, done-with-the-task quality output on the second prompt? Let's bump up the denominator on P just a little bit there, and see how we fare:
1 2 |
|
Oof. Still cost-effective at 0.734, but not quite as good. Where do we cap out, exactly?
1 2 3 4 5 6 7 8 9 |
|
With this little test, we can see that at our next iteration we are already at 0.9792, and by 5 tries per prompt, even in this absolute fever-dream of an over-optimistic scenario, with a futzing fraction of 1.2240, Mallory is now a net detriment to our bottom line.
Harm to the Humans
We are treating H as functionally constant so far, an average around some hypothetical Gaussian distribution, but the distribution itself can also change over time.
Formally speaking, an increase to H would be good for our fraction. Maybe it would even be a good thing; it could mean we're taking on harder and harder tasks due to the superpowers that Mallory has given us.
But an observed increase to H would probably not be good. An increase could also mean your humans are getting worse at solving problems, because using Mallory has atrophied their skills23 and sabotaged learning opportunities2425. It could also go up because your senior, experienced people now hate their jobs26.
For some more vulnerable folks, Mallory might just take a shortcut to all these complex interactions and drive them completely insane27 directly. Employees experiencing an intense psychotic episode are famously less productive than those who are not.
This could all be very bad, if our futzing fraction eventually does head north of 1 and you need to reconsider introducing human-only workflows, without Mallory.
Abridging the Artificial Arithmetic (Alliteratively)
To reiterate, I have proposed this fraction:
which shows us positive ROI when FF is less than 1, and negative ROI when it is more than 1.
This model is heavily simplified. A comprehensive measurement program that tests the efficacy of any technology, let alone one as complex and rapidly changing as LLMs, is more complex than could be captured in a single blog post.
Real-world work might be insufficiently uniform to fit into a closed-form solution like this. Perhaps an iterated simulation with variables based on the range of values seem from your team's metrics would give better results.
However, in this post, I want to illustrate that if you are going to try to evaluate an LLM-based tool, you need to at least include some representation of each of these terms somewhere. They are all fundamental to the way the technology works, and if you're not measuring them somehow, then you are flying blind into the genAI storm.
I also hope to show that a lot of existing assumptions about how benefits might be demonstrated, for example with user surveys about general impressions, or by evaluating artificial benchmark scores, are deeply flawed.
Even making what I consider to be wildly, unrealistically optimistic assumptions about these measurements, I hope I've shown:
- in the numerator, C might be a lot higher than you expect,
- in the denominator, P might be a lot lower than you expect,
- repeated use of an LLM might make H go up, but despite the fact that it's in the denominator, that will ultimately be quite bad for your business.
Personally, I don't have all that many concerns about E and I. E is still seeing significant loss-leader pricing, and I might not be coming down as fast as vendors would like us to believe, if the other numbers work out I don't think they make a huge difference. However, there might still be surprises lurking in there, and if you want to rationally evaluate the effectiveness of a model, you need to be able to measure them and incorporate them as well.
In particular, I really want to stress the importance of the influence of LLMs on your team dynamic, as that can cause massive, hidden increases to C. LLMs present opportunities for junior employees to generate an endless stream of chaff that will simultaneously:
- wreck your performance review process by making them look much more productive than they are,
- increase stress and load on senior employees who need to clean up unforeseen messes created by their LLM output,
- and ruin their own opportunities for career development by skipping over learning opportunities.
If you've already deployed LLM tooling without measuring these things and without updating your performance management processes to account for the strange distortions that these tools make possible, your Futzing Fraction may be much, much greater than 1, creating hidden costs and technical debt that your organization will not notice until a lot of damage has already been done.
If you got all the way here, particularly if you're someone who is enthusiastic about these technologies, thank you for reading. I appreciate your attention and I am hopeful that if we can start paying attention to these details, perhaps we can all stop futzing around so much with this stuff and get back to doing real work.
Acknowledgments
Thank you to my patrons who are supporting my writing on this blog. If you like what you've read here and you'd like to read more of it, or you'd like to support my various open-source endeavors, you can support my work as a sponsor!
-
I do not share this optimism, but I want to try very hard in this particular piece to take it as a given that genAI is in fact helpful. ↩
-
If we could have a better prompt on demand via some repeatable and automatable process, surely we would have used a prompt that got the answer we wanted in the first place. ↩
-
The software idea of a "user agent" straightforwardly comes from the legal principle of an agent, which has deep roots in common law, jurisprudence, philosophy, and math. When we think of an agent (some software) acting on behalf of a principal (a human user), this historical baggage imputes some important ethical obligations to the developer of the agent software. genAI vendors have been as eager as any software vendor to dodge responsibility for faithfully representing the user's interests even as there are some indications that at least some courts are not persuaded by this dodge, at least by the consumers of genAI attempting to pass on the responsibility all the way to end users. Perhaps it goes without saying, but I'll say it anyway: I don't like this newer interpretation of "agent". ↩
-
"Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents", Axel Backlund, Lukas Petersson, Feb 20, 2025 ↩
-
"random thing are happening, maxed out usage on api keys", @leojr94 on Twitter, Mar 17, 2025 ↩
-
"New study sheds light on ChatGPT's alarming interactions with teens" ↩
-
"Lawyers submitted bogus case law created by ChatGPT. A judge fined them $5,000", by Larry Neumeister for the Associated Press, June 22, 2023 ↩
-
During which a human will be busy-waiting on an answer. ↩
-
Given the fluctuating pricing of these products, and fixed subscription overhead, this will obviously need to be amortized; including all the additional terms to actually convert this from your inputs is left as an exercise for the reader. ↩
-
I feel like I should emphasize explicitly here that everything is an average over repeated interactions. For example, you might observe that a particular LLM has a low probability of outputting acceptable work on the first prompt, but higher probability on subsequent prompts in the same context, such that it usually takes 4 prompts. For the purposes of this extremely simple closed-form model, we'd still consider that a P of 25%, even though a more sophisticated model, or a monte carlo simulation that sets progressive bounds on the probability, might produce more accurate values. ↩
-
No it isn't, actually, but for the sake of argument let's grant that it is. ↩
-
It's worth noting that all this expensive measuring itself must be included in C until you have a solid grounding for all your metrics, but let's optimistically leave all of that out for the sake of simplicity. ↩
-
"AI Company Poll Finds 45% of Workers Trust the Tech More Than Their Peers", by Suzanne Blake for Newsweek, Aug 13, 2025 ↩
-
AI Chatbots Remain Overconfident - Even When They're Wrong by Jason Bittel for the Dietrich College of Humanities and Social Sciences at Carnegie Mellon University, July 22, 2025 ↩
-
AI Mistakes Are Very Different From Human Mistakes by Bruce Schneier and Nathan E. Sanders for IEEE Spectrum, Jan 13, 2025 ↩
-
Foreshadowing is a narrative device in which a storyteller gives an advance hint of an upcoming event later in the story. ↩
-
"People are worried about the misuse of AI, but they trust it more than humans" ↩
-
"Why I stopped using AI (as a Senior Software Engineer)", theSeniorDev YouTube channel, Jun 17, 2025 ↩
-
"I was an AI evangelist. Now I'm an AI vegan. Here's why.", Joe McKay for the greatchatlinkedin YouTube channel, Aug 8, 2025 ↩
-
"Study Finds That 52 Percent Of ChatGPT Answers to Programming Questions are Wrong", by Sharon Adarlo for Futurism, May 23, 2024 ↩
-
"Off the Mark: The Pitfalls of Metrics Gaming in AI Progress Races", by Tabrez Syed on BoxCars AI, Dec 14, 2023 ↩
-
"I tried coding with AI, I became lazy and stupid", by Thomasorus, Aug 8, 2025 ↩
-
"How AI Changes Student Thinking: The Hidden Cognitive Risks" by Timothy Cook for Psychology Today, May 10, 2025 ↩
-
"Increased AI use linked to eroding critical thinking skills" by Justin Jackson for Phys.org, Jan 13, 2025 ↩
-
"AI could end my job - Just not the way I expected" by Manuel Artero Anguita on dev.to, Jan 27, 2025 ↩
-
"The Emerging Problem of "AI Psychosis"" by Gary Drevitch for Psychology Today, July 21, 2025. ↩
15 Aug 2025 7:51am GMT
09 Aug 2025
Planet Twisted
Glyph Lefkowitz: R0ML’s Ratio
My father, also known as "R0ML" once described a methodology for evaluating volume purchases that I think needs to be more popular.
If you are a hardcore fan, you might know that he has already described this concept publicly in a talk at OSCON in 2005, among other places, but it has never found its way to the public Internet, so I'm giving it a home here, and in the process, appropriating some of his words.1
Let's say you're running a circus. The circus has many clowns. Ten thousand clowns, to be precise. They require bright red clown noses. Therefore, you must acquire a significant volume of clown noses. An enterprise licensing agreement for clown noses, if you will.
If the nose plays, it can really make the act. In order to make sure you're getting quality noses, you go with a quality vendor. You select a vendor who can supply noses for $100 each, at retail.
Do you want to buy retail? Ten thousand clowns, ten thousand noses, one hundred dollars: that's a million bucks worth of noses, so it's worth your while to get a good deal.
As a conscientious executive, you go to the golf course with your favorite clown accessories vendor and negotiate yourself a 50% discount, with a commitment to buy all ten thousand noses.
Is this a good deal? Should you take it?
To determine this, we will use an analytical tool called R0ML's Ratio (RR).
The ratio has 2 terms:
- the Full Undiscounted Retail List Price of Units Used (FURLPoUU), which can of course be computed by the individual retail list price of a single unit (in our case, $100) multiplied by the number of units used
- the Total Price of the Entire Enterprise Volume Licensing Agreement (TPotEEVLA), which in our case is $500,000.
It is expressed as:
RR = TPotEEVLA FURLPoUU
Crucially, you must be able to compute the number of units used in order to complete this ratio. If, as expected, every single clown wears their nose at least once during the period of the license agreement, then our Units Used is 10,000, our FURLPoUU is $1,000,000 and our TPotEEVLA is $500,000, which makes our RR 0.5.
Congratulations. If R0ML's Ratio is less than 1, it's a good deal. Proceed.
But… maybe the nose doesn't play. Not every clown's costume is an exact clone of the traditional, stereotypical image of a clown. Many are avant-garde. Perhaps this plentiful proboscis pledge was premature. Here, I must quote the originator of this theoretical framework directly:
What if the wheeze doesn't please?
What if the schnozz gives some pause?
In other words: what if some clowns don't wear their noses?
If we were to do this deal, and then ask around afterwards to find out that only 200 of our 10,000 clowns were to use their noses, then FURLPoUU comes out to 200 * $100, for a total of $20,000. In that scenario, RR is 25, which you may observe is substantially greater than 1.
If you do a deal where R0ML's ratio is greater than 1, then you are the bozo.
I apologize if I have belabored this point. As R0ML expressed in the email we exchanged about this many years ago,
I do not mind if you blog about it - and I don't mind getting the credit - although one would think it would be obvious.
And yeah, one would think this would be obvious? But I have belabored it because many discounted enterprise volume purchasing agreements still fail the R0ML's Ratio Bozo Test.2
In the case of clown noses, if you pay the discounted price, at least you get to keep the nose; maybe lightly-used clown noses have some resale value. But in software licensing or SaaS deals, once you've purchased the "discounted" software or service, once you have provisioned the "seats", the money is gone, and if your employees don't use it, then no value for your organization will ever result.
Measuring number of units used is very important. Without this number, you have no idea if you are a bozo or not.
It is often better to give your individual employees a corporate card and allow them to make arbitrary individual purchases of software licenses and SaaS tools, with minimal expense-reporting overhead; this will always keep R0ML's Ratio at 1.0, and thus, you will never be a bozo.
It is always better to do that the first time you are purchasing a new software tool, because the first time making such a purchase you (almost by definition) have no information about "units used" yet. You have no idea - you cannot have any idea - if you are a bozo or not.
If you don't know who the bozo is, it's probably you.
Acknowledgments
Thank you for reading, and especially thank you to my patrons who are supporting my writing on this blog. Of course, extra thanks to dad for, like, having this idea and doing most of the work here beyond my transcription. If you like my dad's ideas and you'd like to post more of them, or you'd like to support my various open-source endeavors, you can support my work as a sponsor!
09 Aug 2025 4:41am GMT
08 Aug 2025
Planet Twisted
Glyph Lefkowitz: The Best Line Length
What's a good maximum line length for your coding standard?
This is, of course, a trick question. By posing it as a question, I have created the misleading impression that it is a question, but Black has selected the correct number for you; it's 88 which is obviously very lucky.
Thanks for reading my blog.
OK, OK. Clearly, there's more to it than that. This is an age-old debate on the level of "tabs versus spaces". So contentious, in fact, that even the famously opinionated Black does in fact let you change it.
Ancient History
One argument that certain silly people1 like to make is "why are we wrapping at 80 characters like we are using 80 character teletypes, it's the 2020s! I have an ultrawide monitor!". The implication here is that the width of 80-character terminals is an antiquated relic, based entirely around the hardware limitations of a bygone era, and modern displays can put tons of stuff on one line, so why not use that capability?
This feels intuitively true, given the huge disparity between ancient times and now: on my own display, I can comfortably fit about 350 characters on a line. What a shame, to have so much room for so many characters in each line, and to waste it all on blank space!
But... is that true?
I stretched out my editor window all the way to measure that '350' number, but I did not continue editing at that window width. In order to have a more comfortable editing experience, I switched back into writeroom mode, a mode which emulates a considerably more writerly application, which limits each line length to 92 characters, regardless of frame width.
You've probably noticed this too. Almost all sites that display prose of any kind limit their width, even on very wide screens.
As silly as that tiny little ribbon of text running down the middle of your monitor might look with a full-screened stereotypical news site or blog, if you full-screen a site that doesn't set that width-limit, although it makes sense that you can now use all that space up, it will look extremely, almost unreadably bad.
Blogging software does not set a column width limit on your text because of some 80-character-wide accident of history in the form of a hardware terminal.
Similarly, if you really try to use that screen real estate to its fullest for coding, and start editing 200-300 character lines, you'll quickly notice it starts to feel just a bit weird and confusing. It gets surprisingly easy to lose your place. Rhetorically the "80 characters is just because of dinosaur technology! Use all those ultrawide pixels!" talking point is quite popular, but practically people usually just want a few more characters worth of breathing room, maxing out at 100 characters, far narrower than even the most svelte widescreen.
So maybe those 80 character terminals are holding us back a little bit, but... wait a second. Why were the terminals 80 characters wide in the first place?
Ancienter History
As this lovely Software Engineering Stack Exchange post summarizes, terminals were probably 80 characters because teletypes were 80 characters, and teletypes were probably 80 characters because punch cards were 80 characters, and punch cards were probably 80 characters because that's just about how many typewritten characters fit onto one line of a US-Letter piece of paper.
Even before typewriters, consider the average newspaper: why do we call a regularly-occurring featured article in a newspaper a "column"? Because broadsheet papers were too wide to have only a single column; they would always be broken into multiple! Far more aggressive than 80 characters, columns in newspapers typically have 30 characters per line.
The first newspaper printing machines were custom designed and could have used whatever width they wanted, so why standardize on something so narrow?3
Science!
There has been a surprising amount of scientific research around this issue, but in brief, there's a reason here rooted in human physiology: when you read a block of text, you are not consciously moving your eyes from word to word like you're dragging a mouse cursor, repositioning continuously. Human eyes reading text move in quick bursts of rotation called "saccades". In order to quickly and accurately move from one line of text to another, the start of the next line needs to be clearly visible in the reader's peripheral vision in order for them to accurately target it. This limits the angle of rotation that the reader can perform in a single saccade, and, thus, the length of a line that they can comfortably read without hunting around for the start of the next line every time they get to the end.
So, 80 (or 88) characters isn't too unreasonable for a limit. It's longer than 30 characters, that's for sure!
But, surely that's not all, or this wouldn't be so contentious in the first place?
Caveats
The screen is wide, though.
The ultrawide aficionados do have a point, even if it's not really the simple one about "old terminals" they originally thought. Our modern wide-screen displays are criminally underutilized, particularly for text. Even adding in the big chunky file, class, and method tree browser over on the left and the source code preview on the right, a brief survey of a Google Image search for "vs code" shows a lot of editors open with huge, blank areas on the right side of the window.
Big screens are super useful as they allow us to leverage our spatial memories to keep more relevant code around and simply glance around as we think, rather than navigate interactively. But it only works if you remember to do it.
Newspapers allowed us to read a ton of information in one sitting with minimum shuffling by packing in as much as 6 columns of text. You could read a column to the bottom of the page, back to the top, and down again, several times.
Similarly, books fill both of their opposed pages with text at the same time, doubling the amount of stuff you can read at once before needing to turn the page.
You may notice that reading text in a book, even in an ebook app, is more comfortable than reading a ton of text by scrolling around in a web browser. That's because our eyes are built for saccades, and repeatedly tracking the continuous smooth motion of the page as it scrolls to a stop, then re-targeting the new fixed location to start saccading around from, is literally more physically strenuous on your eye's muscles!
There's a reason that the codex was a big technological innovation over the scroll. This is a regression!
Today, the right thing to do here is to make use of horizontally split panes in your text editor or IDE, and just make a bit of conscious effort to set up the appropriate code on screen for the problem you're working on. However, this is a potential area for different IDEs to really differentiate themselves, and build multi-column continuous-code-reading layouts that allow for buffers to wrap and be navigable newspaper-style.
Similar, modern CSS has shockingly good support for multi-column layouts, and it's a shame that true multi-column, page-turning layouts are so rare. If I ever figure out a way to deploy this here that isn't horribly clunky and fighting modern platform conventions like "scrolling horizontally is substantially more annoying and inconsistent than scrolling vertically" maybe I will experiment with such a layout on this blog one day. Until then… just make the browser window narrower so other useful stuff can be in the other parts of the screen, I guess.
Code Isn't Prose
But, I digress. While I think that columnar layouts for reading prose are an interesting thing more people should experiment with, code isn't prose.
The metric used for ideal line width, which you may have noticed if you clicked through some of those Wikipedia links earlier, is not "character cells in your editor window", it is characters per line, or "CPL".
With an optimal CPL somewhere between 45 and 95, a code-line-width of somewhere around 90 might actually be the best idea, because whitespace uses up your line-width budget. In a typical object-oriented Python program2, most of your code ends up indented by at least 8 spaces: 4 for the class scope, 4 for the method scope. Most likely a lot of it is 12, because any interesting code will have at least one conditional or loop. So, by the time you're done wasting all that horizontal space, a max line length of 90 actually looks more like a maximum of 78... right about that sweet spot from the US-Letter page in the typewriter that we started with.
What about soft-wrap?
In principle, source code is structured information, whose presentation could be fully decoupled from its serialized representation. Everyone could configure their preferred line width appropriate to their custom preferences and the specific physiological characteristics of their eyes, and the code could be formatted according to the language it was expressed in, and "hard wrapping" could be a silly antiquated thing.
The problem with this argument is the same as the argument against "but tabs are semantic indentation", to wit: nope, no it isn't. What "in principle" means in the previous paragraph is actually "in a fantasy world which we do not inhabit". I'd love it if editors treated code this way and we had a rich history and tradition of structured manipulations rather than typing in strings of symbols to construct source code textually. But that is not the world we live in. Hard wrapping is unfortunately necessary to integrate with diff tools.
So what's the optimal line width?
The exact, specific number here is still ultimately a matter of personal preference.
Hopefully, understanding the long history, science, and underlying physical constraints can lead you to select a contextually appropriate value for your own purposes that will balance ease of reading, integration with the relevant tools in your ecosystem, diff size, presentation in the editors and IDEs that your contributors tend to use, reasonable display in web contexts, on presentation slides, and so on.
But - and this is important - counterpoint:
No it isn't, you don't need to select an optimal width, because it's already been selected for you. It is 88.
Acknowledgments
Thank you for reading, and especially thank you to my patrons who are supporting my writing on this blog. If you like what you've read here and you'd like to read more of it, or you'd like to support my various open-source endeavors, you can support my work as a sponsor!
-
I love the fact that this message is, itself, hard-wrapped to 77 characters. ↩
-
Let's be honest; we're all object-oriented python programmers here, aren't we? ↩
-
Unsurprisingly, there are also financial reasons. More, narrower columns meant it was easier to fix typesetting errors and to insert more advertisements as necessary. But readability really did have a lot to do with it, too; scientists were looking at ease of reading as far back as the 1800s. ↩
08 Aug 2025 5:37am GMT
29 Nov 2024
Planet Plone - Where Developers And Integrators Write
Maurits van Rees: Lightning talks Friday
Bonnie Tyler Sprint
On 12 August 2026 there is a total solar eclipse that can be seen from Valencia, Spain. So we organise a sprint there.
This conference
We had 291 participants, 234 in person and 57 online. 13 Brazilian states (that is all of them), 14 countries.
24.5 percent women, was 13% in 2013, so that has gone up, but we are not there yet. Thank you to PyLadies and Django Girls for making this happen.
We had more than 80 presenters, about 30 lightning talks, lots of talk in the hall ways.
Thanks also to the team!
Ramiro Luz: Yoga time
Yoga exercise.
Rikupekka: University case student portal
We have a student portal at the university. But mostly:
Welcome to Jyväskylä university in Finald for Plone conference 2025, October 13-19!
Jakob: Beethovensprint
26-30 mei 2025 in Bonn, Duitsland.
Afterwards, on May 30 and June 1 there will be FedCon in Bonn, a SciFi convention.
Piero/Victor: BYOUI
Add-ons first development with @plone/registry. See https://plone-registry.readthedocs.io/
It allows for development that is framework agnostic, so it is not only for Plone. It is around configuration that can be extended and injected, which is tricky in most javascript frameworks.
Imagine it.
Ana Dulce: 3D printing
For a difficult model I had trust the process, it took a week, but it worked.
Renan & Iza: Python Brasil
We organised the Python Brasil conference from 16 to 23 October this year in Rio de Janeiro.
Next year 21-27 October in São Paulo.
Erico: Python Cerrado
31 July to 2 August 2025 is the next Python Cerrado conference.
29 Nov 2024 10:25pm GMT
Maurits van Rees: Paul Roeland: The value of longevity
Link to talk information on Plone conference website.
I work for the Clean Clothes Campaign: https://cleanclothes.org/
After three large disasters in factories in 2012 and 2013 with over 1000 deaths, it took three years to get an agreement with clothes manufacturers to get 30 million dollar compensation. It does not bring lives back, but it helps the survivors.
See Open Supply Hub for open data that we collected, for checking which brands are produced in which factories.
Documenting history matters. Stories must be told.
The global closing industry is worth around 1.8 trillion dollars, in a country that would put them on the 12th place in the world. 75 million workers.
Our strongest weapon: backlinks. We have links from OECD, UN, wikipedia, school curriculum, books. Especially those last two don't change ever, so you should never change urls.
Plone: enable the sitemap, please, why not by default? Create a good robots.txt. I weekly check Google Search console, looking for broken links. Tag early, tag often, great tool, even if you have an AI do it.
Our website: started 1998 written in Notepad, 2004 Dreamweaver, 2006 Bluefish, 2010 Joomla, 2013 Plone 4, 2020 Castle CMS (opinionated distribution of Plone, but does not really exist anymore) 2024 Plone 6 with Volto Light Theme (work in progress). Thank you kitconcept for all the help, especially Jonas.
Migrations are painful. Along the years we used wget to csv to SQL to csv, Python script, "Franken-mogrifier", collective.exportimport.
Lessons learned: stable urls are awesome, migrations are painful. Please don't try to salvage CSS from your old site, just start fresh in your new system. Do not try to migrate composite pages or listings.
What if your website does not provide an export? Use wget, still works and is better than httrack. sed/awk/regex are your friend. archivebox (WARC).
Document your steps for your own sanity.
To manage json, jq or jello can be used. sq is a Swiss knife for json/sql/csv. emuto is a hybrid between jq and GraphQL.
Normalize import/export. We have `plone.exportimport` in core now.
In the future I would like a plone exporter script that accepts a regex and exports only matching pages. Switch backends: ZODB, relstorage, nick, quantum-db. Sitewide search/replace/sed. Sneakernet is useful in difficult countries where you cannot send data over the internet: so export to a usb stick.
A backup is only a backup if it regularly gets restored so you know that it works.
- Keeping content and URL stability is a superpower.
- Assuming that export/import/backup/restore/migration are rare occurrences, is wrong.
- Quick export/import is very useful.
Do small migrations, treat it as maintenance. Don't be too far behind. Large migrations one every five years will be costly. Do a small migration every year. Do your part. Clients should also do their part, by budgeting this yearly. That is how budgeting works. Use every iteration to review custom code.
Make your sites live long and prosper.
29 Nov 2024 8:58pm GMT
Maurits van Rees: Fred van Dijk: Run Plone in containers on your own cluster with coolify.io
Link to talk information on Plone conference website.
Sorry, I ran out of time trying to set up https://coolify.io
So let's talk about another problem. Running applications (stacks) in containers is the future. Well: abstraction and isolation is the future, and containers is the current phase.
I am on the Plone A/I team, with Paul, Kim, Erico. All senior sysadmins, so we kept things running. In 2022 we worked on containerisation. Kubernetes was the kool kid then, but Docker Swarm was easier. Checkout Erico's training with new cookieplone templates.
Doing devops well is hard. You have a high workload, but still need to keep learning new stuff to keep up with what is changing.
I want to plug Coolify, which is a full open source product. "Self-hosting with super powers." The main developer, Andras Bacsal, believes in open source and 'hates' pay by usage cloud providers with a vengeance.
Coolify is still docker swarm. We also want Kubernetes support. But we still need sysadmins. Someone will still need to install coolify, and keep it updated.
I would like to run an online DevOps course somewhere January-March 2025. 4-6 meetings of 2 hours, maybe Friday afternoon. Talk through devops and sysadmin concepts, show docker swarm, try coolify, etc.
29 Nov 2024 7:58pm GMT