30 Jul 2025

feedPlanet Debian

Bits from Debian: New Debian Developers and Maintainers (May and June 2025)

The following contributors got their Debian Developer accounts in the last two months:

The following contributors were added as Debian Maintainers in the last two months:

Congratulations!

30 Jul 2025 12:00pm GMT

Steinar H. Gunderson: Superimposed codes, take three

After I wrote last week that OEIS A286874 would stop at a(12) and that computing (verifying) a(13) would take about 4-5000 CPU years, the changes have finally been approved, and… the sequence includes a(13) = 26. What happened?

Well, first of all, I am indeed not a mathematical genius (the last post even forgot the "not"); I had a stupid conversion error in the estimation, causing a factor 25 or so. But the rest came from actual speedups.

First of all, I improved one of the existing symmetry detectors a bit (the one described last in the previous post was not fully rejecting the possible symmetries when multiple new bits were introduced in one value). But I also made a more universal symmetry detector; if switching the order of certain neighboring bits and re-sorting the sequence made it lexicographically smaller, then we can abort the search. This is pretty expensive and only rejects ~5% of candidates, so it's only worth it at higher levels, but it's much cheaper than checking all n! arbitrary permutations and catches maybe 90% of a full rejection. (Also, if you can reject 5% at multiple levels, those percentages tend to add up. We're down from hundreds of thousands of duplicate solutions, to only a bit over 100, so the amount of speedup available from reducing symmetries is rapidly dwindling.)

Also, surprisingly to me, before going on to run the next level, doing a population count to check if there were too few bits to ever be a solution was seemingly a large win (e.g. are have three values so far, but only 21 bits left; we can never generate a sequence larger than 24 even if all the stars align, and can abort immediately). You would think that this counting, which takes very real CPU time even with vectorization, wouldn't be worth it compared to just running through the base layers of the recursion very quickly, but evidently, it is by a large margin. I guess it's a very common case to have many more than 1 bit left but less than 26-n, and it also means you can just stop iterating a bit before you get to the end.

But perhaps the most impactful optimization was a microoptimization. Recall that we spent most of our time ANDing 8192-bit vectors (which would be 16384-bit vectors for a(13)) with each other. Some looking at performance metrics suggested that the RAM bandwidth was completely maxed out, with ~80% of theoretical bandwidth in use; only faster RAM or more memory channels would have made a reasonable dent in the performance of this kind of architecture.

But pretty early, most of those bits will be zero. If you've already decided on the first five values in a sequence, you will not have 8187 options left; in most cases, you'll have more like 3-400. And since the bit sets only ever shrink, we can simply compress away all those known zeros. For most of our purposes, it doesn't really decide what each bit signifies (an important exception is the point where we have a valid solution and need to print it out, but it's not hard to store the mapping), as we mostly use the values for looking up pregenerated vectors to AND together. This means that when we start a new sub-job, we can find which future values are possible, and then map those into new numbers 0 through 511 (or whatever). This means we can use 512-bit vectors instead of 8192-bit vectors, with all the obvious advantages; less ALU work, less memory traffic, and better cache locality. (It's interesting that we started by being extremely ALU-bound, then moved to being very RAM-bound, and then ended up in fairly normal territory.)

Of course, there are situations where you could have more than 512 valid values. In that case, you can either recompile with larger bit sets (typically a multiple of 128, to get good use of SIMD), or you can split into smaller sub-jobs; find all valid ways of extending the sequence by one element (trivial; we already did that to make the bit sets), and then make one job for each. This splitting is also good for variance; no longer do you have some sub-jobs that finish in milliseconds and some that require days.

There are some downsides too, of course. In particular, we can no longer pregenerate one universal 8192*8192*8192 bit LUT (well, 8192*8191/2*8192); every sub-job needs to make its own set of LUTs before starting. But since this is O(n³) and we just cut n from 8192 to 512, it's not really a blocker (although of course far from zero); and importantly, it cuts our total RAM usage. For n=8192, we already needed a bit over 32 GB (though sharable between all jobs), and each next element in the sequence (a(13), a(14), etc.) is a factor 8 extra, so it starts becoming a real problem fast. But on the flipside, I think this extra precalc makes the algorithm much less amenable to a theoretical GPU implementation (~8 MB private data per instance, as opposed to one large shared static pool of constants and then just 1 kB of state per instance), which would otherwise be nontrivial but probably possible (the problem itself is so parallel). Interestingly enough, it's possible to use bitslicing to speed up this precalc, which is a technique I cannot remember when I last used.

All in all, it took only about 114 CPU-days (or, well, thread-days, as hyperthreading now makes sense again) to calculate a(13), which was eminently possible; and many of the optimizations came late in the process, so a rerun would be faster than that. So, could we get to a(14)? Well, maybe. I'm less convinced that it would be impossible than I was with a(13) earlier. :-) But I started looking at it, and it turns out there are literally trillions (possibly more!) of sub-jobs if you want to split deeply enough to get each down into the 512-bit range. And even at ~8 ms per core per job (ignoring the cost of splitting and just looking at the cost of processing the jobs themselves), it just becomes too unwieldy for me, especially since Postgres isn't really that great at storing billions of rows efficiently. But impossible? Definitely not.

30 Jul 2025 7:45am GMT

29 Jul 2025

feedPlanet Debian

Ravi Dwivedi: How to paste your password on your bank's website

If your bank is like mine, its website doesn't allow you to copy your password and paste it by performing a simple Ctrl+V. I tried the Don't Fuck With Paste extension in Firefox, which could paste my bank account's profile password but not the login password.

Therefore, I asked on Mastodon a couple of days ago and got some responses. The solution that worked for me was to use Shift+Insert to paste the password. It worked for me in LibreWolf and Firefox, and that's all I needed.

Furthermore, this behavior by bank websites leads to users choosing insecure and memorable passwords. Using this trick will help you choose strong passwords for your bank account.

I prefer to use random and strong passwords generated using the password manager pass. It is a freedom-respecting software, unlike popular proprietary password managers promoted by YouTubers. Feel free to check out their webpage here. The reason I use pass is that it stores all the passwords locally (and optionally in a remote Git repository) in encrypted form, which can only be decrypted using your private GPG keys.

29 Jul 2025 7:38am GMT