21 Feb 2026
Planet Debian
Thomas Goirand: Seamlessly upgrading a production OpenStack cluster in 4 hours : with 2k lines shell script

tl;dr:
To the question: "what does it take to upgrade OpenStack", my personal answer is: less than 2K lines of dash script. I'll here describe its internals, and why I believe it is the correct solution.
Why writing this blog post
During FOSSDEM 2024, I was asked "how to you handle upgrades". I answered with a big smile and a short "with a very small shell script" as I couldn't explain in 2 minutes how it was done. Just saying "it is great this way" doesn't help giving readers enough hints to be trusted. Why and how did I do it the right way ? This blog post is an attempt to reply better to this question more deeply.
Upgrading OpenStack in production
I wrote this script maybe a 2 or 3 years ago. Though I'm only blogging about it today, because … I did such an upgrade in a public cloud in production last Thuesday evening (ie: the first region of the Infomaniak public cloud). I'd say the cluster is moderately large (as of today: about 8K+ VMs running, 83 compute nodes, 12 network nodes, … for a total of 10880 physical CPU cores and 125 TB of RAM if I only count the compute servers). It took "only" 4 hours to do the upgrade (though I already wore some more code to speed this up for the next time…). It went super smooth without a glitch. I mostly just sat, reading the script output… and went to bed once it finished running. The next day, all my colleagues at Infomaniak were nicely congratulating me that it went that smooth (a big thanks to all of you who did). I couldn't dream of an upgrade that smooth! :)
Still not impressed? Boring read? Yeah… let's dive into more technical details.
Intention behind the implementation
My script isn't perfect. I wont ever pretend it is. But at least, it does minimize down time of every OpenStack service. It also is a "by the book" implementation of what's written in the OpenStack doc, following every upstream advice. As a result, it is fully seamless for some OpenStack services, and as HA as OpenStack can be for others. The upgrade process is of course idempotent and can be re-run in case of failure. Here's why.
General idea
My upgrade script does thing in a certain order, respecting what is documented about upgrades in the OpenStack documentation. It basically does:
- Upgrade all dependency
- Upgrade all services one by one, in all the cluster
Installing dependencies
The first thing the upgrade script does is:
- disable puppet on all nodes of the cluster
- switch the APT repository
- apt-get update on all nodes
- install library dependency on all nodes
For this last thing, a static list of all needed dependency upgrade is maintained between each release of OpenStack, and for each type of nodes. Then for all packages in this list, the script checks with dpkg-query that the package is really installed, and with apt-cache policy that it really is going to be upgraded (Maybe there's an easier way to do this?). This way, no package is marked as manually installed by mistake during the upgrade process. This ensure that "apt-get -purge autoremove" really does what it should, and that the script is really idempotent.
The idea then, is that once all dependencies are installed, upgrading and restarting leaf packages (ie: OpenStack services like Nova, Glance, Cinder, etc.) is very fast, because the apt-get command doesn't need to install all dependencies. So at this point, doing "apt-get install python3-cinder" for example (which will also, thanks to dependencies, upgrade cinder-api and cinder-scheduler, if it's in a controller node) only takes a few seconds. This principle applies to all nodes (controller nodes, network nodes, compute nodes, etc.), which helps a lot speeding-up the upgrade and reduce unavailability.
hapc
At its core, the oci-cluster-upgrade-openstack-release script uses haproxy-cmd (ie: /usr/bin/hapc) to drain each API server to-be-upgraded from haproxy. Hapc is a simple Python wrapper around the haproxy admin socket: it sends command to it with an easy to understand CLI. So it is possible to reliably upgrade one API service only after it's drained away. Draining means one just wait for the last query to finish and the client to disconnect from http before giving the backend server some more queries. If you do not know hapc / haproxy-cmd, I recommend trying it: it's going to be hard for you to stop using it once you tested it. Its bash-completion script makes it VERY easy to use, and it is helpful in production. But not only: it is also nice to have when writing this type of upgrade script. Let's dive into haproxy-cmd.
Example on how to use haproxy-cmd
Let me show you. First, ssh one of the 3 controller and search where the virtual IP (VIP) is located with "crm resource locate openstack-api-vip" or with a (more simple) "crm status". Let's ssh that server who got the VIP, and now, let's drain it away from haproxy.
$ hapc list-backends
$ hapc drain-server --backend glancebe --server cl1-controller-1.infomaniak.ch --verbose --wait --timeout 50
$ apt-get install glance-api
$ hapc enable-server --backend glancebe --server cl1-controller-1.infomaniak.ch
Upgrading the control plane
My upgrade script leverages hapc just like above. For each OpenStack project, it's done in this order on the first node holding the VIP:
- "hapc drain-server" of the API, so haproxy gracefully stops querying it
- stop all services on that node (including non-API services): stop, disable and mask with systemd.
- upgrade that service Python code. For example: "apt-get install python3-nova", which also will pull nova-api, nova-conductor, nova-novncprox, etc. but services wont start automatically as they've been stoped + disabled + masked on the previous bullet point.
- perform the db_sync so that the db is up-to-date [1]
- start all services (unmask, enable and start with systemd)
- re-enable the API backend with "hapc enable-server"
Starting at [1], the risk is that other nodes may have a new version of the database schema, but an old version of the code that isn't compatible with it. But it doesn't take long, because the next step is to take care of the other (usually 2) nodes of the OpenStack control plane:
- "hapc drain-server" of the API of the other 2 controllers
- stop of all services on these 2 controllers [2]
- upgrade of the package
- start of all services
So while there's technically zero down time, still some issues between [1] and [2] above may happen because of the new DB schema and the old code (both for API and other services) are up and running at the same time. It is however supposed to be rare cases (some OpenStack project don't even have db change between some OpenStack releases, and it often continues to work on most queries with the upgraded db), and the cluster will be like that for a very short time, so that's fine, and better than an full API down time.
Satellite services
Then there's satellite services, that needs to be upgraded. Like Neutron, Nova, Cinder. Nova is the least offender as it has all the code to rewrite Json object schema on-the-fly so that it continues to work during an upgrade. Though it's a known issue that Cinder doesn't have the feature (last time I checked), and it's also probably the same for Neutron (maybe recent-ish versions of OpenStack do use oslo.versionnedobjects ?). Anyways, upgrade on these nodes are done just right after the control plane for each service.
Parallelism and upgrade timings
As we're dealing with potentially hundreds of nodes per cluster, a lot of operations are performed in parallel. I choose to simply use the & shell thingy with some "wait" shell stuff so that not too many jobs are done in parallel. For example, when disabling SSH on all nodes, this is done 24 nodes at a time. Which is fine. And the number of nodes is all depending on the type of thing that's being done. For example, while it's perfectly OK to disable puppet on 24 nodes at the same time, but it is not OK to do that with Neutron services. In fact, each time a Neutron agent is restarted, the script explicitly waits for 30 seconds. This conveniently avoids a hailstorm of messages in RabbitMQ, and neutron-rpc-server to become too busy. All of these waiting are necessary, and this is one of the reasons why can sometimes take that long to upgrade a (moderately big) cluster.
Not using config management tooling
Some of my colleagues would have prefer that I used something like Ansible. Whever, there's no reason to use such tool if the idea is just to perform some shell script commands on every servers. It is a way more efficient (in terms of programming) to just use bash / dash to do the work. And if you want my point of view about Ansible: using yaml for doing such programming would be crasy. Yaml is simply not adapted for a job where if, case, and loops are needed. I am well aware that Ansible has workarounds and it could be done, but it wasn't my choice.
21 Feb 2026 12:44am GMT
20 Feb 2026
Planet Debian
Bits from Debian: Proxmox Platinum Sponsor of DebConf26

We are pleased to announce that Proxmox has committed to sponsor DebConf26 as a Platinum Sponsor.
Proxmox develops powerful, yet easy-to-use open-source server solutions. The comprehensive open-source ecosystem is designed to manage divers IT landscapes, from single servers to large-scale distributed data centers. Our unified platform integrates server virtualization, easy backup, and rock-solid email security ensuring seamless interoperability across the entire portfolio. With the Proxmox Datacenter Manager, the ecosystem also offers a "single pane of glass" for centralized management across different locations.
Since 2005, all Proxmox solutions have been built on the rock-solid Debian platform. We are proud to return to DebConf26 as a sponsor because the Debian community provides the foundation that makes our work possible. We believe in keeping IT simple, open, and under your control.
Thank you very much, Proxmox, for your support of DebConf26!
Become a sponsor too!
DebConf26 will take place from 20th to July 25th 2026 in Santa Fe, Argentina, and will be preceded by DebCamp, from 13th to 19th July 2026.
DebConf26 is accepting sponsors! Interested companies and organizations may contact the DebConf team through sponsors@debconf.org, and visit the DebConf26 website at https://debconf26.debconf.org/sponsors/become-a-sponsor/.
20 Feb 2026 5:26pm GMT
Reproducible Builds (diffoscope): diffoscope 313 released
The diffoscope maintainers are pleased to announce the release of diffoscope version 313. This version includes the following changes:
[ Chris Lamb ]
* Don't fail the entire pipeline if deploying to PyPI automatically fails.
[ Vagrant Cascadian ]
* Update external tool reference for 7z on guix.
You find out more by visiting the project homepage.
20 Feb 2026 12:00am GMT
