23 May 2017

feedPlanet Openmoko

Harald "LaF0rge" Welte: Power-cycling a USB port should be simple, right?

Every so often I happen to be involved in designing electronics equipment that's supposed to run reliably remotely in inaccessible locations,without any ability for "remote hands" to perform things like power-cycling or the like. I'm talking about really remote locations, possible with no but limited back-haul, and a very high cost of ever sending somebody there for remote maintenance.

Given that a lot of computer peripherals (chips, modules, ...) use USB these days, this is often some kind of an embedded ARM (rarely x86) SoM or SBC, which is hooked up to a custom board that contains a USB hub chip as well as a line of peripherals.

One of the most important lectures I've learned from experience is: Never trust reset signals / lines, always include power-switching capability. There are many chips and electronics modules available on the market that have either no RESET, or even might claim to have a hardware RESET line which you later (painfully) discover just to be a GPIO polled by software which can get stuck, and hence no way to really hard-reset the given component.

In the case of a USB-attached device (even though the USB might only exist on a circuit board between two ICs), this is typically rather easy: The USB hub is generally capable of switching the power of its downstream ports. Many cheap USB hubs don't implement this at all, or implement only ganged switching, but if you carefully select your USB hub (or in the case of a custom PCB), you can make sure that the given USB hub supports individual port power switching.

Now the next step is how to actually use this from your (embedded) Linux system. It turns out to be harder than expected. After all, we're talking about a standard feature that's present in the USB specifications since USB 1.x in the late 1990ies. So the expectation is that it should be straight-forward to do with any decent operating system.

I don't know how it's on other operating systems, but on Linux I couldn't really find a proper way how to do this in a clean way. For more details, please read my post to the linux-usb mailing list.

Why am I running into this now? Is it such a strange idea? I mean, power-cycling a device should be the most simple and straight-forward thing to do in order to recover from any kind of "stuck state" or other related issue. Logical enabling/disabling of the port, resetting the USB device via USB protocol, etc. are all just "soft" forms of a reset which at best help with USB related issues, but not with any other part of a USB device.

And in the case of e.g. an USB-attached cellular modem, we're actually talking about a multi-processor system with multiple built-in micro-controllers, at least one DSP, an ARM core that might run another Linux itself (to implement the USB gadget), ... - certainly enough complex software that you would want to be able to power-cycle it...

I'm curious what the response of the Linux USB gurus is.

23 May 2017 10:00pm GMT

17 May 2017

feedPlanet Openmoko

Holger "zecke" Freyther: CAMEL and protocol design

Today I want to share the pain of running a production 3GPP TCAP/MAP/CAP system and network protocol design in general. The excellent Free Software ASN1/TCAP/MAP/CAP stack (which is made possible by the Pharo live programming environment) I helped creating is in heavy production usage (powering standard off-the-shelf components like a SGSN, an AuC or non-standard components to enable new business cases) and sees roaming traffic from a lot of networks. From time to time something odd comes up.

In TCAP/MAP/CAP messages but also Request/Response and the possible Errors are defined using ASN1. Over the last decades ETSI and 3GPP have made various major versions and minor releases (e.g. adding new optional attributes to requests/responses/errors). The biggest new standard is CAMEL and it is so big and complicated that it was specified in four phases (each phase with their own versions of the ApplicationContext, think of it as an versioned and entry into the definition for all messages and RPC calls).

One issue in supporting a specific module version (application-context-name) is to find the right minor release of 3GPP (either the newest or oldest for that ACN). Then it is a matter to copy and paste the ASN1 definition from either a PDF or a WordDocument into individual files.. and after that is done one can fix the broken imports (or modify the ASN1 parser to make a global look-up) and typos for elements.

This artificial barrier creates two issue for people implementing MAP/CAP using components. Some use inferior ASN1 tools or can't be bothered to create the input files and decide to hardcode the message content (after all BER/DER is more or less just nested TLV entries). The second issue is related to time/effort as well. When creating the CAMEL ASN1 files I didn't want to do the work four times (once for each phase) and searched for shortcuts too.

The first issue materialized itself by equipment sending completely broken messages or not sending mandatory(!) elements. So what happens if a big telco sends you a message the stack can't decode, you look up the oldest and youngest release defining this ACN and see the element that is attempted to be parsed was always mandatory? Right, one adds an OPTIONAL modifier to be able to move forward…

The second issue is on me though. I started with a set of CAMEL phase3 files and assumed that only the operations (and their arguments/response) would be different across different CAMEL phases but the support structs they use would stay the same. My assumption (and this brings us to protocol design) was that besides the versioning of the module they would be conservative and extend supporting types in a forward compatible way and integrated phase2 and phase1 into the same set of files.

And then reality sets in and the logs of the system showed a message that caused an exception during parsing (normally only happens for the first kind of issue). An extension to the Request structure was changed in a not forward compatible way. Let's have a look:

InitialDPArgExtension ::= SEQUENCE {

-naCarrierInformation [0] NACarrierInformation OPTIONAL,
-gmscAddress [1] ISDN-AddressString OPTIONAL,
-…
+ gmscAddress [0] ISDN-AddressString OPTIONAL,
*more new optional elements*
+ …,
+ enhancedDialledServicesAllowed [11] NULL OPTIONAL,
*more elements after the extension marker*
}

So one element (naCarrierInformation) got removed and then every following element was renumbered and the extension marker was moved further down. In theory the InitialDPArgExtension name binding exists once in the phase2 to definition and once in phase3 and 3GPP had all rights to define a new binding with different. An engineering question is if this was a good decision?

A change in application-context allows to remove some old cruft and make room for new. The tag space might be considered a scarce resource and making room is saving a resource. On the other hand in the history of GSM no other struct had ran out of tags and there are various other approaches to the problem. The above is already an extension to an extension and the step to an extension of an extension of an extension doesn't seem so absurd anymore.

So please think of forward compatibility when designing protocols, think of the implementor and make the definition machine readable and please get the imports right so one doesn't need to resort to a global symbol search. If you are having interesting core network issues related to TCAP, MAP and CAP consider contacting me.

17 May 2017 2:23am GMT

06 May 2017

feedPlanet Openmoko

Holger "zecke" Freyther: MariaDB Galera and custom health probe for Azure LoadBalancer

My Galera set-up on Kubernetes and the Azure LoadBalancer in front of it seem to work nicely but one big TODO is to implement proper health checks. If a node is down, in maintenance or split from the network it should not be part of the LoadBalancer. The Azure LoadBalancer has support for custom HTTP probes and I wanted to write something very simple that handles the HTTP GET, opens a MySQL connection to the destination, check if it is connected to a primary. As this is about health checks the code should be small and reliable.

To improve my Go(-lang) skills I decided to write my healthcheck in Go. And it seemed like a good idea, Go has a powerful HTTP package, a SQL API package and two MySQL implementations. So the entire prototype is just about 72 lines (with comments and empty lines) and I think that qualifies as small. Prototyping the MySQL code took some iterations but in general it went quite quickly. But how reliable is it? Go introduced the nice concept of a context.Context. So any operation should be associated with a context and it should be passed as argument from one method to another. One can create a child context and associate it with a deadline (absolute time) or timeout (relative) and has a way to cancel it.

I grabbed the Context from the HTTP Request, added a timeout and called a function to do the MySQL check. Wow that was easy. Some polish to parse the parameters from the CLI and I am ready to deploy it! But let's see how reliable it is?

I imagined the following error conditions:

  1. The destination IP is reachable but no one listening on the port. The TCP connection will fail quickly (SYN -> RST,ACK)
  2. The destination IP ends in a blackhole (no RST, ACK) received. One would have a large connect timeout
  3. The Galera node (or machine hosting it) is overloaded. While the connect succeeds the authentication or a query might stall
  4. The Galera node is split and not a master

The first and fourth error conditions are easy to test/simulate and trivial to implement properly. I then moved to the third one. My first choice was to implement an infinitely slow Galera node and did that by using nc -l 3006 to accept a TCP connection and then send nothing. I made a healthprobe and waited… and waited.. no timeout. Not after 2s as programmed in the context, not after 2min and not after.. (okay I gave up after 30 min). Pretty discouraging!

After some reading and browsing I saw an open PR to add context.Context support to the MySQL backend. I modified my import, ran go get to fetch it, go build and retested. Okay that didn't work either. So let's try the other MySQL implementation, again change the package imports, go get and go build and retest. I picked the wrong package name but even after picking the right package this driver failed to parse the Database URL. At that point I decided to go back to the first implementation and have a deeper look.

So while many of the SQL API methods take a Context as argument, the Open one does not. Open says it might or might not connect to the database and in case of MySQL it does connect to it. Let's see if there is a workaround? I could spawn a Go routine and have a selective receive on the result or a timeout. While this would make it possible to respond to the HTTP request it does create two issues. First one can't cancel Go routines and I would leak memory, but worse I might run into a connection limit of the Galera node. What about other workarounds? It seems I can play with a custom parameter for readTimeout and writeTimeout and at least limit the timeout per I/O operation. I guess it takes a bit of tuning to find good values for a busy system and let's hope that context.Context will be used more in more places in the future.

06 May 2017 1:45pm GMT