17 Sep 2014

feedplanet.freedesktop.org

Bastien Nocera: A follow up to yesterday's Videos new for 3.14

The more astute (or Wayland testing) amongst you will recognise mutter running a nested Wayland compositor. Yes, it means that Videos will work natively under Wayland.

Got to love indie films

It's not perfect, as I'm still seeing hangs within the Intel driver for a number of operations, but basic playback works, and the playback is actually within the same window and correctly hidden when in the overview ;)

17 Sep 2014 6:24pm GMT

16 Sep 2014

feedplanet.freedesktop.org

Bastien Nocera: Videos 3.14 features

We've added a few, but nonetheless interesting features to Videos in GNOME 3.14.

Auto-rotation of videos

If you capture videos in portrait orientation on your phone, we are now able to rotate them automatically in the movie player, as well as in the thumbnails.

Better streaming

You can now seek anywhere inside streamed videos, even if we didn't download all the way to that point. That's particularly useful for long videos, or slow servers (or a combination of both).

Thumbnails generation

Finally, videos without thumbnails in your videos directory will have thumbnails automatically generated, without having to browse them in Files. This makes the first experience of videos more pleasing to the eye.

What's next?

We'll work on integrating Victor Toso's work on grilo plugins, to show information about the film or TV series on your computer, such as grouping episodes of a series together, showing genres, covers and synopsis for films.

With a bit of luck, we should also be able to provide you with more video content as well, through partners.

16 Sep 2014 2:39pm GMT

15 Sep 2014

feedplanet.freedesktop.org

Keith Packard: xserver-pending-fixes

A Forest of X Server Changes

We've got about another month left in the X server merge window for 1.17 and I've written a small set of fixes which haven't been reviewed yet for merging. I thought I'd advertise them a bit and see if I couldn't encourage a few of you to take a look and see if they're useful, correct and complete.

All of these are in my personal X server repository:

git://people.freedesktop.org/~keithp/xserver.git

Cleaning up the X Registry

Branch: registry-fixes

I'll bet most of you don't even know about this code. It serves as a database mapping various X enumerations to strings to aid in diagnostics. For the security extensions, SECURITY and XSELinux, it holds names for all of the request, event and errors in the core protocol and all registered extensions. For X-Resource, it has the names of the registered resource types.

The X registry gets the request, event and error data from a file, "protocol.txt", which is installed in /usr/lib/xorg/protocol.txt on my machine. It gets the resource names as a part of resource type allocation.

So, what's wrong with this? Three basic things:

  1. A simple bug - protocol.txt is left open while the server runs. This consumes a file descriptor for no good reason.

  2. protocol.txt is read and parsed even if the security extensions aren't available. This wastes time and memory.

  3. The resource names are kept even if X-Resource isn't in use.

The fixes remove the configure options for including the registry code; these functions are only used by the above extensions, so we can tell whether to include the code based solely on whether the extensions are being built.

Getting rid of the TCP listener by default

Branch: listen-fixes

We've had the '-nolisten' option for a while now to disable inbound TCP connections. It's useful for security reasons, but we've never enabled this by default. This patch sequence provides configure options for each of the listen sockets (tcp, unix and local), leaves unix and local enabled by default and disables tcp by default.

A new option, '-listen', is added which allows the user to override the -nolisten defaults in case they actually want to use TCP connections to X.

Glamor bug fixes

branch: glamor-fixes

This branch fixes two bugs:

  1. Scale a large pixmap down to a small pixmap. This happens when you display enormous images in a web page. Iceweasel sends the whole huge image to X and uses Render to scale it to the screen. If the image is larger than a single texture, the X server splits it up into tiles, but the code which tries to perform the merged scale is just broken. Five patches fix this.

  2. Shader-based trapezoids. This code uses area coverage to compute trapezoids. That violates the Render spec, which requires point sampling. Further, the performance of these trapezoids is lower than software (by a lot). This one patch removes the code.

Present bug fixes

branch: present-fixes

A selection of small bug fixes:

  1. Clear pending flips at CloseScreen. This removes a reference to any pending flip pixmap, allowing it to be freed. Otherwise, we'll leak memory across server reset.

  2. Add support for PresentOptionCopy. This has been in the protocol spec for a while, and was completely trivial to implement. However, it never got done. One tiny little patch.

  3. Expose the Present API to drivers via sdksyms.sh. Until now, the present extension APIs have only been available inside the X server. This exposes them to drivers. This took a few cleanup patches first.

Use Present for Glamor XV

branch: glamor-present-xv

Painting XV to the screen should be done at vblank time to avoid tearing. Present offers vblank synchronized operations. Hooking those two together required a few new present APIs to expose the vblank functionality outside of the present code, then a bit of glamor code to hook up that new API to the XV bits.

Switching Glamor to a GL core profile context

branch: glamor-core-profile

This patch set is still in progress, but demonstrates how close we are. We'll be requiring OpenGL 3.3 for this so that we get texture swizzling, which is required for our single channel objects.

The changes present on the branch are:

  1. Switch single channel surfaces from GLALPHA to GLRED.

  2. Use vertex array objects.

  3. Switch ephyr over to using a core 3.3 profile.

Still left to do is

  1. Switch Render code to VBOs

The core code uses VBOs everywhere, but the Render code doesn't. This means that all Render drawing fails, which makes the resulting server not very useful.

My main objective for getting this done is to reduce memory usage by about 16MB, which is the space allocated for software rendering in Mesa in case someone does something which the hardware doesn't handle, and that can only with some legacy OpenGL APIs.

Please help out!

All of these friendly little patches are looking for a bit of review so that they can get merged before the 1.17 window closes.

15 Sep 2014 5:14pm GMT

Julien Danjou: Python bad practice, a concrete case

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, The Hacker's Guide to Python. Today I'd like to show a concrete case of code that I don't consider being the state of the art.

In my last article where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched whisper out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code.

Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, whisper is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people.

Tests

The first thing that I noticed when trying to hack on whisper, is the lack of test. There's only one file containing tests, named test_whisper.py, and the coverage it provides is pretty low. One can check that using the coverage tool.

$ coverage run test_whisper.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.014s

OK
$ coverage report
Name Stmts Miss Cover
----------------------------------
test_whisper 134 4 97%
whisper 584 227 61%
----------------------------------
TOTAL 718 231 67%


While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric.

When I tried to modify whisper, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass.

No PEP 8, no Python 3

The code doesn't respect PEP 8 . A run of flake8 + hacking shows 732 errors… While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects.

The hacking tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax.

A good way to fix that would be to set up tox and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light.

Not using idiomatic Python

A lot of the code could be simplified by using idiomatic Python. Let's take a simple example:

def fetch(path,fromTime,untilTime=None,now=None):
fh = None
try:
fh = open(path,'rb')
return file_fetch(fh, fromTime, untilTime, now)
finally:
if fh:
fh.close()


That piece of code could be easily rewritten as:

def fetch(path,fromTime,untilTime=None,now=None):
with open(path, 'rb') as fh:
return file_fetch(fh, fromTime, untilTime, now)


This way, the function looks actually so simple that one can even wonder why it should exists - but why not.

Usage of loops could also be made more Pythonic:

for i,archive in enumerate(archiveList):
if i == len(archiveList) - 1:
break


could be actually:

for i, archive in enumerate(itertools.islice(archiveList, len(archiveList) - 1):


That reduce the code size and makes it easier to read through the code.

Wrong abstraction level

Also, one thing that I noticed in whisper, is that it abstracts its features at the wrong level.

Take the create() function, it's pretty obvious:

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
# Set default params
if xFilesFactor is None:
xFilesFactor = 0.5
if aggregationMethod is None:
aggregationMethod = 'average'

#Validate archive configurations...
validateArchiveList(archiveList)

#Looks good, now we create the file and write the header
if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
fh = open(path,'wb')
if LOCK:
fcntl.flock( fh.fileno(), fcntl.LOCK_EX )

aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) )
oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList])
maxRetention = struct.pack( longFormat, oldest )
xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) )
archiveCount = struct.pack(longFormat, len(archiveList))
packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount
fh.write(packedMetadata)
headerSize = metadataSize + (archiveInfoSize * len(archiveList))
archiveOffsetPointer = headerSize

for secondsPerPoint,points in archiveList:
archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
fh.write(archiveInfo)
archiveOffsetPointer += (points * pointSize)

#If configured to use fallocate and capable of fallocate use that, else
#attempt sparse if configure or zero pre-allocate if sparse isn't configured.
if CAN_FALLOCATE and useFallocate:
remaining = archiveOffsetPointer - headerSize
fallocate(fh, headerSize, remaining)
elif sparse:
fh.seek(archiveOffsetPointer - 1)
fh.write('\x00')
else:
remaining = archiveOffsetPointer - headerSize
chunksize = 16384
zeroes = '\x00' * chunksize
while remaining > chunksize:
fh.write(zeroes)
remaining -= chunksize
fh.write(zeroes[:remaining])

if AUTOFLUSH:
fh.flush()
os.fsync(fh.fileno())
finally:
if fh:
fh.close()


The function is doing everything: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc.

That means that the caller has to give a file path, even if it just wants a whipser data structure to store itself elsewhere. StringIO() could be used to fake a file handler, but it will fail if the call to fcntl.flock() is not disabled - and it is inefficient anyway.

There's a lot of other functions in the code, such as for example setAggregationMethod(), that mixes the handling of the files - even doing things like os.fsync() - while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible.

Race conditions

There are race conditions, for example in create() (see added comment):

if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
# TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO
# FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-(
fh = open(path,'wb')


That code should be:

try:
fh = os.fdopen(os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL), 'wb')
except OSError as e:
if e.errno = errno.EEXIST:
raise InvalidConfiguration("File %s already exists!" % path)


to avoid any race condition.

Unwanted optimization

We saw earlier the fetch() function that is barely useful, so let's take a look at the file_fetch() function that it's calling.

def file_fetch(fh, fromTime, untilTime, now = None):
header = __readHeader(fh)
[...]


The first thing the function does is to read the header from the file handler. Let's take a look at that function:

def __readHeader(fh):
info = __headerCache.get(fh.name)
if info:
return info

originalOffset = fh.tell()
fh.seek(0)
packedMetadata = fh.read(metadataSize)

try:
(aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata)
except:
raise CorruptWhisperFile("Unable to read header", fh.name)
[...]


The first thing the function does is to look into a cache. Why is there a cache?

It actually caches the header based with an index based on the file path (fh.name). Except that if one for example decide not to use file and cheat using StringIO, then it does not have any name attribute. So this code path will raise an AttributeError.

One has to set a fake name manually on the StringIO instance, and it must be unique so nobody messes with the cache

import StringIO

packedMetadata = <some source>
fh = StringIO.StringIO(packedMetadata)
fh.name = "myfakename"
header = __readHeader(fh)


The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of whisper based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted.

Docstrings

None of the docstrings are written in a a parsable syntax like Sphinx. This means you cannot generate any documentation in a nice format that a developer using the library could read easily.

The documentation is also not up to date:

def fetch(path,fromTime,untilTime=None,now=None):
"""fetch(path,fromTime,untilTime=None)
[...]
"""

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
"""create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average')
[...]
"""


This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument.

Duplicated code

Last but not least, there's a lot of code that is duplicated around in the scripts provided by whisper in its bin directory. Theses scripts should be very lightweight and be using the console_scripts facility of setuptools, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the whisper.py library which is against DRY.

Conclusion

There are a few more things that made me stop considering whisper, but these are part of the whisper features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted.

A lot of these defects are actually points that made me start writing The Hacker's Guide to Python a year ago. Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

The Hacker's Guide to Python

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

15 Sep 2014 11:09am GMT

14 Sep 2014

feedplanet.freedesktop.org

Alberto Ruiz: Back to the kingdom

And not the kingdom of Spain unfortunately (unfortunately because I miss it and because it's still a kingdom). In a few months (not sure about specific dates yet, probably in early 2015) I will be moving back to the United Kingdom, this time to the larger metropolis, London. Don't panic, I will still be with Red Hat, there won't be a lot of changes in that front. In the meantime I will settle back in Gran Canaria and will be flying back and forth on a monthly basis.

I must note that when I made the decision to move to Czech my plan was: "I do not have a plan", just enjoying it and trying to make the best of it without thinking in deadlines as to when to move back to Spain. Red Hat has been a very welcoming company in which I feel just like home and Brno has been a very welcoming city and this is definitively a part of Europe that is worth experiencing. I've met terrific people during this period both inside and outside Red Hat.

There was, however, a little problem.

Something altered the mid-term plans, a few months before I moved, when the decision was already made, I met someone very special with whom now I want to share my life with. After 16 months of carrying a distant relationship it was due time to find a place where we could be together, after months of planning and considering options, London presented itself as the spot to make the move as she found a pretty good job there.

While I am going to miss sharing the office on a daily basis with awesome people, I am looking forward to this new chapter in my life.

Canary Wharf at Night | London, England, Niko Trinkhaus, (CC by-nc)

I want to note that I am deeply thankful to Christian Schaller for his tremendous amount of support during my stay in Brno and for working with me in figuring ways to balance my professional and personal life. I also wish him the best of luck with his new life in Westford, I'm certainly going to miss him.

On the other hand I guess this means I'll show up at the GNOME Beers in London more often :-)

14 Sep 2014 6:15pm GMT

11 Sep 2014

feedplanet.freedesktop.org

Matthias Klumpp: Listaller: Back to the future!

Listaller-Logo (with text)It is time for another report on Listaller, the cross-distro 3rd-party package installer, which is now in development for - depending how you count - 5-6 years. This will become a longer post, so you might grab some coffee or tea ;-)

The original idea

The Listaller project was initially started with the goal to make application deployment on Linux distributions as simple as possible, by providing a unified package installation format and tools which make building apps for multiple distributions easier and deployment of updates simple. The key ideas were:

The current state

The current release of Listaller handles all of this with a plugin for PackageKit, the cross-distro package-management abstraction layer. It hooks into PackageKit and reads information passing through to the native distributor backend, and if it encounters Listaller software, it handles it appropriately. It can also inject update information. This results in all Listaller software being shown in any PackageKit frontends, and people can work with it just like if the packages were native packages. Listaller package installations are controlled by a machine policy, so the administrator can decide that e.g. only packages from a trusted source (= GPG signature in trusted database) can be installed. Dependencies can be pulled from the distributor's repositories, or optionally from external sources, like the PyPI.

This sounds good on paper, but the current implementation has various problems.

The issues

The current Listaller approach has some problems. The biggest one lies in the future: Soon, there will be no PackageKit plugins anymore! PackageKit 1.0 will remove support for them, because they appear to be a major source for crashes, even the in-tree plugins cause problems. Also, the PackageKit service itself is currently being trimmed of unneeded features and less-used code. These changes in PackageKit are great and needed for the project (and I support these efforts), but they cause a pretty huge problem for Listaller: The project relies on the PackageKit plugin - if used without it, you loose the system-integration part, which is one of the key concepts of Listaller, and a primary goal.

But this issue is not the only one. There are more. One huge problem for Listaller is dependency-solving: It needs to know where to get software from in case it isn't installed already. And that has to be done in a cross-distributional way. This is an incredibly complex task, and Listaller contains lots of workarounds for various quirks. It contains so much hacks for distro-specific stuff, that it became really hard to understand. The Listaller dependency model also became very complex, because it tried to handle many corner-cases. This is bad, of course. But the workarounds weren't added for fun, but because it was assumed to be easier than to fixing the root cause, which would have required collaboration between distributors and some changes on the stack, which seemed unlikely to happen at the time the code was written.

The systemd effort

Also a thing which affects Listaller, is the latest push from the systemd team to allow cross-distro 3rd-party installations to happen. I definitively recommend reading the linked blogpost from Lennart, if you have some spare time! The identified problems are the same as for Listaller, but the solution they propose is completely different, and about three orders of magnitude more invasive than whatever the Listaller project had in mind (I make these numbers up, so don't ask!). There are also a few issues I see with Lennarts approach, I will probably go into detail about that in another blogpost (e.g. it requires multiple copies of a library lying around, where one version might have a security vulnerability, and another one doesn't - it's hard to ensure everything is up to date and secure that way, even if you have a top-notch sandbox). I have great respect for the systemd crew and especially Lennart, and I hope them to succeed with their efforts. However, I also think Listaller can achieve a similar things with a less-invasive solution, at least for the 3rd-party app-installations (Listaller is one of the partial-fix solutions with strict focus, so not a direct competitor to the holistic systemd approach. Both solutions could happily live together.)

A step into the future

Some might have guessed it already: There are some bigger changes coming to Listaller! The most important one is that there will be no Listaller anymore, at least not in its old form.

Since the current code relies heavily on the PackageKit plugin, and contains some ugly workarounds, it doesn't make much sense to continue working on it.

Instead, I started the Listaller.NEXT project, which is a rewrite of Listaller in C. There are a some goals for the rewrite:

I made a last release of the 0.5.x series of Listaller, to work with PackageKit 0.9.x - the future lies in the C port.

If you are using Listaller (and I know of people who do, for example some deploy statically-linked stuff on internal test-setups with it), stay tuned. The packaging format will stay mostly compatible with the current version, so you will not see many changes there (the plan is to freeze it very soon, so no backwards-incompatible changes are made anymore). The o.5.x series will receive critical bugfixes if necessary.

Help needed!

As always, there is help needed! Writing C is not that difficult ;-) But user feedback is welcome as well, in case you have an idea. The new code will be hosted on Github in the new listaller-next branch (currently not that much to find there). Long-term, we will completely migrate away from Launchpad.

You can expect more blogposts about the Listaller concepts and progress in the next months (as soon as I am done with some AppStream-related things, which take priority).

11 Sep 2014 8:14am GMT

03 Sep 2014

feedplanet.freedesktop.org

Alan Coopersmith: R_AMD64_PC32 error? There, I Fixed It!

thereifixedit.com - Euro Ipod Charger
see more There I Fixed It

I try to fairly regularly build recent git checkouts of all the upstream modules from X.Org (at least all those listed in the current build.sh) on Solaris. Normally I do this in 32-bit mode on x86 machines using the Sun compilers on the latest Solaris 11 internal development build, but I also occasionally do it in 64-bit mode, or with gcc compilers, or on a SPARC machine. This helps me catch issues that would break our builds when we integrate the new releases before those releases happen. (Ideally I'd set up a Solaris client of the X.Org tinderbox, but I've not gotten around to that.)

Anyways, recently I finally decided to track down an error that only shows up in the 64-bit builds of the xscope protocol monitor/decoder for X11 on Solaris. The builds run fine up until the final link stage, which fails with:

ld: fatal: relocation error: R_AMD64_PC32: file audio.o: symbol littleEndian: value 0x8086c355 does not fit
ld: fatal: relocation error: R_AMD64_PC32: file audio.o: symbol ServerHostName: value 0x8086b4fe does not fit
ld: fatal: relocation error: R_AMD64_PC32: file decode11.o: symbol LBXEvent: value 0x808664c3 does not fit
(and over 150 more symbols that didn't fit)

A google search turned up some forum posts, a blog post, and an article on the AMD64 ABI support in the Sun Studio compilers. And indeed, the solutions they offered did work - building with -Kpic did allow the program to link.

But is that really the best answer? xscope is a simple program, and shouldn't be overflowing the normal memory model. Once it linked, looking at the resulting binary was a bit shocking:

% /usr/gnu/bin/size  xscope
   text    data     bss     dec     hex filename
 416753    5256 2155921980      2156343989      808732b5        xscope

% /usr/bin/size -f xscope

23(.interp) + 32(.SUNW_cap) + 5860(.eh_frame_hdr) + 27200(.eh_frame)
 + 2964(.SUNW_syminfo) + 5944(.hash) + 4224(.SUNW_ldynsym)
 + 17784(.dynsym) + 14703(.dynstr) + 192(.SUNW_version)
 + 1482(.SUNW_versym) + 3168(.SUNW_dynsymsort) + 96(.SUNW_reloc)
 + 1944(.rela.plt) + 1312(.plt) + 291018(.text) + 33(.init) + 33(.fini)
 + 280(.rodata) + 38461(.rodata1) + 1376(.got) + 784(.dynamic)
 + 1952(.data) + 0(.bssf) + 1144(.picdata) + 0(.tdata) + 0(.tbss)
 + 2155921980(.bss) = 2156343989

% pmap -x `pgrep xscope`
26151:  ./xscope
         Address     Kbytes        RSS       Anon     Locked Mode   Mapped File
0000000000400000        408        408          -          - r-x--  xscope
0000000000476000          8          8          8          - rw---  xscope
0000000000478000    2105388       1064       1064          - rw---  xscope
0000000080C83000         52         52         52          - rw---    [ heap ]
[....]
FFFFFD7FFFDF8000         32         32         32          - rw---    [ stack ]
---------------- ---------- ---------- ---------- ----------
        total Kb    2108668       3204       1300          -

Two gigabytes of .bss space allocated!?!?! That can't be right. Looking through the output of the elfdump and nm programs a single symbol stood out:

Symbol Table Section:  .SUNW_ldynsym
     index    value              size              type bind oth ver shndx          name
[...]
      [89]  0x00000000009ff280 0x0000000080280000  OBJT GLOB  D    1 .bss           FDinfo

[Index]   Value                Size                Type  Bind  Other Shndx   Name
[...]
[528]   |            10482304|          2150105088|OBJT |GLOB |0    |28     |FDinfo

Unfortunately, that wasn't one of the ones listed in the linker errors, since it's starting address fit inside the normal memory model, but everything that came after it was out of range.

So what is this giant static allocation for? It's defined in scope.h:

#define BUFFER_SIZE (1024 * 32)

struct fdinfo
{
  Boolean Server;
  long    ClientNumber;
  FD      pair;
  unsigned char   buffer[BUFFER_SIZE];
  int     bufcount;
  int     bufstart;
  int     buflimit;     /* limited writes */
  int     bufdelivered; /* total bytes delivered */
  Boolean writeblocked;
};

extern struct fdinfo   FDinfo[StaticMaxFD];

So it allocates a 32k buffer for up to StaticMaxFD file descriptors. How many is that? For that we need to look in xscope's fd.h:

/* need to change the MaxFD to allow larger number of fd's */
#define StaticMaxFD FD_SETSIZE

and from there to the Solaris system headers, which define FD_SETSIZE in <sys/select.h>:

/*
 * Select uses bit masks of file descriptors in longs.
 * These macros manipulate such bit fields.
 * FD_SETSIZE may be defined by the user, but the default here
 * should be >= NOFILE (param.h).
 */
#ifndef FD_SETSIZE
#ifdef _LP64
#define FD_SETSIZE      65536
#else
#define FD_SETSIZE      1024
#endif  /* _LP64 */

So this makes the buffer fields alone in FDinfo become 65536 * 32 * 1024 bytes, aka 2 gigabytes.

Thus in this case, while compiler flags like -Kpic allow the code to link, using -DFD_SETSIZE=256 instead, builds code that's a little bit saner, fits in the normal memory model, and is less likely to fail with out of memory errors when you need it most:

% /usr/gnu/bin/size -f xscope
   text    data     bss     dec     hex filename
 409388    3352 8449804 8862544  873b50 xscope

% pmap -x `pgrep xscope`
         Address     Kbytes        RSS       Anon     Locked Mode   Mapped File
0000000000400000        404        404          -          - r-x--  xscope
0000000000475000          4          4          4          - rw---  xscope
0000000000476000       8248         20         20          - rw---  xscope
0000000000C84000         52         52         52          - rw---    [ heap ]
[...]
FFFFFD7FFFDFD000         12         12         12          - rw---    [ stack ]
---------------- ---------- ---------- ---------- ----------
        total Kb      11500       2136        232          -

Of course that assumes that xscope is not going to be monitoring more than about 120 clients at a time (since it opens two file descriptors for each client, one connected to the client and one to the real X server), and still wastes many page mappings if you're only monitoring one client. The real fix being worked on for the next upstream release is to make the buffer allocation be dynamic, and allocate just enough for the number of clients we actually are monitoring.

The moral of this story? Just because you can make it build doesn't mean you've fixed it well, and sometimes it's useful to understand why the linker is giving you a hard time.

03 Sep 2014 1:11am GMT

02 Sep 2014

feedplanet.freedesktop.org

Eric Anholt: helping out with VC4

I've had a couple of questions about whether there's a way for others to contribute to the VC4 driver project. There is! I haven't posted about it before because things aren't as ready as I'd like for others to do development (it has a tendency to lock up, and the X implementation isn't really ready yet so you don't get to see your results), but that shouldn't actually stop anyone.

To get your environment set up, build the kernel (https://github.com/anholt/linux.git vc4 branch), Mesa (git://anongit.freedesktop.org/mesa/mesa) with --with-gallium-drivers=vc4, and piglit (git://anongit.freedesktop.org/git/piglit). For working on the Pi, I highly recommend having a serial cable and doing NFS root so that you don't have to write things to slow, unreliable SD cards.

You can run an existing piglit test that should work, to check your environment: env PIGLIT_PLATFORM=gbm VC4_DEBUG=qir ./bin/shader_runner tests/shaders/glsl-algebraic-add-add-1.shader_test -auto -fbo -- you should see a dump of the IR for this shader, and a pass report. The kernel will make some noise about how it's rendered a frame.

Now the actual work: I've left some of the TGSI opcodes unfinished (SCS, DST, DPH, and XPD, for example), so the driver just aborts when a shader tries to use them. How they work is described in src/gallium/docs/source/tgsi.rst. The TGSI-to_QIR code is in vc4_program.c (where you'll find all the opcodes that are implemented currently), and vc4_qir.h has all the opcodes that are available to you and helpers for generating them. Once it's in QIR (which I think should have all the opcodes you need for this work), vc4_qpu_emit.c will turn the QIR into actual QPU code like you find described in the chip specs.

You can dump the shaders being generated by the driver using VC4_DEBUG=tgsi,qir,qpu in the environment (that gets you 3/4 stages of code dumped -- at times you might want some subset of that just to quiet things down).

Since we've still got a lot of GPU hangs, and I don't have reset wokring, you can't even complete a piglit run to find all the problems or to test your changes to see if your changes are good. What I can offer currently is that you could run PIGLIT_PLATFORM=gbm VC4_DEBUG=norast ./piglit-run.py tests/quick.py results/vc4-norast; piglit-summary-html.py --overwrite summary/mysum results/vc4-norast will get you a list of all the tests (which mostly failed, since we didn't render anything), some of which will have assertion failed. Now that you have which tests were assertion failing from the opcode you worked on, you can run them manually, like PIGLIT_PLATFORM=gbm /home/anholt/src/piglit/bin/shader_runner /home/anholt/src/piglit/generated_tests/spec/glsl-1.10/execution/built-in-functions/vs-asin-vec4.shader_test -auto (copy-and-pasted from the results) or PIGLIT_PLATFORM=gbm PIGLIT_TEST="XPD test 2 (same src and dst arg)" ./bin/glean -o -v -v -v -t +vertProg1 --quick (also copy and pasted from the results, but note that you need the other env var for glean to pick out the subtest to run).

Other things you might want eventually: I do my development using cross-builds instead of on the Pi, install to a prefix in my homedir, then rsync that into my NFS root and use LD_LIBRARY_PATH/LIBGL_DRIVERS_PATH on the Pi to point my tests at the driver in the homedir prefix. Cross-builds were a *huge* pain to set up (debian's multiarch doesn't ship the .so symlink with the libary, and the -dev packages that do install them don't install simultaneously for multiple arches), but it's worth it in the end. If you look into cross-build, what I'm using is rpi-tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/arm-linux-gnueabihf-gcc and you'll want --enable-malloc0returnsnull if you cross-build a bunch of X-related packages.

02 Sep 2014 12:19am GMT

01 Sep 2014

feedplanet.freedesktop.org

Christian Schaller: Leaving Brno

So two years ago my family and I moved to Brno in the Czech Republic due to me starting a new job at Red Hat. It has been two roller coaster years with a lot of changes happening both inside Red Hat and with the world that the Linux desktop operates in. During those years my wife and I have gotten to love Brno, which both of us find a bit surprising as we where both quite skeptical to the city in the outset.

I think having grown up in west europe during the cold war I had some preconceptions about what life was like in the former east europe and Brno specifically is struggling a bit with being the second city in Czech after Prague, due to Prague so often being hailed internationally as a beautiful and exciting city.

But I think during these two years Brno has proven itself to us as a place that is great to live, especially if you have a little child. Brno has a lot of beautiful outdoors areas which are great for hiking or relaxing, it is packed full of these childrens cafes where you can take your kid to play while you sit down and have a coffee or a tea, a vibrant expat community, affordable housing, a good range of restaurants, short distance to major cities like Vienna, Prague and Budapest. And lot of old castles and towns around to explore in the vicinity. I think Telc has to be one of our topmost favorites in that regard. And it has very little crime, my wife has been telling her friends how Brno is the first city she has ever lived in where she feels that as a woman she can walk along through the city in the evening or at night and feel safe.

But that said the time has come for us to move on. Due to one of these changes inside Red Hat I mentioned I am getting moved to our US Engineering office in Westford, Massachusetts. For those not familiar with Westford it is close to a city you probably do know, Boston.

So tomorrow the moving company will arrive at our flat here in Brno and pack up everything for the transport to the US. The furniture will take some time to arrive there, so while our stuff is sailing across the ocean we will live with my family in Norway, while I take advantage of the Red Hat office in downtown Oslo. So by mid-October I expect us to be fully set up in the Boston area, although we are heading over there next week for a final house hunting trip so that the furniture has a place to arrive to :)

So goodbye to Brno for now, and looking forward to seeing new and old friends in Boston!

01 Sep 2014 12:53pm GMT

31 Aug 2014

feedplanet.freedesktop.org

Lennart Poettering: Revisiting How We Put Together Linux Systems

In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.

Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:

This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.

The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.

Anyway, so let's summarize what we are trying to do:

What We Propose

So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:

btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.

How To Use It

After we introduced this naming scheme let's see what we can build of this:

Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because usr, runtime, framework, app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.

An Example

Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:

Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:

In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own usr trees and deliver them to your hosts using this scheme. The usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new root sub-volume for each instance, referencing the usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.

Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!

This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:

Now, since the only real vendor data you need is the usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.

Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.

Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.

Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.

Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.

There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.

Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.

So Let's Summarize Again What We Propose

Final Words

I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at linux.conf.au, but there was no interest in my session submissions there...

Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.

Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).

The future is going to be awesome!

31 Aug 2014 10:00pm GMT

29 Aug 2014

feedplanet.freedesktop.org

Daniel Vetter: Review Training Slides

We currently have a large influx of new people contributing to i915 - for the curious just check the git logs. As part of ramping them up I've done a few trainings about upstream review, and a bunch of people I've talked with at KS in Chicago were interested in that, too. So I've cleaned up the slides a bit and dropped the very few references to Intel internal resources. No speaker notes or video recording, but I think this is useful all in itself. And of course if you have comments or see big gaps - feedback is very much welcome:

Upstream Review Training Slides

29 Aug 2014 4:14pm GMT

21 Aug 2014

feedplanet.freedesktop.org

Eric Anholt: X with glamor on vc4

Today I finally got X up on my vc4 driver using glamor. As you can see, there are a bunch of visual issues, and what you can't see is that after a few frames of those gears the hardware locked up and didn't come back. It's still major progress.

2014-08-21 16.16.37

The code can be found in my vc4 branch of mesa and linux-2.6, and the glamor branch of my xf86-video-modesetting. I think the driver's at the point now that someone else could potentially participate. I've intentionally left a bunch of easy problems -- things like supporting the SCS, DST, DPH, and XPD opcodes, for which we have piglit tests (in glean) and are just a matter of translating the math from TGSI's vec4 instruction set (documented in tgsi.rst) to the scalar QIR opcodes.

21 Aug 2014 11:58pm GMT

19 Aug 2014

feedplanet.freedesktop.org

Julien Danjou: Tracking OpenStack contributions in GitHub

I've switched my Git repositories to GitHub recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially OpenStack.

OpenStack hosts its Git repositories on its own infrastructure at git.openstack.org, but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere.

It turns out that it was not the case, and the help page about that on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses Gerrit).

Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using pygithub.

And voilà, my statistics are now including all my contributions to OpenStack!

19 Aug 2014 5:00pm GMT

18 Aug 2014

feedplanet.freedesktop.org

Julien Danjou: OpenStack Ceilometer and the Gnocchi experiment

A little more than 2 years ago, the Ceilometer project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning.

In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it.

Ceilometer early design

For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called samples. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the collector.

This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support.

The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want.

The scalability issue

We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each sample generated by Ceilometer and is stored as is). That basically means that the sample data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful.

It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are O(n) where n is the number of samples recorded (see big O notation if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable.

During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway.

Thinking outside the box

Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box.

When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics.

Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into oslo.messaging.

Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter.

Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right.

I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet - this made it a good challenge.

The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API - as we do in all OpenStack services. This should cover the metric storage pretty well.

Cooking Gnocchi

A cloud of gnocchis!

At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right?

The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process.

The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard.

One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the RRDtool have been doing for a long time on the fly, and I decided it was a good trail to follow.

I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both.

The first driver written to achieve that and store those metrics inside Gnocchi was based on whisper. Whisper is the file format used to store metrics for the Graphite project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable.

Storing metrics for each entities in a different whisper file and putting them in Swift turned out to have a fantastic algorithm complexity: it was O(1). Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design.

However, it turned out that whisper has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (time.time()). I've started to hack on that in my own fork, but… then everything broke. The whisper project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform whisper into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage).

I decided to take a break and look into alternatives, and stumbled upon Pandas, a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it carbonara (a wink to both the Carbon project and pasta, how clever!). The code is quite small (a third of whisper's, 200 SLOC vs 600 SLOC), does not have many of the whisper limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers.

Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a InfluxDB driver, working closely with the upstream developer of that database. Dina Belova started to write an OpenTSDB driver. This helps to make sure the API is designed directly in the right way.

Handling resources

Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc.

Relationship of entities and resources

Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now.

Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network.

All of these requirements led to to the design of what's called the indexer. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on SQLAlchemy and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed.

We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes.

Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable.

The indexer classes and their relations

REST API

All of this is exported via a REST API that was partially designed and documented in the Gnocchi specification in the Ceilometer repository; though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer.

The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back.

Macroscopic view of the Gnocchi architecture

Roadmap & Ceilometer integration

All of this plan has been exposed and discussed with the Ceilometer team during the last OpenStack summit in Atlanta in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue.

It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to.

Early benchmarks

Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July.

The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes.

For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of O(1), and not O(n) anymore.

Next steps

Clément drawing the logo

While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now.

The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the OpenStack Summit for Kilo that will take place next November in Paris.

18 Aug 2014 3:00pm GMT

Christian Schaller: Want to join the Red Hat Graphics team?

We have an opening in our Graphics Team to work on improving the state of open source GPU drivers. Your tasks would include working on various types of hardware and make sure it works great under linux and improving the general state of the Linux graphics stack. Since the work would include working on some specific pieces of hardware it would require the candidate to relocate to our Westford office, just north of Boston.

We are open to candidates with a range of backgrounds, but of course previous history with linux kernel codebase or the X.org codebase or Wayland is an advantage.

Please contact me at cschalle-at-redhat-com if you are interested.

18 Aug 2014 11:26am GMT

16 Aug 2014

feedplanet.freedesktop.org

Matthias Klumpp: AppStream/DEP-11 Debian progress

There hasn't been a progress-report on DEP-11 for some time, but that doesn't mean there was no work going on on it.

DEP-11 is Debian's implementation of AppStream, as well as an effort to enhance the metadata available about software in Debian. While initially, AppStream was only about applications, DEP-11 was designed with a larger scope, to collect data about libraries, binaries and things like Python modules. Now, since AppStream 0.6, DEP-11 and AppStream have essentially the same scope, with the difference of DEP-11 metadata being described in YAML, while official AppStream data is XML. That was due to a request by our ftpmasters team, which doesn't like XML (which is also not used anywhere in Debian, as opposed to YAML). But this doesn't mean that people will have to deal with the YAML file format: The libappstream library will just take DEP-11 data as another data source for it's Xapian database, allowing anything using libappstream to access that data just like the XML stuff. Richards libappstream-glib will also receive support for the DEP-11 format soon, filling it's in-memory data cache and enabling the use of GNOME-Software on Debian.

So, what has been done so far? The past months, my Google Summer of Code student. Abhishek Bhattacharjee, was working hard to integrate DEP-11 support into dak, the Debian Archive Kit, which maintains the whole Debian archive. The result will be an additional metadata table in our internal Postgres database, storing detailed information about the software available in a Debian package, as well as "Components-<arch>.yml.gz" files in the Debian repositories. Dak will also produce an application icon-cache and a screenshots repository. During the time of the SoC, Abhishek focused mainly on the applications part of things, and less on the other components (like extracting data about Python modules or libraries) - these things can easily be implemented later.

The remaining steps will be to polish the code and make it merge-ready for Debian's dak (as soon as it has received enough testing, we will likely give it a try on the Tanglu Debian derivative). Following that, Apt will be extended to fetch the DEP-11 data on-demand on systems where it is useful (which is currently mostly desktop-systems) - if you want to save a little bit of space, you will be able to disable downloading this extra metadata in Apt. From there, libappstream will take the data for it's Xapian db. This will lead to the removal of the much-hated (from ftpmasters and maintainers side) app-install-data package, which has not been updated for two years and only contains a small fraction of the metadata provided by DEP-11.

What Debian will ultimately gain from this effort is support for software-centers like GNOME-Software, and improved support for tools like Apper and Muon in displaying applications. Long-term, with more metadata being available, It would be cool to add support for it to "specialized package managers", like Python's pip, npm or gem, to make them fetch information about available distribution software and install that instead of their own copies from 3rd-party repositories, if possible. This should ultimatively lead to less code duplication on distributions and will likely result in fewer security issues, since the officially maintained and integrated distribution packages can easily be used, if possible. This is no attempt to make tools like pip obsolete, but an attempt to have the different tools installing software on your machine communicate better, instead of creating parallel worlds in terms of software management. Another nice sideeffect of more metadata will be options to search for tools handling mimetypes in the software repos (in case you can't open a file), smart software centers installing missing firmware, and automatic suggestions for developers which software they need to install in order to build a specific software package. Also, the data allows us to match software across distributions, for that, I will have some news soon (not sure how soon though, as I am currently in thesis-writing-mode, and therefore have not that much spare time). Since the goal is to have these features available on all distributions supporting AppStream, it will take longer to realize - but we are on a good way.

So, if you want some more information about my student's awesome work, you can read his blogpost about it. He will also be at Debconf'14 (Portland). (I can't make it this time, but I surely won't miss the next Debconf)

Sadly, I only see a very small chance to have the basic DEP-11 stuff land in-time for Jessie (lots of review work needs to be done, and some more code needs to be written), but we will definitively have it in Jessie+1.

A small example on how this data will look like can be found here - a larger, actual file is available here. Any questions and feedback are highly appreciated.

16 Aug 2014 2:50pm GMT