26 Aug 2014

feedPlanet Debian

Simon Josefsson: The Case for Short OpenPGP Key Validity Periods

After I moved to a new OpenPGP key (see key transition statement) I have received comments about the short life length of my new key. When I created the key (see my GnuPG setup) I set it to expire after 100 days. Some people assumed that I would have to create a new key then, and therefore wondered what value there is to sign a key that will expire in two months. It doesn't work like that, and below I will explain how OpenPGP key expiration works; how to extend the expiration time of your key; and argue why having a relatively short validity period can be a good thing.

The OpenPGP message format has a sub-packet called the Key Expiration Time, quoting the RFC:

5.2.3.6. Key Expiration Time

   (4-octet time field)

   The validity period of the key.  This is the number of seconds after
   the key creation time that the key expires.  If this is not present
   or has a value of zero, the key never expires.  This is found only on
   a self-signature.

You can print the sub-packets in your OpenPGP key with gpg --list-packets. See below an output for my key, and notice the "created 1403464490″ (which is Unix time for 2014-06-22 21:14:50) and the "subpkt 9 len 4 (key expires after 100d0h0m)" which adds up to an expiration on 2014-09-26. Don't confuse the creation time of the key ("created 1403464321″) with when the signature was created ("created 1403464490″).

jas@latte:~$ gpg --export 54265e8c | gpg --list-packets |head -20
:public key packet:
        version 4, algo 1, created 1403464321, expires 0
        pkey[0]: [3744 bits]
        pkey[1]: [17 bits]
:user ID packet: "Simon Josefsson "
:signature packet: algo 1, keyid 0664A76954265E8C
        version 4, created 1403464490, md5len 0, sigclass 0x13
        digest algo 10, begin of digest be 8e
        hashed subpkt 27 len 1 (key flags: 03)
        hashed subpkt 9 len 4 (key expires after 100d0h0m)
        hashed subpkt 11 len 7 (pref-sym-algos: 9 8 7 13 12 11 10)
        hashed subpkt 21 len 4 (pref-hash-algos: 10 9 8 11)
        hashed subpkt 30 len 1 (features: 01)
        hashed subpkt 23 len 1 (key server preferences: 80)
        hashed subpkt 2 len 4 (sig created 2014-06-22)
        hashed subpkt 25 len 1 (primary user ID)
        subpkt 16 len 8 (issuer key ID 0664A76954265E8C)
        data: [3743 bits]
:signature packet: algo 1, keyid EDA21E94B565716F
        version 4, created 1403466403, md5len 0, sigclass 0x10
jas@latte:~$ 

So the key will simply stop being valid after that time? No. It is possible to update the key expiration time value, re-sign the key, and distribute the key to people you communicate with directly or indirectly to OpenPGP keyservers. Since that date is a couple of weeks away, now felt like the perfect opportunity to go through the exercise of taking out my offline master key and boot from a Debian LiveCD and extend its expiry time. See my earlier writeup for LiveCD and USB stick conventions.

user@debian:~$ export GNUPGHOME=/media/FA21-AE97/gnupghome
user@debian:~$ gpg --edit-key 54265e8c
gpg (GnuPG) 1.4.12; Copyright (C) 2012 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Secret key is available.

pub  3744R/54265E8C  created: 2014-06-22  expires: 2014-09-30  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2014-09-30  usage: S   
sub  2048R/78ECD86B  created: 2014-06-22  expires: 2014-09-30  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> expire
Changing expiration time for the primary key.
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 150
Key expires at Fri 23 Jan 2015 02:47:48 PM UTC
Is this correct? (y/N) y

You need a passphrase to unlock the secret key for
user: "Simon Josefsson "
3744-bit RSA key, ID 54265E8C, created 2014-06-22


pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2014-09-30  usage: S   
sub  2048R/78ECD86B  created: 2014-06-22  expires: 2014-09-30  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> key 1

pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub* 2048R/32F8119D  created: 2014-06-22  expires: 2014-09-30  usage: S   
sub  2048R/78ECD86B  created: 2014-06-22  expires: 2014-09-30  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> expire
Changing expiration time for a subkey.
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 150
Key expires at Fri 23 Jan 2015 02:48:05 PM UTC
Is this correct? (y/N) y

You need a passphrase to unlock the secret key for
user: "Simon Josefsson "
3744-bit RSA key, ID 54265E8C, created 2014-06-22


pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub* 2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub  2048R/78ECD86B  created: 2014-06-22  expires: 2014-09-30  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> key 2

pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub* 2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub* 2048R/78ECD86B  created: 2014-06-22  expires: 2014-09-30  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> key 1

pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub* 2048R/78ECD86B  created: 2014-06-22  expires: 2014-09-30  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> expire
Changing expiration time for a subkey.
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 150
Key expires at Fri 23 Jan 2015 02:48:14 PM UTC
Is this correct? (y/N) y

You need a passphrase to unlock the secret key for
user: "Simon Josefsson "
3744-bit RSA key, ID 54265E8C, created 2014-06-22


pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub* 2048R/78ECD86B  created: 2014-06-22  expires: 2015-01-23  usage: E   
sub  2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> key 3

pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub* 2048R/78ECD86B  created: 2014-06-22  expires: 2015-01-23  usage: E   
sub* 2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> key 2

pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub  2048R/78ECD86B  created: 2014-06-22  expires: 2015-01-23  usage: E   
sub* 2048R/36BA8F9B  created: 2014-06-22  expires: 2014-09-30  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> expire
Changing expiration time for a subkey.
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 150
Key expires at Fri 23 Jan 2015 02:48:23 PM UTC
Is this correct? (y/N) y

You need a passphrase to unlock the secret key for
user: "Simon Josefsson "
3744-bit RSA key, ID 54265E8C, created 2014-06-22


pub  3744R/54265E8C  created: 2014-06-22  expires: 2015-01-23  usage: SC  
                     trust: ultimate      validity: ultimate
sub  2048R/32F8119D  created: 2014-06-22  expires: 2015-01-23  usage: S   
sub  2048R/78ECD86B  created: 2014-06-22  expires: 2015-01-23  usage: E   
sub* 2048R/36BA8F9B  created: 2014-06-22  expires: 2015-01-23  usage: A   
[ultimate] (1). Simon Josefsson 
[ultimate] (2)  Simon Josefsson 

gpg> save
user@debian:~$ gpg -a --export 54265e8c > /media/KINGSTON/updated-key.txt
user@debian:~$ 

I remove the "transport" USB stick from the "offline" computer, and back on my laptop I can inspect the new updated key. Let's use the same command as before. The key creation time is the same ("created 1403464321″), of course, but the signature packet has a new time ("created 1409064478″) since it was signed now. Notice "created 1409064478″ and "subpkt 9 len 4 (key expires after 214d19h35m)". The expiration time is computed based on when the key was generated, not when the signature packet was generated. You may want to double-check the pref-sym-algos, pref-hash-algos and other sub-packets so that you don't accidentally change anything else. (Btw, re-signing your key is also how you would modify those preferences over time.)

jas@latte:~$ cat /media/KINGSTON/updated-key.txt |gpg --list-packets | head -20
:public key packet:
        version 4, algo 1, created 1403464321, expires 0
        pkey[0]: [3744 bits]
        pkey[1]: [17 bits]
:user ID packet: "Simon Josefsson "
:signature packet: algo 1, keyid 0664A76954265E8C
        version 4, created 1409064478, md5len 0, sigclass 0x13
        digest algo 10, begin of digest 5c b2
        hashed subpkt 27 len 1 (key flags: 03)
        hashed subpkt 11 len 7 (pref-sym-algos: 9 8 7 13 12 11 10)
        hashed subpkt 21 len 4 (pref-hash-algos: 10 9 8 11)
        hashed subpkt 30 len 1 (features: 01)
        hashed subpkt 23 len 1 (key server preferences: 80)
        hashed subpkt 25 len 1 (primary user ID)
        hashed subpkt 2 len 4 (sig created 2014-08-26)
        hashed subpkt 9 len 4 (key expires after 214d19h35m)
        subpkt 16 len 8 (issuer key ID 0664A76954265E8C)
        data: [3744 bits]
:user ID packet: "Simon Josefsson "
:signature packet: algo 1, keyid 0664A76954265E8C
jas@latte:~$ 

Being happy with the new key, I import it and send it to key servers out there.

jas@latte:~$ gpg --import /media/KINGSTON/updated-key.txt 
gpg: key 54265E8C: "Simon Josefsson " 5 new signatures
gpg: Total number processed: 1
gpg:         new signatures: 5
jas@latte:~$ gpg --send-keys 54265e8c
gpg: sending key 54265E8C to hkp server keys.gnupg.net
jas@latte:~$ gpg --keyserver keyring.debian.org  --send-keys 54265e8c
gpg: sending key 54265E8C to hkp server keyring.debian.org
jas@latte:~$ 

Finally: why go through this hassle, rather than set the key to expire in 50 years? Some reasons for this are:

  1. I don't trust myselt to keep track of a private key (or revocation cert) for 50 years.
  2. I want people to notice my revocation certificate as quickly as possible.
  3. I want people to notice other changes to my key (e.g., cipher preferences) as quickly as possible.

Let's look into the first reason a bit more. What would happen if I lose both the master key and the revocation cert, for a key that's valid 50 years? I would start from scratch and create a new key that I upload to keyservers. Then there would be two keys out there that are valid and identify me, and both will have a set of signatures on it. None of them will be revoked. If I happen to lose the new key again, there will be three valid keys out there with signatures on it. You may argue that this shouldn't be a problem, and that nobody should use any other key than the latest one I want to be used, but that's a technical argument - and at this point we have moved into usability, and that's a trickier area. Having users select which out of a couple of apparently all valid keys that exist for me is simply not going to work well.

The second is more subtle, but considerably more important. If people retrieve my key from keyservers today, and it expires in 50 years, there will be no need to refresh it from key servers. If for some reason I have to publish my revocation certificate, there will be people that won't see it. If instead I set a short validity period, people will have to refresh my key once in a while, and will then either get an updated expiration time, or will get the revocation certificate. This amounts to a CRL/OCSP-like model.

The third is similar to the second, but deserves to be mentioned on its own. Because the cipher preferences are expressed (and signed) in my key, and that ciphers come and go, I would expect that I will modify those during the life-time of my long-term key. If I have a long validity period of my key, people would not refresh it from key servers, and would encrypt messages to me with ciphers I may no longer want to be used.

The downside of having a short validity period is that I have to do slightly more work to get out the offline master key once in a while (which I have to once in a while anyway because I'm signing other peoples keys) and that others need to refresh my key from the key servers. Can anyone identify other disadvantages? Also, having to explain why I'm using a short validity period used to be a downside, but with this writeup posted that won't be the case any more. :-)

flattr this!

26 Aug 2014 6:11pm GMT

Daniel Pocock: GSoC talks at DebConf 14 today

This year I mentored two students doing work in support of Debian and free software (as well as those I mentored for Ganglia).

Both of them are presenting details about their work at DebConf 14 today.

While Juliana's work has been widely publicised already, mainly due to the fact it is accessible to every individual DD, Andrew's work is also quite significant and creates many possibilities to advance awareness of free software.

The Java project that is not just about Java

Andrew's project is about recursively building Java dependencies from third party repositories such as the Maven Central Repository. It matches up well with the wonderful new maven-debian-helper tool in Debian and will help us to fill out /usr/share/maven-repo on every Debian system.

Firstly, this is not just about Java. On a practical level, some aspects of the project are useful for many other purposes. One of those is the aim of scanning a repository for non-free artifacts, making a Git mirror or clone containing a dfsg branch for generating repackaged upstream source and then testing to see if it still builds.

Then there is the principle of software freedom. The Maven Central repository now requires that people publish a sources JAR and license metadata with each binary artifact they upload. They do not, however, demand that the sources JAR be complete or that the binary can be built by somebody else using the published sources. The license data must be specified but it does not appeared to be verified in the same way as packages inspected by Debian's legendary FTP masters.

Thanks to the transitive dependency magic of Maven, it is quite possible that many Java applications that are officially promoted as free software can't trace the source code of every dependency or build plugin.

Many organizations are starting to become more alarmed about the risk that they are dependent upon some rogue dependency. Maybe they will be hit with a lawsuit from a vendor stating that his plugin was only free for the first 3 months. Maybe some binary dependency JAR contains a nasty trojan for harvesting data about their corporate network.

People familiar with the principles of software freedom are in the perfect position to address these concerns and Andrew's work helps us build a cleaner alternative. It obviously can't rebuild every JAR for the very reason that some of them are not really free - however, it does give the opportunity to build a heat-map of trouble spots and also create a fast track to packaging for those heirarchies of JARs that are truly free.

Making WebRTC accessible to more people

Juliana set out to update rtc.debian.org and this involved working on JSCommunicator, the HTML5/JavaScript softphone based on WebRTC.

People attending the session today or participating remotely are advised to set up your RTC / VoIP password at db.debian.org well in advance so the server will allow you to log in and try it during the session. It can take 30 minutes or so for the passwords to be replicated to the SIP proxy and TURN server.

Please also check my previous comments about what works and what doesn't and in particular, please be aware that Iceweasel / Firefox 24 on wheezy is not suitable unless you are on the same LAN as the person you are calling.

26 Aug 2014 4:33pm GMT

Christian Perrier: [life] Follow bubulle running adventures....

Just in case some of my free software friends would care and try understanding why I'm currently not attending my first DebConf since 2004...

Starting tomorrow 07:00am EST (so, 22:00 PST for Debconfers), I'll be running the "TDS" race of the Ultra-Trail du Mont-Blanc races.

Ultra-Trail du Mont-Blanc (UTMB) is one of the world famous long distance moutain trail races. It takes places in Chamonix, just below the Mont-Blanc, France's and Europe's highest moutain. The race is indeed simple : "go around the Mont-Blanc in a big circle, 160km long, with 10,000 meters positive climb cumulated on the climb of about 10 high passes between 2000 and 2700 meters altitude".

"My" race is a shortened version of UTMB that does half of the full loop, from Courmayeur in Italy (just "the other side" of Mont-Blanc, from Chamonix) and goes back to Chamonix. It is "only" 120 kilometers long with 7200 meters of positive climb. Some of these are however know as more difficult than UTMB itself.

Many firsts for me in this race : first "over 100km", first "over 24 hours running". Still, I trained hard for this, achieved a very though race in early July (60km, 5000m climb) with a very good result, and I expect to make it well.

Top runners complete this in 17 hours.....last arrivals are expected after 33 hours "running" (often fast walking, indeed). I plan to achieve the race in 28 hours but, indeed, I have no idea..:-)

So, in case you're boring in a night hacklab, or just want to draw your attention out of IRC, or don't have any package to polish...or just want to have a thought for an old friend, you can try to use the following link and follow all this live : http://utmb.livetrail.net/coureur.php?rech=6384⟨=en

Race start : 7am EST, Wednesday Aug 27th. bubulle arrival: Thursday Aug. 28th, between 10am and 4pm (best projection is 11am).

And there will be cheese at pit stops....

26 Aug 2014 6:09am GMT

Neil Williams: vmdebootstrap images for ARMMP on BBB

After patches from Petter to add foreign architecture support and picking up some scripting from freedombox, I've just built a Debian unstable image using the ARMMP kernel on Beaglebone-black.

A few changes to vmdebootstrap will need to go into the next version (0.3), including an example customise script to setup the u-boot support. With the changes, the command would be:

sudo ./vmdebootstrap --owner `whoami` --verbose --size 2G --mirror http://mirror.bytemark.co.uk/debian --log beaglebone-black.log --log-level debug --arch armhf --foreign /usr/bin/qemu-arm-static --no-extlinux --no-kernel --package u-boot --package linux-image-armmp --distribution sid --enable-dhcp --configure-apt --serial-console-command '/sbin/getty -L ttyO0 115200 vt100' --customize examples/beagleboneblack-customise.sh --bootsize 50m --boottype vfat --image bbb.img

Some of those commands are new but there are a few important elements:

With this in place, a simple dd to an SD card and the BBB boots directly into Debian ARMMP.

The examples are now in my branch and include an initial cubieboard script which is unfinished.

The current image is available for download. (222Mb).

I hope to upload the new vmdebootstrap soon - let me know if you do try the version in the branch.

26 Aug 2014 2:50am GMT

25 Aug 2014

feedPlanet Debian

Petter Reinholdtsen: Do you need an agreement with MPEG-LA to publish and broadcast H.264 video in Norway?

Two years later, I am still not sure if it is legal here in Norway to use or publish a video in H.264 or MPEG4 format edited by the commercially licensed video editors, without limiting the use to create "personal" or "non-commercial" videos or get a license agreement with MPEG LA. If one want to publish and broadcast video in a non-personal or commercial setting, it might be that those tools can not be used, or that video format can not be used, without breaking their copyright license. I am not sure. Back then, I found that the copyright license terms for Adobe Premiere and Apple Final Cut Pro both specified that one could not use the program to produce anything else without a patent license from MPEG LA. The issue is not limited to those two products, though. Other much used products like those from Avid and Sorenson Media have terms of use are similar to those from Adobe and Apple. The complicating factor making me unsure if those terms have effect in Norway or not is that the patents in question are not valid in Norway, but copyright licenses are.

These are the terms for Avid Artist Suite, according to their published end user license text (converted to lower case text for easier reading):

18.2. MPEG-4. MPEG-4 technology may be included with the software. MPEG LA, L.L.C. requires this notice:

This product is licensed under the MPEG-4 visual patent portfolio license for the personal and non-commercial use of a consumer for (i) encoding video in compliance with the MPEG-4 visual standard ("MPEG-4 video") and/or (ii) decoding MPEG-4 video that was encoded by a consumer engaged in a personal and non-commercial activity and/or was obtained from a video provider licensed by MPEG LA to provide MPEG-4 video. No license is granted or shall be implied for any other use. Additional information including that relating to promotional, internal and commercial uses and licensing may be obtained from MPEG LA, LLC. See http://www.mpegla.com. This product is licensed under the MPEG-4 systems patent portfolio license for encoding in compliance with the MPEG-4 systems standard, except that an additional license and payment of royalties are necessary for encoding in connection with (i) data stored or replicated in physical media which is paid for on a title by title basis and/or (ii) data which is paid for on a title by title basis and is transmitted to an end user for permanent storage and/or use, such additional license may be obtained from MPEG LA, LLC. See http://www.mpegla.com for additional details.

18.3. H.264/AVC. H.264/AVC technology may be included with the software. MPEG LA, L.L.C. requires this notice:

This product is licensed under the AVC patent portfolio license for the personal use of a consumer or other uses in which it does not receive remuneration to (i) encode video in compliance with the AVC standard ("AVC video") and/or (ii) decode AVC video that was encoded by a consumer engaged in a personal activity and/or was obtained from a video provider licensed to provide AVC video. No license is granted or shall be implied for any other use. Additional information may be obtained from MPEG LA, L.L.C. See http://www.mpegla.com.

Note the requirement that the videos created can only be used for personal or non-commercial purposes.

The Sorenson Media software have similar terms:

With respect to a license from Sorenson pertaining to MPEG-4 Video Decoders and/or Encoders: Any such product is licensed under the MPEG-4 visual patent portfolio license for the personal and non-commercial use of a consumer for (i) encoding video in compliance with the MPEG-4 visual standard ("MPEG-4 video") and/or (ii) decoding MPEG-4 video that was encoded by a consumer engaged in a personal and non-commercial activity and/or was obtained from a video provider licensed by MPEG LA to provide MPEG-4 video. No license is granted or shall be implied for any other use. Additional information including that relating to promotional, internal and commercial uses and licensing may be obtained from MPEG LA, LLC. See http://www.mpegla.com.

With respect to a license from Sorenson pertaining to MPEG-4 Consumer Recorded Data Encoder, MPEG-4 Systems Internet Data Encoder, MPEG-4 Mobile Data Encoder, and/or MPEG-4 Unique Use Encoder: Any such product is licensed under the MPEG-4 systems patent portfolio license for encoding in compliance with the MPEG-4 systems standard, except that an additional license and payment of royalties are necessary for encoding in connection with (i) data stored or replicated in physical media which is paid for on a title by title basis and/or (ii) data which is paid for on a title by title basis and is transmitted to an end user for permanent storage and/or use. Such additional license may be obtained from MPEG LA, LLC. See http://www.mpegla.com for additional details.

Some free software like Handbrake and FFMPEG uses GPL/LGPL licenses and do not have any such terms included, so for those, there is no requirement to limit the use to personal and non-commercial.

25 Aug 2014 8:10pm GMT

Erich Schubert: Analyzing Twitter - beware of spam

This year I started to widen up my research; and one data source of interest was text because of the lack of structure in it, that makes it often challenging. One of the data sources that everybody seems to use is Twitter: it has a nice API, and few restrictions on using it (except on resharing data). By default, you can get a 1% random sample from all tweets, which is more than enough for many use cases.
We've had some exciting results which a colleague of mine will be presenting tomorrow (Tuesday, Research 22: Topic Modeling) at the KDD 2014 conference:

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

You can also explore some (static!) results online at signi-trend.appspot.com

In our experiments, the "news" data set was more interesting. But after some work, we were able to get reasonable results out of Twitter as well. As you can see from the online demo, most of these fall into pop culture: celebrity deaths, sports, hip-hop. Not much that would change our live; and even less that wasn't before captured by traditional media.
The focus of this post is on the preprocessing needed for getting good results from Twitter. Because it is much easier to get bad results!
The first thing you need to realize about Twitter is that due to the media attention/hype it gets, it is full of spam. I'm pretty sure the engineers at Twitter already try to reduce spam; block hosts and fraud apps. But a lot of the trending topics we discovered were nothing but spam.
Retweets - the "like" of Twitter - are an easy source to see what is popular, but are not very interesting if you want to analyze text. They just reiterate the exact same text (except for a "RT " prefix) than earlier tweets. We found results to be more interesting if we removed retweets. Our theory is that retweeting requires much less effort than writing a real tweet; and things that are trending "with effort" are more interesting than those that were just liked.
Teenie spam. If you ever searched for a teenie idol on Twitter, say this guy I hadn't heard of before, but who has 3.88 million followers on Twitter, and search for Tweets addressed to him, you will get millions over millions of results. Many of these tweets look like this:

@5SOS @Luke5SOS @Ashton5SOS @Michael5SOS @Calum5SOS ♥ CAN YOU FOLLOW MЕ? ♥ It would mean the world to me! ♥ Make my dream come true!🙏 x804

- ♕ Chantal ♕ (@Chantal_San22) 25. August 2014
Now if you look at this tweet, there is this odd "x804" at the end. This is to defeat a simple spam filter by Twitter. Because this user did not tweet this just once: instead it is common amongst teenie to spam their idols with follow requests by the dozen. Probably using some JavaScript hack, or third party Twitter client. Occassionally, you see hundreds of such tweets, each sent within a few seconds of the previous one. If you get a 1% sample of these, you still get a few then...
Even worse (for data analysis) than teenie spammers are commercial spammers and wannabe "hackers" that exercise their "sk1llz" by spamming Twitter. To get a sample of such spam, just search for weight loss on Twitter. There is plenty of fresh spam there, usually consisting of some text pretending to be news, and an anonymized link (there is no need to use an URL shortener such as bit.ly on Twitter, since Twitter has its own URL shortener t.co; and you'll end up with double-shortened URLs). And the hacker spam is even worse (e.g. #alvianbencifa) as he seems to have trojaned hundreds of users, and his advertisement seems to be a nonexistant hash tag, which he tries to get into Twitters "trending topics".
And then there are the bots. Plenty of bots spam Twitter with their analysis of trending topics, reinforcing the trending topics. In my opinion, bots such as trending topics indonesia are useless. No wonder there are only 280 followers. And of the trending topics reported, most of them seem to be spam topics...

Bottom line: if you plan on analyzing Twitter data, spend considerable time on preprocessing to filter out spam of various kind. For example, we remove singletons and digits, then feed the data through a duplicate detector. We end up discarding 20%-25% of Tweets. But we still get some of the spam, such as that hackers spam.
All in all, real data is just messy. People posting there have an agenda that might be opposite to yours. And if someone (or some company) promises you wonders from "Big data" and "Twitter", you better have them demonstrate their use first, before buying their services. Don't trust visions of what could be possible, because the first rule of data analysis is: Garbage in, garbage out.

25 Aug 2014 11:30am GMT

Steve Kemp: Updates on git-hosting and load-balancing

To round up the discussion of the Debian Administration site yesterday I flipped the switch on the load-balancing. Rather than this:

  https -> pound \
                  \
  http  -------------> varnish  --> apache

We now have the simpler route for all requests:

http  -> haproxy -> apache
https -> haproxy -> apache

This means we have one less HTTP-request for all incoming secure connections, and these days secure connections are preferred since a Strict-Transport-Security header is set.

In other news I've been juggling git repositories; I've setup an installation of GitBucket on my git-host. My personal git repository used to contain some private repositories and some mirrors.

Now it contains mirrors of most things on github, as well as many more private repositories.

The main reason for the switch was to get a prettier interface and bug-tracker support.

A side-benefit is that I can use "groups" to organize repositories, so for example:

Most of those are mirrors of the github repositories, but some are new. When signed in I see more sources, for example the source to http://steve.org.uk.

I've been pleased with the setup and performance, though I had to add some caching and some other magic at the nginx level to provide /robots.txt, etc, which are not otherwise present.

I'm not abandoning github, but I will no longer be using it for private repositories (I was gifted a free subscription a year or three ago), and nor will I post things there exclusively.

If a single canonical source location is required for a repository it will be one that I control, maintain, and host.

I don't expect I'll give people commit access on this mirror, but it is certainly possible. In the past I've certainly given people access to private repositories for collaboration, etc.

25 Aug 2014 8:16am GMT

Hideki Yamane: Could you try to consider speaking more slowly and clearly at sessions, please?


Some people (including me :) are not native English speaking person, and also not use English for usual conversation. So, it's a bit tough for them to hear what you said if you speak as usual speed. We want to listen your presentation to understand and discuss about it (of course!), but sometimes machine gun speaking would prevent it.

Calm down, take a deep breath and do your presentation - then it'll be a fantastic, my cat will be pleased with it as below (meow!).



Thank you for your reading. See you in cheese & wine party.

25 Aug 2014 4:08am GMT

24 Aug 2014

feedPlanet Debian

DebConf team: Full video coverage for DebConf14 talks (Posted by Tiago Bortoletto Vaz)

We are happy to announce that live video streams will be available for talks and discussion meetings in DebConf14. Recordings will be posted soon after the events. You can also interact with other local and remote attendees by joining the IRC channels which are listed at the streams page.

For people who want to view the streams outside a webbrowser, the page for each room lists direct links to the streams.

More information on the streams and the various possibilities offered is available at DebConf Videostreams.

The schedule of talks is available at DebConf 14 Schedule.

Thanks to our amazing video volunteers for making it possible. If you like the video coverage, please add a thank you note to VideoTeam Thanks

24 Aug 2014 8:25pm GMT

Noah Meyerhans: Debconf by train

Today is the first time I've taken an interstate train trip in something like 15 years. A few things about the trip were pleasantly surprising. Most of these will come as no surprise:

  1. Less time wasted in security theater at the station prior to departure.
  2. On-time departure
  3. More comfortable seats than a plane or bus.
  4. Quiet.
  5. Permissive free wifi

Wifi was the biggest surprise. Not that it existed, since we're living in the future and wifi is expected everywhere. It's IPv4 only and stuck behind a NAT, which isn't a big surprise, but it is reasonably open. There isn't any port filtering of non-web TCP ports, and even non-TCP protocols are allowed out. Even my aiccu IPv6 tunnel worked fine from the train, although I did experience some weird behavior with it.

I haven't used aiccu much in quite a while, since I have a native IPv6 connection at home, but it can be convenient while travelling. I'm still trying to figure out happened today, though. The first symptoms were that, although I could ping IPv6 hosts, I could not actually log in via IMAP or ssh. Tcpdump showed all the standard symptoms of a PMTU blackhole. Small packets flow fine, large ones are dropped. The interface MTU is set to 1280, which is the minimum MTU for IPv6 and any path on the internet is expected to handle packets of at least that size. Experimentation via ping6 reveals that the largest payload size I can successfully exchange with a peer is 820 bytes. Add 8 bytes for the ICMPv6 header for 828 bytes of payload, plus 40 bytes for the IPv6 header gives an 868 byte packet, which is well under what should be the MTU for this path.

I've worked around this problem with an ip6tables rule to rewrite the MSS on outgoing SYN packets to 760 bytes, which should leave 40 for the IPv6 header and 20 for any extension headers:

sudo ip6tables -t mangle -A OUTPUT -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 760

It is working well and will allow me to publish this from the train, which I'd otherwise have been unable to do. But... weird.

24 Aug 2014 8:19pm GMT

Vincent Sanders: Without craftsmanship, inspiration is a mere reed shaken in the wind.

While I imagine Johannes Brahms was referring to music I think the sentiment applies to other endeavours just as well. The trap of believing an idea is worth something without an implementation occurs all too often, however this is not such an unhappy tale.

Lars original design idea

Lars Wirzenius, Steve McIntyre and myself were chatting a few weeks ago about several of the ongoing Debian discussions. As is often the case these discussions had devolved into somewhat unproductive noise and yet amongst all this was a voice of reason in Russ Allbery.

Lars decided that would take the opportunity of the upcoming opportunity of Debconf 14 to say thank you to Russ for his work. It was decided that a plaque would be a nice gift and I volunteered to do the physical manufacture. Lars came up with the idea of a DEBCON scale similar to the DEFCON scale and got some text together with an initial design idea.

CAD drawing of cut paths in clear acrylic

I took the initial design and as is often the case what is practically possible forced several changes. The prototype was a steep learning curve on using the Cambridge makespace laser cutter to create all the separate pieces.

The construction is pretty simple and consisted of three layers of transparent acrylic plastic. The base layer is a single piece of plastic with the correct outline. The next layer has the DEBCON title, the Debian swirl and level numbers. The top layer has the text engraved in its back surface giving the impression the text floats above the layer behind it.

Failed prototype DEBCON plaqueFor the prototype I attempted to glue the pieces together. This was a complete disaster and required discarding the entire piece and starting again with new materials.

The final version with stand ready to be presented

For the second version I used four small nylon bolts to hold the sandwich of layers together which worked very well.

Presentation of plaque photo by Aigars Mahinovs

Yesterday at the Debconf 14 opening Steve McIntyre presented it to Russ and I think he was pleased, certainly he was surprised (photo from Aigars Mahinovs).

The design files are available from my design git repo, though why anyone would want to reproduce it I have no idea ;-)

24 Aug 2014 7:47pm GMT

Lucas Nussbaum: on the Dark Ages of Free Software: a “Free Service Definition”?

Stefano Zacchiroli opened DebConf'14 with an insightful talk titled Debian in the Dark Ages of Free Software (slides available, video available soon).

He makes the point (quoting slide 16) that the Free Software community is winning a war that is becoming increasingly pointless: yes, users have 100% Free Software thin client at their fingertips [or are really a few steps from there]. But all their relevant computations happen elsewhere, on remote systems they do not control, in the Cloud.

That give-up on control of computing is a huge and important problem, and probably the largest challenge for everybody caring about freedom, free speech, or privacy today. Stefano rightfully points out that we must do something about it. The big question is: how can we, as a community, address it?

Towards a Free Service Definition?

I believe that we all feel a bit lost with this issue because we are trying to attack it with our current tools & weapons. However, they are largely irrelevant here: the Free Software Definition is about software, and software is even to be understood strictly in it, as software programs. Applying it to services, or to computing in general, doesn't lead anywhere. In order to increase the general awareness about this issue, we should define more precisely what levels of control can be provided, to understand what services are not providing to users, and to make an informed decision about waiving a particular level of control when choosing to use a particular service.

Benjamin Mako Hill pointed out yesterday during the post-talk chat that services are not black or white: there aren't impure and pure services. Instead, there's a graduation of possible levels of control for the computing we do. The Free Software Definition lists four freedoms - how many freedoms, or types of control, should there be in a Free Service Definition, or a Controlled-Computing Definition? Again, this is not only about software: the platform on which a particular piece of software is executed has a huge impact on the available level of control: running your own instance of WordPress, or using an instance on wordpress.com, provides very different control (even if as Asheesh Laroia pointed out yesterday, WordPress does a pretty good job at providing export and import features to limit data lock-in).

The creation of such a definition is an iterative process. I actually just realized today that (according to Wikipedia) the very first occurrence of an attempt at a Free Software Definition was published in 1986 (GNU's bulletin Vol 1 No.1, page 8) - I thought it happened a couple of years earlier. Are there existing attempts at defining such freedoms or levels of controls, and at benchmarking such criteria against existing services? Such criteria would not only include control over software modifications and (re)distribution, but also likely include mentions of interoperability and open standards, both to enable the user to move to a compatible service, and to avoid forcing the user to use a particular implementation of a service. A better understanding of network effects is also needed: how much and what type of service lock-in is acceptable on social networks in exchange of functionality?

I think that we should inspire from what was achieved during the last 30 years on Free Software. The tools that were produced are probably irrelevant to address this issue, but there's a lot to learn from the way they were designed. I really look forward to the day when we will have:

Exciting times!

24 Aug 2014 3:39pm GMT

Gregor Herrmann: Debian Perl Group Micro-Sprint

DebConf 14 has started earlier today with the first two talks in sunny portland, oregon.

this year's edition of DebConf didn't feature a preceding DebCamp, & the attempts to organize a proper pkg-perl sprint were not very successful.

nevertheless, two other members of the Debian Perl Group & me met here in PDX on wednesday for our informal unofficial pkg-perl µ-sprint, & as intended, we've used the last days to work on some pkg-perl QA stuff:

as usual, having someone to poke besides you, & the opportunity to get a second pair of eyes quickly was very beneficial. - & of course, spending time with my nice team mates is always a pleasure for me!

24 Aug 2014 12:25am GMT

23 Aug 2014

feedPlanet Debian

Antti-Juhani Kaijanaho: A milestone toward a doctorate

Yesterday I received my official diploma for the degree of Licentiate of Philosophy. The degree lies between a Master's degree and a doctorate, and is not required; it consists of the coursework required for a doctorate, and a Licentiate Thesis, "in which the student demonstrates good conversance with the field of research and the capability of independently and critically applying scientific research methods" (official translation of the Government decree on university degrees 794/2004, Section 23 Paragraph 2).

The title and abstract of my Licentiate Thesis follow:

Kaijanaho, Antti-Juhani
The extent of empirical evidence that could inform evidence-based design of programming languages. A systematic mapping study.
Jyväskylä: University of Jyväskylä, 2014, 243 p.
(Jyväskylä Licentiate Theses in Computing,
ISSN 1795-9713; 18)
ISBN 978-951-39-5790-2 (nid.)
ISBN 978-951-39-5791-9 (PDF)
Finnish summary

Background: Programming language design is not usually informed by empirical studies. In other fields similar problems have inspired an evidence-based paradigm of practice. Central to it are secondary studies summarizing and consolidating the research literature. Aims: This systematic mapping study looks for empirical research that could inform evidence-based design of programming languages. Method: Manual and keyword-based searches were performed, as was a single round of snowballing. There were 2056 potentially relevant publications, of which 180 were selected for inclusion, because they reported empirical evidence on the efficacy of potential design decisions and were published on or before 2012. A thematic synthesis was created. Results: Included studies span four decades, but activity has been sparse until the last five years or so. The form of conditional statements and loops, as well as the choice between static and dynamic typing have all been studied empirically for efficacy in at least five studies each. Error proneness, programming comprehension, and human effort are the most common forms of efficacy studied. Experimenting with programmer participants is the most popular method. Conclusions: There clearly are language design decisions for which empirical evidence regarding efficacy exists; they may be of some use to language designers, and several of them may be ripe for systematic reviewing. There is concern that the lack of interest generated by studies in this topic area until the recent surge of activity may indicate serious issues in their research approach.

Keywords: programming languages, programming language design, evidence-based paradigm, efficacy, research methods, systematic mapping study, thematic synthesis

A Licentiate Thesis is assessed by two examiners, usually drawn from outside of the home university; they write (either jointly or separately) a substantiated statement about the thesis, in which they suggest a grade. The final grade is almost always the one suggested by the examiners. I was very fortunate to have such prominent scientists as Dr. Stefan Hanenberg and Prof. Stein Krogdahl as the examiners of my thesis. They recommended, and I received, the grade "very good" (4 on a scale of 1-5).

The thesis has been accepted for publication in our faculty's licentiate thesis series and will in due course appear in our university's electronic database (along with a very small number of printed copies). In the mean time, if anyone wants an electronic preprint, send me email at antti-juhani.kaijanaho@jyu.fi.

Figure 1 of the thesis: an overview of the mapping processFigure 1 of the thesis: an overview of the mapping process

As you can imagine, the last couple of months in the spring were very stressful for me, as I pressed on to submit this thesis. After submission, it took me nearly two months to recover (which certain people who emailed me on Planet Haskell business during that period certainly noticed). It represents the fruit of almost four years of work (way more than normally is taken to complete a Licentiate Thesis, but never mind that), as I designed this study in Fall 2010.

Figure 8 of the thesis: Core studies per publication yearFigure 8 of the thesis: Core studies per publication year

Recently, I have been writing in my blog a series of posts in which I have been trying to clear my head about certain foundational issues that irritated me during the writing of the thesis. The thesis contains some of that, but that part of it is not very strong, as my examiners put it, for various reasons. The posts have been a deliberately non-academic attempt to shape the thoughts into words, to see what they look like fixed into a tangible form. (If you go read them, be warned: many of them are deliberately provocative, and many of them are intended as tentative in fact if not in phrasing; the series also is very incomplete at this time.)

I closed my previous post, the latest post in that series, as follows:

In fact, the whole of 20th Century philosophy of science is a big pile of failed attempts to explain science; not one explanation is fully satisfactory. [...] Most scientists enjoy not pondering it, for it's a bit like being a cartoon character: so long as you don't look down, you can walk on air.

I wrote my Master's Thesis (PDF) in 2002. It was about the formal method called "B"; but I took a lot of time and pages to examine the history and content of formal logic. My supervisor was, understandably, exasperated, but I did receive the highest possible grade for it (which I never have fully accepted I deserved). The main reason for that digression: I looked down, and I just had to go poke the bridge I was standing on to make sure I was not, in fact, walking on air. In the many years since, I've taken a lot of time to study foundations, first of mathematics, and more recently of science. It is one reason it took me about eight years to come up with a doable doctoral project (and I am still amazed that my department kept employing me; but I suppose they like my teaching, as do I). The other reason was, it took me that long to realize how to study the design of programming languages without going where everyone has gone before.

Debian people, if any are still reading, may find it interesting that I found significant use for the dctrl-tools toolset I have been writing for Debian for about fifteen years: I stored my data collection as a big pile of dctrl-format files. I ended up making some changes to the existing tools (I should upload the new version soon, I suppose), and I wrote another toolset (unfortunately one that is not general purpose, like the dctrl-tools are) in the process.

For the Haskell people, I mainly have an apology for not attending to Planet Haskell duties in the summer; but I am back in business now. I also note, somewhat to my regret, that I found very few studies dealing with Haskell. I just checked; I mention Haskell several times in the background chapter, but it is not mentioned in the results chapter (because there were not studies worthy of special notice).

I am already working on extending this work into a doctoral thesis. I expect, and hope, to complete that one faster.

23 Aug 2014 5:44pm GMT

Joachim Breitner: This blog goes static

After a bit more than 9 years, I am replacing Serendipity, which as been hosting my blog, by a self-made static solution. This means that when you are reading this, my server no longer has to execute some rather large body of untyped code to produce the bytes sent to you. Instead, that happens once in a while on my laptop, and they are stored as static files on the server.

I hope to get a little performance boost from this, so that my site can more easily hold up to being mentioned on hackernews. I also do not want to worry about security issues in Serendipity - static files are not hacked.

Of course there are down-sides to having a static blog. The editing is a bit more annoying: I need to use my laptop (previously I could post from anywhere) and I edit text files instead of using a JavaScript-based WYSIWYG editor (but I was slightly annoyed by that as well). But most importantly your readers cannot comment on static pages. There are cloud-based solutions that integrate commenting via JavaScript on your static pages, but I decided to go for something even more low-level: You can comment by writing an e-mail to me, and I'll put your comment on the page. This has the nice benefit of solving the blog comment spam problem.

The actual implementation of the blog is rather masochistic, as my web page runs on one of these weird obfuscated languages (XSLT). Previously, it contained of XSLT stylesheets producing makefiles calling XSLT sheets. Now it is a bit more-self-contained, with one XSLT stylesheet writing out all the various html and rss files.

I managed to import all my old posts and comments thanks to this script by Michael Hamann (I had played around with this some months ago and just spend what seemed to be an hour to me to find this script again) and a small Haskell script. Old URLs are rewritten (using mod_rewrite) to the new paths, but feed readers might still be confused by this.

This opens the door to a long due re-design of my webpage. But not today...

23 Aug 2014 3:54pm GMT

Daniel Pocock: Want to be selected for Google Summer of Code 2015?

I've mentored a number of students in 2013 and 2014 for Debian and Ganglia and most of the companies I've worked with have run internships and graduate programs from time to time. GSoC 2014 has just finished and with all the excitement, many students are already asking what they can do to prepare and be selected in 2015.

My own observation is that the more time the organization has to get to know the student, the more confident they can be selecting that student. Furthermore, the more time that the student has spent getting to know the free software community, the more easily they can complete GSoC.

Here I present a list of things that students can do to maximize their chance of selection and career opportunities at the same time. These tips are useful for people applying for GSoC itself and related programs such as GNOME's Outreach Program for Women or graduate placements in companies.

Disclaimers

There is no guarantee that Google will run the program again in 2015 or any future year.

There is no guarantee that any organization or mentor (including myself) will be involved until the official list of organizations is published by Google.

Do not follow the advice of web sites that invite you to send pizza or anything else of value to prospective mentors.

Following the steps in this page doesn't guarantee selection. That said, people who do follow these steps are much more likely to be considered and interviewed than somebody who hasn't done any of the things in this list.

Understand what free software really is

You may hear terms like free software and open source software used interchangeably.

They don't mean exactly the same thing and many people use the term free software for the wrong things. Not all open source projects meet the definition of free software. Those that don't, usually as a result of deficiencies in their licenses, are fundamentally incompatible with the majority of software that does use genuinely free licenses.

Google Summer of Code is about both writing and publishing your code and it is also about community. It is fundamental that you know the basics of licensing and how to choose a free license that empowers the community to collaborate on your code well after GSoC has finished.

Please review the definition of free software early on and come back and review it from time to time. The The GNU Project / Free Software Foundation have excellent resources to help you understand what a free software license is and how it works to maximize community collaboration.

Don't look for shortcuts

There is no shortcut to GSoC selection and there is no shortcut to GSoC completion.

The student stipend (USD $5,500 in 2014) is not paid to students unless they complete a minimum amount of valid code. This means that even if a student did find some shortcut to selection, it is unlikely they would be paid without completing meaningful work.

If you are the right candidate for GSoC, you will not need a shortcut anyway. Are you the sort of person who can't leave a coding problem until you really feel it is fixed, even if you keep going all night? Have you ever woken up in the night with a dream about writing code still in your head? Do you become irritated by tedious or repetitive tasks and often think of ways to write code to eliminate such tasks? Does your family get cross with you because you take your laptop to Christmas dinner or some other significant occasion and start coding? If some of these statements summarize the way you think or feel you are probably a natural fit for GSoC.

An opportunity money can't buy

The GSoC stipend will not make you rich. It is intended to make sure you have enough money to survive through the summer and focus on your project. Professional developers make this much money in a week in leading business centers like New York, London and Singapore. When you get to that stage in 3-5 years, you will not even be thinking about exactly how much you made during internships.

GSoC gives you an edge over other internships because it involves publicly promoting your work. Many companies still try to hide the potential of their best recruits for fear they will be poached or that they will be able to demand higher salaries. Everything you complete in GSoC is intended to be published and you get full credit for it. Imagine a young musician getting the opportunity to perform on the main stage at a rock festival. This is how the free software community works. It is a meritocracy and there is nobody to hold you back.

Having a portfolio of free software that you have created or collaborated on and a wide network of professional contacts that you develop before, during and after GSoC will continue to pay you back for years to come. While other graduates are being screened through group interviews and testing days run by employers, people with a track record in a free software project often find they go straight to the final interview round.

Register your domain name and make a permanent email address

Free software is all about community and collaboration. Register your own domain name as this will become a focal point for your work and for people to get to know you as you become part of the community.

This is sound advice for anybody working in IT, not just programmers. It gives the impression that you are confident and have a long term interest in a technology career.

Choosing the provider: as a minimum, you want a provider that offers DNS management, static web site hosting, email forwarding and XMPP services all linked to your domain. You do not need to choose the provider that is linked to your internet connection at home and that is often not the best choice anyway. The XMPP foundation maintains a list of providers known to support XMPP.

Create an email address within your domain name. The most basic domain hosting providers will let you forward the email address to a webmail or university email account of your choice. Configure your webmail to send replies using your personalized email address in the From header.

Update your ~/.gitconfig file to use your personalized email address in your Git commits.

Create a web site and blog

Start writing a blog. Host it using your domain name.

Some people blog every day, other people just blog once every two or three months.

Create links from your web site to your other profiles, such as a Github profile page. This helps reinforce the pages/profiles that are genuinely related to you and avoid confusion with the pages of other developers.

Many mentors are keen to see their students writing a weekly report on a blog during GSoC so starting a blog now gives you a head start. Mentors look at blogs during the selection process to try and gain insight into which topics a student is most suitable for.

Create a profile on Github

Github is one of the most widely used software development web sites. Github makes it quick and easy for you to publish your work and collaborate on the work of other people. Create an account today and get in the habbit of forking other projects, improving them, committing your changes and pushing the work back into your Github account.

Github will quickly build a profile of your commits and this allows mentors to see and understand your interests and your strengths.

In your Github profile, add a link to your web site/blog and make sure the email address you are using for Git commits (in the ~/.gitconfig file) is based on your personal domain.

Start using PGP

Pretty Good Privacy (PGP) is the industry standard in protecting your identity online. All serious free software projects use PGP to sign tags in Git, to sign official emails and to sign official release files.

The most common way to start using PGP is with the GnuPG (GNU Privacy Guard) utility. It is installed by the package manager on most Linux systems.

When you create your own PGP key, use the email address involving your domain name. This is the most permanent and stable solution.

Print your key fingerprint using the gpg-key2ps command, it is in the signing-party package on most Linux systems. Keep copies of the fingerprint slips with you.

This is what my own PGP fingerprint slip looks like. You can also print the key fingerprint on a business card for a more professional look.

Using PGP, it is recommend that you sign any important messages you send but you do not have to encrypt the messages you send, especially if some of the people you send messages to (like family and friends) do not yet have the PGP software to decrypt them.

If using the Thunderbird (Icedove) email client from Mozilla, you can easily send signed messages and validate the messages you receive using the Enigmail plugin.

Get your PGP key signed

Once you have a PGP key, you will need to find other developers to sign it. For people I mentor personally in GSoC, I'm keen to see that you try and find another Debian Developer in your area to sign your key as early as possible.

Free software events

Try and find all the free software events in your area in the months between now and the end of the next Google Summer of Code season. Aim to attend at least two of them before GSoC.

Look closely at the schedules and find out about the individual speakers, the companies and the free software projects that are participating. For events that span more than one day, find out about the dinners, pub nights and other social parts of the event.

Try and identify people who will attend the event who have been GSoC mentors or who intend to be. Contact them before the event, if you are keen to work on something in their domain they may be able to make time to discuss it with you in person.

Take your PGP fingerprint slips. Even if you don't participate in a formal key-signing party at the event, you will still find some developers to sign your PGP key individually. You must take a photo ID document (such as your passport) for the other developer to check the name on your fingerprint but you do not give them a copy of the ID document.

Events come in all shapes and sizes. FOSDEM is an example of one of the bigger events in Europe, linux.conf.au is a similarly large event in Australia. There are many, many more local events such as the Debian France mini-DebConf in Lyon, 2015. Many events are either free or free for students but please check carefully if there is a requirement to register before attending.

On your blog, discuss which events you are attending and which sessions interest you. Write a blog during or after the event too, including photos.

Quantcast generously hosted the Ganglia community meeting in San Francisco, October 2013. We had a wild time in their offices with mini-scooters, burgers, beers and the Ganglia book. That's me on the pink mini-scooter and Bernard Li, one of the other Ganglia GSoC 2014 admins is on the right.

Install Linux

GSoC is fundamentally about free software. Linux is to free software what a tree is to the forest. Using Linux every day on your personal computer dramatically increases your ability to interact with the free software community and increases the number of potential GSoC projects that you can participate in.

This is not to say that people using Mac OS or Windows are unwelcome. I have worked with some great developers who were not Linux users. Linux gives you an edge though and the best time to gain that edge is now, while you are a student and well before you apply for GSoC.

If you must run Windows for some applications used in your course, it will run just fine in a virtual machine using Virtual Box, a free software solution for desktop virtualization. Use Linux as the primary operating system.

Here are links to download ISO DVD (and CD) images for some of the main Linux distributions:

If you are nervous about getting started with Linux, install it on a spare PC or in a virtual machine before you install it on your main PC or laptop. Linux is much less demanding on the hardware than Windows so you can easily run it on a machine that is 5-10 years old. Having just 4GB of RAM and 20GB of hard disk is usually more than enough for a basic graphical desktop environment although having better hardware makes it faster.

Your experiences installing and running Linux, especially if it requires some special effort to make it work with some of your hardware, make interesting topics for your blog.

Decide which technologies you know best

Personally, I have mentored students working with C, C++, Java, Python and JavaScript/HTML5.

In a GSoC program, you will typically do most of your work in just one of these languages.

From the outset, decide which language you will focus on and do everything you can to improve your competence with that language. For example, if you have already used Java in most of your course, plan on using Java in GSoC and make sure you read Effective Java (2nd Edition) by Joshua Bloch.

Decide which themes appeal to you

Find a topic that has long-term appeal for you. Maybe the topic relates to your course or maybe you already know what type of company you would like to work in.

Here is a list of some topics and some of the relevant software projects:

  • System administration, servers and networking: consider projects involving monitoring, automation, packaging. Ganglia is a great community to get involved with and you will encounter the Ganglia software in many large companies and academic/research networks. Contributing to a Linux distribution like Debian or Fedora packaging is another great way to get into system administration.
  • Desktop and user interface: consider projects involving window managers and desktop tools or adding to the user interface of just about any other software.
  • Big data and data science: this can apply to just about any other theme. For example, data science techniques are frequently used now to improve system administration.
  • Business and accounting: consider accounting, CRM and ERP software.
  • Finance and trading: consider projects like R, market data software like OpenMAMA and connectivity software (Apache Camel)
  • Real-time communication (RTC), VoIP, webcam and chat: look at the JSCommunicator or the Jitsi project
  • Web (JavaScript, HTML5): look at the JSCommunicator

Before the GSoC application process begins, you should aim to learn as much as possible about the theme you prefer and also gain practical experience using the software relating to that theme. For example, if you are attracted to the business and accounting theme, install the PostBooks suite and get to know it. Maybe you know somebody who runs a small business: help them to upgrade to PostBooks and use it to prepare some reports.

Make something

Make some small project, less than two week's work, to demonstrate your skills. It is important to make something that somebody will use for a practical purpose, this will help you gain experience communicating with other users through Github.

For an example, see the servlet Juliana Louback created for fixing phone numbers in December 2013. It has since been used as part of the Lumicall web site and Juliana was selected for a GSoC 2014 project with Debian.

There is no better way to demonstrate to a prospective mentor that you are ready for GSoC than by completing and publishing some small project like this yourself. If you don't have any immediate project ideas, many developers will also be able to give you tips on small projects like this that you can attempt, just come and ask us on one of the mailing lists.

Ideally, the project will be something that you would use anyway even if you do not end up participating in GSoC. Such projects are the most motivating and rewarding and usually end up becoming an example of your best work. To continue the example of somebody with a preference for business and accounting software, a small project you might create is a plugin or extension for PostBooks.

Getting to know prospective mentors

Many web sites provide useful information about the developers who contribute to free software projects. Some of these developers may be willing to be a GSoC mentor.

For example, look through some of the following:

Getting on the mentor's shortlist

Once you have identified projects that are interesting to you and developers who work on those projects, it is important to get yourself on the developer's shortlist.

Basically, the shortlist is a list of all students who the developer believes can complete the project. If I feel that a student is unlikely to complete a project or if I don't have enough information to judge a student's probability of success, that student will not be on my shortlist.

If I don't have any student on my shortlist, then a project will not go ahead at all. If there are multiple students on the shortlist, then I will be looking more closely at each of them to try and work out who is the best match.

One way to get a developer's attention is to look at bug reports they have created. Github makes it easy to see complaints or bug reports they have made about their own projects or other projects they depend on. Another way to do this is to search through their code for strings like FIXME and TODO. Projects with standalone bug trackers like the Debian bug tracker also provide an easy way to search for bug reports that a specific person has created or commented on.

Once you find some relevant bug reports, email the developer. Ask if anybody else is working on those issues. Try and start with an issue that is particularly easy and where the solution is interesting for you. This will help you learn to compile and test the program before you try to fix any more complicated bugs. It may even be something you can work on as part of your academic program.

Find successful projects from the previous year

Contact organizations and ask them which GSoC projects were most successful. In many organizations, you can find the past students' project plans and their final reports published on the web. Read through the plans submitted by the students who were chosen. Then read through the final reports by the same students and see how they compare to the original plans.

Start building your project proposal now

Don't wait for the application period to begin. Start writing a project proposal now.

When writing a proposal, it is important to include several things:

  • Think big: what is the goal at the end of the project? Does your work help the greater good in some way, such as increasing the market share of Linux on the desktop?
  • Details: what are specific challenges? What tools will you use?
  • Time management: what will you do each week? Are there weeks where you will not work on GSoC due to vacation or other events? These things are permitted but they must be in your plan if you know them in advance. If an accident or death in the family cut a week out of your GSoC project, which work would you skip and would your project still be useful without that? Having two weeks of flexible time in your plan makes it more resilient against interruptions.
  • Communication: are you on mailing lists, IRC and XMPP chat? Will you make a weekly report on your blog?
  • Users: who will benefit from your work?
  • Testing: who will test and validate your work throughout the project? Ideally, this should involve more than just the mentor.

If your project plan is good enough, could you put it on Kickstarter or another crowdfunding site? This is a good test of whether or not a project is going to be supported by a GSoC mentor.

Learn about packaging and distributing software

Packaging is a vital part of the free software lifecycle. It is very easy to upload a project to Github but it takes more effort to have it become an official package in systems like Debian, Fedora and Ubuntu.

Packaging and the communities around Linux distributions help you reach out to users of your software and get valuable feedback and new contributors. This boosts the impact of your work.

To start with, you may want to help the maintainer of an existing package. Debian packaging teams are existing communities that work in a team and welcome new contributors. The Debian Mentors initiative is another great starting place. In the Fedora world, the place to start may be in one of the Special Interest Groups (SIGs).

Think from the mentor's perspective

After the application deadline, mentors have just 2 or 3 weeks to choose the students. This is actually not a lot of time to be certain if a particular student is capable of completing a project. If the student has a published history of free software activity, the mentor feels a lot more confident about choosing the student.

Some mentors have more than one good student while other mentors receive no applications from capable students. In this situation, it is very common for mentors to send each other details of students who may be suitable. Once again, if a student has a good Github profile and a blog, it is much easier for mentors to try and match that student with another project.

GSoC logo generic

Conclusion

Getting into the world of software engineering is much like joining any other profession or even joining a new hobby or sporting activity. If you run, you probably have various types of shoe and a running watch and you may even spend a couple of nights at the track each week. If you enjoy playing a musical instrument, you probably have a collection of sheet music, accessories for your instrument and you may even aspire to build a recording studio in your garage (or you probably know somebody else who already did that).

The things listed on this page will not just help you walk the walk and talk the talk of a software developer, they will put you on a track to being one of the leaders. If you look over the profiles of other software developers on the Internet, you will find they are doing most of the things on this page already. Even if you are not selected for GSoC at all or decide not to apply, working through the steps on this page will help you clarify your own ideas about your career and help you make new friends in the software engineering community.

23 Aug 2014 11:37am GMT