10 Nov 2011

feedPlanet Python

Terry Jones: Emacs buffer mode histogram

Tonight I noticed that I had over 200 buffers open in emacs. I've been programming a lot in Python recently, so many of them are in Python mode. I wondered how many Python files I had open, and I counted them by hand. About 90. I then wondered how many were in Javascript mode, in RST mode, etc. I wondered what a histogram would look like, for me and for others, at times when I'm programming versus working on documentation, etc.

Because it's emacs, it wasn't hard to write a function to display a buffer mode histogram. Here's mine:

235 buffers open, in 23 distinct modes

91               python +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
47          fundamental +++++++++++++++++++++++++++++++++++++++++++++++
24                  js2 ++++++++++++++++++++++++
21                dired +++++++++++++++++++++
16                 html ++++++++++++++++
 7                 text +++++++
 4                 help ++++
 4           emacs-lisp ++++
 3                   sh +++
 3       makefile-gmake +++
 2          compilation ++
 2                  css ++
 1          Buffer-menu +
 1                 mail +
 1                 grep +
 1      completion-list +
 1                   vm +
 1                  org +
 1               comint +
 1              apropos +
 1                 Info +
 1           vm-summary +
 1      vm-presentation +

Tempting as it is, I'm not going to go on about the heady delights of having a fully programmable editor. You either already know, or you can just drool in slack-jawed wonder.

Unfortunately I'm a terrible emacs lisp programmer. I can barely remember a thing each time I use it. But the interpreter is of course just emacs itself and the elisp documentation is in emacs, so it's a really fun environment to develop in. And because emacs lisp has a ton of support for doing things to itself, code that acts on emacs and your own editing session or buffers is often very succinct. See for example the save-excursion and with-output-to-temp-buffer functions below.

(defun buffer-mode-histogram ()
"Display a histogram of emacs buffer modes."
(interactive)
(let* ((totals '())
(buffers (buffer-list()))
(total-buffers (length buffers))
(ht (make-hash-table :test 'equal)))
(save-excursion
(dolist (buffer buffers)
(set-buffer buffer)
(let
((mode-name (symbol-name major-mode)))
(puthash mode-name (1+ (gethash mode-name ht 0)) ht))))
(maphash (lambda (key value)
(setq totals (cons (list key value) totals)))
ht)
(setq totals (sort totals (lambda (x y) (> (cadr x) (cadr y)))))
(with-output-to-temp-buffer "Buffer mode histogram"
(princ (format "%d buffers open, in %d distinct modes\n\n"
total-buffers (length totals)))
(dolist (item totals)
(let
((key (car item))
(count (cadr item)))
(if (equal (substring key -5) "-mode")
(setq key (substring key 0 -5)))
(princ (format "%2d %20s %s\n" count key
(make-string count ?+))))))))

Various things about the formatting could be improved. E.g., not use fixed-width fields for the count and the mode names, and make the + signs indicate more than one buffer mode when there are many.

10 Nov 2011 2:42pm GMT

Terry Jones: Emacs buffer mode histogram

Tonight I noticed that I had over 200 buffers open in emacs. I've been programming a lot in Python recently, so many of them are in Python mode. I wondered how many Python files I had open, and I counted them by hand. About 90. I then wondered how many were in Javascript mode, in RST mode, etc. I wondered what a histogram would look like, for me and for others, at times when I'm programming versus working on documentation, etc.

Because it's emacs, it wasn't hard to write a function to display a buffer mode histogram. Here's mine:

235 buffers open, in 23 distinct modes

91               python +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
47          fundamental +++++++++++++++++++++++++++++++++++++++++++++++
24                  js2 ++++++++++++++++++++++++
21                dired +++++++++++++++++++++
16                 html ++++++++++++++++
 7                 text +++++++
 4                 help ++++
 4           emacs-lisp ++++
 3                   sh +++
 3       makefile-gmake +++
 2          compilation ++
 2                  css ++
 1          Buffer-menu +
 1                 mail +
 1                 grep +
 1      completion-list +
 1                   vm +
 1                  org +
 1               comint +
 1              apropos +
 1                 Info +
 1           vm-summary +
 1      vm-presentation +

Tempting as it is, I'm not going to go on about the heady delights of having a fully programmable editor. You either already know, or you can just drool in slack-jawed wonder.

Unfortunately I'm a terrible emacs lisp programmer. I can barely remember a thing each time I use it. But the interpreter is of course just emacs itself and the elisp documentation is in emacs, so it's a really fun environment to develop in. And because emacs lisp has a ton of support for doing things to itself, code that acts on emacs and your own editing session or buffers is often very succinct. See for example the save-excursion and with-output-to-temp-buffer functions below.

(defun buffer-mode-histogram ()
"Display a histogram of emacs buffer modes."
(interactive)
(let* ((totals '())
(buffers (buffer-list()))
(total-buffers (length buffers))
(ht (make-hash-table :test 'equal)))
(save-excursion
(dolist (buffer buffers)
(set-buffer buffer)
(let
((mode-name (symbol-name major-mode)))
(puthash mode-name (1+ (gethash mode-name ht 0)) ht))))
(maphash (lambda (key value)
(setq totals (cons (list key value) totals)))
ht)
(setq totals (sort totals (lambda (x y) (> (cadr x) (cadr y)))))
(with-output-to-temp-buffer "Buffer mode histogram"
(princ (format "%d buffers open, in %d distinct modes\n\n"
total-buffers (length totals)))
(dolist (item totals)
(let
((key (car item))
(count (cadr item)))
(if (equal (substring key -5) "-mode")
(setq key (substring key 0 -5)))
(princ (format "%2d %20s %s\n" count key
(make-string count ?+))))))))

Various things about the formatting could be improved. E.g., not use fixed-width fields for the count and the mode names, and make the + signs indicate more than one buffer mode when there are many.

10 Nov 2011 2:42pm GMT

Mike Driscoll: wxPython: ANN: Namespace Diff Tool

Last night, Andrea Gavana released his new Namespace Diff Tool (NDT) to the world. I got his permission to reprint his announcement here for all those people who don't follow the wxPython mailing list. I think it sounds like a really cool tool. You should check it out and see what you think. Here is the announcement:

Description
===========

The `Namespace Diff Tool` (NDT) is a graphical user interface that can
be used to discover differences between different versions of a library,
or even between different iterations/sub-versions of the same library.

The tool can be used to identify what is missing and still needs to be
implemented, or what is new in a new release, which items do not have
docstrings and so on.

Full description of the original idea by Robin Dunn:

http://svn.wxwidgets.org/viewvc/wx/wxPython/Phoenix/trunk/TODO.txt?vi

:warning: As most of the widgets in the GUI are owner drawn or custom,
it is highly probable that the interface itself will look messy on other
platforms (Mac, I am talking to you). Please do try and create a patch to
fix any possible issue in this sense.

:note: Please refer to the TODOs section for a list of things that still
need
to be implemented.

Requirements
============

In order to run NDT, these packages need to be installed:

- Python 2.X (where 5 = X = 7);
- wxPython >= 2.8.10;
- SQLAlchemy >= 0.6.4.

More detailed instructions on how to use it, TODO items, list of
libraries/packages I tested NDT against, screenshots and download links can
be found here:

http://xoomer.virgilio.it/infinity77/main/NDT.html

If you stumble upon a bug (which is highly probable), please do let me
know. But most importantly, please do try and make an effort to create a
patch for the bug.

According to the thread, some bugs were already found and fixed.

10 Nov 2011 1:15pm GMT

Mike Driscoll: wxPython: ANN: Namespace Diff Tool

Last night, Andrea Gavana released his new Namespace Diff Tool (NDT) to the world. I got his permission to reprint his announcement here for all those people who don't follow the wxPython mailing list. I think it sounds like a really cool tool. You should check it out and see what you think. Here is the announcement:

Description
===========

The `Namespace Diff Tool` (NDT) is a graphical user interface that can
be used to discover differences between different versions of a library,
or even between different iterations/sub-versions of the same library.

The tool can be used to identify what is missing and still needs to be
implemented, or what is new in a new release, which items do not have
docstrings and so on.

Full description of the original idea by Robin Dunn:

http://svn.wxwidgets.org/viewvc/wx/wxPython/Phoenix/trunk/TODO.txt?vi

:warning: As most of the widgets in the GUI are owner drawn or custom,
it is highly probable that the interface itself will look messy on other
platforms (Mac, I am talking to you). Please do try and create a patch to
fix any possible issue in this sense.

:note: Please refer to the TODOs section for a list of things that still
need
to be implemented.

Requirements
============

In order to run NDT, these packages need to be installed:

- Python 2.X (where 5 = X = 7);
- wxPython >= 2.8.10;
- SQLAlchemy >= 0.6.4.

More detailed instructions on how to use it, TODO items, list of
libraries/packages I tested NDT against, screenshots and download links can
be found here:

http://xoomer.virgilio.it/infinity77/main/NDT.html

If you stumble upon a bug (which is highly probable), please do let me
know. But most importantly, please do try and make an effort to create a
patch for the bug.

According to the thread, some bugs were already found and fixed.

10 Nov 2011 1:15pm GMT

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

feedPlanet Python

Andy Todd: Extracting a discrete set of values

Today's I love Python moment is bought to you by set types.

I have a file, XML naturally, the contains a series of transactions. Each transaction has a reference number, but the reference number may be repeated. I want to pull the distinct set of reference numbers from this file. The way I learnt to build up a discrete set of items (many years ago) was to use a dict and set default.

>>> ref_nos = {}
>>> for record in records:
>>>     ref_nos.setdefault(record.key, 1)
>>> ref_nos.keys()

But Python has had a sets module since 2.3 and the set standard data type since 2.6 so my knowledge is woefully out of date. The latest way to get the unique values from a sequence looks something like this;

>>> ref_nos = set([record.key for record in records])

I think I should get bonus points for using a list comprehension as well.

10 Nov 2011 5:42am GMT

Andy Todd: Extracting a discrete set of values

Today's I love Python moment is bought to you by set types.

I have a file, XML naturally, the contains a series of transactions. Each transaction has a reference number, but the reference number may be repeated. I want to pull the distinct set of reference numbers from this file. The way I learnt to build up a discrete set of items (many years ago) was to use a dict and set default.

>>> ref_nos = {}
>>> for record in records:
>>>     ref_nos.setdefault(record.key, 1)
>>> ref_nos.keys()

But Python has had a sets module since 2.3 and the set standard data type since 2.6 so my knowledge is woefully out of date. The latest way to get the unique values from a sequence looks something like this;

>>> ref_nos = set([record.key for record in records])

I think I should get bonus points for using a list comprehension as well.

10 Nov 2011 5:42am GMT

09 Nov 2011

feedPlanet Python

Menno's Musings: IMAPClient 0.8 released

Version 0.8 of IMAPClient is out! Although I didn't get everything into this release that I had hoped to, there's still plenty there. Thanks to Johannes Heckel and Andrew Scheller for their contributions to this release.

Highlights for 0.8:

The NEWS file and main documentation has more details on all of the above.

As always, IMAPClient can be installed from PyPI (pip install imapclient) or downloaded from the IMAPClient site.

09 Nov 2011 10:10pm GMT

Menno's Musings: IMAPClient 0.8 released

Version 0.8 of IMAPClient is out! Although I didn't get everything into this release that I had hoped to, there's still plenty there. Thanks to Johannes Heckel and Andrew Scheller for their contributions to this release.

Highlights for 0.8:

The NEWS file and main documentation has more details on all of the above.

As always, IMAPClient can be installed from PyPI (pip install imapclient) or downloaded from the IMAPClient site.

09 Nov 2011 10:10pm GMT

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

feedPlanet Python

Stephen Ferg: How to post source code on WordPress

This post is for folks who blog about Python (or any programming language for that matter) on WordPress.
Updated 2011-11-09 to make it easier to copy-and-paste the [sourcecode] template.

My topic today is How to post source code on WordPress.

The trick is to use the WordPress [sourcecode] shortcut tag, as documented at http://en.support.wordpress.com/code/posting-source-code/.

Note that when the WordPress docs tell you to enclose the [sourcecode] shortcut tag in square - not pointy - brackets, they mean it. When you view your post as HTML, what you should see is square brackets around the shortcut tags, not pointy brackets.

Here is the tag I like to use for snippets of Python code.


[sourcecode language="python" wraplines="false" collapse="false"]
your source code goes here
[/sourcecode]

The default for wraplines is true, which causes long lines to be wrapped. That isn't appropriate for Python, so I specify wraplines="false".

The default for collapse is false, which is what I normally want. But I code it explicitly, as a reminder that if I ever want to collapse a long code snippet, I can.


Here are some examples.

Note that

(1)

First, a normal chunk of relatively short lines of Python code.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

(2)

Here is a different code snippet. This one has a line containing a very long comment. Note that the long line is NOT wrapped, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to. That is because we have specified wraplines="false".

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="false", so lines are NOT wrapped, but extend indefinitely, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to.

(3)

This is what a similar code snippet would look like if we had specified wraplines=true. Note that line 2 wraps around and there is no horizontal scrollbar.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

(4)

Finally, the same code snippet with collapse=true, so the code snippet initially displays as collapsed. Clicking on the collapsed code snippet will cause it to expand.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

As far as I can tell, once a reader has expanded a snippet that was initially collapsed, there is no way for him to re-collapse it. That would be a nice enhancement for WordPress - to allow a reader to collapse and expand a code snippet.


Here is a final thought about wraplines. If you specify wraplines="false", and a reader prints a paper copy of your post, the printed output will not show the scrollbar, and it will show only the portion of long lines that were visible on the screen. In short, the printed output might cut off the right-hand part of long lines.

In most cases, I think, this should not be a problem. The pop-up tools allow a reader to view or print the entire source code snippet if he wants to. Still, I can imagine cases in which I might choose to specify wraplines="true", even for a whitespace-sensitive language such as Python. And I can understand that someone else, simply as a matter of personal taste, might prefer to specify wraplines="true" all of the time.

Now that I think of it, another nice enhancement for WordPress would be to allow a reader to toggle wraplines on and off.


Keep on bloggin'!


09 Nov 2011 7:55pm GMT

Stephen Ferg: How to post source code on WordPress

This post is for folks who blog about Python (or any programming language for that matter) on WordPress.
Updated 2011-11-09 to make it easier to copy-and-paste the [sourcecode] template.

My topic today is How to post source code on WordPress.

The trick is to use the WordPress [sourcecode] shortcut tag, as documented at http://en.support.wordpress.com/code/posting-source-code/.

Note that when the WordPress docs tell you to enclose the [sourcecode] shortcut tag in square - not pointy - brackets, they mean it. When you view your post as HTML, what you should see is square brackets around the shortcut tags, not pointy brackets.

Here is the tag I like to use for snippets of Python code.


[sourcecode language="python" wraplines="false" collapse="false"]
your source code goes here
[/sourcecode]

The default for wraplines is true, which causes long lines to be wrapped. That isn't appropriate for Python, so I specify wraplines="false".

The default for collapse is false, which is what I normally want. But I code it explicitly, as a reminder that if I ever want to collapse a long code snippet, I can.


Here are some examples.

Note that

(1)

First, a normal chunk of relatively short lines of Python code.

indentCount = 0
textChars = []
suffixChars = []

# convert the line into a list of characters
# and feed the list to the ReadAhead generator
chars = ReadAhead(list(line))

c = chars.next() # get first

while c and c == INDENT_CHAR:
    # process indent characters
    indentCount += 1
    c = chars.next()

while c and c != SYMBOL:
    # process text characters
    textChars.append(c)
    c = chars.next()

if c and c == SYMBOL:
    c = chars.next() # read past the SYMBOL
    while c:
        # process suffix characters
        suffixChars.append(c)
        c = chars.next()

(2)

Here is a different code snippet. This one has a line containing a very long comment. Note that the long line is NOT wrapped, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to. That is because we have specified wraplines="false".

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="false", so lines are NOT wrapped, but extend indefinitely, and a horizontal scrollbar is available so that you can scroll as far to the right as you need to.

(3)

This is what a similar code snippet would look like if we had specified wraplines=true. Note that line 2 wraps around and there is no horizontal scrollbar.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

(4)

Finally, the same code snippet with collapse=true, so the code snippet initially displays as collapsed. Clicking on the collapsed code snippet will cause it to expand.

somePythonVariable = 1
# This is a long, single-line, comment.  I put it here to illustrate the effect of the wraplines argument.  In this code snippet, wraplines="true", so lines are ARE wrapped.  They do NOT extend indefinitely, and a horizontal scrollbar is NOT available so that you can scroll as far to the right as you need to.

As far as I can tell, once a reader has expanded a snippet that was initially collapsed, there is no way for him to re-collapse it. That would be a nice enhancement for WordPress - to allow a reader to collapse and expand a code snippet.


Here is a final thought about wraplines. If you specify wraplines="false", and a reader prints a paper copy of your post, the printed output will not show the scrollbar, and it will show only the portion of long lines that were visible on the screen. In short, the printed output might cut off the right-hand part of long lines.

In most cases, I think, this should not be a problem. The pop-up tools allow a reader to view or print the entire source code snippet if he wants to. Still, I can imagine cases in which I might choose to specify wraplines="true", even for a whitespace-sensitive language such as Python. And I can understand that someone else, simply as a matter of personal taste, might prefer to specify wraplines="true" all of the time.

Now that I think of it, another nice enhancement for WordPress would be to allow a reader to toggle wraplines on and off.


Keep on bloggin'!


09 Nov 2011 7:55pm GMT

Matt Harrison: November Utah Python Meeting

There will be a Utah Python meeting, this Thursday. The plan is to discuss the Salt Framework. As always all are invited. Cheers.

09 Nov 2011 2:11pm GMT

Matt Harrison: November Utah Python Meeting

There will be a Utah Python meeting, this Thursday. The plan is to discuss the Salt Framework. As always all are invited. Cheers.

09 Nov 2011 2:11pm GMT

Andrew Dalke: f2pypy

There's a bit of discussion going on about the role of PyPy in scientific computing with Python. I spent a few days of the last week to add more ... shall I say "fuel?" .. to the discussion. I wrote a new back-end to f2py called f2pypy which generates a Python module to a shared library based on using ctypes. The module works (somewhat) with CPython, and does not work with PyPy because there's no way yet to pass a pointer to the array data to a ctypes function.

What it shows is a real mechanism to get PyPy to support existing Fortran libraries already supported by f2py definition files.

NumPy isn't used in all scientific software

There is definitely place for PyPy in scientific computing even now. There are entire branches of science which have little overlap with the strengths of SciPy. I've been a full-time software developer for computational chemistry for 16 years, and have only used NumPy a few times.

One time I needed to compute the generalized inverse matrix. It was in a command-line program called by another process, of all things, and to my annoyance the "import numpy" on the cluster file system was noticably long. I forgot what the numbers were then, but the current numpy import adds 145 (yes, 145!) modules to sys.modules, and 107 of them start with "numpy." Our Lustre configuration did poorly with file metadata, and I think it was over a second to do the import.

I brought this up on the numpy list. While they made some changes, it was pointed out that I am not their target user. The "import numpy" also does an "import numpy.testing" and "input numpy.ctypeslib" and the other imports so that people could use numpy.submodule without an extra explicit import line, and because most people working with numpy are in working in a long-lived interactive session or job, so the startup performance isn't a problem.

I happen to disagree with their choice. "Explicit is better than implicit" and all that. But my point is not to argue for them to change but to give a specific example of how the goals of the NumPy developers can be different than the goals of other scientific programmers.

How do I use Python in science research?

A lot of what I do involves communicating with command-line executables. These are often written by scientists, and most are designed to be run directly by people, and not be other software. Most of the time is spent in the executable, so it doesn't matter if I'm using CPython or PyPy.

There are several cheminformatics libraries for Python. OpenBabel and OEChem use SWIG bindings, RDKit uses Boost, and I don't know what Indigo and Canvas use. Migrating these to PyPy will be hard. I hope that someone is working on SWIG bindings, but it looks like the PyPy developers don't want to commit to a C ABI. (See below.)

There's also code where there are no libraries, and for those I write the code in Python, and sometimes my own extension for C. For some of these case the 3x and higher performance of PyPy would be great. I also know a lot of ways for my CPython-based code to talk to PyPy-based code.

I used to develop software for bioinformatics and structural biology, and my observations still hold for those fields. One of the Biopython developers, Peter Cock, writes:

Regarding Biopython using NumPy, we're already trying it out under PyPy. Large chunks of Biopython do not use NumPy at all, although there a few problems on PyPy 1.6 (one due to a missing XML library, bug filed), most of that seems to work." [*]

He continues with a list of some of what doesn't work.

Support for existing libraries

That said, I know that a lot of people depend Python bindings to existing libraries. These use the C API directly, or through auto-generated interfaces from f2py, Cython, Boost, SWIG, and many more. There's been 10+ years to develop these tools for CPython, and still very little time to adapt them to PyPy.

Relatively few extensions use the ctypes module, which is Python's "other" mechanism for calling external functions. Unlike the C API, this one is also portable across Jython, Iron Python, and PyPy. Obviously, if everyone used ctypes then there wouldn't be a problem. Why don't they?

One is the performance. Calling math.cos() is 8 times faster than doing a LoadLibrary() of libm and calling cos() that way. This is of course the worst case. But that's a CPython limitation. Pypy's ctypes call interface is faster than CPython calling a C extension:

% cat x.py
import ctypes
m = ctypes.cdll.LoadLibrary("/usr/lib/libm.dylib")
cos = m.cos
cos.argtypes = [ctypes.c_double]
cos.restype = ctypes.c_double

% python -mtimeit -s "from x import cos" "cos(0)"
1000000 loops, best of 3: 0.676 usec per loop
% python -mtimeit -s "from math import cos" "cos(0)"
10000000 loops, best of 3: 0.0811 usec per loop
% pypy -mtimeit -s "from x import cos" "cos(0)"
10000000 loops, best of 3: 0.0332 usec per loop
% pypy -mtimeit -s "from math import cos" "cos(0)"
100000000 loops, best of 3: 0.0047 usec per loop

although you can see it's still slower than using a built-in function.

Another reason to not use ctypes is that C/C++ library authors do interesting things with the API. One library I used has public API functions like "dt_charge(atom)" to get the formal charge of an atom, but used a number of #define statements to change those names to the internal name. That example became "dt_e_charge". It also defined certain constants only in the header files. This information isn't in the shared library.

I know at least one vendor which only ships a static library, and not a shared library. Apparently bad LD_LIBRARY_PATHs was such a support headache that they decided it wasn't worth it. (I think they are right.) There's no way to get ctypes to interface to a static library.

A fourth problem is lack of support for C++ templates. That clearly needs a compiler, which ctypes doesn't do.

PyPy needs a (semi-)stable C ABI; can you help?

Based on the above, there will clearly always be a need for compiler-based Python extensions, including PyPy extensions. That means there needs to be some sort of ABI that those extensions can program against.

I don't know what that would look like, and I think the PyPy developers think it's still too early to stablize on it. It may well be; but I think it's because there's no one on the group who wants to work on the task.

They were more than happy last year to show a proof-of-concept interface from PyPy to C++ using the run-time type information added by the Reflex system. (Yeah, I had never heard of it either.) So they have nothing against working with an existing ABI. Do you want to offer one?

I wrote "semi-" in the title because it wasn't until Python 3.2 that CPython got a stable ABI. PyPy notably does have emulation support for some of the CPython 2.x ABI but there are problems. Some modules use the ABI incorrectly, and it works for implementation-specific reasons. (For example, bad reference counts.)

If you are going to work on this, I think it would make sense to target the 3.2 ABI and to include instrumentation to help identify these problems.

The best for me would be if you develop some SWIG/ABI interface. This might just be to produce a bunch of stub functions and a ctypes definition for them. (Hmm, wasn't there a C++ to C SWIG interface?)

f2pypy: Experimental Fortran bindings

The above is talk and hand-waving. Code's also good. There was a PyPy sprint this week and I decided to join in for a few days and prototype an idea I've been thinking about: f2pypy. It's a variation of f2py which generates Python ctypes bindings which PyPy could use to talk with shared libraries implemented in Fortran.

At the end of several days of work, I got f2pypy to generate a Python module based on the "fblas.pyf" code from SciPy. I could import that library in CPython and (for the few functions I tested) get answers which matched the fblas module in SciPy. I could also use pypy to call some of the functions, but PyPy's "numpy" implementation is not mature enough. Its array objects don't support the ctypes interface, so I was unable to call out to the shared library. I could only call the scalar-based functions.

The code is definitely incomplete. Even my CPython-based tests fail some of the the "test_blas.py" from SciPy (I don't implement "cblas" and I think one of the tests depends on Fortran order instead of C order.) It's a proof-of-concept which shows that this approach is definitely viable, and it shows some of the difficulties in the approach.

My point though is that it opens new possibilites which aren't available in NumPy. For example, suppose you want to use one of the BLAS functions in your code. Every Mac includes a copy of BLAS as a built-in library. Instead of making people install SciPy, what about shipping the ctypes module description instead, and using that interface? You can ship pure Python code and still take advantage of platform-optimized libraries!

I earlier highlighted the performance problems in CPython's ctypes interface. But this is PyPy. They already have cross-module optimizations for Python calling Python. There's no reason why those can't apply to ctypes-based functions. (Or perhaps it's already there? I've not tested that.)

How does it work?

Fortran bindings are nice because they don't have the same preprocessor tricks that I mentioned earlier. Pearu Peterson wrote the excellent f2py package starting some 10+ years ago. It has several ways to work with Fortran code. The one I used was to start with a "pyf" definition file and generate Python code using a new back-end.

I figured out how to get SciPy to generate the pyf file for the BLAS library. (The SciPy source uses a template language during the build process to generate the actual code.) I used f2py's "crackfortran" module to parse the pfy file and get the AST. It's a small tree so perhaps I should call it an abstract syntax bush.

The f2py code generate the Python/C extension code based on the AST. My f2pypy code is basically another back-end, which generates ctypes-based code in Python.

The trickiest part was support for C code. Some of the pyf definition lines contain embedded C code. Here I've gathered three examples:

integer optional, intent(in),check(incx>0||incx0) :: incx = 1
integer optional,intent(in),depend(x,incx,offx,y,incy,offy) :: n = (len(x)-offx)/abs(incx)
callstatement (*f2py_func)((trans?(trans==2?"C":"T"):"N"),&m,&n,α,a,&m,x+offx,&incx,β,y+offy,&incy)

I used Fredrick Lundh's wonderful essay on Simple Top-Down Parsing in Python to build a simple C expression parser, which builds another AST. With a bit of AST manipulation, and symbol table knowledge (I need to know which inputs are scalars and which are vectors), I could generate output strings like:

def srot(..., incx = None, ...):
  ...
  if incx is None:
    incx = _ct.c_int(1)
  else:
    incx = _ct.c_int(incx)
  if not ((((incx.value) > 0) or ((incx.value)  0))):
    raise ValueError('(incx>0||incx0) failed for argument incx: incx=%s' % incx.value)

and the more complicated:

_api_cgemv((("c") if (((trans.value) == 2)) else ("t")) if ((trans.value)) else
("n"), (m), (n), (alpha), a.ctypes.data_as(_ct.POINTER(_complex_float)), (m),
(x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incx), (beta),
(y if ((offy.value)) == 0 else y[(offy.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incy))

I definitely do not generate optimized code. I decided to work completely in terms of ctypes scalars and numpy arrays, even for the check() statements. PyPy doesn't optimize that yet, and I think someone else could do a better job by only doing the conversion as part of the call to the Fortran code.

Usage

To generate the new module on a Mac (I don't know the shared library name for other OS installations):

  $PYTHON -m f2pypy tests/fblas.pyf -l vecLib --skip cdotu,zdotu,cdotc,zdotc

This generates "fblas.py". I have some test code for that module

% python test_fblas.py
python test_fblas.py
...........F...
======================================================================
FAIL: test_srot_overwrite (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 116, in test_srot_overwrite
    assert x is x2
AssertionError

----------------------------------------------------------------------
Ran 15 tests in 0.006s

FAILED (failures=1)

This says that "numpy.array(.. copy=False)" makes a new reference, while the internal code f2py uses passes back the same object, so a real implementation will need to handle that detail.

Here's the same output from pypy:

% pypy test_fblas.py
.EE.EEEFEEEE.E.
======================================================================
ERROR: test_dnrm2 (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 166, in test_dnrm2
    E(m.dnrm2(x), float((numpy.array([1+1+16+81], "d")**0.5)))
  File "/Users/dalke/cvses/f2pypy/fblas.py", line 1192, in dnrm2
    return _api_dnrm2((n), (x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_ct.c_double)), (incx))
AttributeError: 'numarray' object has no attribute 'ctypes'

======================================================================
ERROR: test_drot (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 153, in test_drot
    E(m.drot(1,2,3,4), (numpy.array(11.0, dtype="d"), numpy.array(2.0, dtype="d")))
  File "/Users/dalke/cvses/f2pypy/fblas.py", line 181, in drot
    y = _np.array(y, 'd', copy=overwrite_y)
TypeError: __new__() got an unexpected keyword argument 'copy'
 ....

See the dots up there with the "E"rrors? That 4 of the scalar-based tests pass. What fails is the vector-based code. The "ctypes" gets the pointer to the numpy array data, and PyPy doesn't support the "copy" parameter of numpy.array.

Still, it does pass some tests!

Future

I don't use Fortran modules. I don't use f2py. I don't use numarray. I will not be involved in this project for the future. (I do a lot of integration work, and I do a lot of parsing and AST transformations, so that part of this effort was a very pretty good fit!)

I did this because I wanted to show that PyPy can support traditional numeric software libraries and that there is a relatively doable path for migration from existing numpy code to "numpypy" code.

I will not be maintaining the project in the future. If you want to take it on, feel free. I've contributed it to the PyPy project, and it has its own repository. Feel free also to leave a comment or ask me questions.

09 Nov 2011 12:00pm GMT

Andrew Dalke: f2pypy

There's a bit of discussion going on about the role of PyPy in scientific computing with Python. I spent a few days of the last week to add more ... shall I say "fuel?" .. to the discussion. I wrote a new back-end to f2py called f2pypy which generates a Python module to a shared library based on using ctypes. The module works (somewhat) with CPython, and does not work with PyPy because there's no way yet to pass a pointer to the array data to a ctypes function.

What it shows is a real mechanism to get PyPy to support existing Fortran libraries already supported by f2py definition files.

NumPy isn't used in all scientific software

There is definitely place for PyPy in scientific computing even now. There are entire branches of science which have little overlap with the strengths of SciPy. I've been a full-time software developer for computational chemistry for 16 years, and have only used NumPy a few times.

One time I needed to compute the generalized inverse matrix. It was in a command-line program called by another process, of all things, and to my annoyance the "import numpy" on the cluster file system was noticably long. I forgot what the numbers were then, but the current numpy import adds 145 (yes, 145!) modules to sys.modules, and 107 of them start with "numpy." Our Lustre configuration did poorly with file metadata, and I think it was over a second to do the import.

I brought this up on the numpy list. While they made some changes, it was pointed out that I am not their target user. The "import numpy" also does an "import numpy.testing" and "input numpy.ctypeslib" and the other imports so that people could use numpy.submodule without an extra explicit import line, and because most people working with numpy are in working in a long-lived interactive session or job, so the startup performance isn't a problem.

I happen to disagree with their choice. "Explicit is better than implicit" and all that. But my point is not to argue for them to change but to give a specific example of how the goals of the NumPy developers can be different than the goals of other scientific programmers.

How do I use Python in science research?

A lot of what I do involves communicating with command-line executables. These are often written by scientists, and most are designed to be run directly by people, and not be other software. Most of the time is spent in the executable, so it doesn't matter if I'm using CPython or PyPy.

There are several cheminformatics libraries for Python. OpenBabel and OEChem use SWIG bindings, RDKit uses Boost, and I don't know what Indigo and Canvas use. Migrating these to PyPy will be hard. I hope that someone is working on SWIG bindings, but it looks like the PyPy developers don't want to commit to a C ABI. (See below.)

There's also code where there are no libraries, and for those I write the code in Python, and sometimes my own extension for C. For some of these case the 3x and higher performance of PyPy would be great. I also know a lot of ways for my CPython-based code to talk to PyPy-based code.

I used to develop software for bioinformatics and structural biology, and my observations still hold for those fields. One of the Biopython developers, Peter Cock, writes:

Regarding Biopython using NumPy, we're already trying it out under PyPy. Large chunks of Biopython do not use NumPy at all, although there a few problems on PyPy 1.6 (one due to a missing XML library, bug filed), most of that seems to work." [*]

He continues with a list of some of what doesn't work.

Support for existing libraries

That said, I know that a lot of people depend Python bindings to existing libraries. These use the C API directly, or through auto-generated interfaces from f2py, Cython, Boost, SWIG, and many more. There's been 10+ years to develop these tools for CPython, and still very little time to adapt them to PyPy.

Relatively few extensions use the ctypes module, which is Python's "other" mechanism for calling external functions. Unlike the C API, this one is also portable across Jython, Iron Python, and PyPy. Obviously, if everyone used ctypes then there wouldn't be a problem. Why don't they?

One is the performance. Calling math.cos() is 8 times faster than doing a LoadLibrary() of libm and calling cos() that way. This is of course the worst case. But that's a CPython limitation. Pypy's ctypes call interface is faster than CPython calling a C extension:

% cat x.py
import ctypes
m = ctypes.cdll.LoadLibrary("/usr/lib/libm.dylib")
cos = m.cos
cos.argtypes = [ctypes.c_double]
cos.restype = ctypes.c_double

% python -mtimeit -s "from x import cos" "cos(0)"
1000000 loops, best of 3: 0.676 usec per loop
% python -mtimeit -s "from math import cos" "cos(0)"
10000000 loops, best of 3: 0.0811 usec per loop
% pypy -mtimeit -s "from x import cos" "cos(0)"
10000000 loops, best of 3: 0.0332 usec per loop
% pypy -mtimeit -s "from math import cos" "cos(0)"
100000000 loops, best of 3: 0.0047 usec per loop

although you can see it's still slower than using a built-in function.

Another reason to not use ctypes is that C/C++ library authors do interesting things with the API. One library I used has public API functions like "dt_charge(atom)" to get the formal charge of an atom, but used a number of #define statements to change those names to the internal name. That example became "dt_e_charge". It also defined certain constants only in the header files. This information isn't in the shared library.

I know at least one vendor which only ships a static library, and not a shared library. Apparently bad LD_LIBRARY_PATHs was such a support headache that they decided it wasn't worth it. (I think they are right.) There's no way to get ctypes to interface to a static library.

A fourth problem is lack of support for C++ templates. That clearly needs a compiler, which ctypes doesn't do.

PyPy needs a (semi-)stable C ABI; can you help?

Based on the above, there will clearly always be a need for compiler-based Python extensions, including PyPy extensions. That means there needs to be some sort of ABI that those extensions can program against.

I don't know what that would look like, and I think the PyPy developers think it's still too early to stablize on it. It may well be; but I think it's because there's no one on the group who wants to work on the task.

They were more than happy last year to show a proof-of-concept interface from PyPy to C++ using the run-time type information added by the Reflex system. (Yeah, I had never heard of it either.) So they have nothing against working with an existing ABI. Do you want to offer one?

I wrote "semi-" in the title because it wasn't until Python 3.2 that CPython got a stable ABI. PyPy notably does have emulation support for some of the CPython 2.x ABI but there are problems. Some modules use the ABI incorrectly, and it works for implementation-specific reasons. (For example, bad reference counts.)

If you are going to work on this, I think it would make sense to target the 3.2 ABI and to include instrumentation to help identify these problems.

The best for me would be if you develop some SWIG/ABI interface. This might just be to produce a bunch of stub functions and a ctypes definition for them. (Hmm, wasn't there a C++ to C SWIG interface?)

f2pypy: Experimental Fortran bindings

The above is talk and hand-waving. Code's also good. There was a PyPy sprint this week and I decided to join in for a few days and prototype an idea I've been thinking about: f2pypy. It's a variation of f2py which generates Python ctypes bindings which PyPy could use to talk with shared libraries implemented in Fortran.

At the end of several days of work, I got f2pypy to generate a Python module based on the "fblas.pyf" code from SciPy. I could import that library in CPython and (for the few functions I tested) get answers which matched the fblas module in SciPy. I could also use pypy to call some of the functions, but PyPy's "numpy" implementation is not mature enough. Its array objects don't support the ctypes interface, so I was unable to call out to the shared library. I could only call the scalar-based functions.

The code is definitely incomplete. Even my CPython-based tests fail some of the the "test_blas.py" from SciPy (I don't implement "cblas" and I think one of the tests depends on Fortran order instead of C order.) It's a proof-of-concept which shows that this approach is definitely viable, and it shows some of the difficulties in the approach.

My point though is that it opens new possibilites which aren't available in NumPy. For example, suppose you want to use one of the BLAS functions in your code. Every Mac includes a copy of BLAS as a built-in library. Instead of making people install SciPy, what about shipping the ctypes module description instead, and using that interface? You can ship pure Python code and still take advantage of platform-optimized libraries!

I earlier highlighted the performance problems in CPython's ctypes interface. But this is PyPy. They already have cross-module optimizations for Python calling Python. There's no reason why those can't apply to ctypes-based functions. (Or perhaps it's already there? I've not tested that.)

How does it work?

Fortran bindings are nice because they don't have the same preprocessor tricks that I mentioned earlier. Pearu Peterson wrote the excellent f2py package starting some 10+ years ago. It has several ways to work with Fortran code. The one I used was to start with a "pyf" definition file and generate Python code using a new back-end.

I figured out how to get SciPy to generate the pyf file for the BLAS library. (The SciPy source uses a template language during the build process to generate the actual code.) I used f2py's "crackfortran" module to parse the pfy file and get the AST. It's a small tree so perhaps I should call it an abstract syntax bush.

The f2py code generate the Python/C extension code based on the AST. My f2pypy code is basically another back-end, which generates ctypes-based code in Python.

The trickiest part was support for C code. Some of the pyf definition lines contain embedded C code. Here I've gathered three examples:

integer optional, intent(in),check(incx>0||incx0) :: incx = 1
integer optional,intent(in),depend(x,incx,offx,y,incy,offy) :: n = (len(x)-offx)/abs(incx)
callstatement (*f2py_func)((trans?(trans==2?"C":"T"):"N"),&m,&n,α,a,&m,x+offx,&incx,β,y+offy,&incy)

I used Fredrick Lundh's wonderful essay on Simple Top-Down Parsing in Python to build a simple C expression parser, which builds another AST. With a bit of AST manipulation, and symbol table knowledge (I need to know which inputs are scalars and which are vectors), I could generate output strings like:

def srot(..., incx = None, ...):
  ...
  if incx is None:
    incx = _ct.c_int(1)
  else:
    incx = _ct.c_int(incx)
  if not ((((incx.value) > 0) or ((incx.value)  0))):
    raise ValueError('(incx>0||incx0) failed for argument incx: incx=%s' % incx.value)

and the more complicated:

_api_cgemv((("c") if (((trans.value) == 2)) else ("t")) if ((trans.value)) else
("n"), (m), (n), (alpha), a.ctypes.data_as(_ct.POINTER(_complex_float)), (m),
(x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incx), (beta),
(y if ((offy.value)) == 0 else y[(offy.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incy))

I definitely do not generate optimized code. I decided to work completely in terms of ctypes scalars and numpy arrays, even for the check() statements. PyPy doesn't optimize that yet, and I think someone else could do a better job by only doing the conversion as part of the call to the Fortran code.

Usage

To generate the new module on a Mac (I don't know the shared library name for other OS installations):

  $PYTHON -m f2pypy tests/fblas.pyf -l vecLib --skip cdotu,zdotu,cdotc,zdotc

This generates "fblas.py". I have some test code for that module

% python test_fblas.py
python test_fblas.py
...........F...
======================================================================
FAIL: test_srot_overwrite (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 116, in test_srot_overwrite
    assert x is x2
AssertionError

----------------------------------------------------------------------
Ran 15 tests in 0.006s

FAILED (failures=1)

This says that "numpy.array(.. copy=False)" makes a new reference, while the internal code f2py uses passes back the same object, so a real implementation will need to handle that detail.

Here's the same output from pypy:

% pypy test_fblas.py
.EE.EEEFEEEE.E.
======================================================================
ERROR: test_dnrm2 (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 166, in test_dnrm2
    E(m.dnrm2(x), float((numpy.array([1+1+16+81], "d")**0.5)))
  File "/Users/dalke/cvses/f2pypy/fblas.py", line 1192, in dnrm2
    return _api_dnrm2((n), (x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_ct.c_double)), (incx))
AttributeError: 'numarray' object has no attribute 'ctypes'

======================================================================
ERROR: test_drot (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 153, in test_drot
    E(m.drot(1,2,3,4), (numpy.array(11.0, dtype="d"), numpy.array(2.0, dtype="d")))
  File "/Users/dalke/cvses/f2pypy/fblas.py", line 181, in drot
    y = _np.array(y, 'd', copy=overwrite_y)
TypeError: __new__() got an unexpected keyword argument 'copy'
 ....

See the dots up there with the "E"rrors? That 4 of the scalar-based tests pass. What fails is the vector-based code. The "ctypes" gets the pointer to the numpy array data, and PyPy doesn't support the "copy" parameter of numpy.array.

Still, it does pass some tests!

Future

I don't use Fortran modules. I don't use f2py. I don't use numarray. I will not be involved in this project for the future. (I do a lot of integration work, and I do a lot of parsing and AST transformations, so that part of this effort was a very pretty good fit!)

I did this because I wanted to show that PyPy can support traditional numeric software libraries and that there is a relatively doable path for migration from existing numpy code to "numpypy" code.

I will not be maintaining the project in the future. If you want to take it on, feel free. I've contributed it to the PyPy project, and it has its own repository. Feel free also to leave a comment or ask me questions.

09 Nov 2011 12:00pm GMT

Grig Gheorghiu: Troubleshooting memory allocation errors in Elastic MapReduce

Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:


Caused by: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:877)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289)
... 11 more
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:176)
at org.apache.hadoop.util.Shell.run(Shell.java:161)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1238)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:703)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1190)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)




Googling around for java.io.IOException: java.io.IOException: error=12, Cannot allocate memory, it seems it's a common problem. See this AWS Developer Forums thread, this Hadoop core-user mailing list thread, and this explanation by Ken Krugler from Bixo Labs.

Basically, it boils down to the fact that when Java tries to fork a new process (in this case a bash shell), Linux will try to allocate as much memory as the current Java process, even though not all that memory will be required. There are several workarounds (read in particular the AWS Forum thread), but a solution that worked for us was to simply add swap space to the Elastic MapReduce slave nodes.

You can ssh into a slave node from the EMR master node by using the same private key you used when launching the EMR cluster, and by targeting the internal IP address of the slave node. In our case, the slaves are m1.xlarge instances, and they have 4 local disks (/dev/sdb through /dev/sde) mounted as /mnt, /mnt1, /mnt2 and /mnt3, with 414 GB available on each file system. I ran this simple script via sudo on each slave to add 4 swap files of 1 GB each, one on each of the 4 local disks.


$ cat make_swap.sh
#!/bin/bash


SWAPFILES='
/mnt/swapfile1
/mnt1/swapfile1
/mnt2/swapfile1
/mnt3/swapfile1
'
for SWAPFILE in $SWAPFILES; do
dd if=/dev/zero of=$SWAPFILE bs=1024 count=1048576
mkswap $SWAPFILE
swapon $SWAPFILE
echo "$SWAPFILE swap swap defaults 0 0" >> /etc/fstab
done


This solved our issue. No more failed Map tasks, no more failed Reduce tasks. Maybe this will be of use to some other frantic admins out there (like I was yesterday) who are not sure how to troubleshoot the intimidating Hadoop errors they're facing.



09 Nov 2011 10:57am GMT

Grig Gheorghiu: Troubleshooting memory allocation errors in Elastic MapReduce

Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:


Caused by: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:877)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289)
... 11 more
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:176)
at org.apache.hadoop.util.Shell.run(Shell.java:161)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1238)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:703)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1190)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)




Googling around for java.io.IOException: java.io.IOException: error=12, Cannot allocate memory, it seems it's a common problem. See this AWS Developer Forums thread, this Hadoop core-user mailing list thread, and this explanation by Ken Krugler from Bixo Labs.

Basically, it boils down to the fact that when Java tries to fork a new process (in this case a bash shell), Linux will try to allocate as much memory as the current Java process, even though not all that memory will be required. There are several workarounds (read in particular the AWS Forum thread), but a solution that worked for us was to simply add swap space to the Elastic MapReduce slave nodes.

You can ssh into a slave node from the EMR master node by using the same private key you used when launching the EMR cluster, and by targeting the internal IP address of the slave node. In our case, the slaves are m1.xlarge instances, and they have 4 local disks (/dev/sdb through /dev/sde) mounted as /mnt, /mnt1, /mnt2 and /mnt3, with 414 GB available on each file system. I ran this simple script via sudo on each slave to add 4 swap files of 1 GB each, one on each of the 4 local disks.


$ cat make_swap.sh
#!/bin/bash


SWAPFILES='
/mnt/swapfile1
/mnt1/swapfile1
/mnt2/swapfile1
/mnt3/swapfile1
'
for SWAPFILE in $SWAPFILES; do
dd if=/dev/zero of=$SWAPFILE bs=1024 count=1048576
mkswap $SWAPFILE
swapon $SWAPFILE
echo "$SWAPFILE swap swap defaults 0 0" >> /etc/fstab
done


This solved our issue. No more failed Map tasks, no more failed Reduce tasks. Maybe this will be of use to some other frantic admins out there (like I was yesterday) who are not sure how to troubleshoot the intimidating Hadoop errors they're facing.



09 Nov 2011 10:57am GMT

Fabio Zadrozny: Git on the command line

Yes, sometimes I still use the command line for git (even with EGit getting better and an integration on Aptana Studio 3).

This is an update to a previous post about git (with msysgit).

[Update] October 31st, 2011: Improved gitdd command a bit

1. Configure running shell

Usually I use TCC/LE as a shell, so, there's an alias to start the git bash shell from it (tcc allows editing the a 4START.BAT which will run when it's started, so, the configuration may be added there):

alias gitshell=path_to_gitgit-1.7.X/bin/sh.exe --login -i

Other aliases:

alias git1=git log --graph --oneline --decorate

Shows the log with nice graph.

2. Fix history behavior

I find it annoying that the default history completion behavior (when up/down is used) is always getting the previous command instead of completing based on what's already there. To fix that, create a .inputrc file in your user home directory and set the following contents to it:

## By default up/down are bound to previous-history
## and next-history respectively. The following does the
## same but gives the extra functionality where if you
## type any text (or more accurately, if there is any text
## between the start of the line and the cursor),
## the subset of the history starting with that text
## is searched (like 4dos for e.g.).
## Note to get rid of a line just Ctrl-C
"e[B": history-search-forward
"e[A": history-search-backward

$if Bash
## F10 toggles mc on and off
## Note Ctrl-o toggles panes on and off in mc
"e[21~": "mcC-M"

##do history expansion when space entered
Space: magic-space
$endif

## Include system wide settings which are ignored
## by default if one has their own .inputrc
$include /etc/inputrc


3. Diff changes

For diffing the current changes with WinMerge, there's a file called gitdd (added to the git bin dir) with the contents below:

#!/bin/sh

# usage: gitdd
# Compares the current differences in winmerge with links to original files.
# Note that it must be executed at the root of the git repository.

SUBDIRECTORY_OK=1

O=".git-winmerge-tmp-$$"
V=HEAD
list="$O/list"
list_exist="$O/list_exist"
# Delete everything created here on exit
trap "rm -rf $O" 0
mkdir $O
mkdir $O/WORKINGCOPY
# Dump
git diff $V --name-only -z $1 > $list
# Create links to changed files inside temp folder
# (changes made to these links will be made to originals)
for i in `cat $list | xargs -0`; do
PPATH=`dirname $i`
mkdir -p $O/WORKINGCOPY/$PPATH
mkdir -p $O/HEAD/$PPATH
ln $(pwd)/$i $(pwd)/$O/WORKINGCOPY/$i
git show HEAD:$i > $O/HEAD/$i
done
# Copy HEAD versions of changed files to temp folder
# cat $list | xargs -0 git archive --prefix=HEAD/ $V | tar xf - -C $O
# Execute winmerge which must be on the system path
WinMergeU.exe //r //u //wr //dl WORKINGCOPY //dr HEAD $O/WORKINGCOPY $O/HEAD

3a. WinMerge tree view

I also always like to see things as a tree in WinMerge (Menu view > Mode: Alt+V M) and also expanded (Menu view > Expand All Subfolders: Alt+V X). It'll store the configuration to show as tree, but unfortunately it needs to be expanded all the times when doing a compare (i.e.: no auto-expand).

4. Show things in a compact way

git config format.pretty "%h %ct %ad %Cgreen%aN%Creset %s"
git config log.date short

5. The commands used to get the contents of a pull request (create local branch, merge, get dev, merge with dev):

git checkout -b PullFromBranch-dev development
git pull https://github.com/pull_from_user/Pydev.git PullFromBranch
git checkout development
git merge PullFromBranch-dev --no-commit --no-ff

Then, to accept the merge do a commit or to reject it do:
git merge --abort
or
git merge --reset


6. When creating a feature in a branch:

git checkout -b FeatureBranch-dev development
git checkout development
git pull FeatureBranch-dev --no-ff


And the commands I use most:

git status
git commit -a -m "Message"
git push origin master
git checkout
git log -n 6
git log -n 3 --format=full
git commit --amend (change last commit message)
git show (see what happened on a commit)

And some of the commands I had to discover how to use in the msysgit bash are:

Alt+Space+E+K: mark contents for copy
Insert: paste contents
Alt+Backspace: erase until whitespace


09 Nov 2011 9:12am GMT

Fabio Zadrozny: Git on the command line

Yes, sometimes I still use the command line for git (even with EGit getting better and an integration on Aptana Studio 3).

This is an update to a previous post about git (with msysgit).

[Update] October 31st, 2011: Improved gitdd command a bit

1. Configure running shell

Usually I use TCC/LE as a shell, so, there's an alias to start the git bash shell from it (tcc allows editing the a 4START.BAT which will run when it's started, so, the configuration may be added there):

alias gitshell=path_to_gitgit-1.7.X/bin/sh.exe --login -i

Other aliases:

alias git1=git log --graph --oneline --decorate

Shows the log with nice graph.

2. Fix history behavior

I find it annoying that the default history completion behavior (when up/down is used) is always getting the previous command instead of completing based on what's already there. To fix that, create a .inputrc file in your user home directory and set the following contents to it:

## By default up/down are bound to previous-history
## and next-history respectively. The following does the
## same but gives the extra functionality where if you
## type any text (or more accurately, if there is any text
## between the start of the line and the cursor),
## the subset of the history starting with that text
## is searched (like 4dos for e.g.).
## Note to get rid of a line just Ctrl-C
"e[B": history-search-forward
"e[A": history-search-backward

$if Bash
## F10 toggles mc on and off
## Note Ctrl-o toggles panes on and off in mc
"e[21~": "mcC-M"

##do history expansion when space entered
Space: magic-space
$endif

## Include system wide settings which are ignored
## by default if one has their own .inputrc
$include /etc/inputrc


3. Diff changes

For diffing the current changes with WinMerge, there's a file called gitdd (added to the git bin dir) with the contents below:

#!/bin/sh

# usage: gitdd
# Compares the current differences in winmerge with links to original files.
# Note that it must be executed at the root of the git repository.

SUBDIRECTORY_OK=1

O=".git-winmerge-tmp-$$"
V=HEAD
list="$O/list"
list_exist="$O/list_exist"
# Delete everything created here on exit
trap "rm -rf $O" 0
mkdir $O
mkdir $O/WORKINGCOPY
# Dump
git diff $V --name-only -z $1 > $list
# Create links to changed files inside temp folder
# (changes made to these links will be made to originals)
for i in `cat $list | xargs -0`; do
PPATH=`dirname $i`
mkdir -p $O/WORKINGCOPY/$PPATH
mkdir -p $O/HEAD/$PPATH
ln $(pwd)/$i $(pwd)/$O/WORKINGCOPY/$i
git show HEAD:$i > $O/HEAD/$i
done
# Copy HEAD versions of changed files to temp folder
# cat $list | xargs -0 git archive --prefix=HEAD/ $V | tar xf - -C $O
# Execute winmerge which must be on the system path
WinMergeU.exe //r //u //wr //dl WORKINGCOPY //dr HEAD $O/WORKINGCOPY $O/HEAD

3a. WinMerge tree view

I also always like to see things as a tree in WinMerge (Menu view > Mode: Alt+V M) and also expanded (Menu view > Expand All Subfolders: Alt+V X). It'll store the configuration to show as tree, but unfortunately it needs to be expanded all the times when doing a compare (i.e.: no auto-expand).

4. Show things in a compact way

git config format.pretty "%h %ct %ad %Cgreen%aN%Creset %s"
git config log.date short

5. The commands used to get the contents of a pull request (create local branch, merge, get dev, merge with dev):

git checkout -b PullFromBranch-dev development
git pull https://github.com/pull_from_user/Pydev.git PullFromBranch
git checkout development
git merge PullFromBranch-dev --no-commit --no-ff

Then, to accept the merge do a commit or to reject it do:
git merge --abort
or
git merge --reset


6. When creating a feature in a branch:

git checkout -b FeatureBranch-dev development
git checkout development
git pull FeatureBranch-dev --no-ff


And the commands I use most:

git status
git commit -a -m "Message"
git push origin master
git checkout
git log -n 6
git log -n 3 --format=full
git commit --amend (change last commit message)
git show (see what happened on a commit)

And some of the commands I had to discover how to use in the msysgit bash are:

Alt+Space+E+K: mark contents for copy
Insert: paste contents
Alt+Backspace: erase until whitespace


09 Nov 2011 9:12am GMT

Mitchell Garnaat: Comprehensive List of AWS Endpoints

Note: AWS has now started their own list of API endpoints here. You may want to begin using that list as the definitive reference.



Another Note: I am now collecting and publishing this information as JSON data. I am generating the HTML below from this JSON data.


Guy Rosen (@guyro on Twitter) recently asked about a comprehensive list of AWS service endpoints. This information is notoriously difficult to find and seems to be spread across many different documents, release notes, etc. Fortunately, I had most of this information already gathered together in the boto source code so I pulled that together and hunted down the stragglers and put this list together.

If you have any more information to provide or have corrections, etc. please comment below. I'll try to keep this up to date over time.

Auto Scaling

CloudFormation

CloudFront

CloudWatch

DevPay

ElastiCache

Elastic Beanstalk

Elastic Compute Cloud

Elastic Load Balancing

Elastic Map Reduce

Flexible Payment Service

Identity & Access Management

Import/Export

Mechanical Turk

Relational Data Service

Route 53

Security Token Service

Simple Email Service

Simple Notification Service

Simple Queue Service

Simple Storage Service

SimpleDB

Virtual Private Cloud

09 Nov 2011 7:48am GMT

Mitchell Garnaat: Comprehensive List of AWS Endpoints

Note: AWS has now started their own list of API endpoints here. You may want to begin using that list as the definitive reference.



Another Note: I am now collecting and publishing this information as JSON data. I am generating the HTML below from this JSON data.


Guy Rosen (@guyro on Twitter) recently asked about a comprehensive list of AWS service endpoints. This information is notoriously difficult to find and seems to be spread across many different documents, release notes, etc. Fortunately, I had most of this information already gathered together in the boto source code so I pulled that together and hunted down the stragglers and put this list together.

If you have any more information to provide or have corrections, etc. please comment below. I'll try to keep this up to date over time.

Auto Scaling

CloudFormation

CloudFront

CloudWatch

DevPay

ElastiCache

Elastic Beanstalk

Elastic Compute Cloud

Elastic Load Balancing

Elastic Map Reduce

Flexible Payment Service

Identity & Access Management

Import/Export

Mechanical Turk

Relational Data Service

Route 53

Security Token Service

Simple Email Service

Simple Notification Service

Simple Queue Service

Simple Storage Service

SimpleDB

Virtual Private Cloud

09 Nov 2011 7:48am GMT

08 Nov 2011

feedPlanet Python

Davy Wybiral: Planetary states API

I needed a way to deal with planetary positions and velocities and found NASA's HORIZONS and the ephemerides. But I wanted a simpler interface than telnet or lugging around the massive ephemeris files with my applications. So instead, I wrote a simple JSON api for dealing with ephemeris files.

Suppose one wanted to get the chebyshev coefficients for computing mercury's state for today's date (November 5th), the URL query would look like this:

http://www.astro-phys.com/api/coeffs?date=2011-11-5&bodies=mercury


Which would return a JSON object whose structure looks like this:

{
"date": 2455870.5,
"results": {
{"mercury": {
"coeffs": ...
"start": 2455856.5,
"end": 2455872.5
}
}
}



Where "coeffs" contains the chebyshev coefficients for evaluating the state of mercury between the julian dates 2455856.5 and 2455872.5

To simplify it even further, you can grab the state of mercury at 9:30am on November 5th 2011 by using this url:

http://www.astro-phys.com/api/states?date=2011-11-5+9:30am&bodies=mercury



Which would return:

{
"date": 2455870.89583,
"results": {
"mercury": [
[30007449.557, -50119248.882, -29922524.4351],
[2879610.10503, 2030853.04543, 786401.74378]
]
}
}



Where the first array in "mercury" is the position vector (x, y, z) and the second array is the velocity vector (vx, vy, vz)

Multiple planets can be entered, comma-separated.

Applications requiring entire ephemeride records can use this:

http://www.astro-phys.com/api/records?date=2011-11-5+9:30am



This will give you...

{
"date": 2455870.89583,
"start": 2455824.5,
"end": 2455888.5,
"results": {
"mercury": [
[...]
],
...
}
}



Where date is the date asked for, start is the beginning of the record and end is the end of the record. "results" contains every ephemeris body mapped to a list of its coefficient chunks for that record.

The ephemeris being used is DE406 for the time being, though I may add others later (and a backwards compatible means of specifying).

It doesn't include the full date range of DE406 yet, it only contains dates between 2000-2200 (I'll be adding more as needed, if you request a range increase it will probably be granted).

To see a web application using it in action, visit http://www.astro-phys.com and click "start" (its streaming the records to evaluate positions for the planets, the interface can be dragged and zoomed with the mouse).

Lastly, the constants section of the ephemeris is also available from the url

http://www.astro-phys.com/api/constants



When querying for coefficients, you can't ask for earth or moon directly. You have to use "earthmoon" (the earthmoon barycenter) and "geomoon" (the geocentric moon) and compute their states from those. However, when querying for "states", astro-phys does this for you.

PS: All api queries can also take a '&callback=somefunction' to be treated as jsonp. This works great with jquery's getJSON

Here's an example using jQuery.getJSON (note, when date is missing the current time is assumed)

var url = 'http://www.astro-phys.com/api/states?callback=?';
$.getJSON(url, {bodies: 'mercury'}, function(data) {
var p = data.results.mercury[0];
var v = data.results.mercury[1];
alert('Position:\nx='+p[0]+'\ny='+p[1]+'\nz='+p[2]);
alert('Velocity:\nx='+v[0]+'\ny='+v[1]+'\nz='+v[2]);
});

08 Nov 2011 9:19pm GMT

Davy Wybiral: Planetary states API

I needed a way to deal with planetary positions and velocities and found NASA's HORIZONS and the ephemerides. But I wanted a simpler interface than telnet or lugging around the massive ephemeris files with my applications. So instead, I wrote a simple JSON api for dealing with ephemeris files.

Suppose one wanted to get the chebyshev coefficients for computing mercury's state for today's date (November 5th), the URL query would look like this:

http://www.astro-phys.com/api/coeffs?date=2011-11-5&bodies=mercury


Which would return a JSON object whose structure looks like this:

{
"date": 2455870.5,
"results": {
{"mercury": {
"coeffs": ...
"start": 2455856.5,
"end": 2455872.5
}
}
}



Where "coeffs" contains the chebyshev coefficients for evaluating the state of mercury between the julian dates 2455856.5 and 2455872.5

To simplify it even further, you can grab the state of mercury at 9:30am on November 5th 2011 by using this url:

http://www.astro-phys.com/api/states?date=2011-11-5+9:30am&bodies=mercury



Which would return:

{
"date": 2455870.89583,
"results": {
"mercury": [
[30007449.557, -50119248.882, -29922524.4351],
[2879610.10503, 2030853.04543, 786401.74378]
]
}
}



Where the first array in "mercury" is the position vector (x, y, z) and the second array is the velocity vector (vx, vy, vz)

Multiple planets can be entered, comma-separated.

Applications requiring entire ephemeride records can use this:

http://www.astro-phys.com/api/records?date=2011-11-5+9:30am



This will give you...

{
"date": 2455870.89583,
"start": 2455824.5,
"end": 2455888.5,
"results": {
"mercury": [
[...]
],
...
}
}



Where date is the date asked for, start is the beginning of the record and end is the end of the record. "results" contains every ephemeris body mapped to a list of its coefficient chunks for that record.

The ephemeris being used is DE406 for the time being, though I may add others later (and a backwards compatible means of specifying).

It doesn't include the full date range of DE406 yet, it only contains dates between 2000-2200 (I'll be adding more as needed, if you request a range increase it will probably be granted).

To see a web application using it in action, visit http://www.astro-phys.com and click "start" (its streaming the records to evaluate positions for the planets, the interface can be dragged and zoomed with the mouse).

Lastly, the constants section of the ephemeris is also available from the url

http://www.astro-phys.com/api/constants



When querying for coefficients, you can't ask for earth or moon directly. You have to use "earthmoon" (the earthmoon barycenter) and "geomoon" (the geocentric moon) and compute their states from those. However, when querying for "states", astro-phys does this for you.

PS: All api queries can also take a '&callback=somefunction' to be treated as jsonp. This works great with jquery's getJSON

Here's an example using jQuery.getJSON (note, when date is missing the current time is assumed)

var url = 'http://www.astro-phys.com/api/states?callback=?';
$.getJSON(url, {bodies: 'mercury'}, function(data) {
var p = data.results.mercury[0];
var v = data.results.mercury[1];
alert('Position:\nx='+p[0]+'\ny='+p[1]+'\nz='+p[2]);
alert('Velocity:\nx='+v[0]+'\ny='+v[1]+'\nz='+v[2]);
});

08 Nov 2011 9:19pm GMT

Pete Hunt: PyMySQL 0.5 released

I just released PyMySQL 0.5. This version should be much more stable than previous versions as I've fixed a lot of the unicode handling.

As always, we support an extremely broad range of Python versions, including the 2.x and 3.x series. The 2.x version is in PyPI as PyMySQL, the 3.x version is in as PyMySQL3.

Check it out at http://www.pymysql.org/

08 Nov 2011 6:37pm GMT

Pete Hunt: PyMySQL 0.5 released

I just released PyMySQL 0.5. This version should be much more stable than previous versions as I've fixed a lot of the unicode handling.

As always, we support an extremely broad range of Python versions, including the 2.x and 3.x series. The 2.x version is in PyPI as PyMySQL, the 3.x version is in as PyMySQL3.

Check it out at http://www.pymysql.org/

08 Nov 2011 6:37pm GMT

Michael Bayer: Alembic Documentation Ready for Review

Many of you are aware that for quite some time I've had available a new migrations tool called Alembic. I wrote the majority of this code last year and basically haven't had much time to work on it, save for a little bit of integration at my day job.

Alembic has several areas that improve upon the current crop of migration tools, including:

  • Full control over how migrations run, including multiple database support, transactional DDL, etc.
  • A super-minimal style of writing migrations, not requiring full table definitions for simple operations.
  • No monkeypatching or repurposing of core SQLAlchemy objects.
  • Ability to generate migrations as SQL scripts, critical for working in large organizations on restricted-access production systems.
  • A non-linear versioning model that allows, somewhat rudimentally, for branching and merging of multiple migration file streams.
  • By popular request, designs for some degree of automation will be added, where "obivous" migrations of tables or columns being added or dropped as well as simple column attribute changes can be detected and rendered into migration scripts automatically.

I get asked about migration tools quite often, and I've always mentioned that I have such a tool available in an early form, it just lacks documentation, hoping that one out of so many eager users would be able to chip in a couple of days to help get it going. Turns out it's simply not possible to get someone to document your project for you; so here at the PloneConf sprints I've taken the time at the sprints to create initial documentation. Originally I was going to do some work on Dogpile but I've literally been asked about schema migrations, either in person or via reddit or twitter, about nine times in the past three days.

The documentation for Alembic will provide an overview of the tool, including theory of operation as well as philosophy, and some indicators of features not yet implemented. I would like feedback on the current approach. Alembic is not yet released though can be checked out from Bitbucket for testing. It's a pretty simple tool, 1163 lines of code at the moment, with plenty of room to support a lot more database features in a straightforward way.

I also have an open question about the notion of "automatic" migrations - how does one discern detecting whether or not a table or column has had a name change, or if a new table/column was added and an old one dropped? Just a curiosity as I attempt to avoid reading South's source code.

08 Nov 2011 5:01pm GMT

Michael Bayer: Alembic Documentation Ready for Review

Many of you are aware that for quite some time I've had available a new migrations tool called Alembic. I wrote the majority of this code last year and basically haven't had much time to work on it, save for a little bit of integration at my day job.

Alembic has several areas that improve upon the current crop of migration tools, including:

  • Full control over how migrations run, including multiple database support, transactional DDL, etc.
  • A super-minimal style of writing migrations, not requiring full table definitions for simple operations.
  • No monkeypatching or repurposing of core SQLAlchemy objects.
  • Ability to generate migrations as SQL scripts, critical for working in large organizations on restricted-access production systems.
  • A non-linear versioning model that allows, somewhat rudimentally, for branching and merging of multiple migration file streams.
  • By popular request, designs for some degree of automation will be added, where "obivous" migrations of tables or columns being added or dropped as well as simple column attribute changes can be detected and rendered into migration scripts automatically.

I get asked about migration tools quite often, and I've always mentioned that I have such a tool available in an early form, it just lacks documentation, hoping that one out of so many eager users would be able to chip in a couple of days to help get it going. Turns out it's simply not possible to get someone to document your project for you; so here at the PloneConf sprints I've taken the time at the sprints to create initial documentation. Originally I was going to do some work on Dogpile but I've literally been asked about schema migrations, either in person or via reddit or twitter, about nine times in the past three days.

The documentation for Alembic will provide an overview of the tool, including theory of operation as well as philosophy, and some indicators of features not yet implemented. I would like feedback on the current approach. Alembic is not yet released though can be checked out from Bitbucket for testing. It's a pretty simple tool, 1163 lines of code at the moment, with plenty of room to support a lot more database features in a straightforward way.

I also have an open question about the notion of "automatic" migrations - how does one discern detecting whether or not a table or column has had a name change, or if a new table/column was added and an old one dropped? Just a curiosity as I attempt to avoid reading South's source code.

08 Nov 2011 5:01pm GMT

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

feedPlanet Python

Go Deh: Should you worry about a 2x speedup?

Let's take as context an implementation of a task for Rosetta Code - a site set up to compare how different programming languages are used to implement the same task for over five hundred tasks and over four hundred languages.

My short answer would be it depends! You need to:

I.e. ensure you know what the task is asking for and then verify that both solutions solve the task as stated. It can be very easy to misinterpret what the task is asking for, for example, if a task asks for a particular algorithm, do both the examples you are comparing use that algorithm?

As well as comparing two implementations for speed, you should also compare for readability. How well the code reads can have a large impact on how easy the code is to maintain. It has been known for task descriptions to be modified; someone tracking that modification may need to work out if and how code needs to be updated. If an example is overly complex and/or unidiomatic then it could cause problems.

Time complexity. If one version of code works better when given 'bigger' data then you need to know more about when that happens - it could be that the cross-over point in terms of speed of execution is never likely to be met. Maybe the size of data needed to reach cross over is unreasonable to expect, or that other mechanisms come into play that mask predicted gains (in other words you might need to verify using that actual bigger data set to account for things like swapping or caching at the OS and hardware level.

How fast does it need to be? Rosetta code doesn't usually mention absolute speed of execution, but if one example takes ten hours and the other takes five then you might want to take that into account. If one example took 0.2 seconds and the other only 0.1 seconds then I guess there is an unwritten expectation that examples "don't take long to run" where long is related to the expectation and patience of the user.

You need to look at the context. In the case of Rosetta code, it may be best to give a solution using a similar algorithm to other examples, or a solution that shows accepted use of the language.

When you make your considered choice, you might want to squirrel away the losing code with notes on why it wasn't used, - On Rosetta Code we sometimes add more than one solution to a task with comments contrasting the two if they both have merit.

It seems to me that talk about optimising for speed, and speed comparisons tends to dominate on the web over other optimisations, (usually with no extra info on the accuracy of the result. Actually there might be more cases of a revised result that showed not even the first digit of the original answer was right, but more than two digits of precision were shown in the answers)!

08 Nov 2011 5:04am GMT

Go Deh: Should you worry about a 2x speedup?

Let's take as context an implementation of a task for Rosetta Code - a site set up to compare how different programming languages are used to implement the same task for over five hundred tasks and over four hundred languages.

My short answer would be it depends! You need to:

I.e. ensure you know what the task is asking for and then verify that both solutions solve the task as stated. It can be very easy to misinterpret what the task is asking for, for example, if a task asks for a particular algorithm, do both the examples you are comparing use that algorithm?

As well as comparing two implementations for speed, you should also compare for readability. How well the code reads can have a large impact on how easy the code is to maintain. It has been known for task descriptions to be modified; someone tracking that modification may need to work out if and how code needs to be updated. If an example is overly complex and/or unidiomatic then it could cause problems.

Time complexity. If one version of code works better when given 'bigger' data then you need to know more about when that happens - it could be that the cross-over point in terms of speed of execution is never likely to be met. Maybe the size of data needed to reach cross over is unreasonable to expect, or that other mechanisms come into play that mask predicted gains (in other words you might need to verify using that actual bigger data set to account for things like swapping or caching at the OS and hardware level.

How fast does it need to be? Rosetta code doesn't usually mention absolute speed of execution, but if one example takes ten hours and the other takes five then you might want to take that into account. If one example took 0.2 seconds and the other only 0.1 seconds then I guess there is an unwritten expectation that examples "don't take long to run" where long is related to the expectation and patience of the user.

You need to look at the context. In the case of Rosetta code, it may be best to give a solution using a similar algorithm to other examples, or a solution that shows accepted use of the language.

When you make your considered choice, you might want to squirrel away the losing code with notes on why it wasn't used, - On Rosetta Code we sometimes add more than one solution to a task with comments contrasting the two if they both have merit.

It seems to me that talk about optimising for speed, and speed comparisons tends to dominate on the web over other optimisations, (usually with no extra info on the accuracy of the result. Actually there might be more cases of a revised result that showed not even the first digit of the original answer was right, but more than two digits of precision were shown in the answers)!

08 Nov 2011 5:04am GMT

07 Nov 2011

feedPlanet Python

Mikko Ohtamaa: sauna.reload – the most awesomely named Python package ever

This blog post is about sauna.reload which is a Python package adding an automatic reload feature to Plone CMS.

The short history of reloading

The web developer community takes reload-code-on-development as granted. You edit your source file, hit refresh and poof: your changed are there. It is a good way and the only real web development way and one of the sane things PHP community has taught to us. Things could be different: for example if you are developing an embedded code on mobile platforms you might need to build the ROM image for hours before you see your changed source code line in the action.

Side note: To check your code changes it is often faster the hit enter on the browsee address bar line as this does not reload CSS and JS files for the HTTP request

For PHP the reloading is easy as the software stack parses every PHP file again on every request. There is no reload - only load. In fact there is a whole "optimizer" open source business model spawned around PHP just to make it faster. PHP processes are stateless.

For Python the things are not so simple. Python web frameworks have separate process lifespan and HTTP request handling lifespans. Often a server process is started, it spawns N threads and each thread handles one HTTP request at a time. Python processes are stateful.

As being stateful, Python processes may start slowly. You need to parse all source code and create initial memory objects, etc. Frameworks do not optimize for fast start-up because it really doesn't matter on the production service as you spawn Python processes only once, when the server is booted.

Plone CMS, coming with choking 250 MB of source code (Linux kernel has 400-500 MB) is the extreme of this slow initialization. Plone CMS is the biggest open source Python project out there (does anyone dare to challenge my clain?). When Plone starts it loads most of source code to memory and parsing and initializing Python itself, plus various XML files, is quite an achievement.

What Tornado, Django, Paster etc. Python frameworks do that when they reload they simply zap the process dead and reinitialize 100% virgin process. This is ok as these low level frameworks have little overhead. Though it becomes painful slow too when your Django service grows for many applications and the code starts piling up…

For Plone 100% start up this is no go. Plone people care about the start up time, but there is not much they can do about it… even the smallest sites have start up times of dozens of seconds.

Python and module reload

Plone used to have (has) a helper package called plone.reload. It works by reloading Python modules. Python interpreter allows you to reload a module again in the memory. The code is actually based on xreload.py written by Guido van Rossum himself.

However, there is a catch. Naive code reload does not work. As xreload.py comments state there are issues. Global stateful objects, other initialization specific code paths, etc. are ignored. So with plone.reload you ended up with a broken Python process as many times you ended up with a successful reload. It's so frustrating to use that you don't actually want to use it.

From Finland with Love

sauna.reload was initially created in Sauna Sprint 2011 event, arranged by EESTEC and Plone Community. The event name comes from the abundance of sauna in Finland and the package name comes from the fact that a lot of sauna was put into the effort of creating it.

Going trident

A fork. (The orignal image)

A spoon (The orignal image)

sauna.reload takes a different strategy of reloading

The idea behind sauna.reload goes back to the days as I was working with Python for Nokia Series 60 phones. Symbian seriously sucks, peace on its memory (though not sure if it's dead yet or just a walking corpse). If you did any Symbian development you saw how it could never fly - such a monster it was.

One of the issues was start-up time of the apps, especially Python start up time. Symbian IO was slow and simple fact to running the application to the point where all imports had been run took several seconds. For all of this time you could just show some happy loading screen for the user. It was very difficult to deliver a Python application which would have acceptable user experience on Symbian.

Far back in the day I chatted with Jukka Laurila who gave me some tips. Emacs used to have unexec() command which simply dumped the application state to the disk and resumed later. This works nicely if your application is an isolated binary and does not have any references to DLLs, so it doesn't work so nicely on any contemporary platform.

On the other hand, Maemo folks had something called PyLauncher. PyLauncher loaded a daemon to memory and this daemon had Python interpreter and pygtk library loaded. When Python script had to be run, the PyLauncher daemon simply fork()'ed a new child and this new child inherited already loaded Python interpreter and pygtk for free. Fork() is a copy-on-write operation on modern operating systems.

This gave me a cunning idea.

Taking a shortcut

When you develop a web application and you want to reload changes you usually don't want to reload the whole stack. In fact, your custom code tends to sit up on the top of the framework stack and unless you are a framework core developers yourself, you never poke into those files.

What if we run Plone initialization process to the point it is just about to load your custom code, freeze the process and then just always go back to this point when we want to load our (changed) code?

Plone process is in fact a series of Python egg loads. Your custom eggs are loaded last.

With sofware only your imagination is the limit

So I presented the idea to Asko Soukka and Esa-Matti Suuronen from University of Jyväskylä (one of the largest Plone users in Finland) in Sauna Sprint 2011. If I recall correctly their first reaction was "wow dude that definitely is not going to work".

But at the end of the tunnel was the freedom of Plone web developers - being released from the chains of Zope start-up times forever. Even if it had turned out sauna.reload was not possible we would have been stranded in the middle of the forest having nothing to do. So the guys decided to give it a crack. All the time they spent to develop sauna.reload would be paid back later with gracious multiplier.

sauna.reload is using awesome cross-platform Watchdog Python package to monitor the source code files. Plone is frozen in the start-up process before any custom source code eggs from src/ folder are being loaded. When there is a file-system change event sauna.reload kills the forked child process and reforks a new child from the frozen parent process, effectively continuing the loading from the clean table where none of your source code files were present.

Going uphill

San Francisco, where we are hosting Plone Conference 2011 and sprints, is a city of hills. They feel quite steep for a person who comes from a place where maps don't have countours.

With Zope 2, the technology the ancients left to us, things were not that straightforward.

ZODB, the database used in Plone, happens in-process for the development instances (single process mode). When you go back to the orignal frozen copy of Plone process you'll also travel back in append-only database time.

The sauna.reload devs managed to fix this. However this code is prone to ZODB internal changes as happened with Plone 4.1.

Another funky issues were how to monkey-patch Zope loading process early enough and reordering egg loading process so that we hold loading the custom code till very end.

For The Win

sauna.reload works on all Python packages which declare z3c.autoinclude loading in setup.py. Just go and use it. This, of course, is based on the assumption you use an operating system which supports fork() operation and unfortunately Windows is not privileged for this.

Note: OSX users might need to use trunk version of Watchdog until a bug fix is released

Your Plone restart time goes down to 1 second from 40 seconds. Also, the restart happens automatically every time you save your files. It works with Grok, Zope component architecture, everything… ZCML is reloaded too. You should feel an unsurmountable boost in your productivity.

We still have some quirks in sauna.reload. If you try hard enough you rarely might be able to corrupt database. We don't know how, so keep trying and report your findings pack to GitHub issue tracker. Also as sauna.reload is meant for the development only you shouldn't really worry about breaking things. We also have an old issue tracker which was not migrated to collective - how one can migrate a issue tracker on Github?

The Future

Who knows? Maybe someone gets inspiration of this work and ports it to other Python frameworks.

Our plan is to release a Plone developer distribution which comes with sauna.reload and other goodies by default. This way new Plone developers could instantly start hacking their code by just copy-pasting examples and keep hitting refresh.

Subscribe to this blog in a reader Follow me on Twitter

07 Nov 2011 10:53pm GMT

Mikko Ohtamaa: sauna.reload – the most awesomely named Python package ever

This blog post is about sauna.reload which is a Python package adding an automatic reload feature to Plone CMS.

The short history of reloading

The web developer community takes reload-code-on-development as granted. You edit your source file, hit refresh and poof: your changed are there. It is a good way and the only real web development way and one of the sane things PHP community has taught to us. Things could be different: for example if you are developing an embedded code on mobile platforms you might need to build the ROM image for hours before you see your changed source code line in the action.

Side note: To check your code changes it is often faster the hit enter on the browsee address bar line as this does not reload CSS and JS files for the HTTP request

For PHP the reloading is easy as the software stack parses every PHP file again on every request. There is no reload - only load. In fact there is a whole "optimizer" open source business model spawned around PHP just to make it faster. PHP processes are stateless.

For Python the things are not so simple. Python web frameworks have separate process lifespan and HTTP request handling lifespans. Often a server process is started, it spawns N threads and each thread handles one HTTP request at a time. Python processes are stateful.

As being stateful, Python processes may start slowly. You need to parse all source code and create initial memory objects, etc. Frameworks do not optimize for fast start-up because it really doesn't matter on the production service as you spawn Python processes only once, when the server is booted.

Plone CMS, coming with choking 250 MB of source code (Linux kernel has 400-500 MB) is the extreme of this slow initialization. Plone CMS is the biggest open source Python project out there (does anyone dare to challenge my clain?). When Plone starts it loads most of source code to memory and parsing and initializing Python itself, plus various XML files, is quite an achievement.

What Tornado, Django, Paster etc. Python frameworks do that when they reload they simply zap the process dead and reinitialize 100% virgin process. This is ok as these low level frameworks have little overhead. Though it becomes painful slow too when your Django service grows for many applications and the code starts piling up…

For Plone 100% start up this is no go. Plone people care about the start up time, but there is not much they can do about it… even the smallest sites have start up times of dozens of seconds.

Python and module reload

Plone used to have (has) a helper package called plone.reload. It works by reloading Python modules. Python interpreter allows you to reload a module again in the memory. The code is actually based on xreload.py written by Guido van Rossum himself.

However, there is a catch. Naive code reload does not work. As xreload.py comments state there are issues. Global stateful objects, other initialization specific code paths, etc. are ignored. So with plone.reload you ended up with a broken Python process as many times you ended up with a successful reload. It's so frustrating to use that you don't actually want to use it.

From Finland with Love

sauna.reload was initially created in Sauna Sprint 2011 event, arranged by EESTEC and Plone Community. The event name comes from the abundance of sauna in Finland and the package name comes from the fact that a lot of sauna was put into the effort of creating it.

Going trident

A fork. (The orignal image)

A spoon (The orignal image)

sauna.reload takes a different strategy of reloading

The idea behind sauna.reload goes back to the days as I was working with Python for Nokia Series 60 phones. Symbian seriously sucks, peace on its memory (though not sure if it's dead yet or just a walking corpse). If you did any Symbian development you saw how it could never fly - such a monster it was.

One of the issues was start-up time of the apps, especially Python start up time. Symbian IO was slow and simple fact to running the application to the point where all imports had been run took several seconds. For all of this time you could just show some happy loading screen for the user. It was very difficult to deliver a Python application which would have acceptable user experience on Symbian.

Far back in the day I chatted with Jukka Laurila who gave me some tips. Emacs used to have unexec() command which simply dumped the application state to the disk and resumed later. This works nicely if your application is an isolated binary and does not have any references to DLLs, so it doesn't work so nicely on any contemporary platform.

On the other hand, Maemo folks had something called PyLauncher. PyLauncher loaded a daemon to memory and this daemon had Python interpreter and pygtk library loaded. When Python script had to be run, the PyLauncher daemon simply fork()'ed a new child and this new child inherited already loaded Python interpreter and pygtk for free. Fork() is a copy-on-write operation on modern operating systems.

This gave me a cunning idea.

Taking a shortcut

When you develop a web application and you want to reload changes you usually don't want to reload the whole stack. In fact, your custom code tends to sit up on the top of the framework stack and unless you are a framework core developers yourself, you never poke into those files.

What if we run Plone initialization process to the point it is just about to load your custom code, freeze the process and then just always go back to this point when we want to load our (changed) code?

Plone process is in fact a series of Python egg loads. Your custom eggs are loaded last.

With sofware only your imagination is the limit

So I presented the idea to Asko Soukka and Esa-Matti Suuronen from University of Jyväskylä (one of the largest Plone users in Finland) in Sauna Sprint 2011. If I recall correctly their first reaction was "wow dude that definitely is not going to work".

But at the end of the tunnel was the freedom of Plone web developers - being released from the chains of Zope start-up times forever. Even if it had turned out sauna.reload was not possible we would have been stranded in the middle of the forest having nothing to do. So the guys decided to give it a crack. All the time they spent to develop sauna.reload would be paid back later with gracious multiplier.

sauna.reload is using awesome cross-platform Watchdog Python package to monitor the source code files. Plone is frozen in the start-up process before any custom source code eggs from src/ folder are being loaded. When there is a file-system change event sauna.reload kills the forked child process and reforks a new child from the frozen parent process, effectively continuing the loading from the clean table where none of your source code files were present.

Going uphill

San Francisco, where we are hosting Plone Conference 2011 and sprints, is a city of hills. They feel quite steep for a person who comes from a place where maps don't have countours.

With Zope 2, the technology the ancients left to us, things were not that straightforward.

ZODB, the database used in Plone, happens in-process for the development instances (single process mode). When you go back to the orignal frozen copy of Plone process you'll also travel back in append-only database time.

The sauna.reload devs managed to fix this. However this code is prone to ZODB internal changes as happened with Plone 4.1.

Another funky issues were how to monkey-patch Zope loading process early enough and reordering egg loading process so that we hold loading the custom code till very end.

For The Win

sauna.reload works on all Python packages which declare z3c.autoinclude loading in setup.py. Just go and use it. This, of course, is based on the assumption you use an operating system which supports fork() operation and unfortunately Windows is not privileged for this.

Note: OSX users might need to use trunk version of Watchdog until a bug fix is released

Your Plone restart time goes down to 1 second from 40 seconds. Also, the restart happens automatically every time you save your files. It works with Grok, Zope component architecture, everything… ZCML is reloaded too. You should feel an unsurmountable boost in your productivity.

We still have some quirks in sauna.reload. If you try hard enough you rarely might be able to corrupt database. We don't know how, so keep trying and report your findings pack to GitHub issue tracker. Also as sauna.reload is meant for the development only you shouldn't really worry about breaking things. We also have an old issue tracker which was not migrated to collective - how one can migrate a issue tracker on Github?

The Future

Who knows? Maybe someone gets inspiration of this work and ports it to other Python frameworks.

Our plan is to release a Plone developer distribution which comes with sauna.reload and other goodies by default. This way new Plone developers could instantly start hacking their code by just copy-pasting examples and keep hitting refresh.

Subscribe to this blog in a reader Follow me on Twitter

07 Nov 2011 10:53pm GMT

Kay Hayen: Nuitka Release 0.3.14

This is to inform you about the new stable release of Nuitka. This time it contains mostly organisational improvements, some bug fixes, improved compatibility and cleanups.

Please see the page "What is Nuitka?" for clarification of what it is now and what it wants to be.

This release is again the result of working towards compilation of a real program (Mercurial). This time, I have added support for proper handling of compiled types by the "inspect" module.

Bug fixes

- Fix for "Missing checks in parameter parsing with star list, star dict and positional arguments". There as whole in the checks for argument counts, now the correct error is given. Fixed in 0.3.13a already.

- The simple slice operations with 2 values, not extended with 3 values, were not applying the correct order for evaluation. Fixed in 0.3.13a already.

- The simple slice operations couldn't handle "None" as the value for lower or upper index. Fixed in 0.3.11a already.

- The inplace simple slice operations evaluated the slice index expressions twice, which could cause problems if they had side effects. Fixed in 0.3.11a already.

New Features

- Run time patching the "inspect" module so it accepts compiled functions, compiled methods, and compiled generator objects. The "test_inspect" test of CPython is nearly working unchanged with this.

- The generator functions didn't have "CO_GENERATOR" set in their code object, made compatible with CPython in this regard too. The inspect module will therefore return correct value for "inspect.isgeneratorfunction()" too.

Optimizations

- Slice indexes that are None are now constant propagated as well.

- Slightly more efficient code generation for dual star arg functions, removing useless checks.

Cleanups

- Moved the scons, static C++ files, and asm files to new package "nuitka.build" where also now "SconsInterface" module lives.

- Moved the Qt dialog files to "nuitka.gui"

- Moved the unfreezer code to its own static C++ file.

- Some pylint cleanups.

New Tests

- New test "Recursion" to cover recursive functions.

- New test "Inspection" to cover the patching of "inspect" module.

- Cover "execfile" on the class level as well in "ExecEval" test.

- Cover evaluation order of simple slices in "OrderCheck" too.

Organizational

- There is a new issue tracker available under http://bugs.nuitka.net

Please register and report issues you encounter with Nuitka. I have put all the known issues there and started to use it recently. It's Roundup based like bugs.python.org is, so people will find it familiar.

- The "setup.py" is apparently functional. The source releases for download are made it with, and it appears the binary distributions work too. We may now build a windows installer. It's currently in testing, we will make it available when finished.

Numbers

There are no new numbers. Nuitka should be as fast as it was, 258% speedup:

python 2.6:

Pystone(1.1) time for 50000 passes = 0.48
This machine benchmarks at 104167 pystones/second

Nuitka 0.3.11 (driven by python 2.6):

Pystone(1.1) time for 50000 passes = 0.19
This machine benchmarks at 263158 pystones/second

Summary

The new source organisation makes packaging Nuitka really easy now. From here, we can likely provide "binary" package of Nuitka soon. A windows installer will be nice.

The patching of "inspect" works wonders for compatibility for those programs that insist on checking types, instead of doing duck typing. The function call problem, was an issue found by the Mercurial test suite.

For the "hg.exe" to pass all of its test suite, more work may be needed, this is the overall goal I am currently striving for. Once real world programs like "hg" work, we can use these as more meaningful benchmarks and resume work on optimization.

As always you will find its latest version here.

Yours,
Kay Hayen

07 Nov 2011 10:52pm GMT

Kay Hayen: Nuitka Release 0.3.14

This is to inform you about the new stable release of Nuitka. This time it contains mostly organisational improvements, some bug fixes, improved compatibility and cleanups.

Please see the page "What is Nuitka?" for clarification of what it is now and what it wants to be.

This release is again the result of working towards compilation of a real program (Mercurial). This time, I have added support for proper handling of compiled types by the "inspect" module.

Bug fixes

- Fix for "Missing checks in parameter parsing with star list, star dict and positional arguments". There as whole in the checks for argument counts, now the correct error is given. Fixed in 0.3.13a already.

- The simple slice operations with 2 values, not extended with 3 values, were not applying the correct order for evaluation. Fixed in 0.3.13a already.

- The simple slice operations couldn't handle "None" as the value for lower or upper index. Fixed in 0.3.11a already.

- The inplace simple slice operations evaluated the slice index expressions twice, which could cause problems if they had side effects. Fixed in 0.3.11a already.

New Features

- Run time patching the "inspect" module so it accepts compiled functions, compiled methods, and compiled generator objects. The "test_inspect" test of CPython is nearly working unchanged with this.

- The generator functions didn't have "CO_GENERATOR" set in their code object, made compatible with CPython in this regard too. The inspect module will therefore return correct value for "inspect.isgeneratorfunction()" too.

Optimizations

- Slice indexes that are None are now constant propagated as well.

- Slightly more efficient code generation for dual star arg functions, removing useless checks.

Cleanups

- Moved the scons, static C++ files, and asm files to new package "nuitka.build" where also now "SconsInterface" module lives.

- Moved the Qt dialog files to "nuitka.gui"

- Moved the unfreezer code to its own static C++ file.

- Some pylint cleanups.

New Tests

- New test "Recursion" to cover recursive functions.

- New test "Inspection" to cover the patching of "inspect" module.

- Cover "execfile" on the class level as well in "ExecEval" test.

- Cover evaluation order of simple slices in "OrderCheck" too.

Organizational

- There is a new issue tracker available under http://bugs.nuitka.net

Please register and report issues you encounter with Nuitka. I have put all the known issues there and started to use it recently. It's Roundup based like bugs.python.org is, so people will find it familiar.

- The "setup.py" is apparently functional. The source releases for download are made it with, and it appears the binary distributions work too. We may now build a windows installer. It's currently in testing, we will make it available when finished.

Numbers

There are no new numbers. Nuitka should be as fast as it was, 258% speedup:

python 2.6:

Pystone(1.1) time for 50000 passes = 0.48
This machine benchmarks at 104167 pystones/second

Nuitka 0.3.11 (driven by python 2.6):

Pystone(1.1) time for 50000 passes = 0.19
This machine benchmarks at 263158 pystones/second

Summary

The new source organisation makes packaging Nuitka really easy now. From here, we can likely provide "binary" package of Nuitka soon. A windows installer will be nice.

The patching of "inspect" works wonders for compatibility for those programs that insist on checking types, instead of doing duck typing. The function call problem, was an issue found by the Mercurial test suite.

For the "hg.exe" to pass all of its test suite, more work may be needed, this is the overall goal I am currently striving for. Once real world programs like "hg" work, we can use these as more meaningful benchmarks and resume work on optimization.

As always you will find its latest version here.

Yours,
Kay Hayen

07 Nov 2011 10:52pm GMT

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT