28 Aug 2015

feedPlanet Python

Kun Xi: Thumb Up or Down: take survey with your hand gesture

In the latest SurveyMonkey hackathon held on August 13 - 14, 2015, I stayed up the night to build the Thumb Up or Down, a computer vision prototype to allow you to take survey with your hand gesture. It is really fun and I'd like to share my experience with you.

A video is worthy millions of words:

Sorry, your browser doesn't support embedded videos, but don't worry, you can download it and watch it with your favorite video player!

Conceptually, the hack consists four components:

Of course, I have simplified the design and made lots of trade-off to meet the 24-hour deadline.

Motion Detection

Computer vision is hard; but standing on the shoulder of the accumulated community efforts, OpenCV, some problems are solvable, even in the hachathon timeline.

The idea behind the motion detection is to shoot a background image as the baseline, then diff each frame with the baseline, accumulate the contour areas, and trigger the event if it exceeds the preset threshold.

import cv2
import imutils

# Grab the picture from the camera
camera = cv2.VideoCapture(0)
grabbed, frame = camera.read()

# Resize, normalize
frame = imutils.resize(frame, width=400, height=300)
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)

frame_delta = cv2.absdiff(self.first_frame, gray)
thresh = cv2.threshold(frame_delta, 25, 255, cv2.THRESH_BINARY)[1]

# dilate the thresholded image to fill in holes, then find contours
# on thresholded image
thresh = cv2.dilate(thresh, None, iterations=2)
cnts, _ = cv2.findContours(thresh.copy(),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE)

# We only pay attention to the largest contour
contour = max(cv2.contourArea(c) for c in cnts) if cnts else 0
if contour > 5000:
    return True

Clearly, I assumed that the backend and the browser runs in the same physical machine, thus I can avoid the implementation of the video stream upload from the frontend. This assumption is somehow legitimate if you consider the app might be packaged as a solution, and running in Raspberry Pi.

Image Recognition

The image recognition is a much harder problem though. Theoretically, the solution is related to the hand detection, which might be solved by skin detection or convex hull detection with some trial-n-error. In the hackathon domain, the problem is simplified as whether the input is more likely the thumb up or the thumb down pattern.

First, I tried the matchTemplate method. It performed poorly, as the method basically slides the template image, - the thumb up or thumb down image -, align the input image with 2D convolution. The method does not take scale or rotation into account.

Then I tried the feature detection. It extracts the key points of the template and input, and describe the distance of two corresponding key points. I did the feature detections of the input image against both template images, and determined the gesture based which template yielded more good matches. This method performed reasonably well, at least good enough to pull off the hackathon demo1.

Socketio Server

The current architecture demands a bi-directional communication channel between the client and server, which is exactly socketio designed for. gevent-socketio implements the socketio protocol with gevent, and the pyramid integration example gave me a quickstart2.

In the __init__.py, the socketio upgrade path is bound to the route_name, socketio:

config.add_route('socketio', 'socket.io/*remaining')

Then in the views.py, we initialize the socketio namespace, /thumbup:

@view_config(route_name='socketio')
def socket_io(request):
    socketio_manage(request.environ,
            {'/thumbup': CameraNamespace},
            request=request)
    return Response('')

In the client side, the socketio client MUST connect to the /thumbup namespace to establish the two-way communication:

$(document).ready(function() {
    var socket = io.connect('/thumbup');
    socket.on('action', function() {...});

then the CameraNamespace will get event notification from the client, and be able to send packages and emit events to the client with this socket. In my example, the CameraNamespace spawns the motion_detect method with the socketio handle, self, in the closure, then motion_detect can trigger the state machine in the client side.

from socketio.namespace import BaseNamespace

class CameraNamespace(BaseNamespace):
    def initialize(self):
        camera = Camera()
        gevent.spawn(camera.motion_detect, self)

HTML5 Front end

The front end MUST provide some feedback once the survey taker is detected. Luckily with WebRTC, this is pretty straightforward:

navigator.getUserMedia({ video: true, audio: false }, function(stream) {
    var vendorURL = window.URL || window.webkitURL;
    var video = document.querySelector("video");
    video.src = vendorURL.createObjectURL(stream);
    video.play();
});

See Mozilla's example for more details about the vendor extension detection.

You may also take a picture, and send it back to the server side:

var canvas = document.querySelector('canvas');
canvas.width = $(video).width();
canvas.height = $(video).height();
var context = canvas.getContext('2d');
context.drawImage(video, 0, 0, canvas.width, canvas.height);
var data = canvas.toDataURL('image/png');

The data is the Base64 encoded PNG data with data:image/png;base64, header, we can easily decode it to OpenCV image:

import numpy as np

@view_config(route_name='detect')
def detect(request):
    img_str = base64.b64decode(request.body[22:])
    nparr = np.fromstring(img_str, np.uint8)
    img = cv2.imdecode(nparr, cv2.CV_LOAD_IMAGE_GRAYSCALE)

Close thoughts

OpenCV is a versatile and powerful swiss knife for image processing and computer vision. Especially with the cv2 python binding, it empower us to hack some meaningful prototype in a sprint.

WebRTC and other HTML5 technologies make it possible to build web app instead of native app for more sophisticated application like this.

One more thing, SurveyMonkey is hiring! We are looking for talented python developers to help the world make better decisions.


  1. Further inspection shows that the methodology is flawed. As the thumb up and thumb down are relatively similar, we also need to evaluate the perspective transform matrix to determine the gesture: more concretely, if the gesture is identified as a thumb up with 180 degree rotation, it is indeed a thumb down sign; vice versa for the rotated thumb down sign.

  2. If you use python3, you may want to try aiopyramid.

28 Aug 2015 7:00pm GMT

Kun Xi: Thumb Up or Down: take survey with your hand gesture

In the latest SurveyMonkey hackathon held on August 13 - 14, 2015, I stayed up the night to build the Thumb Up or Down, a computer vision prototype to allow you to take survey with your hand gesture. It is really fun and I'd like to share my experience with you.

A video is worthy millions of words:

Sorry, your browser doesn't support embedded videos, but don't worry, you can download it and watch it with your favorite video player!

Conceptually, the hack consists four components:

Of course, I have simplified the design and made lots of trade-off to meet the 24-hour deadline.

Motion Detection

Computer vision is hard; but standing on the shoulder of the accumulated community efforts, OpenCV, some problems are solvable, even in the hachathon timeline.

The idea behind the motion detection is to shoot a background image as the baseline, then diff each frame with the baseline, accumulate the contour areas, and trigger the event if it exceeds the preset threshold.

import cv2
import imutils

# Grab the picture from the camera
camera = cv2.VideoCapture(0)
grabbed, frame = camera.read()

# Resize, normalize
frame = imutils.resize(frame, width=400, height=300)
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)

frame_delta = cv2.absdiff(self.first_frame, gray)
thresh = cv2.threshold(frame_delta, 25, 255, cv2.THRESH_BINARY)[1]

# dilate the thresholded image to fill in holes, then find contours
# on thresholded image
thresh = cv2.dilate(thresh, None, iterations=2)
cnts, _ = cv2.findContours(thresh.copy(),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE)

# We only pay attention to the largest contour
contour = max(cv2.contourArea(c) for c in cnts) if cnts else 0
if contour > 5000:
    return True

Clearly, I assumed that the backend and the browser runs in the same physical machine, thus I can avoid the implementation of the video stream upload from the frontend. This assumption is somehow legitimate if you consider the app might be packaged as a solution, and running in Raspberry Pi.

Image Recognition

The image recognition is a much harder problem though. Theoretically, the solution is related to the hand detection, which might be solved by skin detection or convex hull detection with some trial-n-error. In the hackathon domain, the problem is simplified as whether the input is more likely the thumb up or the thumb down pattern.

First, I tried the matchTemplate method. It performed poorly, as the method basically slides the template image, - the thumb up or thumb down image -, align the input image with 2D convolution. The method does not take scale or rotation into account.

Then I tried the feature detection. It extracts the key points of the template and input, and describe the distance of two corresponding key points. I did the feature detections of the input image against both template images, and determined the gesture based which template yielded more good matches. This method performed reasonably well, at least good enough to pull off the hackathon demo1.

Socketio Server

The current architecture demands a bi-directional communication channel between the client and server, which is exactly socketio designed for. gevent-socketio implements the socketio protocol with gevent, and the pyramid integration example gave me a quickstart2.

In the __init__.py, the socketio upgrade path is bound to the route_name, socketio:

config.add_route('socketio', 'socket.io/*remaining')

Then in the views.py, we initialize the socketio namespace, /thumbup:

@view_config(route_name='socketio')
def socket_io(request):
    socketio_manage(request.environ,
            {'/thumbup': CameraNamespace},
            request=request)
    return Response('')

In the client side, the socketio client MUST connect to the /thumbup namespace to establish the two-way communication:

$(document).ready(function() {
    var socket = io.connect('/thumbup');
    socket.on('action', function() {...});

then the CameraNamespace will get event notification from the client, and be able to send packages and emit events to the client with this socket. In my example, the CameraNamespace spawns the motion_detect method with the socketio handle, self, in the closure, then motion_detect can trigger the state machine in the client side.

from socketio.namespace import BaseNamespace

class CameraNamespace(BaseNamespace):
    def initialize(self):
        camera = Camera()
        gevent.spawn(camera.motion_detect, self)

HTML5 Front end

The front end MUST provide some feedback once the survey taker is detected. Luckily with WebRTC, this is pretty straightforward:

navigator.getUserMedia({ video: true, audio: false }, function(stream) {
    var vendorURL = window.URL || window.webkitURL;
    var video = document.querySelector("video");
    video.src = vendorURL.createObjectURL(stream);
    video.play();
});

See Mozilla's example for more details about the vendor extension detection.

You may also take a picture, and send it back to the server side:

var canvas = document.querySelector('canvas');
canvas.width = $(video).width();
canvas.height = $(video).height();
var context = canvas.getContext('2d');
context.drawImage(video, 0, 0, canvas.width, canvas.height);
var data = canvas.toDataURL('image/png');

The data is the Base64 encoded PNG data with data:image/png;base64, header, we can easily decode it to OpenCV image:

import numpy as np

@view_config(route_name='detect')
def detect(request):
    img_str = base64.b64decode(request.body[22:])
    nparr = np.fromstring(img_str, np.uint8)
    img = cv2.imdecode(nparr, cv2.CV_LOAD_IMAGE_GRAYSCALE)

Close thoughts

OpenCV is a versatile and powerful swiss knife for image processing and computer vision. Especially with the cv2 python binding, it empower us to hack some meaningful prototype in a sprint.

WebRTC and other HTML5 technologies make it possible to build web app instead of native app for more sophisticated application like this.

One more thing, SurveyMonkey is hiring! We are looking for talented python developers to help the world make better decisions.


  1. Further inspection shows that the methodology is flawed. As the thumb up and thumb down are relatively similar, we also need to evaluate the perspective transform matrix to determine the gesture: more concretely, if the gesture is identified as a thumb up with 180 degree rotation, it is indeed a thumb down sign; vice versa for the rotated thumb down sign.

  2. If you use python3, you may want to try aiopyramid.

28 Aug 2015 7:00pm GMT

PyCharm: PyCharm 4.5.4 RC2 is available

Having announced the PyCharm 4.5.4 Release Candidate build a week ago, today we've published the PyCharm 4.5.4 RC2 build 141.2569, which is already available for download and evaluation from the EAP page.

This build introduces just a few new fixes which can be found in the release notes. They are:

Download the PyСharm 4.5.4 RC2 build for your platform from the project EAP page and please report any bugs and feature request to our Issue Tracker. It also will be available shortly as a patch update from within the IDE (from 4.5.x builds only) for those who selected the Beta Releases channel in the update settings.

Stay tuned for the PyCharm 4.5.4 release announcement, follow us on twitter, and develop with pleasure!

-PyCharm Team

28 Aug 2015 4:08pm GMT

PyCharm: PyCharm 4.5.4 RC2 is available

Having announced the PyCharm 4.5.4 Release Candidate build a week ago, today we've published the PyCharm 4.5.4 RC2 build 141.2569, which is already available for download and evaluation from the EAP page.

This build introduces just a few new fixes which can be found in the release notes. They are:

Download the PyСharm 4.5.4 RC2 build for your platform from the project EAP page and please report any bugs and feature request to our Issue Tracker. It also will be available shortly as a patch update from within the IDE (from 4.5.x builds only) for those who selected the Beta Releases channel in the update settings.

Stay tuned for the PyCharm 4.5.4 release announcement, follow us on twitter, and develop with pleasure!

-PyCharm Team

28 Aug 2015 4:08pm GMT

Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I'm at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I've just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

Here are the slides:


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

28 Aug 2015 2:50pm GMT

Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I'm at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I've just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

Here are the slides:


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

28 Aug 2015 2:50pm GMT

Kay Hayen: Nuitka Release 0.5.14

This is to inform you about the new stable release of Nuitka. It is the extremely compatible Python compiler. Please see the page "What is Nuitka?" for an overview.

This release is an intermediate step towards value propagation, which is not considered ready for stable release yet. The major point is the elimination of the try/finally expressions, as they are problems to SSA. The try/finally statement change is delayed.

There are also a lot of bug fixes, and enhancements to code generation, as well as major cleanups of code base.

Bug Fixes

  • Python3: Added support assignments trailing star assignment.

    *a, b = 1, 2
    

    This raised ValueError before.

  • Python3: Properly detect illegal double star assignments.

    *a, *b = c
    
  • Python3: Properly detect the syntax error to star assign from non-tuple/list.

    *a = 1
    
  • Python3.4: Fixed a crash of the binary when copying dictionaries with split tables received as star arguments.

  • Python3: Fixed reference loss, when using raise a from b where b was an exception instance. Fixed in 0.5.13.8 already.

  • Windows: Fix, the flag --disable-windows-console was not properly handled for MinGW32 run time resulting in a crash.

  • Python2.7.10: Was not recognizing this as a 2.7.x variant and therefore not applying minor version compatibility levels properly.

  • Fix, when choosing to have frozen source references, code objects were not use the same value as __file__ did for its filename.

  • Fix, when re-executing itself to drop the site module, make sure we find the same file again, and not according to the PYTHONPATH changes coming from it. Issue#223. Fixed in 0.5.13.4 already.

  • Enhanced code generation for del variable statements, where it's clear that the value must be assigned.

  • When pressing CTRL-C, the stack traces from both Nuitka and Scons were given, we now avoid the one from Scons.

  • Fix, the dump from --xml no longer contains functions that have become unused during analysis.

  • Standalone: Creating or running programs from inside unicode paths was not working on Windows. Issue#231 Issue#229 and. Fixed in 0.5.13.7 already.

  • Namespace package support was not yet complete, importing the parent of a package was still failing. Issue#230. Fixed in 0.5.13.7 already.

  • Python2.6: Compatibility for exception check messages enhanced with newest minor releases.

  • Compatibility: The NameError in classes needs to say global name and not just name too.

  • Python3: Fixed creation of XML representation, now done without lxml as it doesn't support needed features on that version. Fixed in 0.5.13.5 already.

  • Python2: Fix, when creating code for the largest negative constant to still fit into int, that was only working in the main module. Issue#228. Fixed in 0.5.13.5 already.

New Features

  • Added support for Windows 10.
  • Followed changes for Python 3.5 beta 2. Still only usable as a Python 3.4 replacement, no new features.
  • Using a self compiled Python running from the source tree is now supported.
  • Added support for AnaConda Python distribution. As it doesn't install the Python DLL, we copy it along for acceleration mode.
  • Added support for Visual Studio 2015. Issue#222. Fixed in 0.5.13.3 already.
  • Added support for self compiled Python versions running from build tree, this is intended to help debug things on Windows.

Optimization

  • Function inlining is now present in the code, but still disabled, because it needs more changes in other areas, before we can generally do it.

  • Trivial outlines, result of re-formulations or function inlining, are now inlined, in case they just return an expression.

  • The re-formulation for or and and has been giving up, eliminating the use of a try/finally expression, at the cost of dedicated boolean nodes and code generation for these.

    This saves around 8% of compile time memory for Nuitka, and allows for faster and more complete optimization, and gets rid of a complicated structure for analysis.

  • When a frame is used in an exception, its locals are detached. This was done more often than necessary and even for frames that are not necessary our own ones. This will speed up some exception cases.

  • When the default arguments, or the keyword default arguments (Python3) or the annotations (Python3) were raising an exception, the function definition is now replaced with the exception, saving a code generation. This happens frequently with Python2/Python3 compatible code guarded by version checks.

  • The SSA analysis for loops now properly traces "break" statement situations and merges the post-loop situation from all of them. This significantly allows for and improves optimization of code following the loop.

  • The SSA analysis of try/finally statements has been greatly enhanced. The handler for finally is now optimized for exception raise and no exception raise individually, as well as for break, continue and return in the tried code. The SSA analysis for after the statement is now the result of merging these different cases, should they not abort.

  • The code generation for del statements is now taking advantage should there be definite knowledge of previous value. This speed them up slightly.

  • The SSA analysis of del statements now properly decided if the statement can raise or not, allowing for more optimization.

  • For list contractions, the re-formulation was enhanced using the new outline construct instead of a pseudo function, leading to better analysis and code generation.

  • Comparison chains are now re-formulated into outlines too, allowing for better analysis of them.

  • Exceptions raised in function creations, e.g. in default values, are now propagated, eliminating the function's code. This happens most often with Python2/Python3 in branches. On the other hand, function creations that cannot are also annotated now.

  • Closure variables that become unreferenced outside of the function become normal variables leading to better tracing and code generation for them.

  • Function creations cannot raise except their defaults, keyword defaults or annotations do.

Organizational

  • Removed gitorious mirror of the git repository, they shut down.
  • Make it more clear in the documentation that Python2 is needed at compile time to create Python3 executables.

Cleanups

  • Moved more parts of code generation to their own modules, and used registry for code generation for more expression kinds.

  • Unified try/except and try/finally into a single construct that handles both through try/except/break/continue/return semantics. Finally is now solved via duplicating the handler into cases necessary.

    No longer are nodes annotated with information if they need to publish the exception or not, this is now all done with the dedicated nodes.

  • The try/finally expressions have been replaced with outline function bodies, that instead of side effect statements, are more like functions with return values, allowing for easier analysis and dedicated code generation of much lower complexity.

  • No more "tolerant" flag for release nodes, we now decide this fully based on SSA information.

  • Added helper for assertions that code flow does not reach certain positions, e.g. a function must return or raise, aborting statements do not continue and so on.

  • To keep cloning of code parts as simple as possible, the limited use of makeCloneAt has been changed to a new makeClone which produces identical copies, which is what we always do. And a generic cloning based on "details" has been added, requiring to make constructor arguments and details complete and consistent.

  • The re-formulation code helpers have been improved to be more convenient at creating nodes.

  • The old nuitka.codegen module Generator was still used for many things. These now all got moved to appropriate code generation modules, and their users got updated, also moving some code generator functions in the process.

  • The module nuitka.codegen.CodeTemplates got replaces with direct uses of the proper topic module from nuitka.codegen.templates, with some more added, and their names harmonized to be more easily recognizable.

  • Added more assertions to the generated code, to aid bug finding.

  • The autoformat now sorts pylint markups for increased consistency.

  • Releases no longer have a tolerant flag, this was not needed anymore as we use SSA.

  • Handle CTRL-C in scons code preventing per job messages that are not helpful and avoid tracebacks from scons, also remove more unused tools like rpm from out inline copy.

Tests

  • Added the CPython3.4 test suite.

  • The CPython3.2, CPython3.3, and CPython3.4 test suite now run with Python2 giving the same errors. Previously there were a few specific errors, some with line numbers, some with different SyntaxError be raised, due to different order of checks.

    This increases the coverage of the exception raising tests somewhat.

  • Also the CPython3.x test suites now all pass with debug Python, as does the CPython 2.6 test suite with 2.6 now.

  • Added tests to cover all forms of unpacking assignments supported in Python3, to be sure there are no other errors unknown to us.

  • Started to document the reference count tests, and to make it more robust against SSA optimization. This will take some time and is work in progress.

  • Made the compile library test robust against modules that raise a syntax error, checking that Nuitka does the same.

  • Refined more tests to be directly execuable with Python3, this is an ongoing effort.

Summary

This release is clearly major. It represents a huge step forward for Nuitka as it improves nearly every aspect of code generation and analysis. Removing the try/finally expression nodes proved to be necessary in order to even have the correct SSA in their cases. Very important optimization was blocked by it.

Going forward, the try/finally statements will be removed and dead variable elimination will happen, which then will give function inlining. This is expected to happen in one of the next releases.

This release is a consolidation of 8 hotfix releases, and many refactorings needed towards the next big step, which might also break things, and for that reason is going to get its own release cycle.

28 Aug 2015 5:03am GMT

Kay Hayen: Nuitka Release 0.5.14

This is to inform you about the new stable release of Nuitka. It is the extremely compatible Python compiler. Please see the page "What is Nuitka?" for an overview.

This release is an intermediate step towards value propagation, which is not considered ready for stable release yet. The major point is the elimination of the try/finally expressions, as they are problems to SSA. The try/finally statement change is delayed.

There are also a lot of bug fixes, and enhancements to code generation, as well as major cleanups of code base.

Bug Fixes

  • Python3: Added support assignments trailing star assignment.

    *a, b = 1, 2
    

    This raised ValueError before.

  • Python3: Properly detect illegal double star assignments.

    *a, *b = c
    
  • Python3: Properly detect the syntax error to star assign from non-tuple/list.

    *a = 1
    
  • Python3.4: Fixed a crash of the binary when copying dictionaries with split tables received as star arguments.

  • Python3: Fixed reference loss, when using raise a from b where b was an exception instance. Fixed in 0.5.13.8 already.

  • Windows: Fix, the flag --disable-windows-console was not properly handled for MinGW32 run time resulting in a crash.

  • Python2.7.10: Was not recognizing this as a 2.7.x variant and therefore not applying minor version compatibility levels properly.

  • Fix, when choosing to have frozen source references, code objects were not use the same value as __file__ did for its filename.

  • Fix, when re-executing itself to drop the site module, make sure we find the same file again, and not according to the PYTHONPATH changes coming from it. Issue#223. Fixed in 0.5.13.4 already.

  • Enhanced code generation for del variable statements, where it's clear that the value must be assigned.

  • When pressing CTRL-C, the stack traces from both Nuitka and Scons were given, we now avoid the one from Scons.

  • Fix, the dump from --xml no longer contains functions that have become unused during analysis.

  • Standalone: Creating or running programs from inside unicode paths was not working on Windows. Issue#231 Issue#229 and. Fixed in 0.5.13.7 already.

  • Namespace package support was not yet complete, importing the parent of a package was still failing. Issue#230. Fixed in 0.5.13.7 already.

  • Python2.6: Compatibility for exception check messages enhanced with newest minor releases.

  • Compatibility: The NameError in classes needs to say global name and not just name too.

  • Python3: Fixed creation of XML representation, now done without lxml as it doesn't support needed features on that version. Fixed in 0.5.13.5 already.

  • Python2: Fix, when creating code for the largest negative constant to still fit into int, that was only working in the main module. Issue#228. Fixed in 0.5.13.5 already.

New Features

  • Added support for Windows 10.
  • Followed changes for Python 3.5 beta 2. Still only usable as a Python 3.4 replacement, no new features.
  • Using a self compiled Python running from the source tree is now supported.
  • Added support for AnaConda Python distribution. As it doesn't install the Python DLL, we copy it along for acceleration mode.
  • Added support for Visual Studio 2015. Issue#222. Fixed in 0.5.13.3 already.
  • Added support for self compiled Python versions running from build tree, this is intended to help debug things on Windows.

Optimization

  • Function inlining is now present in the code, but still disabled, because it needs more changes in other areas, before we can generally do it.

  • Trivial outlines, result of re-formulations or function inlining, are now inlined, in case they just return an expression.

  • The re-formulation for or and and has been giving up, eliminating the use of a try/finally expression, at the cost of dedicated boolean nodes and code generation for these.

    This saves around 8% of compile time memory for Nuitka, and allows for faster and more complete optimization, and gets rid of a complicated structure for analysis.

  • When a frame is used in an exception, its locals are detached. This was done more often than necessary and even for frames that are not necessary our own ones. This will speed up some exception cases.

  • When the default arguments, or the keyword default arguments (Python3) or the annotations (Python3) were raising an exception, the function definition is now replaced with the exception, saving a code generation. This happens frequently with Python2/Python3 compatible code guarded by version checks.

  • The SSA analysis for loops now properly traces "break" statement situations and merges the post-loop situation from all of them. This significantly allows for and improves optimization of code following the loop.

  • The SSA analysis of try/finally statements has been greatly enhanced. The handler for finally is now optimized for exception raise and no exception raise individually, as well as for break, continue and return in the tried code. The SSA analysis for after the statement is now the result of merging these different cases, should they not abort.

  • The code generation for del statements is now taking advantage should there be definite knowledge of previous value. This speed them up slightly.

  • The SSA analysis of del statements now properly decided if the statement can raise or not, allowing for more optimization.

  • For list contractions, the re-formulation was enhanced using the new outline construct instead of a pseudo function, leading to better analysis and code generation.

  • Comparison chains are now re-formulated into outlines too, allowing for better analysis of them.

  • Exceptions raised in function creations, e.g. in default values, are now propagated, eliminating the function's code. This happens most often with Python2/Python3 in branches. On the other hand, function creations that cannot are also annotated now.

  • Closure variables that become unreferenced outside of the function become normal variables leading to better tracing and code generation for them.

  • Function creations cannot raise except their defaults, keyword defaults or annotations do.

Organizational

  • Removed gitorious mirror of the git repository, they shut down.
  • Make it more clear in the documentation that Python2 is needed at compile time to create Python3 executables.

Cleanups

  • Moved more parts of code generation to their own modules, and used registry for code generation for more expression kinds.

  • Unified try/except and try/finally into a single construct that handles both through try/except/break/continue/return semantics. Finally is now solved via duplicating the handler into cases necessary.

    No longer are nodes annotated with information if they need to publish the exception or not, this is now all done with the dedicated nodes.

  • The try/finally expressions have been replaced with outline function bodies, that instead of side effect statements, are more like functions with return values, allowing for easier analysis and dedicated code generation of much lower complexity.

  • No more "tolerant" flag for release nodes, we now decide this fully based on SSA information.

  • Added helper for assertions that code flow does not reach certain positions, e.g. a function must return or raise, aborting statements do not continue and so on.

  • To keep cloning of code parts as simple as possible, the limited use of makeCloneAt has been changed to a new makeClone which produces identical copies, which is what we always do. And a generic cloning based on "details" has been added, requiring to make constructor arguments and details complete and consistent.

  • The re-formulation code helpers have been improved to be more convenient at creating nodes.

  • The old nuitka.codegen module Generator was still used for many things. These now all got moved to appropriate code generation modules, and their users got updated, also moving some code generator functions in the process.

  • The module nuitka.codegen.CodeTemplates got replaces with direct uses of the proper topic module from nuitka.codegen.templates, with some more added, and their names harmonized to be more easily recognizable.

  • Added more assertions to the generated code, to aid bug finding.

  • The autoformat now sorts pylint markups for increased consistency.

  • Releases no longer have a tolerant flag, this was not needed anymore as we use SSA.

  • Handle CTRL-C in scons code preventing per job messages that are not helpful and avoid tracebacks from scons, also remove more unused tools like rpm from out inline copy.

Tests

  • Added the CPython3.4 test suite.

  • The CPython3.2, CPython3.3, and CPython3.4 test suite now run with Python2 giving the same errors. Previously there were a few specific errors, some with line numbers, some with different SyntaxError be raised, due to different order of checks.

    This increases the coverage of the exception raising tests somewhat.

  • Also the CPython3.x test suites now all pass with debug Python, as does the CPython 2.6 test suite with 2.6 now.

  • Added tests to cover all forms of unpacking assignments supported in Python3, to be sure there are no other errors unknown to us.

  • Started to document the reference count tests, and to make it more robust against SSA optimization. This will take some time and is work in progress.

  • Made the compile library test robust against modules that raise a syntax error, checking that Nuitka does the same.

  • Refined more tests to be directly execuable with Python3, this is an ongoing effort.

Summary

This release is clearly major. It represents a huge step forward for Nuitka as it improves nearly every aspect of code generation and analysis. Removing the try/finally expression nodes proved to be necessary in order to even have the correct SSA in their cases. Very important optimization was blocked by it.

Going forward, the try/finally statements will be removed and dead variable elimination will happen, which then will give function inlining. This is expected to happen in one of the next releases.

This release is a consolidation of 8 hotfix releases, and many refactorings needed towards the next big step, which might also break things, and for that reason is going to get its own release cycle.

28 Aug 2015 5:03am GMT

Mike C. Fletcher: Raspberry Pi PyOpenGL

So since Soni got me to setup raspbian on the old raspberry pi, I got PyOpenGL + GLES2 working on it today. There is a bug in raspbian's EGL library (it depends on GLES2 without linking to it), but with a work-around for that bzr head of PyOpenGL can run the bcmwindow example/raw.py demo. I *don't* have a spare HDMI cable, however, so I didn't actually get to *see* the demo running. Oh well, next time. bmcwindow is now up on pypi should people be interested.

28 Aug 2015 4:35am GMT

Mike C. Fletcher: Raspberry Pi PyOpenGL

So since Soni got me to setup raspbian on the old raspberry pi, I got PyOpenGL + GLES2 working on it today. There is a bug in raspbian's EGL library (it depends on GLES2 without linking to it), but with a work-around for that bzr head of PyOpenGL can run the bcmwindow example/raw.py demo. I *don't* have a spare HDMI cable, however, so I didn't actually get to *see* the demo running. Oh well, next time. bmcwindow is now up on pypi should people be interested.

28 Aug 2015 4:35am GMT

Montreal Python User Group: Call for Speakers - Montréal-Python 54: Virtualized Utopia

It's back to school so, at Montréal-Python, we are preparing for the first event of the season!

We are back every second Monday of the month, so our next meeting will take place on Monday, September the 14th at 6:30pm at UQÀM.

For the occasion, we are looking for speakers to give talks of 5, 10, 20, or even 45 minutes.

Come tell us about your latest discoveries, your latest module, or your latest professional or personal realizations. It is your chance to meet with the local Python community.

Send us your propositions at mtlpyteam@googlegroups.com

When:

Monday, September 14th 2015

Schedule:

We'd like to thank our sponsors for their continued support:

28 Aug 2015 4:00am GMT

Montreal Python User Group: Call for Speakers - Montréal-Python 54: Virtualized Utopia

It's back to school so, at Montréal-Python, we are preparing for the first event of the season!

We are back every second Monday of the month, so our next meeting will take place on Monday, September the 14th at 6:30pm at UQÀM.

For the occasion, we are looking for speakers to give talks of 5, 10, 20, or even 45 minutes.

Come tell us about your latest discoveries, your latest module, or your latest professional or personal realizations. It is your chance to meet with the local Python community.

Send us your propositions at mtlpyteam@googlegroups.com

When:

Monday, September 14th 2015

Schedule:

We'd like to thank our sponsors for their continued support:

28 Aug 2015 4:00am GMT

PyTexas: 2015 Schedule Released and Tons Away to Stay in the Loop

The 2015 schedule has been released and it is jammed packed with awesome talks and tutorials. Check it out today.

2015 schedule »

We've also added a few extra ways to stay in the loop with PyTexas news.

First, we added push notifications to the site, so if you see a little pop up asking for authorization and hit "Allow", you will receive up to date information on the conference. These notifications even come when your disconnected from the site if you use Chrome (all versions) or Safari (desktop). All other browsers, require you to be on the site to get the notifications.

Lastly, we also created a Gitter.im Chat Room for anyone that wants to talk about the conference or has questions. We'll be monitoring that chat room, so feel free to drop in anytime.

Gitter.im Chat Room »

Don't like any of those options? Then check us out on Twitter too: @pytexas.

28 Aug 2015 2:10am GMT

PyTexas: 2015 Schedule Released and Tons Away to Stay in the Loop

The 2015 schedule has been released and it is jammed packed with awesome talks and tutorials. Check it out today.

2015 schedule »

We've also added a few extra ways to stay in the loop with PyTexas news.

First, we added push notifications to the site, so if you see a little pop up asking for authorization and hit "Allow", you will receive up to date information on the conference. These notifications even come when your disconnected from the site if you use Chrome (all versions) or Safari (desktop). All other browsers, require you to be on the site to get the notifications.

Lastly, we also created a Gitter.im Chat Room for anyone that wants to talk about the conference or has questions. We'll be monitoring that chat room, so feel free to drop in anytime.

Gitter.im Chat Room »

Don't like any of those options? Then check us out on Twitter too: @pytexas.

28 Aug 2015 2:10am GMT

Ben Rousch: Kivy – Interactive Applications and Games in Python, 2nd Edition Review

I was recently asked by the author to review the second edition of "Kivy - Interactive Applications in Python" from Packt Publishing. I had difficulty recommending the first edition mostly due to the atrocious editing - or lack thereof - that it had suffered. It really reflected badly on Packt, and since it was the only Kivy book available, I did not want that same inattention to quality to reflect on Kivy. Packt gave me a free ebook copy of this book in exchange for agreeing to do this review.

At any rate, the second edition is much improved over the first. Although a couple of glaring issues remain, it looks like it has been visited by at least one native English speaking editor. The Kivy content is good, and I can now recommend it for folks who know Python and want to get started with Kivy. The following is the review I posted to Amazon:

-

This second edition of "Kivy - Interactive Applications and Games in Python" is much improved from the first edition. The atrocious grammar throughout the first edition book has mostly been fixed, although it's still worse than what I expect from a professionally edited book. The new chapters showcase current Kivy features while reiterating how to build a basic Kivy app, and the book covers an impressive amount material in its nearly 185 pages. I think this is due largely to the efficiency and power of coding in Python and Kivy, but also to the carefully-chosen projects the author selected for his readers to create. Despite several indentation issues in the example code and the many grammar issues typical of Packt's books, I can now recommend this book for intermediate to experienced Python programmers who are looking to get started with Kivy.

Chapter one is a good, quick introduction to a minimal Kivy app, layouts, widgets, and their properties.

Chapter two is an excellent introduction and exploration of basic canvas features and usage. This is often a difficult concept for beginners to understand, and this chapter handles it well.

Chapter three covers events and binding of events, but is much denser and difficult to grok than chapter two. It will likely require multiple reads of the chapter to get a good understanding of the topic, but if you're persistent, everything you need is there.

Chapter four contains a hodge-podge of Kivy user interface features. Screens and scatters are covered well, but gestures still feel like magic. I have yet to find a good in-depth explanation of gestures in Kivy, so this does not come as a surprise. Behaviors is a new feature in Kivy and a new section in this second edition of the book. Changing default styles is also covered in this chapter. The author does not talk about providing a custom atlas for styling, but presents an alternative method for theming involving Factories.

In chapter six the author does a good job of covering animations, and introduces sounds, the clock, and atlases. He brings these pieces together to build a version of Space Invaders, in about 500 lines of Python and KV. It ends up a bit code-dense, but the result is a fun game and a concise code base to play around with.

In chapter seven the author builds a TED video player including subtitles and an Android actionbar. There is perhaps too much attention paid to the VideoPlayer widget, but the resulting application is a useful base for creating other video applications.

-

28 Aug 2015 1:16am GMT

Ben Rousch: Kivy – Interactive Applications and Games in Python, 2nd Edition Review

I was recently asked by the author to review the second edition of "Kivy - Interactive Applications in Python" from Packt Publishing. I had difficulty recommending the first edition mostly due to the atrocious editing - or lack thereof - that it had suffered. It really reflected badly on Packt, and since it was the only Kivy book available, I did not want that same inattention to quality to reflect on Kivy. Packt gave me a free ebook copy of this book in exchange for agreeing to do this review.

At any rate, the second edition is much improved over the first. Although a couple of glaring issues remain, it looks like it has been visited by at least one native English speaking editor. The Kivy content is good, and I can now recommend it for folks who know Python and want to get started with Kivy. The following is the review I posted to Amazon:

-

This second edition of "Kivy - Interactive Applications and Games in Python" is much improved from the first edition. The atrocious grammar throughout the first edition book has mostly been fixed, although it's still worse than what I expect from a professionally edited book. The new chapters showcase current Kivy features while reiterating how to build a basic Kivy app, and the book covers an impressive amount material in its nearly 185 pages. I think this is due largely to the efficiency and power of coding in Python and Kivy, but also to the carefully-chosen projects the author selected for his readers to create. Despite several indentation issues in the example code and the many grammar issues typical of Packt's books, I can now recommend this book for intermediate to experienced Python programmers who are looking to get started with Kivy.

Chapter one is a good, quick introduction to a minimal Kivy app, layouts, widgets, and their properties.

Chapter two is an excellent introduction and exploration of basic canvas features and usage. This is often a difficult concept for beginners to understand, and this chapter handles it well.

Chapter three covers events and binding of events, but is much denser and difficult to grok than chapter two. It will likely require multiple reads of the chapter to get a good understanding of the topic, but if you're persistent, everything you need is there.

Chapter four contains a hodge-podge of Kivy user interface features. Screens and scatters are covered well, but gestures still feel like magic. I have yet to find a good in-depth explanation of gestures in Kivy, so this does not come as a surprise. Behaviors is a new feature in Kivy and a new section in this second edition of the book. Changing default styles is also covered in this chapter. The author does not talk about providing a custom atlas for styling, but presents an alternative method for theming involving Factories.

In chapter six the author does a good job of covering animations, and introduces sounds, the clock, and atlases. He brings these pieces together to build a version of Space Invaders, in about 500 lines of Python and KV. It ends up a bit code-dense, but the result is a fun game and a concise code base to play around with.

In chapter seven the author builds a TED video player including subtitles and an Android actionbar. There is perhaps too much attention paid to the VideoPlayer widget, but the resulting application is a useful base for creating other video applications.

-

28 Aug 2015 1:16am GMT

Matthew Rocklin: Efficient Tabular Storage

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:

We use NYCTaxi dataset for examples, and introduce a small project, Castra.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Larger than Memory Data and Disk I/O

We analyze large datasets (10-100GB) on our laptop by extending memory with disk. Tools like dask.array and dask.dataframe make this easier for array and tabular data.

Interaction times can improve significantly (from minutes to seconds) if we choose to store our data on disk efficiently. This is particularly important for large data because we can no longer separately "load in our data" while we get a coffee and then iterate rapidly on our dataset once it's comfortably in memory.

Larger-than-memory datasets force our interactive workflows to include the hard drive.

CSV is convenient but slow

CSV is great. It's human readable, accessible by every tool (even Excel!), and pretty simple.

CSV is also slow. The pandas.read_csv parser maxes out at 100MB/s on simple data. This doesn't include any keyword arguments like datetime parsing that might slow it down further. Consider the time to parse a 24GB dataset:

24GB / (100MB/s) == 4 minutes

A four minute delay is too long for interactivity. We need to operate in seconds rather than minutes otherwise people leave to work on something else. This improvement from a few minutes to a few seconds is entirely possible if we choose better formats.

Example with CSVs

As an example lets play with the NYC Taxi dataset using dask.dataframe a library that copies the Pandas API but operates in chunks off of disk.

>>> import dask.dataframe as dd

>>> df = dd.read_csv('csv/trip_data_*.csv',
...                  skipinitialspace=True,
...                  parse_dates=['pickup_datetime', 'dropoff_datetime'])

>>> df.head()
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 -73.978165 40.757977 -73.989838 40.751171
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 -74.006683 40.731781 -73.994499 40.750660
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 -74.004707 40.737770 -74.009834 40.726002
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 -73.974602 40.759945 -73.984734 40.759388
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 -73.976250 40.748528 -74.002586 40.747868

Time Costs

It takes a second to load the first few lines but 11 to 12 minutes to roll through the entire dataset. We make a zoomable picture below of a random sample of the taxi pickup locations in New York City. This example is taken from a full example notebook here.

df2 = df[(df.pickup_latitude > 40) &
         (df.pickup_latitude < 42) &
         (df.pickup_longitude > -75) &
         (df.pickup_longitude < -72)]

sample = df2.sample(frac=0.0001)
pickup = sample[['pickup_latitude', 'pickup_longitude']]

result = pickup.compute()

from bokeh.plotting import figure, show, output_notebook
p = figure(title="Pickup Locations")
p.scatter(result.pickup_longitude, result.pickup_latitude, size=3, alpha=0.2)

Eleven minutes is a long time

This result takes eleven minutes to compute, almost all of which is parsing CSV files. While this may be acceptable for a single computation we invariably make mistakes and start over or find new avenues in our data to explore. Each step in our thought process now takes eleven minutes, ouch.

Interactive exploration of larger-than-memory datasets requires us to evolve beyond CSV files.

Principles to store tabular data

What efficient techniques exist for tabular data?

A good solution may have the following attributes:

  1. Binary
  2. Columnar
  3. Categorical support
  4. Compressed
  5. Indexed/Partitioned

We discuss each of these below.

Binary

Consider the text '1.23' as it is stored in a CSV file and how it is stored as a Python/C float in memory:

These look very different. When we load 1.23 from a CSV textfile we need to translate it to 0x3f9d70a4; this takes time.

A binary format stores our data on disk exactly how it will look in memory; we store the bytes 0x3f9d70a4 directly on disk so that when we load data from disk to memory no extra translation is necessary. Our file is no longer human readable but it's much faster.

This gets more intense when we consider datetimes:

Every time we parse a datetime we need to compute how many microseconds it has been since the epoch. This calculation needs to take into account things like how many days in each month, and all of the intervening leap years. This is slow. A binary representation would record the integer directly on disk (as 0x51e278694a680) so that we can load our datetimes directly into memory without calculation.

Columnar

Many analytic computations only require a few columns at a time, often only one, e.g.

>>> df.passenger_counts.value_counts().compute().sort_index()
0           3755
1      119605039
2       23097153
3        7187354
4        3519779
5        9852539
6        6628287
7             30
8             23
9             24
129            1
255            1
Name: passenger_count, dtype: int64

Of our 24 GB we may only need 2GB. Columnar storage means storing each column separately from the others so that we can read relevant columns without passing through irrelevant columns.

Our CSV example fails at this. While we only want two columns, pickup_datetime and pickup_longitude, we pass through all of our data to collect the relevant fields. The pickup location data is mixed with all the rest.

Categoricals

Categoricals encode repetitive text columns (normally very expensive) as integers (very very cheap) in a way that is invisible to the user.

Consider the following (mostly text) columns of our NYC taxi dataset:

>>> df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].head()
medallion vendor_id rate_code store_and_fwd_flag
0 89D227B655E5C82AECF13C3F540D4CF4 CMT 1 N
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N

Each of these columns represents elements of a small set:

And yet we store these elements in large and cumbersome dtypes:

In [4]: df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].dtypes
Out[4]:
medallion             object
vendor_id             object
rate_code              int64
store_and_fwd_flag    object
dtype: object

We use int64 for rate code, which could easily have fit into an int8 an opportunity for an 8x improvement in memory use. The object dtype used for strings in Pandas and Python takes up a lot of memory and is quite slow:

In [1]: import sys
In [2]: sys.getsizeof('CMT')  # bytes
Out[2]: 40

Categoricals replace the original column with a column of integers (of the appropriate size, often int8) along with a small index mapping those integers to the original values. I've written about categoricals before so I won't go into too much depth here. Categoricals increase both storage and computational efficiency by about 10x if you have text data that describes elements in a category.

Compression

After we've encoded everything well and separated our columns we find ourselves limited by disk I/O read speeds. Disk read bandwidths range from 100MB/s (laptop spinning disk hard drive) to 2GB/s (RAID of SSDs). This read speed strongly depends on how large our reads are. The bandwidths given above reflect large sequential reads such as you might find when reading all of a 100MB file in one go. Performance degrades for smaller reads. Fortunately, for analytic queries we're often in the large sequential read case (hooray!)

We reduce disk read times through compression. Consider the datetimes of the NYC taxi dataset. These values are repetitive and slowly changing; a perfect match for modern compression techniques.

>>> ind = df.index.compute()  # this is on presorted index data (see castra section below)
>>> ind
DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               ...
               '2013-12-31 23:59:42', '2013-12-31 23:59:47',
               '2013-12-31 23:59:48', '2013-12-31 23:59:49',
               '2013-12-31 23:59:50', '2013-12-31 23:59:51',
               '2013-12-31 23:59:54', '2013-12-31 23:59:55',
               '2013-12-31 23:59:57', '2013-12-31 23:59:57'],
               dtype='datetime64[ns]', name=u'pickup_datetime', length=169893985, freq=None, tz=None)

Benchmark datetime compression

We can use a modern compression library, like fastlz or blosc to compress this data at high speeds.

In [36]: import blosc

In [37]: %time compressed = blosc.compress_ptr(address=ind.values.ctypes.data,
    ...:                                       items=len(ind),
    ...:                                       typesize=ind.values.dtype.alignment,
    ...:                                       clevel=5)
CPU times: user 3.22 s, sys: 332 ms, total: 3.55 s
Wall time: 512 ms

In [40]: len(compressed) / ind.nbytes  # compression ratio
Out[40]: 0.14296813539337488

In [41]: ind.nbytes / 0.512 / 1e9      # Compresson bandwidth (GB/s)
Out[41]: 2.654593515625

In [42]: %time _ = blosc.decompress(compressed)
CPU times: user 1.3 s, sys: 438 ms, total: 1.74 s
Wall time: 406 ms

In [43]: ind.nbytes / 0.406 / 1e9      # Decompression bandwidth (GB/s)
Out[43]: 3.3476647290640393

We store 7x fewer bytes on disk (thus septupling our effective disk I/O) by adding an extra 3GB/s delay. If we're on a really nice Macbook pro hard drive (~600MB/s) then this is a clear and substantial win. The worse the hard drive, the better this is.

But sometimes compression isn't as nice

Some data is more or less compressable than others. The following column of floating point data does not compress as nicely.

In [44]: x = df.pickup_latitude.compute().values
In [45]: %time compressed = blosc.compress_ptr(x.ctypes.data, len(x), x.dtype.alignment, clevel=5)
CPU times: user 5.87 s, sys: 0 ns, total: 5.87 s
Wall time: 925 ms

In [46]: len(compressed) / x.nbytes
Out[46]: 0.7518617315969132

This compresses more slowly and only provides marginal benefit. Compression may still be worth it on slow disk but this isn't a huge win.

The pickup_latitude column isn't compressible because most of the information isn't repetitive. The numbers to the far right of the decimal point are more or less random.

40.747868

Other floating point columns may compress well, particularly when they are rounded to small and meaningful decimal values.

Compression rules of thumb

Optimal compression requires thought. General rules of thumb include the following:

Avoid gzip and bz2

Finally, avoid gzip and bz2. These are both very common and very slow. If dealing with text data, consider snappy (also available via blosc.)

Indexing/Partitioning

One column usually dominates our queries. In time-series data this is time. For personal data this is the user ID.

Just as column stores let us avoid irrelevant columns, partitioning our data along a preferred index column lets us avoid irrelevant rows. We may need the data for the last month and don't need several years' worth. We may need the information for Alice and don't need the information for Bob.

Traditional relational databases provide indexes on any number of columns or sets of columns. This is excellent if you are using a traditional relational database. Unfortunately the data structures to provide arbitrary indexes don't mix well with some of the attributes discussed above and we're limited to a single index that partitions our data into sorted blocks.

Some projects that implement these principles

Many modern distributed database storage systems designed for analytic queries implement these principles well. Notable players include Redshift and Parquet.

Additionally newer single-machine data stores like Dato's SFrame and BColz follow many of these principles. Finally many people have been doing this for a long time with custom use of libraries like HDF5.

It turns out that these principles are actually quite easy to implement with the right tools (thank you #PyData) The rest of this post will talk about a tiny 500 line project, Castra, that implements these princples and gets good speedups on biggish Pandas data.

Castra

With these goals in mind we built Castra, a binary partitioned compressed columnstore with builtin support for categoricals and integration with both Pandas and dask.dataframe.

Load data from CSV files, sort on index, save to Castra

Here we load in our data from CSV files, sort on the pickup datetime column, and store to a castra file. This takes about an hour (as compared to eleven minutes for a single read.) Again, you can view the full notebook here

>>> import dask.dataframe as dd
>>> df = dd.read_csv('csv/trip_data_*.csv',
...                  skipinitialspace=True,
...                  parse_dates=['pickup_datetime', 'dropoff_datetime'])

>>> (df.set_index('pickup_datetime', compute=False)
...    .to_castra('trip.castra', categories=True))

Profit

Now we can take advantage of columnstores, compression, and binary representation to perform analytic queries quickly. Here is code to create a histogram of trip distance. The plot of the results follows below.

Note that this is especially fast because Pandas now releases the GIL on value_counts operations (all groupby operations really). This takes around 20 seconds on my machine on the last release of Pandas vs 5 seconds on the development branch. Moving from CSV files to Castra moved the bottleneck of our computation from disk I/O to processing speed, allowing improvements like multi-core processing to really shine.

We plot the result of the above computation with Bokeh below. Note the spike around 20km. This is around the distance from Midtown Manhattan to LaGuardia airport.

I've shown Castra used above with dask.dataframe but it works fine with straight Pandas too.

Credit

Castra was started by myself and Valentin Haenel (current maintainer of bloscpack and bcolz) during an evening sprint following PyData Berlin. Several bugfixes and refactors were followed up by Phil Cloud and Jim Crist.

Castra is roughly 500 lines long. It's a tiny project which is both good and bad. It's being used experimentally and there are some heavy disclaimers in the README. This post is not intended as a sales pitch for Castra, but rather to provide a vocabulary to talk about efficient tabular storage.

28 Aug 2015 12:00am GMT

Matthew Rocklin: Efficient Tabular Storage

tl;dr: We discuss efficient techniques for on-disk storage of tabular data, notably the following:

We use NYCTaxi dataset for examples, and introduce a small project, Castra.

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

Larger than Memory Data and Disk I/O

We analyze large datasets (10-100GB) on our laptop by extending memory with disk. Tools like dask.array and dask.dataframe make this easier for array and tabular data.

Interaction times can improve significantly (from minutes to seconds) if we choose to store our data on disk efficiently. This is particularly important for large data because we can no longer separately "load in our data" while we get a coffee and then iterate rapidly on our dataset once it's comfortably in memory.

Larger-than-memory datasets force our interactive workflows to include the hard drive.

CSV is convenient but slow

CSV is great. It's human readable, accessible by every tool (even Excel!), and pretty simple.

CSV is also slow. The pandas.read_csv parser maxes out at 100MB/s on simple data. This doesn't include any keyword arguments like datetime parsing that might slow it down further. Consider the time to parse a 24GB dataset:

24GB / (100MB/s) == 4 minutes

A four minute delay is too long for interactivity. We need to operate in seconds rather than minutes otherwise people leave to work on something else. This improvement from a few minutes to a few seconds is entirely possible if we choose better formats.

Example with CSVs

As an example lets play with the NYC Taxi dataset using dask.dataframe a library that copies the Pandas API but operates in chunks off of disk.

>>> import dask.dataframe as dd

>>> df = dd.read_csv('csv/trip_data_*.csv',
...                  skipinitialspace=True,
...                  parse_dates=['pickup_datetime', 'dropoff_datetime'])

>>> df.head()
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 -73.978165 40.757977 -73.989838 40.751171
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 -74.006683 40.731781 -73.994499 40.750660
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 -74.004707 40.737770 -74.009834 40.726002
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 -73.974602 40.759945 -73.984734 40.759388
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 -73.976250 40.748528 -74.002586 40.747868

Time Costs

It takes a second to load the first few lines but 11 to 12 minutes to roll through the entire dataset. We make a zoomable picture below of a random sample of the taxi pickup locations in New York City. This example is taken from a full example notebook here.

df2 = df[(df.pickup_latitude > 40) &
         (df.pickup_latitude < 42) &
         (df.pickup_longitude > -75) &
         (df.pickup_longitude < -72)]

sample = df2.sample(frac=0.0001)
pickup = sample[['pickup_latitude', 'pickup_longitude']]

result = pickup.compute()

from bokeh.plotting import figure, show, output_notebook
p = figure(title="Pickup Locations")
p.scatter(result.pickup_longitude, result.pickup_latitude, size=3, alpha=0.2)

Eleven minutes is a long time

This result takes eleven minutes to compute, almost all of which is parsing CSV files. While this may be acceptable for a single computation we invariably make mistakes and start over or find new avenues in our data to explore. Each step in our thought process now takes eleven minutes, ouch.

Interactive exploration of larger-than-memory datasets requires us to evolve beyond CSV files.

Principles to store tabular data

What efficient techniques exist for tabular data?

A good solution may have the following attributes:

  1. Binary
  2. Columnar
  3. Categorical support
  4. Compressed
  5. Indexed/Partitioned

We discuss each of these below.

Binary

Consider the text '1.23' as it is stored in a CSV file and how it is stored as a Python/C float in memory:

These look very different. When we load 1.23 from a CSV textfile we need to translate it to 0x3f9d70a4; this takes time.

A binary format stores our data on disk exactly how it will look in memory; we store the bytes 0x3f9d70a4 directly on disk so that when we load data from disk to memory no extra translation is necessary. Our file is no longer human readable but it's much faster.

This gets more intense when we consider datetimes:

Every time we parse a datetime we need to compute how many microseconds it has been since the epoch. This calculation needs to take into account things like how many days in each month, and all of the intervening leap years. This is slow. A binary representation would record the integer directly on disk (as 0x51e278694a680) so that we can load our datetimes directly into memory without calculation.

Columnar

Many analytic computations only require a few columns at a time, often only one, e.g.

>>> df.passenger_counts.value_counts().compute().sort_index()
0           3755
1      119605039
2       23097153
3        7187354
4        3519779
5        9852539
6        6628287
7             30
8             23
9             24
129            1
255            1
Name: passenger_count, dtype: int64

Of our 24 GB we may only need 2GB. Columnar storage means storing each column separately from the others so that we can read relevant columns without passing through irrelevant columns.

Our CSV example fails at this. While we only want two columns, pickup_datetime and pickup_longitude, we pass through all of our data to collect the relevant fields. The pickup location data is mixed with all the rest.

Categoricals

Categoricals encode repetitive text columns (normally very expensive) as integers (very very cheap) in a way that is invisible to the user.

Consider the following (mostly text) columns of our NYC taxi dataset:

>>> df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].head()
medallion vendor_id rate_code store_and_fwd_flag
0 89D227B655E5C82AECF13C3F540D4CF4 CMT 1 N
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 CMT 1 N
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 CMT 1 N

Each of these columns represents elements of a small set:

And yet we store these elements in large and cumbersome dtypes:

In [4]: df[['medallion', 'vendor_id', 'rate_code', 'store_and_fwd_flag']].dtypes
Out[4]:
medallion             object
vendor_id             object
rate_code              int64
store_and_fwd_flag    object
dtype: object

We use int64 for rate code, which could easily have fit into an int8 an opportunity for an 8x improvement in memory use. The object dtype used for strings in Pandas and Python takes up a lot of memory and is quite slow:

In [1]: import sys
In [2]: sys.getsizeof('CMT')  # bytes
Out[2]: 40

Categoricals replace the original column with a column of integers (of the appropriate size, often int8) along with a small index mapping those integers to the original values. I've written about categoricals before so I won't go into too much depth here. Categoricals increase both storage and computational efficiency by about 10x if you have text data that describes elements in a category.

Compression

After we've encoded everything well and separated our columns we find ourselves limited by disk I/O read speeds. Disk read bandwidths range from 100MB/s (laptop spinning disk hard drive) to 2GB/s (RAID of SSDs). This read speed strongly depends on how large our reads are. The bandwidths given above reflect large sequential reads such as you might find when reading all of a 100MB file in one go. Performance degrades for smaller reads. Fortunately, for analytic queries we're often in the large sequential read case (hooray!)

We reduce disk read times through compression. Consider the datetimes of the NYC taxi dataset. These values are repetitive and slowly changing; a perfect match for modern compression techniques.

>>> ind = df.index.compute()  # this is on presorted index data (see castra section below)
>>> ind
DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               '2013-01-01 00:00:00', '2013-01-01 00:00:00',
               ...
               '2013-12-31 23:59:42', '2013-12-31 23:59:47',
               '2013-12-31 23:59:48', '2013-12-31 23:59:49',
               '2013-12-31 23:59:50', '2013-12-31 23:59:51',
               '2013-12-31 23:59:54', '2013-12-31 23:59:55',
               '2013-12-31 23:59:57', '2013-12-31 23:59:57'],
               dtype='datetime64[ns]', name=u'pickup_datetime', length=169893985, freq=None, tz=None)

Benchmark datetime compression

We can use a modern compression library, like fastlz or blosc to compress this data at high speeds.

In [36]: import blosc

In [37]: %time compressed = blosc.compress_ptr(address=ind.values.ctypes.data,
    ...:                                       items=len(ind),
    ...:                                       typesize=ind.values.dtype.alignment,
    ...:                                       clevel=5)
CPU times: user 3.22 s, sys: 332 ms, total: 3.55 s
Wall time: 512 ms

In [40]: len(compressed) / ind.nbytes  # compression ratio
Out[40]: 0.14296813539337488

In [41]: ind.nbytes / 0.512 / 1e9      # Compresson bandwidth (GB/s)
Out[41]: 2.654593515625

In [42]: %time _ = blosc.decompress(compressed)
CPU times: user 1.3 s, sys: 438 ms, total: 1.74 s
Wall time: 406 ms

In [43]: ind.nbytes / 0.406 / 1e9      # Decompression bandwidth (GB/s)
Out[43]: 3.3476647290640393

We store 7x fewer bytes on disk (thus septupling our effective disk I/O) by adding an extra 3GB/s delay. If we're on a really nice Macbook pro hard drive (~600MB/s) then this is a clear and substantial win. The worse the hard drive, the better this is.

But sometimes compression isn't as nice

Some data is more or less compressable than others. The following column of floating point data does not compress as nicely.

In [44]: x = df.pickup_latitude.compute().values
In [45]: %time compressed = blosc.compress_ptr(x.ctypes.data, len(x), x.dtype.alignment, clevel=5)
CPU times: user 5.87 s, sys: 0 ns, total: 5.87 s
Wall time: 925 ms

In [46]: len(compressed) / x.nbytes
Out[46]: 0.7518617315969132

This compresses more slowly and only provides marginal benefit. Compression may still be worth it on slow disk but this isn't a huge win.

The pickup_latitude column isn't compressible because most of the information isn't repetitive. The numbers to the far right of the decimal point are more or less random.

40.747868

Other floating point columns may compress well, particularly when they are rounded to small and meaningful decimal values.

Compression rules of thumb

Optimal compression requires thought. General rules of thumb include the following:

Avoid gzip and bz2

Finally, avoid gzip and bz2. These are both very common and very slow. If dealing with text data, consider snappy (also available via blosc.)

Indexing/Partitioning

One column usually dominates our queries. In time-series data this is time. For personal data this is the user ID.

Just as column stores let us avoid irrelevant columns, partitioning our data along a preferred index column lets us avoid irrelevant rows. We may need the data for the last month and don't need several years' worth. We may need the information for Alice and don't need the information for Bob.

Traditional relational databases provide indexes on any number of columns or sets of columns. This is excellent if you are using a traditional relational database. Unfortunately the data structures to provide arbitrary indexes don't mix well with some of the attributes discussed above and we're limited to a single index that partitions our data into sorted blocks.

Some projects that implement these principles

Many modern distributed database storage systems designed for analytic queries implement these principles well. Notable players include Redshift and Parquet.

Additionally newer single-machine data stores like Dato's SFrame and BColz follow many of these principles. Finally many people have been doing this for a long time with custom use of libraries like HDF5.

It turns out that these principles are actually quite easy to implement with the right tools (thank you #PyData) The rest of this post will talk about a tiny 500 line project, Castra, that implements these princples and gets good speedups on biggish Pandas data.

Castra

With these goals in mind we built Castra, a binary partitioned compressed columnstore with builtin support for categoricals and integration with both Pandas and dask.dataframe.

Load data from CSV files, sort on index, save to Castra

Here we load in our data from CSV files, sort on the pickup datetime column, and store to a castra file. This takes about an hour (as compared to eleven minutes for a single read.) Again, you can view the full notebook here

>>> import dask.dataframe as dd
>>> df = dd.read_csv('csv/trip_data_*.csv',
...                  skipinitialspace=True,
...                  parse_dates=['pickup_datetime', 'dropoff_datetime'])

>>> (df.set_index('pickup_datetime', compute=False)
...    .to_castra('trip.castra', categories=True))

Profit

Now we can take advantage of columnstores, compression, and binary representation to perform analytic queries quickly. Here is code to create a histogram of trip distance. The plot of the results follows below.

Note that this is especially fast because Pandas now releases the GIL on value_counts operations (all groupby operations really). This takes around 20 seconds on my machine on the last release of Pandas vs 5 seconds on the development branch. Moving from CSV files to Castra moved the bottleneck of our computation from disk I/O to processing speed, allowing improvements like multi-core processing to really shine.

We plot the result of the above computation with Bokeh below. Note the spike around 20km. This is around the distance from Midtown Manhattan to LaGuardia airport.

I've shown Castra used above with dask.dataframe but it works fine with straight Pandas too.

Credit

Castra was started by myself and Valentin Haenel (current maintainer of bloscpack and bcolz) during an evening sprint following PyData Berlin. Several bugfixes and refactors were followed up by Phil Cloud and Jim Crist.

Castra is roughly 500 lines long. It's a tiny project which is both good and bad. It's being used experimentally and there are some heavy disclaimers in the README. This post is not intended as a sales pitch for Castra, but rather to provide a vocabulary to talk about efficient tabular storage.

28 Aug 2015 12:00am GMT

27 Aug 2015

feedPlanet Python

Evennia: A wagon-load of post-summer updates

Summer vacations are over and work resumes in Evennia land! Here's a wagon-load of small updates on what's going on.

Ainneve

The Ainneve project, the creation of an official, open-source Evennia demo game, has gotten going. The lead devs of the project are keen to make this a collaborative effort and there is a lot of good discussion and code being written. After some slowdown at the end of summer it's bound to pick up again.

Ainneve's a rare chance to see a full MUD getting developed from scratch out in the open. The current issue list includes tags for difficulty and allows also newbie Python coders to jump in. Not to mention you have a chance to get valuable feedback on your work by seasoned coders!

So if you are at all interested in making a MUD, try out Python/Evennia or just get involved in a semi-big programming project, this is a great chance to do so.

Imaginary Realities

Since a few weeks, there is a new issue of Imaginary realities (vol 7, issue 3) is out. As usual I have an article in it. This venerable e-zine was revitalized to include articles on both MU* as well as roguelikes, Interactive fiction and others. Not only is this issue the most content-rich since the reboot, with this issue they have also spruced up their interface to make past issues easier to navigate.

Evennia Web client

In the pipeline I have some updates to Evennia's websocket/JSON MUD-web client component. These are changes that are intended to make the webclient easier to customize and hook into Evennia output using only HTML/CSS. More details on this will be forthcoming when I have more solid stuff to show.

______
Image: The troll here a-cometh by Griatch

27 Aug 2015 2:36pm GMT

Evennia: A wagon-load of post-summer updates

Summer vacations are over and work resumes in Evennia land! Here's a wagon-load of small updates on what's going on.

Ainneve

The Ainneve project, the creation of an official, open-source Evennia demo game, has gotten going. The lead devs of the project are keen to make this a collaborative effort and there is a lot of good discussion and code being written. After some slowdown at the end of summer it's bound to pick up again.

Ainneve's a rare chance to see a full MUD getting developed from scratch out in the open. The current issue list includes tags for difficulty and allows also newbie Python coders to jump in. Not to mention you have a chance to get valuable feedback on your work by seasoned coders!

So if you are at all interested in making a MUD, try out Python/Evennia or just get involved in a semi-big programming project, this is a great chance to do so.

Imaginary Realities

Since a few weeks, there is a new issue of Imaginary realities (vol 7, issue 3) is out. As usual I have an article in it. This venerable e-zine was revitalized to include articles on both MU* as well as roguelikes, Interactive fiction and others. Not only is this issue the most content-rich since the reboot, with this issue they have also spruced up their interface to make past issues easier to navigate.

Evennia Web client

In the pipeline I have some updates to Evennia's websocket/JSON MUD-web client component. These are changes that are intended to make the webclient easier to customize and hook into Evennia output using only HTML/CSS. More details on this will be forthcoming when I have more solid stuff to show.

______
Image: The troll here a-cometh by Griatch

27 Aug 2015 2:36pm GMT

eGenix.com: eGenix mx Base Distribution 3.2.9 GA

Introduction

The eGenix.com mx Base Distribution for Python is a collection of professional quality software tools which enhance Python's usability in many important areas such as fast text searching, date/time processing and high speed data types.

The tools have a proven track record of being portable across many Unix and Windows platforms. You can write applications which use the tools on Windows and then run them on Unix platforms without change due to the consistent platform independent interfaces.

The distributions contains these open-source Python extensions, all grouped under the top-level mx Python package:

The package also includes the mxSetup module, which implements our distutils based package tool chain (including the tooling for our Python web installer technology), as well as a number of helpful smaller modules in the mx.Misc subpackage, such as mx.Misc.ConfigFile for config file parsing or mx.Misc.CommandLine to quickly write command line applications in Python.

All available packages have proven their stability and usefulness in many mission critical applications and various commercial settings all around the world.

News

The 3.2.9 release of the eGenix mx Base Distribution is the latest release of our open-source Python extensions. It includes these fixes and enhancements:

Fixes for all Python Builds

Fixes for Python Debug Builds

Installation Enhancements and Fixes (via included mxSetup)

Most of these enhancements and fixes are part of the Python web installer support we added to mxSetup a while ago. If you want to learn more about this web installer technology, please see this talk on the topic.

eGenix mx Base Distribution 3.2.0 was release on 2012-08-28. Please see the announcement for new features in the 3.2 major release compared to earlier releases.

For a complete list of changes, please see eGenix mx Base change log page and the change logs of the included Python packages.

Upgrading

We encourage all users to upgrade to this latest eGenix mx Base Distribution release.

If you are upgrading from eGenix mx Base 3.1.x, please see the eGenix mx Base Distribution 3.2.0 release notes for details on what has changed since the 3.1 major release.

License

The eGenix mx Base Distribution is distributed under the terms of our eGenix.com Public License 1.1.0 which is an open source license similar to the Python license. You can use the packages in both commercial and
non-commercial settings without fee or charge.

This open source distribution package comes with full source code.

Downloads

Please visit the eGenix mx Base Distribution product page for downloads, instructions on installation and documentation of the packages.

If you want to try the package, please jump straight to the download instructions or simply run pip install egenix-mx-base.

As always, we are providing prebuilt binaries for all supported platforms: Windows 32/64-bit, Linux 32/64-bit, FreeBSD 32/64-bit, Mac OS X 32/64-bit. Source code archives are available for installation on all other Python platforms, such as Solaris, AIX, HP-UX, etc.

To simplify installation in Zope/Plone and other egg-based systems, we have also precompiled egg distributions for all platforms. These are available on our own PyPI-style index server for easy and automatic download. Please see the download instructions for details.

Whether you are using a prebuilt package or the source distribution, installation is a simple "python setup.py install" command in all cases. The only difference is that the prebuilt packages do not require a compiler or the Python development packages to be installed.

Support

Commercial support contracts for this product are available from eGenix.com.

Please see the support section of our website for details

More Information

For more information on the eGenix mx Base Distribution, licensing and download instructions, please write to sales@egenix.com.

Enjoy !

Marc-Andre Lemburg, eGenix.com

27 Aug 2015 10:00am GMT

eGenix.com: eGenix mx Base Distribution 3.2.9 GA

Introduction

The eGenix.com mx Base Distribution for Python is a collection of professional quality software tools which enhance Python's usability in many important areas such as fast text searching, date/time processing and high speed data types.

The tools have a proven track record of being portable across many Unix and Windows platforms. You can write applications which use the tools on Windows and then run them on Unix platforms without change due to the consistent platform independent interfaces.

The distributions contains these open-source Python extensions, all grouped under the top-level mx Python package:

The package also includes the mxSetup module, which implements our distutils based package tool chain (including the tooling for our Python web installer technology), as well as a number of helpful smaller modules in the mx.Misc subpackage, such as mx.Misc.ConfigFile for config file parsing or mx.Misc.CommandLine to quickly write command line applications in Python.

All available packages have proven their stability and usefulness in many mission critical applications and various commercial settings all around the world.

News

The 3.2.9 release of the eGenix mx Base Distribution is the latest release of our open-source Python extensions. It includes these fixes and enhancements:

Fixes for all Python Builds

Fixes for Python Debug Builds

Installation Enhancements and Fixes (via included mxSetup)

Most of these enhancements and fixes are part of the Python web installer support we added to mxSetup a while ago. If you want to learn more about this web installer technology, please see this talk on the topic.

eGenix mx Base Distribution 3.2.0 was release on 2012-08-28. Please see the announcement for new features in the 3.2 major release compared to earlier releases.

For a complete list of changes, please see eGenix mx Base change log page and the change logs of the included Python packages.

Upgrading

We encourage all users to upgrade to this latest eGenix mx Base Distribution release.

If you are upgrading from eGenix mx Base 3.1.x, please see the eGenix mx Base Distribution 3.2.0 release notes for details on what has changed since the 3.1 major release.

License

The eGenix mx Base Distribution is distributed under the terms of our eGenix.com Public License 1.1.0 which is an open source license similar to the Python license. You can use the packages in both commercial and
non-commercial settings without fee or charge.

This open source distribution package comes with full source code.

Downloads

Please visit the eGenix mx Base Distribution product page for downloads, instructions on installation and documentation of the packages.

If you want to try the package, please jump straight to the download instructions or simply run pip install egenix-mx-base.

As always, we are providing prebuilt binaries for all supported platforms: Windows 32/64-bit, Linux 32/64-bit, FreeBSD 32/64-bit, Mac OS X 32/64-bit. Source code archives are available for installation on all other Python platforms, such as Solaris, AIX, HP-UX, etc.

To simplify installation in Zope/Plone and other egg-based systems, we have also precompiled egg distributions for all platforms. These are available on our own PyPI-style index server for easy and automatic download. Please see the download instructions for details.

Whether you are using a prebuilt package or the source distribution, installation is a simple "python setup.py install" command in all cases. The only difference is that the prebuilt packages do not require a compiler or the Python development packages to be installed.

Support

Commercial support contracts for this product are available from eGenix.com.

Please see the support section of our website for details

More Information

For more information on the eGenix mx Base Distribution, licensing and download instructions, please write to sales@egenix.com.

Enjoy !

Marc-Andre Lemburg, eGenix.com

27 Aug 2015 10:00am GMT

Codementor: Adding Flow Control to Apache Pig using Python

(image source)

Introduction

So you like Pig but its cramping your style? Are you not sure what Pig is about? Are you keen to write some code to write code for you? If yes, then this is for you.

This tutorial ties together a whole lot of different techniques and technologies. The aim here is to show you a trick to get Pig to behave in a way that's just a little bit more loopy. It's a trick I've used before quite a lot and I've written a couple of utility functions to make it easy. I'll go over the bits and pieces here. This tutorial, on a more general note, is about writing code that writes code. The general technique and concerns outlined here can be applied to other code generating problems.

What Does Pig Do?

Pig is a high-level scripting toolset used for defining and executing complex map-reduce workflows. Let's take a closer look at that sentence…

Pig, is a top-level Apache project. It is open source and really quite nifty. Learn more about it here. PigLatin is Pig's language. Pig executes PigLatin scripts. Within a PigLatin script you write a bunch of statements that get converted into a bunch of map-reduce jobs that can get executed in sequence on your Hadoop cluster. It's usually nice to abstract away from writing plain old map-reduce jobs because they can be a total pain in the neck.

If you haven't used Pig before and aren't sure if it's for you, it might be a good idea to check out Hive. Hive and Pig have a lot of overlap in terms of functionality, but have different philosophies. They aren't total competitors because they are often used in conjunction with one another. Hive resembles SQL, while PigLatin resembles… PigLatin. So if you are familiar with SQL then Hive might be an easier learn, but IMHO Pig is a bit more sensible than Hive in how it describes data flow.

What Doesn't Pig Do?

Pig doesn't make any decisions about the flow of program execution, it only allows you to specify the flow of data. In other words, it allows you to say stuff like this:

-----------------------------------------------
-- define some data format goodies
-----------------------------------------------

define CSV_READER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX'
                                                            );


define CSV_WRITER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX',
                                                            'SKIP_OUTPUT_HEADER'
                                                            );

-----------------------------------------------
-- load some data
-----------------------------------------------

r_one = LOAD 'one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD 'two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into 'r_three.csv' using CSV_WRITER;

The script above says where the data should flow. Every statement you see there will get executed exactly once no matter what (unless there is some kind of error).

You can run the script from the command line like so:

pig path/to/my_script.oink

Ok, what if we have a bunch of files and each of them needs to have the same stuff happen to it? Does that mean we would need to copy-paste our PigLatin script and edit each one to have the right paths?

Well, no. Pig allows some really basic substitutions. You can do stuff like this:`

r_one = LOAD '$DIR/one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD '$DIR/two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into '$DIR/r_three.csv' using CSV_WRITER;

Then you can run the script as many times as you like with different values for DIR. Something like:

pig path/to/my_script.oink -p DIR=jan_2015
pig path/to/my_script.oink -p DIR=feb_2015
pig path/to/my_script.oink -p DIR=march_2015

So pig allows variable substitution and that is a pretty powerful thing on its own. But it doesn't allow loops or if statements and that can be somewhat limiting. What if we had to iterate over 60 different values for DIR? This is something Pig doesn't cater for.

Luckily for us, Python can loop just fine. So we could do something like:

def run_pig_script(sFilePath,dPigArgs=None):
    """
    run piggy run
    """
    import subprocess
    lCmd = ["pig",sFilePath,]  
    for sArg in ['{0}={1}'.format(*t) for t in (dPigArgs or {}).items()]:
        lCmd.append('-p')
        lCmd.append(sArg)
    print lCmd
    p = subprocess.Popen(lCmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
    stdout, stderr = p.communicate()
    return stdout,stderr

for sDir in lManyDirectories:
    run_pig_script(sFilePath="path/to/my_script.oink",dPigArgs={'DIR':sDir})

The run_pig_script function makes use of the subprocess module to create a Pig process through use of the Popen function. Popen takes a list of token strings as its first argument and makes a system call from there. So first we create the command list lCmd then start a process. The output of the process (the stuff that would usually get printed to the console window) gets redirected to the stderr and stdout objects.

In order to populate lCmd we use a short-hand for loop notation known as list comprehension. It's very cool and useful but beyond the scope of this text. Try calling run_pig_script with a few different arguments and see what it prints and you should easily get a feel for what Popen expects.

But what if you really need a loop inside your pig script?

So we have covered executing a PigLatin script many times with different values, what if we want to make use of many variables within the PigLatin script? For example, what happens if we want to loop over some variable number of directories within a single script? For example something like this…

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);


r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Writing all that down could become tedious. Especially if we are working with an arbitrary number of files each time. Maybe we want a union of all the sales of the month so far, then we would need to come up with a new script for every day. That sounds pretty horrible and would require a lot of copy-paste and copy-paste smells bad.

So Here is What We are Going to Do Instead

Have some pythonish pseudo-code:

lStrs = complicated_operation_getting_list_of_strings() #1
sPigPath = generate_pig_script(lStrs)                   #2
run_pig_script(sFilePath = sPigPath)                    #3

So we have 3 steps in the code above: Step 1 is getting the data we need that the pig script is going to rely on. Then, in step 2, we need to take that data and turn it into something Pig will be able to understand. Step 3 then needs to make it run.

Step 1 of the process very much depends on what you are trying to do. Following from the previous example we would likely want complicated_operation_getting_list_of_strings to look like:

def complicated_operation_getting_list_of_strings():
    import datetime
    oNow = datetime.datetime.now()
    sMonth = oNow.strftime('%b').lower()
    return ["{0}_{1}".format(sMonth,i+1) for i in range(oNow.day)]

The rest of this tutorial wil be dealing with steps 2 and 3.

Template Systems

Writing code to write code for us! That's pretty futuristic stuff!

Not really…

Ever written a web app? Did you use some kind of framework for this? Did the framework specify (or allow you to specify) some special way of writing HTML so that you could do clever things in your HTML files? Clever things like loops and ifs and variable substitutions? If you answered yes to these questions, you wrote code that wrote HTML code for you at least. And if you answered no, then the take-away message here is: Writing code that writes code is something that has been done for ages, there are many systems libraries and packages that support this kind of thing in many languages. These kinds of tools are generally referred to as template systems.

The template system we'll be using for this is Mako. This is not a mako tutorial, to learn about mako, check this out.

An important thing in choosing a template system to make sure that it doesn't clash with the language you are using it to write. And if it does clash then you need to find ways to compensate. What I mean by this is: If I am using a template language then that language has a few well-defined control sequences for doing things like loops and variable substitution. An example from mako is:

${xSomeVariable}

When you render that line of code then the value of xSomeVariable will get turned into a string. But what if ${stuff} meant something in the language you are trying to generate? Then there is a good chance that mako will find things in your template files that it thinks it needs to deal with and it will either output garbage or raise exceptions.

Mako and PigLatin don't have this problem. So that's pretty convenient.

Using Python to generate PigLatin

Remember this: sPigPath = generate_pig_script(lNames)?

Good coders don't mix languages in the same file if they can help it (which is pretty much always). So while it is possible to define your entire PigLatin mako template in the form of a big giant string inside your Python script, we aren't going to do that.

Also, it would be nice if the code we are writing works for more than one template. So instead of:

sPigPath = generate_pig_script(lStrs)   #2

We'll do this:

sPigPath = generate_pig_script(sFilePath,dContext)   #2

We want to pass in the path to our template file, along with a dictionary containing the context variables we'd use to render it this time. For example we could have:

dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

Ok, so lets write some real code then…

def generate_pig_script(sFilePath,dContext):
    """
    render the template at sFilePath using the context in dContext,
    save the output in a temporary file
    return the path to the generated file
    """
    from mako.template import Template
    import datetime

    #1. fetch the template from the file
    oTemplate = Template(filename=sFilePath)
    
    #2. render it using the context dictionary. This gives us a string
    sOutputScript = oTemplate.render(**dContext)

    #3. put the output into some file...
    sOutPath = "{0}_{1}".format(sFilePath,datetime.datetime.now().isoformat())
    with open(sOutPath,'w') as f:
        f.write(sOutputScript)

    return sOutPath

The comments in the code should be enough to understand its general functioning.

Just to complete the picture, let's make an actual template…

Remember this?

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);


r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Here it is in the form of a mako template:

%for sD in lStrs:

r_${sD} = LOAD '${sD}/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

%endfor

r_all = UNION ${','.join(['r_{0}'.format(sD) for sD in lStrs])};

The full picture

So now we have used Python to generate a PigLatin script and store it in a known location. And we already know how to get Python to launch Pig. So that's it. Pretty straight forward, eh? This tutorial made use of a few different technologies and techniques and it's impossible not to jump around a little bit so I've included a little summary here of how to use this technique:

#1 given a working PigLatin script that has a lot of repitition or a variable number of inputs, create a mako template

#2 write a function that creates the context for the mako template. eg:
dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

#3 render the template
sPigFilePath = generate_pig_script(sMakoFilePath,dContext)

#and finally run the thing...
run_pig_script(sPigFilePath,dPigArgs=None)

Conclusions

We've covered some of the basics of code generation and used Python and the mako templating system to make Pig more loopy. I've touched on a lot of different technologies and techniques. Pig itself is quite a big deal, and the kinds of problems it is applied to can fill books. The mako templating engine is a powerful thing in itself and has many use cases other than Pig (I mostly use it in conjunction with Pyramid for example). Python loops and list comprehension is worth looking into if any of the weird for-loop stuff didn't make sense; and finally the subprocess modult- it constitutes quite a rabbit hole on its own.

27 Aug 2015 9:42am GMT

Codementor: Adding Flow Control to Apache Pig using Python

(image source)

Introduction

So you like Pig but its cramping your style? Are you not sure what Pig is about? Are you keen to write some code to write code for you? If yes, then this is for you.

This tutorial ties together a whole lot of different techniques and technologies. The aim here is to show you a trick to get Pig to behave in a way that's just a little bit more loopy. It's a trick I've used before quite a lot and I've written a couple of utility functions to make it easy. I'll go over the bits and pieces here. This tutorial, on a more general note, is about writing code that writes code. The general technique and concerns outlined here can be applied to other code generating problems.

What Does Pig Do?

Pig is a high-level scripting toolset used for defining and executing complex map-reduce workflows. Let's take a closer look at that sentence…

Pig, is a top-level Apache project. It is open source and really quite nifty. Learn more about it here. PigLatin is Pig's language. Pig executes PigLatin scripts. Within a PigLatin script you write a bunch of statements that get converted into a bunch of map-reduce jobs that can get executed in sequence on your Hadoop cluster. It's usually nice to abstract away from writing plain old map-reduce jobs because they can be a total pain in the neck.

If you haven't used Pig before and aren't sure if it's for you, it might be a good idea to check out Hive. Hive and Pig have a lot of overlap in terms of functionality, but have different philosophies. They aren't total competitors because they are often used in conjunction with one another. Hive resembles SQL, while PigLatin resembles… PigLatin. So if you are familiar with SQL then Hive might be an easier learn, but IMHO Pig is a bit more sensible than Hive in how it describes data flow.

What Doesn't Pig Do?

Pig doesn't make any decisions about the flow of program execution, it only allows you to specify the flow of data. In other words, it allows you to say stuff like this:

-----------------------------------------------
-- define some data format goodies
-----------------------------------------------

define CSV_READER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX'
                                                            );


define CSV_WRITER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX',
                                                            'SKIP_OUTPUT_HEADER'
                                                            );

-----------------------------------------------
-- load some data
-----------------------------------------------

r_one = LOAD 'one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD 'two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into 'r_three.csv' using CSV_WRITER;

The script above says where the data should flow. Every statement you see there will get executed exactly once no matter what (unless there is some kind of error).

You can run the script from the command line like so:

pig path/to/my_script.oink

Ok, what if we have a bunch of files and each of them needs to have the same stuff happen to it? Does that mean we would need to copy-paste our PigLatin script and edit each one to have the right paths?

Well, no. Pig allows some really basic substitutions. You can do stuff like this:`

r_one = LOAD '$DIR/one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD '$DIR/two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into '$DIR/r_three.csv' using CSV_WRITER;

Then you can run the script as many times as you like with different values for DIR. Something like:

pig path/to/my_script.oink -p DIR=jan_2015
pig path/to/my_script.oink -p DIR=feb_2015
pig path/to/my_script.oink -p DIR=march_2015

So pig allows variable substitution and that is a pretty powerful thing on its own. But it doesn't allow loops or if statements and that can be somewhat limiting. What if we had to iterate over 60 different values for DIR? This is something Pig doesn't cater for.

Luckily for us, Python can loop just fine. So we could do something like:

def run_pig_script(sFilePath,dPigArgs=None):
    """
    run piggy run
    """
    import subprocess
    lCmd = ["pig",sFilePath,]  
    for sArg in ['{0}={1}'.format(*t) for t in (dPigArgs or {}).items()]:
        lCmd.append('-p')
        lCmd.append(sArg)
    print lCmd
    p = subprocess.Popen(lCmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
    stdout, stderr = p.communicate()
    return stdout,stderr

for sDir in lManyDirectories:
    run_pig_script(sFilePath="path/to/my_script.oink",dPigArgs={'DIR':sDir})

The run_pig_script function makes use of the subprocess module to create a Pig process through use of the Popen function. Popen takes a list of token strings as its first argument and makes a system call from there. So first we create the command list lCmd then start a process. The output of the process (the stuff that would usually get printed to the console window) gets redirected to the stderr and stdout objects.

In order to populate lCmd we use a short-hand for loop notation known as list comprehension. It's very cool and useful but beyond the scope of this text. Try calling run_pig_script with a few different arguments and see what it prints and you should easily get a feel for what Popen expects.

But what if you really need a loop inside your pig script?

So we have covered executing a PigLatin script many times with different values, what if we want to make use of many variables within the PigLatin script? For example, what happens if we want to loop over some variable number of directories within a single script? For example something like this…

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);


r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Writing all that down could become tedious. Especially if we are working with an arbitrary number of files each time. Maybe we want a union of all the sales of the month so far, then we would need to come up with a new script for every day. That sounds pretty horrible and would require a lot of copy-paste and copy-paste smells bad.

So Here is What We are Going to Do Instead

Have some pythonish pseudo-code:

lStrs = complicated_operation_getting_list_of_strings() #1
sPigPath = generate_pig_script(lStrs)                   #2
run_pig_script(sFilePath = sPigPath)                    #3

So we have 3 steps in the code above: Step 1 is getting the data we need that the pig script is going to rely on. Then, in step 2, we need to take that data and turn it into something Pig will be able to understand. Step 3 then needs to make it run.

Step 1 of the process very much depends on what you are trying to do. Following from the previous example we would likely want complicated_operation_getting_list_of_strings to look like:

def complicated_operation_getting_list_of_strings():
    import datetime
    oNow = datetime.datetime.now()
    sMonth = oNow.strftime('%b').lower()
    return ["{0}_{1}".format(sMonth,i+1) for i in range(oNow.day)]

The rest of this tutorial wil be dealing with steps 2 and 3.

Template Systems

Writing code to write code for us! That's pretty futuristic stuff!

Not really…

Ever written a web app? Did you use some kind of framework for this? Did the framework specify (or allow you to specify) some special way of writing HTML so that you could do clever things in your HTML files? Clever things like loops and ifs and variable substitutions? If you answered yes to these questions, you wrote code that wrote HTML code for you at least. And if you answered no, then the take-away message here is: Writing code that writes code is something that has been done for ages, there are many systems libraries and packages that support this kind of thing in many languages. These kinds of tools are generally referred to as template systems.

The template system we'll be using for this is Mako. This is not a mako tutorial, to learn about mako, check this out.

An important thing in choosing a template system to make sure that it doesn't clash with the language you are using it to write. And if it does clash then you need to find ways to compensate. What I mean by this is: If I am using a template language then that language has a few well-defined control sequences for doing things like loops and variable substitution. An example from mako is:

${xSomeVariable}

When you render that line of code then the value of xSomeVariable will get turned into a string. But what if ${stuff} meant something in the language you are trying to generate? Then there is a good chance that mako will find things in your template files that it thinks it needs to deal with and it will either output garbage or raise exceptions.

Mako and PigLatin don't have this problem. So that's pretty convenient.

Using Python to generate PigLatin

Remember this: sPigPath = generate_pig_script(lNames)?

Good coders don't mix languages in the same file if they can help it (which is pretty much always). So while it is possible to define your entire PigLatin mako template in the form of a big giant string inside your Python script, we aren't going to do that.

Also, it would be nice if the code we are writing works for more than one template. So instead of:

sPigPath = generate_pig_script(lStrs)   #2

We'll do this:

sPigPath = generate_pig_script(sFilePath,dContext)   #2

We want to pass in the path to our template file, along with a dictionary containing the context variables we'd use to render it this time. For example we could have:

dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

Ok, so lets write some real code then…

def generate_pig_script(sFilePath,dContext):
    """
    render the template at sFilePath using the context in dContext,
    save the output in a temporary file
    return the path to the generated file
    """
    from mako.template import Template
    import datetime

    #1. fetch the template from the file
    oTemplate = Template(filename=sFilePath)
    
    #2. render it using the context dictionary. This gives us a string
    sOutputScript = oTemplate.render(**dContext)

    #3. put the output into some file...
    sOutPath = "{0}_{1}".format(sFilePath,datetime.datetime.now().isoformat())
    with open(sOutPath,'w') as f:
        f.write(sOutputScript)

    return sOutPath

The comments in the code should be enough to understand its general functioning.

Just to complete the picture, let's make an actual template…

Remember this?

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);


r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Here it is in the form of a mako template:

%for sD in lStrs:

r_${sD} = LOAD '${sD}/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

%endfor

r_all = UNION ${','.join(['r_{0}'.format(sD) for sD in lStrs])};

The full picture

So now we have used Python to generate a PigLatin script and store it in a known location. And we already know how to get Python to launch Pig. So that's it. Pretty straight forward, eh? This tutorial made use of a few different technologies and techniques and it's impossible not to jump around a little bit so I've included a little summary here of how to use this technique:

#1 given a working PigLatin script that has a lot of repitition or a variable number of inputs, create a mako template

#2 write a function that creates the context for the mako template. eg:
dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

#3 render the template
sPigFilePath = generate_pig_script(sMakoFilePath,dContext)

#and finally run the thing...
run_pig_script(sPigFilePath,dPigArgs=None)

Conclusions

We've covered some of the basics of code generation and used Python and the mako templating system to make Pig more loopy. I've touched on a lot of different technologies and techniques. Pig itself is quite a big deal, and the kinds of problems it is applied to can fill books. The mako templating engine is a powerful thing in itself and has many use cases other than Pig (I mostly use it in conjunction with Pyramid for example). Python loops and list comprehension is worth looking into if any of the weird for-loop stuff didn't make sense; and finally the subprocess modult- it constitutes quite a rabbit hole on its own.

27 Aug 2015 9:42am GMT

26 Aug 2015

feedPlanet Python

Daily Tech Video (Python): [Video 275] Adam Forsyth: Python Not Recommended

Programmers love to advocate for their favorite languages. "I love language X, so why doesn't everyone else use language X?" As a growing number of businesses use multiple languages, you can end up with pockets of different advocates, many of whom are then adamant that their language be used for more things. In this talk, Adam Forsyth describes life as a Python developer in a mostly-Ruby shop - and then goes on to describe when, much as he likes Python, it isn't an appropriate tool to use.

The post [Video 275] Adam Forsyth: Python Not Recommended appeared first on Daily Tech Video.

26 Aug 2015 9:52pm GMT

Daily Tech Video (Python): [Video 275] Adam Forsyth: Python Not Recommended

Programmers love to advocate for their favorite languages. "I love language X, so why doesn't everyone else use language X?" As a growing number of businesses use multiple languages, you can end up with pockets of different advocates, many of whom are then adamant that their language be used for more things. In this talk, Adam Forsyth describes life as a Python developer in a mostly-Ruby shop - and then goes on to describe when, much as he likes Python, it isn't an appropriate tool to use.

The post [Video 275] Adam Forsyth: Python Not Recommended appeared first on Daily Tech Video.

26 Aug 2015 9:52pm GMT

Carl Chenet: Retweet 0.2 : bump to Python 3

Follow me on Identi.ca or Twitter or Diaspora* Don't know Retweet? My last post about it introduced this small Twitter bot whichs just retweets (for now) every tweets from a Twitter account to another one. Retweet Retweet 0.2 on Github The Retweet project on Github The official documentation of the Retweet project Retweet was created in order…

26 Aug 2015 9:01pm GMT

Carl Chenet: Retweet 0.2 : bump to Python 3

Follow me on Identi.ca or Twitter or Diaspora* Don't know Retweet? My last post about it introduced this small Twitter bot whichs just retweets (for now) every tweets from a Twitter account to another one. Retweet Retweet 0.2 on Github The Retweet project on Github The official documentation of the Retweet project Retweet was created in order…

26 Aug 2015 9:01pm GMT

Python Anywhere: New release - Web app charts, MySQL upgrade and bug fixes

Hit charts for web apps

Screenshot of hit charts The main change for this release is that we now report hits and errors to your web apps on the web app page. If you're a paying user, you get pretty charts over a range of time periods. If you're not, you'll get a text report.

Web app error reporting

We've greatly improved the errors that are reported when you reload a web app.

Batteries included

As much as is possible, we have tried to bring the packages that we install for Python 3 to parity with Python 2. That means that the number of packages that come preinstalled for Python 3 has increased dramatically.

Database upgrade

All of your databases have been upgraded to MySQL 5.5.

Other stuff

We've also applied a number of small bug fixes, user interface improvements and stability fixes.

26 Aug 2015 8:02am GMT

Python Anywhere: New release - Web app charts, MySQL upgrade and bug fixes

Hit charts for web apps

Screenshot of hit charts The main change for this release is that we now report hits and errors to your web apps on the web app page. If you're a paying user, you get pretty charts over a range of time periods. If you're not, you'll get a text report.

Web app error reporting

We've greatly improved the errors that are reported when you reload a web app.

Batteries included

As much as is possible, we have tried to bring the packages that we install for Python 3 to parity with Python 2. That means that the number of packages that come preinstalled for Python 3 has increased dramatically.

Database upgrade

All of your databases have been upgraded to MySQL 5.5.

Other stuff

We've also applied a number of small bug fixes, user interface improvements and stability fixes.

26 Aug 2015 8:02am GMT

Nigel Babu: Mailman Subscription Spam

In the last few weeks, a few of us running mailman have been noticing attacks using our servers. Most often we end up being used as relays to send subscription spam to the servers. They pick one address and use multiple aliases of the address to send spam to. I won't get into the details of the attack, but here's a script that I came up with and is now modified to be friendly thanks to OpenStack Infra Team.

Create the file /usr/lib/mailman/bin/ban.py with this content:

def ban(m, address):
    try:
        m.Lock()
        if address not in m.ban_list:
            m.ban_list.append(address)
        m.Save()
    finally:
        m.Unlock()

Now run this script like this

sudo /usr/lib/mailman/bin/withlist -a -r ban "<address to ban>"

The ban address can be a regular expressions, so to ban an address and all suffixes, use ^address.*@example.com as the address to ban.

26 Aug 2015 5:40am GMT

Nigel Babu: Mailman Subscription Spam

In the last few weeks, a few of us running mailman have been noticing attacks using our servers. Most often we end up being used as relays to send subscription spam to the servers. They pick one address and use multiple aliases of the address to send spam to. I won't get into the details of the attack, but here's a script that I came up with and is now modified to be friendly thanks to OpenStack Infra Team.

Create the file /usr/lib/mailman/bin/ban.py with this content:

def ban(m, address):
    try:
        m.Lock()
        if address not in m.ban_list:
            m.ban_list.append(address)
        m.Save()
    finally:
        m.Unlock()

Now run this script like this

sudo /usr/lib/mailman/bin/withlist -a -r ban "<address to ban>"

The ban address can be a regular expressions, so to ban an address and all suffixes, use ^address.*@example.com as the address to ban.

26 Aug 2015 5:40am GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT