07 Dec 2016

feedPlanet Python

PyCharm: PyCharm 2016.3.1 RC Available

We're happy to announce the availability of PyCharm 2016.3.1 RC. We've worked hard to fix some issues some of you are facing:

Get it now from the EAP page!

Although we pay careful attention to make sure our software works well, if you experience any issues please report them on our issue tracker. If you have any ideas about how we can make PyCharm better, please let us know on our issue tracker as well!

The Drive to Develop

-PyCharm Team

07 Dec 2016 3:25pm GMT

PyCharm: PyCharm 2016.3.1 RC Available

We're happy to announce the availability of PyCharm 2016.3.1 RC. We've worked hard to fix some issues some of you are facing:

Get it now from the EAP page!

Although we pay careful attention to make sure our software works well, if you experience any issues please report them on our issue tracker. If you have any ideas about how we can make PyCharm better, please let us know on our issue tracker as well!

The Drive to Develop

-PyCharm Team

07 Dec 2016 3:25pm GMT

Codementor: 15 Essential Python Interview Questions

##Introduction

Looking for a Python job? Chances are you will need to prove that you know how to work with Python. Here are a couple of questions that cover a wide base of skills associated with Python. Focus is placed on the language itself, and not any particular package or framework. Each question will be linked to a suitable tutorial if there is one. Some questions will wrap up multiple topics.

I haven't actually been given an interview test quite as hard as this one, if you can get to the answers comfortably then go get yourself a job.

##What this tutorial is not

This tutorial does not aim to cover every available workplace culture - different employers will ask you different questions in different ways; they will follow different conventions; they will value different things. They will test you in different ways. Some employers will sit you down in from of a computer and ask you to solve simple problems; some will stand you up in front of a white board and do similar; some will give you a take home test to solve; some will just have a conversation with you.

The best test for a programmer is actually programming. This is a difficult thing to test with a simple tutorial. So for bonus points make sure that you can actually use the functionality demonstrated in the questions. If you actually understand how to get to the answers well enough that you can actually make use of the demonstrated concepts then you are winning.

Similarly, the best test for a software engineer is actually engineering. This tutorial is about Python as a language. Being able to design efficient, effective, maintainable class hierarchies for solving niche problems is great and wonderful and a skill set worth pursuing but well beyond the scope of this text.

Another thing this tutorial is not is PEP8 compliant. This is intentional as, as mentioned before, different employers will follow different conventions. You will need to adapt to fit the culture of the workplace. Because practicality beats purity.

Another thing this tutorial isn't is concise. I don't want to just throw questions and answers at you and hope something sticks. I want you to get it, or at least get it well enough that you are in a position to look for further explanations yourself for any problem topics.


Want to ace your technical interview? Schedule a Technical Interview Practice Session with an expert now!


Question 1

What is Python really? You can (and are encouraged) make comparisons to other technologies in your answer

###Answer

Here are a few key points:
- Python is an interpreted language. That means that, unlike languages like C and its variants, Python does not need to be compiled before it is run. Other interpreted languages include PHP and Ruby.

Why this matters:

If you are applying for a Python position, you should know what it is and why it is so gosh-darn cool. And why it isn't o.O

Question 2

Fill in the missing code:

def print_directory_contents(sPath):
    """
    This function takes the name of a directory 
    and prints out the paths files within that 
    directory as well as any files contained in 
    contained directories. 

    This function is similar to os.walk. Please don't
    use os.walk in your answer. We are interested in your 
    ability to work with nested structures. 
    """
    fill_this_in

Answer

def print_directory_contents(sPath):
    import os                                       
    for sChild in os.listdir(sPath):                
        sChildPath = os.path.join(sPath,sChild)
        if os.path.isdir(sChildPath):
            print_directory_contents(sChildPath)
        else:
            print(sChildPath)


Pay special attention

Why this matters:

Question 3

Looking at the below code, write down the final values of A0, A1, …An.

A0 = dict(zip(('a','b','c','d','e'),(1,2,3,4,5)))
A1 = range(10)
A2 = sorted([i for i in A1 if i in A0])
A3 = sorted([A0[s] for s in A0])
A4 = [i for i in A1 if i in A3]
A5 = {i:i*i for i in A1}
A6 = [[i,i*i] for i in A1]

If you dont know what zip is don't stress out. No sane employer will expect you to memorize the standard library. Here is the output of help(zip).

zip(...)
    zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
    
    Return a list of tuples, where each tuple contains the i-th element
    from each of the argument sequences.  The returned list is truncated
    in length to the length of the shortest argument sequence.

If that doesn't make sense then take a few minutes to figure it out however you choose to.

Answer

A0 = {'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4}  # the order may vary
A1 = range(0, 10) # or [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] in python 2
A2 = []
A3 = [1, 2, 3, 4, 5]
A4 = [1, 2, 3, 4, 5]
A5 = {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
A6 = [[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

Why this is important

  1. List comprehension is a wonderful time saver and a big stumbling block for a lot of people
  2. if you can read them you can probably write them down
  3. some of this code was made to be deliberately weird. You may need to work with some weird people

Get Your Python Code Reviewed

Question 4

Python and multi-threading. Is it a good idea? List some ways to get some Python code to run in a parallel way.

Answer

Python doesn't allow multi-threading in the truest sense of the word. It has a multi-threading package but if you want to multi-thread to speed your code up, then it's usually not a good idea to use it. Python has a construct called the Global Interpreter Lock (GIL). The GIL makes sure that only one of your 'threads' can execute at any one time. A thread acquires the GIL, does a little work, then passes the GIL onto the next thread. This happens very quickly so to the human eye it may seem like your threads are executing in parallel, but they are really just taking turns using the same CPU core. All this GIL passing adds overhead to execution. This means that if you want to make your code run faster then using the threading package often isn't a good idea.

There are reasons to use Python's threading package. If you want to run some things simultaneously, and efficiency is not a concern, then it's totally fine and convenient. Or if you are running code that needs to wait for something (like some IO) then it could make a lot of sense. But the threading library wont let you use extra CPU cores.

Multi-threading can be outsourced to the operating system (by doing multi-processing), some external application that calls your Python code (eg, Spark or Hadoop), or some code that your Python code calls (eg: you could have your Python code call a C function that does the expensive multi-threaded stuff).

###Why this is important
Because the GIL is an A-hole. Lots of people spend a lot of time trying to find bottlenecks in their fancy Python multi-threaded code before they learn what the GIL is.

Question 5

How do you keep track of different versions of your code?

Answer:

Version control! At this point, you should act excited and tell them how you even use Git (or whatever is your favorite) to keep track of correspondence with Granny. Git is my preferred version control system, but there are others, for example subversion.

Why this is important:

Because code without version control is like coffee without a cup. Sometimes we need to write once-off throw away scripts and that's ok, but if you are dealing with any significant amount of code, a version control system will be a benefit. Version Control helps with keeping track of who made what change to the code base; finding out when bugs were introduced to the code; keeping track of versions and releases of your software; distributing the source code amongst team members; deployment and certain automations. It allows you to roll your code back to before you broke it which is great on its own. Lots of stuff. It's just great.

Question 6

What does this code output:

def f(x,l=[]):
    for i in range(x):
        l.append(i*i)
    print(l) 

f(2)
f(3,[3,2,1])
f(3)


###Answer

[0, 1]
[3, 2, 1, 0, 1, 4]
[0, 1, 0, 1, 4]


###Hu?
The first function call should be fairly obvious, the loop appends 0 and then 1 to the empty list, l. l is a name for a variable that points to a list stored in memory.
The second call starts off by creating a new list in a new block of memory. l then refers to this new list. It then appends 0, 1 and 4 to this new list. So that's great.
The third function call is the weird one. It uses the original list stored in the original memory block. That is why it starts off with 0 and 1.

Try this out if you don't understand:
```python
l_mem = []

l = l_mem # the first call
for i in range(2):
l.append(i*i)

print(l) # [0, 1]

l = [3,2,1] # the second call
for i in range(3):
l.append(i*i)

print(l) # [3, 2, 1, 0, 1, 4]

l = l_mem # the third call
for i in range(3):
l.append(i*i)

print(l) # [0, 1, 0, 1, 4]
```

Question 7

What is monkey patching and is it ever a good idea?

Answer

Monkey patching is changing the behaviour of a function or object after it has already been defined. For example:

import datetime
datetime.datetime.now = lambda: datetime.datetime(2012, 12, 12)

Most of the time it's a pretty terrible idea - it is usually best if things act in a well-defined way. One reason to monkey patch would be in testing. The mock package is very useful to this end.

###Why does this matter?
It shows that you understand a bit about methodologies in unit testing. Your mention of monkey avoidance will show that you aren't one of those coders who favor fancy code over maintainable code (they are out there, and they suck to work with). Remember the principle of KISS? And it shows that you know a little bit about how Python works on a lower level, how functions are actually stored and called and suchlike.

PS: it's really worth reading a little bit about mock if you haven't yet. It's pretty useful.

Question 8

What does this stuff mean: *args, **kwargs? And why would we use it?

Answer

Use *args when we aren't sure how many arguments are going to be passed to a function, or if we want to pass a stored list or tuple of arguments to a function. **kwargs is used when we dont know how many keyword arguments will be passed to a function, or it can be used to pass the values of a dictionary as keyword arguments. The identifiers args and kwargs are a convention, you could also use *bob and **billy but that would not be wise.

Here is a little illustration:


def f(*args,**kwargs): print(args, kwargs)

l = [1,2,3]
t = (4,5,6)
d = {'a':7,'b':8,'c':9}

f()
f(1,2,3)                    # (1, 2, 3) {}
f(1,2,3,"groovy")           # (1, 2, 3, 'groovy') {}
f(a=1,b=2,c=3)              # () {'a': 1, 'c': 3, 'b': 2}
f(a=1,b=2,c=3,zzz="hi")     # () {'a': 1, 'c': 3, 'b': 2, 'zzz': 'hi'}
f(1,2,3,a=1,b=2,c=3)        # (1, 2, 3) {'a': 1, 'c': 3, 'b': 2}

f(*l,**d)                   # (1, 2, 3) {'a': 7, 'c': 9, 'b': 8}
f(*t,**d)                   # (4, 5, 6) {'a': 7, 'c': 9, 'b': 8}
f(1,2,*t)                   # (1, 2, 4, 5, 6) {}
f(q="winning",**d)          # () {'a': 7, 'q': 'winning', 'c': 9, 'b': 8}
f(1,2,*t,q="winning",**d)   # (1, 2, 4, 5, 6) {'a': 7, 'q': 'winning', 'c': 9, 'b': 8}

def f2(arg1,arg2,*args,**kwargs): print(arg1,arg2, args, kwargs)

f2(1,2,3)                       # 1 2 (3,) {}
f2(1,2,3,"groovy")              # 1 2 (3, 'groovy') {}
f2(arg1=1,arg2=2,c=3)           # 1 2 () {'c': 3}
f2(arg1=1,arg2=2,c=3,zzz="hi")  # 1 2 () {'c': 3, 'zzz': 'hi'}
f2(1,2,3,a=1,b=2,c=3)           # 1 2 (3,) {'a': 1, 'c': 3, 'b': 2}

f2(*l,**d)                   # 1 2 (3,) {'a': 7, 'c': 9, 'b': 8}
f2(*t,**d)                   # 4 5 (6,) {'a': 7, 'c': 9, 'b': 8}
f2(1,2,*t)                   # 1 2 (4, 5, 6) {}
f2(1,1,q="winning",**d)      # 1 1 () {'a': 7, 'q': 'winning', 'c': 9, 'b': 8}
f2(1,2,*t,q="winning",**d)   # 1 2 (4, 5, 6) {'a': 7, 'q': 'winning', 'c': 9, 'b': 8} 


Why do we care?

Sometimes we will need to pass an unknown number of arguments or keyword arguments into a function. Sometimes we will want to store arguments or keyword arguments for later use. Sometimes it's just a time saver.

Question 9

What do these mean to you: @classmethod, @staticmethod, @property?

###Answer Background knowledge

These are decorators. A decorator is a special kind of function that either takes a function and returns a function, or takes a class and returns a class. The @ symbol is just syntactic sugar that allows you to decorate something in a way that's easy to read.

@my_decorator
def my_func(stuff):
    do_things


Is equivalent to
```python
def my_func(stuff):
do_things

my_func = my_decorator(my_func)
```

You can find a tutorial on how decorators in general work here.

###Actual Answer
The decorators @classmethod, @staticmethod and @property are used on functions defined within classes. Here is how they behave:

class MyClass(object):
    def __init__(self):
        self._some_property = "properties are nice"
        self._some_other_property = "VERY nice"
    def normal_method(*args,**kwargs):
        print("calling normal_method({0},{1})".format(args,kwargs))
    @classmethod
    def class_method(*args,**kwargs):
        print("calling class_method({0},{1})".format(args,kwargs))
    @staticmethod
    def static_method(*args,**kwargs):
        print("calling static_method({0},{1})".format(args,kwargs))
    @property
    def some_property(self,*args,**kwargs):
        print("calling some_property getter({0},{1},{2})".format(self,args,kwargs))
        return self._some_property
    @some_property.setter
    def some_property(self,*args,**kwargs):
        print("calling some_property setter({0},{1},{2})".format(self,args,kwargs))
        self._some_property = args[0]
    @property
    def some_other_property(self,*args,**kwargs):
        print("calling some_other_property getter({0},{1},{2})".format(self,args,kwargs))
        return self._some_other_property

o = MyClass()

# undecorated methods work like normal, they get the current instance (self) as the first argument

o.normal_method 

# <bound method MyClass.normal_method of <__main__.MyClass instance at 0x7fdd2537ea28>>

o.normal_method() 

# normal_method((<__main__.MyClass instance at 0x7fdd2537ea28>,),{})

o.normal_method(1,2,x=3,y=4) 

# normal_method((<__main__.MyClass instance at 0x7fdd2537ea28>, 1, 2),{'y': 4, 'x': 3})


# class methods always get the class as the first argument

o.class_method

# <bound method classobj.class_method of <class __main__.MyClass at 0x7fdd2536a390>>

o.class_method()

# class_method((<class __main__.MyClass at 0x7fdd2536a390>,),{})

o.class_method(1,2,x=3,y=4)

# class_method((<class __main__.MyClass at 0x7fdd2536a390>, 1, 2),{'y': 4, 'x': 3})


# static methods have no arguments except the ones you pass in when you call them

o.static_method

# <function static_method at 0x7fdd25375848>

o.static_method()

# static_method((),{})

o.static_method(1,2,x=3,y=4)

# static_method((1, 2),{'y': 4, 'x': 3})


# properties are a way of implementing getters and setters. It's an error to explicitly call them

# "read only" attributes can be specified by creating a getter without a setter (as in some_other_property)

o.some_property

# calling some_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# 'properties are nice'

o.some_property()

# calling some_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# Traceback (most recent call last):

#   File "<stdin>", line 1, in <module>

# TypeError: 'str' object is not callable

o.some_other_property

# calling some_other_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# 'VERY nice'


# o.some_other_property()

# calling some_other_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# Traceback (most recent call last):

#   File "<stdin>", line 1, in <module>

# TypeError: 'str' object is not callable

o.some_property = "groovy"

# calling some_property setter(<__main__.MyClass object at 0x7fb2b7077890>,('groovy',),{})

o.some_property

# calling some_property getter(<__main__.MyClass object at 0x7fb2b7077890>,(),{})

# 'groovy'

o.some_other_property = "very groovy"

# Traceback (most recent call last):

#   File "<stdin>", line 1, in <module>

# AttributeError: can't set attribute

o.some_other_property

# calling some_other_property getter(<__main__.MyClass object at 0x7fb2b7077890>,(),{})

# 'VERY nice'

Question 10

Consider the following code, what will it output?

class A(object):
    def go(self):
        print("go A go!")
    def stop(self):
        print("stop A stop!")
    def pause(self):
        raise Exception("Not Implemented")

class B(A):
    def go(self):
        super(B, self).go()
        print("go B go!")

class C(A):
    def go(self):
        super(C, self).go()
        print("go C go!")
    def stop(self):
        super(C, self).stop()
        print("stop C stop!")

class D(B,C):
    def go(self):
        super(D, self).go()
        print("go D go!")
    def stop(self):
        super(D, self).stop()
        print("stop D stop!")
    def pause(self):
        print("wait D wait!")

class E(B,C): pass

a = A()
b = B()
c = C()
d = D()
e = E()


# specify output from here onwards

a.go()
b.go()
c.go()
d.go()
e.go()

a.stop()
b.stop()
c.stop()
d.stop()
e.stop()

a.pause()
b.pause()
c.pause()
d.pause()
e.pause()


Answer

The output is specified in the comments in the segment below:

a.go()

# go A go!

b.go()

# go A go!

# go B go!

c.go()

# go A go!

# go C go!
 
d.go()

# go A go!

# go C go!

# go B go!

# go D go!

e.go()

# go A go!

# go C go!

# go B go!

a.stop()

# stop A stop!

b.stop()

# stop A stop!

c.stop()

# stop A stop!

# stop C stop!

d.stop()

# stop A stop!

# stop C stop!

# stop D stop!

e.stop()

# stop A stop!
 
a.pause()

# ... Exception: Not Implemented

b.pause()

# ... Exception: Not Implemented

c.pause()

# ... Exception: Not Implemented

d.pause()

# wait D wait!

e.pause()

# ...Exception: Not Implemented


Why do we care?

Because OO programming is really, really important. Really. Answering this question shows your understanding of inheritance and the use of Python's super function. Most of the time the order of resolution doesn't matter. Sometimes it does, it depends on your application.

Question 11

Consider the following code, what will it output?


class Node(object):
    def __init__(self,sName):
        self._lChildren = []
        self.sName = sName
    def __repr__(self):
        return "<Node '{}'>".format(self.sName)
    def append(self,*args,**kwargs):
        self._lChildren.append(*args,**kwargs)
    def print_all_1(self):
        print(self)
        for oChild in self._lChildren:
            oChild.print_all_1()
    def print_all_2(self):
        def gen(o):
            lAll = [o,]
            while lAll:
                oNext = lAll.pop(0)
                lAll.extend(oNext._lChildren)
                yield oNext
        for oNode in gen(self):
            print(oNode)

oRoot = Node("root")
oChild1 = Node("child1")
oChild2 = Node("child2")
oChild3 = Node("child3")
oChild4 = Node("child4")
oChild5 = Node("child5")
oChild6 = Node("child6")
oChild7 = Node("child7")
oChild8 = Node("child8")
oChild9 = Node("child9")
oChild10 = Node("child10")

oRoot.append(oChild1)
oRoot.append(oChild2)
oRoot.append(oChild3)
oChild1.append(oChild4)
oChild1.append(oChild5)
oChild2.append(oChild6)
oChild4.append(oChild7)
oChild3.append(oChild8)
oChild3.append(oChild9)
oChild6.append(oChild10)


# specify output from here onwards

oRoot.print_all_1()
oRoot.print_all_2()


Answer

oRoot.print_all_1() prints:

<Node 'root'>
<Node 'child1'>
<Node 'child4'>
<Node 'child7'>
<Node 'child5'>
<Node 'child2'>
<Node 'child6'>
<Node 'child10'>
<Node 'child3'>
<Node 'child8'>
<Node 'child9'>

oRoot.print_all_2() prints:

<Node 'root'>
<Node 'child1'>
<Node 'child2'>
<Node 'child3'>
<Node 'child4'>
<Node 'child5'>
<Node 'child6'>
<Node 'child8'>
<Node 'child9'>
<Node 'child7'>
<Node 'child10'>

Why do we care?

Because composition and object construction is what objects are all about. Objects are composed of stuff and they need to be initialised somehow. This also ties up some stuff about recursion and use of generators.

Generators are great. You could have achieved similar functionality to print_all_2 by just constructing a big long list and then printing it's contents. One of the nice things about generators is that they don't need to take up much space in memory.

It is also worth pointing out that print_all_1 traverses the tree in a depth-first manner, while print_all_2 is width-first. Make sure you understand those terms. Sometimes one kind of traversal is more appropriate than the other. But that depends very much on your application.

Question 12

Describe Python's garbage collection mechanism in brief.

Answer

A lot can be said here. There are a few main points that you should mention:

This explanation is CPython specific.

Question 13

Place the following functions below in order of their efficiency. They all take in a list of numbers between 0 and 1. The list can be quite long. An example input list would be [random.random() for i in range(100000)]. How would you prove that your answer is correct?


def f1(lIn):
    l1 = sorted(lIn)
    l2 = [i for i in l1 if i<0.5]
    return [i*i for i in l2]

def f2(lIn):
    l1 = [i for i in lIn if i<0.5]
    l2 = sorted(l1)
    return [i*i for i in l2]

def f3(lIn):
    l1 = [i*i for i in lIn]
    l2 = sorted(l1)
    return [i for i in l1 if i<(0.5*0.5)]


Answer

Most to least efficient: f2, f1, f3. To prove that this is the case, you would want to profile your code. Python has a lovely profiling package that should do the trick.

import cProfile
lIn = [random.random() for i in range(100000)]
cProfile.run('f1(lIn)')
cProfile.run('f2(lIn)')
cProfile.run('f3(lIn)')

For completion's sake, here is what the above profile outputs:

>>> cProfile.run('f1(lIn)')
         4 function calls in 0.045 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009    0.044    0.044 <stdin>:1(f1)
        1    0.001    0.001    0.045    0.045 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.035    0.035    0.035    0.035 {sorted}


>>> cProfile.run('f2(lIn)')
         4 function calls in 0.024 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.008    0.008    0.023    0.023 <stdin>:1(f2)
        1    0.001    0.001    0.024    0.024 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.016    0.016    0.016    0.016 {sorted}


>>> cProfile.run('f3(lIn)')
         4 function calls in 0.055 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.016    0.016    0.054    0.054 <stdin>:1(f3)
        1    0.001    0.001    0.055    0.055 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.038    0.038    0.038    0.038 {sorted}


Why do we care?

Locating and avoiding bottlenecks is often pretty worthwhile. A lot of coding for efficiency comes down to common sense - in the example above it's obviously quicker to sort a list if it's a smaller list, so if you have the choice of filtering before a sort it's often a good idea. The less obvious stuff can still be located through use of the proper tools. It's good to know about these tools.

Question 14

Something you failed at?

Wrong answer

I never fail!

Why this is important:

Shows that you are capable of admitting errors, taking responsibility for your mistakes, and learning from your mistakes. All of these things are pretty darn important if you are going to be useful. If you are actually perfect then too bad, you might need to get creative here.

Question 15

Do you have any personal projects?

Really?

This shows that you are willing to do more than the bare minimum in terms of keeping your skillset up to date. If you work on personal projects and code outside of the workplace then employers are more likely to see you as an asset that will grow. Even if they don't ask this question I find it's useful to broach the subject.

Conclusion

These questions intentionally touched on many topics. And the answers were intentionally verbose. In a programming interview, you will need to demonstrate your understanding and if you can do it in a concise way then by all means do that. I tried to give enough information in the answers that you could glean some meaning from them even if you had never heard of some of these topics before. I hope you find this useful in your job hunt.

Go get 'em tiger.

07 Dec 2016 2:49pm GMT

Codementor: 15 Essential Python Interview Questions

##Introduction

Looking for a Python job? Chances are you will need to prove that you know how to work with Python. Here are a couple of questions that cover a wide base of skills associated with Python. Focus is placed on the language itself, and not any particular package or framework. Each question will be linked to a suitable tutorial if there is one. Some questions will wrap up multiple topics.

I haven't actually been given an interview test quite as hard as this one, if you can get to the answers comfortably then go get yourself a job.

##What this tutorial is not

This tutorial does not aim to cover every available workplace culture - different employers will ask you different questions in different ways; they will follow different conventions; they will value different things. They will test you in different ways. Some employers will sit you down in from of a computer and ask you to solve simple problems; some will stand you up in front of a white board and do similar; some will give you a take home test to solve; some will just have a conversation with you.

The best test for a programmer is actually programming. This is a difficult thing to test with a simple tutorial. So for bonus points make sure that you can actually use the functionality demonstrated in the questions. If you actually understand how to get to the answers well enough that you can actually make use of the demonstrated concepts then you are winning.

Similarly, the best test for a software engineer is actually engineering. This tutorial is about Python as a language. Being able to design efficient, effective, maintainable class hierarchies for solving niche problems is great and wonderful and a skill set worth pursuing but well beyond the scope of this text.

Another thing this tutorial is not is PEP8 compliant. This is intentional as, as mentioned before, different employers will follow different conventions. You will need to adapt to fit the culture of the workplace. Because practicality beats purity.

Another thing this tutorial isn't is concise. I don't want to just throw questions and answers at you and hope something sticks. I want you to get it, or at least get it well enough that you are in a position to look for further explanations yourself for any problem topics.


Want to ace your technical interview? Schedule a Technical Interview Practice Session with an expert now!


Question 1

What is Python really? You can (and are encouraged) make comparisons to other technologies in your answer

###Answer

Here are a few key points:
- Python is an interpreted language. That means that, unlike languages like C and its variants, Python does not need to be compiled before it is run. Other interpreted languages include PHP and Ruby.

Why this matters:

If you are applying for a Python position, you should know what it is and why it is so gosh-darn cool. And why it isn't o.O

Question 2

Fill in the missing code:

def print_directory_contents(sPath):
    """
    This function takes the name of a directory 
    and prints out the paths files within that 
    directory as well as any files contained in 
    contained directories. 

    This function is similar to os.walk. Please don't
    use os.walk in your answer. We are interested in your 
    ability to work with nested structures. 
    """
    fill_this_in

Answer

def print_directory_contents(sPath):
    import os                                       
    for sChild in os.listdir(sPath):                
        sChildPath = os.path.join(sPath,sChild)
        if os.path.isdir(sChildPath):
            print_directory_contents(sChildPath)
        else:
            print(sChildPath)


Pay special attention

Why this matters:

Question 3

Looking at the below code, write down the final values of A0, A1, …An.

A0 = dict(zip(('a','b','c','d','e'),(1,2,3,4,5)))
A1 = range(10)
A2 = sorted([i for i in A1 if i in A0])
A3 = sorted([A0[s] for s in A0])
A4 = [i for i in A1 if i in A3]
A5 = {i:i*i for i in A1}
A6 = [[i,i*i] for i in A1]

If you dont know what zip is don't stress out. No sane employer will expect you to memorize the standard library. Here is the output of help(zip).

zip(...)
    zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
    
    Return a list of tuples, where each tuple contains the i-th element
    from each of the argument sequences.  The returned list is truncated
    in length to the length of the shortest argument sequence.

If that doesn't make sense then take a few minutes to figure it out however you choose to.

Answer

A0 = {'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4}  # the order may vary
A1 = range(0, 10) # or [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] in python 2
A2 = []
A3 = [1, 2, 3, 4, 5]
A4 = [1, 2, 3, 4, 5]
A5 = {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
A6 = [[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

Why this is important

  1. List comprehension is a wonderful time saver and a big stumbling block for a lot of people
  2. if you can read them you can probably write them down
  3. some of this code was made to be deliberately weird. You may need to work with some weird people

Get Your Python Code Reviewed

Question 4

Python and multi-threading. Is it a good idea? List some ways to get some Python code to run in a parallel way.

Answer

Python doesn't allow multi-threading in the truest sense of the word. It has a multi-threading package but if you want to multi-thread to speed your code up, then it's usually not a good idea to use it. Python has a construct called the Global Interpreter Lock (GIL). The GIL makes sure that only one of your 'threads' can execute at any one time. A thread acquires the GIL, does a little work, then passes the GIL onto the next thread. This happens very quickly so to the human eye it may seem like your threads are executing in parallel, but they are really just taking turns using the same CPU core. All this GIL passing adds overhead to execution. This means that if you want to make your code run faster then using the threading package often isn't a good idea.

There are reasons to use Python's threading package. If you want to run some things simultaneously, and efficiency is not a concern, then it's totally fine and convenient. Or if you are running code that needs to wait for something (like some IO) then it could make a lot of sense. But the threading library wont let you use extra CPU cores.

Multi-threading can be outsourced to the operating system (by doing multi-processing), some external application that calls your Python code (eg, Spark or Hadoop), or some code that your Python code calls (eg: you could have your Python code call a C function that does the expensive multi-threaded stuff).

###Why this is important
Because the GIL is an A-hole. Lots of people spend a lot of time trying to find bottlenecks in their fancy Python multi-threaded code before they learn what the GIL is.

Question 5

How do you keep track of different versions of your code?

Answer:

Version control! At this point, you should act excited and tell them how you even use Git (or whatever is your favorite) to keep track of correspondence with Granny. Git is my preferred version control system, but there are others, for example subversion.

Why this is important:

Because code without version control is like coffee without a cup. Sometimes we need to write once-off throw away scripts and that's ok, but if you are dealing with any significant amount of code, a version control system will be a benefit. Version Control helps with keeping track of who made what change to the code base; finding out when bugs were introduced to the code; keeping track of versions and releases of your software; distributing the source code amongst team members; deployment and certain automations. It allows you to roll your code back to before you broke it which is great on its own. Lots of stuff. It's just great.

Question 6

What does this code output:

def f(x,l=[]):
    for i in range(x):
        l.append(i*i)
    print(l) 

f(2)
f(3,[3,2,1])
f(3)


###Answer

[0, 1]
[3, 2, 1, 0, 1, 4]
[0, 1, 0, 1, 4]


###Hu?
The first function call should be fairly obvious, the loop appends 0 and then 1 to the empty list, l. l is a name for a variable that points to a list stored in memory.
The second call starts off by creating a new list in a new block of memory. l then refers to this new list. It then appends 0, 1 and 4 to this new list. So that's great.
The third function call is the weird one. It uses the original list stored in the original memory block. That is why it starts off with 0 and 1.

Try this out if you don't understand:
```python
l_mem = []

l = l_mem # the first call
for i in range(2):
l.append(i*i)

print(l) # [0, 1]

l = [3,2,1] # the second call
for i in range(3):
l.append(i*i)

print(l) # [3, 2, 1, 0, 1, 4]

l = l_mem # the third call
for i in range(3):
l.append(i*i)

print(l) # [0, 1, 0, 1, 4]
```

Question 7

What is monkey patching and is it ever a good idea?

Answer

Monkey patching is changing the behaviour of a function or object after it has already been defined. For example:

import datetime
datetime.datetime.now = lambda: datetime.datetime(2012, 12, 12)

Most of the time it's a pretty terrible idea - it is usually best if things act in a well-defined way. One reason to monkey patch would be in testing. The mock package is very useful to this end.

###Why does this matter?
It shows that you understand a bit about methodologies in unit testing. Your mention of monkey avoidance will show that you aren't one of those coders who favor fancy code over maintainable code (they are out there, and they suck to work with). Remember the principle of KISS? And it shows that you know a little bit about how Python works on a lower level, how functions are actually stored and called and suchlike.

PS: it's really worth reading a little bit about mock if you haven't yet. It's pretty useful.

Question 8

What does this stuff mean: *args, **kwargs? And why would we use it?

Answer

Use *args when we aren't sure how many arguments are going to be passed to a function, or if we want to pass a stored list or tuple of arguments to a function. **kwargs is used when we dont know how many keyword arguments will be passed to a function, or it can be used to pass the values of a dictionary as keyword arguments. The identifiers args and kwargs are a convention, you could also use *bob and **billy but that would not be wise.

Here is a little illustration:


def f(*args,**kwargs): print(args, kwargs)

l = [1,2,3]
t = (4,5,6)
d = {'a':7,'b':8,'c':9}

f()
f(1,2,3)                    # (1, 2, 3) {}
f(1,2,3,"groovy")           # (1, 2, 3, 'groovy') {}
f(a=1,b=2,c=3)              # () {'a': 1, 'c': 3, 'b': 2}
f(a=1,b=2,c=3,zzz="hi")     # () {'a': 1, 'c': 3, 'b': 2, 'zzz': 'hi'}
f(1,2,3,a=1,b=2,c=3)        # (1, 2, 3) {'a': 1, 'c': 3, 'b': 2}

f(*l,**d)                   # (1, 2, 3) {'a': 7, 'c': 9, 'b': 8}
f(*t,**d)                   # (4, 5, 6) {'a': 7, 'c': 9, 'b': 8}
f(1,2,*t)                   # (1, 2, 4, 5, 6) {}
f(q="winning",**d)          # () {'a': 7, 'q': 'winning', 'c': 9, 'b': 8}
f(1,2,*t,q="winning",**d)   # (1, 2, 4, 5, 6) {'a': 7, 'q': 'winning', 'c': 9, 'b': 8}

def f2(arg1,arg2,*args,**kwargs): print(arg1,arg2, args, kwargs)

f2(1,2,3)                       # 1 2 (3,) {}
f2(1,2,3,"groovy")              # 1 2 (3, 'groovy') {}
f2(arg1=1,arg2=2,c=3)           # 1 2 () {'c': 3}
f2(arg1=1,arg2=2,c=3,zzz="hi")  # 1 2 () {'c': 3, 'zzz': 'hi'}
f2(1,2,3,a=1,b=2,c=3)           # 1 2 (3,) {'a': 1, 'c': 3, 'b': 2}

f2(*l,**d)                   # 1 2 (3,) {'a': 7, 'c': 9, 'b': 8}
f2(*t,**d)                   # 4 5 (6,) {'a': 7, 'c': 9, 'b': 8}
f2(1,2,*t)                   # 1 2 (4, 5, 6) {}
f2(1,1,q="winning",**d)      # 1 1 () {'a': 7, 'q': 'winning', 'c': 9, 'b': 8}
f2(1,2,*t,q="winning",**d)   # 1 2 (4, 5, 6) {'a': 7, 'q': 'winning', 'c': 9, 'b': 8} 


Why do we care?

Sometimes we will need to pass an unknown number of arguments or keyword arguments into a function. Sometimes we will want to store arguments or keyword arguments for later use. Sometimes it's just a time saver.

Question 9

What do these mean to you: @classmethod, @staticmethod, @property?

###Answer Background knowledge

These are decorators. A decorator is a special kind of function that either takes a function and returns a function, or takes a class and returns a class. The @ symbol is just syntactic sugar that allows you to decorate something in a way that's easy to read.

@my_decorator
def my_func(stuff):
    do_things


Is equivalent to
```python
def my_func(stuff):
do_things

my_func = my_decorator(my_func)
```

You can find a tutorial on how decorators in general work here.

###Actual Answer
The decorators @classmethod, @staticmethod and @property are used on functions defined within classes. Here is how they behave:

class MyClass(object):
    def __init__(self):
        self._some_property = "properties are nice"
        self._some_other_property = "VERY nice"
    def normal_method(*args,**kwargs):
        print("calling normal_method({0},{1})".format(args,kwargs))
    @classmethod
    def class_method(*args,**kwargs):
        print("calling class_method({0},{1})".format(args,kwargs))
    @staticmethod
    def static_method(*args,**kwargs):
        print("calling static_method({0},{1})".format(args,kwargs))
    @property
    def some_property(self,*args,**kwargs):
        print("calling some_property getter({0},{1},{2})".format(self,args,kwargs))
        return self._some_property
    @some_property.setter
    def some_property(self,*args,**kwargs):
        print("calling some_property setter({0},{1},{2})".format(self,args,kwargs))
        self._some_property = args[0]
    @property
    def some_other_property(self,*args,**kwargs):
        print("calling some_other_property getter({0},{1},{2})".format(self,args,kwargs))
        return self._some_other_property

o = MyClass()

# undecorated methods work like normal, they get the current instance (self) as the first argument

o.normal_method 

# <bound method MyClass.normal_method of <__main__.MyClass instance at 0x7fdd2537ea28>>

o.normal_method() 

# normal_method((<__main__.MyClass instance at 0x7fdd2537ea28>,),{})

o.normal_method(1,2,x=3,y=4) 

# normal_method((<__main__.MyClass instance at 0x7fdd2537ea28>, 1, 2),{'y': 4, 'x': 3})


# class methods always get the class as the first argument

o.class_method

# <bound method classobj.class_method of <class __main__.MyClass at 0x7fdd2536a390>>

o.class_method()

# class_method((<class __main__.MyClass at 0x7fdd2536a390>,),{})

o.class_method(1,2,x=3,y=4)

# class_method((<class __main__.MyClass at 0x7fdd2536a390>, 1, 2),{'y': 4, 'x': 3})


# static methods have no arguments except the ones you pass in when you call them

o.static_method

# <function static_method at 0x7fdd25375848>

o.static_method()

# static_method((),{})

o.static_method(1,2,x=3,y=4)

# static_method((1, 2),{'y': 4, 'x': 3})


# properties are a way of implementing getters and setters. It's an error to explicitly call them

# "read only" attributes can be specified by creating a getter without a setter (as in some_other_property)

o.some_property

# calling some_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# 'properties are nice'

o.some_property()

# calling some_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# Traceback (most recent call last):

#   File "<stdin>", line 1, in <module>

# TypeError: 'str' object is not callable

o.some_other_property

# calling some_other_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# 'VERY nice'


# o.some_other_property()

# calling some_other_property getter(<__main__.MyClass instance at 0x7fb2b70877e8>,(),{})

# Traceback (most recent call last):

#   File "<stdin>", line 1, in <module>

# TypeError: 'str' object is not callable

o.some_property = "groovy"

# calling some_property setter(<__main__.MyClass object at 0x7fb2b7077890>,('groovy',),{})

o.some_property

# calling some_property getter(<__main__.MyClass object at 0x7fb2b7077890>,(),{})

# 'groovy'

o.some_other_property = "very groovy"

# Traceback (most recent call last):

#   File "<stdin>", line 1, in <module>

# AttributeError: can't set attribute

o.some_other_property

# calling some_other_property getter(<__main__.MyClass object at 0x7fb2b7077890>,(),{})

# 'VERY nice'

Question 10

Consider the following code, what will it output?

class A(object):
    def go(self):
        print("go A go!")
    def stop(self):
        print("stop A stop!")
    def pause(self):
        raise Exception("Not Implemented")

class B(A):
    def go(self):
        super(B, self).go()
        print("go B go!")

class C(A):
    def go(self):
        super(C, self).go()
        print("go C go!")
    def stop(self):
        super(C, self).stop()
        print("stop C stop!")

class D(B,C):
    def go(self):
        super(D, self).go()
        print("go D go!")
    def stop(self):
        super(D, self).stop()
        print("stop D stop!")
    def pause(self):
        print("wait D wait!")

class E(B,C): pass

a = A()
b = B()
c = C()
d = D()
e = E()


# specify output from here onwards

a.go()
b.go()
c.go()
d.go()
e.go()

a.stop()
b.stop()
c.stop()
d.stop()
e.stop()

a.pause()
b.pause()
c.pause()
d.pause()
e.pause()


Answer

The output is specified in the comments in the segment below:

a.go()

# go A go!

b.go()

# go A go!

# go B go!

c.go()

# go A go!

# go C go!
 
d.go()

# go A go!

# go C go!

# go B go!

# go D go!

e.go()

# go A go!

# go C go!

# go B go!

a.stop()

# stop A stop!

b.stop()

# stop A stop!

c.stop()

# stop A stop!

# stop C stop!

d.stop()

# stop A stop!

# stop C stop!

# stop D stop!

e.stop()

# stop A stop!
 
a.pause()

# ... Exception: Not Implemented

b.pause()

# ... Exception: Not Implemented

c.pause()

# ... Exception: Not Implemented

d.pause()

# wait D wait!

e.pause()

# ...Exception: Not Implemented


Why do we care?

Because OO programming is really, really important. Really. Answering this question shows your understanding of inheritance and the use of Python's super function. Most of the time the order of resolution doesn't matter. Sometimes it does, it depends on your application.

Question 11

Consider the following code, what will it output?


class Node(object):
    def __init__(self,sName):
        self._lChildren = []
        self.sName = sName
    def __repr__(self):
        return "<Node '{}'>".format(self.sName)
    def append(self,*args,**kwargs):
        self._lChildren.append(*args,**kwargs)
    def print_all_1(self):
        print(self)
        for oChild in self._lChildren:
            oChild.print_all_1()
    def print_all_2(self):
        def gen(o):
            lAll = [o,]
            while lAll:
                oNext = lAll.pop(0)
                lAll.extend(oNext._lChildren)
                yield oNext
        for oNode in gen(self):
            print(oNode)

oRoot = Node("root")
oChild1 = Node("child1")
oChild2 = Node("child2")
oChild3 = Node("child3")
oChild4 = Node("child4")
oChild5 = Node("child5")
oChild6 = Node("child6")
oChild7 = Node("child7")
oChild8 = Node("child8")
oChild9 = Node("child9")
oChild10 = Node("child10")

oRoot.append(oChild1)
oRoot.append(oChild2)
oRoot.append(oChild3)
oChild1.append(oChild4)
oChild1.append(oChild5)
oChild2.append(oChild6)
oChild4.append(oChild7)
oChild3.append(oChild8)
oChild3.append(oChild9)
oChild6.append(oChild10)


# specify output from here onwards

oRoot.print_all_1()
oRoot.print_all_2()


Answer

oRoot.print_all_1() prints:

<Node 'root'>
<Node 'child1'>
<Node 'child4'>
<Node 'child7'>
<Node 'child5'>
<Node 'child2'>
<Node 'child6'>
<Node 'child10'>
<Node 'child3'>
<Node 'child8'>
<Node 'child9'>

oRoot.print_all_2() prints:

<Node 'root'>
<Node 'child1'>
<Node 'child2'>
<Node 'child3'>
<Node 'child4'>
<Node 'child5'>
<Node 'child6'>
<Node 'child8'>
<Node 'child9'>
<Node 'child7'>
<Node 'child10'>

Why do we care?

Because composition and object construction is what objects are all about. Objects are composed of stuff and they need to be initialised somehow. This also ties up some stuff about recursion and use of generators.

Generators are great. You could have achieved similar functionality to print_all_2 by just constructing a big long list and then printing it's contents. One of the nice things about generators is that they don't need to take up much space in memory.

It is also worth pointing out that print_all_1 traverses the tree in a depth-first manner, while print_all_2 is width-first. Make sure you understand those terms. Sometimes one kind of traversal is more appropriate than the other. But that depends very much on your application.

Question 12

Describe Python's garbage collection mechanism in brief.

Answer

A lot can be said here. There are a few main points that you should mention:

This explanation is CPython specific.

Question 13

Place the following functions below in order of their efficiency. They all take in a list of numbers between 0 and 1. The list can be quite long. An example input list would be [random.random() for i in range(100000)]. How would you prove that your answer is correct?


def f1(lIn):
    l1 = sorted(lIn)
    l2 = [i for i in l1 if i<0.5]
    return [i*i for i in l2]

def f2(lIn):
    l1 = [i for i in lIn if i<0.5]
    l2 = sorted(l1)
    return [i*i for i in l2]

def f3(lIn):
    l1 = [i*i for i in lIn]
    l2 = sorted(l1)
    return [i for i in l1 if i<(0.5*0.5)]


Answer

Most to least efficient: f2, f1, f3. To prove that this is the case, you would want to profile your code. Python has a lovely profiling package that should do the trick.

import cProfile
lIn = [random.random() for i in range(100000)]
cProfile.run('f1(lIn)')
cProfile.run('f2(lIn)')
cProfile.run('f3(lIn)')

For completion's sake, here is what the above profile outputs:

>>> cProfile.run('f1(lIn)')
         4 function calls in 0.045 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009    0.044    0.044 <stdin>:1(f1)
        1    0.001    0.001    0.045    0.045 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.035    0.035    0.035    0.035 {sorted}


>>> cProfile.run('f2(lIn)')
         4 function calls in 0.024 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.008    0.008    0.023    0.023 <stdin>:1(f2)
        1    0.001    0.001    0.024    0.024 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.016    0.016    0.016    0.016 {sorted}


>>> cProfile.run('f3(lIn)')
         4 function calls in 0.055 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.016    0.016    0.054    0.054 <stdin>:1(f3)
        1    0.001    0.001    0.055    0.055 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.038    0.038    0.038    0.038 {sorted}


Why do we care?

Locating and avoiding bottlenecks is often pretty worthwhile. A lot of coding for efficiency comes down to common sense - in the example above it's obviously quicker to sort a list if it's a smaller list, so if you have the choice of filtering before a sort it's often a good idea. The less obvious stuff can still be located through use of the proper tools. It's good to know about these tools.

Question 14

Something you failed at?

Wrong answer

I never fail!

Why this is important:

Shows that you are capable of admitting errors, taking responsibility for your mistakes, and learning from your mistakes. All of these things are pretty darn important if you are going to be useful. If you are actually perfect then too bad, you might need to get creative here.

Question 15

Do you have any personal projects?

Really?

This shows that you are willing to do more than the bare minimum in terms of keeping your skillset up to date. If you work on personal projects and code outside of the workplace then employers are more likely to see you as an asset that will grow. Even if they don't ask this question I find it's useful to broach the subject.

Conclusion

These questions intentionally touched on many topics. And the answers were intentionally verbose. In a programming interview, you will need to demonstrate your understanding and if you can do it in a concise way then by all means do that. I tried to give enough information in the answers that you could glean some meaning from them even if you had never heard of some of these topics before. I hope you find this useful in your job hunt.

Go get 'em tiger.

07 Dec 2016 2:49pm GMT

tryexceptpass: I understand what you’re saying and agree that it’s a safer option than traditional threading, but…

There are quite a few thread-safe methods available throughout the library, some that were not discussed here, created specifically for…

Continue reading on »

07 Dec 2016 1:55pm GMT

tryexceptpass: I understand what you’re saying and agree that it’s a safer option than traditional threading, but…

There are quite a few thread-safe methods available throughout the library, some that were not discussed here, created specifically for…

Continue reading on »

07 Dec 2016 1:55pm GMT

Marcos Dione: ayrton-0.9.1

Last night I realized the first point. Checking today I found the latter. Early, often, go!

Get it on github or pypi!


python ayrton

07 Dec 2016 1:14pm GMT

Marcos Dione: ayrton-0.9.1

Last night I realized the first point. Checking today I found the latter. Early, often, go!

Get it on github or pypi!


python ayrton

07 Dec 2016 1:14pm GMT

Codementor: Adding Flow Control to Apache Pig using Python

(image source)

##Introduction
So you like Pig but its cramping your style? Are you not sure what Pig is about? Are you keen to write some code to write code for you? If yes, then this is for you.

This tutorial ties together a whole lot of different techniques and technologies. The aim here is to show you a trick to get Pig to behave in a way that's just a little bit more loopy. It's a trick I've used before quite a lot and I've written a couple of utility functions to make it easy. I'll go over the bits and pieces here. This tutorial, on a more general note, is about writing code that writes code. The general technique and concerns outlined here can be applied to other code generating problems.

##What Does Pig Do?
Pig is a high-level scripting toolset used for defining and executing complex map-reduce workflows. Let's take a closer look at that sentence…

Pig, is a top-level Apache project. It is open source and really quite nifty. Learn more about it here. PigLatin is Pig's language. Pig executes PigLatin scripts. Within a PigLatin script you write a bunch of statements that get converted into a bunch of map-reduce jobs that can get executed in sequence on your Hadoop cluster. It's usually nice to abstract away from writing plain old map-reduce jobs because they can be a total pain in the neck.

If you haven't used Pig before and aren't sure if it's for you, it might be a good idea to check out Hive. Hive and Pig have a lot of overlap in terms of functionality, but have different philosophies. They aren't total competitors because they are often used in conjunction with one another. Hive resembles SQL, while PigLatin resembles… PigLatin. So if you are familiar with SQL then Hive might be an easier learn, but IMHO Pig is a bit more sensible than Hive in how it describes data flow.

##What Doesn't Pig Do?

Pig doesn't make any decisions about the flow of program execution, it only allows you to specify the flow of data. In other words, it allows you to say stuff like this:

-----------------------------------------------
-- define some data format goodies
-----------------------------------------------

define CSV_READER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX'
                                                            );


define CSV_WRITER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX',
                                                            'SKIP_OUTPUT_HEADER'
                                                            );

-----------------------------------------------
-- load some data
-----------------------------------------------

r_one = LOAD 'one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD 'two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into 'r_three.csv' using CSV_WRITER;

The script above says where the data should flow. Every statement you see there will get executed exactly once no matter what (unless there is some kind of error).

You can run the script from the command line like so:

pig path/to/my_script.oink

Ok, what if we have a bunch of files and each of them needs to have the same stuff happen to it? Does that mean we would need to copy-paste our PigLatin script and edit each one to have the right paths?

Well, no. Pig allows some really basic substitutions. You can do stuff like this:

r_one = LOAD '$DIR/one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD '$DIR/two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into '$DIR/r_three.csv' using CSV_WRITER;

Then you can run the script as many times as you like with different values for DIR. Something like:

pig path/to/my_script.oink -p DIR=jan_2015
pig path/to/my_script.oink -p DIR=feb_2015
pig path/to/my_script.oink -p DIR=march_2015

So pig allows variable substitution and that is a pretty powerful thing on its own. But it doesn't allow loops or if statements and that can be somewhat limiting. What if we had to iterate over 60 different values for DIR? This is something Pig doesn't cater for.

Luckily for us, Python can loop just fine. So we could do something like:

def run_pig_script(sFilePath,dPigArgs=None):
    """
    run piggy run
    """
    import subprocess
    lCmd = ["pig",sFilePath,]  
    for sArg in ['{0}={1}'.format(*t) for t in (dPigArgs or {}).items()]:
        lCmd.append('-p')
        lCmd.append(sArg)
    print lCmd
    p = subprocess.Popen(lCmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
    stdout, stderr = p.communicate()
    return stdout,stderr

for sDir in lManyDirectories:
    run_pig_script(sFilePath="path/to/my_script.oink",dPigArgs={'DIR':sDir})

The run_pig_script function makes use of the subprocess module to create a Pig process through use of the Popen function. Popen takes a list of token strings as its first argument and makes a system call from there. So first we create the command list lCmd then start a process. The output of the process (the stuff that would usually get printed to the console window) gets redirected to the stderr and stdout objects.

In order to populate lCmd we use a short-hand for loop notation known as list comprehension. It's very cool and useful but beyond the scope of this text. Try calling run_pig_script with a few different arguments and see what it prints and you should easily get a feel for what Popen expects.

But what if you really need a loop inside your pig script?

So we have covered executing a PigLatin script many times with different values, what if we want to make use of many variables within the PigLatin script? For example, what happens if we want to loop over some variable number of directories within a single script? For example something like this…

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Writing all that down could become tedious. Especially if we are working with an arbitrary number of files each time. Maybe we want a union of all the sales of the month so far, then we would need to come up with a new script for every day. That sounds pretty horrible and would require a lot of copy-paste and copy-paste smells bad.

So Here is What We are Going to Do Instead

Have some pythonish pseudo-code:

lStrs = complicated_operation_getting_list_of_strings() #1
sPigPath = generate_pig_script(lStrs)                   #2
run_pig_script(sFilePath = sPigPath)                    #3

So we have 3 steps in the code above: Step 1 is getting the data we need that the pig script is going to rely on. Then, in step 2, we need to take that data and turn it into something Pig will be able to understand. Step 3 then needs to make it run.

Step 1 of the process very much depends on what you are trying to do. Following from the previous example we would likely want complicated_operation_getting_list_of_strings to look like:

def complicated_operation_getting_list_of_strings():
    import datetime
    oNow = datetime.datetime.now()
    sMonth = oNow.strftime('%b').lower()
    return ["{0}_{1}".format(sMonth,i+1) for i in range(oNow.day)]

The rest of this tutorial wil be dealing with steps 2 and 3.

Template Systems

Writing code to write code for us! That's pretty futuristic stuff!

Not really…

Ever written a web app? Did you use some kind of framework for this? Did the framework specify (or allow you to specify) some special way of writing HTML so that you could do clever things in your HTML files? Clever things like loops and ifs and variable substitutions? If you answered yes to these questions, you wrote code that wrote HTML code for you at least. And if you answered no, then the take-away message here is: Writing code that writes code is something that has been done for ages, there are many systems libraries and packages that support this kind of thing in many languages. These kinds of tools are generally referred to as template systems.

The template system we'll be using for this is Mako. This is not a mako tutorial, to learn about mako, check this out.

An important thing in choosing a template system to make sure that it doesn't clash with the language you are using it to write. And if it does clash then you need to find ways to compensate. What I mean by this is: If I am using a template language then that language has a few well-defined control sequences for doing things like loops and variable substitution. An example from mako is:

${xSomeVariable}

When you render that line of code then the value of xSomeVariable will get turned into a string. But what if ${stuff} meant something in the language you are trying to generate? Then there is a good chance that mako will find things in your template files that it thinks it needs to deal with and it will either output garbage or raise exceptions.

Mako and PigLatin don't have this problem. So that's pretty convenient.

Using Python to generate PigLatin

Remember this: sPigPath = generate_pig_script(lNames)?

Good coders don't mix languages in the same file if they can help it (which is pretty much always). So while it is possible to define your entire PigLatin mako template in the form of a big giant string inside your Python script, we aren't going to do that.

Also, it would be nice if the code we are writing works for more than one template. So instead of:

sPigPath = generate_pig_script(lStrs)   #2

We'll do this:

sPigPath = generate_pig_script(sFilePath,dContext)   #2

We want to pass in the path to our template file, along with a dictionary containing the context variables we'd use to render it this time. For example we could have:

dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

Ok, so lets write some real code then…

def generate_pig_script(sFilePath,dContext):
    """
    render the template at sFilePath using the context in dContext,
    save the output in a temporary file
    return the path to the generated file
    """
    from mako.template import Template
    import datetime

    #1. fetch the template from the file
    oTemplate = Template(filename=sFilePath)
    
    #2. render it using the context dictionary. This gives us a string
    sOutputScript = oTemplate.render(**dContext)

    #3. put the output into some file...
    sOutPath = "{0}_{1}".format(sFilePath,datetime.datetime.now().isoformat())
    with open(sOutPath,'w') as f:
        f.write(sOutputScript)

    return sOutPath

The comments in the code should be enough to understand its general functioning.

Just to complete the picture, let's make an actual template…

Remember this?

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Here it is in the form of a mako template:

%for sD in lStrs:

r_${sD} = LOAD '${sD}/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

%endfor

r_all = UNION ${','.join(['r_{0}'.format(sD) for sD in lStrs])};

The full picture

So now we have used Python to generate a PigLatin script and store it in a known location. And we already know how to get Python to launch Pig. So that's it. Pretty straight forward, eh? This tutorial made use of a few different technologies and techniques and it's impossible not to jump around a little bit so I've included a little summary here of how to use this technique:


#1 given a working PigLatin script that has a lot of repitition or a variable number of inputs, create a mako template

#2 write a function that creates the context for the mako template. eg:
dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

#3 render the template
sPigFilePath = generate_pig_script(sMakoFilePath,dContext)

#and finally run the thing...
run_pig_script(sPigFilePath,dPigArgs=None)

Conclusion

We've covered some of the basics of code generation and used Python and the mako templating system to make Pig more loopy. I've touched on a lot of different technologies and techniques. Pig itself is quite a big deal, and the kinds of problems it is applied to can fill books. The mako templating engine is a powerful thing in itself and has many use cases other than Pig (I mostly use it in conjunction with Pyramid for example). Python loops and list comprehension is worth looking into if any of the weird for-loop stuff didn't make sense; and finally the subprocess modult- it constitutes quite a rabbit hole on its own.

07 Dec 2016 9:38am GMT

Codementor: Adding Flow Control to Apache Pig using Python

(image source)

##Introduction
So you like Pig but its cramping your style? Are you not sure what Pig is about? Are you keen to write some code to write code for you? If yes, then this is for you.

This tutorial ties together a whole lot of different techniques and technologies. The aim here is to show you a trick to get Pig to behave in a way that's just a little bit more loopy. It's a trick I've used before quite a lot and I've written a couple of utility functions to make it easy. I'll go over the bits and pieces here. This tutorial, on a more general note, is about writing code that writes code. The general technique and concerns outlined here can be applied to other code generating problems.

##What Does Pig Do?
Pig is a high-level scripting toolset used for defining and executing complex map-reduce workflows. Let's take a closer look at that sentence…

Pig, is a top-level Apache project. It is open source and really quite nifty. Learn more about it here. PigLatin is Pig's language. Pig executes PigLatin scripts. Within a PigLatin script you write a bunch of statements that get converted into a bunch of map-reduce jobs that can get executed in sequence on your Hadoop cluster. It's usually nice to abstract away from writing plain old map-reduce jobs because they can be a total pain in the neck.

If you haven't used Pig before and aren't sure if it's for you, it might be a good idea to check out Hive. Hive and Pig have a lot of overlap in terms of functionality, but have different philosophies. They aren't total competitors because they are often used in conjunction with one another. Hive resembles SQL, while PigLatin resembles… PigLatin. So if you are familiar with SQL then Hive might be an easier learn, but IMHO Pig is a bit more sensible than Hive in how it describes data flow.

##What Doesn't Pig Do?

Pig doesn't make any decisions about the flow of program execution, it only allows you to specify the flow of data. In other words, it allows you to say stuff like this:

-----------------------------------------------
-- define some data format goodies
-----------------------------------------------

define CSV_READER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX'
                                                            );


define CSV_WRITER org.apache.pig.piggybank.storage.CSVExcelStorage(
                                                            ',',
                                                            'YES_MULTILINE',
                                                            'UNIX',
                                                            'SKIP_OUTPUT_HEADER'
                                                            );

-----------------------------------------------
-- load some data
-----------------------------------------------

r_one = LOAD 'one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD 'two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into 'r_three.csv' using CSV_WRITER;

The script above says where the data should flow. Every statement you see there will get executed exactly once no matter what (unless there is some kind of error).

You can run the script from the command line like so:

pig path/to/my_script.oink

Ok, what if we have a bunch of files and each of them needs to have the same stuff happen to it? Does that mean we would need to copy-paste our PigLatin script and edit each one to have the right paths?

Well, no. Pig allows some really basic substitutions. You can do stuff like this:

r_one = LOAD '$DIR/one.csv' using CSV_READER
AS (a:chararray,b:chararray,c:chararray);

r_two = LOAD '$DIR/two.csv' using CSV_READER
AS (a:chararray,d:chararray,e:chararray);

-----------------------------------------------
-- do some processing
-----------------------------------------------

r_joined = JOIN r_one by a, t_two by a;

r_final = FOREACH r_joined GENERATE 
    r_one::a, b, e;

-----------------------------------------------
-- store the result
-----------------------------------------------

store r_final into '$DIR/r_three.csv' using CSV_WRITER;

Then you can run the script as many times as you like with different values for DIR. Something like:

pig path/to/my_script.oink -p DIR=jan_2015
pig path/to/my_script.oink -p DIR=feb_2015
pig path/to/my_script.oink -p DIR=march_2015

So pig allows variable substitution and that is a pretty powerful thing on its own. But it doesn't allow loops or if statements and that can be somewhat limiting. What if we had to iterate over 60 different values for DIR? This is something Pig doesn't cater for.

Luckily for us, Python can loop just fine. So we could do something like:

def run_pig_script(sFilePath,dPigArgs=None):
    """
    run piggy run
    """
    import subprocess
    lCmd = ["pig",sFilePath,]  
    for sArg in ['{0}={1}'.format(*t) for t in (dPigArgs or {}).items()]:
        lCmd.append('-p')
        lCmd.append(sArg)
    print lCmd
    p = subprocess.Popen(lCmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
    stdout, stderr = p.communicate()
    return stdout,stderr

for sDir in lManyDirectories:
    run_pig_script(sFilePath="path/to/my_script.oink",dPigArgs={'DIR':sDir})

The run_pig_script function makes use of the subprocess module to create a Pig process through use of the Popen function. Popen takes a list of token strings as its first argument and makes a system call from there. So first we create the command list lCmd then start a process. The output of the process (the stuff that would usually get printed to the console window) gets redirected to the stderr and stdout objects.

In order to populate lCmd we use a short-hand for loop notation known as list comprehension. It's very cool and useful but beyond the scope of this text. Try calling run_pig_script with a few different arguments and see what it prints and you should easily get a feel for what Popen expects.

But what if you really need a loop inside your pig script?

So we have covered executing a PigLatin script many times with different values, what if we want to make use of many variables within the PigLatin script? For example, what happens if we want to loop over some variable number of directories within a single script? For example something like this…

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Writing all that down could become tedious. Especially if we are working with an arbitrary number of files each time. Maybe we want a union of all the sales of the month so far, then we would need to come up with a new script for every day. That sounds pretty horrible and would require a lot of copy-paste and copy-paste smells bad.

So Here is What We are Going to Do Instead

Have some pythonish pseudo-code:

lStrs = complicated_operation_getting_list_of_strings() #1
sPigPath = generate_pig_script(lStrs)                   #2
run_pig_script(sFilePath = sPigPath)                    #3

So we have 3 steps in the code above: Step 1 is getting the data we need that the pig script is going to rely on. Then, in step 2, we need to take that data and turn it into something Pig will be able to understand. Step 3 then needs to make it run.

Step 1 of the process very much depends on what you are trying to do. Following from the previous example we would likely want complicated_operation_getting_list_of_strings to look like:

def complicated_operation_getting_list_of_strings():
    import datetime
    oNow = datetime.datetime.now()
    sMonth = oNow.strftime('%b').lower()
    return ["{0}_{1}".format(sMonth,i+1) for i in range(oNow.day)]

The rest of this tutorial wil be dealing with steps 2 and 3.

Template Systems

Writing code to write code for us! That's pretty futuristic stuff!

Not really…

Ever written a web app? Did you use some kind of framework for this? Did the framework specify (or allow you to specify) some special way of writing HTML so that you could do clever things in your HTML files? Clever things like loops and ifs and variable substitutions? If you answered yes to these questions, you wrote code that wrote HTML code for you at least. And if you answered no, then the take-away message here is: Writing code that writes code is something that has been done for ages, there are many systems libraries and packages that support this kind of thing in many languages. These kinds of tools are generally referred to as template systems.

The template system we'll be using for this is Mako. This is not a mako tutorial, to learn about mako, check this out.

An important thing in choosing a template system to make sure that it doesn't clash with the language you are using it to write. And if it does clash then you need to find ways to compensate. What I mean by this is: If I am using a template language then that language has a few well-defined control sequences for doing things like loops and variable substitution. An example from mako is:

${xSomeVariable}

When you render that line of code then the value of xSomeVariable will get turned into a string. But what if ${stuff} meant something in the language you are trying to generate? Then there is a good chance that mako will find things in your template files that it thinks it needs to deal with and it will either output garbage or raise exceptions.

Mako and PigLatin don't have this problem. So that's pretty convenient.

Using Python to generate PigLatin

Remember this: sPigPath = generate_pig_script(lNames)?

Good coders don't mix languages in the same file if they can help it (which is pretty much always). So while it is possible to define your entire PigLatin mako template in the form of a big giant string inside your Python script, we aren't going to do that.

Also, it would be nice if the code we are writing works for more than one template. So instead of:

sPigPath = generate_pig_script(lStrs)   #2

We'll do this:

sPigPath = generate_pig_script(sFilePath,dContext)   #2

We want to pass in the path to our template file, along with a dictionary containing the context variables we'd use to render it this time. For example we could have:

dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

Ok, so lets write some real code then…

def generate_pig_script(sFilePath,dContext):
    """
    render the template at sFilePath using the context in dContext,
    save the output in a temporary file
    return the path to the generated file
    """
    from mako.template import Template
    import datetime

    #1. fetch the template from the file
    oTemplate = Template(filename=sFilePath)
    
    #2. render it using the context dictionary. This gives us a string
    sOutputScript = oTemplate.render(**dContext)

    #3. put the output into some file...
    sOutPath = "{0}_{1}".format(sFilePath,datetime.datetime.now().isoformat())
    with open(sOutPath,'w') as f:
        f.write(sOutputScript)

    return sOutPath

The comments in the code should be enough to understand its general functioning.

Just to complete the picture, let's make an actual template…

Remember this?

r_jan_1 = LOAD 'jan_1/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_2 = LOAD 'jan_2/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_3 = LOAD 'jan_3/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
r_jan_4 = LOAD 'jan_4/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);
... more stuff
r_jan_16 = LOAD 'jan_16/prices.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

r_all = UNION r_jan_1, r_jan_2, r_jan_3, r_jan_4, ... r_jan_16;

Here it is in the form of a mako template:

%for sD in lStrs:

r_${sD} = LOAD '${sD}/sales.csv' USING CSV_READER AS (a,b,c,d,e,f,g);

%endfor

r_all = UNION ${','.join(['r_{0}'.format(sD) for sD in lStrs])};

The full picture

So now we have used Python to generate a PigLatin script and store it in a known location. And we already know how to get Python to launch Pig. So that's it. Pretty straight forward, eh? This tutorial made use of a few different technologies and techniques and it's impossible not to jump around a little bit so I've included a little summary here of how to use this technique:


#1 given a working PigLatin script that has a lot of repitition or a variable number of inputs, create a mako template

#2 write a function that creates the context for the mako template. eg:
dContext = {
    'lStrs' : complicated_operation_getting_list_of_strings()
}

#3 render the template
sPigFilePath = generate_pig_script(sMakoFilePath,dContext)

#and finally run the thing...
run_pig_script(sPigFilePath,dPigArgs=None)

Conclusion

We've covered some of the basics of code generation and used Python and the mako templating system to make Pig more loopy. I've touched on a lot of different technologies and techniques. Pig itself is quite a big deal, and the kinds of problems it is applied to can fill books. The mako templating engine is a powerful thing in itself and has many use cases other than Pig (I mostly use it in conjunction with Pyramid for example). Python loops and list comprehension is worth looking into if any of the weird for-loop stuff didn't make sense; and finally the subprocess modult- it constitutes quite a rabbit hole on its own.

07 Dec 2016 9:38am GMT

Codementor: Extending Apache Pig with Python UDFs

(image source)

Introduction

Apache Pig is a popular system for executing complex Hadoop map-reduce based data-flows. It adds a layer of abstraction on top of Hadoop's map-reduce mechanisms in order to allow developers to take a high-level view of the data and operations on that data. Pig allows you to do things more explicitly. For example, you can join two or more data sources (much like an SQL join). Writing a join as a map and reduce function is a bit of a drag and it's usually worth avoiding. So Pig is great because it simplifies complex tasks - it provides a high-level scripting language that allows users to take more of a big-picture view of their data flow.

Pig is especially great because it is extensible. This tutorial will focus on its extensibility. By the end of this tutorial, you will be able to write PigLatin scripts that execute Python code as a part of a larger map-reduce workflow. Pig can be extended with other languages too, but for now we'll stick to Python.

Before we continue

This tutorial relies on a bunch of knowledge. It'll be very useful if you know a little Python and PigLatin. It'll also be useful to know a bit about how map-reduce works in the context of Hadoop.

User Defined Functions (UDFs)

A Pig UDF is a function that is accessible to Pig, but written in a language that isn't PigLatin. Pig allows you to register UDFs for use within a PigLatin script. A UDF needs to fit a specific prototype - you can't just write your function however you want because then Pig won't know how to call your function, it won't know what kinds of arguments it needs, and it won't know what kind of return value to expect. There are a couple of basic UDF types:

Eval UDFs

This is the most common type of UDF. It's used in FOREACH type statements. Here's an example of an eval function in action:

users = LOAD 'user_data' AS (name: chararray);
upper_users = FOREACH users GENERATE my_udfs.to_upper_case(name);

This code is fairly simple - Pig doesn't really do string processing so we introduce a UDF that does. There are some missing pieces that I'll get to later, specifically how Pig knows what my_udfs means and suchlike.

Aggregation UDFs

These are just a special case of an eval UDF. An Aggregate function is usually applied to grouped data. For example:

user_sales = LOAD 'user_sales' AS (name: chararray, price: float);
grouped_sales = GROUP user_sales BY name;
number_of_sales = FOREACH grouped_sales GENERATE group, COUNT(user_sales);

In other words, an aggregate UDF is a udf that is used to combine multiple pieces of information. Here we are aggregating sales data to show how many purchases were made by each user.

Filter UDFs

A filter UDF returns a boolean value. If you have a data source that has a bunch of rows and only a portion of those rows are useful for the current analysis then a filter function of some kind would be useful. An example of a filter function is action follows:

user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray);
rude_messages = FILTER user_messages by my_udfs.contains_naughty_words(message);

Enough talk, let's code

In this section we'll be writing a couple of Python UDFs and making them accessible within PigLatin scripts.

Here's about the simplest Python UDF you can write:

from pig_util import outputSchema

@outputSchema('word:chararray')
def hi_world():
    return "hello world"

The data output from a function has a specific form. Pig likes it if you specify the schema of the data because then it knows what it can do with that data. That's what the output_schema decorator is for. There are a bunch of different ways to specify a schema, we'll get to that in a little bit.

Now if that were saved in a file called "my_udfs.py" you would be able to make use of it in a PigLatin script like so:

-- first register it to make it available
REGISTER 'myudf.py' using jython as my_special_udfs

users = LOAD 'user_data' AS (name: chararray);
hello_users = FOREACH users GENERATE name, my_special_udfs.hi_world();

Specifying the UDF output schema

Now a UDF has input and output. This little section is all about the outputs. Here we'll go over the different ways you can specify the output format of a Python UDF through use of the outputSchema decorator. We have a few options, here they are:


# our original udf

# it returns a single chararray (that's PigLatin for String)
@outputSchema('word:chararray')
def hi_world():
    return "hello world"
    

# this one returns a Python tuple. Pig recognises the first element 

# of the tuple as a chararray like before, and the next one as a 

# long (a kind of integer)
@outputSchema("word:chararray,number:long")
def hi_everyone():
  return "hi there", 15


#we can use outputSchema to define nested schemas too, here is a bag of tuples
@outputSchema('some_bag:bag{t:(field_1:chararray, field_2:int)}')
def bag_udf():
    return [
        ('hi',1000),
        ('there',2000),
        ('bill',0)
    ]


#and here is a map
@outputSchema('something_nice:map[]')
def my_map_maker():
    return {"a":"b", "c":"d", "e","f"}

So outputSchema can be used to imply that a function outputs one or a combination of basic types. Those types are:

If no schema is specified then Pig assumes that the UDF outputs a bytearray.

UDF arguments

Not only does a UDF have outputs but inputs as well! This sentence should be filed under 'dah'. I reserved it for a separate section so as not to clutter the discussion on output schemas. This part is fairly straight-forward so I'm just going to breeze through it…

First some UDFs:

def deal_with_a_string(s1):
    return s1 + " for the win!"

def deal_with_two_strings(s1,s2):
    return s1 + " " + s2
    
def square_a_number(i):
    return i*i
    
def now_for_a_bag(lBag):
    lOut = []
    for i,l in enumerate(lBag):
        lNew = [i,] + l
        lOut.append(lNew)
    return lOut


And here we make use of those UDFs in a PigLatin script:

REGISTER 'myudf.py' using jython as myudfs

users = LOAD 'user_data' AS (firstname: chararray, lastname:chararray,some_integer:int);

winning_users    = FOREACH users GENERATE myudfs.deal_with_a_string(firstname);
full_names       = FOREACH users GENERATE myudfs.deal_with_two_strings(firstname,lastname);
squared_integers = FOREACH users GENERATE myudfs.square_a_number(some_integer);

users_by_number = GROUP users by some_integer;
indexed_users_by_number = FOREACH users_by_number GENERATE group,myudfs.now_for_a_bag(users);

Beyond Standard Python UDFs

There are a couple of gotchas to using Python in the form of a UDF. Firstly, even though we are writing our UDFs in Python, Pig executes them in Jython. Jython is an implementation of Python that runs on the Java Virtual Machine (JVM). Most of the time this is not an issue as Jython strives to implement all of the same features of CPython but there are some libraries that it doesn't allow. For example you can't use numpy from Jython.

Besides that, Pig doesn't really allow for Python Filter UDFs. You can only do stuff like this:

user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray);
--add a field that says whether it is naughty (1) or not (0)
messages_with_rudeness = FOREACH user_messages GENERATE name,message,contains_naughty_words(message) as naughty;     
--then filter by the naughty field
filtered_messages = FILTER messages_with_rudeness by (naughty==1);    
-- and finally strip away the naughty field                  
rude_messages = FOREACH filtered_messages GENERATE name,message;  

Python Streaming UDFs

Pig allows you to hook into the Hadoop Streaming API, this allows us to get around the Jython issue when we need to. If you haven't heard of Hadoop Streaming before, here is the low down: Hadoop allows you to write mappers and reducers in any language that gives you access to stdin and stdout. So that's pretty much any language you want. Like Python 3 or even Cow. Since this is a Python tutorial the examples that follow will all be in Python but you can plug in whatever you want.

Here's a simple Python streaming script, lets call it simple_stream.py:


#! /usr/bin/env python

import sys
import string

for line in sys.stdin:
    if len(line) == 0: continue   
    l = line.split()    #split the line by whitespace
    for i,s in enumerate(l):
        print "{key}\t{value}\n".format(key=i,value=s) # give out a key value pair for each word in the line

The aim is to get Hadoop to run the script on each node. That means that the hash bang line (#!) needs to be valid on every node, all the import statements must be valid on every node (any packages imported must be installed on each node); and any other system level files or resources accessed within the Python script must be accessible in the same way on every node.

Ok, onto the Pig stuff…

To make the streaming UDF accessible to Pig we make use of the define statement. You can read all about it here

Here is how we can use it with our simple_stream script:

DEFINE stream_alias 'simple_stream.py' SHIP('simple_stream.py');
user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray);
just_messages = FOREACH user_messages generate message;
streamed = STREAM just_messages THROUGH stream_alias;
DUMP streamed;

Lets look at that that DEFINE statement a little closer. The general format we are using is:

DEFINE alias 'command' SHIP('files');

The alias is the name we use to access our streaming function from within our PigLatin script. The command is the system command Pig will call when it needs to use our streaming function. And finally SHIP tells Pig which files and dependencies Pig needs to distribute to the Hadoop nodes for the command to be able to work.

Then once we have the resources we want to pass though the our streaming function we just use the STREAM command as above.

And that's it

Well, sort of. PigLatin is quite a big thing, this tutorial just barely scraped the surface of its capabilities. If all the LOADing and FOREACHing and suchlike didn't make sense to you the I would suggest checking out a more introductory PigLatin tutorial before coming back here. This tutorial should be enough to get you started in using Python from within Pig jobs.

Python is also quite a big thing. Understanding the Python import system is really worthwhile if you want to use Python on a Hadoop cluster. It's also worthwhile understanding some little details like how Python decorators work.

There are also some more technical ways of calling Python from Pig, this tutorial aimed to be an introduction to UDFs, not a definitive guide. For more examples and more in-depth discussions of the different decorators and suchlike that Pig makes available to Jython based UDFs I would suggest taking a look at Pig's official documentation.

Another topic only touched on briefly was Hadoop Streaming, this in itself is a powerful technology but actually pretty easy to use once you get started. I've made use of the Streaming API many times without needing anything as complicated as PigLatin - it's worthwhile being able to use that API as a standalone thing.

07 Dec 2016 9:31am GMT

Codementor: Extending Apache Pig with Python UDFs

(image source)

Introduction

Apache Pig is a popular system for executing complex Hadoop map-reduce based data-flows. It adds a layer of abstraction on top of Hadoop's map-reduce mechanisms in order to allow developers to take a high-level view of the data and operations on that data. Pig allows you to do things more explicitly. For example, you can join two or more data sources (much like an SQL join). Writing a join as a map and reduce function is a bit of a drag and it's usually worth avoiding. So Pig is great because it simplifies complex tasks - it provides a high-level scripting language that allows users to take more of a big-picture view of their data flow.

Pig is especially great because it is extensible. This tutorial will focus on its extensibility. By the end of this tutorial, you will be able to write PigLatin scripts that execute Python code as a part of a larger map-reduce workflow. Pig can be extended with other languages too, but for now we'll stick to Python.

Before we continue

This tutorial relies on a bunch of knowledge. It'll be very useful if you know a little Python and PigLatin. It'll also be useful to know a bit about how map-reduce works in the context of Hadoop.

User Defined Functions (UDFs)

A Pig UDF is a function that is accessible to Pig, but written in a language that isn't PigLatin. Pig allows you to register UDFs for use within a PigLatin script. A UDF needs to fit a specific prototype - you can't just write your function however you want because then Pig won't know how to call your function, it won't know what kinds of arguments it needs, and it won't know what kind of return value to expect. There are a couple of basic UDF types:

Eval UDFs

This is the most common type of UDF. It's used in FOREACH type statements. Here's an example of an eval function in action:

users = LOAD 'user_data' AS (name: chararray);
upper_users = FOREACH users GENERATE my_udfs.to_upper_case(name);

This code is fairly simple - Pig doesn't really do string processing so we introduce a UDF that does. There are some missing pieces that I'll get to later, specifically how Pig knows what my_udfs means and suchlike.

Aggregation UDFs

These are just a special case of an eval UDF. An Aggregate function is usually applied to grouped data. For example:

user_sales = LOAD 'user_sales' AS (name: chararray, price: float);
grouped_sales = GROUP user_sales BY name;
number_of_sales = FOREACH grouped_sales GENERATE group, COUNT(user_sales);

In other words, an aggregate UDF is a udf that is used to combine multiple pieces of information. Here we are aggregating sales data to show how many purchases were made by each user.

Filter UDFs

A filter UDF returns a boolean value. If you have a data source that has a bunch of rows and only a portion of those rows are useful for the current analysis then a filter function of some kind would be useful. An example of a filter function is action follows:

user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray);
rude_messages = FILTER user_messages by my_udfs.contains_naughty_words(message);

Enough talk, let's code

In this section we'll be writing a couple of Python UDFs and making them accessible within PigLatin scripts.

Here's about the simplest Python UDF you can write:

from pig_util import outputSchema

@outputSchema('word:chararray')
def hi_world():
    return "hello world"

The data output from a function has a specific form. Pig likes it if you specify the schema of the data because then it knows what it can do with that data. That's what the output_schema decorator is for. There are a bunch of different ways to specify a schema, we'll get to that in a little bit.

Now if that were saved in a file called "my_udfs.py" you would be able to make use of it in a PigLatin script like so:

-- first register it to make it available
REGISTER 'myudf.py' using jython as my_special_udfs

users = LOAD 'user_data' AS (name: chararray);
hello_users = FOREACH users GENERATE name, my_special_udfs.hi_world();

Specifying the UDF output schema

Now a UDF has input and output. This little section is all about the outputs. Here we'll go over the different ways you can specify the output format of a Python UDF through use of the outputSchema decorator. We have a few options, here they are:


# our original udf

# it returns a single chararray (that's PigLatin for String)
@outputSchema('word:chararray')
def hi_world():
    return "hello world"
    

# this one returns a Python tuple. Pig recognises the first element 

# of the tuple as a chararray like before, and the next one as a 

# long (a kind of integer)
@outputSchema("word:chararray,number:long")
def hi_everyone():
  return "hi there", 15


#we can use outputSchema to define nested schemas too, here is a bag of tuples
@outputSchema('some_bag:bag{t:(field_1:chararray, field_2:int)}')
def bag_udf():
    return [
        ('hi',1000),
        ('there',2000),
        ('bill',0)
    ]


#and here is a map
@outputSchema('something_nice:map[]')
def my_map_maker():
    return {"a":"b", "c":"d", "e","f"}

So outputSchema can be used to imply that a function outputs one or a combination of basic types. Those types are:

If no schema is specified then Pig assumes that the UDF outputs a bytearray.

UDF arguments

Not only does a UDF have outputs but inputs as well! This sentence should be filed under 'dah'. I reserved it for a separate section so as not to clutter the discussion on output schemas. This part is fairly straight-forward so I'm just going to breeze through it…

First some UDFs:

def deal_with_a_string(s1):
    return s1 + " for the win!"

def deal_with_two_strings(s1,s2):
    return s1 + " " + s2
    
def square_a_number(i):
    return i*i
    
def now_for_a_bag(lBag):
    lOut = []
    for i,l in enumerate(lBag):
        lNew = [i,] + l
        lOut.append(lNew)
    return lOut


And here we make use of those UDFs in a PigLatin script:

REGISTER 'myudf.py' using jython as myudfs

users = LOAD 'user_data' AS (firstname: chararray, lastname:chararray,some_integer:int);

winning_users    = FOREACH users GENERATE myudfs.deal_with_a_string(firstname);
full_names       = FOREACH users GENERATE myudfs.deal_with_two_strings(firstname,lastname);
squared_integers = FOREACH users GENERATE myudfs.square_a_number(some_integer);

users_by_number = GROUP users by some_integer;
indexed_users_by_number = FOREACH users_by_number GENERATE group,myudfs.now_for_a_bag(users);

Beyond Standard Python UDFs

There are a couple of gotchas to using Python in the form of a UDF. Firstly, even though we are writing our UDFs in Python, Pig executes them in Jython. Jython is an implementation of Python that runs on the Java Virtual Machine (JVM). Most of the time this is not an issue as Jython strives to implement all of the same features of CPython but there are some libraries that it doesn't allow. For example you can't use numpy from Jython.

Besides that, Pig doesn't really allow for Python Filter UDFs. You can only do stuff like this:

user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray);
--add a field that says whether it is naughty (1) or not (0)
messages_with_rudeness = FOREACH user_messages GENERATE name,message,contains_naughty_words(message) as naughty;     
--then filter by the naughty field
filtered_messages = FILTER messages_with_rudeness by (naughty==1);    
-- and finally strip away the naughty field                  
rude_messages = FOREACH filtered_messages GENERATE name,message;  

Python Streaming UDFs

Pig allows you to hook into the Hadoop Streaming API, this allows us to get around the Jython issue when we need to. If you haven't heard of Hadoop Streaming before, here is the low down: Hadoop allows you to write mappers and reducers in any language that gives you access to stdin and stdout. So that's pretty much any language you want. Like Python 3 or even Cow. Since this is a Python tutorial the examples that follow will all be in Python but you can plug in whatever you want.

Here's a simple Python streaming script, lets call it simple_stream.py:


#! /usr/bin/env python

import sys
import string

for line in sys.stdin:
    if len(line) == 0: continue   
    l = line.split()    #split the line by whitespace
    for i,s in enumerate(l):
        print "{key}\t{value}\n".format(key=i,value=s) # give out a key value pair for each word in the line

The aim is to get Hadoop to run the script on each node. That means that the hash bang line (#!) needs to be valid on every node, all the import statements must be valid on every node (any packages imported must be installed on each node); and any other system level files or resources accessed within the Python script must be accessible in the same way on every node.

Ok, onto the Pig stuff…

To make the streaming UDF accessible to Pig we make use of the define statement. You can read all about it here

Here is how we can use it with our simple_stream script:

DEFINE stream_alias 'simple_stream.py' SHIP('simple_stream.py');
user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray);
just_messages = FOREACH user_messages generate message;
streamed = STREAM just_messages THROUGH stream_alias;
DUMP streamed;

Lets look at that that DEFINE statement a little closer. The general format we are using is:

DEFINE alias 'command' SHIP('files');

The alias is the name we use to access our streaming function from within our PigLatin script. The command is the system command Pig will call when it needs to use our streaming function. And finally SHIP tells Pig which files and dependencies Pig needs to distribute to the Hadoop nodes for the command to be able to work.

Then once we have the resources we want to pass though the our streaming function we just use the STREAM command as above.

And that's it

Well, sort of. PigLatin is quite a big thing, this tutorial just barely scraped the surface of its capabilities. If all the LOADing and FOREACHing and suchlike didn't make sense to you the I would suggest checking out a more introductory PigLatin tutorial before coming back here. This tutorial should be enough to get you started in using Python from within Pig jobs.

Python is also quite a big thing. Understanding the Python import system is really worthwhile if you want to use Python on a Hadoop cluster. It's also worthwhile understanding some little details like how Python decorators work.

There are also some more technical ways of calling Python from Pig, this tutorial aimed to be an introduction to UDFs, not a definitive guide. For more examples and more in-depth discussions of the different decorators and suchlike that Pig makes available to Jython based UDFs I would suggest taking a look at Pig's official documentation.

Another topic only touched on briefly was Hadoop Streaming, this in itself is a powerful technology but actually pretty easy to use once you get started. I've made use of the Streaming API many times without needing anything as complicated as PigLatin - it's worthwhile being able to use that API as a standalone thing.

07 Dec 2016 9:31am GMT

Python Insider: Python 3.6.0 release candidate is now available

Python 3.6.0rc1 is the release candidate for Python 3.6, the next major
release of Python.

Code for 3.6.0 is now frozen. Assuming no release critical problems are
found prior to the 3.6.0 final release date, currently 2016-12-16, the
3.6.0 final release will be the same code base as this 3.6.0rc1.
Maintenance releases for the 3.6 series will follow at regular
intervals starting in the first quarter of 2017.


Among the major new features in Python 3.6 are:

* PEP 468 - Preserving the order of **kwargs in a function
* PEP 487 - Simpler customization of class creation
* PEP 495 - Local Time Disambiguation
* PEP 498 - Literal String Formatting
* PEP 506 - Adding A Secrets Module To The Standard Library
* PEP 509 - Add a private version to dict
* PEP 515 - Underscores in Numeric Literals
* PEP 519 - Adding a file system path protocol
* PEP 520 - Preserving Class Attribute Definition Order
* PEP 523 - Adding a frame evaluation API to CPython
* PEP 524 - Make os.urandom() blocking on Linux (during system startup)
* PEP 525 - Asynchronous Generators (provisional)
* PEP 526 - Syntax for Variable Annotations (provisional)
* PEP 528 - Change Windows console encoding to UTF-8
* PEP 529 - Change Windows filesystem encoding to UTF-8
* PEP 530 - Asynchronous Comprehensions

Please see "What's New In Python 3.6" for more information:

https://docs.python.org/3.6/whatsnew/3.6.html

You can find Python 3.6.0rc1 here:

https://www.python.org/downloads/release/python-360rc1/

Note that 3.6.0rc1 is still a preview release and thus its use is not recommended for
production environments

More information about the release schedule can be found here:

https://www.python.org/dev/peps/pep-0494/

07 Dec 2016 2:34am GMT

Python Insider: Python 3.6.0 release candidate is now available

Python 3.6.0rc1 is the release candidate for Python 3.6, the next major
release of Python.

Code for 3.6.0 is now frozen. Assuming no release critical problems are
found prior to the 3.6.0 final release date, currently 2016-12-16, the
3.6.0 final release will be the same code base as this 3.6.0rc1.
Maintenance releases for the 3.6 series will follow at regular
intervals starting in the first quarter of 2017.


Among the major new features in Python 3.6 are:

* PEP 468 - Preserving the order of **kwargs in a function
* PEP 487 - Simpler customization of class creation
* PEP 495 - Local Time Disambiguation
* PEP 498 - Literal String Formatting
* PEP 506 - Adding A Secrets Module To The Standard Library
* PEP 509 - Add a private version to dict
* PEP 515 - Underscores in Numeric Literals
* PEP 519 - Adding a file system path protocol
* PEP 520 - Preserving Class Attribute Definition Order
* PEP 523 - Adding a frame evaluation API to CPython
* PEP 524 - Make os.urandom() blocking on Linux (during system startup)
* PEP 525 - Asynchronous Generators (provisional)
* PEP 526 - Syntax for Variable Annotations (provisional)
* PEP 528 - Change Windows console encoding to UTF-8
* PEP 529 - Change Windows filesystem encoding to UTF-8
* PEP 530 - Asynchronous Comprehensions

Please see "What's New In Python 3.6" for more information:

https://docs.python.org/3.6/whatsnew/3.6.html

You can find Python 3.6.0rc1 here:

https://www.python.org/downloads/release/python-360rc1/

Note that 3.6.0rc1 is still a preview release and thus its use is not recommended for
production environments

More information about the release schedule can be found here:

https://www.python.org/dev/peps/pep-0494/

07 Dec 2016 2:34am GMT

06 Dec 2016

feedPlanet Python

Catalin George Festila: The python-nmap python module fail.

You can read about this python module here.

First let's install this python module.
C:\Python27>cd Scripts

C:\Python27\Scripts>pip install python-nmap
Collecting python-nmap
Downloading python-nmap-0.6.1.tar.gz (41kB)
100% |################################| 51kB 240kB/s
Installing collected packages: python-nmap
Running setup.py install for python-nmap ... done
Successfully installed python-nmap-0.6.1

About this python-nmap version you can read here.
I try to run the example source code but not of this example working.
For example I got this:
>>> nm.scan('127.0.0.1', '22-443')
{'nmap': {'scanstats': {'uphosts': '1', 'timestr': 'Wed Dec 07 08:13:01 2016', '
downhosts': '-1', 'totalhosts': '0', 'elapsed': '10.74'}, 'scaninfo': {'tcp': {'
services': '22-443', 'method': 'syn'}, 'error': [u'dnet: Failed to open device l
o0\r\nQUITTING!\r\n', u'dnet: Failed to open device lo0\r\nQUITTING!\r\n']}, 'co
mmand_line': 'nmap -oX - -p 22-443 -sV 127.0.0.1'}, 'scan': {}}

06 Dec 2016 10:15pm GMT

Catalin George Festila: The python-nmap python module fail.

You can read about this python module here.

First let's install this python module.
C:\Python27>cd Scripts

C:\Python27\Scripts>pip install python-nmap
Collecting python-nmap
Downloading python-nmap-0.6.1.tar.gz (41kB)
100% |################################| 51kB 240kB/s
Installing collected packages: python-nmap
Running setup.py install for python-nmap ... done
Successfully installed python-nmap-0.6.1

About this python-nmap version you can read here.
I try to run the example source code but not of this example working.
For example I got this:
>>> nm.scan('127.0.0.1', '22-443')
{'nmap': {'scanstats': {'uphosts': '1', 'timestr': 'Wed Dec 07 08:13:01 2016', '
downhosts': '-1', 'totalhosts': '0', 'elapsed': '10.74'}, 'scaninfo': {'tcp': {'
services': '22-443', 'method': 'syn'}, 'error': [u'dnet: Failed to open device l
o0\r\nQUITTING!\r\n', u'dnet: Failed to open device lo0\r\nQUITTING!\r\n']}, 'co
mmand_line': 'nmap -oX - -p 22-443 -sV 127.0.0.1'}, 'scan': {}}

06 Dec 2016 10:15pm GMT

Brett Cannon: Why I took October off from OSS volunteering

06 Dec 2016 7:13pm GMT

Brett Cannon: Why I took October off from OSS volunteering

06 Dec 2016 7:13pm GMT

Brett Cannon: What to look for in a new TV

I'm kind of an A/V nerd. Now I'm not hardcore enough to have a vinyl collection or have an amp for my TV, but all my headphones cost over $100 and I have a Sonos Playbar so I don't have to put up with crappy TV speakers. What I'm trying to say is that I care about the A/V equipment I use, but not to the extent that money is no object when it comes to my enjoyment of a movie (I'm not that rich and my wife would kill me if I spent that kind of money on electronics). That means I tend to research extensively before making a major A/V purchase since I don't do it very often and I want quality within reason which does not lend itself to impulse buying.

Prior to September 1, 2016, I had a 2011 Vizio television. It was 47", did 1080p, and had passive 3D. When I purchased the TV I was fresh out of UBC having just finished my Ph.D. so it wasn't top-of-the-line, but it was considered very good for the price. I was happy with the picture, but admittedly it wasn't amazing; the screen had almost a matte finish which led to horrible glare. I also rarely used the 3D in the television as 3D Blu-Ray discs always cost extra and so few movies took the time to actually film in 3D to begin with, instead choosing to do it in post-production (basically animated films and TRON: Legacy were all that we ever watched in 3D). And to top it all off, the TV took a while to turn on. I don't know what kind of LCB bulbs were in it, but they took forever to warm up and just annoyed me (yes, very much a first-world problem).

So when UHD came into existence I started to keep an eye on the technology and what television manufacturers were doing to incorporate the technology to entice people like me to upgrade. After two years of watching this space and one of the TVs I was considering having a one-day sale that knock 23% off the price, I ended up buying a 55" Samsung KS8000 yesterday. Since I spent so much time considering this purchase I figured I would try and distill what knowledge I have picked up over the years into a blog post so that when you decide to upgrade to UHD you don't have to start from zero knowledge like I did.

What to care about

First, you don't care about the resolution of the TV. All UHD televisions are 4K, so that's just taken care of for you. It also doesn't generally make a difference in the picture because most people sit too far away from their TV to make the higher resolution matter.

No, the one thing you're going to care about is HDR and everything that comes with it. And of course it can't be a simple thing to measure like size or resolution. Oh no, HDR has a bunch of parts to it that go into the quality of the picture: brightness, colour gamut, and format (yes, there's a format war; HD-DVD/Blu-Ray didn't teach the TV manufacturers a big enough lesson).

Brightness

A key part of HDR is the range of brightness to show what you frequently hear referred to as "inky blacks" and "bright whites". The way you get deep blacks and bright whites is by supporting a huge range of brightness. What you will hear about TVs is what their maximum nit is. Basically you're aiming for 1000 nits or higher for a maximum and as close to 0 as possible for a minimum.

Now of course this isn't as simple as it sounds as there's different technology being used to try and solve this problem.

LCD

Thanks to our computers I'm sure everyone reading this is familiar with LCD displays. But what you might not realize is how they exactly work. In a nutshell there are LED lightbulbs behind your screen that provides white light, and then the LCD pixels turn on and off the red/green/blue parts of themselves to filter out certain colours. So yeah, there are lightbulbs in your screen and how strong they are dictates how bright your TV screen will be.

Now the thing that comes into play here for brightness is how those LED bulbs are oriented in order to get towards that 0 nits for inky blacks. Typical screens are edge-list, which means there is basically a strip of LEDs on the edges of the TV that shine light towards the middle of the screen. This is fine and it's what screens have been working with for a while, but it does mean there's always some light behind the pixels so it's kind of hard to keep it from leaking out a little bit.

This is where local dimming comes in. Some manufacturers are now laying out the LED bulbs in an array/grid behind the screen instead of at the edges. What this allows is for the TV to switch dim an LED bulb if it isn't needed at full strength to illuminate a certain quadrant of the screen (potentially even switching off entirely). Obviously the denser the array, the more local dimming zones and thus the greater chance a picture with some black in it will be able to switch off an LED to truly get a dark black for that part of the screen. As for how often something you're watching is going to allow you to take advantage of such local dimming due to a dark area lining up within a zone is going to vary so this is going to be a personal call as to whether this makes a difference to you.

OLED

If I didn't have a budget and wanted the ultimate solution for getting the best blacks in a picture, I would probably have an OLED TV from LG. What makes these TVs so great is the fact that OLEDs are essentially pixels that provide their own light. What that means is if you want an OLED pixel to be black, you simply switch it off. Or to compare it to local dimming, it's as if every pixel was its own local dimming zone. So if you want truly dark blacks, OLED are the way to go. It also leads to better colours since the intensity of the pixel is consistent compared to an LCD where the brightness is affected by how far the pixel is from the LED bulb that's providing light to the pixel.

But the drawback is that OLED TVs only get so bright. Since each pixel has to generate its own light they can't reach really four-digit nit levels like the LCD TVs can. It's still much brighter than any HD TV, but OLED TVs don't match the maximum brightness of the higher-end LCD TVs.

So currently it's a race to see if LCDs can get their blacks down or if OLEDs can get their brightness up. But from what I have read, in 2016 your best bet is OLED if you can justify the cost to yourself (they are very expensive televisions).

Colour gamut

While having inky blacks and bright whites are nice, not everyone is waiting for Mad Max: Fury Road in black and white. That means you actually care about the rest of the rainbow, which means you care about the colour gamut of the TV for a specific colour space. TVs are currently trying to cover as much of the DCI-P3 colour space as possible right now. Maybe in a few years TVs will fully cover that colour space, at which point they will start worrying about Rec. 2020 (also called BT.2020), but there's still room in covering DCI-P3 before that's something to care about.

In the end colour gamut is probably not going to be something you explicitly shop for, but more of something to be aware of that you will possibly gain by going up in price on your television.

Formats

So you have your brightness and you have your colours, now you have to care about what format all of this information is stored in. Yes my friends, there's a new format war and it's HDR10 versus Dolby Vision. Now if you buy a TV from Vizio or LG then you don't have to care because they are supporting both formats. But if you consider any other manufacturer you need to decide on whether you care about Dolby Vision because everyone supports HDR10 these days but no one supports Dolby Vision at the moment except those two manufacturers.

There is one key reason that HRD10 is supported by all television makers: it's an open specification. By being free it doesn't cut into profits of TVs which obviously every manufacturer likes and is probably why HDR10 is the required HDR standard for Ultra Blu-Ray discs (Dolby Vision is supported on Ultra Blu-Ray, but not required). Dolby Vision, on the other hand, requires licensing fees paid to Dolby. Articles also consistently suggest that Dolby Vision requires new hardware which would also drive up costs of supporting Dolby Vision (best I can come up with is that since Dolby Vision is 12-bit and HDR10 is 10-bit that TVs typically use a separate chip for Dolby Vision processing).

Dolby Vision does currently have two things going for it over HDR10. One is that Dolby Vision is dynamic per frame while HDR10 is static. This is most likely a temporary perk, though, because HDR10 is gaining dynamic support sometime in the future.

Two is that Dolby Vision is part of an end-to-end solution from image capture to projection in the theatres. By making Dolby Vision then also work at home it allows for directors and editors to get the results they want for the cinema and then just pass those results along to your TV without extra work.

All of this is to say that Dolby Vision seems to be the better technology, but the overhead/cost of adding it to a TV along with demand will ultimately dictate whether it catches on. Luckily all TV manufacturers has agreed on the minimum standard of HDR10 so you won't be completely left out if you buy a TV from someone other than LG or Vizio.

Where to go for advice

When it comes time to buy a TV, I recommend Rtings.com for advice. They have a very nice battery of tests they put the TV through and give you nice level of detail on how they reached their scores for each test. They even provide the settings they used for their tests so you can replicate them at home.

You can also read what the Wirecutter is currently recommending. For me, though, I prefer Rtings.com and use the Wirecutter as a confirmation check if their latest TV round-up isn't too out-of-date.

Ultra HD Premium

If you want a very simple way to help choose a television, you can simply consider ones that are listed as Ultra HD Premium. That way you know the TV roughly meets a minimum set of specifications that are reasonable to want if you're spending a lot of money on a TV. The certification is new in 2016 and so there are not a ton of TVs yet that have the certification, but since TV manufacturers like having stamps on their televisions I suspect it will start to become a thing.

One thing to be aware of is that Vizio doesn't like the certification. Basically they have complained that the lack of standards around how to actually measure what the certification requires makes it somewhat of a moot point. That's a totally reasonable criticism and why using the certification as a filter for TVs consider is good, but to not blindly buy a TV just because it has Ultra HD Premium stamp of approval.

Why I chose my TV

Much like when I bought a soundbar, I had some restrictions placed upon me when considering what television I wanted. One, the TV couldn't be any larger than 55" (to prevent the TV from taking over the living room even though we should have a 65" based on the minimum distance people might sit from the TV). This immediately put certain limits on me as some model lines don't start until 65" like the Vizio Reference series. I also wasn't willing to spend CAD 4,000 on an LG, so that eliminated OLED from consideration. I also wanted HDR, so that eliminated an OLED that was only HD.

In the end it was between the 55" Samsung KS8000, 55" Vizio P-series, and the 50" Vizio P-series. The reason for the same Vizio model at different sizes is the fact that they use different display technology; the 50" has a VA display while the 55" has an IPS display. The former will have better colours but the latter has better viewing angles. Unfortunately I couldn't find either model on display here in Vancouver to see what kind of difference it made.

One other knock against the Vizio -- at least at 55" -- was that it wasn't very good in a bright room. That's a problem for us as our living room is north facing with a big window and the TV is perpendicular to those windows, so we have plenty of glare on the screen as the sun goes down. The Samsung, on the other hand, was rated to do better in a glare-heavy room. And thanks to a one-day sale it brought the price of the Samsung to within striking distance of the Vizio. So in the end with the price difference no longer a factor I decided to go with the TV that worked best with glare and maximized the size I could go with.

My only worry with my purchase is if Dolby Vision ends up taking hold and I get left in the cold somehow. But thanks to the HDR10 support being what Ultra Blu-Ray mandates I'm not terribly worried of being shut out entirely from HDR content. There's also hope that I might be able to upgrade my television in the future thanks to it using a Mini One Connect which breaks out the connections from the television. In other TVs the box is much bigger as it contains all of the smarts of the television, allowing future upgrades. There's a chance I will be able to upgrade the box to get Dolby Vision in the future, but that's just a guess at this point that it's even possible, let alone whether Samsung choose to add Do

It's been 48 hours with the TV and both Andrea and I are happy with the purchase; me because the picture is great, Andrea because I will now shut up about television technology in regards to a new TV purchase.

06 Dec 2016 7:13pm GMT

Brett Cannon: What to look for in a new TV

I'm kind of an A/V nerd. Now I'm not hardcore enough to have a vinyl collection or have an amp for my TV, but all my headphones cost over $100 and I have a Sonos Playbar so I don't have to put up with crappy TV speakers. What I'm trying to say is that I care about the A/V equipment I use, but not to the extent that money is no object when it comes to my enjoyment of a movie (I'm not that rich and my wife would kill me if I spent that kind of money on electronics). That means I tend to research extensively before making a major A/V purchase since I don't do it very often and I want quality within reason which does not lend itself to impulse buying.

Prior to September 1, 2016, I had a 2011 Vizio television. It was 47", did 1080p, and had passive 3D. When I purchased the TV I was fresh out of UBC having just finished my Ph.D. so it wasn't top-of-the-line, but it was considered very good for the price. I was happy with the picture, but admittedly it wasn't amazing; the screen had almost a matte finish which led to horrible glare. I also rarely used the 3D in the television as 3D Blu-Ray discs always cost extra and so few movies took the time to actually film in 3D to begin with, instead choosing to do it in post-production (basically animated films and TRON: Legacy were all that we ever watched in 3D). And to top it all off, the TV took a while to turn on. I don't know what kind of LCB bulbs were in it, but they took forever to warm up and just annoyed me (yes, very much a first-world problem).

So when UHD came into existence I started to keep an eye on the technology and what television manufacturers were doing to incorporate the technology to entice people like me to upgrade. After two years of watching this space and one of the TVs I was considering having a one-day sale that knock 23% off the price, I ended up buying a 55" Samsung KS8000 yesterday. Since I spent so much time considering this purchase I figured I would try and distill what knowledge I have picked up over the years into a blog post so that when you decide to upgrade to UHD you don't have to start from zero knowledge like I did.

What to care about

First, you don't care about the resolution of the TV. All UHD televisions are 4K, so that's just taken care of for you. It also doesn't generally make a difference in the picture because most people sit too far away from their TV to make the higher resolution matter.

No, the one thing you're going to care about is HDR and everything that comes with it. And of course it can't be a simple thing to measure like size or resolution. Oh no, HDR has a bunch of parts to it that go into the quality of the picture: brightness, colour gamut, and format (yes, there's a format war; HD-DVD/Blu-Ray didn't teach the TV manufacturers a big enough lesson).

Brightness

A key part of HDR is the range of brightness to show what you frequently hear referred to as "inky blacks" and "bright whites". The way you get deep blacks and bright whites is by supporting a huge range of brightness. What you will hear about TVs is what their maximum nit is. Basically you're aiming for 1000 nits or higher for a maximum and as close to 0 as possible for a minimum.

Now of course this isn't as simple as it sounds as there's different technology being used to try and solve this problem.

LCD

Thanks to our computers I'm sure everyone reading this is familiar with LCD displays. But what you might not realize is how they exactly work. In a nutshell there are LED lightbulbs behind your screen that provides white light, and then the LCD pixels turn on and off the red/green/blue parts of themselves to filter out certain colours. So yeah, there are lightbulbs in your screen and how strong they are dictates how bright your TV screen will be.

Now the thing that comes into play here for brightness is how those LED bulbs are oriented in order to get towards that 0 nits for inky blacks. Typical screens are edge-list, which means there is basically a strip of LEDs on the edges of the TV that shine light towards the middle of the screen. This is fine and it's what screens have been working with for a while, but it does mean there's always some light behind the pixels so it's kind of hard to keep it from leaking out a little bit.

This is where local dimming comes in. Some manufacturers are now laying out the LED bulbs in an array/grid behind the screen instead of at the edges. What this allows is for the TV to switch dim an LED bulb if it isn't needed at full strength to illuminate a certain quadrant of the screen (potentially even switching off entirely). Obviously the denser the array, the more local dimming zones and thus the greater chance a picture with some black in it will be able to switch off an LED to truly get a dark black for that part of the screen. As for how often something you're watching is going to allow you to take advantage of such local dimming due to a dark area lining up within a zone is going to vary so this is going to be a personal call as to whether this makes a difference to you.

OLED

If I didn't have a budget and wanted the ultimate solution for getting the best blacks in a picture, I would probably have an OLED TV from LG. What makes these TVs so great is the fact that OLEDs are essentially pixels that provide their own light. What that means is if you want an OLED pixel to be black, you simply switch it off. Or to compare it to local dimming, it's as if every pixel was its own local dimming zone. So if you want truly dark blacks, OLED are the way to go. It also leads to better colours since the intensity of the pixel is consistent compared to an LCD where the brightness is affected by how far the pixel is from the LED bulb that's providing light to the pixel.

But the drawback is that OLED TVs only get so bright. Since each pixel has to generate its own light they can't reach really four-digit nit levels like the LCD TVs can. It's still much brighter than any HD TV, but OLED TVs don't match the maximum brightness of the higher-end LCD TVs.

So currently it's a race to see if LCDs can get their blacks down or if OLEDs can get their brightness up. But from what I have read, in 2016 your best bet is OLED if you can justify the cost to yourself (they are very expensive televisions).

Colour gamut

While having inky blacks and bright whites are nice, not everyone is waiting for Mad Max: Fury Road in black and white. That means you actually care about the rest of the rainbow, which means you care about the colour gamut of the TV for a specific colour space. TVs are currently trying to cover as much of the DCI-P3 colour space as possible right now. Maybe in a few years TVs will fully cover that colour space, at which point they will start worrying about Rec. 2020 (also called BT.2020), but there's still room in covering DCI-P3 before that's something to care about.

In the end colour gamut is probably not going to be something you explicitly shop for, but more of something to be aware of that you will possibly gain by going up in price on your television.

Formats

So you have your brightness and you have your colours, now you have to care about what format all of this information is stored in. Yes my friends, there's a new format war and it's HDR10 versus Dolby Vision. Now if you buy a TV from Vizio or LG then you don't have to care because they are supporting both formats. But if you consider any other manufacturer you need to decide on whether you care about Dolby Vision because everyone supports HDR10 these days but no one supports Dolby Vision at the moment except those two manufacturers.

There is one key reason that HRD10 is supported by all television makers: it's an open specification. By being free it doesn't cut into profits of TVs which obviously every manufacturer likes and is probably why HDR10 is the required HDR standard for Ultra Blu-Ray discs (Dolby Vision is supported on Ultra Blu-Ray, but not required). Dolby Vision, on the other hand, requires licensing fees paid to Dolby. Articles also consistently suggest that Dolby Vision requires new hardware which would also drive up costs of supporting Dolby Vision (best I can come up with is that since Dolby Vision is 12-bit and HDR10 is 10-bit that TVs typically use a separate chip for Dolby Vision processing).

Dolby Vision does currently have two things going for it over HDR10. One is that Dolby Vision is dynamic per frame while HDR10 is static. This is most likely a temporary perk, though, because HDR10 is gaining dynamic support sometime in the future.

Two is that Dolby Vision is part of an end-to-end solution from image capture to projection in the theatres. By making Dolby Vision then also work at home it allows for directors and editors to get the results they want for the cinema and then just pass those results along to your TV without extra work.

All of this is to say that Dolby Vision seems to be the better technology, but the overhead/cost of adding it to a TV along with demand will ultimately dictate whether it catches on. Luckily all TV manufacturers has agreed on the minimum standard of HDR10 so you won't be completely left out if you buy a TV from someone other than LG or Vizio.

Where to go for advice

When it comes time to buy a TV, I recommend Rtings.com for advice. They have a very nice battery of tests they put the TV through and give you nice level of detail on how they reached their scores for each test. They even provide the settings they used for their tests so you can replicate them at home.

You can also read what the Wirecutter is currently recommending. For me, though, I prefer Rtings.com and use the Wirecutter as a confirmation check if their latest TV round-up isn't too out-of-date.

Ultra HD Premium

If you want a very simple way to help choose a television, you can simply consider ones that are listed as Ultra HD Premium. That way you know the TV roughly meets a minimum set of specifications that are reasonable to want if you're spending a lot of money on a TV. The certification is new in 2016 and so there are not a ton of TVs yet that have the certification, but since TV manufacturers like having stamps on their televisions I suspect it will start to become a thing.

One thing to be aware of is that Vizio doesn't like the certification. Basically they have complained that the lack of standards around how to actually measure what the certification requires makes it somewhat of a moot point. That's a totally reasonable criticism and why using the certification as a filter for TVs consider is good, but to not blindly buy a TV just because it has Ultra HD Premium stamp of approval.

Why I chose my TV

Much like when I bought a soundbar, I had some restrictions placed upon me when considering what television I wanted. One, the TV couldn't be any larger than 55" (to prevent the TV from taking over the living room even though we should have a 65" based on the minimum distance people might sit from the TV). This immediately put certain limits on me as some model lines don't start until 65" like the Vizio Reference series. I also wasn't willing to spend CAD 4,000 on an LG, so that eliminated OLED from consideration. I also wanted HDR, so that eliminated an OLED that was only HD.

In the end it was between the 55" Samsung KS8000, 55" Vizio P-series, and the 50" Vizio P-series. The reason for the same Vizio model at different sizes is the fact that they use different display technology; the 50" has a VA display while the 55" has an IPS display. The former will have better colours but the latter has better viewing angles. Unfortunately I couldn't find either model on display here in Vancouver to see what kind of difference it made.

One other knock against the Vizio -- at least at 55" -- was that it wasn't very good in a bright room. That's a problem for us as our living room is north facing with a big window and the TV is perpendicular to those windows, so we have plenty of glare on the screen as the sun goes down. The Samsung, on the other hand, was rated to do better in a glare-heavy room. And thanks to a one-day sale it brought the price of the Samsung to within striking distance of the Vizio. So in the end with the price difference no longer a factor I decided to go with the TV that worked best with glare and maximized the size I could go with.

My only worry with my purchase is if Dolby Vision ends up taking hold and I get left in the cold somehow. But thanks to the HDR10 support being what Ultra Blu-Ray mandates I'm not terribly worried of being shut out entirely from HDR content. There's also hope that I might be able to upgrade my television in the future thanks to it using a Mini One Connect which breaks out the connections from the television. In other TVs the box is much bigger as it contains all of the smarts of the television, allowing future upgrades. There's a chance I will be able to upgrade the box to get Dolby Vision in the future, but that's just a guess at this point that it's even possible, let alone whether Samsung choose to add Do

It's been 48 hours with the TV and both Andrea and I are happy with the purchase; me because the picture is great, Andrea because I will now shut up about television technology in regards to a new TV purchase.

06 Dec 2016 7:13pm GMT

Brett Cannon: Introducing Which Film

What I'm announcing

Today I'm happy to announce the public unveiling of Which Film! I'll discuss how the site came about and what drives it, but I thought I would first explain what it does: it's a website to help you choose what movie you and your family/friends should watch together. What you do is you go to the site, enter in the Trakt.tv usernames of everyone who wants to watch a film together (so you need at least two people and kept data like your watchlist and ratings on Trakt), and then Which Film cross-references everyone's watchlists and ratings to create a list of movies that people may want to watch together.

The list of movies is ranked based on a simple point scale. If a movie is on someone's watchlist it gets 4 points, movies rated 10 ⭐ get 3 points, 9 ⭐ get 2 points, and 8 ⭐ get 1 point. Everyone who participates contributes points and the movies are sorted from highest score to lowest. The reason for the point values is the assumption that watching a movie most people have not seen is the best, followed by a movies people rate very highly. In the case of ties, the movie seen longest ago (if ever) by anyone in the group is ranked higher than movies seen more recently by someone. That way there's a bigger chance someone will be willing to watch a movie again when everyone else wants to see it for the first time.

None of this is very fancy or revolutionary, but it's useful any time you get together with a group of friends to watch a film and you end up having a hard time choosing to watch. It can help even between spouses as it will identify movies both people want to watch, removing that particular point of contention.

The story behind Which Film

Now normally launching a new website wouldn't cause for any backstory, but this project has been under development for about six years, so there's a bit of history to it.

One fateful night ...

The inspiration for Which Film stemmed from one night when my co-creator Karl, his wife, my wife, and I got together and decided we wanted to watch a movie. This turned out to be quite an ordeal due to disparate tastes among all four of us. Karl and I thought that there had to be a better way to figure out a film we could all happily watch together. It didn't need to necessarily be something none of us had seen (although that was preferred), but it did need to be something that had a chance of making all of us happy if we chose to watch it.

This is when I realized that at least for me I had all of the relevant data to make such a decision on IMDb. I had been keeping my watchlist and ratings up-to-date on the site for years, to the point of amassing a watchlist over of 400 movies. Karl and I realized that had all four of us done that we could have cross-referenced the data and easily have found a film we all liked. Yes, it would require convincing everyone involved to keep track of what movies they wanted to see and rating movies that had seen, but we figured that wasn't an insurmountable problem. And so we decided we should code up a solution since we're both software developers.

You need an API, IMDb

But there was trouble with this project from the beginning. It turns out that while IMDb is happy for you to store your data on their servers, they don't exactly make it easy to get the data out. For instance, when I started looking into this they had two ways of getting to your data in some programmatic way: RSS and CSV files. The problem with RSS is that it was capped at (I believe) 200 entries, so I couldn't use it to access my entire data set. The issue with CSV was that you had to be logged in to download it. And the issue with both approaches was they were constantly broken for for different things simultaneously; when I looked into this last RSS was busted for one kind of list while CSV was broken for another. To top it all off the brokenness wasn't temporary, but lasted for lengths of time measured in months. That obviously doesn't work if you want to rely on the data and there's no official API (and IMDb at least used to aggressively go after anyone who use their name in a project).

Luckily I found Trakt. It has an API, it was accessible on a phone, and it wasn't ugly. The trick, though, was getting my data from IMDb to Trakt. Luckily there was a magical point when CSV exporting on IMDb worked for all of my lists, and so I downloaded the data and hacked together csv2trakt to migrate my data over (there is TraktRater for importing into Trakt as well, but at the time I had issues getting it to run on macOS).

What platform?

With my data moved over, we then had to choose what platform to have Which Film on. We toyed with the idea of doing a mobile app, but I'm an Android user and Karl is on iOS (and the same split for our wives), so that would have meant two apps. That didn't really appeal to either of us so we decided to do a website. We also consciously chose to do a single-page app to avoid maintaining a backend where would have to worry about uptime, potential server costs, etc. It also helps that there's a local company in Vancouver called Surge that does really nice static page hosting with a very reasonable free tier (when they get Let's Encrypt support I'll probably bump up to their paid tier if people actually end up using Which Film).

Choosing a programming language is never easy for me

Since we had found a website we were willing to ask people to use to store data, I had solved my data import problem, and we had decided on doing a website solution, next was what technology stack to use. The simple answer would have been Python, but for me that's somewhat boring since I obviously know Python. To make sure we both maximized our learning from this project we endeavoured to find a programming language neither of us had extensive experience in.

Eventually we settled on Dart. At the time we made this decision I worked at Google which is where Dart comes from, so I knew if I got really stuck with something I had internal resources to lean on. Karl liked the idea of using Dart because his game developer background appreciated the fact that Dart was looking into things like SIMD for performance. I also knew that Dart had been chosen by the ads product division at Google which meant it wasn't going anywhere. That also meant choosing Angular 2 was a simple decision since Google was using Dart with Angular 2 for products and so it would have solid Dart support.

But why six years?!?

As I have said, the site isn't complicated as you can tell from its source code, so you may be wondering why it took us six years before we could finish it. Well, since coming up with this idea I at least finished my Ph.D., moved five times between two countries,and worked for two different employers (if you don't count my Ph.D.). Karl had a similar busy life over the same timespan. And having me spend a majority of those six years in a different timezone didn't help facilitate discussions. At least we had plenty of time to think through various UX and design problems. ☺

If you give Which Film a try do let Karl and/or me know on Twitter (if you just want to see how the website works and you don't have a Trakt account you can use our usernames: brettcannon and kschmidt).

06 Dec 2016 7:13pm GMT

Brett Cannon: Introducing Which Film

What I'm announcing

Today I'm happy to announce the public unveiling of Which Film! I'll discuss how the site came about and what drives it, but I thought I would first explain what it does: it's a website to help you choose what movie you and your family/friends should watch together. What you do is you go to the site, enter in the Trakt.tv usernames of everyone who wants to watch a film together (so you need at least two people and kept data like your watchlist and ratings on Trakt), and then Which Film cross-references everyone's watchlists and ratings to create a list of movies that people may want to watch together.

The list of movies is ranked based on a simple point scale. If a movie is on someone's watchlist it gets 4 points, movies rated 10 ⭐ get 3 points, 9 ⭐ get 2 points, and 8 ⭐ get 1 point. Everyone who participates contributes points and the movies are sorted from highest score to lowest. The reason for the point values is the assumption that watching a movie most people have not seen is the best, followed by a movies people rate very highly. In the case of ties, the movie seen longest ago (if ever) by anyone in the group is ranked higher than movies seen more recently by someone. That way there's a bigger chance someone will be willing to watch a movie again when everyone else wants to see it for the first time.

None of this is very fancy or revolutionary, but it's useful any time you get together with a group of friends to watch a film and you end up having a hard time choosing to watch. It can help even between spouses as it will identify movies both people want to watch, removing that particular point of contention.

The story behind Which Film

Now normally launching a new website wouldn't cause for any backstory, but this project has been under development for about six years, so there's a bit of history to it.

One fateful night ...

The inspiration for Which Film stemmed from one night when my co-creator Karl, his wife, my wife, and I got together and decided we wanted to watch a movie. This turned out to be quite an ordeal due to disparate tastes among all four of us. Karl and I thought that there had to be a better way to figure out a film we could all happily watch together. It didn't need to necessarily be something none of us had seen (although that was preferred), but it did need to be something that had a chance of making all of us happy if we chose to watch it.

This is when I realized that at least for me I had all of the relevant data to make such a decision on IMDb. I had been keeping my watchlist and ratings up-to-date on the site for years, to the point of amassing a watchlist over of 400 movies. Karl and I realized that had all four of us done that we could have cross-referenced the data and easily have found a film we all liked. Yes, it would require convincing everyone involved to keep track of what movies they wanted to see and rating movies that had seen, but we figured that wasn't an insurmountable problem. And so we decided we should code up a solution since we're both software developers.

You need an API, IMDb

But there was trouble with this project from the beginning. It turns out that while IMDb is happy for you to store your data on their servers, they don't exactly make it easy to get the data out. For instance, when I started looking into this they had two ways of getting to your data in some programmatic way: RSS and CSV files. The problem with RSS is that it was capped at (I believe) 200 entries, so I couldn't use it to access my entire data set. The issue with CSV was that you had to be logged in to download it. And the issue with both approaches was they were constantly broken for for different things simultaneously; when I looked into this last RSS was busted for one kind of list while CSV was broken for another. To top it all off the brokenness wasn't temporary, but lasted for lengths of time measured in months. That obviously doesn't work if you want to rely on the data and there's no official API (and IMDb at least used to aggressively go after anyone who use their name in a project).

Luckily I found Trakt. It has an API, it was accessible on a phone, and it wasn't ugly. The trick, though, was getting my data from IMDb to Trakt. Luckily there was a magical point when CSV exporting on IMDb worked for all of my lists, and so I downloaded the data and hacked together csv2trakt to migrate my data over (there is TraktRater for importing into Trakt as well, but at the time I had issues getting it to run on macOS).

What platform?

With my data moved over, we then had to choose what platform to have Which Film on. We toyed with the idea of doing a mobile app, but I'm an Android user and Karl is on iOS (and the same split for our wives), so that would have meant two apps. That didn't really appeal to either of us so we decided to do a website. We also consciously chose to do a single-page app to avoid maintaining a backend where would have to worry about uptime, potential server costs, etc. It also helps that there's a local company in Vancouver called Surge that does really nice static page hosting with a very reasonable free tier (when they get Let's Encrypt support I'll probably bump up to their paid tier if people actually end up using Which Film).

Choosing a programming language is never easy for me

Since we had found a website we were willing to ask people to use to store data, I had solved my data import problem, and we had decided on doing a website solution, next was what technology stack to use. The simple answer would have been Python, but for me that's somewhat boring since I obviously know Python. To make sure we both maximized our learning from this project we endeavoured to find a programming language neither of us had extensive experience in.

Eventually we settled on Dart. At the time we made this decision I worked at Google which is where Dart comes from, so I knew if I got really stuck with something I had internal resources to lean on. Karl liked the idea of using Dart because his game developer background appreciated the fact that Dart was looking into things like SIMD for performance. I also knew that Dart had been chosen by the ads product division at Google which meant it wasn't going anywhere. That also meant choosing Angular 2 was a simple decision since Google was using Dart with Angular 2 for products and so it would have solid Dart support.

But why six years?!?

As I have said, the site isn't complicated as you can tell from its source code, so you may be wondering why it took us six years before we could finish it. Well, since coming up with this idea I at least finished my Ph.D., moved five times between two countries,and worked for two different employers (if you don't count my Ph.D.). Karl had a similar busy life over the same timespan. And having me spend a majority of those six years in a different timezone didn't help facilitate discussions. At least we had plenty of time to think through various UX and design problems. ☺

If you give Which Film a try do let Karl and/or me know on Twitter (if you just want to see how the website works and you don't have a Trakt account you can use our usernames: brettcannon and kschmidt).

06 Dec 2016 7:13pm GMT

Marcos Dione: ayrton-0.9

Another release, but this time not (only) a bugfix one. After playing with bool semantics I converted the file tests from a _X format, which, let's face it, was not pretty, into the more usual -X format. This alone merits a change in the minor version number. Also, _in, _out and _err also accept a tuple (path, flags), so you can specify things like os.O_APPEND.

In other news, I had to drop support for Pyhton-3.3, because otherwise I would have to complexify the import system a lot.

But in the end, yes, this also is a bugfix release. Lost of fd leaks where plugged, so I suggest you to upgrade if you can. Just remember the s/_X/-X/ change. I found all the leaks thanks to unitest's warnings, even if sometimes they were a little misleading:

testRemoteCommandStdout (tests.test_remote.RealRemoteTests) ... ayrton/parser/pyparser/parser.py:175: <span class="createlink">ResourceWarning</span>: unclosed <socket.socket fd=5, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, raddr=/tmp/ssh-XZxnYoIQxZX9/agent.7248>
  self.stack[-1] = (dfa, next_state, node)

The file and line cited in the warning have nothing to do with the warning itself (it was not the one who raised it) or the leaked fd, so it took me a while to find were those leaks were coming from. I hope I have some time to find why this is so. The most frustrating thing was that unitest closes the leaking fd, which is nice, but in one of the test cases it was closing it seemingly before the test finished, and the test failed because the socket was closed:

======================================================================
ERROR: testLocalVarToRemoteToLocal (tests.test_remote.RealRemoteTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mdione/src/projects/ayrton_clean/ayrton/tests/test_remote.py", line 225, in wrapper
    test (self)
File "/home/mdione/src/projects/ayrton_clean/ayrton/tests/test_remote.py", line 235, in testLocalVarToRemoteToLocal
    self.runner.run_file ('ayrton/tests/scripts/testLocalVarToRealRemoteToLocal.ay')
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 304, in run_file
    return self.run_script (script, file_name, argv, params)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 323, in run_script
    return self.run_tree (tree, file_name, argv, params)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 336, in run_tree
    return self.run_code (code, file_name, argv)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 421, in run_code
    raise error
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 402, in run_code
    exec (code, self.globals, self.locals)
File "ayrton/tests/scripts/testLocalVarToRealRemoteToLocal.ay", line 6, in <module>
    with remote ('127.0.0.1', _test=True):
File "/home/mdione/src/projects/ayrton_clean/ayrton/remote.py", line 362, in __enter__
    i, o, e= self.prepare_connections (backchannel_port, command)
File "/home/mdione/src/projects/ayrton_clean/ayrton/remote.py", line 270, in prepare_connections
    self.client.connect (self.hostname, *self.args, **self.kwargs)
File "/usr/lib/python3/dist-packages/paramiko/client.py", line 338, in connect
    t.start_client()
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 493, in start_client
    raise e
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1757, in run
    self.kex_engine.parse_next(ptype, m)
File "/usr/lib/python3/dist-packages/paramiko/kex_group1.py", line 75, in parse_next
    return self._parse_kexdh_reply(m)
File "/usr/lib/python3/dist-packages/paramiko/kex_group1.py", line 112, in _parse_kexdh_reply
    self.transport._activate_outbound()
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 2079, in _activate_outbound
    self._send_message(m)
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1566, in _send_message
    self.packetizer.send_message(data)
File "/usr/lib/python3/dist-packages/paramiko/packet.py", line 364, in send_message
    self.write_all(out)
File "/usr/lib/python3/dist-packages/paramiko/packet.py", line 314, in write_all
    raise EOFError()
EOFError

This probably has something to do with the fact that the test (a functional test, really) is using threads and real sockets. Again, I'll try to investigate this.

All in all, the release is an interesting one. I'll keep adding small features and releasing, let's see how it goes. Meanwhile, here's the changelog:

Get it on github or pypi!


python ayrton

06 Dec 2016 6:46pm GMT

Marcos Dione: ayrton-0.9

Another release, but this time not (only) a bugfix one. After playing with bool semantics I converted the file tests from a _X format, which, let's face it, was not pretty, into the more usual -X format. This alone merits a change in the minor version number. Also, _in, _out and _err also accept a tuple (path, flags), so you can specify things like os.O_APPEND.

In other news, I had to drop support for Pyhton-3.3, because otherwise I would have to complexify the import system a lot.

But in the end, yes, this also is a bugfix release. Lost of fd leaks where plugged, so I suggest you to upgrade if you can. Just remember the s/_X/-X/ change. I found all the leaks thanks to unitest's warnings, even if sometimes they were a little misleading:

testRemoteCommandStdout (tests.test_remote.RealRemoteTests) ... ayrton/parser/pyparser/parser.py:175: <span class="createlink">ResourceWarning</span>: unclosed <socket.socket fd=5, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, raddr=/tmp/ssh-XZxnYoIQxZX9/agent.7248>
  self.stack[-1] = (dfa, next_state, node)

The file and line cited in the warning have nothing to do with the warning itself (it was not the one who raised it) or the leaked fd, so it took me a while to find were those leaks were coming from. I hope I have some time to find why this is so. The most frustrating thing was that unitest closes the leaking fd, which is nice, but in one of the test cases it was closing it seemingly before the test finished, and the test failed because the socket was closed:

======================================================================
ERROR: testLocalVarToRemoteToLocal (tests.test_remote.RealRemoteTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mdione/src/projects/ayrton_clean/ayrton/tests/test_remote.py", line 225, in wrapper
    test (self)
File "/home/mdione/src/projects/ayrton_clean/ayrton/tests/test_remote.py", line 235, in testLocalVarToRemoteToLocal
    self.runner.run_file ('ayrton/tests/scripts/testLocalVarToRealRemoteToLocal.ay')
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 304, in run_file
    return self.run_script (script, file_name, argv, params)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 323, in run_script
    return self.run_tree (tree, file_name, argv, params)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 336, in run_tree
    return self.run_code (code, file_name, argv)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 421, in run_code
    raise error
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 402, in run_code
    exec (code, self.globals, self.locals)
File "ayrton/tests/scripts/testLocalVarToRealRemoteToLocal.ay", line 6, in <module>
    with remote ('127.0.0.1', _test=True):
File "/home/mdione/src/projects/ayrton_clean/ayrton/remote.py", line 362, in __enter__
    i, o, e= self.prepare_connections (backchannel_port, command)
File "/home/mdione/src/projects/ayrton_clean/ayrton/remote.py", line 270, in prepare_connections
    self.client.connect (self.hostname, *self.args, **self.kwargs)
File "/usr/lib/python3/dist-packages/paramiko/client.py", line 338, in connect
    t.start_client()
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 493, in start_client
    raise e
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1757, in run
    self.kex_engine.parse_next(ptype, m)
File "/usr/lib/python3/dist-packages/paramiko/kex_group1.py", line 75, in parse_next
    return self._parse_kexdh_reply(m)
File "/usr/lib/python3/dist-packages/paramiko/kex_group1.py", line 112, in _parse_kexdh_reply
    self.transport._activate_outbound()
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 2079, in _activate_outbound
    self._send_message(m)
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1566, in _send_message
    self.packetizer.send_message(data)
File "/usr/lib/python3/dist-packages/paramiko/packet.py", line 364, in send_message
    self.write_all(out)
File "/usr/lib/python3/dist-packages/paramiko/packet.py", line 314, in write_all
    raise EOFError()
EOFError

This probably has something to do with the fact that the test (a functional test, really) is using threads and real sockets. Again, I'll try to investigate this.

All in all, the release is an interesting one. I'll keep adding small features and releasing, let's see how it goes. Meanwhile, here's the changelog:

Get it on github or pypi!


python ayrton

06 Dec 2016 6:46pm GMT

tryexceptpass: Threaded Asynchronous Magic and How to Wield It

Photo Credit: Daniel Schwen via Wikipedia

A dive into Python's asyncio tasks and event loops

Ok let's face it. Clock speeds no longer govern the pace at which computer processors improve. Instead we see increased transistor density and higher core counts. Translating to software terms, this means that code won't run faster, but more of it can run in parallel.

Although making good use of our new-found silicon real estate requires improvements in software, a lot of programming languages have already started down this path by adding features that help with parallel execution. In fact, they've been there for years waiting for us to take advantage.

So why don't we? A good engineer always has an ear to the ground, listening for the latest trends in his industry, so let's take a look at what Python is building for us.

What do we have so far?

Python enables parallelism through both the threading and the multiprocessing libraries. Yet it wasn't until the 3.4 branch that it gave us the asyncio library to help with single-threaded concurrency. This addition was key in providing a more convincing final push to start swapping over from version 2.

The asyncio package allows us to define coroutines. These are code blocks that have the ability of yielding execution to other blocks. They run inside an event loop which iterates through the scheduled tasks and executes them one by one. A task switch occurs when it reaches an await statement or when the current task completes.

Task execution itself happens the same as in a single-threaded system. Meaning, this is not an implementation of parallelism, it's actually closer to multithreading. We can perceive the concurrency in situations where a block of code depends on external actions.

This illusion is possible because the block can yield execution while it waits, making anything that depends on external IO, like network or disk storage, a great candidate. When the IO completes, the coroutine receives an interrupt and can proceed with execution. In the meantime, other tasks execute.

The asyncio event loop can also serve as a task scheduler. Both asynchronous and blocking functions can queue up their execution as needed.

Tasks

A Task represents callable blocks of code designed for asynchronous execution within event loops. They execute single-threaded, but can run in parallel through loops on different threads.

Prefixing a function definition with the async keyword turns it into an asynchronous coroutine. Though the task itself will not exist until it's added to a loop. This is usually implicit when calling most loop methods, but asyncio.ensure_future(your_coroutine) is the more direct mechanism.

To denote an operation or instruction that can yield execution, we use the await keyword. Although it's only available within a coroutine block and causes a syntax error if used anywhere else.

Please note that the async keyword was not implemented until Python version 3.5. So when working with older versions, use the @asyncio.coroutine decorator and yield from keywords instead.

Scheduling

In order to execute a task, we need a reference to the event loop in which to run it. Using loop = asyncio.get_event_loop() gives us the current loop in our execution thread. Now it's a matter of calling loop.run_until_complete(your_coroutine) or loop.run_forever() to have it do some work.

Let's look at a short example to illustrate a few points. I strongly encourage you to open an interpreter and follow along:

import time
import asyncio
async def do_some_work(x):
print("Waiting " + str(x))
await asyncio.sleep(x)
loop = asyncio.get_event_loop()
loop.run_until_complete(do_some_work(5))

Here we defined do_some_work() as a coroutine that waits on the results of external workload. The workload is simulated through asyncio.sleep.

Running the code may be surprising. Did you expect run_until_complete to be a blocking call? Remember that we're using the event loop from the current thread to execute the task. We'll discuss alternatives in more detail later. So for now, the important part is to understand that while execution blocks, the await keyword still enables concurrency.

For a better picture, let's change our test code a bit and look at executing tasks in batches:

tasks = [asyncio.ensure_future(do_some_work(2)), 
asyncio.ensure_future(do_some_work(5)]
loop.run_until_complete(asyncio.gather(*tasks))

Introducing the asyncio.gather() function enables results aggregation. It waits for several tasks in the same thread to complete and puts the results in a list.

The main observation here is that both function calls did not execute in sequence. It did not wait 2 seconds, then 5, for a total of 7 seconds. Instead it started to wait 2s, then moved on to the next item which started to wait 5s, returning when the longer task completed, for a total of 5s. Feel free to add more print statements to the base function if it helps visualize.

This means that we can put long running tasks with awaitable code in an execution batch, then ask Python to run them in parallel and wait until they all complete. If you plan it right, this will be faster than running in sequence.

Think of it as an alternative to the threading package where after spinning up a number of Threads, we wait for them to complete with .join(). The major difference is that there's less overhead incurred than creating a new thread for each function.

Of course, it's always good to point out that your millage may vary based on the task at hand. If you're doing compute-heavy work, with little or no time waiting, then the only benefit you get is the grouping of code into logical batches.

Running a loop in a different thread

What if instead of doing everything in the current thread, we spawn a separate Thread to do the work for us.

from threading import Thread
import asyncio
def start_loop(loop):
asyncio.set_event_loop(loop)
l.run_forever()
new_loop = asyncio.new_event_loop()
t = Thread(target=start_loop, args=(new_loop,))
t.start()

Notice that this time we created a new event loop through asyncio.new_event_loop(). The idea is to spawn a new thread, pass it that new loop and then call thread-safe functions (discussed later) to schedule work.

The advantage of this method is that work executed by the other event loop will not block execution in the current thread. Thereby allowing the main thread to manage the work, and enabling a new category of execution mechanisms.

Queuing work in a different thread

Using the thread and event loop from the previous code block, we can easily get work done with the call_soon(), call_later() or call_at() methods. They are able to run regular function code blocks (those not defined as coroutines) in an event loop.

However, it's best to use their _threadsafe alternatives. Let's see how that looks:

import time
def more_work(x):
print("More work %s" % x)
time.sleep(x)
print("Finished more work %s" % x)
new_loop.call_soon_threadsafe(more_work, 6)
new_loop.call_soon_threadsafe(more_work, 3)

Now we're talking! Executing this code does not block the main interpreter, allowing us to give it more work. Since the work executes in order, we now essentially have a task queue.

We just went to multi-threaded execution of single-threaded code, but isn't concurrency part of what we get with asyncio? Sure it is! That loop on the worker thread is still async, so let's enable parallelism by giving it awaitable coroutines.

Doing so is a matter of using asyncio.run_coroutine_threadsafe(), as seen bellow:

new_loop.call_soon_threadsafe(more_work, 20)
asyncio.run_coroutine_threadsafe(do_some_work(5), new_loop)
asyncio.run_coroutine_threadsafe(do_some_work(10), new_loop)

These instructions illustrate how python is going about execution. The first call to more_work blocks for 20 seconds, while the calls to do_some_work execute in parallel immediately after more_work finishes.

Real World Example #1 - Sending Notifications

A common situation these days is to send notifications as a result of a task or event. This is usually simple, but talking to an email server to submit a new message can take time, and so can crafting the email itself.

There are many scenarios where we don't have the luxury of waiting around for tasks to complete. Where doing so provides no benefit to the end user. A prime example being a request for a password reset, or a webhook event that triggers repository builds and emails the results.

The recommended practice so far has been to use a task queuing system like celery, on top of a message queue server like rabbitmq to schedule the work. I'm here to tell you that for small things that can easily execute from another thread of your main application, it's not a bad idea to just use asyncio. The pattern being fairly similar to the code examples we've seen so far:

import asyncio
import smtplib
from threading import Thread
def send_notification(email):
"""Generate and send the notification email"""
    # Do some work to get email body
message = ...

# Connect to the server
server = smtplib.SMTP("smtp.gmail.com:587")
server.ehlo()
server.starttls()
server.login(username, password)
    # Send the email
server.sendmail(from_addr, email, message)
    server.quit()
def start_email_worker(loop):
"""Switch to new event loop and run forever"""

asyncio.set_event_loop(loop)
loop.run_forever()
# Create the new loop and worker thread
worker_loop = asyncio.new_event_loop()
worker = Thread(target=star_email_worker, args=(worker_loop,))
# Start the thread
worker.start()
# Assume a Flask restful interface endpoint
@app.route("/notify")
def notify(email):
"""Request notification email"""
    worker_loop.call_soon_threadsafe(send_notification, email)

Here we assume a Flask web API with an endpoint mounted at /notify in which to request a notification email of some sort.

Notice that send_notification is not a coroutine, so each email will be a blocking call. The worker thread's event loop will serve as the queue in which to track the outgoing emails.

Why are the SMTP calls synchronous you wonder? Well, while this is a good example of what should be awaitable IO, I'm not aware of an asynchronous SMTP library at the moment. Feel free to substitute with an async def, await and run_coroutine_threadsafe, if you do find one.

Real World Example #2 - Parallel Web Requests

Here's an example of batching HTTP requests that run concurrently to several servers, while waiting for responses before processing. I expect it to be useful for those of you that do a lot of scraping, as well as a quick intro to the aiohttp module.

import asyncio
import aiohttp
async def fetch(url):
"""Perform an HTTP GET to the URL and print the response"""
    response = await aiohttp.request('GET', url)
return await response.text()
# Get a reference to the event loop
loop = asyncio.get_event_loop()
# Create the batch of requests we wish to execute
requests = [asyncio.ensure_future(fetch("https://github.com")),
asyncio.ensure_future(fetch("https://google.com"))]
# Run the batch
responses = loop.run_until_complete(asyncio.gather(*requests))
# Examine responses
for resp in responses:
print(resp)

Fairly straightforward, it's a matter of grouping the work in a list of tasks and using run_until_complete to get the responses back. This can easily change to use a separate thread in which to make requests, where it would be simple to add all the URLs through the thread-safe methods described previously.

I want to note that the requests library has asynchronous support through gevent, but I haven't done the work to figure out how that can tie into asyncio. In contrast, I'm not aware of asyncio plans for the popular scraping framework scrapy, but I assume they're working on it.

Stopping the loop

If at any point you find yourself wanting to stop an infinite event loop, or want to cancel tasks that haven't completed, I tend to use a KeyboardInterrupt exception clause to trigger cancellation as shown below. Although the same can be accomplished by using the signal module and registering a handler for signal.SIGINT.

try:
loop.run_forever()
except KeyboardInterrupt:
# Canceling pending tasks and stopping the loop
asyncio.gather(*asyncio.Task.all_tasks()).cancel()
    # Stopping the loop
loop.stop()
    # Received Ctrl+C
loop.close()

This time we're introducing the use of Task.all_tasks() to generate a list of all currently running or scheduled tasks. When coupled with gather() we can send the cancel() command to each one and have them all stop executing or remove them from the queue.

Please note that due to signaling deficiencies in Windows, if the loop is empty, the keyboard interrupt is never triggered. A workaround for this situation is to queue a task that sleeps for several seconds. This guarantees that if the interrupt arrives while the task sleeps, the loop will notice when it wakes.

Asynchronous programming can be very confusing. I must confess that I started with some base assumptions that turned out to be wrong. It wasn't until I dove deeper into it that I realized what's really going on.

I hope this served as a good introduction to asyncio event loops and tasks, as well as their possible uses. I know there are plenty of other articles out there, but I wanted to make something that tied things to some real world examples. If you have any questions or comments, feel free to drop them below and I'll help as best I can.


Threaded Asynchronous Magic and How to Wield It was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

06 Dec 2016 6:06pm GMT

tryexceptpass: Threaded Asynchronous Magic and How to Wield It

Photo Credit: Daniel Schwen via Wikipedia

A dive into Python's asyncio tasks and event loops

Ok let's face it. Clock speeds no longer govern the pace at which computer processors improve. Instead we see increased transistor density and higher core counts. Translating to software terms, this means that code won't run faster, but more of it can run in parallel.

Although making good use of our new-found silicon real estate requires improvements in software, a lot of programming languages have already started down this path by adding features that help with parallel execution. In fact, they've been there for years waiting for us to take advantage.

So why don't we? A good engineer always has an ear to the ground, listening for the latest trends in his industry, so let's take a look at what Python is building for us.

What do we have so far?

Python enables parallelism through both the threading and the multiprocessing libraries. Yet it wasn't until the 3.4 branch that it gave us the asyncio library to help with single-threaded concurrency. This addition was key in providing a more convincing final push to start swapping over from version 2.

The asyncio package allows us to define coroutines. These are code blocks that have the ability of yielding execution to other blocks. They run inside an event loop which iterates through the scheduled tasks and executes them one by one. A task switch occurs when it reaches an await statement or when the current task completes.

Task execution itself happens the same as in a single-threaded system. Meaning, this is not an implementation of parallelism, it's actually closer to multithreading. We can perceive the concurrency in situations where a block of code depends on external actions.

This illusion is possible because the block can yield execution while it waits, making anything that depends on external IO, like network or disk storage, a great candidate. When the IO completes, the coroutine receives an interrupt and can proceed with execution. In the meantime, other tasks execute.

The asyncio event loop can also serve as a task scheduler. Both asynchronous and blocking functions can queue up their execution as needed.

Tasks

A Task represents callable blocks of code designed for asynchronous execution within event loops. They execute single-threaded, but can run in parallel through loops on different threads.

Prefixing a function definition with the async keyword turns it into an asynchronous coroutine. Though the task itself will not exist until it's added to a loop. This is usually implicit when calling most loop methods, but asyncio.ensure_future(your_coroutine) is the more direct mechanism.

To denote an operation or instruction that can yield execution, we use the await keyword. Although it's only available within a coroutine block and causes a syntax error if used anywhere else.

Please note that the async keyword was not implemented until Python version 3.5. So when working with older versions, use the @asyncio.coroutine decorator and yield from keywords instead.

Scheduling

In order to execute a task, we need a reference to the event loop in which to run it. Using loop = asyncio.get_event_loop() gives us the current loop in our execution thread. Now it's a matter of calling loop.run_until_complete(your_coroutine) or loop.run_forever() to have it do some work.

Let's look at a short example to illustrate a few points. I strongly encourage you to open an interpreter and follow along:

import time
import asyncio
async def do_some_work(x):
print("Waiting " + str(x))
await asyncio.sleep(x)
loop = asyncio.get_event_loop()
loop.run_until_complete(do_some_work(5))

Here we defined do_some_work() as a coroutine that waits on the results of external workload. The workload is simulated through asyncio.sleep.

Running the code may be surprising. Did you expect run_until_complete to be a blocking call? Remember that we're using the event loop from the current thread to execute the task. We'll discuss alternatives in more detail later. So for now, the important part is to understand that while execution blocks, the await keyword still enables concurrency.

For a better picture, let's change our test code a bit and look at executing tasks in batches:

tasks = [asyncio.ensure_future(do_some_work(2)), 
asyncio.ensure_future(do_some_work(5)]
loop.run_until_complete(asyncio.gather(*tasks))

Introducing the asyncio.gather() function enables results aggregation. It waits for several tasks in the same thread to complete and puts the results in a list.

The main observation here is that both function calls did not execute in sequence. It did not wait 2 seconds, then 5, for a total of 7 seconds. Instead it started to wait 2s, then moved on to the next item which started to wait 5s, returning when the longer task completed, for a total of 5s. Feel free to add more print statements to the base function if it helps visualize.

This means that we can put long running tasks with awaitable code in an execution batch, then ask Python to run them in parallel and wait until they all complete. If you plan it right, this will be faster than running in sequence.

Think of it as an alternative to the threading package where after spinning up a number of Threads, we wait for them to complete with .join(). The major difference is that there's less overhead incurred than creating a new thread for each function.

Of course, it's always good to point out that your millage may vary based on the task at hand. If you're doing compute-heavy work, with little or no time waiting, then the only benefit you get is the grouping of code into logical batches.

Running a loop in a different thread

What if instead of doing everything in the current thread, we spawn a separate Thread to do the work for us.

from threading import Thread
import asyncio
def start_loop(loop):
asyncio.set_event_loop(loop)
l.run_forever()
new_loop = asyncio.new_event_loop()
t = Thread(target=start_loop, args=(new_loop,))
t.start()

Notice that this time we created a new event loop through asyncio.new_event_loop(). The idea is to spawn a new thread, pass it that new loop and then call thread-safe functions (discussed later) to schedule work.

The advantage of this method is that work executed by the other event loop will not block execution in the current thread. Thereby allowing the main thread to manage the work, and enabling a new category of execution mechanisms.

Queuing work in a different thread

Using the thread and event loop from the previous code block, we can easily get work done with the call_soon(), call_later() or call_at() methods. They are able to run regular function code blocks (those not defined as coroutines) in an event loop.

However, it's best to use their _threadsafe alternatives. Let's see how that looks:

import time
def more_work(x):
print("More work %s" % x)
time.sleep(x)
print("Finished more work %s" % x)
new_loop.call_soon_threadsafe(more_work, 6)
new_loop.call_soon_threadsafe(more_work, 3)

Now we're talking! Executing this code does not block the main interpreter, allowing us to give it more work. Since the work executes in order, we now essentially have a task queue.

We just went to multi-threaded execution of single-threaded code, but isn't concurrency part of what we get with asyncio? Sure it is! That loop on the worker thread is still async, so let's enable parallelism by giving it awaitable coroutines.

Doing so is a matter of using asyncio.run_coroutine_threadsafe(), as seen bellow:

new_loop.call_soon_threadsafe(more_work, 20)
asyncio.run_coroutine_threadsafe(do_some_work(5), new_loop)
asyncio.run_coroutine_threadsafe(do_some_work(10), new_loop)

These instructions illustrate how python is going about execution. The first call to more_work blocks for 20 seconds, while the calls to do_some_work execute in parallel immediately after more_work finishes.

Real World Example #1 - Sending Notifications

A common situation these days is to send notifications as a result of a task or event. This is usually simple, but talking to an email server to submit a new message can take time, and so can crafting the email itself.

There are many scenarios where we don't have the luxury of waiting around for tasks to complete. Where doing so provides no benefit to the end user. A prime example being a request for a password reset, or a webhook event that triggers repository builds and emails the results.

The recommended practice so far has been to use a task queuing system like celery, on top of a message queue server like rabbitmq to schedule the work. I'm here to tell you that for small things that can easily execute from another thread of your main application, it's not a bad idea to just use asyncio. The pattern being fairly similar to the code examples we've seen so far:

import asyncio
import smtplib
from threading import Thread
def send_notification(email):
"""Generate and send the notification email"""
    # Do some work to get email body
message = ...

# Connect to the server
server = smtplib.SMTP("smtp.gmail.com:587")
server.ehlo()
server.starttls()
server.login(username, password)
    # Send the email
server.sendmail(from_addr, email, message)
    server.quit()
def start_email_worker(loop):
"""Switch to new event loop and run forever"""

asyncio.set_event_loop(loop)
loop.run_forever()
# Create the new loop and worker thread
worker_loop = asyncio.new_event_loop()
worker = Thread(target=star_email_worker, args=(worker_loop,))
# Start the thread
worker.start()
# Assume a Flask restful interface endpoint
@app.route("/notify")
def notify(email):
"""Request notification email"""
    worker_loop.call_soon_threadsafe(send_notification, email)

Here we assume a Flask web API with an endpoint mounted at /notify in which to request a notification email of some sort.

Notice that send_notification is not a coroutine, so each email will be a blocking call. The worker thread's event loop will serve as the queue in which to track the outgoing emails.

Why are the SMTP calls synchronous you wonder? Well, while this is a good example of what should be awaitable IO, I'm not aware of an asynchronous SMTP library at the moment. Feel free to substitute with an async def, await and run_coroutine_threadsafe, if you do find one.

Real World Example #2 - Parallel Web Requests

Here's an example of batching HTTP requests that run concurrently to several servers, while waiting for responses before processing. I expect it to be useful for those of you that do a lot of scraping, as well as a quick intro to the aiohttp module.

import asyncio
import aiohttp
async def fetch(url):
"""Perform an HTTP GET to the URL and print the response"""
    response = await aiohttp.request('GET', url)
return await response.text()
# Get a reference to the event loop
loop = asyncio.get_event_loop()
# Create the batch of requests we wish to execute
requests = [asyncio.ensure_future(fetch("https://github.com")),
asyncio.ensure_future(fetch("https://google.com"))]
# Run the batch
responses = loop.run_until_complete(asyncio.gather(*requests))
# Examine responses
for resp in responses:
print(resp)

Fairly straightforward, it's a matter of grouping the work in a list of tasks and using run_until_complete to get the responses back. This can easily change to use a separate thread in which to make requests, where it would be simple to add all the URLs through the thread-safe methods described previously.

I want to note that the requests library has asynchronous support through gevent, but I haven't done the work to figure out how that can tie into asyncio. In contrast, I'm not aware of asyncio plans for the popular scraping framework scrapy, but I assume they're working on it.

Stopping the loop

If at any point you find yourself wanting to stop an infinite event loop, or want to cancel tasks that haven't completed, I tend to use a KeyboardInterrupt exception clause to trigger cancellation as shown below. Although the same can be accomplished by using the signal module and registering a handler for signal.SIGINT.

try:
loop.run_forever()
except KeyboardInterrupt:
# Canceling pending tasks and stopping the loop
asyncio.gather(*asyncio.Task.all_tasks()).cancel()
    # Stopping the loop
loop.stop()
    # Received Ctrl+C
loop.close()

This time we're introducing the use of Task.all_tasks() to generate a list of all currently running or scheduled tasks. When coupled with gather() we can send the cancel() command to each one and have them all stop executing or remove them from the queue.

Please note that due to signaling deficiencies in Windows, if the loop is empty, the keyboard interrupt is never triggered. A workaround for this situation is to queue a task that sleeps for several seconds. This guarantees that if the interrupt arrives while the task sleeps, the loop will notice when it wakes.

Asynchronous programming can be very confusing. I must confess that I started with some base assumptions that turned out to be wrong. It wasn't until I dove deeper into it that I realized what's really going on.

I hope this served as a good introduction to asyncio event loops and tasks, as well as their possible uses. I know there are plenty of other articles out there, but I wanted to make something that tied things to some real world examples. If you have any questions or comments, feel free to drop them below and I'll help as best I can.


Threaded Asynchronous Magic and How to Wield It was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

06 Dec 2016 6:06pm GMT

Continuum Analytics News: Introducing: fastparquet

Developer Blog
Tuesday, December 6, 2016
Martin Durant
Continuum Analytics

A compliant, flexible and speedy interface to Parquet format files for Python, fastparquet provides seamless translation between in-memory pandas DataFrames and on-disc storage.

In this post, we will introduce the two functions that will most commonly be used within fastparquet, followed by a discussion of the current Big Data landscape, Python's place within it and details of how fastparquet fills one of the gaps on the way to building out a full end-to-end Big Data pipeline in Python.

fastparquet Teaser

New users of fastparquet will mainly use the functions write and ParquetFile.to_pandas. Both functions offer good performance with default values, and both have a number of options to improve performance further.

import fastparquet

# write data
fastparquet.write('out.parq', df, compression='SNAPPY')

# load data
pfile = fastparquet.ParquetFile('out.parq') 
df2 = pfile.topandas() # all columns 
df3 = pfile.topandas(columns=['floats', 'times']) # pick some columns

Introduction: Python and Big Data

Python was named as a favourite tool for data science by 45% of data scientists in 2016. Many reasons can be presented for this, and near the top will be:

  • Python is very commonly taught at college and university level

  • Python and associated numerical libraries are free and open source

  • The code tends to be concise, quick to write, and expressive

  • An extremely rich ecosystem of libraries exist for not only numerical processing but also other important links in the pipeline from data ingest to visualization and distribution of results

Big Data, however, has typically been based on traditional databases and, in latter years, the Hadoop ecosystem. Hadoop provides a distributed file-system, cluster resource management (YARN, Mesos) and a set of frameworks for processing data (map-reduce, pig, kafka, and many more). In the past few years, Spark has rapidly increased in usage, becoming a major force, even though 62% of users use Python to execute Spark jobs (via PySpark).

The Hadoop ecosystem and its tools, including Spark, are heavily based around the Java Virtual Machine (JVM), which creates a gap between the familiar, rich Python data ecosystem and clustered Big Data with Hadoop. One such missing piece is a data format that can efficiently store large amounts of tabular data, in a columnar layout, and split it into blocks on a distributed file-system.

Parquet has become the de-facto standard file format for tabular data in Spark, Impala and other clustered frameworks. Parquet provides several advantages relevant to Big Data processing:

  • Columnar storage, only read the data of interest

  • Efficient binary packing

  • Choice of compression algorithms and encoding

  • Splits data into files, allowing for parallel processing

  • Range of logical types

  • Statistics stored in metadata to allow for skipping unneeded chunks

  • Data partitioning using the directory structure

fastparquet bridges the gap to provide native Python read/write access with out the need to use Java.

Until now, Spark's Python interface provided the only way to write Spark files from Python. Much of the time is spent in deserializing the data in the Java-Python bridge. Also, note that the times column returned is now just integers, rather than the correct datetime type. Not only does fastparquet provide native access to Parquet files, it in fact makes the transfer of data to Spark much faster.

# to make and save a large-ish DataFrame
import pandas as pd 
import numpy as np 
N = 10000000

df = pd.DataFrame({'ints': np.random.randint(0, 1000, size=N),
 'floats': np.random.randn(N),
 'times': pd.DatetimeIndex(start='1980', freq='s', periods=N)})
import pyspark
sc = pyspark.SparkContext()
sql = pyspark.SQLContext(sc) 

The default Spark single-machine configuration cannot handle the above DataFrame (out-of-memory error), so we'll perform timing for 1/10 of the data:

# sending data to spark via pySpark serialization, 1/10 of the data
%time o = sql.createDataFrame(df[::10]).count()
CPU times: user 3.45 s, sys: 96.6 ms, total: 3.55 s
Wall time: 4.14 s

%%time
# sending data to spark via a file made with fastparquet, all the data 
fastparquet.write('out.parq', df, compression='SNAPPY')
df4 = sql.read.parquet('outspark.parq').count()
CPU times: user 2.75 s, sys: 285 ms, total: 3.04 s
Wall time: 3.27 s

The fastparquet Library

fastparquet is an open source library providing a Python interface to the Parquet file format. It uses Numba and NumPy to provide speed, and writes data to and from pandas DataFrames, the most typical starting point for Python data science operations.

fastparquet can be installed using conda:

conda install -c conda-forge fastparquet

(currently only available for Python 3)

  • The code is hosted on GitHub
  • The primary documentation is on RTD

Bleeding edge installation directly from the GitHub repo is also supported, as long as Numba, pandas, pytest and ThriftPy are installed.

Reading Parquet files into pandas is simple and, again, much faster than via PySpark serialization.

import fastparquet 
pfile = fastparquet.ParquetFile('out.parq')
%time df2 = pfile.to_pandas()
CPU times: user 812 ms, sys: 291 ms, total: 1.1 s
Wall time: 1.1 s

The Parquet format is more compact and faster to load than the ubiquitous CSV format.

df.to_csv('out.csv')
!du -sh out.csv out.parq
490M    out.csv

162M    out.parq

In this case, the data is 229MB in memory, which translates to 162MB on-disc as Parquet or 490MB as CSV. Loading from CSV takes substantially longer than from Parquet.

%time df2 = pd.read_csv('out.csv', parse_dates=True)
CPU times: user 9.85 s, sys: 1 s, total: 10.9 s
Wall time: 10.9 s

The biggest advantage, however, is the ability to pick only some columns of interest. In CSV, this still means scanning through the whole file (if not parsing all the values), but the columnar nature of Parquet means only reading the data you need.

%time df3 = pd.read_csv('out.csv', usecols=['floats'])
%time df3 = pfile.to_pandas(columns=['floats'])
CPU times: user 4.04 s, sys: 176 ms, total: 4.22 s
Wall time: 4.22 s
CPU times: user 40 ms, sys: 96.9 ms, total: 137 ms
Wall time: 137 ms

Example

We have taken the airlines dataset and converted it into Parquet format using fastparquet. The original data was in CSV format, one file per year, 1987-2004. The total data size is 11GB as CSV, uncompressed, which becomes about double that in memory as a pandas DataFrame for typical dtypes. This is approaching, if not Big Data, Sizable Data, because it cannot fit into my machine's memory.

The Parquet data is stored as a multi-file dataset. The total size is 2.5GB, with Snappy compression throughout.

ls airlines-parq/
_common_metadata  part.12.parquet   part.18.parquet   part.4.parquet

_metadata         part.13.parquet   part.19.parquet   part.5.parquet

part.0.parquet    part.14.parquet   part.2.parquet    part.6.parquet

part.1.parquet    part.15.parquet   part.20.parquet   part.7.parquet

part.10.parquet   part.16.parquet   part.21.parquet   part.8.parquet

part.11.parquet   part.17.parquet   part.3.parquet    part.9.parquet

To load the metadata:

import fastparquet
pf = fastparquet.ParquetFile('airlines-parq')

The ParquetFile instance provides various information about the data set in attributes:

pf.info

pf.schema

pf.dtypes

pf.count

Furthermore, we have information available about the "row-groups" (logical chunks) and the 29 column fragments contained within each. In this case, we have one row-group for each of the original CSV files-that is, one per year.

fastparquet will not generally be as fast as a direct memory dump, such as numpy.save or Feather, nor will it be as fast or compact as custom tuned formats like bcolz. However, it provides good trade-offs and options which can be tuned to the nature of the data. For example, the column/row-group chunking of the data allows pre-selection of only some portions of the total, which enables not having to scan through the other parts of the disc at all. The load speed will depend on the data type of the column, the efficiency of compression, and whether there are any NULLs.

There is, in general, a trade-off between compression and processing speed; uncompressed will tend to be faster, but larger on disc, and gzip compression will be the most compact, but slowest. Snappy compression, in this example, provides moderate space efficiency, without too much processing cost.

fastparquet has no problem loading a very large number of rows or columns (memory allowing):

%%time
# 124M bool values
d = pf.to_pandas(columns=['Cancelled'])
CPU times: user 436 ms, sys: 167 ms, total: 603 ms
Wall time: 620 ms

%%time
d = pf.to_pandas(columns=['Distance'])
CPU times: user 964 ms, sys: 466 ms, total: 1.43 s
Wall time: 1.47 s

%%time
# just the first portion of the data, 1.3M rows, 29 columns 
d = pf.to_pandas(filters=(('Year', '==', 1987), ))
CPU times: user 1.37 s, sys: 212 ms, total: 1.58 s
Wall time: 1.59 s

The following factors are known to reduce performance:

  • The existence of NULLs in the data. It is faster to use special values, such as NaN for data types that allow it, or other known sentinel values, such as an empty byte-string.

  • Variable-length string encoding is slow on both write and read, and fixed-length will be faster, although this is not compatible with all Parquet frameworks (particularly Spark). Converting to categories will be a good option if the cardinality is low.

  • Some data types require conversion in order to be stored in Parquet's few primitive types. Conversion may take some time.

The Python Big Data Ecosystem

fastparquet provides one of the necessary links for Python to be a first-class citizen within Big Data processing. Although useful alone, it is intended to work seamlessly with the following libraries:

  • Dask, a pure-Python, flexible parallel execution engine, and its distributed scheduler. Each row-group is independent of the others, and Dask can take advantage of this to process parts of a Parquet data-set in parallel. The Dask DataFrame closely mirrors pandas, and methods on it (a subset of all those in pandas) actually call pandas methods on the underlying shards of the logical DataFrame. The Dask Parquet interface is experimental, as it lags slightly behind development in fastparquet.

  • hdfs3 , s3fs and adlfs provide native Pythonic interfaces to massive file systems. If the whole purpose of Parquet is to store Big Data, we need somewhere to keep it. fastparquet accepts a function to open a file-like object, given a path, and, so, can use any of these back-ends for reading and writing, and makes it easy to use any new file-system back-end in the future. Choosing the back-end is automatic when using Dask and a URL like s3://mybucket/mydata.parq.

With the blossoming of interactive visualization technologies for Python, the prospect of end-to-end Big Data processing projects is now fully realizable.

fastparquet Status and Plans

As of the publication of this article, the fastparquet library can be considered beta-useful to the general public and able to cope with many situations, but with some caveats (see below). Please try your own use case and report issues and comments on the GitHub tracker. The code will continue to develop (contributions welcome), and we will endeavour to keep the documentation in sync and provide regular updates.

A number of nice-to-haves are planned, and work to improve the performance should be completed around the new year, 2017.

Further Helpful Information

We don't have the space to talk about it here, but documentation at RTD gives further details on:

  • How to iterate through Parquet-stored data, rather than load the whole data set into memory at once

  • Using Parquet with Dask-DataFrames for parallelism and on a distributed cluster

  • Getting the most out of performance

  • Reading and writing partitioned data

  • Data types understood by Parquet and fastparquet

fastparquet Caveats

Aside from the performance pointers, above, some specific things do not work in fastparquet, and for some of these, fixes are not planned-unless there is substantial community interest.

  • Some encodings are not supported, such as delta encoding, since we have no test data to develop against.

  • Nested schemas are not supported at all, and are not currently planned, since they don't fit in well with pandas' tabular layout. If a column contains Python objects, they can be JSON-encoded and written to Parquet as strings.

  • Some output Parquet files will not be compatible with some other Parquet frameworks. For instance, Spark cannot read fixed-length byte arrays.

This work is fully open source (Apache-2.0), and contributions are welcome.

Development of the library has been supported by Continuum Analytics.

06 Dec 2016 5:48pm GMT

Continuum Analytics News: Introducing: fastparquet

Developer Blog
Tuesday, December 6, 2016
Martin Durant
Continuum Analytics

A compliant, flexible and speedy interface to Parquet format files for Python, fastparquet provides seamless translation between in-memory pandas DataFrames and on-disc storage.

In this post, we will introduce the two functions that will most commonly be used within fastparquet, followed by a discussion of the current Big Data landscape, Python's place within it and details of how fastparquet fills one of the gaps on the way to building out a full end-to-end Big Data pipeline in Python.

fastparquet Teaser

New users of fastparquet will mainly use the functions write and ParquetFile.to_pandas. Both functions offer good performance with default values, and both have a number of options to improve performance further.

import fastparquet

# write data
fastparquet.write('out.parq', df, compression='SNAPPY')

# load data
pfile = fastparquet.ParquetFile('out.parq') 
df2 = pfile.topandas() # all columns 
df3 = pfile.topandas(columns=['floats', 'times']) # pick some columns

Introduction: Python and Big Data

Python was named as a favourite tool for data science by 45% of data scientists in 2016. Many reasons can be presented for this, and near the top will be:

  • Python is very commonly taught at college and university level

  • Python and associated numerical libraries are free and open source

  • The code tends to be concise, quick to write, and expressive

  • An extremely rich ecosystem of libraries exist for not only numerical processing but also other important links in the pipeline from data ingest to visualization and distribution of results

Big Data, however, has typically been based on traditional databases and, in latter years, the Hadoop ecosystem. Hadoop provides a distributed file-system, cluster resource management (YARN, Mesos) and a set of frameworks for processing data (map-reduce, pig, kafka, and many more). In the past few years, Spark has rapidly increased in usage, becoming a major force, even though 62% of users use Python to execute Spark jobs (via PySpark).

The Hadoop ecosystem and its tools, including Spark, are heavily based around the Java Virtual Machine (JVM), which creates a gap between the familiar, rich Python data ecosystem and clustered Big Data with Hadoop. One such missing piece is a data format that can efficiently store large amounts of tabular data, in a columnar layout, and split it into blocks on a distributed file-system.

Parquet has become the de-facto standard file format for tabular data in Spark, Impala and other clustered frameworks. Parquet provides several advantages relevant to Big Data processing:

  • Columnar storage, only read the data of interest

  • Efficient binary packing

  • Choice of compression algorithms and encoding

  • Splits data into files, allowing for parallel processing

  • Range of logical types

  • Statistics stored in metadata to allow for skipping unneeded chunks

  • Data partitioning using the directory structure

fastparquet bridges the gap to provide native Python read/write access with out the need to use Java.

Until now, Spark's Python interface provided the only way to write Spark files from Python. Much of the time is spent in deserializing the data in the Java-Python bridge. Also, note that the times column returned is now just integers, rather than the correct datetime type. Not only does fastparquet provide native access to Parquet files, it in fact makes the transfer of data to Spark much faster.

# to make and save a large-ish DataFrame
import pandas as pd 
import numpy as np 
N = 10000000

df = pd.DataFrame({'ints': np.random.randint(0, 1000, size=N),
 'floats': np.random.randn(N),
 'times': pd.DatetimeIndex(start='1980', freq='s', periods=N)})
import pyspark
sc = pyspark.SparkContext()
sql = pyspark.SQLContext(sc) 

The default Spark single-machine configuration cannot handle the above DataFrame (out-of-memory error), so we'll perform timing for 1/10 of the data:

# sending data to spark via pySpark serialization, 1/10 of the data
%time o = sql.createDataFrame(df[::10]).count()
CPU times: user 3.45 s, sys: 96.6 ms, total: 3.55 s
Wall time: 4.14 s

%%time
# sending data to spark via a file made with fastparquet, all the data 
fastparquet.write('out.parq', df, compression='SNAPPY')
df4 = sql.read.parquet('outspark.parq').count()
CPU times: user 2.75 s, sys: 285 ms, total: 3.04 s
Wall time: 3.27 s

The fastparquet Library

fastparquet is an open source library providing a Python interface to the Parquet file format. It uses Numba and NumPy to provide speed, and writes data to and from pandas DataFrames, the most typical starting point for Python data science operations.

fastparquet can be installed using conda:

conda install -c conda-forge fastparquet

(currently only available for Python 3)

  • The code is hosted on GitHub
  • The primary documentation is on RTD

Bleeding edge installation directly from the GitHub repo is also supported, as long as Numba, pandas, pytest and ThriftPy are installed.

Reading Parquet files into pandas is simple and, again, much faster than via PySpark serialization.

import fastparquet 
pfile = fastparquet.ParquetFile('out.parq')
%time df2 = pfile.to_pandas()
CPU times: user 812 ms, sys: 291 ms, total: 1.1 s
Wall time: 1.1 s

The Parquet format is more compact and faster to load than the ubiquitous CSV format.

df.to_csv('out.csv')
!du -sh out.csv out.parq
490M    out.csv

162M    out.parq

In this case, the data is 229MB in memory, which translates to 162MB on-disc as Parquet or 490MB as CSV. Loading from CSV takes substantially longer than from Parquet.

%time df2 = pd.read_csv('out.csv', parse_dates=True)
CPU times: user 9.85 s, sys: 1 s, total: 10.9 s
Wall time: 10.9 s

The biggest advantage, however, is the ability to pick only some columns of interest. In CSV, this still means scanning through the whole file (if not parsing all the values), but the columnar nature of Parquet means only reading the data you need.

%time df3 = pd.read_csv('out.csv', usecols=['floats'])
%time df3 = pfile.to_pandas(columns=['floats'])
CPU times: user 4.04 s, sys: 176 ms, total: 4.22 s
Wall time: 4.22 s
CPU times: user 40 ms, sys: 96.9 ms, total: 137 ms
Wall time: 137 ms

Example

We have taken the airlines dataset and converted it into Parquet format using fastparquet. The original data was in CSV format, one file per year, 1987-2004. The total data size is 11GB as CSV, uncompressed, which becomes about double that in memory as a pandas DataFrame for typical dtypes. This is approaching, if not Big Data, Sizable Data, because it cannot fit into my machine's memory.

The Parquet data is stored as a multi-file dataset. The total size is 2.5GB, with Snappy compression throughout.

ls airlines-parq/
_common_metadata  part.12.parquet   part.18.parquet   part.4.parquet

_metadata         part.13.parquet   part.19.parquet   part.5.parquet

part.0.parquet    part.14.parquet   part.2.parquet    part.6.parquet

part.1.parquet    part.15.parquet   part.20.parquet   part.7.parquet

part.10.parquet   part.16.parquet   part.21.parquet   part.8.parquet

part.11.parquet   part.17.parquet   part.3.parquet    part.9.parquet

To load the metadata:

import fastparquet
pf = fastparquet.ParquetFile('airlines-parq')

The ParquetFile instance provides various information about the data set in attributes:

pf.info

pf.schema

pf.dtypes

pf.count

Furthermore, we have information available about the "row-groups" (logical chunks) and the 29 column fragments contained within each. In this case, we have one row-group for each of the original CSV files-that is, one per year.

fastparquet will not generally be as fast as a direct memory dump, such as numpy.save or Feather, nor will it be as fast or compact as custom tuned formats like bcolz. However, it provides good trade-offs and options which can be tuned to the nature of the data. For example, the column/row-group chunking of the data allows pre-selection of only some portions of the total, which enables not having to scan through the other parts of the disc at all. The load speed will depend on the data type of the column, the efficiency of compression, and whether there are any NULLs.

There is, in general, a trade-off between compression and processing speed; uncompressed will tend to be faster, but larger on disc, and gzip compression will be the most compact, but slowest. Snappy compression, in this example, provides moderate space efficiency, without too much processing cost.

fastparquet has no problem loading a very large number of rows or columns (memory allowing):

%%time
# 124M bool values
d = pf.to_pandas(columns=['Cancelled'])
CPU times: user 436 ms, sys: 167 ms, total: 603 ms
Wall time: 620 ms

%%time
d = pf.to_pandas(columns=['Distance'])
CPU times: user 964 ms, sys: 466 ms, total: 1.43 s
Wall time: 1.47 s

%%time
# just the first portion of the data, 1.3M rows, 29 columns 
d = pf.to_pandas(filters=(('Year', '==', 1987), ))
CPU times: user 1.37 s, sys: 212 ms, total: 1.58 s
Wall time: 1.59 s

The following factors are known to reduce performance:

  • The existence of NULLs in the data. It is faster to use special values, such as NaN for data types that allow it, or other known sentinel values, such as an empty byte-string.

  • Variable-length string encoding is slow on both write and read, and fixed-length will be faster, although this is not compatible with all Parquet frameworks (particularly Spark). Converting to categories will be a good option if the cardinality is low.

  • Some data types require conversion in order to be stored in Parquet's few primitive types. Conversion may take some time.

The Python Big Data Ecosystem

fastparquet provides one of the necessary links for Python to be a first-class citizen within Big Data processing. Although useful alone, it is intended to work seamlessly with the following libraries:

  • Dask, a pure-Python, flexible parallel execution engine, and its distributed scheduler. Each row-group is independent of the others, and Dask can take advantage of this to process parts of a Parquet data-set in parallel. The Dask DataFrame closely mirrors pandas, and methods on it (a subset of all those in pandas) actually call pandas methods on the underlying shards of the logical DataFrame. The Dask Parquet interface is experimental, as it lags slightly behind development in fastparquet.

  • hdfs3 , s3fs and adlfs provide native Pythonic interfaces to massive file systems. If the whole purpose of Parquet is to store Big Data, we need somewhere to keep it. fastparquet accepts a function to open a file-like object, given a path, and, so, can use any of these back-ends for reading and writing, and makes it easy to use any new file-system back-end in the future. Choosing the back-end is automatic when using Dask and a URL like s3://mybucket/mydata.parq.

With the blossoming of interactive visualization technologies for Python, the prospect of end-to-end Big Data processing projects is now fully realizable.

fastparquet Status and Plans

As of the publication of this article, the fastparquet library can be considered beta-useful to the general public and able to cope with many situations, but with some caveats (see below). Please try your own use case and report issues and comments on the GitHub tracker. The code will continue to develop (contributions welcome), and we will endeavour to keep the documentation in sync and provide regular updates.

A number of nice-to-haves are planned, and work to improve the performance should be completed around the new year, 2017.

Further Helpful Information

We don't have the space to talk about it here, but documentation at RTD gives further details on:

  • How to iterate through Parquet-stored data, rather than load the whole data set into memory at once

  • Using Parquet with Dask-DataFrames for parallelism and on a distributed cluster

  • Getting the most out of performance

  • Reading and writing partitioned data

  • Data types understood by Parquet and fastparquet

fastparquet Caveats

Aside from the performance pointers, above, some specific things do not work in fastparquet, and for some of these, fixes are not planned-unless there is substantial community interest.

  • Some encodings are not supported, such as delta encoding, since we have no test data to develop against.

  • Nested schemas are not supported at all, and are not currently planned, since they don't fit in well with pandas' tabular layout. If a column contains Python objects, they can be JSON-encoded and written to Parquet as strings.

  • Some output Parquet files will not be compatible with some other Parquet frameworks. For instance, Spark cannot read fixed-length byte arrays.

This work is fully open source (Apache-2.0), and contributions are welcome.

Development of the library has been supported by Continuum Analytics.

06 Dec 2016 5:48pm GMT

Albert Hopkins: Writing autofill plugins for TeamPlayer

Background

TeamPlayer is a Django-based streaming radio app with a twist. A while back it gained a feature called "shake things up" where, instead of dead silence, "DJ Ango" would play tracks from the TeamPlayer Library when no players had any queued songs. Initially this was implemented by creating a queue for DJ Ango and then filling it with random tracks. This worked but after I while I became annoyed by the "randomness" and so went about writing a few other implementations which I call "autofill strategies". These were function definitions and the autofill logic used an if/else clause to select which function to call based on what was set in the Django settings.

Recently I got rid of the if/else's and instead use setuptools entry points. This also allows for third parties to write "autofill plugins" for TeamPlayer. Here's how to do it.

As I said every autofill strategy is a Python function with the following signature:

def my_autofill_strategy(*, queryset, entries_needed, station):  

This function should return a list of teamplayer.models.LibraryItem. The list should ideally have a length of entries_needed but no longer, and the returned list should contain entries from the queryset. The "should"s are emphasized because sometimes a particular strategy can't find enough entries from the queryset so it can either return a smaller list or return entries not in the queryset or both. The station argument is the teamplayer.models.Station instance for which songs are being selected. This is (almost) always Station.main_station().

Idea

Regular terrestrial radio stations often play the same set of songs in rotation over and over again. This is one reason why I rarely listen to them. However I thought this would be an interesting (and easy) autofill strategy to write.

Implementation

Here's the idea: keep a (play)list of songs from the TeamPlayer Library for rotation, store it in a database table, and then write the autofill function to simply pick from that list. Here is the Django database model:

from django.db import models  
from teamplayer.models import LibraryItem

class Song(models.Model):  
    song = models.OneToOneField(LibraryItem)

This table's rows just point to a LibraryItem. We can use the Django admin site to maintain the list. So again the autofill function just points to entries from the list:

from .models import Song

def rotation_autofill(*, queryset, entries_needed, station):  
    songs = Song.objects.order_by('?')[:entries_needed]
    songs = [i.song for i in songs]

    return songs

Now all that we need is some logic to run the commercial breaks and station identification. Just kidding. Now all that is needed is to "package" our plugin.

Packaging

As I've said TeamPlayer now uses setuptools entry points to get autofill strategies. The entry point group name for autofill plugins is aptly called 'teamplayer.autofill_strategy'. So in our setup.py we register our function as such:

# setup.py
from setuptools import setup

setup(  
    name='mypackage',
    ...
    entry_points={
        'teamplayer.autofill_strategy': [
            'rotation = mypackage.autofill:rotation_autofill',
        ]
    }
)

Here the entry_points argument to setup defines the entry points. For this we declare the group teamplayer.autofill_strategy and in that group we have a single entry point called rotation. rotation points to the rotation_autofill function in the module mypackage.autofill (using dots for the module and a colon for the member).

From there all you would need is to pip install your app, add it to INSTALLED_APPS (after TeamPlayer) and change the following setting:

TEAMPLAYER = {  
    'SHAKE_THINGS_UP': 10,
    'AUTOFILL_STRATEGY': 'rotation',
}

The 'SHAKE_THINGS_UP' setting tells TeamPlayer the (maximum) number of Library items to add to DJ Ango's queue at a time (0 to disable) and the AUTOFILL_STRATEGY tells which autofill strategy plugin to load.

A (more) complete implementation of this example is here.

06 Dec 2016 5:39pm GMT

Albert Hopkins: Writing autofill plugins for TeamPlayer

Background

TeamPlayer is a Django-based streaming radio app with a twist. A while back it gained a feature called "shake things up" where, instead of dead silence, "DJ Ango" would play tracks from the TeamPlayer Library when no players had any queued songs. Initially this was implemented by creating a queue for DJ Ango and then filling it with random tracks. This worked but after I while I became annoyed by the "randomness" and so went about writing a few other implementations which I call "autofill strategies". These were function definitions and the autofill logic used an if/else clause to select which function to call based on what was set in the Django settings.

Recently I got rid of the if/else's and instead use setuptools entry points. This also allows for third parties to write "autofill plugins" for TeamPlayer. Here's how to do it.

As I said every autofill strategy is a Python function with the following signature:

def my_autofill_strategy(*, queryset, entries_needed, station):  

This function should return a list of teamplayer.models.LibraryItem. The list should ideally have a length of entries_needed but no longer, and the returned list should contain entries from the queryset. The "should"s are emphasized because sometimes a particular strategy can't find enough entries from the queryset so it can either return a smaller list or return entries not in the queryset or both. The station argument is the teamplayer.models.Station instance for which songs are being selected. This is (almost) always Station.main_station().

Idea

Regular terrestrial radio stations often play the same set of songs in rotation over and over again. This is one reason why I rarely listen to them. However I thought this would be an interesting (and easy) autofill strategy to write.

Implementation

Here's the idea: keep a (play)list of songs from the TeamPlayer Library for rotation, store it in a database table, and then write the autofill function to simply pick from that list. Here is the Django database model:

from django.db import models  
from teamplayer.models import LibraryItem

class Song(models.Model):  
    song = models.OneToOneField(LibraryItem)

This table's rows just point to a LibraryItem. We can use the Django admin site to maintain the list. So again the autofill function just points to entries from the list:

from .models import Song

def rotation_autofill(*, queryset, entries_needed, station):  
    songs = Song.objects.order_by('?')[:entries_needed]
    songs = [i.song for i in songs]

    return songs

Now all that we need is some logic to run the commercial breaks and station identification. Just kidding. Now all that is needed is to "package" our plugin.

Packaging

As I've said TeamPlayer now uses setuptools entry points to get autofill strategies. The entry point group name for autofill plugins is aptly called 'teamplayer.autofill_strategy'. So in our setup.py we register our function as such:

# setup.py
from setuptools import setup

setup(  
    name='mypackage',
    ...
    entry_points={
        'teamplayer.autofill_strategy': [
            'rotation = mypackage.autofill:rotation_autofill',
        ]
    }
)

Here the entry_points argument to setup defines the entry points. For this we declare the group teamplayer.autofill_strategy and in that group we have a single entry point called rotation. rotation points to the rotation_autofill function in the module mypackage.autofill (using dots for the module and a colon for the member).

From there all you would need is to pip install your app, add it to INSTALLED_APPS (after TeamPlayer) and change the following setting:

TEAMPLAYER = {  
    'SHAKE_THINGS_UP': 10,
    'AUTOFILL_STRATEGY': 'rotation',
}

The 'SHAKE_THINGS_UP' setting tells TeamPlayer the (maximum) number of Library items to add to DJ Ango's queue at a time (0 to disable) and the AUTOFILL_STRATEGY tells which autofill strategy plugin to load.

A (more) complete implementation of this example is here.

06 Dec 2016 5:39pm GMT

Obey the Testing Goat: Second Edition update: Virtualenvs, Django 1.10, REST APIs, cleaner FTs...

A brief update on my progress for the second edition.

screenshot of book project plan, almost done Getting there!

Virtualenvs all the way down.

In the first edition, I made the judgement call that telling people to use virtualenvs at the very beginning of the book would be too confusing for beginners. I've decided to revisit that decision, since virtualenvs are more and more de rigueur these days. I mean, if the djangogirls tutorial is recommending one, given that it's the most beginner-friendly tutorial on Earth, then it really must be a good idea. So there's new instructions in the pre-requisite installations chapter. Let me know if you think they could be clearer.

Django 1.10

Django 1.10 doesn't introduce that many new features over 1.8, but upgrading was still pretty fiddly. Thank goodness for my extensive tests (tests for the tests in the book about testing, yes. because of course.) The main changes you'll likely to notice is in Chapter 4 where I introduce the Django Test Client, much earlier than I used to (which, through a long chain of causes, is actually because of a change to the way csrf tokens are generated). Other than that, Django 1.10 was pretty much a drop-in replacement. The main thing I'm preparing for really is the upgrade to 1.11LTS early next year.

REST APIs

I was thinking of having a couple of in-line chapters on building a REST API, but for now I've decided to have them as appendices. It starts with how to roll your own, including an example of how to test client-side ajax javascript with sinon, and then there's a second appendix on Django Rest Framework. These are both very much just skeleton outlines at the moment, but, still, feedback and suggestions appreciated.

A cleaner flow for Chapter 6

Chapter 6 is all about rewriting an app that almost works, to be one that actually works, but trying to work incrementally all along, and using the FTs to tell us when we make progress, and warn us if we introduce regressions. I used to have just the one FT, and track progress/regressions by "what line number is the FT failing at? is it higher or lower than before?". Instead I've split out one FT that tests that the existing behaviour still works, and one FT for the new behaviour, and that's much neater I think.

Next: geckodriver and Selenium 3 (uh-oh!)

There are plenty more little tweaks and nice-to-have additions I can think of (React? Docker? Oh yeah, I got your trendy topics covered), but the main task that's really outstanding is upgrading to Selenium 3 and geckodriver. And the reason that's scary is because the current status of implicit waits is up for debate, and I rely on implicit waits a lot. Introducing explicit waits earlier might be a good thing (they're currently only mentioned in Chapter 20), but it would definitely add to the learning curve in the early chapters (I think they'd have to go in chapter 4 or 5, which feels very early indeed). So I'm kinda in denial about this at the moment, hoping that maybe Mozilla will reintroduce the old behaviour, or maybe I'll build some magical wrapper around selenium that just does implicit waits for you (maybe using my stale element check trick) (in my copious spare time), or maybe switch to chromedriver, or I don't know I don't want to think about it. Suggestions, words of encouragement, moral support all welcome here.

In the meantime, I hope you enjoy the new stuff. Keep in touch!

06 Dec 2016 5:12pm GMT

Obey the Testing Goat: Second Edition update: Virtualenvs, Django 1.10, REST APIs, cleaner FTs...

A brief update on my progress for the second edition.

screenshot of book project plan, almost done Getting there!

Virtualenvs all the way down.

In the first edition, I made the judgement call that telling people to use virtualenvs at the very beginning of the book would be too confusing for beginners. I've decided to revisit that decision, since virtualenvs are more and more de rigueur these days. I mean, if the djangogirls tutorial is recommending one, given that it's the most beginner-friendly tutorial on Earth, then it really must be a good idea. So there's new instructions in the pre-requisite installations chapter. Let me know if you think they could be clearer.

Django 1.10

Django 1.10 doesn't introduce that many new features over 1.8, but upgrading was still pretty fiddly. Thank goodness for my extensive tests (tests for the tests in the book about testing, yes. because of course.) The main changes you'll likely to notice is in Chapter 4 where I introduce the Django Test Client, much earlier than I used to (which, through a long chain of causes, is actually because of a change to the way csrf tokens are generated). Other than that, Django 1.10 was pretty much a drop-in replacement. The main thing I'm preparing for really is the upgrade to 1.11LTS early next year.

REST APIs

I was thinking of having a couple of in-line chapters on building a REST API, but for now I've decided to have them as appendices. It starts with how to roll your own, including an example of how to test client-side ajax javascript with sinon, and then there's a second appendix on Django Rest Framework. These are both very much just skeleton outlines at the moment, but, still, feedback and suggestions appreciated.

A cleaner flow for Chapter 6

Chapter 6 is all about rewriting an app that almost works, to be one that actually works, but trying to work incrementally all along, and using the FTs to tell us when we make progress, and warn us if we introduce regressions. I used to have just the one FT, and track progress/regressions by "what line number is the FT failing at? is it higher or lower than before?". Instead I've split out one FT that tests that the existing behaviour still works, and one FT for the new behaviour, and that's much neater I think.

Next: geckodriver and Selenium 3 (uh-oh!)

There are plenty more little tweaks and nice-to-have additions I can think of (React? Docker? Oh yeah, I got your trendy topics covered), but the main task that's really outstanding is upgrading to Selenium 3 and geckodriver. And the reason that's scary is because the current status of implicit waits is up for debate, and I rely on implicit waits a lot. Introducing explicit waits earlier might be a good thing (they're currently only mentioned in Chapter 20), but it would definitely add to the learning curve in the early chapters (I think they'd have to go in chapter 4 or 5, which feels very early indeed). So I'm kinda in denial about this at the moment, hoping that maybe Mozilla will reintroduce the old behaviour, or maybe I'll build some magical wrapper around selenium that just does implicit waits for you (maybe using my stale element check trick) (in my copious spare time), or maybe switch to chromedriver, or I don't know I don't want to think about it. Suggestions, words of encouragement, moral support all welcome here.

In the meantime, I hope you enjoy the new stuff. Keep in touch!

06 Dec 2016 5:12pm GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT