24 Nov 2017

feedPlanet Python

Codementor: Workflow with airflow

Airflow is an open source project started at Airbnb. It is a tool to orchestrate the desire workflow of your application dynamically. Which is readily scalable to infinity because of it modular design.

24 Nov 2017 2:34am GMT

Codementor: Workflow with airflow

Airflow is an open source project started at Airbnb. It is a tool to orchestrate the desire workflow of your application dynamically. Which is readily scalable to infinity because of it modular design.

24 Nov 2017 2:34am GMT

23 Nov 2017

feedPlanet Python

Obey the Testing Goat: Speeding up Django unit tests with SQLite, keebdb and /dev/shm

Here's the tldr version:

DATABASES = {
    'default': {
        'ENGINE': 'your db settings as normal',
        [...]
        'TEST': {
          # this gets you in-memory sqlite for tests, which is fast
          'ENGINE': 'django.db.backends.sqlite3',
        }
    }
}

if 'test' in sys.argv and 'keepdb' in sys.argv:
    # and this allows you to use --keepdb to skip re-creating the db,
    # even faster!
    DATABASES['default']['TEST']['NAME'] = '/dev/shm/myproject.test.db.sqlite3'

More context

For day-to-day development, running your tests needs to be as fast as possible to keep you in a good workflow. Although you're unlikely to be using SQLite as your production database, using it for tests in dev is often a nice shortcut, particularly since Django will use an in-memory sqlite database for tests, which is even faster than one on disk.

But, especially if you have a large and complicated database, re-creating it with each test run can take quite a bit of time. That's where the --keepdb step comes in. Of course, normally if you're using an in-memory database, keepdb doesn't make any sense because memory disappears between runs. That's where the sneaky trick of using /dev/shm comes in. In linux, /dev/shm is actually a filesystem against your machine's RAM, and it will persist between processes, until you reboot your machine.

So you get all the speed of an in-memory SQLite database, with the extra boost of not having to re-create the database.

What if the database changes,

... I hear you ask? Django is smart enough to apply any new migrations to the keepdb database if it notices them. Docs here. This works pretty well in my experience, although I have had to blow away the test db in /dev/shm manually once or twice...

Don't do this in CI

But I'm only advocating this for use in development! Ultimately, Postgres or whichever database you're using will behave differently from SQLite. Django does a good job of abstracting away 90% of those differences, but that still leaves plenty of strange edge case behaviours to do with default values, ordering and transactions that can easily trip you up.

Make sure you always run your test suite in CI against the real database.

You could use an environment variable for example, to make sure:

if os.environ.get('CI'):
    del DATABASES['default']['TEST']

Some numbers:

Here's a subset of the PythonAnywhere tests running, first without keepdb:

»»»» time ./manage.py test console
Creating test database for alias 'default'...
................................................................................
................................................................................
...........................................................
----------------------------------------------------------------------
Ran 219 tests in 5.261s

OK
Destroying test database for alias 'default'...
10.86user 0.25system 0:11.12elapsed 99%CPU 

And now with keepdb:

»»»» time ./manage.py test console --keepdb
Using existing test database for alias 'default'...
................................................................................
................................................................................
...........................................................
----------------------------------------------------------------------
Ran 219 tests in 5.557s

OK
Preserving test database for alias 'default'...
6.28user 0.36system 0:06.66elapsed 99%CPU 

Notice the time that Django reports is the almost the same in both cases, but the actual elapsed time is quite different -- that's because Django isn't counting time spent re-creating the database at the beginning of the test run.

(also, be aware if you're running your own tests here that you will only see an improvement the second time you run with keepdb. On the first run it's still creating the database)

Your mileage may vary

On a different project with simpler models I see very different results:

»»»» time ./manage.py test opera.tests.test_pages
[...]
Ran 65 tests in 7.620s
15.44user 0.08system 0:15.53elapsed 99%CPU

»»»» time ./manage.py test opera.tests.test_pages --keepdb
[...]
Ran 65 tests in 7.535s
14.91user 0.10system 0:15.02elapsed 99%CPU

Really not that much in it! Although that's on my laptop with a nice fast processor and SSD. Differences are more pronounced (on both projects) on a system with a slower CPU and filesystem:

$ time ./manage.py test opera.tests.test_pages
Creating test database for alias 'default'...
[...]
Ran 65 tests in 13.620s
real    0m25.720s

$ time ./manage.py test opera.tests.test_pages --keepdb
Using existing test database for alias 'default'...
[...]
Ran 65 tests in 10.648s
real    0m20.632s

Linux only!

/dev/shm only exists on Unixey operating systems. A bit of googling might help you find alternatives on Windows and MacOS though -- I can't vouch for these, but here are the first two links I found while duckduckgoing:

More tips

Let me know if these help you!

23 Nov 2017 6:32pm GMT

Obey the Testing Goat: Speeding up Django unit tests with SQLite, keebdb and /dev/shm

Here's the tldr version:

DATABASES = {
    'default': {
        'ENGINE': 'your db settings as normal',
        [...]
        'TEST': {
          # this gets you in-memory sqlite for tests, which is fast
          'ENGINE': 'django.db.backends.sqlite3',
        }
    }
}

if 'test' in sys.argv and 'keepdb' in sys.argv:
    # and this allows you to use --keepdb to skip re-creating the db,
    # even faster!
    DATABASES['default']['TEST']['NAME'] = '/dev/shm/myproject.test.db.sqlite3'

More context

For day-to-day development, running your tests needs to be as fast as possible to keep you in a good workflow. Although you're unlikely to be using SQLite as your production database, using it for tests in dev is often a nice shortcut, particularly since Django will use an in-memory sqlite database for tests, which is even faster than one on disk.

But, especially if you have a large and complicated database, re-creating it with each test run can take quite a bit of time. That's where the --keepdb step comes in. Of course, normally if you're using an in-memory database, keepdb doesn't make any sense because memory disappears between runs. That's where the sneaky trick of using /dev/shm comes in. In linux, /dev/shm is actually a filesystem against your machine's RAM, and it will persist between processes, until you reboot your machine.

So you get all the speed of an in-memory SQLite database, with the extra boost of not having to re-create the database.

What if the database changes,

... I hear you ask? Django is smart enough to apply any new migrations to the keepdb database if it notices them. Docs here. This works pretty well in my experience, although I have had to blow away the test db in /dev/shm manually once or twice...

Don't do this in CI

But I'm only advocating this for use in development! Ultimately, Postgres or whichever database you're using will behave differently from SQLite. Django does a good job of abstracting away 90% of those differences, but that still leaves plenty of strange edge case behaviours to do with default values, ordering and transactions that can easily trip you up.

Make sure you always run your test suite in CI against the real database.

You could use an environment variable for example, to make sure:

if os.environ.get('CI'):
    del DATABASES['default']['TEST']

Some numbers:

Here's a subset of the PythonAnywhere tests running, first without keepdb:

»»»» time ./manage.py test console
Creating test database for alias 'default'...
................................................................................
................................................................................
...........................................................
----------------------------------------------------------------------
Ran 219 tests in 5.261s

OK
Destroying test database for alias 'default'...
10.86user 0.25system 0:11.12elapsed 99%CPU 

And now with keepdb:

»»»» time ./manage.py test console --keepdb
Using existing test database for alias 'default'...
................................................................................
................................................................................
...........................................................
----------------------------------------------------------------------
Ran 219 tests in 5.557s

OK
Preserving test database for alias 'default'...
6.28user 0.36system 0:06.66elapsed 99%CPU 

Notice the time that Django reports is the almost the same in both cases, but the actual elapsed time is quite different -- that's because Django isn't counting time spent re-creating the database at the beginning of the test run.

(also, be aware if you're running your own tests here that you will only see an improvement the second time you run with keepdb. On the first run it's still creating the database)

Your mileage may vary

On a different project with simpler models I see very different results:

»»»» time ./manage.py test opera.tests.test_pages
[...]
Ran 65 tests in 7.620s
15.44user 0.08system 0:15.53elapsed 99%CPU

»»»» time ./manage.py test opera.tests.test_pages --keepdb
[...]
Ran 65 tests in 7.535s
14.91user 0.10system 0:15.02elapsed 99%CPU

Really not that much in it! Although that's on my laptop with a nice fast processor and SSD. Differences are more pronounced (on both projects) on a system with a slower CPU and filesystem:

$ time ./manage.py test opera.tests.test_pages
Creating test database for alias 'default'...
[...]
Ran 65 tests in 13.620s
real    0m25.720s

$ time ./manage.py test opera.tests.test_pages --keepdb
Using existing test database for alias 'default'...
[...]
Ran 65 tests in 10.648s
real    0m20.632s

Linux only!

/dev/shm only exists on Unixey operating systems. A bit of googling might help you find alternatives on Windows and MacOS though -- I can't vouch for these, but here are the first two links I found while duckduckgoing:

More tips

Let me know if these help you!

23 Nov 2017 6:32pm GMT

PyCharm: PyCharm 2017.3 RC

PyCharm 2017.3's EAP phase has come to an end, and we're happy to announce the Release Candidate for PyCharm 2017.3

Get PyCharm 2017.3 RC

Improvements in This Version

If these features sound interesting to you, try them yourself:

Get PyCharm 2017.3 RC

If you are using a recent version of Ubuntu (16.04 and later) you can also install PyCharm EAP versions using snap:

sudo snap install [pycharm-professional | pycharm-community] --classic --candidate

If you already used snap for the previous version, you can update using:

sudo snap refresh [pycharm-professional | pycharm-community] --classic --candidate

The Release Candidate is not an EAP build. You will need a license for PyCharm 2017.3 RC Professional Edition, if you don't have one you will get a 30-day trial when you start it.

If you run into any issues with this version, or another version of PyCharm, please let us know on our YouTrack. If you have other suggestions or remarks, you can reach us on Twitter, or by commenting on the blog.

23 Nov 2017 4:14pm GMT

PyCharm: PyCharm 2017.3 RC

PyCharm 2017.3's EAP phase has come to an end, and we're happy to announce the Release Candidate for PyCharm 2017.3

Get PyCharm 2017.3 RC

Improvements in This Version

If these features sound interesting to you, try them yourself:

Get PyCharm 2017.3 RC

If you are using a recent version of Ubuntu (16.04 and later) you can also install PyCharm EAP versions using snap:

sudo snap install [pycharm-professional | pycharm-community] --classic --candidate

If you already used snap for the previous version, you can update using:

sudo snap refresh [pycharm-professional | pycharm-community] --classic --candidate

The Release Candidate is not an EAP build. You will need a license for PyCharm 2017.3 RC Professional Edition, if you don't have one you will get a 30-day trial when you start it.

If you run into any issues with this version, or another version of PyCharm, please let us know on our YouTrack. If you have other suggestions or remarks, you can reach us on Twitter, or by commenting on the blog.

23 Nov 2017 4:14pm GMT

PyCharm: Results of the Django/PyCharm Fundraising Effort 2017

We're happy to report that our second iteration of the Django/PyСharm fundraising campaign - which we ran this summer - was a huge success. This year we helped raise a total of $66,094USD for the Django Software Foundation! Last year (2016) we ran a similar campaign which resulted in a collective contribution of $50,000USD to the cause. We're happy we could raise even more money this year for the Django community!

If you missed the campaign here's the essence of the past promotion: For 3 weeks this summer, Django developers could effectively donate to Django Software Foundation by purchasing a new individual PyCharm Professional annual subscription at 30% off, with all proceeds from the sales going to the Django Software Foundation. Read more details here.

All the money raised goes toward Django outreach and diversity programs: supporting DSF, the Django Fellowship program, Django Girls workshops, sponsoring official Django conferences, and other equally incredible projects.

We want to say huge thanks to the DSF for their active collaboration and making this fundraiser happen. We hope that in 2018 we'll be able to make this yearly event even more successful!

The DSF general fundraising campaign is still on-going, and we encourage everyone to contribute to the success of Django by donating to DSF directly.

If you have any questions, get in touch with us at fundraising@djangoproject.com or JetBrains at pycharm-support@jetbrains.com.

23 Nov 2017 3:02pm GMT

PyCharm: Results of the Django/PyCharm Fundraising Effort 2017

We're happy to report that our second iteration of the Django/PyСharm fundraising campaign - which we ran this summer - was a huge success. This year we helped raise a total of $66,094USD for the Django Software Foundation! Last year (2016) we ran a similar campaign which resulted in a collective contribution of $50,000USD to the cause. We're happy we could raise even more money this year for the Django community!

If you missed the campaign here's the essence of the past promotion: For 3 weeks this summer, Django developers could effectively donate to Django Software Foundation by purchasing a new individual PyCharm Professional annual subscription at 30% off, with all proceeds from the sales going to the Django Software Foundation. Read more details here.

All the money raised goes toward Django outreach and diversity programs: supporting DSF, the Django Fellowship program, Django Girls workshops, sponsoring official Django conferences, and other equally incredible projects.

We want to say huge thanks to the DSF for their active collaboration and making this fundraiser happen. We hope that in 2018 we'll be able to make this yearly event even more successful!

The DSF general fundraising campaign is still on-going, and we encourage everyone to contribute to the success of Django by donating to DSF directly.

If you have any questions, get in touch with us at fundraising@djangoproject.com or JetBrains at pycharm-support@jetbrains.com.

23 Nov 2017 3:02pm GMT

Tryton News: Live streaming for the Tryton Unconference 2017

TUL2017

The Tryton Unconference 2017 is only two weeks away. Registration is still open. But for those who can not make it, we will broadcast a live streaming on our Youtube channel.

The first day, 7th December 2017, is dedicated to business oriented talks.

The second day, 8th December 2017, is focused on developer talks.

Informations and registration are available at https://tul2017.tryton.org/. If you have question about the organisation, please contact the foundation at foundation@tryton.org.

And don't forget to spread the word! #TUL2017

23 Nov 2017 12:00pm GMT

Tryton News: Live streaming for the Tryton Unconference 2017

TUL2017

The Tryton Unconference 2017 is only two weeks away. Registration is still open. But for those who can not make it, we will broadcast a live streaming on our Youtube channel.

The first day, 7th December 2017, is dedicated to business oriented talks.

The second day, 8th December 2017, is focused on developer talks.

Informations and registration are available at https://tul2017.tryton.org/. If you have question about the organisation, please contact the foundation at foundation@tryton.org.

And don't forget to spread the word! #TUL2017

23 Nov 2017 12:00pm GMT

"Menno's Musings": IMAPClient 1.1.0

IMAPClient 1.1.0 has just been released! Many thanks to the recent contributors and the project's other maintainers, Nicolas Le Manchet and Maxime Lorant. This release is full of great stuff because of them. Here's some highlights:

  • search now supports nested criteria so that more complex criteria can be expressed. IMAPClient will add parentheses in the right place. See
  • PLAIN authentication support
  • UNSELECT support
  • ENABLE support
  • UID EXPUNGE support
  • IMAP modified UTF-7 encoding/decoding now works correctly in all cases
  • the mock package is no longer installed by default (just as a test dependency)
  • many, many bug fixes

See the release notes for more details.

Much work has already gone into the 2.0 release and it won't be too far behind 1.1.0. The headline change there is reworking IMAPClient's handling of TLS.

23 Nov 2017 11:11am GMT

"Menno's Musings": IMAPClient 1.1.0

IMAPClient 1.1.0 has just been released! Many thanks to the recent contributors and the project's other maintainers, Nicolas Le Manchet and Maxime Lorant. This release is full of great stuff because of them. Here's some highlights:

  • search now supports nested criteria so that more complex criteria can be expressed. IMAPClient will add parentheses in the right place. See
  • PLAIN authentication support
  • UNSELECT support
  • ENABLE support
  • UID EXPUNGE support
  • IMAP modified UTF-7 encoding/decoding now works correctly in all cases
  • the mock package is no longer installed by default (just as a test dependency)
  • many, many bug fixes

See the release notes for more details.

Much work has already gone into the 2.0 release and it won't be too far behind 1.1.0. The headline change there is reworking IMAPClient's handling of TLS.

23 Nov 2017 11:11am GMT

Reinout van Rees: Large scale search for social sciences - Wouter van Atteveldt

(Summary of a talk at a Dutch Python meetup). Full title: large scale search and text analysis with Python, Elastic, Celery and a bit of R.

Wouter teaches political sciences at the university and uses a lot of text analysis. He showed an example of text analysis of the Gaza-Isreal conflict, comparing USA media talking about Isreal and Chinese media talking about Israel. You saw big differences! The USA media talks more about rocket attacks by Hamas on Israel. Chinese media talks more about the results on the ground: invasion, deaths, etc.

Text analysis is very important for social sciences. There's a flood of digital information, online and archived. Major newspapers in the Netherlands have been digitalized by the royal library, for instance. Lots of text to work with. You see the same with facebook and so: they can extract lots of info from the texts people type in!

Facebook once did an experiment on positive/negative tweaks to timelines. Totally bullshit from a scientific viewpoint. So we cannot leave social research to the likes of facebook. So.... there is a need for a good tool, especially for education. They build it in python.

Why python? Open source, platform independent, relatively easy to learn. Large community. Lots of tools like django, numpy including large communities around it.

He also uses R. Also open source. Mostly a question of "go where the users are" as most social scientists are used to statistical languages like R.

What they build is https://amcat.nl, Amsterdam Content Analysis Toolkit. He demoed it with textual searches in old newspaper articles. Nice graphs, for instance, with occurrence of words in the various years.

AmCAT is written in python and django. Fully open source. Postgres, celery. The search happens with elasticsearch. The articles are all in elasticsearch, but for safety's case they're also keeping them in postgres. Perhaps not needed, but...

They use django management commands for a reasonably friendly command line interface. Both for maintenance commands and for queries.

Queries can be quite elaborate and take quite some time. Longer than the regular web timeout. For that, they use Celery for offloading long-running tasks.

Text analysis: for that you need NLP, National Language Processing. There are many different good tools like CoreNLP for English, Alpino for Dutch, ParZU for German. Only.... they're hard to use for social scientists: hard to install and operate.

What they build themselves was NLPipe, a simple (Flask) website for interacting with the NLP libraries, working as a job manager. You get the results out as csv or json.


Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

23 Nov 2017 10:06am GMT

Reinout van Rees: Large scale search for social sciences - Wouter van Atteveldt

(Summary of a talk at a Dutch Python meetup). Full title: large scale search and text analysis with Python, Elastic, Celery and a bit of R.

Wouter teaches political sciences at the university and uses a lot of text analysis. He showed an example of text analysis of the Gaza-Isreal conflict, comparing USA media talking about Isreal and Chinese media talking about Israel. You saw big differences! The USA media talks more about rocket attacks by Hamas on Israel. Chinese media talks more about the results on the ground: invasion, deaths, etc.

Text analysis is very important for social sciences. There's a flood of digital information, online and archived. Major newspapers in the Netherlands have been digitalized by the royal library, for instance. Lots of text to work with. You see the same with facebook and so: they can extract lots of info from the texts people type in!

Facebook once did an experiment on positive/negative tweaks to timelines. Totally bullshit from a scientific viewpoint. So we cannot leave social research to the likes of facebook. So.... there is a need for a good tool, especially for education. They build it in python.

Why python? Open source, platform independent, relatively easy to learn. Large community. Lots of tools like django, numpy including large communities around it.

He also uses R. Also open source. Mostly a question of "go where the users are" as most social scientists are used to statistical languages like R.

What they build is https://amcat.nl, Amsterdam Content Analysis Toolkit. He demoed it with textual searches in old newspaper articles. Nice graphs, for instance, with occurrence of words in the various years.

AmCAT is written in python and django. Fully open source. Postgres, celery. The search happens with elasticsearch. The articles are all in elasticsearch, but for safety's case they're also keeping them in postgres. Perhaps not needed, but...

They use django management commands for a reasonably friendly command line interface. Both for maintenance commands and for queries.

Queries can be quite elaborate and take quite some time. Longer than the regular web timeout. For that, they use Celery for offloading long-running tasks.

Text analysis: for that you need NLP, National Language Processing. There are many different good tools like CoreNLP for English, Alpino for Dutch, ParZU for German. Only.... they're hard to use for social scientists: hard to install and operate.

What they build themselves was NLPipe, a simple (Flask) website for interacting with the NLP libraries, working as a job manager. You get the results out as csv or json.


Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

23 Nov 2017 10:06am GMT

Reinout van Rees: Building robust commandline tools with click and flask - Wojtek Burakiewicz

(Summary of a talk at a Dutch Python meetup).

Wojtek likes simple tools. His presentation is about a recent tool they build: an internal tool that deploys python applications to servers with virtualenvs. It also configures nginx.

They based the architecture around microservices. So you basically write a simple script that talks to a couple of APIs that in turn talk to other APIs.

They started out with stash, artifactory, ansible, jenkins, supervisord and JIRA. All except JIRA have a nice API. Their deploy tool talked to those APIs. One problem was authentication. One tool needs user/pass, the other an auth token, the next an ssh key... Another problem was network segmentation. You don't want every laptop to talk to your production environment...

The solution was to use one intermediary API. So the command line tool talks to the intermediary API which in turns talks to the APIs mentioned above.

Another advantage of an intermediary API is that you can unify the concepts. You can just talk about "host" and "application", even though in Jenkins/ansible/etc it might be a "job" or a "machine.

You can also exchange components! You can switch from Stash to Bitbucket without the user of the deploy tool noticing.

They used flask. If you compare it to django:

  • When you install django, you get a house.
  • When you install Flask, you get a single brick.

There are all sorts of libraries you can use to get all the functionality you would get with django. Only you get to install and compose it yourself.

For the command line, he used click, a nice alternative to argparse (he dislikes argparse). From the homepage: Click is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary.

When you distribute command line tools, people will have older versions. So you need to get the users to upgrade in some way.

They did it in three ways:

  • By returning a special header from the API with the desired command line version. The command line tool can then ask the user to upgrade when needed.
  • The commnad line tool also passes its version to the API. If the API needs a newer version, it can return an error code.
  • If an update was needed, it would also print out the necessary pip command as an extra service.

Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

23 Nov 2017 10:06am GMT

Reinout van Rees: Building robust commandline tools with click and flask - Wojtek Burakiewicz

(Summary of a talk at a Dutch Python meetup).

Wojtek likes simple tools. His presentation is about a recent tool they build: an internal tool that deploys python applications to servers with virtualenvs. It also configures nginx.

They based the architecture around microservices. So you basically write a simple script that talks to a couple of APIs that in turn talk to other APIs.

They started out with stash, artifactory, ansible, jenkins, supervisord and JIRA. All except JIRA have a nice API. Their deploy tool talked to those APIs. One problem was authentication. One tool needs user/pass, the other an auth token, the next an ssh key... Another problem was network segmentation. You don't want every laptop to talk to your production environment...

The solution was to use one intermediary API. So the command line tool talks to the intermediary API which in turns talks to the APIs mentioned above.

Another advantage of an intermediary API is that you can unify the concepts. You can just talk about "host" and "application", even though in Jenkins/ansible/etc it might be a "job" or a "machine.

You can also exchange components! You can switch from Stash to Bitbucket without the user of the deploy tool noticing.

They used flask. If you compare it to django:

  • When you install django, you get a house.
  • When you install Flask, you get a single brick.

There are all sorts of libraries you can use to get all the functionality you would get with django. Only you get to install and compose it yourself.

For the command line, he used click, a nice alternative to argparse (he dislikes argparse). From the homepage: Click is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary.

When you distribute command line tools, people will have older versions. So you need to get the users to upgrade in some way.

They did it in three ways:

  • By returning a special header from the API with the desired command line version. The command line tool can then ask the user to upgrade when needed.
  • The commnad line tool also passes its version to the API. If the API needs a newer version, it can return an error code.
  • If an update was needed, it would also print out the necessary pip command as an extra service.

Ongerelateerde reclame: bij ons python programmeren in hartje Utrecht? Mail me even :-)

23 Nov 2017 10:06am GMT

eGenix.com: PyDDF Python Herbst Sprint 2017

The following text is in German, since we're announcing a Python sprint in Düsseldorf, Germany.

Ankündigung

PyDDF Python Herbst Sprint 2017 in
Düsseldorf


Samstag, 25.11.2017, 10:00-18:00 Uhr
Sonntag, 26.11.2017, 10:00-18:00 Uhr

trivago GmbH, Karl-Arnold-Platz 1A, 40474 Düsseldorf

Informationen

Das Python Meeting Düsseldorf (PyDDF) veranstaltet mit freundlicher Unterstützung der trivago GmbH ein Python Sprint Wochenende im November.

Der Sprint findet am Wochenende 25./26.11.2017 in der trivago Niederlassung am Karl-Arnold-Platz 1A statt (nicht am Bennigsen-Platz 1).

Folgende Themengebiete haben wir als Anregung angedacht:

Natürlich kann jeder Teilnehmer weitere Themen vorschlagen.

Anmeldung und weitere Infos

Alles weitere und die Anmeldung findet Ihr auf der Sprint Seite:

Teilnehmer sollten sich zudem auf der PyDDF Liste anmelden, da wir uns dort koordinieren:

Über das Python Meeting Düsseldorf

Das Python Meeting Düsseldorf ist eine regelmäßige Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der Region wendet.

Einen guten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal, auf dem wir Videos der Vorträge nach den Meetings veröffentlichen.

Veranstaltet wird das Meeting von der eGenix.com GmbH, Langenfeld, in Zusammenarbeit mit Clark Consulting & Research, Düsseldorf.

Viel Spaß !

Marc-Andre Lemburg, eGenix.com

23 Nov 2017 9:00am GMT

eGenix.com: PyDDF Python Herbst Sprint 2017

The following text is in German, since we're announcing a Python sprint in Düsseldorf, Germany.

Ankündigung

PyDDF Python Herbst Sprint 2017 in
Düsseldorf


Samstag, 25.11.2017, 10:00-18:00 Uhr
Sonntag, 26.11.2017, 10:00-18:00 Uhr

trivago GmbH, Karl-Arnold-Platz 1A, 40474 Düsseldorf

Informationen

Das Python Meeting Düsseldorf (PyDDF) veranstaltet mit freundlicher Unterstützung der trivago GmbH ein Python Sprint Wochenende im November.

Der Sprint findet am Wochenende 25./26.11.2017 in der trivago Niederlassung am Karl-Arnold-Platz 1A statt (nicht am Bennigsen-Platz 1).

Folgende Themengebiete haben wir als Anregung angedacht:

Natürlich kann jeder Teilnehmer weitere Themen vorschlagen.

Anmeldung und weitere Infos

Alles weitere und die Anmeldung findet Ihr auf der Sprint Seite:

Teilnehmer sollten sich zudem auf der PyDDF Liste anmelden, da wir uns dort koordinieren:

Über das Python Meeting Düsseldorf

Das Python Meeting Düsseldorf ist eine regelmäßige Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der Region wendet.

Einen guten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal, auf dem wir Videos der Vorträge nach den Meetings veröffentlichen.

Veranstaltet wird das Meeting von der eGenix.com GmbH, Langenfeld, in Zusammenarbeit mit Clark Consulting & Research, Düsseldorf.

Viel Spaß !

Marc-Andre Lemburg, eGenix.com

23 Nov 2017 9:00am GMT

22 Nov 2017

feedPlanet Python

Talk Python to Me: #139 Paths into a data science career

Data science is one of the fastest growing segments of software development. It takes a slightly different set of skills than your average full-stack development job. This means there's a big opportunity to get into data science. But how do you get into the industry?

22 Nov 2017 8:00am GMT

Talk Python to Me: #139 Paths into a data science career

Data science is one of the fastest growing segments of software development. It takes a slightly different set of skills than your average full-stack development job. This means there's a big opportunity to get into data science. But how do you get into the industry?

22 Nov 2017 8:00am GMT

Python Bytes: #53 Getting started with devpi and Git Virtual FS

22 Nov 2017 8:00am GMT

Python Bytes: #53 Getting started with devpi and Git Virtual FS

22 Nov 2017 8:00am GMT

Dataquest: Setting Up the PyData Stack on Windows

The speed of modern electronic devices allows us to crunch large amounts of data at home. However, these devices require the right software in order to reach peak performance. Luckily, it's now easier than ever to set up your own data science environment.

One of the most popular stacks for data science is PyData, a collection of software packages within Python. Python is one of the most common languages in data science, largely thanks to its wide selection of user-made packages.

In this tutorial, we'll show you how to set up a fully functional PyData stack on your local Windows machine. This will give you full control over your installed environment and give you a first taste of what you'll need to know when setting up more advanced configurations in the cloud. To install the stack, we'll be making use of Anaconda, a popular Python distribution released by Continuum Analytics. It contains all the packages and tools to get started with data science, including Python packages and editors.

By default, Anaconda...

22 Nov 2017 7:00am GMT

Dataquest: Setting Up the PyData Stack on Windows

The speed of modern electronic devices allows us to crunch large amounts of data at home. However, these devices require the right software in order to reach peak performance. Luckily, it's now easier than ever to set up your own data science environment.

One of the most popular stacks for data science is PyData, a collection of software packages within Python. Python is one of the most common languages in data science, largely thanks to its wide selection of user-made packages.

In this tutorial, we'll show you how to set up a fully functional PyData stack on your local Windows machine. This will give you full control over your installed environment and give you a first taste of what you'll need to know when setting up more advanced configurations in the cloud. To install the stack, we'll be making use of Anaconda, a popular Python distribution released by Continuum Analytics. It contains all the packages and tools to get started with data science, including Python packages and editors.

By default, Anaconda...

22 Nov 2017 7:00am GMT

21 Nov 2017

feedPlanet Python

Stack Abuse: Introduction to Regular Expressions in Python

In this tutorial we are going to learn about using regular expressions in Python, including their syntax, and how to construct them using built-in Python modules. To do this we'll cover the different operations in Python's re module, and how to use it in your Python applications.

What are Regular Expressions?

Regular expressions are basically just a sequence of characters that can be used to define a search pattern for finding text. This "search engine" is embedded within the Python programming language (and many other languages as well) and made available through the re module.

To use regular expressions (or "regex" for short) you usually specify the rules for the set of possible strings that you want to match and then ask yourself questions such as "Does this string match the pattern?", or "Is there a match for the pattern anywhere in this string?".

You can also use regexes to modify a string or to split it apart in various ways. These "higher order" operations all start by first matching text with the regex string, and then the string can be manipulated (like being split) once the match is found. All this is made possible by the re module available in Python, which we'll look at further in some later sections.

Regular Expression Syntax

A regular expression specifies a pattern that aims to match the input string. In this section we'll show some of the special characters and patterns you can use to match strings.

Matching Characters

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. There are also other special characters which can't match themselves, i.e. ^, $, *, +, ?, {, }, [, ], \, |, (, and ). This is because they are used for higher-order matching functionality, which is described further in this table:

Metacharacter Description
* Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on.
+ Matches the preceding element one or more times. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
? Matches the preceding element zero or one time. For example, ab?c matches only "ac" or "abc".
| The choice (also known as alternation or set union) operator matches either the expression before or the expression after this operator. For example, abc|def can match either "abc" or "def".
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
^ Matches the starting position in the string, like the startsWith() function. In line-based tools, it matches the starting position of any line.
? Matches the ending position of the string or the position just before a string-ending newline, like the endsWith() function. In line-based tools, it matches the ending position of any line.

Credit to Wikipedia for some of the regex descriptions.

Regular Expressions Methods in Python

There are several methods available to use regular expressions. Here we are going to discuss some of the most commonly used methods and also give a few examples of how they are used. These methods include:

  1. re.match()
  2. re.search()
  3. re.findall()
  4. re.split()
  5. re.sub()
  6. re.compile()

re.match(pattern, string, flags=0)

This expression is used to match a character or set of characters at the beginning of a string. It's also important to note that this expression will only match at the beginning of the string and not at the beginning of each line if the given string has multiple lines.

The expression below will return None because Python does not appear at the beginning of the string.

# match.py

import re  
result = re.match(r'Python', 'It\'s  easy to learn Python. Python also has elegant syntax')

print(result)  

$ python match.py
None  

re.search(pattern, string)

This module will checks for a match anywhere in the given string and will return the results if found, and None if not found.

In the following code we are simply trying to find if the word "puppy" appears in the string "Daisy found a puppy".

# search.py

import re

if re.search("puppy", "Daisy found a puppy."):  
    print("Puppy found")
else:  
    print("No puppy")

Here we first import the re module and use it to search the occurrence of the substring "puppy" in the string "Daisy found a puppy". If it does exist in the string, a re.MatchObject is returned, which is considered "truthy" when evalutated in an if-statement.

$ python search.py 
Puppy found  

re.compile(pattern, flags=0)

This method is used to compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, which we have discussed above. This can also save time since parsing/handling regex strings can be computationally expensive to run.

# compile.py

import re

pattern = re.compile('Python')  
result = pattern.findall('Pythonistas are programmers that use Python, which is an easy-to-learn and powerful language.')

print(result)

find = pattern.findall('Python is easy to learn')

print(find)  

$ python compile.py 
['Python', 'Python']
['Python']

Notice that only the matched string is returned, as opposed to the entire word in the case of "Pythonistas". This is more useful when using a regex string that has special match characters in it.

re.sub(pattern, repl, string)

Like the name suggests, this expression is used to search and substitute for a new string if the pattern occurs.

# sub.py

import re  
result = re.sub(r'python', 'ruby', 'python is a very easy language')

print(result)  

$ python sub.py 
ruby is a very easy language  

re.findall(pattern, string)

As you've seen prior to this section, this method finds and retrieves a list of all occurrences in the given string. It combines both the functions and properties of re.search() and re.match(). The following example will retrieve all the occurrences of "Python" from the string.

# findall.py

import re

result = re.findall(r'Python', 'Python is an easy to learn, powerful programming language. Python also has elegant syntax')  
print(result)  

$ python findall.py 
['Python', 'Python']

Again, using an exact match string like this ("Python") is really only useful for finding if the regex string occurs in the given string, or how many times it occurs.

re.split(pattern, string, maxsplit=0, flags=0)

This expression will split a string at the location in which the specified pattern occurs in the string. It will also return the text of all groups in the pattern if an advanced feature like capturing parentheses are used in the pattern.

# split.py

import re

result =  re.split(r"y", "Daisy found a puppy")

if result:  
    print(result)
else:  
   print("No puppy")

As you can see above, the character pattern "y" occurs three times and the expression has split in all instances where it occurs.

$ python split.py 
['Dais', ' found a pupp', '']

Practical uses of Regular Expressions

Whether you know it or not, we use regular expressions almost daily in our applications. Since regular expressions are available in just about every programming language, it's not easy to escape their usage. Let's look at some of the ways regular expressions can be used in your applications.

Constructing URLs

Every web page has a URL. Now imagine you have a Django website with an address like "http://www.example.com/products/27/", where 27 is the ID of a product. It would be very cumbersome to write separate views to match every single product.

However, with regular expressions, we can create a pattern that will match the URL and extract the ID for us:

An expression that will match and extract any numerical ID could be ^products/(\d+)/$.

Validating Email Addresses

Every authentication system requires users to sign up and log in before they can be allowed access to the system. We can use regular expression to check if an email address supplied is in a valid format.

# validate_email.py

import re

email = "example@gmail.com"

if not re.match(re.compile(r'^.+@[^.].*\.[a-z]{2,10}$', flags=re.IGNORECASE), email):  
    print("Enter a valid email address")
else:  
    print("Email address is valid")

As you can see, this is a pretty complicated regex string. Let's break it down a bit using the example email address in the code above. It basically means the following:

So, as you'd expect, the code matches our example address:

$ python validate_email.py 
Email address is valid  

Validating Phone Numbers

The following example is used to validate a list of prefixed Canadian numbers:

# validate_numbers.py

import re

numbers = ["+18009592809", "=18009592809"]

for number in numbers:  
    if not re.match(re.compile(r"^(\+1?[-. ]?(\d+))$"), number):
        print("Number is not valid")
    else:
        print("Number is valid")

$ python validate_numbers.py 
Number is valid  
Number is not valid  

As you can see, because the second number uses a "=" character instead of "+", it is deemed invalid.

Filtering Unwanted Content

Regular expressions can also be used to filter certain words out of post comments, which is particularly useful in blog posts and social media. The following example shows how you can filter out pre-selected words that users should not use in their comments.

# filter.py

import re

curse_words = ["foo", "bar", "baz"]  
comment = "This string contains a foo word."  
curse_count = 0

for word in curse_words:  
    if re.search(word, comment):
        curse_count += 1

print("Comment has " + str(curse_count) + " curse word(s).")  

$ python filter.py 
Comment has 1 curse word(s).  

Conclusion

This tutorial has covered what is needed to be able to use regular expressions in any application. Feel free to consult the documentation for the re module, which has a ton of resources to help you accomplish your application's goals.

21 Nov 2017 5:32pm GMT

Stack Abuse: Introduction to Regular Expressions in Python

In this tutorial we are going to learn about using regular expressions in Python, including their syntax, and how to construct them using built-in Python modules. To do this we'll cover the different operations in Python's re module, and how to use it in your Python applications.

What are Regular Expressions?

Regular expressions are basically just a sequence of characters that can be used to define a search pattern for finding text. This "search engine" is embedded within the Python programming language (and many other languages as well) and made available through the re module.

To use regular expressions (or "regex" for short) you usually specify the rules for the set of possible strings that you want to match and then ask yourself questions such as "Does this string match the pattern?", or "Is there a match for the pattern anywhere in this string?".

You can also use regexes to modify a string or to split it apart in various ways. These "higher order" operations all start by first matching text with the regex string, and then the string can be manipulated (like being split) once the match is found. All this is made possible by the re module available in Python, which we'll look at further in some later sections.

Regular Expression Syntax

A regular expression specifies a pattern that aims to match the input string. In this section we'll show some of the special characters and patterns you can use to match strings.

Matching Characters

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. There are also other special characters which can't match themselves, i.e. ^, $, *, +, ?, {, }, [, ], \, |, (, and ). This is because they are used for higher-order matching functionality, which is described further in this table:

Metacharacter Description
* Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on.
+ Matches the preceding element one or more times. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
? Matches the preceding element zero or one time. For example, ab?c matches only "ac" or "abc".
| The choice (also known as alternation or set union) operator matches either the expression before or the expression after this operator. For example, abc|def can match either "abc" or "def".
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
^ Matches the starting position in the string, like the startsWith() function. In line-based tools, it matches the starting position of any line.
? Matches the ending position of the string or the position just before a string-ending newline, like the endsWith() function. In line-based tools, it matches the ending position of any line.

Credit to Wikipedia for some of the regex descriptions.

Regular Expressions Methods in Python

There are several methods available to use regular expressions. Here we are going to discuss some of the most commonly used methods and also give a few examples of how they are used. These methods include:

  1. re.match()
  2. re.search()
  3. re.findall()
  4. re.split()
  5. re.sub()
  6. re.compile()

re.match(pattern, string, flags=0)

This expression is used to match a character or set of characters at the beginning of a string. It's also important to note that this expression will only match at the beginning of the string and not at the beginning of each line if the given string has multiple lines.

The expression below will return None because Python does not appear at the beginning of the string.

# match.py

import re  
result = re.match(r'Python', 'It\'s  easy to learn Python. Python also has elegant syntax')

print(result)  

$ python match.py
None  

re.search(pattern, string)

This module will checks for a match anywhere in the given string and will return the results if found, and None if not found.

In the following code we are simply trying to find if the word "puppy" appears in the string "Daisy found a puppy".

# search.py

import re

if re.search("puppy", "Daisy found a puppy."):  
    print("Puppy found")
else:  
    print("No puppy")

Here we first import the re module and use it to search the occurrence of the substring "puppy" in the string "Daisy found a puppy". If it does exist in the string, a re.MatchObject is returned, which is considered "truthy" when evalutated in an if-statement.

$ python search.py 
Puppy found  

re.compile(pattern, flags=0)

This method is used to compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, which we have discussed above. This can also save time since parsing/handling regex strings can be computationally expensive to run.

# compile.py

import re

pattern = re.compile('Python')  
result = pattern.findall('Pythonistas are programmers that use Python, which is an easy-to-learn and powerful language.')

print(result)

find = pattern.findall('Python is easy to learn')

print(find)  

$ python compile.py 
['Python', 'Python']
['Python']

Notice that only the matched string is returned, as opposed to the entire word in the case of "Pythonistas". This is more useful when using a regex string that has special match characters in it.

re.sub(pattern, repl, string)

Like the name suggests, this expression is used to search and substitute for a new string if the pattern occurs.

# sub.py

import re  
result = re.sub(r'python', 'ruby', 'python is a very easy language')

print(result)  

$ python sub.py 
ruby is a very easy language  

re.findall(pattern, string)

As you've seen prior to this section, this method finds and retrieves a list of all occurrences in the given string. It combines both the functions and properties of re.search() and re.match(). The following example will retrieve all the occurrences of "Python" from the string.

# findall.py

import re

result = re.findall(r'Python', 'Python is an easy to learn, powerful programming language. Python also has elegant syntax')  
print(result)  

$ python findall.py 
['Python', 'Python']

Again, using an exact match string like this ("Python") is really only useful for finding if the regex string occurs in the given string, or how many times it occurs.

re.split(pattern, string, maxsplit=0, flags=0)

This expression will split a string at the location in which the specified pattern occurs in the string. It will also return the text of all groups in the pattern if an advanced feature like capturing parentheses are used in the pattern.

# split.py

import re

result =  re.split(r"y", "Daisy found a puppy")

if result:  
    print(result)
else:  
   print("No puppy")

As you can see above, the character pattern "y" occurs three times and the expression has split in all instances where it occurs.

$ python split.py 
['Dais', ' found a pupp', '']

Practical uses of Regular Expressions

Whether you know it or not, we use regular expressions almost daily in our applications. Since regular expressions are available in just about every programming language, it's not easy to escape their usage. Let's look at some of the ways regular expressions can be used in your applications.

Constructing URLs

Every web page has a URL. Now imagine you have a Django website with an address like "http://www.example.com/products/27/", where 27 is the ID of a product. It would be very cumbersome to write separate views to match every single product.

However, with regular expressions, we can create a pattern that will match the URL and extract the ID for us:

An expression that will match and extract any numerical ID could be ^products/(\d+)/$.

Validating Email Addresses

Every authentication system requires users to sign up and log in before they can be allowed access to the system. We can use regular expression to check if an email address supplied is in a valid format.

# validate_email.py

import re

email = "example@gmail.com"

if not re.match(re.compile(r'^.+@[^.].*\.[a-z]{2,10}$', flags=re.IGNORECASE), email):  
    print("Enter a valid email address")
else:  
    print("Email address is valid")

As you can see, this is a pretty complicated regex string. Let's break it down a bit using the example email address in the code above. It basically means the following:

So, as you'd expect, the code matches our example address:

$ python validate_email.py 
Email address is valid  

Validating Phone Numbers

The following example is used to validate a list of prefixed Canadian numbers:

# validate_numbers.py

import re

numbers = ["+18009592809", "=18009592809"]

for number in numbers:  
    if not re.match(re.compile(r"^(\+1?[-. ]?(\d+))$"), number):
        print("Number is not valid")
    else:
        print("Number is valid")

$ python validate_numbers.py 
Number is valid  
Number is not valid  

As you can see, because the second number uses a "=" character instead of "+", it is deemed invalid.

Filtering Unwanted Content

Regular expressions can also be used to filter certain words out of post comments, which is particularly useful in blog posts and social media. The following example shows how you can filter out pre-selected words that users should not use in their comments.

# filter.py

import re

curse_words = ["foo", "bar", "baz"]  
comment = "This string contains a foo word."  
curse_count = 0

for word in curse_words:  
    if re.search(word, comment):
        curse_count += 1

print("Comment has " + str(curse_count) + " curse word(s).")  

$ python filter.py 
Comment has 1 curse word(s).  

Conclusion

This tutorial has covered what is needed to be able to use regular expressions in any application. Feel free to consult the documentation for the re module, which has a ton of resources to help you accomplish your application's goals.

21 Nov 2017 5:32pm GMT

NumFOCUS: My Favourite Tool: Jupyter Notebook

re-posted with permission from Software Carpentry My favourite tool is … the Jupyter Notebook. One of my favourite tools is the Jupyter notebook. I use it for teaching my students scientific computing with Python. Why I like it: Using Jupyter with the plugin RISE, I can create presentations including code cells that I can edit and execute live during […]

21 Nov 2017 2:22pm GMT

NumFOCUS: My Favourite Tool: Jupyter Notebook

re-posted with permission from Software Carpentry My favourite tool is … the Jupyter Notebook. One of my favourite tools is the Jupyter notebook. I use it for teaching my students scientific computing with Python. Why I like it: Using Jupyter with the plugin RISE, I can create presentations including code cells that I can edit and execute live during […]

21 Nov 2017 2:22pm GMT

PyCon: Tutorials Due Friday, Ticket Sales Moving Quickly

Tutorial Deadline Approaching!

We're only a few days away from the deadline for PyCon tutorials-November 24-so be sure to get your submissions in by the end of the day Friday! We're looking for all sorts of tutorials to help our community learn and level up, so be sure to check out our Call for Proposals for more details and enter your proposal in your dashboard soon!

Talk, Poster, and Education Summit proposals are due January 3. See https://us.pycon.org/2018/speaking for all of your proposal needs.

Registration

We opened our registration a few weeks ago and are 40% of the way through our Early Bird pricing, which gives discounted rates to the first 800 tickets sold. For corporate tickets, you'll save over 20% by buying early, and individuals save over 12%. The regular $125 student tickets are dropped to $100 during early sales. Click here for more details and to register!

If our Financial Aid program can help you attend PyCon, we encourage your application. We'll be accepting applications through February 15, 2018, with more details available at https://us.pycon.org/2018/financial-assistance.

Sponsorship

We're looking for sponsors to join us for PyCon 2018! Check out our prospectus for the details and contact pycon-sponsors@python.org!

21 Nov 2017 11:28am GMT

PyCon: Tutorials Due Friday, Ticket Sales Moving Quickly

Tutorial Deadline Approaching!

We're only a few days away from the deadline for PyCon tutorials-November 24-so be sure to get your submissions in by the end of the day Friday! We're looking for all sorts of tutorials to help our community learn and level up, so be sure to check out our Call for Proposals for more details and enter your proposal in your dashboard soon!

Talk, Poster, and Education Summit proposals are due January 3. See https://us.pycon.org/2018/speaking for all of your proposal needs.

Registration

We opened our registration a few weeks ago and are 40% of the way through our Early Bird pricing, which gives discounted rates to the first 800 tickets sold. For corporate tickets, you'll save over 20% by buying early, and individuals save over 12%. The regular $125 student tickets are dropped to $100 during early sales. Click here for more details and to register!

If our Financial Aid program can help you attend PyCon, we encourage your application. We'll be accepting applications through February 15, 2018, with more details available at https://us.pycon.org/2018/financial-assistance.

Sponsorship

We're looking for sponsors to join us for PyCon 2018! Check out our prospectus for the details and contact pycon-sponsors@python.org!

21 Nov 2017 11:28am GMT

Matthew Rocklin: Dask Release 0.16.0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I'm pleased to announce the release of Dask version 0.16.0. This is a major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.15.3 release on September 24th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Conda packages are available on both conda-forge and default channels.

Full changelogs are available here:

Some notable changes follow.

Breaking Changes

Dask collection interface

It is now easier to implement custom collections using the Dask collection interface.

Dask collections (arrays, dataframes, bags, delayed) interact with Dask schedulers (single-machine, distributed) with a few internal methods. We formalized this interface into protocols like .__dask_graph__() and .__dask_keys__() and have published that interface. Any object that implements the methods described in that document will interact with all Dask scheduler features as a first-class Dask object.

class MyDaskCollection(object):
    def __dask_graph__(self):
        ...

    def __dask_keys__(self):
        ...

    def __dask_optimize__(self, ...):
        ...

    ...

This interface has already been implemented within the XArray project for labeled and indexed arrays. Now all XArray classes (DataSet, DataArray, Variable) are fully understood by all Dask schedulers. They are as first-class as dask.arrays or dask.dataframes.

import xarray as xa
from dask.distributed import Client

client = Client()

ds = xa.open_mfdataset('*.nc', ...)

ds = client.persist(ds)  # XArray object integrate seamlessly with Dask schedulers

Work on Dask's collection interfaces was primarily done by Jim Crist.

Bandwidth and Tornado 5 compatibility

Dask is built on the Tornado library for concurrent network programming. In an effort to improve inter-worker bandwidth on exotic hardware (Infiniband), Dask developers are proposing changes to Tornado's network infrastructure.

However, in order to use these changes Dask itself needs to run on the next version of Tornado in development, Tornado 5.0.0, which breaks a number of interfaces on which Dask has relied. Dask developers have been resolving these and we encourage other PyData developers to do the same. For example, neither Bokeh nor Jupyter work on Tornado 5.0.0-dev.

Dask inter-worker bandwidth is peaking at around 1.5-2GB/s on a network theoretically capable of 3GB/s. GitHub issue: pangeo #6

Dask worker bandwidth

Network performance and Tornado compatibility are primarily being handled by Antoine Pitrou.

Parquet Compatibility

Dask.dataframe can use either of the two common Parquet libraries in Python, Apache Arrow and Fastparquet. Each has its own strengths and its own base of users who prefer it. We've significantly extended Dask's parquet test suite to cover each library, extending roundtrip compatibility. Notably, you can now both read and write with PyArrow.

df.to_parquet('...', engine='fastparquet')
df = dd.read_parquet('...', engine='pyarrow')

There is still work to be done here. The variety of parquet reader/writers and conventions out there makes completely solving this problem difficult. It's nice seeing the various projects slowly converge on common functionality.

This work was jointly done by Uwe Korn, Jim Crist, and Martin Durant.

Retrying Tasks

One of the most requested features for the Dask.distributed scheduler is the ability to retry failed tasks. This is particularly useful to people using Dask as a task queue, rather than as a big dataframe or array.

future = client.submit(func, *args, retries=5)

Task retries were primarily built by Antoine Pitrou.

Transactional Work Stealing

The Dask.distributed task scheduler performs load balancing through work stealing. Previously this would sometimes result in the same task running simultaneously in two locations. Now stealing is transactional, meaning that it will avoid accidentally running the same task twice. This behavior is especially important for people using Dask tasks for side effects.

It is still possible for the same task to run twice, but now this only happens in more extreme situations, such as when a worker dies or a TCP connection is severed, neither of which are common on standard hardware.

Transactional work stealing was primarily implemented by Matthew Rocklin.

New Diagnostic Pages

There is a new set of diagnostic web pages available in the Info tab of the dashboard. These pages provide more in-depth information about each worker and task, but are not dynamic in any way. They use Tornado templates rather than Bokeh plots, which means that they are less responsive but are much easier to build. This is an easy and cheap way to expose more scheduler state.

Task page of Dask's scheduler info dashboard

Nested compute calls

Calling .compute() within a task now invokes the same distributed scheduler. This enables writing more complex workloads with less thought to starting worker clients.

import dask
from dask.distributed import Client
client = Client()  # only works for the newer scheduler

@dask.delayed
def f(x):
    ...
    return dask.compute(...)  # can call dask.compute within delayed task

dask.compute([f(i) for ...])

Nested compute calls were primarily developed by Matthew Rocklin and Olivier Grisel.

More aggressive Garbage Collection

The workers now explicitly call gc.collect() at various times when under memory pressure and when releasing data. This helps to avoid some memory leaks, especially when using Pandas dataframes. Doing this carefully proved to require a surprising degree of detail.

Improved garbage collection was primarily implemented and tested by Fabian Keller and Olivier Grisel, with recommendations by Antoine Pitrou.

Related projects

Dask-ML

A variety of Dask Machine Learning projects are now being assembled under one unified repository, dask-ml. We encourage users and researchers alike to read through that project. We believe there are many useful and interesting approaches contained within.

The work to assemble and curate these algorithms is primarily being handled by Tom Augspurger.

XArray

The XArray project for indexed and labeled arrays is also releasing their major 0.10.0 release this week, which includes many performance improvements, particularly for using Dask on larger datasets.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.3 release on September 24th:

The following people contributed to the dask/distributed repository since the 1.19.1 release on September 24nd:

The following people contributed to the dask/dask-ml repository

In addition, we are proud to announce that Olivier Grisel has accepted commit rights to the Dask projects. Olivier has been particularly active on the distributed scheduler, and on related projects like Joblib, SKLearn, and Cloudpickle.

21 Nov 2017 12:00am GMT

Matthew Rocklin: Dask Release 0.16.0

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation.

I'm pleased to announce the release of Dask version 0.16.0. This is a major release with new features, breaking changes, and stability improvements. This blogpost outlines notable changes since the 0.15.3 release on September 24th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Conda packages are available on both conda-forge and default channels.

Full changelogs are available here:

Some notable changes follow.

Breaking Changes

Dask collection interface

It is now easier to implement custom collections using the Dask collection interface.

Dask collections (arrays, dataframes, bags, delayed) interact with Dask schedulers (single-machine, distributed) with a few internal methods. We formalized this interface into protocols like .__dask_graph__() and .__dask_keys__() and have published that interface. Any object that implements the methods described in that document will interact with all Dask scheduler features as a first-class Dask object.

class MyDaskCollection(object):
    def __dask_graph__(self):
        ...

    def __dask_keys__(self):
        ...

    def __dask_optimize__(self, ...):
        ...

    ...

This interface has already been implemented within the XArray project for labeled and indexed arrays. Now all XArray classes (DataSet, DataArray, Variable) are fully understood by all Dask schedulers. They are as first-class as dask.arrays or dask.dataframes.

import xarray as xa
from dask.distributed import Client

client = Client()

ds = xa.open_mfdataset('*.nc', ...)

ds = client.persist(ds)  # XArray object integrate seamlessly with Dask schedulers

Work on Dask's collection interfaces was primarily done by Jim Crist.

Bandwidth and Tornado 5 compatibility

Dask is built on the Tornado library for concurrent network programming. In an effort to improve inter-worker bandwidth on exotic hardware (Infiniband), Dask developers are proposing changes to Tornado's network infrastructure.

However, in order to use these changes Dask itself needs to run on the next version of Tornado in development, Tornado 5.0.0, which breaks a number of interfaces on which Dask has relied. Dask developers have been resolving these and we encourage other PyData developers to do the same. For example, neither Bokeh nor Jupyter work on Tornado 5.0.0-dev.

Dask inter-worker bandwidth is peaking at around 1.5-2GB/s on a network theoretically capable of 3GB/s. GitHub issue: pangeo #6

Dask worker bandwidth

Network performance and Tornado compatibility are primarily being handled by Antoine Pitrou.

Parquet Compatibility

Dask.dataframe can use either of the two common Parquet libraries in Python, Apache Arrow and Fastparquet. Each has its own strengths and its own base of users who prefer it. We've significantly extended Dask's parquet test suite to cover each library, extending roundtrip compatibility. Notably, you can now both read and write with PyArrow.

df.to_parquet('...', engine='fastparquet')
df = dd.read_parquet('...', engine='pyarrow')

There is still work to be done here. The variety of parquet reader/writers and conventions out there makes completely solving this problem difficult. It's nice seeing the various projects slowly converge on common functionality.

This work was jointly done by Uwe Korn, Jim Crist, and Martin Durant.

Retrying Tasks

One of the most requested features for the Dask.distributed scheduler is the ability to retry failed tasks. This is particularly useful to people using Dask as a task queue, rather than as a big dataframe or array.

future = client.submit(func, *args, retries=5)

Task retries were primarily built by Antoine Pitrou.

Transactional Work Stealing

The Dask.distributed task scheduler performs load balancing through work stealing. Previously this would sometimes result in the same task running simultaneously in two locations. Now stealing is transactional, meaning that it will avoid accidentally running the same task twice. This behavior is especially important for people using Dask tasks for side effects.

It is still possible for the same task to run twice, but now this only happens in more extreme situations, such as when a worker dies or a TCP connection is severed, neither of which are common on standard hardware.

Transactional work stealing was primarily implemented by Matthew Rocklin.

New Diagnostic Pages

There is a new set of diagnostic web pages available in the Info tab of the dashboard. These pages provide more in-depth information about each worker and task, but are not dynamic in any way. They use Tornado templates rather than Bokeh plots, which means that they are less responsive but are much easier to build. This is an easy and cheap way to expose more scheduler state.

Task page of Dask's scheduler info dashboard

Nested compute calls

Calling .compute() within a task now invokes the same distributed scheduler. This enables writing more complex workloads with less thought to starting worker clients.

import dask
from dask.distributed import Client
client = Client()  # only works for the newer scheduler

@dask.delayed
def f(x):
    ...
    return dask.compute(...)  # can call dask.compute within delayed task

dask.compute([f(i) for ...])

Nested compute calls were primarily developed by Matthew Rocklin and Olivier Grisel.

More aggressive Garbage Collection

The workers now explicitly call gc.collect() at various times when under memory pressure and when releasing data. This helps to avoid some memory leaks, especially when using Pandas dataframes. Doing this carefully proved to require a surprising degree of detail.

Improved garbage collection was primarily implemented and tested by Fabian Keller and Olivier Grisel, with recommendations by Antoine Pitrou.

Related projects

Dask-ML

A variety of Dask Machine Learning projects are now being assembled under one unified repository, dask-ml. We encourage users and researchers alike to read through that project. We believe there are many useful and interesting approaches contained within.

The work to assemble and curate these algorithms is primarily being handled by Tom Augspurger.

XArray

The XArray project for indexed and labeled arrays is also releasing their major 0.10.0 release this week, which includes many performance improvements, particularly for using Dask on larger datasets.

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.3 release on September 24th:

The following people contributed to the dask/distributed repository since the 1.19.1 release on September 24nd:

The following people contributed to the dask/dask-ml repository

In addition, we are proud to announce that Olivier Grisel has accepted commit rights to the Dask projects. Olivier has been particularly active on the distributed scheduler, and on related projects like Joblib, SKLearn, and Cloudpickle.

21 Nov 2017 12:00am GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT