22 Feb 2019

feedPlanet Python

Davide Moro: Test automation framework thoughts and examples with Python, pytest and Jenkins

In this article I'll share some personal thoughts about Test Automation Frameworks; you can take inspiration from them if you are going to evaluate different test automation platforms or assess your current test automation solution (or solutions).

Despite it is a generic article about test automation, you'll find many examples explaining how to address some common needs using the Python based test framework named pytest and the Jenkins automation server: use the information contained here just as a comparison and feel free to comment sharing alternative methods or ideas coming from different worlds.

It contains references to some well (or less) known pytest plugins or testing libraries too.

Before talking about automation and test automation framework features and characteristics let me introduce the most important test automation goal you should always keep in mind.

Test automation goals: ROI

You invest in automation for a future return of investment.
Simpler approaches let you start more quickly but in the long term they don't perform well in terms of ROI and vice versa. In addition the initial complexity due to a higher level of abstraction may produce better results in the medium or long term: better ROI and some benefits for non technical testers too. Have a look at the test automation engineer ISTQB certification syllabus for more information:


So what I mean is that test automation is not easy: it doesn't mean just recording some actions or write some automated test procedures because how you decide to automate things affects the ROI. Your test automation strategy should consider your tester technical skills now and future evolutions, considerations about how to improve your system testability (is your software testable?), good test design and architecture/system/domain knowledge. In other words be aware of vendors selling "silver bullet" solutions promising smooth test automation for everyone, especially rec&play solutions: there are no silver bullets.

Test automation solution features and characteristics

A test automation solution should be enough generic and flexible, otherwise there is the risk of having to adopt different and maybe incompatible tools for different kind of tests. Try to imagine the mess of having the following situation: one tool or commercial service for browser based tests only based on rec&play, one tool for API testing only, performance test frameworks that doesn't let you reuse existing scenarios, one tool for BDD only scenarios, different Jenkins jobs with different settings for each different tool, no test management tool integration, etc. A unique solution, if possible, would be better: something that let you choose the level of abstraction and that doesn't force you. Something that let you start simple and that follow your future needs and the skill evolution of your testers.
That's one of the reasons why I prefer pytest over an hyper specialized solution like behave for example: if you combine pytest+pytest-bdd you can write BDD scenarios too and you are not forced to use a BDD only capable test framework (without having the pytest flexibility and tons of additional plugins).

And now, after this preamble, an unordered list of features or characteristics that you may consider for your test automation solution software selection:

Typically a test automation engineer will be able to drive automated test runs using the framework command line interface (CLI) during test development but you'll find out very soon that you need an automation server for long running tests, scheduled builds, CI and here it comes Jenkins. Jenkins could be used by non technical testers for launching test runs or initialize an environment with some test data.

Jenkins

What is Jenkins? From the Jenkins website:

Continuous Integration and Continuous Delivery. As an extensible automation server, Jenkins can be used as a simple CI server or turned into the continuous delivery hub for any project.

So thanks to Jenkins everyone can launch a parametrized automated test session just using a browser: no command line and nothing installed on your personal computer. So more power to non technical users thanks to Jenkins!

With Jenkins you can easily schedule recurrent automatic test runs, start remotely via external software some parametrized test runs, implement a CI and many other things. In addition as we will see Jenkins is quite easy to configure and manage thanks to through the web configuration and/or Jenkins pipelines.

Basically Jenkins is very good at starting builds and generally jobs. In this case Jenkins will be in charge of launching our parametrized automated test runs.

And now let's talk a little bit of Python and the pytest test framework.

Python for testing

I don't know if there are some articles talking about statistics on the net about the correlation between Test Automation Engineer job offers and the Python programming language, with a comparison between other programming languages. If you find a similar resource share with me please!

My personal feeling observing for a while many Test Automation Engineer job offers (or any similar QA job with some automation flavor) is that the Python word is very common. Most of times is one of the nice to have requirements and other times is mandatory.

Let's see why the programming language of choice for many QA departments is Python, even for companies that are not using Python for building their product or solutions.

Why Python for testing

Why Python is becoming so popular for test automation? Probably because it is more affordable for people with no or little programming knowledge compared to other languages. In addition the Python community is very supportive and friendly especially with new comers, so if you are planning to attend any Python conference be prepared to fall in love with this fantastic community and make new friends (friends, not only connections!). For example at this time of writing you are still in time for attending PyCon Nove 2018 in the beautiful Florence (even better if you like history, good wine, good food and meet great people):

You can just compare the most classical hello world, for example with Java:

public class HelloWorld {

public static void main(String[] args) {

System.out.println("Hello, World!");

}

}

and compare it with the Python version now:

print("Hello, World!")

Do you see any difference? If you are trying to explain to a non programmer how to print a line in the terminal window with Java you'll have to introduce public, static, void, class, System, installing a runtime environment choosing from different version, installing an IDE, running javac, etc and only at the end you will be able to see something printed on the screen. With Python, most of times it comes preinstalled in many distributions, you just focus on what to need to do. Requirements: a text editor and Python installed. If you are not experienced you start with a simple approach and later you can progressively learn more advanced testing approaches.

And what about test assertions? Compare for example a Javascript based assertions:

expect(b).not.toEqual(c);

with the Python version:

assert b != c

So no expect(a).not.toBeLessThan(b), expect(c >= d).toBeTruthy() or expect(e).toBeLessThan(f): with Python you just say assert a >= 0 so nothing to remember for assertions!

Python is a big fat and very powerful programming language but it follows a "pay only for what you eat" approach.

Why pytest

If Python is the language of your choice you should consider the pytest framework and its high quality community plugins and I think it is a good starting point for building your own test automation solution.

The pytest framework (https://docs.pytest.org/en/latest/) makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries.

Most important pytest features:

I strongly suggest to have a look at the pytest documentation but I'd like to make some examples showing something about fixtures, code reuse, test parametrization and improved maintainability of your tests. If you are not a technical reader you can skip this section.

I'm trying to explain fixtures with practical examples based on answers and questions:

Here you can see an example of fixture parametrization (the test_smtp will be executed twice because you have 2 different fixture configurations):

import pytest
import smtplib

@pytest.fixture(scope="module",
params=["smtp1.com", "smtp2.org"])
def smtp(request):
smtp = smtplib.SMTP(request.param, 587, timeout=5)
yield smtp
print("finalizing %s" % smtp)
smtp.close()

def test_smtp(smtp):
# use smtp fixture (e.g., smtp.sendmail(...))
# and make some assertions.
# The same test will be executed twice (2 different params)

...

And now an example of test parametrization:

import pytest
@pytest.mark.parametrize("test_input,expected", [
("3+5", 8),
("2+4", 6),
("6*9", 42), ])
def test_eval(test_input, expected):
assert eval(test_input) == expected

For more info see:

This is only pytest, as we will see there are many pytest plugins that extend the pytest core features.

Pytest plugins

There are hundreds of pytest plugins, the ones I am using more frequently are:

Python libraries for testing:

Scaffolding tools:

Pytest + Jenkins together

We've discussed about Python, pytest and Jenkins main ingredients for our cocktail recipe (shaken, not stirred). Optional ingredients: integration with external test management tools and selenium grid providers.

Thanks to pytest and its plugins you have a rich command line interface (CLI); with Jenkins you can schedule automated builds, setup a CI, let not technical users or other stakeholders executing parametrized test runs or building test always fresh test data on the fly for manual testing, etc. You just need a browser, nothing installed on your computer.

Here you can see how our recipe looks like:


Now lets comment all our features provided by the Jenkins "build with parameters" graphical interface, explaining option by option when and why they are useful.

Target environment (ENVIRONMENT)

In this article we are not talking about regular unit tests, the basis for your testing pyramid. Instead we are talking about system, functional, API, integration, performance tests to be launched against a particular instance of an integrated system (e.g., dev, alpha or beta environments).

You know, unit tests are good they are not sufficient: it is important to verify if the integrated system (sometimes different complex systems developed by different teams under the same or third party organizations) works fine as it is supposed to do. It is important because it might happen that 100% unit tested systems doesn't play well after the integration for many different reasons. So with unit tests you take care about your code quality, with higher test levels you take care about your product quality. Thanks to these tests you can confirm an expected product behavior or criticize your product.

So thanks to the ENVIRONMENT option you will be able to choose one of the target environments. It is important to be able to reuse all your tests and launch them against different environments without having to change your testware code. Under the hood the pytest launcher will be able to switch between different environments thanks to the pytest-variables parametrization using the --variables command line option, where each available option in the ENVIRONMENT select element is bound to a variables files (e.g., DEV.yml, ALPHA.yml, etc) containing what the testware needs to know about the target environment.

Generally speaking you should be able to reuse your tests without any modification thanks to a parametrization mechanism.If your test framework doesn't let you change target environment and it forces you to modify your code, change framework.

Browser settings (BROWSER)

This option makes sense only if you are going to launch browser based tests otherwise it will be ignored for other type of tests (e.g., API or integration tests).

You should be able to select a particular version of browser (latest or a specific version) if any of your tests require a real browser (not needed for API tests just for making one example) and preferably you should be able to integrate with a cloud system that allows you to use any combination of real browsers and OS systems (not only a minimal subset of versions and only Firefox and Chrome like several test platforms online do). Thanks to the BROWSER option you can choose which browser and version use for your browser based tests. Under the hood the pytest launcher will use the --variables command line option provided by the pytest-variables plugin, where each option is bound to a file containing the browser type, version and capabilities (e.g., FIREFOX.yml, FIREFOX-xy.yml, etc). Thanks to pytest, or any other code based testing framework, you will be able to combine browser interactions with non browser actions or assertions.

A lot of big fat warnings about rec&play online platforms for browser testing or if you want to implement your testing strategy using only or too many browser based tests. You shouldn't consider only if they provide a wide range of OS and versions, the most common browsers. They should let you perform also non browser based actions or assertions (interaction with queues, database interaction, http POST/PUT/etc calls, etc). What I mean is that sometimes only a browser is not sufficient for testing your system: it might be good for a CMS but if you are testing an IoT platform you don't have enough control and you will write completely useless tests or low value tests (e.g., pure UI checks instead of testing reactive side effects depending on eternal triggers, reports, device activity simulations causing some effects on the web platform under test, etc).

In addition be aware that some browser based online testing platforms doesn't use Selenium for their browser automation engine under the hood. For example during a software selection I found an online platform using some Javascript injection for implementing user actions interaction inside the browser and this might be very dangerous. For example let's consider a login page that takes a while before the input elements become ready for accepting the user input when some conditions are met. If for some reasons a bug will never unlock the disabled login form behind a spinner icon, your users won't be able to login to that platform. Using Selenium you'll get a failing result in case of failure due to a timeout error (the test will wait for elements won't never be ready to interact with and after few seconds it will raise an exception) and it's absolutely correct. Using that platform the test was green because under the hood the input element interaction was implemented using DOM actions with the final result of having all your users stuck: how can you trust such platform?

OS settings (OS)

This option is useful for browser based tests too. Many Selenium grid vendors provide real browser on real OS systems and you can choose the desired combination of versions.

Resolution settings (RESOLUTION)

Same for the above options, many vendor solutions let you choose the desired screen resolution for automated browser based testing sessions.

Select tests by names expressions (KEYWORDS)

Pytest let you select the tests you are going to launch selecting a subset of tests that matches a pattern language based on test and module names.

For example I find very useful to add the test management tool reference in test names, this way you will be able to launch exactly just that test:

c93466

Or for example all test names containing the login word but not c92411:

login and not c92411

Or if you organize your tests in different modules you can just specify the folder name and you'll select all the tests that live under that module:

api

Under the hood the pytest command will be launched with -k "EXPRESSION", for example

-k "c93466"

It is used in combination with markers, a sort of test tags.

Select tests to be executed by tag expressions (MARKERS)

Markers can be used alone or in conjunction with keyword expressions. They are a sort of tag expression that let you select just the minimum set of tests for your test run.

Under the hood the pytest launcher uses the command line syntax -m "EXPRESSION".

For example you can see a marker expression that selects all tests marked with the edit tag excluding the ones marked with CANBusProfileEdit:

edit and not CANBusProfileEdit

Or execute only edit negative tests:

edit and negative

Or all integration tests

integration

It's up to you creating granular keywords for features and all you need for select your tests (e.g., functional, integration, fast, negative, ci, etc).

Test management tool integration (TESTRAIL_ENABLE)

All my tests are decorated with the test case identifier provided by the test management tool, in my company we are using TestRail.

If this option is enabled the test results of executed tests will be reported in the test management tool.

Implemented using the pytest-testrail plugin.

Enable debug mode (DEBUG)

The debug mode enables verbose logging.

In addition for browser based tests open selenium grid sessions activating debug capabilities options (https://www.browserstack.com/automate/capabilities). For example verbose browser console logs, video recordings, screenshots for each step, etc. In my company we are using a local installation of Zalenium and BrowserStack automate.

Block on first failure (BLOCK_FIRST_FAILURE)

This option is very useful for the following needs:

The first usage let you gain confidence with a new build and you want to stop on the very first failure for analyzing what happened.

The second usage is very helpful for:

As you can imagine you may combine this option with COUNT, PARALLEL_SESSIONS, RANDOM_ENABLE and DEBUG depending on your needs. You can test your tests robustness too.

Under the hood implemented using the pytest's -x option.

Parallel test executions (PARALLEL_SESSIONS)

Under the hood implemented with pytest-xdist's command line option called -n NUM and let you execute your tests with the desired parallelism level.

pytest-xdist is very powerful and provides more advanced options and network distributed executions. See https://github.com/pytest-dev/pytest-xdist for further options.

Switch from different selenium grid providers (SELENIUM_GRID_URL)

For browser based testing by default your tests will be launched on a remote grid URL. If you don't touch this option the default grid will be used (a local Zalenium or any other provider) but in case of need you can easily switch provider without having to change nothing in your testware.

If you want you can save money maintaining and using a local Zalenium as default option; Zalenium can be configured as a selenium grid router that will dispatch capabilities that it is not able to satisfy. This way you will be able to save money and augment a little bit the parallelism level without having to change plan.

Repeat test execution for a given amount of times (COUNT)

Already discussed before, often used in conjunction with BLOCK_FIRST_FAILURE (pytest core -x option)

If you are trying to diagnose an intermittent failure, it can be useful to run the same test or group of tests over and over again until you get a failure. You can use py.test's -x option in conjunction with pytest-repeat to force the test runner to stop at the first failure.

Based on pytest-repeat's --count=COUNT command line option.

Enable random test ordering execution (RANDOM_ENABLE)

This option enables random test execution order.

At the moment I'm using the pytest-randomly plugin but there are 3 or 4 similar alternatives I have to try out.

By randomly ordering the tests, the risk of surprising inter-test dependencies is reduced.

Specify a random seed (RANDOM_SEED)

If you get a failure executing a random test, it should be possible to reproduce systematically rerunning the same tests order with the same test data.

Always from the pytest-randomly readme:

By resetting the random seed to a repeatable number for each test, tests can create data based on random numbers and yet remain repeatable, for example factory boy's fuzzy values. This is good for ensuring that tests specify the data they need and that the tested system is not affected by any data that is filled in randomly due to not being specified.

Play option (PLAY)

This option will be discussed in a dedicated blog post I am going to write.

Basically you are able to paste a JSON serialization of actions and assertions and the pytest runner will be able to execute your test procedure.

You need just a computer with a browser for running any test (API, integration, system, UI, etc). You can paste how to reproduce a bug on a JIRA bug and everyone will be able to paste it on the Jenkins build with parameters form.

See pytest-play for further information.

If you are going to attending next Pycon in Florence don't miss the following pytest-play talk presented by Serena Martinetti:

UPDATES:

How to create a pytest project

If you are a little bit curious about how to install pytest or create a pytest runner with Jenkins you can have a look at the following scaffolding tool:

It provides a hello world example that let you start with the test technique more suitable for you: plain selenium scripts, BDD or pytest-play JSON test procedures. If you want you can install page objects library. So you can create a QA project in minutes.

Your QA project will be shipped with a Jenkinsfile file that requires a tox-py36 docker executor that provides a python3.6 environment with tox already installed; unfortunately tox-py36 is not yet public so you should implement it by your own at the moment.
Once you provide a tox-py36 docker executor the Jenkinsfile will create for you the build with parameters Jenkins form for you automatically on the very first Jenkins build for your project.

Conclusions

I hope you'll find some useful information in this article: nice to have features for test frameworks or platform, a little bit of curiosity for the Python world or new pytest plugin you never heard about.

Feedback and contributions are always welcome.

Tweets about test automation and new articles happens here:

22 Feb 2019 11:21pm GMT

Davide Moro: Test automation framework thoughts and examples with Python, pytest and Jenkins

In this article I'll share some personal thoughts about Test Automation Frameworks; you can take inspiration from them if you are going to evaluate different test automation platforms or assess your current test automation solution (or solutions).

Despite it is a generic article about test automation, you'll find many examples explaining how to address some common needs using the Python based test framework named pytest and the Jenkins automation server: use the information contained here just as a comparison and feel free to comment sharing alternative methods or ideas coming from different worlds.

It contains references to some well (or less) known pytest plugins or testing libraries too.

Before talking about automation and test automation framework features and characteristics let me introduce the most important test automation goal you should always keep in mind.

Test automation goals: ROI

You invest in automation for a future return of investment.
Simpler approaches let you start more quickly but in the long term they don't perform well in terms of ROI and vice versa. In addition the initial complexity due to a higher level of abstraction may produce better results in the medium or long term: better ROI and some benefits for non technical testers too. Have a look at the test automation engineer ISTQB certification syllabus for more information:


So what I mean is that test automation is not easy: it doesn't mean just recording some actions or write some automated test procedures because how you decide to automate things affects the ROI. Your test automation strategy should consider your tester technical skills now and future evolutions, considerations about how to improve your system testability (is your software testable?), good test design and architecture/system/domain knowledge. In other words be aware of vendors selling "silver bullet" solutions promising smooth test automation for everyone, especially rec&play solutions: there are no silver bullets.

Test automation solution features and characteristics

A test automation solution should be enough generic and flexible, otherwise there is the risk of having to adopt different and maybe incompatible tools for different kind of tests. Try to imagine the mess of having the following situation: one tool or commercial service for browser based tests only based on rec&play, one tool for API testing only, performance test frameworks that doesn't let you reuse existing scenarios, one tool for BDD only scenarios, different Jenkins jobs with different settings for each different tool, no test management tool integration, etc. A unique solution, if possible, would be better: something that let you choose the level of abstraction and that doesn't force you. Something that let you start simple and that follow your future needs and the skill evolution of your testers.
That's one of the reasons why I prefer pytest over an hyper specialized solution like behave for example: if you combine pytest+pytest-bdd you can write BDD scenarios too and you are not forced to use a BDD only capable test framework (without having the pytest flexibility and tons of additional plugins).

And now, after this preamble, an unordered list of features or characteristics that you may consider for your test automation solution software selection:

Typically a test automation engineer will be able to drive automated test runs using the framework command line interface (CLI) during test development but you'll find out very soon that you need an automation server for long running tests, scheduled builds, CI and here it comes Jenkins. Jenkins could be used by non technical testers for launching test runs or initialize an environment with some test data.

Jenkins

What is Jenkins? From the Jenkins website:

Continuous Integration and Continuous Delivery. As an extensible automation server, Jenkins can be used as a simple CI server or turned into the continuous delivery hub for any project.

So thanks to Jenkins everyone can launch a parametrized automated test session just using a browser: no command line and nothing installed on your personal computer. So more power to non technical users thanks to Jenkins!

With Jenkins you can easily schedule recurrent automatic test runs, start remotely via external software some parametrized test runs, implement a CI and many other things. In addition as we will see Jenkins is quite easy to configure and manage thanks to through the web configuration and/or Jenkins pipelines.

Basically Jenkins is very good at starting builds and generally jobs. In this case Jenkins will be in charge of launching our parametrized automated test runs.

And now let's talk a little bit of Python and the pytest test framework.

Python for testing

I don't know if there are some articles talking about statistics on the net about the correlation between Test Automation Engineer job offers and the Python programming language, with a comparison between other programming languages. If you find a similar resource share with me please!

My personal feeling observing for a while many Test Automation Engineer job offers (or any similar QA job with some automation flavor) is that the Python word is very common. Most of times is one of the nice to have requirements and other times is mandatory.

Let's see why the programming language of choice for many QA departments is Python, even for companies that are not using Python for building their product or solutions.

Why Python for testing

Why Python is becoming so popular for test automation? Probably because it is more affordable for people with no or little programming knowledge compared to other languages. In addition the Python community is very supportive and friendly especially with new comers, so if you are planning to attend any Python conference be prepared to fall in love with this fantastic community and make new friends (friends, not only connections!). For example at this time of writing you are still in time for attending PyCon Nove 2018 in the beautiful Florence (even better if you like history, good wine, good food and meet great people):

You can just compare the most classical hello world, for example with Java:

public class HelloWorld {

public static void main(String[] args) {

System.out.println("Hello, World!");

}

}

and compare it with the Python version now:

print("Hello, World!")

Do you see any difference? If you are trying to explain to a non programmer how to print a line in the terminal window with Java you'll have to introduce public, static, void, class, System, installing a runtime environment choosing from different version, installing an IDE, running javac, etc and only at the end you will be able to see something printed on the screen. With Python, most of times it comes preinstalled in many distributions, you just focus on what to need to do. Requirements: a text editor and Python installed. If you are not experienced you start with a simple approach and later you can progressively learn more advanced testing approaches.

And what about test assertions? Compare for example a Javascript based assertions:

expect(b).not.toEqual(c);

with the Python version:

assert b != c

So no expect(a).not.toBeLessThan(b), expect(c >= d).toBeTruthy() or expect(e).toBeLessThan(f): with Python you just say assert a >= 0 so nothing to remember for assertions!

Python is a big fat and very powerful programming language but it follows a "pay only for what you eat" approach.

Why pytest

If Python is the language of your choice you should consider the pytest framework and its high quality community plugins and I think it is a good starting point for building your own test automation solution.

The pytest framework (https://docs.pytest.org/en/latest/) makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries.

Most important pytest features:

I strongly suggest to have a look at the pytest documentation but I'd like to make some examples showing something about fixtures, code reuse, test parametrization and improved maintainability of your tests. If you are not a technical reader you can skip this section.

I'm trying to explain fixtures with practical examples based on answers and questions:

Here you can see an example of fixture parametrization (the test_smtp will be executed twice because you have 2 different fixture configurations):

import pytest
import smtplib

@pytest.fixture(scope="module",
params=["smtp1.com", "smtp2.org"])
def smtp(request):
smtp = smtplib.SMTP(request.param, 587, timeout=5)
yield smtp
print("finalizing %s" % smtp)
smtp.close()

def test_smtp(smtp):
# use smtp fixture (e.g., smtp.sendmail(...))
# and make some assertions.
# The same test will be executed twice (2 different params)

...

And now an example of test parametrization:

import pytest
@pytest.mark.parametrize("test_input,expected", [
("3+5", 8),
("2+4", 6),
("6*9", 42), ])
def test_eval(test_input, expected):
assert eval(test_input) == expected

For more info see:

This is only pytest, as we will see there are many pytest plugins that extend the pytest core features.

Pytest plugins

There are hundreds of pytest plugins, the ones I am using more frequently are:

Python libraries for testing:

Scaffolding tools:

Pytest + Jenkins together

We've discussed about Python, pytest and Jenkins main ingredients for our cocktail recipe (shaken, not stirred). Optional ingredients: integration with external test management tools and selenium grid providers.

Thanks to pytest and its plugins you have a rich command line interface (CLI); with Jenkins you can schedule automated builds, setup a CI, let not technical users or other stakeholders executing parametrized test runs or building test always fresh test data on the fly for manual testing, etc. You just need a browser, nothing installed on your computer.

Here you can see how our recipe looks like:


Now lets comment all our features provided by the Jenkins "build with parameters" graphical interface, explaining option by option when and why they are useful.

Target environment (ENVIRONMENT)

In this article we are not talking about regular unit tests, the basis for your testing pyramid. Instead we are talking about system, functional, API, integration, performance tests to be launched against a particular instance of an integrated system (e.g., dev, alpha or beta environments).

You know, unit tests are good they are not sufficient: it is important to verify if the integrated system (sometimes different complex systems developed by different teams under the same or third party organizations) works fine as it is supposed to do. It is important because it might happen that 100% unit tested systems doesn't play well after the integration for many different reasons. So with unit tests you take care about your code quality, with higher test levels you take care about your product quality. Thanks to these tests you can confirm an expected product behavior or criticize your product.

So thanks to the ENVIRONMENT option you will be able to choose one of the target environments. It is important to be able to reuse all your tests and launch them against different environments without having to change your testware code. Under the hood the pytest launcher will be able to switch between different environments thanks to the pytest-variables parametrization using the --variables command line option, where each available option in the ENVIRONMENT select element is bound to a variables files (e.g., DEV.yml, ALPHA.yml, etc) containing what the testware needs to know about the target environment.

Generally speaking you should be able to reuse your tests without any modification thanks to a parametrization mechanism.If your test framework doesn't let you change target environment and it forces you to modify your code, change framework.

Browser settings (BROWSER)

This option makes sense only if you are going to launch browser based tests otherwise it will be ignored for other type of tests (e.g., API or integration tests).

You should be able to select a particular version of browser (latest or a specific version) if any of your tests require a real browser (not needed for API tests just for making one example) and preferably you should be able to integrate with a cloud system that allows you to use any combination of real browsers and OS systems (not only a minimal subset of versions and only Firefox and Chrome like several test platforms online do). Thanks to the BROWSER option you can choose which browser and version use for your browser based tests. Under the hood the pytest launcher will use the --variables command line option provided by the pytest-variables plugin, where each option is bound to a file containing the browser type, version and capabilities (e.g., FIREFOX.yml, FIREFOX-xy.yml, etc). Thanks to pytest, or any other code based testing framework, you will be able to combine browser interactions with non browser actions or assertions.

A lot of big fat warnings about rec&play online platforms for browser testing or if you want to implement your testing strategy using only or too many browser based tests. You shouldn't consider only if they provide a wide range of OS and versions, the most common browsers. They should let you perform also non browser based actions or assertions (interaction with queues, database interaction, http POST/PUT/etc calls, etc). What I mean is that sometimes only a browser is not sufficient for testing your system: it might be good for a CMS but if you are testing an IoT platform you don't have enough control and you will write completely useless tests or low value tests (e.g., pure UI checks instead of testing reactive side effects depending on eternal triggers, reports, device activity simulations causing some effects on the web platform under test, etc).

In addition be aware that some browser based online testing platforms doesn't use Selenium for their browser automation engine under the hood. For example during a software selection I found an online platform using some Javascript injection for implementing user actions interaction inside the browser and this might be very dangerous. For example let's consider a login page that takes a while before the input elements become ready for accepting the user input when some conditions are met. If for some reasons a bug will never unlock the disabled login form behind a spinner icon, your users won't be able to login to that platform. Using Selenium you'll get a failing result in case of failure due to a timeout error (the test will wait for elements won't never be ready to interact with and after few seconds it will raise an exception) and it's absolutely correct. Using that platform the test was green because under the hood the input element interaction was implemented using DOM actions with the final result of having all your users stuck: how can you trust such platform?

OS settings (OS)

This option is useful for browser based tests too. Many Selenium grid vendors provide real browser on real OS systems and you can choose the desired combination of versions.

Resolution settings (RESOLUTION)

Same for the above options, many vendor solutions let you choose the desired screen resolution for automated browser based testing sessions.

Select tests by names expressions (KEYWORDS)

Pytest let you select the tests you are going to launch selecting a subset of tests that matches a pattern language based on test and module names.

For example I find very useful to add the test management tool reference in test names, this way you will be able to launch exactly just that test:

c93466

Or for example all test names containing the login word but not c92411:

login and not c92411

Or if you organize your tests in different modules you can just specify the folder name and you'll select all the tests that live under that module:

api

Under the hood the pytest command will be launched with -k "EXPRESSION", for example

-k "c93466"

It is used in combination with markers, a sort of test tags.

Select tests to be executed by tag expressions (MARKERS)

Markers can be used alone or in conjunction with keyword expressions. They are a sort of tag expression that let you select just the minimum set of tests for your test run.

Under the hood the pytest launcher uses the command line syntax -m "EXPRESSION".

For example you can see a marker expression that selects all tests marked with the edit tag excluding the ones marked with CANBusProfileEdit:

edit and not CANBusProfileEdit

Or execute only edit negative tests:

edit and negative

Or all integration tests

integration

It's up to you creating granular keywords for features and all you need for select your tests (e.g., functional, integration, fast, negative, ci, etc).

Test management tool integration (TESTRAIL_ENABLE)

All my tests are decorated with the test case identifier provided by the test management tool, in my company we are using TestRail.

If this option is enabled the test results of executed tests will be reported in the test management tool.

Implemented using the pytest-testrail plugin.

Enable debug mode (DEBUG)

The debug mode enables verbose logging.

In addition for browser based tests open selenium grid sessions activating debug capabilities options (https://www.browserstack.com/automate/capabilities). For example verbose browser console logs, video recordings, screenshots for each step, etc. In my company we are using a local installation of Zalenium and BrowserStack automate.

Block on first failure (BLOCK_FIRST_FAILURE)

This option is very useful for the following needs:

The first usage let you gain confidence with a new build and you want to stop on the very first failure for analyzing what happened.

The second usage is very helpful for:

As you can imagine you may combine this option with COUNT, PARALLEL_SESSIONS, RANDOM_ENABLE and DEBUG depending on your needs. You can test your tests robustness too.

Under the hood implemented using the pytest's -x option.

Parallel test executions (PARALLEL_SESSIONS)

Under the hood implemented with pytest-xdist's command line option called -n NUM and let you execute your tests with the desired parallelism level.

pytest-xdist is very powerful and provides more advanced options and network distributed executions. See https://github.com/pytest-dev/pytest-xdist for further options.

Switch from different selenium grid providers (SELENIUM_GRID_URL)

For browser based testing by default your tests will be launched on a remote grid URL. If you don't touch this option the default grid will be used (a local Zalenium or any other provider) but in case of need you can easily switch provider without having to change nothing in your testware.

If you want you can save money maintaining and using a local Zalenium as default option; Zalenium can be configured as a selenium grid router that will dispatch capabilities that it is not able to satisfy. This way you will be able to save money and augment a little bit the parallelism level without having to change plan.

Repeat test execution for a given amount of times (COUNT)

Already discussed before, often used in conjunction with BLOCK_FIRST_FAILURE (pytest core -x option)

If you are trying to diagnose an intermittent failure, it can be useful to run the same test or group of tests over and over again until you get a failure. You can use py.test's -x option in conjunction with pytest-repeat to force the test runner to stop at the first failure.

Based on pytest-repeat's --count=COUNT command line option.

Enable random test ordering execution (RANDOM_ENABLE)

This option enables random test execution order.

At the moment I'm using the pytest-randomly plugin but there are 3 or 4 similar alternatives I have to try out.

By randomly ordering the tests, the risk of surprising inter-test dependencies is reduced.

Specify a random seed (RANDOM_SEED)

If you get a failure executing a random test, it should be possible to reproduce systematically rerunning the same tests order with the same test data.

Always from the pytest-randomly readme:

By resetting the random seed to a repeatable number for each test, tests can create data based on random numbers and yet remain repeatable, for example factory boy's fuzzy values. This is good for ensuring that tests specify the data they need and that the tested system is not affected by any data that is filled in randomly due to not being specified.

Play option (PLAY)

This option will be discussed in a dedicated blog post I am going to write.

Basically you are able to paste a JSON serialization of actions and assertions and the pytest runner will be able to execute your test procedure.

You need just a computer with a browser for running any test (API, integration, system, UI, etc). You can paste how to reproduce a bug on a JIRA bug and everyone will be able to paste it on the Jenkins build with parameters form.

See pytest-play for further information.

If you are going to attending next Pycon in Florence don't miss the following pytest-play talk presented by Serena Martinetti:

UPDATES:

How to create a pytest project

If you are a little bit curious about how to install pytest or create a pytest runner with Jenkins you can have a look at the following scaffolding tool:

It provides a hello world example that let you start with the test technique more suitable for you: plain selenium scripts, BDD or pytest-play JSON test procedures. If you want you can install page objects library. So you can create a QA project in minutes.

Your QA project will be shipped with a Jenkinsfile file that requires a tox-py36 docker executor that provides a python3.6 environment with tox already installed; unfortunately tox-py36 is not yet public so you should implement it by your own at the moment.
Once you provide a tox-py36 docker executor the Jenkinsfile will create for you the build with parameters Jenkins form for you automatically on the very first Jenkins build for your project.

Conclusions

I hope you'll find some useful information in this article: nice to have features for test frameworks or platform, a little bit of curiosity for the Python world or new pytest plugin you never heard about.

Feedback and contributions are always welcome.

Tweets about test automation and new articles happens here:

22 Feb 2019 11:21pm GMT

Davide Moro: Hello pytest-play!

pytest-play is a rec&play (rec not yet available) pytest plugin that let you execute a set of actions and assertions using commands serialized in JSON format. It tries to make test automation more affordable for non programmers or non Python programmers for browser, functional, API, integration or system testing thanks to its pluggable architecture and third party plugins that let you interact with the most common databases and systems.

In addition it provides also some facilitations for writing browser UI actions (e.g., implicit waits before interacting with an input element. The Cypress framework for me was a great source of inspiration) and asynchronous checks (e.g., wait until a certain condition is true).

You can use pytest-play programmatically (e.g., use the pytest-play engine as a library for pytest-play standalone scenarios or using the pytest-play API implementing BDD steps).

Starting from pytest-play>1.4.x it was introduced a new experimental feature that let you use pytest-play as a framework creating Python-free automated tests based on a JSON based serialization format for actions and assertions (in the next future the more user friendly YAML format will be supported).

So now depending on your needs and skills you can choose to use pytest-play as a library or as a framework.

In this article I'm going to show how to implement a Plone CMS based login test using the python-free approach without having to write any line of Python code.

What is pytest-play and why it exists

In this section I'm going to add more information about the pytest-play approach and other considerations: if you want to see now how to implement our Python-free automated login test jump to the next section!

Hyper specialized tool problems


There are many commercial products or tools that offer solutions for API testing only, browser testing only. Sometimes hyper specialized tools might fit your needs (e.g., a content management system based web application) but sometimes they are not helpful for other distributed applications.

For example an API-only platform is not effective for testing a CQRS based application. It is not useful testing only HTTP 200 OK response, you should test that all the expected commands are generated on the event store (e.g., Cassandra) or other side effects.

Another example for an IoT applications and UI/browser only testing platforms. You cannot test reactive web apps only with a browser, you should control also simulated device activities (e.g., MQTT, queues, API) for messages/alarms/reports) or any other web based interactions performed by other users (e.g., HTTP calls); you might need to check asynchronously the expected results on web sockets instead of using a real browser implementing when some actions are performed.

What is pytest-play

In other words pytest-play is an open source testing solution based on the pytest framework that let you:

using a serialization format (JSON at this time of writing, YAML in the next future) that should be more affordable for non technical testers, non programmers or programmers with no Python knowledge.

Potentially you will be able to share and execute a new scenario not yet included in your test library copying and pasting a pytest-play JSON to a Jenkins build with parameters form like the following one (see the PLAY textarea):

From http://davidemoro.blogspot.it/2018/03/test-automation-python-pytest-jenkins.html


In addition if you are a technical user you can extend it writing your own plugins, you can provide the integration with external tools (e.g., test management tools, software metrics engines, etc), you can decide the test abstraction depending on deadlines/skills/strategy (e.g., use plain json files, a programmatic approach based on json scenarios or BDD steps based on pytest-play).

What pytest-play is not

For example pytest-play doesn't provide a test scenario recorder but it enforces users to understand what they are doing.

It requires a very very little programming knowledge for writing some assertions using simple code expressions but with a little training activity it is still affordable by non programmers (you don't have to learn a programming language, just some basic assertions).

It is not feature complete but it is free software.

If you want to know more in this previous article I've talked about:

A pytest-play example: parametrized login (featuring Plone CMS)

In this example we'll see how to write and execute pure json pytest-play scenarios with test data decoupled by the test implementation and test parametrization. I'm using the available online Plone 5 demo site kindly hosted by Andreas Jung (www.zopyx.com).

The project is available here:

The tests could be launched this way as a normal pytest project once you installed pytest and the dependencies (there is a requirements.txt file, see the above link):





$ pytest --variables env-ALPHA.yml --splinter-webdriver firefox --splinter-screenshot-dir /tmp -x

Where the you can have multiple environment/variable files. E.g., env-ALPHA.yml containing the alpha base url and any other variables:

pytest-play:
base_url: https://plone-demo.info

Our login test_login.json scenario contains (as you can see there are NO asynchronous waits because they are not needed for basic examples so you can focus on actions and assertions thanks to implicit waits):

{
"steps": [
{
"comment": "visit base url",
"type": "get",
"url": "$base_url"
},
{
"comment": "click on login link",
"locator": {
"type": "id",
"value": "personaltools-login"
},
"type": "clickElement"
},
{
"comment": "provide a username",
"locator": {
"type": "id",
"value": "__ac_name"
},
"text": "$username",
"type": "setElementText"
},
{
"comment": "provide a password",
"locator": {
"type": "id",
"value": "__ac_password"
},
"text": "$password",
"type": "setElementText"
},
{
"comment": "click on login submit button",
"locator": {
"type": "css",
"value": ".pattern-modal-buttons > input[name=submit]"
},
"type": "clickElement"
},
{
"comment": "wait for page loaded",
"locator": {
"type": "css",
"value": ".icon-user"
},
"type": "waitForElementVisible"
}
]
}

Plus an optional test scenario metadata file test_login.ini that contains pytest keyword and decoupled test data:

[pytest]
markers =
login
test_data =
{"username": "siteadmin", "password": "siteadmin"}
{"username": "editor", "password": "editor"}
{"username": "reader", "password": "reader"}

Thanks to the metadata file you have just one scenario and it will be executed 3 times (as many times as test data rows)!

Et voilà, let's see it in action out scenario without having to write any line of Python code:


There is only a warning I have to remove but it worked and we got exactly 3 different test runs for our login scenario as expected!

pytest-play status

pytest-play should be still considered experimental software and many features needs to be implemented or refactored:

PyCon Nove @ Florence

If you are going to attending next PyCon Nove in Florence don't miss the following pytest-play talk presented by Serena Martinetti:

Do you like pytest-play?

Tweets about pytest-play happens on @davidemoro.
Positive or negative feedback is always appreciated. If you find interesting the concepts behind pytest-play let me know with a tweet, add a new pytest-play adapter and/or add a GitHub star if you liked it:

Star

Updates

22 Feb 2019 11:15pm GMT

Davide Moro: Hello pytest-play!

pytest-play is a rec&play (rec not yet available) pytest plugin that let you execute a set of actions and assertions using commands serialized in JSON format. It tries to make test automation more affordable for non programmers or non Python programmers for browser, functional, API, integration or system testing thanks to its pluggable architecture and third party plugins that let you interact with the most common databases and systems.

In addition it provides also some facilitations for writing browser UI actions (e.g., implicit waits before interacting with an input element. The Cypress framework for me was a great source of inspiration) and asynchronous checks (e.g., wait until a certain condition is true).

You can use pytest-play programmatically (e.g., use the pytest-play engine as a library for pytest-play standalone scenarios or using the pytest-play API implementing BDD steps).

Starting from pytest-play>1.4.x it was introduced a new experimental feature that let you use pytest-play as a framework creating Python-free automated tests based on a JSON based serialization format for actions and assertions (in the next future the more user friendly YAML format will be supported).

So now depending on your needs and skills you can choose to use pytest-play as a library or as a framework.

In this article I'm going to show how to implement a Plone CMS based login test using the python-free approach without having to write any line of Python code.

What is pytest-play and why it exists

In this section I'm going to add more information about the pytest-play approach and other considerations: if you want to see now how to implement our Python-free automated login test jump to the next section!

Hyper specialized tool problems


There are many commercial products or tools that offer solutions for API testing only, browser testing only. Sometimes hyper specialized tools might fit your needs (e.g., a content management system based web application) but sometimes they are not helpful for other distributed applications.

For example an API-only platform is not effective for testing a CQRS based application. It is not useful testing only HTTP 200 OK response, you should test that all the expected commands are generated on the event store (e.g., Cassandra) or other side effects.

Another example for an IoT applications and UI/browser only testing platforms. You cannot test reactive web apps only with a browser, you should control also simulated device activities (e.g., MQTT, queues, API) for messages/alarms/reports) or any other web based interactions performed by other users (e.g., HTTP calls); you might need to check asynchronously the expected results on web sockets instead of using a real browser implementing when some actions are performed.

What is pytest-play

In other words pytest-play is an open source testing solution based on the pytest framework that let you:

using a serialization format (JSON at this time of writing, YAML in the next future) that should be more affordable for non technical testers, non programmers or programmers with no Python knowledge.

Potentially you will be able to share and execute a new scenario not yet included in your test library copying and pasting a pytest-play JSON to a Jenkins build with parameters form like the following one (see the PLAY textarea):

From http://davidemoro.blogspot.it/2018/03/test-automation-python-pytest-jenkins.html


In addition if you are a technical user you can extend it writing your own plugins, you can provide the integration with external tools (e.g., test management tools, software metrics engines, etc), you can decide the test abstraction depending on deadlines/skills/strategy (e.g., use plain json files, a programmatic approach based on json scenarios or BDD steps based on pytest-play).

What pytest-play is not

For example pytest-play doesn't provide a test scenario recorder but it enforces users to understand what they are doing.

It requires a very very little programming knowledge for writing some assertions using simple code expressions but with a little training activity it is still affordable by non programmers (you don't have to learn a programming language, just some basic assertions).

It is not feature complete but it is free software.

If you want to know more in this previous article I've talked about:

A pytest-play example: parametrized login (featuring Plone CMS)

In this example we'll see how to write and execute pure json pytest-play scenarios with test data decoupled by the test implementation and test parametrization. I'm using the available online Plone 5 demo site kindly hosted by Andreas Jung (www.zopyx.com).

The project is available here:

The tests could be launched this way as a normal pytest project once you installed pytest and the dependencies (there is a requirements.txt file, see the above link):





$ pytest --variables env-ALPHA.yml --splinter-webdriver firefox --splinter-screenshot-dir /tmp -x

Where the you can have multiple environment/variable files. E.g., env-ALPHA.yml containing the alpha base url and any other variables:

pytest-play:
base_url: https://plone-demo.info

Our login test_login.json scenario contains (as you can see there are NO asynchronous waits because they are not needed for basic examples so you can focus on actions and assertions thanks to implicit waits):

{
"steps": [
{
"comment": "visit base url",
"type": "get",
"url": "$base_url"
},
{
"comment": "click on login link",
"locator": {
"type": "id",
"value": "personaltools-login"
},
"type": "clickElement"
},
{
"comment": "provide a username",
"locator": {
"type": "id",
"value": "__ac_name"
},
"text": "$username",
"type": "setElementText"
},
{
"comment": "provide a password",
"locator": {
"type": "id",
"value": "__ac_password"
},
"text": "$password",
"type": "setElementText"
},
{
"comment": "click on login submit button",
"locator": {
"type": "css",
"value": ".pattern-modal-buttons > input[name=submit]"
},
"type": "clickElement"
},
{
"comment": "wait for page loaded",
"locator": {
"type": "css",
"value": ".icon-user"
},
"type": "waitForElementVisible"
}
]
}

Plus an optional test scenario metadata file test_login.ini that contains pytest keyword and decoupled test data:

[pytest]
markers =
login
test_data =
{"username": "siteadmin", "password": "siteadmin"}
{"username": "editor", "password": "editor"}
{"username": "reader", "password": "reader"}

Thanks to the metadata file you have just one scenario and it will be executed 3 times (as many times as test data rows)!

Et voilà, let's see it in action out scenario without having to write any line of Python code:


There is only a warning I have to remove but it worked and we got exactly 3 different test runs for our login scenario as expected!

pytest-play status

pytest-play should be still considered experimental software and many features needs to be implemented or refactored:

PyCon Nove @ Florence

If you are going to attending next PyCon Nove in Florence don't miss the following pytest-play talk presented by Serena Martinetti:

Do you like pytest-play?

Tweets about pytest-play happens on @davidemoro.
Positive or negative feedback is always appreciated. If you find interesting the concepts behind pytest-play let me know with a tweet, add a new pytest-play adapter and/or add a GitHub star if you liked it:

Star

Updates

22 Feb 2019 11:15pm GMT

Nikola: Nikola v8.0.2 is out!

On behalf of the Nikola team, I am pleased to announce the immediate availability of Nikola v8.0.2. This is a quality-of-life release with a handful of bug fixes, two new translations and a few extra features.

What is Nikola?

Nikola is a static site and blog generator, written in Python. It can use Mako and Jinja2 templates, and input in many popular markup formats, such as reStructuredText and Markdown - and can even turn Jupyter Notebooks into blog posts! It also supports image galleries, and is multilingual. Nikola is flexible, and page builds are extremely fast, courtesy of doit (which is rebuilding only what has been changed).

Find out more at the website: https://getnikola.com/

Downloads

Install using pip install Nikola. (Python 3-only since v8.0.0.)

Changes

  • Make ARCHIVE_PATH, ARCHIVE_FILENAME translatable (Issue #3234)
  • Support configuring Isso via GLOBAL_CONTEXT['isso_config'] (Issue #3225)
  • Handle fragments in doc role (Issue #3212)
  • Slugify references in doc role.
  • Add Interlingua translation by Alberto Mardegan
  • Add Afrikaans translation by Friedel Wolff
  • Support for docutils.conf (Issue #3188)

Bugfixes

  • Avoid random rebuilds with sites whose locales are not fullly supported, and random rebuilds on multilingual sites using Python 3.4/3.5 (Issue #3216)
  • Apply modifications to default_metadata before copying it to other languages
  • Make Commento comments work (Issue #3198)
  • Correctly handle separators in the relative path given to "ignored_assets" key in theme meta files (Issue #3210)
  • Fix error when nikola new_post receives directory name as path (Issue #3207)
  • Add slashes to paths with query strings in nikola serve only if there isn't one before ?
  • Read listings files as UTF-8
  • Set one-file status basing on default language only (Issue #3191)
  • Don't warn if post status is set to published explicitly (Issue #3181)
  • Remove mention of Twitter cards requiring an opt-in. This is not true anymore - anyone can use them.
  • fancydates now workwith listdate items (eg. archives)
  • bootstrap4 and bootblog4 themes no longer load moment.js when fancydates are off. (Issue #3231)

22 Feb 2019 7:34pm GMT

Nikola: Nikola v8.0.2 is out!

On behalf of the Nikola team, I am pleased to announce the immediate availability of Nikola v8.0.2. This is a quality-of-life release with a handful of bug fixes, two new translations and a few extra features.

What is Nikola?

Nikola is a static site and blog generator, written in Python. It can use Mako and Jinja2 templates, and input in many popular markup formats, such as reStructuredText and Markdown - and can even turn Jupyter Notebooks into blog posts! It also supports image galleries, and is multilingual. Nikola is flexible, and page builds are extremely fast, courtesy of doit (which is rebuilding only what has been changed).

Find out more at the website: https://getnikola.com/

Downloads

Install using pip install Nikola. (Python 3-only since v8.0.0.)

Changes

  • Make ARCHIVE_PATH, ARCHIVE_FILENAME translatable (Issue #3234)
  • Support configuring Isso via GLOBAL_CONTEXT['isso_config'] (Issue #3225)
  • Handle fragments in doc role (Issue #3212)
  • Slugify references in doc role.
  • Add Interlingua translation by Alberto Mardegan
  • Add Afrikaans translation by Friedel Wolff
  • Support for docutils.conf (Issue #3188)

Bugfixes

  • Avoid random rebuilds with sites whose locales are not fullly supported, and random rebuilds on multilingual sites using Python 3.4/3.5 (Issue #3216)
  • Apply modifications to default_metadata before copying it to other languages
  • Make Commento comments work (Issue #3198)
  • Correctly handle separators in the relative path given to "ignored_assets" key in theme meta files (Issue #3210)
  • Fix error when nikola new_post receives directory name as path (Issue #3207)
  • Add slashes to paths with query strings in nikola serve only if there isn't one before ?
  • Read listings files as UTF-8
  • Set one-file status basing on default language only (Issue #3191)
  • Don't warn if post status is set to published explicitly (Issue #3181)
  • Remove mention of Twitter cards requiring an opt-in. This is not true anymore - anyone can use them.
  • fancydates now workwith listdate items (eg. archives)
  • bootstrap4 and bootblog4 themes no longer load moment.js when fancydates are off. (Issue #3231)

22 Feb 2019 7:34pm GMT

Python Sweetness: Reasons Mitogen sucks

I have a particular dislike for nonspecific negativity, where nothing can be done to address its source because the reasons underlying it are never explicitly described. In the context of Mitogen, there has been a consistent stream of this sort originating from an important camp in public spaces, and despite efforts to bring specifics out into the open, still it continues to persist.

For that reason I'd like to try a new strategy: justify the negativity and give it a face by providing all the fuel it needs to burn. Therefore in this post, in the interests of encouraging honesty, I will critique my own work.

Mitogen is threaded

Mitogen is a prehistoric design, with the earliest code dating all the way back to 2007, a time when threading seemed cool and the 'obvious' solution to the problem the library intended to solve. Through mountains of words I have since then justified the use of threading, as if the use of threads were a mandatory feature. In some scenarios this is partially true, but just as often entirely incorrect.

This code does everything your mother told you never to do when using threads, and suffers endless asynchrony issues as a result. Historically it failed to lock mutable data structures, and some remnants of that past remain to the present day.

If you're going to throw tomatoes, threading is a great place to start. If there is any legitimate reason to have a problem with integrating this library, look no further.

The documentation sucks

If this guy gets hit by a bus, what happens to the library? Valid concern! Documentation is currently inconsistent at best, out of date and nonexistent at worst. I care a lot about good documentation, but not quite as much as keeping users happy. Therefore as a result, a description of the internals detailed enough to allow someone to take over the work, or even fix some kinds of bugs, does not as yet exist.

It's not that Mitogen is exactly rocket science to maintain, but through poor and inconcise documentation, coupled with a mysterious name, a simple solution is cloaked in a mysterious aura of inscrutability. That incites fear in many who attempt to grasp it.

Mitogen isn't tested properly

Due to the endless asynchrony issues created by threading and running across multiple processes, Mitogen is a nightmare to test. In the common case where some important test passes, there is little to no guarantee that test will continue to pass, say, if the moon turned blue or a flock of migratory pink flamingos happened to pass overhead.

Addressing testing in Mitogen is more than a full-time job, it's an impossible nightmare literally exceeding the boundaries of computer science. Due to the library's design, and the underlying use of threading, a combinatorial explosion of possible program states are produced for which no good testing strategy can or ever will exist.

Many of these aspects can be tested only partially by throwing the library at the wall in a loop and ensuring it sticks, and due to the problem it tries to solve, this will likely remain true in perpetuity. The sound byte you're looking for to justify your loathing is "Mitogen is spaghetti code that's impossible to test"

Mitogen does things it shouldn't do.

If you look closely enough you'll find some modules importing ctypes and doing all kinds of crazy things to work around deficiencies in Python's interfaces. Who the hell writes cowboy code like this? What happens if I run this on some ancient 286 running Xenix? While such atrocities are carefully sandboxed, the reality is that testing in these scenarios has never been done. The computer might catch fire, or the moon might coming crashing into the earth.

Mitogen is riddled with internationalization issues

As a side effect of being written by a monolingual developer, Mitogen sometimes doesn't have a clue about sudo or su when it says things like المصادقة غير صحيحة, therefore you can never really truly 100% rely on it to function when charging handsomely to peddle someone else's work at customer sites, when some of those might be large international concerns, say, in the oil or finance sector.

Because of the hacky methods used to integrate external tools, there may always be some operating system, tool and language combination for which the library will fail to notice something is wrong. In the worst case it might hang, or confusingly claim a connection timeout occurred when really it just didn't understand the tool.

Mitogen uses pickle on the network

The crown jewel: it is absolutely true that Mitogen will pass untrusted pickles through an unpickler, with mitigations in place to prevent arbitrary code execution. But is it enough? Nobody knows for sure! It's impossible to combat this variety of FUD, but just in case a replacement serialization has existed in a local branch for over 6 months, waiting for just the day some material vulnerability is finally revealed.

Be certain to mention pickle insecurity at every opportunity when generating negativity about the library, it's almost impossible to fight against it.

The Ansible extension monkey-patches half the application

Half may be a slight exaggeration, but indeed it's true, the Ansible extension is littered with monkey-patches. It doesn't matter that these are tidy and conservative patches that uninstall themselves when not in use or when incompatible situations are detected, monkey patching! is! wrong! and must! be prevented!

Who could trust any code that subclasses types dynamically, or assigns a method to a class instance? No code should ever do that, or heaven forbid, nor should any test suite. Get this printed on a t-shirt so the message is abundantly clear.

The Ansible extension reimplements the whole module executor

It's true, it didn't even cutpaste the existing code, just blanket reimplements it! All spread out in a rats nest of 10 line methods across a giant inscrutible class hierarchy. The nerve!

Despite replacing a giant chunk of Ansible's lower tiers, the plain truth is that this code is a weird derivative of existing Ansible functionality bolted on the side. Who in their right mind wants to run a setup like that?

The Ansible extension prefers performance over safety

Here's a great one: to avoid super important network round-trips when copying small files, rather than raise an error when the file copy fails, it instead raises the error on the follow-up module execution. How confusing! How can this mess possibly be debugged?

The Ansible extension makes it easy for buggy modules to break things

The dirty secret behind some of that performance? We're reusing state, the horror! And because we're reusing state, if some of that state is crapped over by a broken module, all hell could break loose. The sky might fall, hard disks could be wiped, fortune 500s might cease to exist!

In the absense of collaboration with the authors of such evil code, Mitogen is forced between a rock and a hard place: maintaining a blacklist of known-broken modules and forking a threaded process (fork! with threads! the horror!) to continue the illusion of good performance in spite of absolutely no help from those who could provide it.

Forking support is entirely disabled when targetting Python 2.4, and there is a high chance Mitogen will remove forking support entirely in the future, but in the meantime, use no less than 72 point bold font when discussing one of the library's greatest sins of all.

The Ansible extension rewrites the user's playbook

It's true! Why does a strategy module get to pick which connection plugins are used? It's as if the author believed in eliminating the need for unproductive busywork at all costs for the users of his project, and dared put user before code with a monkey-patch to ensure manual edits across a huge Git repo full of playbooks weren't necessary to allow the code to be tested, you know, by busy people who were already so stressed out about not having enough time to complete the jobs they've been assigned, due to a fundamentally slow and inefficient tool, that they'd risk experimenting with such a low quality project to begin with. The horror!

Summary

Mitogen has in excess of 4,000 interactive downloads (12,000 including CI) over the past year, of course not all of them users, but I have long since lost count of the number of people who swear by it and rely on it daily. In contrast to these numbers, GitHub currently sports a grand total of 23 open user-reported bugs.

If you need more ammunition in your vague and futile battle with the future, please file a bug and I will be only too happy to supply it. But please, please, please avoid suggesting things "don't work" or are broken somehow without actually specifying how, especially in the context of a tool that continues to waste thousands of man hours every year. It is deeply insulting and a damning indictment on your ability to communicate and openly collaborate in what is otherwise allegedly the context of a free software project.

Hopefully this post should arm such people with legitimate complaints for the foreseeable future. Meanwhile, I will work to improve the library, including as a result of bug reports filed in GitHub.

22 Feb 2019 7:30pm GMT

Python Sweetness: Reasons Mitogen sucks

I have a particular dislike for nonspecific negativity, where nothing can be done to address its source because the reasons underlying it are never explicitly described. In the context of Mitogen, there has been a consistent stream of this sort originating from an important camp in public spaces, and despite efforts to bring specifics out into the open, still it continues to persist.

For that reason I'd like to try a new strategy: justify the negativity and give it a face by providing all the fuel it needs to burn. Therefore in this post, in the interests of encouraging honesty, I will critique my own work.

Mitogen is threaded

Mitogen is a prehistoric design, with the earliest code dating all the way back to 2007, a time when threading seemed cool and the 'obvious' solution to the problem the library intended to solve. Through mountains of words I have since then justified the use of threading, as if the use of threads were a mandatory feature. In some scenarios this is partially true, but just as often entirely incorrect.

This code does everything your mother told you never to do when using threads, and suffers endless asynchrony issues as a result. Historically it failed to lock mutable data structures, and some remnants of that past remain to the present day.

If you're going to throw tomatoes, threading is a great place to start. If there is any legitimate reason to have a problem with integrating this library, look no further.

The documentation sucks

If this guy gets hit by a bus, what happens to the library? Valid concern! Documentation is currently inconsistent at best, out of date and nonexistent at worst. I care a lot about good documentation, but not quite as much as keeping users happy. Therefore as a result, a description of the internals detailed enough to allow someone to take over the work, or even fix some kinds of bugs, does not as yet exist.

It's not that Mitogen is exactly rocket science to maintain, but through poor and inconcise documentation, coupled with a mysterious name, a simple solution is cloaked in a mysterious aura of inscrutability. That incites fear in many who attempt to grasp it.

Mitogen isn't tested properly

Due to the endless asynchrony issues created by threading and running across multiple processes, Mitogen is a nightmare to test. In the common case where some important test passes, there is little to no guarantee that test will continue to pass, say, if the moon turned blue or a flock of migratory pink flamingos happened to pass overhead.

Addressing testing in Mitogen is more than a full-time job, it's an impossible nightmare literally exceeding the boundaries of computer science. Due to the library's design, and the underlying use of threading, a combinatorial explosion of possible program states are produced for which no good testing strategy can or ever will exist.

Many of these aspects can be tested only partially by throwing the library at the wall in a loop and ensuring it sticks, and due to the problem it tries to solve, this will likely remain true in perpetuity. The sound byte you're looking for to justify your loathing is "Mitogen is spaghetti code that's impossible to test"

Mitogen does things it shouldn't do.

If you look closely enough you'll find some modules importing ctypes and doing all kinds of crazy things to work around deficiencies in Python's interfaces. Who the hell writes cowboy code like this? What happens if I run this on some ancient 286 running Xenix? While such atrocities are carefully sandboxed, the reality is that testing in these scenarios has never been done. The computer might catch fire, or the moon might coming crashing into the earth.

Mitogen is riddled with internationalization issues

As a side effect of being written by a monolingual developer, Mitogen sometimes doesn't have a clue about sudo or su when it says things like المصادقة غير صحيحة, therefore you can never really truly 100% rely on it to function when charging handsomely to peddle someone else's work at customer sites, when some of those might be large international concerns, say, in the oil or finance sector.

Because of the hacky methods used to integrate external tools, there may always be some operating system, tool and language combination for which the library will fail to notice something is wrong. In the worst case it might hang, or confusingly claim a connection timeout occurred when really it just didn't understand the tool.

Mitogen uses pickle on the network

The crown jewel: it is absolutely true that Mitogen will pass untrusted pickles through an unpickler, with mitigations in place to prevent arbitrary code execution. But is it enough? Nobody knows for sure! It's impossible to combat this variety of FUD, but just in case a replacement serialization has existed in a local branch for over 6 months, waiting for just the day some material vulnerability is finally revealed.

Be certain to mention pickle insecurity at every opportunity when generating negativity about the library, it's almost impossible to fight against it.

The Ansible extension monkey-patches half the application

Half may be a slight exaggeration, but indeed it's true, the Ansible extension is littered with monkey-patches. It doesn't matter that these are tidy and conservative patches that uninstall themselves when not in use or when incompatible situations are detected, monkey patching! is! wrong! and must! be prevented!

Who could trust any code that subclasses types dynamically, or assigns a method to a class instance? No code should ever do that, or heaven forbid, nor should any test suite. Get this printed on a t-shirt so the message is abundantly clear.

The Ansible extension reimplements the whole module executor

It's true, it didn't even cutpaste the existing code, just blanket reimplements it! All spread out in a rats nest of 10 line methods across a giant inscrutible class hierarchy. The nerve!

Despite replacing a giant chunk of Ansible's lower tiers, the plain truth is that this code is a weird derivative of existing Ansible functionality bolted on the side. Who in their right mind wants to run a setup like that?

The Ansible extension prefers performance over safety

Here's a great one: to avoid super important network round-trips when copying small files, rather than raise an error when the file copy fails, it instead raises the error on the follow-up module execution. How confusing! How can this mess possibly be debugged?

The Ansible extension makes it easy for buggy modules to break things

The dirty secret behind some of that performance? We're reusing state, the horror! And because we're reusing state, if some of that state is crapped over by a broken module, all hell could break loose. The sky might fall, hard disks could be wiped, fortune 500s might cease to exist!

In the absense of collaboration with the authors of such evil code, Mitogen is forced between a rock and a hard place: maintaining a blacklist of known-broken modules and forking a threaded process (fork! with threads! the horror!) to continue the illusion of good performance in spite of absolutely no help from those who could provide it.

Forking support is entirely disabled when targetting Python 2.4, and there is a high chance Mitogen will remove forking support entirely in the future, but in the meantime, use no less than 72 point bold font when discussing one of the library's greatest sins of all.

The Ansible extension rewrites the user's playbook

It's true! Why does a strategy module get to pick which connection plugins are used? It's as if the author believed in eliminating the need for unproductive busywork at all costs for the users of his project, and dared put user before code with a monkey-patch to ensure manual edits across a huge Git repo full of playbooks weren't necessary to allow the code to be tested, you know, by busy people who were already so stressed out about not having enough time to complete the jobs they've been assigned, due to a fundamentally slow and inefficient tool, that they'd risk experimenting with such a low quality project to begin with. The horror!

Summary

Mitogen has in excess of 4,000 interactive downloads (12,000 including CI) over the past year, of course not all of them users, but I have long since lost count of the number of people who swear by it and rely on it daily. In contrast to these numbers, GitHub currently sports a grand total of 23 open user-reported bugs.

If you need more ammunition in your vague and futile battle with the future, please file a bug and I will be only too happy to supply it. But please, please, please avoid suggesting things "don't work" or are broken somehow without actually specifying how, especially in the context of a tool that continues to waste thousands of man hours every year. It is deeply insulting and a damning indictment on your ability to communicate and openly collaborate in what is otherwise allegedly the context of a free software project.

Hopefully this post should arm such people with legitimate complaints for the foreseeable future. Meanwhile, I will work to improve the library, including as a result of bug reports filed in GitHub.

22 Feb 2019 7:30pm GMT

Peter Bengtsson: Django ORM optimization story on selecting the least possible

This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name column of all its child models, and the turn all these name strings into 1 MD5 checksum string.

Variants

The first attempted looked like this:

artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

The SQL used to generate this is as follows:

SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name", 
"main_song"."text", "main_song"."language", "main_song"."key_phrases", 
"main_song"."popularity", "main_song"."text_length", "main_song"."metadata", 
"main_song"."created", "main_song"."modified", 
"main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Clearly, I don't need anything but just the name column, version 2:

artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist).only("name"):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Now, the SQL used is:

SELECT "main_song"."id", "main_song"."name" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

But still, since I don't really need instances of model class Song I can use the .values() method which gives back a list of dictionaries. This is version 3:

names = []
for song in Song.objects.filter(artist=a).values("name"):
    names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

This time Django figures it doesn't even need the primary key value so it looks like this:

SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Last but not least; there is an even faster one. values_list(). This time it doesn't even bother to map the column name to the value in a dictionary. And since I only need 1 column's value, I can set flat=True. Version 4 looks like this:

names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
    names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Same SQL gets used this time as in version 3.

The benchmark

Hopefully this little benchmark script speaks for itself:

from songsearch.main.models import *

import hashlib


def f1(a):
    names = []
    for song in Song.objects.filter(artist=a):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f2(a):
    names = []
    for song in Song.objects.filter(artist=a).only("name"):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f3(a):
    names = []
    for song in Song.objects.filter(artist=a).values("name"):
        names.append(song["name"])
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f4(a):
    names = []
    for name in Song.objects.filter(artist=a).values_list("name", flat=True):
        names.append(name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


artist = Artist.objects.get(name="Bad Religion")
print(Song.objects.filter(artist=artist).count())

print(f1(artist) == f2(artist))
print(f2(artist) == f3(artist))
print(f3(artist) == f4(artist))

# Reporting
import time
import random
import statistics

functions = f1, f2, f3, f4
times = {f.__name__: [] for f in functions}

for i in range(500):
    func = random.choice(functions)
    t0 = time.time()
    func(artist)
    t1 = time.time()
    times[func.__name__].append((t1 - t0) * 1000)

for name in sorted(times):
    numbers = times[name]
    print("FUNCTION:", name, "Used", len(numbers), "times")
    print("\tBEST", min(numbers))
    print("\tMEDIAN", statistics.median(numbers))
    print("\tMEAN  ", statistics.mean(numbers))
    print("\tSTDEV ", statistics.stdev(numbers))

I ran this on my PostgreSQL 11.1 on my MacBook Pro with Django 2.1.7. So the database is on localhost.

The results

276
True
True
True
FUNCTION: f1 Used 135 times
    BEST 6.309986114501953
    MEDIAN 7.531881332397461
    MEAN   7.834429211086697
    STDEV  2.03779968066591
FUNCTION: f2 Used 135 times
    BEST 3.039121627807617
    MEDIAN 3.7298202514648438
    MEAN   4.012803678159361
    STDEV  1.8498943539073027
FUNCTION: f3 Used 110 times
    BEST 0.9920597076416016
    MEDIAN 1.4405250549316406
    MEAN   1.5053835782137783
    STDEV  0.3523240470133114
FUNCTION: f4 Used 120 times
    BEST 0.9369850158691406
    MEDIAN 1.3251304626464844
    MEAN   1.4017681280771892
    STDEV  0.3391019435930447

Bar chart

Discussion

I guess the hashlib.md5("".join(names).encode("utf-8")).hexdigest() stuff is a bit "off-topic" but I checked and it's roughly 300 times faster than building up the names list.

It's clearly better to ask less of Python and PostgreSQL to get a better total time. No surprise there. What was interesting was the proportion of these differences. Memorize that and you'll be better equipped if it's worth the hassle of not using the Django ORM in the most basic form.

Also, do take note that this is only relevant in when dealing with many records. The slowest variant (f1) takes, on average, 7 milliseconds.

Summarizing the difference with percentages compared to the fastest variant:

22 Feb 2019 6:49pm GMT

Peter Bengtsson: Django ORM optimization story on selecting the least possible

This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name column of all its child models, and the turn all these name strings into 1 MD5 checksum string.

Variants

The first attempted looked like this:

artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

The SQL used to generate this is as follows:

SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name", 
"main_song"."text", "main_song"."language", "main_song"."key_phrases", 
"main_song"."popularity", "main_song"."text_length", "main_song"."metadata", 
"main_song"."created", "main_song"."modified", 
"main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Clearly, I don't need anything but just the name column, version 2:

artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist).only("name"):
    names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Now, the SQL used is:

SELECT "main_song"."id", "main_song"."name" 
FROM "main_song" WHERE "main_song"."artist_id" = 22729;

But still, since I don't really need instances of model class Song I can use the .values() method which gives back a list of dictionaries. This is version 3:

names = []
for song in Song.objects.filter(artist=a).values("name"):
    names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

This time Django figures it doesn't even need the primary key value so it looks like this:

SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Last but not least; there is an even faster one. values_list(). This time it doesn't even bother to map the column name to the value in a dictionary. And since I only need 1 column's value, I can set flat=True. Version 4 looks like this:

names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
    names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Same SQL gets used this time as in version 3.

The benchmark

Hopefully this little benchmark script speaks for itself:

from songsearch.main.models import *

import hashlib


def f1(a):
    names = []
    for song in Song.objects.filter(artist=a):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f2(a):
    names = []
    for song in Song.objects.filter(artist=a).only("name"):
        names.append(song.name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f3(a):
    names = []
    for song in Song.objects.filter(artist=a).values("name"):
        names.append(song["name"])
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


def f4(a):
    names = []
    for name in Song.objects.filter(artist=a).values_list("name", flat=True):
        names.append(name)
    return hashlib.md5("".join(names).encode("utf-8")).hexdigest()


artist = Artist.objects.get(name="Bad Religion")
print(Song.objects.filter(artist=artist).count())

print(f1(artist) == f2(artist))
print(f2(artist) == f3(artist))
print(f3(artist) == f4(artist))

# Reporting
import time
import random
import statistics

functions = f1, f2, f3, f4
times = {f.__name__: [] for f in functions}

for i in range(500):
    func = random.choice(functions)
    t0 = time.time()
    func(artist)
    t1 = time.time()
    times[func.__name__].append((t1 - t0) * 1000)

for name in sorted(times):
    numbers = times[name]
    print("FUNCTION:", name, "Used", len(numbers), "times")
    print("\tBEST", min(numbers))
    print("\tMEDIAN", statistics.median(numbers))
    print("\tMEAN  ", statistics.mean(numbers))
    print("\tSTDEV ", statistics.stdev(numbers))

I ran this on my PostgreSQL 11.1 on my MacBook Pro with Django 2.1.7. So the database is on localhost.

The results

276
True
True
True
FUNCTION: f1 Used 135 times
    BEST 6.309986114501953
    MEDIAN 7.531881332397461
    MEAN   7.834429211086697
    STDEV  2.03779968066591
FUNCTION: f2 Used 135 times
    BEST 3.039121627807617
    MEDIAN 3.7298202514648438
    MEAN   4.012803678159361
    STDEV  1.8498943539073027
FUNCTION: f3 Used 110 times
    BEST 0.9920597076416016
    MEDIAN 1.4405250549316406
    MEAN   1.5053835782137783
    STDEV  0.3523240470133114
FUNCTION: f4 Used 120 times
    BEST 0.9369850158691406
    MEDIAN 1.3251304626464844
    MEAN   1.4017681280771892
    STDEV  0.3391019435930447

Bar chart

Discussion

I guess the hashlib.md5("".join(names).encode("utf-8")).hexdigest() stuff is a bit "off-topic" but I checked and it's roughly 300 times faster than building up the names list.

It's clearly better to ask less of Python and PostgreSQL to get a better total time. No surprise there. What was interesting was the proportion of these differences. Memorize that and you'll be better equipped if it's worth the hassle of not using the Django ORM in the most basic form.

Also, do take note that this is only relevant in when dealing with many records. The slowest variant (f1) takes, on average, 7 milliseconds.

Summarizing the difference with percentages compared to the fastest variant:

22 Feb 2019 6:49pm GMT

Python Engineering at Microsoft: What’s New with the AI and Machine Learning Tools for Python: February 2019 Update

Across Visual Studio Code and Azure Notebooks, January brought numerous exciting updates to the AI and Machine Learning tooling for Python! This roll-up blog post recaps the latest products updates as well as the upcoming events for AI and Machine Learning:

What's new?

Python Interactive in VS Code

The Python extension for VS Code first introduced an interactive data science experience in the last Oct update. With this release, we brought the power of Jupyter Notebooks into VS Code. Many feature additions have been released since, including remote Jupyter support, ability to export Python code to Jupyter Notebooks, etc.

The most noticeable enhancement in the Jan 2019 update allows code to be typed and executed directly in the Python Interactive window. Now, the window effectively turns into an IPython console that can be used standalone as well as in conjunction with the code editor. As well, the Jan 2019 update brings general Python goodness, such as improved diagnostics for failed tests with pytest and much faster outline view with the Python Language Server. See blog Python in Visual Studio Code - January 2019 Release for the full list.

Work with Azure Machine Learning service in VS Code and Azure Notebooks

To add some simplicity to the complex of world of ML pipelines, Microsoft announced the general availability of the Azure Machine Learning service in December. This service eases and accelerates the building, training, and deployment of your machine learning models from the cloud to the edge.

The Azure Machine Learning extension for VS Code provides an easy way to manage the Azure Machine Learning service. This includes controls over experiments, pipelines, compute, models, and services, all from within VS Code. Interested? Check out how to get started with the AML extension.

Additionally, Azure Notebooks offers a seamless way to take advantage of these new APIs within our hosted Jupyter notebook experience. You can join thousands of others who have tried out our Getting Started with the Azure Machine Learning service sample project . As well, you can connect directly with Azure Notebooks from your AML Workspace.

Azure Notebooks Connect() News

For those who missed our exciting Connect() Announcements, Azure Notebooks released new integrations with the Azure ecosystem as well as a fresh UI. With these improvements, we hope to help data scientists achieve greater productivity on our platform.

For workloads that need a bit more power beyond on our free compute pool, Notebooks now allows you to connect to any SKU of Data Science Virtual Machine. Through this, users can take advantage of the full suite of Azure compute capabilities, right from Azure Notebooks. Read more in our documentation.

We encourage users to try out these new compute options as well as the exciting Azure Machine Learning service integration mentioned above to further productivity on the Azure Notebooks platform.

New resources

Beyond feature releases, there are some great new demo resources from the Ignite Tour:

Upcoming Events

Our Python tooling team as well as many of the Microsoft Cloud Developer Advocates have a few exciting events coming up:

Tell us what you think

As always, we look forward to hearing your feedback! Please leave comments below or find us on Github (VS Code Python extension, Azure Notebooks) and Twitter (@pythonvscode, @AzureNotebooks).

The post What's New with the AI and Machine Learning Tools for Python: February 2019 Update appeared first on Python.

22 Feb 2019 6:00pm GMT

Python Engineering at Microsoft: What’s New with the AI and Machine Learning Tools for Python: February 2019 Update

Across Visual Studio Code and Azure Notebooks, January brought numerous exciting updates to the AI and Machine Learning tooling for Python! This roll-up blog post recaps the latest products updates as well as the upcoming events for AI and Machine Learning:

What's new?

Python Interactive in VS Code

The Python extension for VS Code first introduced an interactive data science experience in the last Oct update. With this release, we brought the power of Jupyter Notebooks into VS Code. Many feature additions have been released since, including remote Jupyter support, ability to export Python code to Jupyter Notebooks, etc.

The most noticeable enhancement in the Jan 2019 update allows code to be typed and executed directly in the Python Interactive window. Now, the window effectively turns into an IPython console that can be used standalone as well as in conjunction with the code editor. As well, the Jan 2019 update brings general Python goodness, such as improved diagnostics for failed tests with pytest and much faster outline view with the Python Language Server. See blog Python in Visual Studio Code - January 2019 Release for the full list.

Work with Azure Machine Learning service in VS Code and Azure Notebooks

To add some simplicity to the complex of world of ML pipelines, Microsoft announced the general availability of the Azure Machine Learning service in December. This service eases and accelerates the building, training, and deployment of your machine learning models from the cloud to the edge.

The Azure Machine Learning extension for VS Code provides an easy way to manage the Azure Machine Learning service. This includes controls over experiments, pipelines, compute, models, and services, all from within VS Code. Interested? Check out how to get started with the AML extension.

Additionally, Azure Notebooks offers a seamless way to take advantage of these new APIs within our hosted Jupyter notebook experience. You can join thousands of others who have tried out our Getting Started with the Azure Machine Learning service sample project . As well, you can connect directly with Azure Notebooks from your AML Workspace.

Azure Notebooks Connect() News

For those who missed our exciting Connect() Announcements, Azure Notebooks released new integrations with the Azure ecosystem as well as a fresh UI. With these improvements, we hope to help data scientists achieve greater productivity on our platform.

For workloads that need a bit more power beyond on our free compute pool, Notebooks now allows you to connect to any SKU of Data Science Virtual Machine. Through this, users can take advantage of the full suite of Azure compute capabilities, right from Azure Notebooks. Read more in our documentation.

We encourage users to try out these new compute options as well as the exciting Azure Machine Learning service integration mentioned above to further productivity on the Azure Notebooks platform.

New resources

Beyond feature releases, there are some great new demo resources from the Ignite Tour:

Upcoming Events

Our Python tooling team as well as many of the Microsoft Cloud Developer Advocates have a few exciting events coming up:

Tell us what you think

As always, we look forward to hearing your feedback! Please leave comments below or find us on Github (VS Code Python extension, Azure Notebooks) and Twitter (@pythonvscode, @AzureNotebooks).

The post What's New with the AI and Machine Learning Tools for Python: February 2019 Update appeared first on Python.

22 Feb 2019 6:00pm GMT

Stack Abuse: Sorting and Merging Single Linked List

In the last article, we started our discussion about the linked list. We saw what the linked list is along with its advantages and disadvantages. We also studied some of the most commonly used linked list method such as traversal, insertion, deletion, searching, and counting an element. Finally, we saw how to reverse a linked list.

In this article, we will continue from where we left in the last article and will see how to sort a linked list using bubble and merge sort, and how to merge two sorted linked lists.

Before we continue, it is imperative to mention that you should create Node and LinkedList classes that we created in the last article.

Sorting a Linked List using Bubble Sort

There are two ways to sort a linked list using bubble sort:

  1. Exchanging data between nodes
  2. Modifying the links between nodes

In this section, we will see how both these approaches work. We will use the bubble sort algorithm to first sort the linked list by changing the data, and then we will see how we can use bubble sort to change the links in order to sort the linked list.

Sorting Linked List by Exchanging Data

To sort a linked list by exchanging data, we need to declare three variables p, q, and end.

The variable p will be initialized with the start node, while end will be set to None.

It is important to remember that to sort the list with n elements using bubble sort, you need n-1 iterations.

To implement bubble sort, we need two while loops. The outer while loop executes until the value of variable end is equal to the self.start_node.

The inner while loop executes until p becomes equal to the end variable. Inside the outer while loop, the value of p will be set to self.start_node which is the first node. Inside the inner while loop, the value of q will be set to p.link which is actually the node next to q. Then the values of p and q will be compared if p is greater than q the values of both the variables will be swapped and then p will point to p.ref, which is the next node. Finally, the end will be assigned the value of p. This process continues until the linked list is sorted.

Let's understand this process with the help of an example. Suppose we have the following list:

8,7,1,6,9  

Let's implement our algorithm to sort the list. We'll see what will happen during each iteration. The purpose of the bubble sort is that during each iteration, the largest value should be pushed to the end, hence at the end of all iterations, the list will automatically be sorted.

Before the loop executes, the value of end is set to None.

In the first iteration, p will be set to 8, and q will be set to 7. Since p is greater than q, the values will be swapped and p will become p.ref. At this point of time the linked list will look like this:

7,8,1,6,9  

Since at this point of time, p is not equal to end, the loop will continue and now p will become 8 and q will become 1. Since again p is greater than q, the values will be swapped again and p will again become p.ref. The list will look like this:

7,1,8,6,9  

Here again, p is not equal to end, the loop will continue and now p will become 8 and q will become 6. Since again p is greater than q, the values will be swapped again and p will again become p.ref. The list will look like this:

7,1,6,8,9  

Again p is not equal to end, the loop will continue and now p will become 8 and q will become 9. Here since p is not greater than q, the values will not be swapped and p will become p.ref. At this point of time, the reference of p will point to None, and end also points to None. Hence the inner while loop will break and end will be set to p.

In the next set of iterations, the loop will execute until 8, since 9 is already at the end. The process continues until the list is completely sorted.

The Python code for sorting the linked list using bubble sort by exchanging the data is as follows:

    def bub_sort_datachange(self):
        end = None
        while end != self.start_node:
            p = self.start_node
            while p.ref != end:
                q = p.ref
                if p.item > q.item:
                    p.item, q.item = q.item, p.item
                p = p.ref
            end = p

Add the bub_sort_dataexchange() method to the LinkedList class that you created in the last article.

Once you add the method to the linked list, create any set of nodes using the make_new_list() function and then use the bub_sort_dataexchange() to sort the list. You should see the sorted list when you execute the traverse_list() function.

Sorting Linked List by Modifying Links

Bubble sort can also be used to sort a linked list by modifying the links instead of changing data. The process remains quite similar to sorting the list by exchanging data, however, in this case, we have an additional variable r that will always correspond to the node previous than the p node.

Let's take a simple example of how we will swap two nodes by modifying links. Suppose we have a linked list with the following items:

10,45,65,35,1  

And we want to swap 65 and 35. At this point in time p corresponds to node 65, and q corresponds to node 35. The variable r will correspond to node 45 (previous to node p). Now if the node p is greater than node q, which is the case here, the p.ref will be set to q.ref and q.ref will be set to p. Similarly, r.ref will be set to q. This will swap nodes 65 and 35.

The following method implements the bubble sorting for the linked list by modifying links:

    def bub_sort_linkchange(self):
        end = None
        while end != self.start_node:
            r = p = self.start_node
            while p.ref != end:
                q = p.ref
                if p.item > q.item:
                    p.ref = q.ref
                    q.ref = p
                    if p != self.start_node:
                        r.ref = q
                    else:
                        self.start_node = q
                    p,q = q,p
                r = p
                p = p.ref
            end = p

Add the bub_sort_linkchange() method to the LinkedList class that you created in the last article.

Once you add the method to the linked list, create any set of nodes using the make_new_list() function and then use the bub_sort_linkchange() to sort the list. You should see the sorted list when you execute the traverse_list() function.

Merging Sorted Linked List

In this section we will see how we can merge two sorted linked lists in a manner that the resulting linked list is also sorted. There are two approaches to achieve this. We can create a new linked list that contains individually sorted lists or we can simply change links of the two linked list to join the two sorted linked list. In the second case, we do not have to create a new linked list.

Let's first see how we can merge two linked lists by creating a new list.

Merging Sorted Linked Lists by Creating a New List

Let's first dry run the algorithm to see how we can merge two sorted linked list with the help of a new list.

Suppose we have the following two sorted linked lists:

list1:

10,45,65,  

list2:

5,15,35,68  

These are the two lists we want to merge. The algorithm is straight forward. All we will need is three variables, p, q, and em, and an empty list newlist.

At the beginning of the algorithm, p will point to the first element of the list1 whereas q will point to the first element of the list2. The variable em will be empty. At the start of the algorithm, we will have the following values:

p = 10  
q = 5  
em = none  
newlist = none  

Next, we will compare the first element of the list1 with the first element of list2, in other words, we will compare the values of p and q and the smaller value will be stored in the variable em which will become the first node of the new list. The value of em will be added to the end of the newlist.

After the first comparison we will have the following values:

p = 10  
q = 15  
em = 5  
newlist = 5  

Since q was less than p, therefore, we store the value of q in em moved 'q' one index to the right. In the second pass, we will have the following values:

p = 45  
q = 15  
em = 10  
newlist = 5, 10  

Here since p was smaller, we add the value of p to newlist, and set em to p and then moved p one index to the right. In the next iteration we have:

p = 45  
q = 35  
em = 15  
newlist = 5, 10, 15  

Similarly, in the next iteration:

p = 45  
q = 68  
em = 35  
newlist = 5, 10, 15, 35  

And in the next iteration, p will again be smaller than q, hence:

p = 65  
q = 68  
em = 45  
newlist = 5, 10, 15, 35, 45  

Finally,

p = None  
q = 68  
em = 65  
newlist = 5, 10, 15, 35, 45, 65  

When one of the list becomes None, all the elements of the second list are added at the end of the new list. Therefore, the final list will be:

p = None  
q = None  
em = 68  
newlist = 5, 10, 15, 35, 45, 65, 68  

The Python script for merging two sorted lists is as follows:

    def merge_helper(self, list2):
        merged_list = LinkedList()
        merged_list.start_node = self.merge_by_newlist(self.start_node, list2.start_node)
        return merged_list

    def merge_by_newlist(self, p, q):
        if p.item <= q.item:
            startNode = Node(p.item)
            p = p.ref
        else:
            startNode = Node(q.item)
            q = q.ref

        em = startNode

        while p is not None and q is not None:
            if p.item <= q.item:
                em.ref = Node(p.item)
                p = p.ref
            else:
                em.ref = Node(q.item)
                q = q.ref
            em = em.ref

        while p is not None:
            em.ref = Node(p.item)
            p = p.ref
            em = em.ref

        while q is not None:
            em.ref = Node(q.item)
            q = q.ref
            em = em.ref

        return startNode

In the script above we have two methods: merge_helper() and merge_by_newlist(). The first method merge_helper() takes a linked list as a parameter and then passes the self class, which is a linked list itself and the linked list passed to it as a parameter, to the merge_by_newlist() method.

The merge_by_newlist() method merges the two linked by creating a new linked list and returns the start node of the new linked list. Add these two methods to the LinkedList class. Create two new linked lists, sort them using the bub_sort_datachange() or the bub_sort_linkchange() methods that you created in the last section and then use the merge_by_newlist() to see if you can merge two sorted linked lists or not.

Merging Sorted Linked Lists by Rearranging Links

In this approach, a new linked list is not used to store the merger of two sorted linked lists. Rather, the links of the two linked lists are modified in such a way that two linked lists are merged in a sorted manner.

Let's see a simple example of how we can do this. Suppose we have the same two lists list1 and list2:

list1:

10,45,65,  

list2:

5,15,35,68  

We want to merge them in a sorted manner by rearranging the links. To do so we need variables p, q and em. Initially, they will have the following values:

p = 10  
q = 5  
em = none  
newlist = none  

Next, we will compare the first element of the list1 with the first element of list2, in other words, we will compare the values of p and q and the smaller value will be stored in the variable em which will become the first node of the new list.

After the first comparison we will have the following values:

p = 10  
q = 15  
start = 5  
em = start  

After the first iteration, since q is less than p, the start node will point towards q and q will become q.ref. The em will be equal to start. The em will always refer to the newly inserted node in the merged list.

p = 45  
q = 15  
em = 10  

Here since p was smaller than the q, the variable em now points towards the original value of p and p becomes p.ref.

p = 45  
q = 35  
em = 15  

Here since q was smaller than p, em points towards q and q becomes q.ref.

p = 45  
q = 68  
em = 35  

Similarly em here points towards q.

p = 65  
q = 68  
em = 45  
newlist = 5, 10, 15, 35, 45  

And here em points towards becomes p.

p = None  
q = 68  
em = 65  
newlist = 5, 10, 15, 35, 45, 65  

When one of the lists becomes None, the elements from the second list are simply added at the end.

p = None  
q = None  
em = 68  
newlist = 5, 10, 15, 35, 45, 65, 68  

The script that contains functions for merging two lists without creating a new list is as follows:

    def merge_helper2(self, list2):
        merged_list = LinkedList()
        merged_list.start_node = self.merge_by_linkChange(self.start_node, list2.start_node)
        return merged_list

    def merge_by_linkChange(self, p, q):
        if p.item <= q.item:
            startNode = Node(p.item)
            p = p.ref
        else:
            startNode = Node(q.item)
            q = q.ref

        em = startNode

        while p is not None and q is not None:
            if p.item <= q.item:
                em.ref = Node(p.item)
                em = em.ref
                p = p.ref
            else:
                em.ref = Node(q.item)
                em = em.ref
                q = q.ref


        if p is None:
            em.ref = q
        else:
            em.ref = p

        return startNode

In the script above we have two methods: merge_helper2() and merge_by_linkChange(). The first method merge_helper2() takes a linked list as a parameter and then passes the self class which is a linked list itself and the linked list passed to it as a parameter, to the merge_by_linkChange(), which merges the two linked by modifying the links and returns the start node of the merged list. Add these two methods to the LinkedList class. Create two new linked lists, sort them using the bub_sort_datachange() or the bub_sort_linkchange() methods that you created in the last section and then use the merge_by_newlist() to see if you can merge two sorted linked lists or not. Let's see this process in action.

Create a new linked list using the following script:

new_linked_list1 = LinkedList()  
new_linked_list1.make_new_list()  

The script will ask you for the number of nodes to enter. Enter as many nodes as you like and then add values for each node as shown below:

How many nodes do you want to create: 4  
Enter the value for the node:12  
Enter the value for the node:45  
Enter the value for the node:32  
Enter the value for the node:61  

Next, create another linked list repeating the above process:

new_linked_list2 = LinkedList()  
new_linked_list2.make_new_list()  

Next, add a few dummy nodes with the help of the following script:

How many nodes do you want to create: 4  
Enter the value for the node:36  
Enter the value for the node:41  
Enter the value for the node:25  
Enter the value for the node:9  

The next step is to sort both the lists. Execute the following script:

new_linked_list1. bub_sort_datachange()  
new_linked_list2. bub_sort_datachange()  

Finally, the following script merges the two linked lists:

list3 = new_linked_list1.merge_helper2(new_linked_list2)  

To see if the lists have actually been merged, execute the following script:

list3.traverse_list()  

The output looks like this:

9  
12  
25  
32  
36  
41  
45  
61  

Conclusion

In this article, we continued from where we left in the previous article. We saw how we can sort merge lists by changing data and then my modifying links. Finally, we also studied different ways of merging two sorted linked lists.

22 Feb 2019 1:52pm GMT

Stack Abuse: Sorting and Merging Single Linked List

In the last article, we started our discussion about the linked list. We saw what the linked list is along with its advantages and disadvantages. We also studied some of the most commonly used linked list method such as traversal, insertion, deletion, searching, and counting an element. Finally, we saw how to reverse a linked list.

In this article, we will continue from where we left in the last article and will see how to sort a linked list using bubble and merge sort, and how to merge two sorted linked lists.

Before we continue, it is imperative to mention that you should create Node and LinkedList classes that we created in the last article.

Sorting a Linked List using Bubble Sort

There are two ways to sort a linked list using bubble sort:

  1. Exchanging data between nodes
  2. Modifying the links between nodes

In this section, we will see how both these approaches work. We will use the bubble sort algorithm to first sort the linked list by changing the data, and then we will see how we can use bubble sort to change the links in order to sort the linked list.

Sorting Linked List by Exchanging Data

To sort a linked list by exchanging data, we need to declare three variables p, q, and end.

The variable p will be initialized with the start node, while end will be set to None.

It is important to remember that to sort the list with n elements using bubble sort, you need n-1 iterations.

To implement bubble sort, we need two while loops. The outer while loop executes until the value of variable end is equal to the self.start_node.

The inner while loop executes until p becomes equal to the end variable. Inside the outer while loop, the value of p will be set to self.start_node which is the first node. Inside the inner while loop, the value of q will be set to p.link which is actually the node next to q. Then the values of p and q will be compared if p is greater than q the values of both the variables will be swapped and then p will point to p.ref, which is the next node. Finally, the end will be assigned the value of p. This process continues until the linked list is sorted.

Let's understand this process with the help of an example. Suppose we have the following list:

8,7,1,6,9  

Let's implement our algorithm to sort the list. We'll see what will happen during each iteration. The purpose of the bubble sort is that during each iteration, the largest value should be pushed to the end, hence at the end of all iterations, the list will automatically be sorted.

Before the loop executes, the value of end is set to None.

In the first iteration, p will be set to 8, and q will be set to 7. Since p is greater than q, the values will be swapped and p will become p.ref. At this point of time the linked list will look like this:

7,8,1,6,9  

Since at this point of time, p is not equal to end, the loop will continue and now p will become 8 and q will become 1. Since again p is greater than q, the values will be swapped again and p will again become p.ref. The list will look like this:

7,1,8,6,9  

Here again, p is not equal to end, the loop will continue and now p will become 8 and q will become 6. Since again p is greater than q, the values will be swapped again and p will again become p.ref. The list will look like this:

7,1,6,8,9  

Again p is not equal to end, the loop will continue and now p will become 8 and q will become 9. Here since p is not greater than q, the values will not be swapped and p will become p.ref. At this point of time, the reference of p will point to None, and end also points to None. Hence the inner while loop will break and end will be set to p.

In the next set of iterations, the loop will execute until 8, since 9 is already at the end. The process continues until the list is completely sorted.

The Python code for sorting the linked list using bubble sort by exchanging the data is as follows:

    def bub_sort_datachange(self):
        end = None
        while end != self.start_node:
            p = self.start_node
            while p.ref != end:
                q = p.ref
                if p.item > q.item:
                    p.item, q.item = q.item, p.item
                p = p.ref
            end = p

Add the bub_sort_dataexchange() method to the LinkedList class that you created in the last article.

Once you add the method to the linked list, create any set of nodes using the make_new_list() function and then use the bub_sort_dataexchange() to sort the list. You should see the sorted list when you execute the traverse_list() function.

Sorting Linked List by Modifying Links

Bubble sort can also be used to sort a linked list by modifying the links instead of changing data. The process remains quite similar to sorting the list by exchanging data, however, in this case, we have an additional variable r that will always correspond to the node previous than the p node.

Let's take a simple example of how we will swap two nodes by modifying links. Suppose we have a linked list with the following items:

10,45,65,35,1  

And we want to swap 65 and 35. At this point in time p corresponds to node 65, and q corresponds to node 35. The variable r will correspond to node 45 (previous to node p). Now if the node p is greater than node q, which is the case here, the p.ref will be set to q.ref and q.ref will be set to p. Similarly, r.ref will be set to q. This will swap nodes 65 and 35.

The following method implements the bubble sorting for the linked list by modifying links:

    def bub_sort_linkchange(self):
        end = None
        while end != self.start_node:
            r = p = self.start_node
            while p.ref != end:
                q = p.ref
                if p.item > q.item:
                    p.ref = q.ref
                    q.ref = p
                    if p != self.start_node:
                        r.ref = q
                    else:
                        self.start_node = q
                    p,q = q,p
                r = p
                p = p.ref
            end = p

Add the bub_sort_linkchange() method to the LinkedList class that you created in the last article.

Once you add the method to the linked list, create any set of nodes using the make_new_list() function and then use the bub_sort_linkchange() to sort the list. You should see the sorted list when you execute the traverse_list() function.

Merging Sorted Linked List

In this section we will see how we can merge two sorted linked lists in a manner that the resulting linked list is also sorted. There are two approaches to achieve this. We can create a new linked list that contains individually sorted lists or we can simply change links of the two linked list to join the two sorted linked list. In the second case, we do not have to create a new linked list.

Let's first see how we can merge two linked lists by creating a new list.

Merging Sorted Linked Lists by Creating a New List

Let's first dry run the algorithm to see how we can merge two sorted linked list with the help of a new list.

Suppose we have the following two sorted linked lists:

list1:

10,45,65,  

list2:

5,15,35,68  

These are the two lists we want to merge. The algorithm is straight forward. All we will need is three variables, p, q, and em, and an empty list newlist.

At the beginning of the algorithm, p will point to the first element of the list1 whereas q will point to the first element of the list2. The variable em will be empty. At the start of the algorithm, we will have the following values:

p = 10  
q = 5  
em = none  
newlist = none  

Next, we will compare the first element of the list1 with the first element of list2, in other words, we will compare the values of p and q and the smaller value will be stored in the variable em which will become the first node of the new list. The value of em will be added to the end of the newlist.

After the first comparison we will have the following values:

p = 10  
q = 15  
em = 5  
newlist = 5  

Since q was less than p, therefore, we store the value of q in em moved 'q' one index to the right. In the second pass, we will have the following values:

p = 45  
q = 15  
em = 10  
newlist = 5, 10  

Here since p was smaller, we add the value of p to newlist, and set em to p and then moved p one index to the right. In the next iteration we have:

p = 45  
q = 35  
em = 15  
newlist = 5, 10, 15  

Similarly, in the next iteration:

p = 45  
q = 68  
em = 35  
newlist = 5, 10, 15, 35  

And in the next iteration, p will again be smaller than q, hence:

p = 65  
q = 68  
em = 45  
newlist = 5, 10, 15, 35, 45  

Finally,

p = None  
q = 68  
em = 65  
newlist = 5, 10, 15, 35, 45, 65  

When one of the list becomes None, all the elements of the second list are added at the end of the new list. Therefore, the final list will be:

p = None  
q = None  
em = 68  
newlist = 5, 10, 15, 35, 45, 65, 68  

The Python script for merging two sorted lists is as follows:

    def merge_helper(self, list2):
        merged_list = LinkedList()
        merged_list.start_node = self.merge_by_newlist(self.start_node, list2.start_node)
        return merged_list

    def merge_by_newlist(self, p, q):
        if p.item <= q.item:
            startNode = Node(p.item)
            p = p.ref
        else:
            startNode = Node(q.item)
            q = q.ref

        em = startNode

        while p is not None and q is not None:
            if p.item <= q.item:
                em.ref = Node(p.item)
                p = p.ref
            else:
                em.ref = Node(q.item)
                q = q.ref
            em = em.ref

        while p is not None:
            em.ref = Node(p.item)
            p = p.ref
            em = em.ref

        while q is not None:
            em.ref = Node(q.item)
            q = q.ref
            em = em.ref

        return startNode

In the script above we have two methods: merge_helper() and merge_by_newlist(). The first method merge_helper() takes a linked list as a parameter and then passes the self class, which is a linked list itself and the linked list passed to it as a parameter, to the merge_by_newlist() method.

The merge_by_newlist() method merges the two linked by creating a new linked list and returns the start node of the new linked list. Add these two methods to the LinkedList class. Create two new linked lists, sort them using the bub_sort_datachange() or the bub_sort_linkchange() methods that you created in the last section and then use the merge_by_newlist() to see if you can merge two sorted linked lists or not.

Merging Sorted Linked Lists by Rearranging Links

In this approach, a new linked list is not used to store the merger of two sorted linked lists. Rather, the links of the two linked lists are modified in such a way that two linked lists are merged in a sorted manner.

Let's see a simple example of how we can do this. Suppose we have the same two lists list1 and list2:

list1:

10,45,65,  

list2:

5,15,35,68  

We want to merge them in a sorted manner by rearranging the links. To do so we need variables p, q and em. Initially, they will have the following values:

p = 10  
q = 5  
em = none  
newlist = none  

Next, we will compare the first element of the list1 with the first element of list2, in other words, we will compare the values of p and q and the smaller value will be stored in the variable em which will become the first node of the new list.

After the first comparison we will have the following values:

p = 10  
q = 15  
start = 5  
em = start  

After the first iteration, since q is less than p, the start node will point towards q and q will become q.ref. The em will be equal to start. The em will always refer to the newly inserted node in the merged list.

p = 45  
q = 15  
em = 10  

Here since p was smaller than the q, the variable em now points towards the original value of p and p becomes p.ref.

p = 45  
q = 35  
em = 15  

Here since q was smaller than p, em points towards q and q becomes q.ref.

p = 45  
q = 68  
em = 35  

Similarly em here points towards q.

p = 65  
q = 68  
em = 45  
newlist = 5, 10, 15, 35, 45  

And here em points towards becomes p.

p = None  
q = 68  
em = 65  
newlist = 5, 10, 15, 35, 45, 65  

When one of the lists becomes None, the elements from the second list are simply added at the end.

p = None  
q = None  
em = 68  
newlist = 5, 10, 15, 35, 45, 65, 68  

The script that contains functions for merging two lists without creating a new list is as follows:

    def merge_helper2(self, list2):
        merged_list = LinkedList()
        merged_list.start_node = self.merge_by_linkChange(self.start_node, list2.start_node)
        return merged_list

    def merge_by_linkChange(self, p, q):
        if p.item <= q.item:
            startNode = Node(p.item)
            p = p.ref
        else:
            startNode = Node(q.item)
            q = q.ref

        em = startNode

        while p is not None and q is not None:
            if p.item <= q.item:
                em.ref = Node(p.item)
                em = em.ref
                p = p.ref
            else:
                em.ref = Node(q.item)
                em = em.ref
                q = q.ref


        if p is None:
            em.ref = q
        else:
            em.ref = p

        return startNode

In the script above we have two methods: merge_helper2() and merge_by_linkChange(). The first method merge_helper2() takes a linked list as a parameter and then passes the self class which is a linked list itself and the linked list passed to it as a parameter, to the merge_by_linkChange(), which merges the two linked by modifying the links and returns the start node of the merged list. Add these two methods to the LinkedList class. Create two new linked lists, sort them using the bub_sort_datachange() or the bub_sort_linkchange() methods that you created in the last section and then use the merge_by_newlist() to see if you can merge two sorted linked lists or not. Let's see this process in action.

Create a new linked list using the following script:

new_linked_list1 = LinkedList()  
new_linked_list1.make_new_list()  

The script will ask you for the number of nodes to enter. Enter as many nodes as you like and then add values for each node as shown below:

How many nodes do you want to create: 4  
Enter the value for the node:12  
Enter the value for the node:45  
Enter the value for the node:32  
Enter the value for the node:61  

Next, create another linked list repeating the above process:

new_linked_list2 = LinkedList()  
new_linked_list2.make_new_list()  

Next, add a few dummy nodes with the help of the following script:

How many nodes do you want to create: 4  
Enter the value for the node:36  
Enter the value for the node:41  
Enter the value for the node:25  
Enter the value for the node:9  

The next step is to sort both the lists. Execute the following script:

new_linked_list1. bub_sort_datachange()  
new_linked_list2. bub_sort_datachange()  

Finally, the following script merges the two linked lists:

list3 = new_linked_list1.merge_helper2(new_linked_list2)  

To see if the lists have actually been merged, execute the following script:

list3.traverse_list()  

The output looks like this:

9  
12  
25  
32  
36  
41  
45  
61  

Conclusion

In this article, we continued from where we left in the previous article. We saw how we can sort merge lists by changing data and then my modifying links. Finally, we also studied different ways of merging two sorted linked lists.

22 Feb 2019 1:52pm GMT

Python Software Foundation: The North Star of PyCascades, core Python developer Mariatta Wijaya, receives the 2018 Q3 Community Service Award

We in the Python community have a deep appreciation for the volunteers who organize, promote, and write the language. A phrase that has become a cornerstone of our community, afterall, 'Come for the language, stay for the community' (derived from Python core developer Brett Cannon's opening remarks at PyCon 2014), reflects the passion of our community and more so of the countless volunteers building our community.

One volunteer who has been steadfast in actively building the Python community - from her contributions to CPython to her work as an organizer and co-chair of PyCascades 2018 and more - is Mariatta Wijaya. We at the Python Software Foundation are pleased to name Mariatta Wijaya as a 2018 Q3 Community Service Award recipient:

RESOLVED, that the Python Software Foundation award the Q3 2018 Community Service Award to Mariatta Wijaya for her contributions to CPython, diversity efforts for the Python Core Contributor team, and her work on PyCascades.


Come talk! The path to becoming a Core Python Developer


At Montreal PyCon 2015, Guido Van Rossum delivered the closing keynote during which Guido issued a public ask, "I want at least two female Python core developers in the next year ... and I will try to train them myself if that's what it takes. So come talk to me." Consequently, Mariatta did just that, she reached out to Guido after PyCon 2016 to learn more about starting in Python core development. Mariatta recalls, "I hadn't contributed to open source [yet] and I wanted to know how to start". Guido recommended some ways for Mariatta to start including reviewing the dev guide, looking at open issues and joining and introducing herself on the Python dev mailing list .

Following Guido's advice, Mariatta "read the issues [to] see if there is anything I can help with, anything that interests me ... [when I learned that] Brett was starting migration [of Python] to GitHub". As an engineer at Zapier, Mariatta has a background in web development so the migration provided an initial issue she could begin to explore. Mariatta has since contributed to several bots that improve the workflows for Python contributors and core developers, reviewed and merged 700+ PRs to Python, and is Co-Chair of the Language Summit in 2019 and 2020. Some examples of bots she has written include cherry-picker, a "tool used to backport CPython changes from master into one or more of the maintenance branches". Additionally, Mariatta is the author of PEP-581: Using GitHub Issues for CPython. Her motivations behind this PEP again come back to improving the core Python development processes, "I think it will be more beneficial to use an out of the box issue tracker like GitHub as it will allow core developers to focus on developing and contributing".

A role model for us all: Increasing the diversity of the Core Python Development team


The recent departure of Guido as BDFL and the subsequent discussion about the future governance of core Python led to several suggestions, with the Steering Council ultimately becoming the chosen model. Core Python developer Victor Stinner along with several others nominated Mariatta to the Steering Council. In Victor's nomination he explains, "Mariatta became the first woman core developer in Python [in 2017]. She is actively sharing her experience to encourage people from underrepresented groups to contribute to Python." The work required to become a core developer is laborious yet Mariatta has continuously gone the extra mile to lead by example and be active in public outreach. "Mariatta is my role model for mentoring and diversity which is helping a lot to get more people involved in Python," Victor adds. Python core developer and Steering Council member Carol Willing echoed this sentiment sharing, "Mariatta works to share Python and its possible uses with others. Her blend of hard work, enthusiasm, and caring have welcomed many into the Python Community,"

Mariatta's PyCon 2018 talk titled, "What is a Core Python Developer" is an ideal example of Mariatta's dedication to building a more diverse core Python team. Beginning with a question, "do you use f'strings" (Mariatta is a known avid fan of f'strings, she even has stickers for them) Mariatta dives into a talk about what the pathway is for core (and contributing) developers ultimately commenting on the very real, stark gender imbalance within the core team, "We have 848 contributors to Python, less than 10 are women. We have 89 core developers, only 2 are women ... This is real but this is also wrong. This is not the right representation of our community". While this number is starting to change as more women are promoted to core development (Cheryl Sabella's promotion this week ups the number of women core developers up to 5 out of 97), Mariatta has continued to be a champion and advocate for diversity and inclusion in the core development team. Even the captionist in Mariatta's PyCon 2018 (seen below in tweets) talk captured their appreciation for Mariatta's dedication.

Heard a story today that brought me to tears. A talk I did at #PyCon in 2014 inspired a woman to become involved in the community, speak, and become a core contributor. Diversity is important.@mariatta Thank you for sharing. The python was in you all along 🐍💙
- 𝙽𝚒𝚗𝚊 𝚉𝚊𝚔𝚑𝚊𝚛𝚎𝚗𝚔𝚘 🐍 (@nnja) May 12, 2018


Yes you are amazing @mariatta thank you for your work, your contributions, for sharing your story. Our #python community is better because of people like you. #pycon2018 pic.twitter.com/84FvYr4dSq
- Loooorena Mesa @ The Cosmos 🌝🌝🌝 (@loooorenanicole) May 12, 2018


Awesome talk on becoming a python core developer by the amazing @mariatta. More women need to contribute to CPython and they can look to @mariatta for advice. Please spread the word. @pycon #pycon #pycon2018.
- saptarshikar (@saptarshikar) May 12, 2018


The North Star of PyCascades


Outside of her contributions to CPython, Mariatta has been an active organizer with PyCascades - a regional Python conference now about to kick off its second conference this week. The inaugural 2018 conference, held in Mariatta's hometown Vancouver, Canada, introduced a a single track format with 30 minute talks, no question and answer, and includes lightning talks. Inspired by the single track format of Write the Docs and DjangoCon Europe, this format was not only an easier way to get a new conference off the ground but, as Mariatta observed, "is able to give [speakers] the large audience they deserve". This format also makes it easier for attendees to navigate.

Co-Chairing the conference with Mariatta in 2018, Seb Vetter remarked, "Mariatta has been THE driving force behind PyCascades in the inaugural year". As a co-chair Mariatta helped respond to many last minute issues such as when, the day before the conference at 10:00am local time, Guido informed the organizers he was unable to obtain a visa to travel to speak at PyCascades. Within a few hours, the team setup for Guido to speak remotely, had sent him a badge, when the team learned Guido would be able to attend after all! "When we found out he's coming, we printed one more badge for him. That's why he has multiple badges," Mariatta explained. Juggling many changing priorities is the life of an organizer. Yet each decision made, "she ensured … considered the potential impact on the diversity of the conference," Seb remembered adding, "[Mariatta] seems to have an endless stream of enthusiasm and energy and was our North Star for doing everything we could to make it as inclusive for attendees as possible". The idea of Mariatta acting as a North Star was echoed by PyCascades organizer Don Sheu adding, "she gives voice to folks that aren't sufficiently represented in tech … [as a part of] PyCascades founding team, Mariatta's influence is creating a safe environment".

With PyCascades 2019 happening in Seattle this upcoming weekend (February 23 - 24), Mariatta is again contributing as an organizer.

What do f-string stickers and food have in common? Mariatta's love of them!


Outside of Python, when asked what else Mariatta likes to do she simply responded, "I love food!". And her favorite food? Asian cuisine.

#IceCreamSelfie at North Bay Python 2018.
Source: https://mariatta.ca/img/ics-northbaypython-2018.jpg.

If you happen to see Mariatta at an event, say hi. Maybe she'll have f-string sticker for you!

I've just ordered more f-string stickers in preparation for #PyCascades2019 😃
- Mariatta 🤦 (@mariatta) January 15, 2019

22 Feb 2019 9:00am GMT

Python Software Foundation: The North Star of PyCascades, core Python developer Mariatta Wijaya, receives the 2018 Q3 Community Service Award

We in the Python community have a deep appreciation for the volunteers who organize, promote, and write the language. A phrase that has become a cornerstone of our community, afterall, 'Come for the language, stay for the community' (derived from Python core developer Brett Cannon's opening remarks at PyCon 2014), reflects the passion of our community and more so of the countless volunteers building our community.

One volunteer who has been steadfast in actively building the Python community - from her contributions to CPython to her work as an organizer and co-chair of PyCascades 2018 and more - is Mariatta Wijaya. We at the Python Software Foundation are pleased to name Mariatta Wijaya as a 2018 Q3 Community Service Award recipient:

RESOLVED, that the Python Software Foundation award the Q3 2018 Community Service Award to Mariatta Wijaya for her contributions to CPython, diversity efforts for the Python Core Contributor team, and her work on PyCascades.


Come talk! The path to becoming a Core Python Developer


At Montreal PyCon 2015, Guido Van Rossum delivered the closing keynote during which Guido issued a public ask, "I want at least two female Python core developers in the next year ... and I will try to train them myself if that's what it takes. So come talk to me." Consequently, Mariatta did just that, she reached out to Guido after PyCon 2016 to learn more about starting in Python core development. Mariatta recalls, "I hadn't contributed to open source [yet] and I wanted to know how to start". Guido recommended some ways for Mariatta to start including reviewing the dev guide, looking at open issues and joining and introducing herself on the Python dev mailing list .

Following Guido's advice, Mariatta "read the issues [to] see if there is anything I can help with, anything that interests me ... [when I learned that] Brett was starting migration [of Python] to GitHub". As an engineer at Zapier, Mariatta has a background in web development so the migration provided an initial issue she could begin to explore. Mariatta has since contributed to several bots that improve the workflows for Python contributors and core developers, reviewed and merged 700+ PRs to Python, and is Co-Chair of the Language Summit in 2019 and 2020. Some examples of bots she has written include cherry-picker, a "tool used to backport CPython changes from master into one or more of the maintenance branches". Additionally, Mariatta is the author of PEP-581: Using GitHub Issues for CPython. Her motivations behind this PEP again come back to improving the core Python development processes, "I think it will be more beneficial to use an out of the box issue tracker like GitHub as it will allow core developers to focus on developing and contributing".

A role model for us all: Increasing the diversity of the Core Python Development team


The recent departure of Guido as BDFL and the subsequent discussion about the future governance of core Python led to several suggestions, with the Steering Council ultimately becoming the chosen model. Core Python developer Victor Stinner along with several others nominated Mariatta to the Steering Council. In Victor's nomination he explains, "Mariatta became the first woman core developer in Python [in 2017]. She is actively sharing her experience to encourage people from underrepresented groups to contribute to Python." The work required to become a core developer is laborious yet Mariatta has continuously gone the extra mile to lead by example and be active in public outreach. "Mariatta is my role model for mentoring and diversity which is helping a lot to get more people involved in Python," Victor adds. Python core developer and Steering Council member Carol Willing echoed this sentiment sharing, "Mariatta works to share Python and its possible uses with others. Her blend of hard work, enthusiasm, and caring have welcomed many into the Python Community,"

Mariatta's PyCon 2018 talk titled, "What is a Core Python Developer" is an ideal example of Mariatta's dedication to building a more diverse core Python team. Beginning with a question, "do you use f'strings" (Mariatta is a known avid fan of f'strings, she even has stickers for them) Mariatta dives into a talk about what the pathway is for core (and contributing) developers ultimately commenting on the very real, stark gender imbalance within the core team, "We have 848 contributors to Python, less than 10 are women. We have 89 core developers, only 2 are women ... This is real but this is also wrong. This is not the right representation of our community". While this number is starting to change as more women are promoted to core development (Cheryl Sabella's promotion this week ups the number of women core developers up to 5 out of 97), Mariatta has continued to be a champion and advocate for diversity and inclusion in the core development team. Even the captionist in Mariatta's PyCon 2018 (seen below in tweets) talk captured their appreciation for Mariatta's dedication.

Heard a story today that brought me to tears. A talk I did at #PyCon in 2014 inspired a woman to become involved in the community, speak, and become a core contributor. Diversity is important.@mariatta Thank you for sharing. The python was in you all along 🐍💙
- 𝙽𝚒𝚗𝚊 𝚉𝚊𝚔𝚑𝚊𝚛𝚎𝚗𝚔𝚘 🐍 (@nnja) May 12, 2018


Yes you are amazing @mariatta thank you for your work, your contributions, for sharing your story. Our #python community is better because of people like you. #pycon2018 pic.twitter.com/84FvYr4dSq
- Loooorena Mesa @ The Cosmos 🌝🌝🌝 (@loooorenanicole) May 12, 2018


Awesome talk on becoming a python core developer by the amazing @mariatta. More women need to contribute to CPython and they can look to @mariatta for advice. Please spread the word. @pycon #pycon #pycon2018.
- saptarshikar (@saptarshikar) May 12, 2018


The North Star of PyCascades


Outside of her contributions to CPython, Mariatta has been an active organizer with PyCascades - a regional Python conference now about to kick off its second conference this week. The inaugural 2018 conference, held in Mariatta's hometown Vancouver, Canada, introduced a a single track format with 30 minute talks, no question and answer, and includes lightning talks. Inspired by the single track format of Write the Docs and DjangoCon Europe, this format was not only an easier way to get a new conference off the ground but, as Mariatta observed, "is able to give [speakers] the large audience they deserve". This format also makes it easier for attendees to navigate.

Co-Chairing the conference with Mariatta in 2018, Seb Vetter remarked, "Mariatta has been THE driving force behind PyCascades in the inaugural year". As a co-chair Mariatta helped respond to many last minute issues such as when, the day before the conference at 10:00am local time, Guido informed the organizers he was unable to obtain a visa to travel to speak at PyCascades. Within a few hours, the team setup for Guido to speak remotely, had sent him a badge, when the team learned Guido would be able to attend after all! "When we found out he's coming, we printed one more badge for him. That's why he has multiple badges," Mariatta explained. Juggling many changing priorities is the life of an organizer. Yet each decision made, "she ensured … considered the potential impact on the diversity of the conference," Seb remembered adding, "[Mariatta] seems to have an endless stream of enthusiasm and energy and was our North Star for doing everything we could to make it as inclusive for attendees as possible". The idea of Mariatta acting as a North Star was echoed by PyCascades organizer Don Sheu adding, "she gives voice to folks that aren't sufficiently represented in tech … [as a part of] PyCascades founding team, Mariatta's influence is creating a safe environment".

With PyCascades 2019 happening in Seattle this upcoming weekend (February 23 - 24), Mariatta is again contributing as an organizer.

What do f-string stickers and food have in common? Mariatta's love of them!


Outside of Python, when asked what else Mariatta likes to do she simply responded, "I love food!". And her favorite food? Asian cuisine.

#IceCreamSelfie at North Bay Python 2018.
Source: https://mariatta.ca/img/ics-northbaypython-2018.jpg.

If you happen to see Mariatta at an event, say hi. Maybe she'll have f-string sticker for you!

I've just ordered more f-string stickers in preparation for #PyCascades2019 😃
- Mariatta 🤦 (@mariatta) January 15, 2019

22 Feb 2019 9:00am GMT

Python Bytes: #118 Better Python executable management with pipx

22 Feb 2019 8:00am GMT

Python Bytes: #118 Better Python executable management with pipx

22 Feb 2019 8:00am GMT

21 Feb 2019

feedPlanet Python

Andrea Grandi: Skipping tests depending on the Python version

Sometimes we want to run certain tests only on a specific version of Python.

Suppose you are migrating a large project from Python 2 to Python 3 and you know in advance that certain tests won't run under Python 3.

Chances are that during the migration you are already using the six library. The six libraries have two boolean properties which are initialised to True depending on the Python version which is being used: PY2 when running under Python 2 and PY3 when running under Python 3.

This library, combined with the skipIf method of unittest library can be used to easily skip tests when using Python 3:

import six
import unittest


class MyTestCase(unittest.TestCase):


    @unittest.skipIf(six.PY3, "not compatible with Python 3")
    def test_example(self):
        # This test won't run under Python 3
        pass

Credits

Thanks to my colleague Nicola for giving me the inspiration to write this post.

21 Feb 2019 8:00pm GMT

Andrea Grandi: Skipping tests depending on the Python version

Sometimes we want to run certain tests only on a specific version of Python.

Suppose you are migrating a large project from Python 2 to Python 3 and you know in advance that certain tests won't run under Python 3.

Chances are that during the migration you are already using the six library. The six libraries have two boolean properties which are initialised to True depending on the Python version which is being used: PY2 when running under Python 2 and PY3 when running under Python 3.

This library, combined with the skipIf method of unittest library can be used to easily skip tests when using Python 3:

import six
import unittest


class MyTestCase(unittest.TestCase):


    @unittest.skipIf(six.PY3, "not compatible with Python 3")
    def test_example(self):
        # This test won't run under Python 3
        pass

Credits

Thanks to my colleague Nicola for giving me the inspiration to write this post.

21 Feb 2019 8:00pm GMT

PyCharm: Webinar Recording: “Demystifying Python’s async and await Keywords” with Michael Kennedy

Yesterday we hosted a webinar with Michael Kennedy from Talk Python To Me podcasts and training presenting Demystifying Python's async and await Keywords. Turned out to be the highest-rated webinar in 7 years of JetBrains' webinars. Thanks Michael! The webinar recording is now available, as well as a repository with the Python code he showed and the slides he used.

During the webinar, Michael laid the basis for async programming in Python, detailing CPU parallelism versus I/O parallelism. He showed the impacts of each on rendering time and how different forms of parallelism affect rendering times.

In the code, he started with a basic, naive function that ran horribly slow, then gradually sped it up with different Python techniques (generators, async/await, etc.) He also covered companion libraries that have emerged in the Python ecosystem.

For those who want a deep-dive on this topic, Michael has a 4 hour course Async Techniques and Examples in Python. He also has a thorough Mastering PyCharm.

Thanks so much to Michael for this well-prepared, well-presented webinar and staying late to handle the record-number of questions.

-PyCharm Team-
The Drive to Develop

21 Feb 2019 5:41pm GMT

PyCharm: Webinar Recording: “Demystifying Python’s async and await Keywords” with Michael Kennedy

Yesterday we hosted a webinar with Michael Kennedy from Talk Python To Me podcasts and training presenting Demystifying Python's async and await Keywords. Turned out to be the highest-rated webinar in 7 years of JetBrains' webinars. Thanks Michael! The webinar recording is now available, as well as a repository with the Python code he showed and the slides he used.

During the webinar, Michael laid the basis for async programming in Python, detailing CPU parallelism versus I/O parallelism. He showed the impacts of each on rendering time and how different forms of parallelism affect rendering times.

In the code, he started with a basic, naive function that ran horribly slow, then gradually sped it up with different Python techniques (generators, async/await, etc.) He also covered companion libraries that have emerged in the Python ecosystem.

For those who want a deep-dive on this topic, Michael has a 4 hour course Async Techniques and Examples in Python. He also has a thorough Mastering PyCharm.

Thanks so much to Michael for this well-prepared, well-presented webinar and staying late to handle the record-number of questions.

-PyCharm Team-
The Drive to Develop

21 Feb 2019 5:41pm GMT

Stack Abuse: Converting Python Scripts to Executable Files

Introduction

In this tutorial, we will explore the conversion of Python scripts to Windows executable files in four simple steps. Although there are many ways to do it, we'll be covering, according to popular opinion, the simplest one so far.

This tutorial has been designed after reviewing many common errors that people face while performing this task, and hence contains detailed information to install and set up all the dependencies as well. Feel free to skip any step, if you already have those dependencies installed. Without any further ado, let's start.

Step 1: Install cURL

cURL provides a library and command line tool for transferring data using various protocols. We need it to download the pip package manager in the next step. Many of you would already have it set up, which you can check by running the following command:

$ curl --version

If the command above returns a curl version, you can skip the next instructions in this step. As for the rest of you, you can install curl by following these three steps:

  1. Go to https://curl.haxx.se/dlwiz/?type=bin&os=Win64&flav=-&ver=*&cpu=x86_64
  2. Download the curl package which matches your system's specifications (32-bit/64-bit)
  3. Unzip the file and go to the bin folder, you can find the curl.exe file there

However, this means that you can only use the curl command in that particular folder. In order to be able to use the curl command from anywhere on your machine, right-click on curl.exe, click on "Properties" and copy the "Location" value. After that, right-click on "My PC" and click on "Properties". In the option panel on the left, select the option "Advanced System Settings". It has been highlighted in the screenshot below.

In the window that appears, click "Environment Variables" near the bottom right. It has been highlighted in the screenshot below.

In the next window, find and double click on the user variable named "Path", then click on "New". A new text box will be created in that window; paste the "Location" value of the "curl.exe" file that you copied earlier, and then click on 'OK'.

cURL should now be accessible from anywhere in your system. Confirm your installation by running the command below:

$ curl --version

Let's go to the next step.

Step 2: Install pip

In this step, we will install pip, which is basically a package manager for Python packages. We need it in the next step to install the pyinstaller library. Most of you would already have it set up, to check run the following command:

$ pip --version

If the command above returned a pip version, you can skip the next instructions in this step.

As for the rest, you can install pip by running the following two commands in the command prompt:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

$ python get-pip.py

That's it. Pip has now been installed to your local machine! You can run the following command for confirmation:

$ pip --version

Before moving to the next step, you need to repeat what we did for curl.exe so that you can access the pip command from anywhere in your machine, but this time we'll be doing so for "pip.exe".

Hit the Windows key and search for "pip.exe", then right-click on the first search result and click on "Open File Location", it will take you to the folder in which that file is located. Right-click on the "pip.exe" file and then select "Properties". After that, copy the "Location" value and paste it in the Path variable just like we did in Step 1.

Step 3: Install PyInstaller

In this step, we'll install pyinstaller using pip. We need pyinstaller to convert our Python scripts into executable (.exe) files. You just need to copy paste the command below into your command prompt and run it:

$ pip install pyinstaller

Again, to confirm your installation, run the following command:

$ pyinstaller --version

Note: If you've Anaconda installed in your system, then you're probably using conda package manager instead. In that case, run the following commands below, in sequence:

$ conda install -c conda-forge pyinstaller
$ conda install -c anaconda pywin32

This step marks the end of all installations. In the next step, we'll be converting our Python files to an executable with just a single command.

Step 4: Convert Python Files to Executables

This is the last step. We'll use pyinstaller to convert our .py files to .exe with a single command. So, let's do it!

Open up the command prompt and navigate to the directory that your Python file/script is located in. Alternatively, you can open that directory using File Explorer, right-click + shift and then select "Open Command Prompt in this folder". Before converting your file, you should check that your file works as expected. For that purpose, I have written a basic Python script which prints the number 10 when executed.

Let's run the script and see if it works fine before converting it to an executable file. Run the following command on your command prompt:

$ python name_of_your_file.py

In my case, the filename was 'sum.py'.

To create a standalone executable file in the same directory as your Python file, run the following command:

$ pyinstaller --onefile <file_name>.py

This instruction might take some time to complete. Upon completion, it will generate three folders. You can find the executable file in the 'dist' folder. Please note that the "onefile" argument tells pyinstaller to create a single executable file only.

Let's now run our executable file to see if the procedure worked!

Ta-da! It worked just as expected.

A little tip, if your executable file closes too fast for you to notice the output, you can add an input() line at the end of your Python file, which keeps the prompt open while waiting for using input. That is how I was able to take a screenshot of my output as well.

Also note that if your executable depends on any other executable files, like phantomjs, you need to keep them in the same directory as your Python file's directory so that pyinstaller can include it in the executable.

Conclusion

In this tutorial, we discussed in detail the conversion of Python scripts to executable files using Python's pyinstaller library in four steps. We started by installing cURL, followed by pip and pyinstaller. Lastly, we converted a sample Python file to executable to ensure that the procedure works on Windows.

21 Feb 2019 4:05pm GMT

Stack Abuse: Converting Python Scripts to Executable Files

Introduction

In this tutorial, we will explore the conversion of Python scripts to Windows executable files in four simple steps. Although there are many ways to do it, we'll be covering, according to popular opinion, the simplest one so far.

This tutorial has been designed after reviewing many common errors that people face while performing this task, and hence contains detailed information to install and set up all the dependencies as well. Feel free to skip any step, if you already have those dependencies installed. Without any further ado, let's start.

Step 1: Install cURL

cURL provides a library and command line tool for transferring data using various protocols. We need it to download the pip package manager in the next step. Many of you would already have it set up, which you can check by running the following command:

$ curl --version

If the command above returns a curl version, you can skip the next instructions in this step. As for the rest of you, you can install curl by following these three steps:

  1. Go to https://curl.haxx.se/dlwiz/?type=bin&os=Win64&flav=-&ver=*&cpu=x86_64
  2. Download the curl package which matches your system's specifications (32-bit/64-bit)
  3. Unzip the file and go to the bin folder, you can find the curl.exe file there

However, this means that you can only use the curl command in that particular folder. In order to be able to use the curl command from anywhere on your machine, right-click on curl.exe, click on "Properties" and copy the "Location" value. After that, right-click on "My PC" and click on "Properties". In the option panel on the left, select the option "Advanced System Settings". It has been highlighted in the screenshot below.

In the window that appears, click "Environment Variables" near the bottom right. It has been highlighted in the screenshot below.

In the next window, find and double click on the user variable named "Path", then click on "New". A new text box will be created in that window; paste the "Location" value of the "curl.exe" file that you copied earlier, and then click on 'OK'.

cURL should now be accessible from anywhere in your system. Confirm your installation by running the command below:

$ curl --version

Let's go to the next step.

Step 2: Install pip

In this step, we will install pip, which is basically a package manager for Python packages. We need it in the next step to install the pyinstaller library. Most of you would already have it set up, to check run the following command:

$ pip --version

If the command above returned a pip version, you can skip the next instructions in this step.

As for the rest, you can install pip by running the following two commands in the command prompt:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

$ python get-pip.py

That's it. Pip has now been installed to your local machine! You can run the following command for confirmation:

$ pip --version

Before moving to the next step, you need to repeat what we did for curl.exe so that you can access the pip command from anywhere in your machine, but this time we'll be doing so for "pip.exe".

Hit the Windows key and search for "pip.exe", then right-click on the first search result and click on "Open File Location", it will take you to the folder in which that file is located. Right-click on the "pip.exe" file and then select "Properties". After that, copy the "Location" value and paste it in the Path variable just like we did in Step 1.

Step 3: Install PyInstaller

In this step, we'll install pyinstaller using pip. We need pyinstaller to convert our Python scripts into executable (.exe) files. You just need to copy paste the command below into your command prompt and run it:

$ pip install pyinstaller

Again, to confirm your installation, run the following command:

$ pyinstaller --version

Note: If you've Anaconda installed in your system, then you're probably using conda package manager instead. In that case, run the following commands below, in sequence:

$ conda install -c conda-forge pyinstaller
$ conda install -c anaconda pywin32

This step marks the end of all installations. In the next step, we'll be converting our Python files to an executable with just a single command.

Step 4: Convert Python Files to Executables

This is the last step. We'll use pyinstaller to convert our .py files to .exe with a single command. So, let's do it!

Open up the command prompt and navigate to the directory that your Python file/script is located in. Alternatively, you can open that directory using File Explorer, right-click + shift and then select "Open Command Prompt in this folder". Before converting your file, you should check that your file works as expected. For that purpose, I have written a basic Python script which prints the number 10 when executed.

Let's run the script and see if it works fine before converting it to an executable file. Run the following command on your command prompt:

$ python name_of_your_file.py

In my case, the filename was 'sum.py'.

To create a standalone executable file in the same directory as your Python file, run the following command:

$ pyinstaller --onefile <file_name>.py

This instruction might take some time to complete. Upon completion, it will generate three folders. You can find the executable file in the 'dist' folder. Please note that the "onefile" argument tells pyinstaller to create a single executable file only.

Let's now run our executable file to see if the procedure worked!

Ta-da! It worked just as expected.

A little tip, if your executable file closes too fast for you to notice the output, you can add an input() line at the end of your Python file, which keeps the prompt open while waiting for using input. That is how I was able to take a screenshot of my output as well.

Also note that if your executable depends on any other executable files, like phantomjs, you need to keep them in the same directory as your Python file's directory so that pyinstaller can include it in the executable.

Conclusion

In this tutorial, we discussed in detail the conversion of Python scripts to executable files using Python's pyinstaller library in four steps. We started by installing cURL, followed by pip and pyinstaller. Lastly, we converted a sample Python file to executable to ensure that the procedure works on Windows.

21 Feb 2019 4:05pm GMT

Dataquest: Tutorial: Find Dominant Colors in an Image through Clustering

Tutorial: Find Dominant Colors in an Image through Clustering

Analyzing images with code can be difficult. How do you make your code "understand" the context of an image?

In general, the first step of analyzing images with AI is finding the dominant colors. In this tutorial, we're going to find dominant colors in images using matplotlib's image class. Finding dominant colors is also something you can do with third-party APIs, but we're going to build our own system for doing this so that we have total control over the process.

We will first look at converting an image into its component colors in the form of a matrix, and then perform k-means clustering on it to find the dominant colors.

Prerequisites

This tutorial assumes that you know basics of Python, but you don't need to have worked with images in Python before.

This tutorial is based on the following:

  • Python version 3.6.5
  • matplotlib version 2.2.3: to decode images and visualize dominant colors
  • scipy version 1.1.0: to perform clustering that determines dominant colors

The packages matplotlib and scipy can be installed through a package manager - pip. You may want to install specific versions of packages in a virtual environment to make sure there are no clashes with dependencies of other projects you are working on.

pip install matplotlib==2.2.3
pip install scipy==1.1.0

Further, we analyze JPG images in this tutorial, the support for which is available only when you install an additional package, pillow.

pip install pillow==5.2.0

Alternatively, you can just use a Jupyter notebook. The code for this tutorial was run on a Jupyter notebook in Anaconda version 1.8.7. These above packages are pre-installed in Anaconda.

import matplotlib
matplotlib.__version__

'3.0.2'

import PIL
PIL.__version__

'5.3.0'

import scipy
scipy.__version__

'1.1.0'

Decoding images

Images may have various extensions - JPG, PNG, TIFF are common. This post focuses on JPG images only, but the process for other image formats should not be very different. The first step in the process is to read the image.

An image with a JPG extension is stored in memory as a list of dots, known as pixels. A pixel, or a picture element, represents a single dot in an image. The color of the dot is determined by a combination of three values - its three component colors (Red, Blue and Green). The color of the pixel is essentially a combination of these three component colors.

Tutorial: Find Dominant Colors in an Image through Clustering
Image source: Datagenetics

Let us use Dataquest's logo for the purpose of finding dominant colors in the image. You can download the image here.

To read an image in Python, you need to import the image class of matplotlib (documentation). The imread() method of the image class decodes an image into its RGB values. The output of the imread() method is an array with the dimensions M x N x 3, where M and N are the dimensions of the image.

from matplotlib import image as img

image = img.imread('./dataquest.jpg')
image.shape

(200, 200, 3)

You can use the imshow() method of matplotlib's pyplot class to display an image, which is in the form of a matrix of RGB values.

%matplotlib inline

from matplotlib import pyplot as plt

plt.imshow(image)
plt.show()

Tutorial: Find Dominant Colors in an Image through Clustering

The matrix we get from the imread() method depends on the type of the image being read. For instance, PNG images would also include an element measuring a pixel's level of transparency. This post will only cover JPG images.

Before moving on to clustering the images, we need to perform an additional step. In the process of finding out the dominant colors of an image, we are not concerned about the position of the pixel. Hence, we need to convert the M x N x 3 matrix to three individual lists, which contain the respective red, blue and green values. The following snippet converts the matrix stored in image into three individual lists, each of length 40,000 (200 x 200).

r = []
g = []
b = []

for line in image:
    for pixel in line:
        temp_r, temp_g, temp_b = pixel
        r.append(temp_r)
        g.append(temp_g)
        b.append(temp_b)

The snippet above creates three empty lists, and then loops through each pixel of our image, appending the RGB values to our r, g, and b lists, respectively. If done correctly, each list wiill have a length of 40,000 (200 x 200).

Clustering Basics

Now that we have stored all of the component colors of our image, it is time to find the dominant colors. Let us now take a moment to understand the basics of clustering and how it is going to help us find the dominant colors in an image.

Clustering is a technique that helps in grouping similar items together based on particular attributes. We are going to apply k-means clustering to the list of three colors that we have just created above.

The colors at each cluster center will reflect the average of the attributes of all members of a cluster, and that will help us determine the dominant colors in the image.

How Many Dominant Colors?

Before we perform k-means clustering on the pixel data points, it might be good for us to figure out how many clusters are ideal for a given image, since not all images will have the same number of dominant colors.

Since we are dealing with three variables for clustering - the Red, Blue and Green values of pixels - we can visualize these variables on three dimensions to understand how many dominant colors may exist.

To make a 3D plot in matplotlib, we will use the Axes3D() class (documentation). After initializing the axes using the Axes3D() class, we use the scatter method and use the three lists of color values as arguments.

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(r, g, b)
plt.show()

Tutorial: Find Dominant Colors in an Image through Clustering

In the resultant plot, we can see that the distribution of points forms two elongated clusters. This is also supported by the fact that by looking at it, we can see that the image is primarily composed of two colors. Therefore, we will focus on creating two clusters in the next section.

First, though, we can probably guess that it's possible that a 3D plot may not yield distinct clusters for some images. Additionally, if we were using a PNG image, we would have a fourth agrument (each pixel's transparency value), which would make plotting in three dimensions impossible. In such cases, you may need to use the elbow method to determine the ideal number of clusters.

Perform Clustering in SciPy

In the previous step, we've determined that we'd like two clusters, and now we are ready to perform k-means clustering on the data. Let's create a Pandas data frame to easily manage the variables.

import pandas as pd

df = pd.DataFrame({'red': r,
             'blue': b,
             'green': g})

There are essentially three steps involved in the process of k-means clustering with SciPy:

  1. Standardize the variables by dividing each data point by its standard deviation. We will use the whiten() method of the vq class.
  2. Generate cluster centers using the kmeans() method.
  3. Generate cluster labels for each data point using the vq() method of the vq class.

The first step above ensures that the variations in each variable affects the clusters equally. Imagine two variables with largely different scales. If we ignore the first step above, the variable with the larger scale and variation would have a larger impact on the formation of clusters, thus making the process biased. We therefore standardize the variable using the whiten() function. The whiten() function takes one argument, a list or array of the values of a variable, and returns the standardized values. After standardization, we print a sample of the data frame. Notice that the variations in the columns has reduced considerably in the standardized columns.

from scipy.cluster.vq import whiten

df['scaled_red'] = whiten(df['red'])
df['scaled_blue'] = whiten(df['blue'])
df['scaled_green'] = whiten(df['green'])
df.sample(n = 10)

red blue green scaled_red scaled_blue scaled_green
24888 255 255 255 3.068012 3.590282 3.170015
38583 255 255 255 3.068012 3.590282 3.170015
2659 67 91 72 0.806105 1.281238 0.895063
36954 255 255 255 3.068012 3.590282 3.170015
39967 255 255 255 3.068012 3.590282 3.170015
13851 75 101 82 0.902357 1.422033 1.019377
14354 75 101 82 0.902357 1.422033 1.019377
18613 75 100 82 0.902357 1.407954 1.019377
6719 75 100 82 0.902357 1.407954 1.019377
14299 71 104 82 0.854231 1.464272 1.019377

The next step is to perform k-means clustering with the standardized columns. We will use the kmeans() function of perform clustering. The kmeans() function (documentation) has two required arguments - the observations and the number of clusters. It returns two values - the cluster centers and the distortion. Distortion is the sum of squared distances between each point and its nearest cluster center. We will not be using distortion in this tutorial.

from scipy.cluster.vq import kmeans

cluster_centers, distortion = kmeans(df[['scaled_red', 'scaled_green', 'scaled_blue']], 2)

The final step in k-means clustering is generating cluster labels. However, in this exercise, we won't need to do that. We are only looking for the dominant colors, which are represnted by the cluster centers.

Display Dominant Colors

We have performed k-means clustering and generated our cluster centers, so let's see what values they contain.

print(cluster_centers)

[[2.94579782 3.1243935  3.52525635]
 [0.91860119 1.05099931 1.4465091 ]]

As you can see, the results we get are standardized versions of RGB values. To get the original color values we need to multiply them with their standard deviations.

We will display the colors in the form of a palette using the imshow() method of matplotlib's pyplot class. However, to display colors, imshow() needs the values of RGB in the range of 0 to 1, where 1 signifies 255 in our original scale of RGB values. We therefore must divide each RGB component of our cluster centers with 255 in order to get to a value between 0 to 1, and display them through the imshow() method.

Finally, we have one more consideration before we plot the colors using the imshow() function (documentation). The dimensions of the cluster centers are N x 3, where N is the number of clusters. imshow() is originally intended to display an A X B matrix of colors, so it expects a 3D array of dimentions A x B x 3 (three color elements for each block in the palette). Hence, we need to convert our N x 3 matrix to 1 x N x 3 matrix by passing the colors of the cluster centers as a list with a single element. For instance, if we stored our colors in colors we need to pass [colors] as an argument to imshow().

Let's explore the dominant colors in our image.

colors = []

r_std, g_std, b_std = df[['red', 'green', 'blue']].std()

for cluster_center in cluster_centers:
    scaled_r, scaled_g, scaled_b = cluster_center
    colors.append((
        scaled_r * r_std / 255,
        scaled_g * g_std / 255,
        scaled_b * b_std / 255
    ))
plt.imshow([colors])
plt.show()

Tutorial: Find Dominant Colors in an Image through Clustering

As expected, the colors seen in the image are quite similar to the prominent colors in the image that we started with. That said, you've probably noticed that the light blue color above doesn't actually appear in our source image. Remember, the cluster centers are the means all of of the RGB values of all pixels in each cluster. So, the resultant cluster center may not actually be a color in the original image, it is just the RBG value that's at the center of the cluster all similar looking pixels from our image.

Conclusion

In this post, we looked at a step by step implementation for finding the dominant colors of an image in Python using matplotlib and scipy. We started with a JPG image and converted it to its RGB values using the imread() method of the image class in matplotlib. We then performed k-means clustering with scipy to find the dominant colors. Finally, we displayed the dominant colors using the imshow() method of the pyplot class in matplotlib.

Want to learn more Python skills? Dataquest offers a full sequence of courses that can take you from zero to data scientist in Python. Sign up and start learning for free!

21 Feb 2019 3:00pm GMT

Dataquest: Tutorial: Find Dominant Colors in an Image through Clustering

Tutorial: Find Dominant Colors in an Image through Clustering

Analyzing images with code can be difficult. How do you make your code "understand" the context of an image?

In general, the first step of analyzing images with AI is finding the dominant colors. In this tutorial, we're going to find dominant colors in images using matplotlib's image class. Finding dominant colors is also something you can do with third-party APIs, but we're going to build our own system for doing this so that we have total control over the process.

We will first look at converting an image into its component colors in the form of a matrix, and then perform k-means clustering on it to find the dominant colors.

Prerequisites

This tutorial assumes that you know basics of Python, but you don't need to have worked with images in Python before.

This tutorial is based on the following:

  • Python version 3.6.5
  • matplotlib version 2.2.3: to decode images and visualize dominant colors
  • scipy version 1.1.0: to perform clustering that determines dominant colors

The packages matplotlib and scipy can be installed through a package manager - pip. You may want to install specific versions of packages in a virtual environment to make sure there are no clashes with dependencies of other projects you are working on.

pip install matplotlib==2.2.3
pip install scipy==1.1.0

Further, we analyze JPG images in this tutorial, the support for which is available only when you install an additional package, pillow.

pip install pillow==5.2.0

Alternatively, you can just use a Jupyter notebook. The code for this tutorial was run on a Jupyter notebook in Anaconda version 1.8.7. These above packages are pre-installed in Anaconda.

import matplotlib
matplotlib.__version__

'3.0.2'

import PIL
PIL.__version__

'5.3.0'

import scipy
scipy.__version__

'1.1.0'

Decoding images

Images may have various extensions - JPG, PNG, TIFF are common. This post focuses on JPG images only, but the process for other image formats should not be very different. The first step in the process is to read the image.

An image with a JPG extension is stored in memory as a list of dots, known as pixels. A pixel, or a picture element, represents a single dot in an image. The color of the dot is determined by a combination of three values - its three component colors (Red, Blue and Green). The color of the pixel is essentially a combination of these three component colors.

Tutorial: Find Dominant Colors in an Image through Clustering
Image source: Datagenetics

Let us use Dataquest's logo for the purpose of finding dominant colors in the image. You can download the image here.

To read an image in Python, you need to import the image class of matplotlib (documentation). The imread() method of the image class decodes an image into its RGB values. The output of the imread() method is an array with the dimensions M x N x 3, where M and N are the dimensions of the image.

from matplotlib import image as img

image = img.imread('./dataquest.jpg')
image.shape

(200, 200, 3)

You can use the imshow() method of matplotlib's pyplot class to display an image, which is in the form of a matrix of RGB values.

%matplotlib inline

from matplotlib import pyplot as plt

plt.imshow(image)
plt.show()

Tutorial: Find Dominant Colors in an Image through Clustering

The matrix we get from the imread() method depends on the type of the image being read. For instance, PNG images would also include an element measuring a pixel's level of transparency. This post will only cover JPG images.

Before moving on to clustering the images, we need to perform an additional step. In the process of finding out the dominant colors of an image, we are not concerned about the position of the pixel. Hence, we need to convert the M x N x 3 matrix to three individual lists, which contain the respective red, blue and green values. The following snippet converts the matrix stored in image into three individual lists, each of length 40,000 (200 x 200).

r = []
g = []
b = []

for line in image:
    for pixel in line:
        temp_r, temp_g, temp_b = pixel
        r.append(temp_r)
        g.append(temp_g)
        b.append(temp_b)

The snippet above creates three empty lists, and then loops through each pixel of our image, appending the RGB values to our r, g, and b lists, respectively. If done correctly, each list wiill have a length of 40,000 (200 x 200).

Clustering Basics

Now that we have stored all of the component colors of our image, it is time to find the dominant colors. Let us now take a moment to understand the basics of clustering and how it is going to help us find the dominant colors in an image.

Clustering is a technique that helps in grouping similar items together based on particular attributes. We are going to apply k-means clustering to the list of three colors that we have just created above.

The colors at each cluster center will reflect the average of the attributes of all members of a cluster, and that will help us determine the dominant colors in the image.

How Many Dominant Colors?

Before we perform k-means clustering on the pixel data points, it might be good for us to figure out how many clusters are ideal for a given image, since not all images will have the same number of dominant colors.

Since we are dealing with three variables for clustering - the Red, Blue and Green values of pixels - we can visualize these variables on three dimensions to understand how many dominant colors may exist.

To make a 3D plot in matplotlib, we will use the Axes3D() class (documentation). After initializing the axes using the Axes3D() class, we use the scatter method and use the three lists of color values as arguments.

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(r, g, b)
plt.show()

Tutorial: Find Dominant Colors in an Image through Clustering

In the resultant plot, we can see that the distribution of points forms two elongated clusters. This is also supported by the fact that by looking at it, we can see that the image is primarily composed of two colors. Therefore, we will focus on creating two clusters in the next section.

First, though, we can probably guess that it's possible that a 3D plot may not yield distinct clusters for some images. Additionally, if we were using a PNG image, we would have a fourth agrument (each pixel's transparency value), which would make plotting in three dimensions impossible. In such cases, you may need to use the elbow method to determine the ideal number of clusters.

Perform Clustering in SciPy

In the previous step, we've determined that we'd like two clusters, and now we are ready to perform k-means clustering on the data. Let's create a Pandas data frame to easily manage the variables.

import pandas as pd

df = pd.DataFrame({'red': r,
             'blue': b,
             'green': g})

There are essentially three steps involved in the process of k-means clustering with SciPy:

  1. Standardize the variables by dividing each data point by its standard deviation. We will use the whiten() method of the vq class.
  2. Generate cluster centers using the kmeans() method.
  3. Generate cluster labels for each data point using the vq() method of the vq class.

The first step above ensures that the variations in each variable affects the clusters equally. Imagine two variables with largely different scales. If we ignore the first step above, the variable with the larger scale and variation would have a larger impact on the formation of clusters, thus making the process biased. We therefore standardize the variable using the whiten() function. The whiten() function takes one argument, a list or array of the values of a variable, and returns the standardized values. After standardization, we print a sample of the data frame. Notice that the variations in the columns has reduced considerably in the standardized columns.

from scipy.cluster.vq import whiten

df['scaled_red'] = whiten(df['red'])
df['scaled_blue'] = whiten(df['blue'])
df['scaled_green'] = whiten(df['green'])
df.sample(n = 10)

red blue green scaled_red scaled_blue scaled_green
24888 255 255 255 3.068012 3.590282 3.170015
38583 255 255 255 3.068012 3.590282 3.170015
2659 67 91 72 0.806105 1.281238 0.895063
36954 255 255 255 3.068012 3.590282 3.170015
39967 255 255 255 3.068012 3.590282 3.170015
13851 75 101 82 0.902357 1.422033 1.019377
14354 75 101 82 0.902357 1.422033 1.019377
18613 75 100 82 0.902357 1.407954 1.019377
6719 75 100 82 0.902357 1.407954 1.019377
14299 71 104 82 0.854231 1.464272 1.019377

The next step is to perform k-means clustering with the standardized columns. We will use the kmeans() function of perform clustering. The kmeans() function (documentation) has two required arguments - the observations and the number of clusters. It returns two values - the cluster centers and the distortion. Distortion is the sum of squared distances between each point and its nearest cluster center. We will not be using distortion in this tutorial.

from scipy.cluster.vq import kmeans

cluster_centers, distortion = kmeans(df[['scaled_red', 'scaled_green', 'scaled_blue']], 2)

The final step in k-means clustering is generating cluster labels. However, in this exercise, we won't need to do that. We are only looking for the dominant colors, which are represnted by the cluster centers.

Display Dominant Colors

We have performed k-means clustering and generated our cluster centers, so let's see what values they contain.

print(cluster_centers)

[[2.94579782 3.1243935  3.52525635]
 [0.91860119 1.05099931 1.4465091 ]]

As you can see, the results we get are standardized versions of RGB values. To get the original color values we need to multiply them with their standard deviations.

We will display the colors in the form of a palette using the imshow() method of matplotlib's pyplot class. However, to display colors, imshow() needs the values of RGB in the range of 0 to 1, where 1 signifies 255 in our original scale of RGB values. We therefore must divide each RGB component of our cluster centers with 255 in order to get to a value between 0 to 1, and display them through the imshow() method.

Finally, we have one more consideration before we plot the colors using the imshow() function (documentation). The dimensions of the cluster centers are N x 3, where N is the number of clusters. imshow() is originally intended to display an A X B matrix of colors, so it expects a 3D array of dimentions A x B x 3 (three color elements for each block in the palette). Hence, we need to convert our N x 3 matrix to 1 x N x 3 matrix by passing the colors of the cluster centers as a list with a single element. For instance, if we stored our colors in colors we need to pass [colors] as an argument to imshow().

Let's explore the dominant colors in our image.

colors = []

r_std, g_std, b_std = df[['red', 'green', 'blue']].std()

for cluster_center in cluster_centers:
    scaled_r, scaled_g, scaled_b = cluster_center
    colors.append((
        scaled_r * r_std / 255,
        scaled_g * g_std / 255,
        scaled_b * b_std / 255
    ))
plt.imshow([colors])
plt.show()

Tutorial: Find Dominant Colors in an Image through Clustering

As expected, the colors seen in the image are quite similar to the prominent colors in the image that we started with. That said, you've probably noticed that the light blue color above doesn't actually appear in our source image. Remember, the cluster centers are the means all of of the RGB values of all pixels in each cluster. So, the resultant cluster center may not actually be a color in the original image, it is just the RBG value that's at the center of the cluster all similar looking pixels from our image.

Conclusion

In this post, we looked at a step by step implementation for finding the dominant colors of an image in Python using matplotlib and scipy. We started with a JPG image and converted it to its RGB values using the imread() method of the image class in matplotlib. We then performed k-means clustering with scipy to find the dominant colors. Finally, we displayed the dominant colors using the imshow() method of the pyplot class in matplotlib.

Want to learn more Python skills? Dataquest offers a full sequence of courses that can take you from zero to data scientist in Python. Sign up and start learning for free!

21 Feb 2019 3:00pm GMT

Stack Abuse: Python Performance Optimization

Introduction

Resources are never sufficient to meet growing needs in most industries, and now especially in technology as it carves its way deeper into our lives. Technology makes life easier and more convenient and it is able to evolve and become better over time.

This increased reliance on technology has come at the expense of the computing resources available. As a result, more powerful computers are being developed and the optimization of code has never been more crucial.

Application performance requirements are rising more than our hardware can keep up with. To combat this, people have come up with many strategies to utilize resources more efficiently - Containerizing, Reactive (Asynchronous) Applications, etc.

Though, the first step we should take, and by far the easiest one to take into consideration, is code optimization. We need to write code that performs better and utilizes less computing resources.

In this article, we will optimize common patterns and procedures in Python programming in an effort to boost the performance and enhance the utilization of the available computing resources.

Problem with Performance

As software solutions scale, performance becomes more crucial and issues become more grand and visible. When we are writing code on our localhost, it is easy to miss some performance issues since usage is not intense. Once the same software is deployed for thousands and hundreds of thousands of concurrent end-users, the issues become more elaborate.

Slowness is one of the main issues to creep up when software is scaled. This is characterized by increased response time. For instance, a web server may take longer to serve web pages or send responses back to clients when the requests become too many. Nobody likes a slow system especially since technology is meant to make certain operations faster, and usability will decline if the system is slow.

When software is not optimized to utilize available resources well, it will end up requiring more resources to ensure it runs smoothly. For instance, if memory management is not handled well, the program will end up requiring more memory, hence resulting in upgrading costs or frequent crashes.

Inconsistency and erroneous output is another result of poorly optimized programs. These points highlight the need for optimization of programs.

Why and When to Optimize

When building for large scale use, optimization is a crucial aspect of software to consider. Optimized software is able to handle a large number of concurrent users or requests while maintaining the level of performance in terms of speed easily.

This leads to overall customer satisfaction since usage is unaffected. This also leads to fewer headaches when an application crashes in the middle of the night and your angry manager calls you to fix it instantly.

Computing resources are expensive and optimization can come in handy in reducing operational costs in terms of storage, memory, or computing power.

But when do we optimize?

It is important to note that optimization may negatively affect the readability and maintainability of the codebase by making it more complex. Therefore, it is important to consider the result of the optimization against the technical debt it will raise.

If we are building large systems which expect a lot of interaction by the end users, then we need our system working at the best state and this calls for optimization. Also, if we have limited resources in terms of computing power or memory, optimization will go a long way in ensuring that we can make do with the resources available to us.

Profiling

Before we can optimize our code, it has to be working. This way we can be able to tell how it performs and utilizes resources. And this brings us to the first rule of optimization - Don't.

As Donald Knuth - a mathematician, computer scientist, and professor at Stanford University put it:

"Premature optimization is the root of all evil."

The solution has to work for it to be optimized.

Profiling entails scrutiny of our code and analyzing its performance in order to identify how our code performs in various situations and areas of improvement if needed. It will enable us to identify the amount of time that our program takes or the amount of memory it uses in its operations. This information is vital in the optimization process since it helps us decide whether to optimize our code or not.

Profiling can be a challenging undertaking and take a lot of time and if done manually some issues that affect performance may be missed. To this effect, the various tools that can help profile code faster and more efficiently include:

Profiling will help us identify areas to optimize in our code. Let us discuss how choosing the right data structure or control flow can help our Python code perform better.

Choosing Data Structures and Control Flow

The choice of data structure in our code or algorithm implemented can affect the performance of our Python code. If we make the right choices with our data structures, our code will perform well.

Profiling can be of great help to identify the best data structure to use at different points in our Python code. Are we doing a lot of inserts? Are we deleting frequently? Are we constantly searching for items? Such questions can help guide us choose the correct data structure for the need and consequently result in optimized Python code.

Time and memory usage will be greatly affected by our choice of data structure. It is also important to note that some data structures are implemented differently in different programming languages.

For Loop vs List Comprehensions

Loops are common when developing in Python and soon enough you will come across list comprehensions, which are a concise way to create new lists which also support conditions.

For instance, if we want to get a list of the squares of all even numbers in a certain range using the for loop:

new_list = []  
for n in range(0, 10):  
    if n % 2 == 0:
        new_list.append(n**2)

A List Comprehension version of the loop would simply be:

new_list = [ n**2 for n in range(0,10) if n%2 == 0]  

The list comprehension is shorter and more concise, but that is not the only trick up its sleeve. They are also notably faster in execution time than for loops. We will use the Timeit module which provides a way to time small bits of Python code.

Let us put the list comprehension against the equivalent for loop and see how long each takes to achieve the same result:

import timeit

def for_square(n):  
    new_list = []
    for i in range(0, n):
        if i % 2 == 0:
            new_list.append(n**2)
    return new_list

def list_comp_square(n):  
    return [i**2 for i in range(0, n) if i % 2 == 0]

print("Time taken by For Loop: {}".format(timeit.timeit('for_square(10)', 'from __main__ import for_square')))

print("Time taken by List Comprehension: {}".format(timeit.timeit('list_comp_square(10)', 'from __main__ import list_comp_square')))  

After running the script 5 times using Python 2:

$ python for-vs-lc.py 
Time taken by For Loop: 2.56907987595  
Time taken by List Comprehension: 2.01556396484  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.37083697319  
Time taken by List Comprehension: 1.94110512733  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.52163410187  
Time taken by List Comprehension: 1.96427607536  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.44279003143  
Time taken by List Comprehension: 2.16282701492  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.63641500473  
Time taken by List Comprehension: 1.90950393677  

While the difference is not constant, the list comprehension is taking less time than the for loop. In small-scale code, this may not make that much of a difference, but at large-scale execution, it may be all the difference needed to save some time.

If we increase the range of squares from 10 to 100, the difference becomes more apparent:

$ python for-vs-lc.py 
Time taken by For Loop: 16.0991549492  
Time taken by List Comprehension: 13.9700510502  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 16.6425571442  
Time taken by List Comprehension: 13.4352738857  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 16.2476081848  
Time taken by List Comprehension: 13.2488780022  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 15.9152050018  
Time taken by List Comprehension: 13.3579590321  

cProfile is a profiler that comes with Python and if we use it to profile our code:

cprofile analysis

Upon further scrutiny, we can still see that the cProfile tool reports that our List Comprehension takes less execution time than our For Loop implementation, as we had established earlier. cProfile displays all the functions called, the number of times they have been called and the amount of time taken by each.

If our intention is to reduce the time taken by our code to execute, then the List Comprehension would be a better choice over using the For Loop. The effect of such a decision to optimize our code will be much clearer at a larger scale and shows just how important, but also easy, optimizing code can be.

But what if we are concerned about our memory usage? A list comprehension would require more memory to remove items in a list than a normal loop. A list comprehension always creates a new list in memory upon completion, so for deletion of items off a list, a new list would be created. Whereas, for a normal for loop, we can use the list.remove() or list.pop() to modify the original list instead of creating a new one in memory.

Again, in small-scale scripts, it might not make much difference, but optimization comes good at a larger scale, and in that situation, such memory saving will come good and allow us to use the extra memory saved for other operations.

Linked Lists

Another data structure that can come in handy to achieve memory saving is the Linked List. It differs from a normal array in that each item or node has a link or pointer to the next node in the list and it does not require contiguous memory allocation.

An array requires that memory required to store it and its items be allocated upfront and this can be quite expensive or wasteful when the size of the array is not known in advance.

A linked list will allow you to allocate memory as needed. This is possible because the nodes in the linked list can be stored in different places in memory but come together in the linked list through pointers. This makes linked lists a lot more flexible compared to arrays.

The caveat with a linked list is that the lookup time is slower than an array's due to the placement of the items in memory. Proper profiling will help you identify whether you need better memory or time management in order to decide whether to use a Linked List or an Array as your choice of the data structure when optimizing your code.

Range vs XRange

When dealing with loops in Python, sometimes we will need to generate a list of integers to assist us in executing for-loops. The functions range and xrange are used to this effect.

Their functionality is the same but they are different in that the range returns a list object but the xrange returns an xrange object.

What does this mean? An xrange object is a generator in that it's not the final list. It gives us the ability to generate the values in the expected final list as required during runtime through a technique known as "yielding".

The fact that the xrange function does not return the final list makes it the more memory efficient choice for generating huge lists of integers for looping purposes.

If we need to generate a large number of integers for use, xrange should be our go-to option for this purpose since it uses less memory. If we use the range function instead, the entire list of integers will need to be created and this will get memory intensive.

Let us explore this difference in memory consumption between the two functions:

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)  
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.  
>>> import sys
>>> 
>>> r = range(1000000)
>>> x = xrange(1000000)
>>> 
>>> print(sys.getsizeof(r))
8000072  
>>> 
>>> print(sys.getsizeof(x))
40  
>>> 
>>> print(type(r))
<type 'list'>  
>>> print(type(x))
<type 'xrange'>  

We create a range of 1,000,000 integers using range and xrange. The type of object created by the range function is a List that consumes 8000072 bytes of memory while the xrange object consumes only 40 bytes of memory.

The xrange function saves us memory, loads of it, but what about item lookup time? Let's time the lookup time of an integer in the generated list of integers using Timeit:

import timeit

r = range(1000000)  
x = xrange(1000000)

def lookup_range():  
    return r[999999]

def lookup_xrange():  
    return x[999999]

print("Look up time in Range: {}".format(timeit.timeit('lookup_range()', 'from __main__ import lookup_range')))

print("Look up time in Xrange: {}".format(timeit.timeit('lookup_xrange()', 'from __main__ import lookup_xrange')))  

The result:

$ python range-vs-xrange.py 
Look up time in Range: 0.0959858894348  
Look up time in Xrange: 0.140854120255  
$ 
$ python range-vs-xrange.py 
Look up time in Range: 0.111716985703  
Look up time in Xrange: 0.130584001541  
$ 
$ python range-vs-xrange.py 
Look up time in Range: 0.110965013504  
Look up time in Xrange: 0.133008003235  
$ 
$ python range-vs-xrange.py 
Look up time in Range: 0.102388143539  
Look up time in Xrange: 0.133061170578  

xrange may consume less memory but takes more time to find an item in it. Given the situation and the available resources, we can choose either of range or xrange depending on the aspect we are going for. This reiterates the importance of profiling in the optimization of our Python code.

Note: xrange is deprecated in Python 3 and the range function can now serve the same functionality. Generators are still available on Python 3 and can help us save memory in other ways such as Generator Comprehensions or Expressions.

Sets

When working with Lists in Python we need to keep in mind that they allow duplicate entries. What if it matters whether our data contains duplicates or not?

This is where Python Sets come in. They are like Lists but they do not allow any duplicates to be stored in them. Sets are also used to efficiently remove duplicates from Lists and are faster than creating a new list and populating it from the one with duplicates.

In this operation, you can think of them as a funnel or filter that holds back duplicates and only lets unique values pass.

Let us compare the two operations:

import timeit

# here we create a new list and add the elements one by one
# while checking for duplicates
def manual_remove_duplicates(list_of_duplicates):  
    new_list = []
    [new_list.append(n) for n in list_of_duplicates if n not in new_list]
    return new_list

# using a set is as simple as
def set_remove_duplicates(list_of_duplicates):  
    return list(set(list_of_duplicates))

list_of_duplicates = [10, 54, 76, 10, 54, 100, 1991, 6782, 1991, 1991, 64, 10]

print("Manually removing duplicates takes {}s".format(timeit.timeit('manual_remove_duplicates(list_of_duplicates)', 'from __main__ import manual_remove_duplicates, list_of_duplicates')))

print("Using Set to remove duplicates takes {}s".format(timeit.timeit('set_remove_duplicates(list_of_duplicates)', 'from __main__ import set_remove_duplicates, list_of_duplicates')))  

After running the script five times:

$ python sets-vs-lists.py 
Manually removing duplicates takes 2.64614701271s  
Using Set to remove duplicates takes 2.23225092888s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.65356898308s  
Using Set to remove duplicates takes 1.1165189743s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.53129696846s  
Using Set to remove duplicates takes 1.15646100044s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.57102680206s  
Using Set to remove duplicates takes 1.13189387321s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.48338890076s  
Using Set to remove duplicates takes 1.20611810684s  

Using a Set to remove duplicates is consistently faster than manually creating a list and adding items while checking for presence.

This could be useful when filtering entries for a giveaway contest, where we should filter out duplicate entries. If it takes 2s to filter out 120 entries, imagine filtering out 10 000 entries. On such a scale, the vastly increased performance that comes with Sets is significant.

This might not occur commonly, but it can make a huge difference when called upon. Proper profiling can help us identify such situations, and can make all the difference in the performance of our code.

String Concatenation

Strings are immutable by default in Python and subsequently, String concatenation can be quite slow. There are several ways of concatenating strings that apply to various situations.

We can use the + (plus) to join strings. This is ideal for a few String objects and not at scale. If you use the + operator to concatenate multiple strings, each concatenation will create a new object since Strings are immutable. This will result in the creation of many new String objects in memory hence improper utilization of memory.

We can also use the concatenate operator += to join strings but this only works for two strings at a time, unlike the + operator that can join more than two strings.

If we have an iterator such as a List that has multiple Strings, the ideal way to concatenate them is by using the .join() method.

Let us create a list of a thousand words and compare how the .join() and the += operator compare:

import timeit

# create a list of 1000 words
list_of_words = ["foo "] * 1000

def using_join(list_of_words):  
    return "".join(list_of_words)

def using_concat_operator(list_of_words):  
    final_string = ""
    for i in list_of_words:
        final_string += i
    return final_string

print("Using join() takes {} s".format(timeit.timeit('using_join(list_of_words)', 'from __main__ import using_join, list_of_words')))

print("Using += takes {} s".format(timeit.timeit('using_concat_operator(list_of_words)', 'from __main__ import using_concat_operator, list_of_words')))  

After two tries:

$ python join-vs-concat.py 
Using join() takes 14.0949640274 s  
Using += takes 79.5631570816 s  
$ 
$ python join-vs-concat.py 
Using join() takes 13.3542580605 s  
Using += takes 76.3233859539 s  

It is evident that the .join() method is not only neater and more readable, but it is also significantly faster than the concatenation operator when joining Strings in an iterator.

If you're performing a lot of String concatenation operations, enjoying the benefits of an approach that's almost 7 times faster is wonderful.

Conclusion

We have established that the optimization of code is crucial in Python and also saw the difference made as it scales. Through the Timeit module and cProfile profiler, we have been able to tell which implementation takes less time to execute and backed it up with the figures. The data structures and control flow structures we use can greatly affect the performance of our code and we should be more careful.

Profiling is also a crucial step in code optimization since it guides the optimization process and makes it more accurate. We need to be sure that our code works and is correct before optimizing it to avoid premature optimization which might end up being more expensive to maintain or will make the code hard to understand.

21 Feb 2019 2:20pm GMT

Stack Abuse: Python Performance Optimization

Introduction

Resources are never sufficient to meet growing needs in most industries, and now especially in technology as it carves its way deeper into our lives. Technology makes life easier and more convenient and it is able to evolve and become better over time.

This increased reliance on technology has come at the expense of the computing resources available. As a result, more powerful computers are being developed and the optimization of code has never been more crucial.

Application performance requirements are rising more than our hardware can keep up with. To combat this, people have come up with many strategies to utilize resources more efficiently - Containerizing, Reactive (Asynchronous) Applications, etc.

Though, the first step we should take, and by far the easiest one to take into consideration, is code optimization. We need to write code that performs better and utilizes less computing resources.

In this article, we will optimize common patterns and procedures in Python programming in an effort to boost the performance and enhance the utilization of the available computing resources.

Problem with Performance

As software solutions scale, performance becomes more crucial and issues become more grand and visible. When we are writing code on our localhost, it is easy to miss some performance issues since usage is not intense. Once the same software is deployed for thousands and hundreds of thousands of concurrent end-users, the issues become more elaborate.

Slowness is one of the main issues to creep up when software is scaled. This is characterized by increased response time. For instance, a web server may take longer to serve web pages or send responses back to clients when the requests become too many. Nobody likes a slow system especially since technology is meant to make certain operations faster, and usability will decline if the system is slow.

When software is not optimized to utilize available resources well, it will end up requiring more resources to ensure it runs smoothly. For instance, if memory management is not handled well, the program will end up requiring more memory, hence resulting in upgrading costs or frequent crashes.

Inconsistency and erroneous output is another result of poorly optimized programs. These points highlight the need for optimization of programs.

Why and When to Optimize

When building for large scale use, optimization is a crucial aspect of software to consider. Optimized software is able to handle a large number of concurrent users or requests while maintaining the level of performance in terms of speed easily.

This leads to overall customer satisfaction since usage is unaffected. This also leads to fewer headaches when an application crashes in the middle of the night and your angry manager calls you to fix it instantly.

Computing resources are expensive and optimization can come in handy in reducing operational costs in terms of storage, memory, or computing power.

But when do we optimize?

It is important to note that optimization may negatively affect the readability and maintainability of the codebase by making it more complex. Therefore, it is important to consider the result of the optimization against the technical debt it will raise.

If we are building large systems which expect a lot of interaction by the end users, then we need our system working at the best state and this calls for optimization. Also, if we have limited resources in terms of computing power or memory, optimization will go a long way in ensuring that we can make do with the resources available to us.

Profiling

Before we can optimize our code, it has to be working. This way we can be able to tell how it performs and utilizes resources. And this brings us to the first rule of optimization - Don't.

As Donald Knuth - a mathematician, computer scientist, and professor at Stanford University put it:

"Premature optimization is the root of all evil."

The solution has to work for it to be optimized.

Profiling entails scrutiny of our code and analyzing its performance in order to identify how our code performs in various situations and areas of improvement if needed. It will enable us to identify the amount of time that our program takes or the amount of memory it uses in its operations. This information is vital in the optimization process since it helps us decide whether to optimize our code or not.

Profiling can be a challenging undertaking and take a lot of time and if done manually some issues that affect performance may be missed. To this effect, the various tools that can help profile code faster and more efficiently include:

Profiling will help us identify areas to optimize in our code. Let us discuss how choosing the right data structure or control flow can help our Python code perform better.

Choosing Data Structures and Control Flow

The choice of data structure in our code or algorithm implemented can affect the performance of our Python code. If we make the right choices with our data structures, our code will perform well.

Profiling can be of great help to identify the best data structure to use at different points in our Python code. Are we doing a lot of inserts? Are we deleting frequently? Are we constantly searching for items? Such questions can help guide us choose the correct data structure for the need and consequently result in optimized Python code.

Time and memory usage will be greatly affected by our choice of data structure. It is also important to note that some data structures are implemented differently in different programming languages.

For Loop vs List Comprehensions

Loops are common when developing in Python and soon enough you will come across list comprehensions, which are a concise way to create new lists which also support conditions.

For instance, if we want to get a list of the squares of all even numbers in a certain range using the for loop:

new_list = []  
for n in range(0, 10):  
    if n % 2 == 0:
        new_list.append(n**2)

A List Comprehension version of the loop would simply be:

new_list = [ n**2 for n in range(0,10) if n%2 == 0]  

The list comprehension is shorter and more concise, but that is not the only trick up its sleeve. They are also notably faster in execution time than for loops. We will use the Timeit module which provides a way to time small bits of Python code.

Let us put the list comprehension against the equivalent for loop and see how long each takes to achieve the same result:

import timeit

def for_square(n):  
    new_list = []
    for i in range(0, n):
        if i % 2 == 0:
            new_list.append(n**2)
    return new_list

def list_comp_square(n):  
    return [i**2 for i in range(0, n) if i % 2 == 0]

print("Time taken by For Loop: {}".format(timeit.timeit('for_square(10)', 'from __main__ import for_square')))

print("Time taken by List Comprehension: {}".format(timeit.timeit('list_comp_square(10)', 'from __main__ import list_comp_square')))  

After running the script 5 times using Python 2:

$ python for-vs-lc.py 
Time taken by For Loop: 2.56907987595  
Time taken by List Comprehension: 2.01556396484  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.37083697319  
Time taken by List Comprehension: 1.94110512733  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.52163410187  
Time taken by List Comprehension: 1.96427607536  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.44279003143  
Time taken by List Comprehension: 2.16282701492  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 2.63641500473  
Time taken by List Comprehension: 1.90950393677  

While the difference is not constant, the list comprehension is taking less time than the for loop. In small-scale code, this may not make that much of a difference, but at large-scale execution, it may be all the difference needed to save some time.

If we increase the range of squares from 10 to 100, the difference becomes more apparent:

$ python for-vs-lc.py 
Time taken by For Loop: 16.0991549492  
Time taken by List Comprehension: 13.9700510502  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 16.6425571442  
Time taken by List Comprehension: 13.4352738857  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 16.2476081848  
Time taken by List Comprehension: 13.2488780022  
$ 
$ python for-vs-lc.py 
Time taken by For Loop: 15.9152050018  
Time taken by List Comprehension: 13.3579590321  

cProfile is a profiler that comes with Python and if we use it to profile our code:

cprofile analysis

Upon further scrutiny, we can still see that the cProfile tool reports that our List Comprehension takes less execution time than our For Loop implementation, as we had established earlier. cProfile displays all the functions called, the number of times they have been called and the amount of time taken by each.

If our intention is to reduce the time taken by our code to execute, then the List Comprehension would be a better choice over using the For Loop. The effect of such a decision to optimize our code will be much clearer at a larger scale and shows just how important, but also easy, optimizing code can be.

But what if we are concerned about our memory usage? A list comprehension would require more memory to remove items in a list than a normal loop. A list comprehension always creates a new list in memory upon completion, so for deletion of items off a list, a new list would be created. Whereas, for a normal for loop, we can use the list.remove() or list.pop() to modify the original list instead of creating a new one in memory.

Again, in small-scale scripts, it might not make much difference, but optimization comes good at a larger scale, and in that situation, such memory saving will come good and allow us to use the extra memory saved for other operations.

Linked Lists

Another data structure that can come in handy to achieve memory saving is the Linked List. It differs from a normal array in that each item or node has a link or pointer to the next node in the list and it does not require contiguous memory allocation.

An array requires that memory required to store it and its items be allocated upfront and this can be quite expensive or wasteful when the size of the array is not known in advance.

A linked list will allow you to allocate memory as needed. This is possible because the nodes in the linked list can be stored in different places in memory but come together in the linked list through pointers. This makes linked lists a lot more flexible compared to arrays.

The caveat with a linked list is that the lookup time is slower than an array's due to the placement of the items in memory. Proper profiling will help you identify whether you need better memory or time management in order to decide whether to use a Linked List or an Array as your choice of the data structure when optimizing your code.

Range vs XRange

When dealing with loops in Python, sometimes we will need to generate a list of integers to assist us in executing for-loops. The functions range and xrange are used to this effect.

Their functionality is the same but they are different in that the range returns a list object but the xrange returns an xrange object.

What does this mean? An xrange object is a generator in that it's not the final list. It gives us the ability to generate the values in the expected final list as required during runtime through a technique known as "yielding".

The fact that the xrange function does not return the final list makes it the more memory efficient choice for generating huge lists of integers for looping purposes.

If we need to generate a large number of integers for use, xrange should be our go-to option for this purpose since it uses less memory. If we use the range function instead, the entire list of integers will need to be created and this will get memory intensive.

Let us explore this difference in memory consumption between the two functions:

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)  
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.  
>>> import sys
>>> 
>>> r = range(1000000)
>>> x = xrange(1000000)
>>> 
>>> print(sys.getsizeof(r))
8000072  
>>> 
>>> print(sys.getsizeof(x))
40  
>>> 
>>> print(type(r))
<type 'list'>  
>>> print(type(x))
<type 'xrange'>  

We create a range of 1,000,000 integers using range and xrange. The type of object created by the range function is a List that consumes 8000072 bytes of memory while the xrange object consumes only 40 bytes of memory.

The xrange function saves us memory, loads of it, but what about item lookup time? Let's time the lookup time of an integer in the generated list of integers using Timeit:

import timeit

r = range(1000000)  
x = xrange(1000000)

def lookup_range():  
    return r[999999]

def lookup_xrange():  
    return x[999999]

print("Look up time in Range: {}".format(timeit.timeit('lookup_range()', 'from __main__ import lookup_range')))

print("Look up time in Xrange: {}".format(timeit.timeit('lookup_xrange()', 'from __main__ import lookup_xrange')))  

The result:

$ python range-vs-xrange.py 
Look up time in Range: 0.0959858894348  
Look up time in Xrange: 0.140854120255  
$ 
$ python range-vs-xrange.py 
Look up time in Range: 0.111716985703  
Look up time in Xrange: 0.130584001541  
$ 
$ python range-vs-xrange.py 
Look up time in Range: 0.110965013504  
Look up time in Xrange: 0.133008003235  
$ 
$ python range-vs-xrange.py 
Look up time in Range: 0.102388143539  
Look up time in Xrange: 0.133061170578  

xrange may consume less memory but takes more time to find an item in it. Given the situation and the available resources, we can choose either of range or xrange depending on the aspect we are going for. This reiterates the importance of profiling in the optimization of our Python code.

Note: xrange is deprecated in Python 3 and the range function can now serve the same functionality. Generators are still available on Python 3 and can help us save memory in other ways such as Generator Comprehensions or Expressions.

Sets

When working with Lists in Python we need to keep in mind that they allow duplicate entries. What if it matters whether our data contains duplicates or not?

This is where Python Sets come in. They are like Lists but they do not allow any duplicates to be stored in them. Sets are also used to efficiently remove duplicates from Lists and are faster than creating a new list and populating it from the one with duplicates.

In this operation, you can think of them as a funnel or filter that holds back duplicates and only lets unique values pass.

Let us compare the two operations:

import timeit

# here we create a new list and add the elements one by one
# while checking for duplicates
def manual_remove_duplicates(list_of_duplicates):  
    new_list = []
    [new_list.append(n) for n in list_of_duplicates if n not in new_list]
    return new_list

# using a set is as simple as
def set_remove_duplicates(list_of_duplicates):  
    return list(set(list_of_duplicates))

list_of_duplicates = [10, 54, 76, 10, 54, 100, 1991, 6782, 1991, 1991, 64, 10]

print("Manually removing duplicates takes {}s".format(timeit.timeit('manual_remove_duplicates(list_of_duplicates)', 'from __main__ import manual_remove_duplicates, list_of_duplicates')))

print("Using Set to remove duplicates takes {}s".format(timeit.timeit('set_remove_duplicates(list_of_duplicates)', 'from __main__ import set_remove_duplicates, list_of_duplicates')))  

After running the script five times:

$ python sets-vs-lists.py 
Manually removing duplicates takes 2.64614701271s  
Using Set to remove duplicates takes 2.23225092888s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.65356898308s  
Using Set to remove duplicates takes 1.1165189743s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.53129696846s  
Using Set to remove duplicates takes 1.15646100044s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.57102680206s  
Using Set to remove duplicates takes 1.13189387321s  
$ 
$ python sets-vs-lists.py 
Manually removing duplicates takes 2.48338890076s  
Using Set to remove duplicates takes 1.20611810684s  

Using a Set to remove duplicates is consistently faster than manually creating a list and adding items while checking for presence.

This could be useful when filtering entries for a giveaway contest, where we should filter out duplicate entries. If it takes 2s to filter out 120 entries, imagine filtering out 10 000 entries. On such a scale, the vastly increased performance that comes with Sets is significant.

This might not occur commonly, but it can make a huge difference when called upon. Proper profiling can help us identify such situations, and can make all the difference in the performance of our code.

String Concatenation

Strings are immutable by default in Python and subsequently, String concatenation can be quite slow. There are several ways of concatenating strings that apply to various situations.

We can use the + (plus) to join strings. This is ideal for a few String objects and not at scale. If you use the + operator to concatenate multiple strings, each concatenation will create a new object since Strings are immutable. This will result in the creation of many new String objects in memory hence improper utilization of memory.

We can also use the concatenate operator += to join strings but this only works for two strings at a time, unlike the + operator that can join more than two strings.

If we have an iterator such as a List that has multiple Strings, the ideal way to concatenate them is by using the .join() method.

Let us create a list of a thousand words and compare how the .join() and the += operator compare:

import timeit

# create a list of 1000 words
list_of_words = ["foo "] * 1000

def using_join(list_of_words):  
    return "".join(list_of_words)

def using_concat_operator(list_of_words):  
    final_string = ""
    for i in list_of_words:
        final_string += i
    return final_string

print("Using join() takes {} s".format(timeit.timeit('using_join(list_of_words)', 'from __main__ import using_join, list_of_words')))

print("Using += takes {} s".format(timeit.timeit('using_concat_operator(list_of_words)', 'from __main__ import using_concat_operator, list_of_words')))  

After two tries:

$ python join-vs-concat.py 
Using join() takes 14.0949640274 s  
Using += takes 79.5631570816 s  
$ 
$ python join-vs-concat.py 
Using join() takes 13.3542580605 s  
Using += takes 76.3233859539 s  

It is evident that the .join() method is not only neater and more readable, but it is also significantly faster than the concatenation operator when joining Strings in an iterator.

If you're performing a lot of String concatenation operations, enjoying the benefits of an approach that's almost 7 times faster is wonderful.

Conclusion

We have established that the optimization of code is crucial in Python and also saw the difference made as it scales. Through the Timeit module and cProfile profiler, we have been able to tell which implementation takes less time to execute and backed it up with the figures. The data structures and control flow structures we use can greatly affect the performance of our code and we should be more careful.

Profiling is also a crucial step in code optimization since it guides the optimization process and makes it more accurate. We need to be sure that our code works and is correct before optimizing it to avoid premature optimization which might end up being more expensive to maintain or will make the code hard to understand.

21 Feb 2019 2:20pm GMT

PyCharm: PyCharm 2019.1 EAP 5

PyCharm's Early Access Program (EAP) continues with its fifth version. Get it now from our website

New in This Version

All-new Jupyter Notebooks

Jupyter Notebooks

You may have read in our Python Developer Survey that over half of Python developers now use Python for data science. To better meet the needs of professional data scientists, we've been working hard on improving the data science experience in PyCharm. A lot of data science starts with Jupyter Notebooks, and we're happy to present our all-new support for working with these in PyCharm.

Why did we rebuild them from the ground up?

Our previous support had several technical limitations that prevented us from offering a truly great Jupyter experience, and also kept us from fixing many of the bugs that were reported with them.

The all new support presents Jupyter notebooks as a side-by-side view of the code and its output, highlighting the matching cells as you navigate through the file. PyCharm can now offer you the full code intelligence you expect from your professional IDE.

Another new feature is debugging of Jupyter cells: you can place a breakpoint, and step through what is happening to explore your analysis in detail.

A Professional Feature

We want to dedicate a lot our efforts and resources to improving scientific tooling. For us to be able to do this, we're moving Jupyter notebooks into PyCharm Professional Edition. We've seen that the Jupyter notebook experience is essential to scientific Python users, and the group of scientific Python users is growing rapidly. We've made this decision to be able to meet the needs of data scientists better, and quicker.

We want your feedback!

Please try out this feature, and let us know how it fits your workflow. If you have any suggestions, please reach out to us by commenting on this post, or by going straight to our issue tracker.

What happens to the old Jupyter support?

As we are focusing our development efforts on making the new Jupyter notebooks experience as smooth as possible, we will no longer bundle the legacy support. The legacy code is available on GitHub, and Apache 2.0 licensed. We'd encourage anyone interested to fork this repo, and extend it as desired.

Further Improvements

Interested in Trying the New EAP?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to stay up to date throughout the entire EAP.

With PyCharm 2019.1 we're moving to a new runtime environment: this EAP build already bundles the brand new JetBrains Runtime Environment (a customized version of JRE 11). Unfortunately, since this build uses the brand-new platform, the patch-update from previous versions is not available this time. Please use the full installation method instead.

If you tried 2019.1 EAP 3 or an earlier EAP of 2019.1: you may get an error about "MaxJavaStackTraceDepth=-1" when you start the IDE. If you get it, please remove that line from the custom JVM options. This is an incompatibility between the old JRE and the new one, and we apologize for any inconvenience.

If you're on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP, and stay up to date. You can find the installation instructions on our website.

PyCharm 2019.1 is in development during the EAP phase, therefore not all new features are already available. More features will be added in the coming weeks. As PyCharm 2019.1 is pre-release software, it is not as stable as the release versions. Furthermore, we may decide to change and/or drop certain features as the EAP progresses.

All EAP versions will ship with a built-in EAP license, which means that these versions are free to use for 30 days after the day that they are built. As EAPs are released weekly, you'll be able to use PyCharm Professional Edition EAP for free for the duration of the EAP program, as long as you upgrade at least once every 30 days.

21 Feb 2019 12:24pm GMT

PyCharm: PyCharm 2019.1 EAP 5

PyCharm's Early Access Program (EAP) continues with its fifth version. Get it now from our website

New in This Version

All-new Jupyter Notebooks

Jupyter Notebooks

You may have read in our Python Developer Survey that over half of Python developers now use Python for data science. To better meet the needs of professional data scientists, we've been working hard on improving the data science experience in PyCharm. A lot of data science starts with Jupyter Notebooks, and we're happy to present our all-new support for working with these in PyCharm.

Why did we rebuild them from the ground up?

Our previous support had several technical limitations that prevented us from offering a truly great Jupyter experience, and also kept us from fixing many of the bugs that were reported with them.

The all new support presents Jupyter notebooks as a side-by-side view of the code and its output, highlighting the matching cells as you navigate through the file. PyCharm can now offer you the full code intelligence you expect from your professional IDE.

Another new feature is debugging of Jupyter cells: you can place a breakpoint, and step through what is happening to explore your analysis in detail.

A Professional Feature

We want to dedicate a lot our efforts and resources to improving scientific tooling. For us to be able to do this, we're moving Jupyter notebooks into PyCharm Professional Edition. We've seen that the Jupyter notebook experience is essential to scientific Python users, and the group of scientific Python users is growing rapidly. We've made this decision to be able to meet the needs of data scientists better, and quicker.

We want your feedback!

Please try out this feature, and let us know how it fits your workflow. If you have any suggestions, please reach out to us by commenting on this post, or by going straight to our issue tracker.

What happens to the old Jupyter support?

As we are focusing our development efforts on making the new Jupyter notebooks experience as smooth as possible, we will no longer bundle the legacy support. The legacy code is available on GitHub, and Apache 2.0 licensed. We'd encourage anyone interested to fork this repo, and extend it as desired.

Further Improvements

Interested in Trying the New EAP?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to stay up to date throughout the entire EAP.

With PyCharm 2019.1 we're moving to a new runtime environment: this EAP build already bundles the brand new JetBrains Runtime Environment (a customized version of JRE 11). Unfortunately, since this build uses the brand-new platform, the patch-update from previous versions is not available this time. Please use the full installation method instead.

If you tried 2019.1 EAP 3 or an earlier EAP of 2019.1: you may get an error about "MaxJavaStackTraceDepth=-1" when you start the IDE. If you get it, please remove that line from the custom JVM options. This is an incompatibility between the old JRE and the new one, and we apologize for any inconvenience.

If you're on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP, and stay up to date. You can find the installation instructions on our website.

PyCharm 2019.1 is in development during the EAP phase, therefore not all new features are already available. More features will be added in the coming weeks. As PyCharm 2019.1 is pre-release software, it is not as stable as the release versions. Furthermore, we may decide to change and/or drop certain features as the EAP progresses.

All EAP versions will ship with a built-in EAP license, which means that these versions are free to use for 30 days after the day that they are built. As EAPs are released weekly, you'll be able to use PyCharm Professional Edition EAP for free for the duration of the EAP program, as long as you upgrade at least once every 30 days.

21 Feb 2019 12:24pm GMT

PyBites: Code Challenge 61 - Build a URL Shortener

There is an immense amount to be learned simply by tinkering with things. - Henry Ford

Hey Pythonistas,

Changing the PCC game a bit

Let's be honest here: we slacked off a bit on our blog code challenges! Apart from the increased workload overall, our synchronous approach of solving a challenge before launching a next one, is holding us back.

So we are going to change our approach a bit. We keep launching PyBites Code Challenges (PCCs) on our blog, because most importantly this is what gets YOU to write Python code!

However we are dropping hard deadlines and review posts. Solving them is an ongoing thread and you can see merged solutions on the community branch (each challenge # has a dedicated folder).

You can collaborate with each other on our Slack (= dedicated #codechallenge channel). And of course keep PR'ing your code via our platform.

Want to code review / become a mentor?

We are still getting a pretty manageable number of PRs to be able to merge them all in ourselves.

However we think it would be really cool to give each PR a bit more of a code check. Hence we also want to make this a community effort.

So if you want to help out merging PRs into our challenges branch, become a moderator. You can volunteer on Slack.

It's a great chance to read other people's code honing your code review skills, and (last but not least!) become a mentor building up great relationships with other Pythonistas. Sounds fair enough?


Back to business ... our new challenge:

In this challenge we're asking you to spice up your life with your very own URL Shortener!

We've all seen sites like bit.ly that allow you to shorten a URL into something... well... shorter! It's time to you make your own.

There are roughly four parts to this challenge:

  1. Make a small Django/Flask/Bottle app that receives in a URL.

  2. Using the supplied URL, generate a unique URL with the base of pybit.es. It should be generated keeping uniqueness in mind.

  3. Return the shortened URL.

  4. Bonus: track the visits in a second DB table for stats.

It sounds more complex but breaking it down into these steps should help you tackle the problem more effectively.

Good luck and have fun!

Ideas and feedback

If you have ideas for a future challenge or find any issues, open a GH Issue or reach out via Twitter, Slack or Email.

Last but not least: there is no best solution, only learning more and better Python. Good luck!

Become a Python Ninja

At PyBites you get to master Python through Code Challenges:


>>> from pybites import Bob, Julian

Keep Calm and Code in Python!

21 Feb 2019 11:00am GMT

PyBites: Code Challenge 61 - Build a URL Shortener

There is an immense amount to be learned simply by tinkering with things. - Henry Ford

Hey Pythonistas,

Changing the PCC game a bit

Let's be honest here: we slacked off a bit on our blog code challenges! Apart from the increased workload overall, our synchronous approach of solving a challenge before launching a next one, is holding us back.

So we are going to change our approach a bit. We keep launching PyBites Code Challenges (PCCs) on our blog, because most importantly this is what gets YOU to write Python code!

However we are dropping hard deadlines and review posts. Solving them is an ongoing thread and you can see merged solutions on the community branch (each challenge # has a dedicated folder).

You can collaborate with each other on our Slack (= dedicated #codechallenge channel). And of course keep PR'ing your code via our platform.

Want to code review / become a mentor?

We are still getting a pretty manageable number of PRs to be able to merge them all in ourselves.

However we think it would be really cool to give each PR a bit more of a code check. Hence we also want to make this a community effort.

So if you want to help out merging PRs into our challenges branch, become a moderator. You can volunteer on Slack.

It's a great chance to read other people's code honing your code review skills, and (last but not least!) become a mentor building up great relationships with other Pythonistas. Sounds fair enough?


Back to business ... our new challenge:

In this challenge we're asking you to spice up your life with your very own URL Shortener!

We've all seen sites like bit.ly that allow you to shorten a URL into something... well... shorter! It's time to you make your own.

There are roughly four parts to this challenge:

  1. Make a small Django/Flask/Bottle app that receives in a URL.

  2. Using the supplied URL, generate a unique URL with the base of pybit.es. It should be generated keeping uniqueness in mind.

  3. Return the shortened URL.

  4. Bonus: track the visits in a second DB table for stats.

It sounds more complex but breaking it down into these steps should help you tackle the problem more effectively.

Good luck and have fun!

Ideas and feedback

If you have ideas for a future challenge or find any issues, open a GH Issue or reach out via Twitter, Slack or Email.

Last but not least: there is no best solution, only learning more and better Python. Good luck!

Become a Python Ninja

At PyBites you get to master Python through Code Challenges:


>>> from pybites import Bob, Julian

Keep Calm and Code in Python!

21 Feb 2019 11:00am GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT