25 Aug 2016

feedPlanet Python

Matthew Rocklin: Supporting Users in Open Source

What are the social expectations of open source developers to help users understand their projects? What are the social expectations of users when asking for help?

As part of developing Dask, an open source library with growing adoption, I directly interact with users over GitHub issues for bug reports, StackOverflow for usage questions, a mailing list and live Gitter chat for community conversation. Dask is blessed with awesome users. These are researchers doing very cool work of high impact and with novel use cases. They report bugs and usage questions with such skill that it's clear that they are Veteran Users of open source projects.

Veteran Users are Heroes

It's not easy being a veteran user. It takes a lot of time to distill a bug down to a reproducible example, or a question into an MCVE, or to read all of the documentation to make sure that a conceptual question definitely isn't answered in the docs. And yet this effort really shines through and it's incredibly valuable to making open source software better. These distilled reports are arguably more important than fixing the actual bug or writing the actual documentation.

Bugs occur in the wild, in code that is half related to the developer's library (like Pandas or Dask) and half related to the user's application. The veteran user works hard to pull away all of their code and data, creating a gem of an example that is trivial to understand and run anywhere that still shows off the problem.

This way the veteran user can show up with their problem to the development team and say "here is something that you will quickly understand to be a problem." On the developer side this is incredibly valuable. They learn of a relevant bug and immediately understand what's going on, without having to download someone else's data or understand their domain. This switches from merely convenient to strictly necessary when the developers deal with 10+ such reports a day.

Novice Users need help too

However there are a lot of novice users out there. We were all been novice users once, and even if we are veterans today we are probably still novices at something else. Knowing what to do and how to ask for help is hard. Having the guts to walk into a chat room where people will quickly see that you're a novice is even harder. It's like using public transit in a deeply foreign language. Respect is warranted here.

I categorize novice users into two groups:

  1. Experienced technical novices, who are very experienced in their field and technical things generally, but who don't yet have an thorough understanding of open source culture and how to ask questions smoothly. They're entirely capable of behaving like a veteran user if pointed in the right directions.
  2. Novice technical novices, who don't yet have the ability to distill their problems into the digestible nuggets that open source developers expect.

In the first case of technically experienced novices, I've found that being direct works surprisingly well. I used to be apologetic in asking people to submit MCVEs. Today I'm more blunt but surprisingly I find that this group doesn't seem to mind. I suspect that this group is accustomed to operating in situations where other people's time is very costly.

The second case of novice novice users are more challenging for individual developers to handle one-by-one, both because novices are more common, and because solving their problems often requires more time commitment. Instead open source communities often depend on broadcast and crowd-sourced solutions, like documentation, StackOverflow, or meetups and user groups. For example in Dask we strongly point people towards StackOverflow in order to build up a knowledge-base of question-answer pairs. Pandas has done this well; almost every Pandas question you Google leads to a StackOverflow post, handling 90% of the traffic and improving the lives of thousands. Many projects simply don't have the human capital to hand-hold individuals through using the library.

In a few projects there are enough generous and experienced users that they're able to field questions from individual users. SymPy is a good example here. I learned open source programming within SymPy. Their community was broad enough that they were able to hold my hand as I learned Git, testing, communication practices and all of the other soft skills that we need to be effective in writing great software. The support structure of SymPy is something that I've never experienced anywhere else.

My Apologies

I've found myself becoming increasingly impolite when people ask me for certain kinds of extended help with their code. I've been trying to track down why this is and I think that it comes from a mismatch of social contracts.

Large parts of technical society have an (entirely reasonable) belief that open source developers are available to answer questions about how we use their project. This was probably true in popular culture, where our stereotypical image of an open source developer was working out of their basement long into the night on things that relatively few enthusiasts bothered with. They were happy to engage and had the free time in which to do it.

In some ways things have changed a lot. We now have paid professionals building software that is used by thousands or millions of users. These professionals easily charge consulting fees of hundreds of dollars per hour for exactly the kind of assistance that people show up expecting for free under the previous model. These developers have to answer for how they spend their time when they're at work, and when they're not at work they now have families and kids that deserve just as much attention as their open source users.

Both of these cultures, the creative do-it-yourself basement culture and the more corporate culture, are important to the wonderful surge we've seen in open source software. How do we balance them? Should developers, like doctors or lawyers perform pro-bono work as part of their profession? Should grants specifically include paid time for community engagement and outreach? Should users, as part of receiving help feel an obligation to improve documentation or stick around and help others?

Solutions?

I'm not sure what to do here. I feel an obligation to remain connected with users from a broad set of applications, even those that companies or grants haven't decided to fund. However at the same time I don't know how to say "I'm sorry, I simply don't have the time to help you with your problem." in a way that feels at all compassionate.

I think that people should still ask questions. I think that we need to foster an environment in which developers can say "Sorry. Busy." more easily. I think that we as a community need better resources to teach novice users to become veteran users.

25 Aug 2016 12:00am GMT

Matthew Rocklin: Supporting Users in Open Source

What are the social expectations of open source developers to help users understand their projects? What are the social expectations of users when asking for help?

As part of developing Dask, an open source library with growing adoption, I directly interact with users over GitHub issues for bug reports, StackOverflow for usage questions, a mailing list and live Gitter chat for community conversation. Dask is blessed with awesome users. These are researchers doing very cool work of high impact and with novel use cases. They report bugs and usage questions with such skill that it's clear that they are Veteran Users of open source projects.

Veteran Users are Heroes

It's not easy being a veteran user. It takes a lot of time to distill a bug down to a reproducible example, or a question into an MCVE, or to read all of the documentation to make sure that a conceptual question definitely isn't answered in the docs. And yet this effort really shines through and it's incredibly valuable to making open source software better. These distilled reports are arguably more important than fixing the actual bug or writing the actual documentation.

Bugs occur in the wild, in code that is half related to the developer's library (like Pandas or Dask) and half related to the user's application. The veteran user works hard to pull away all of their code and data, creating a gem of an example that is trivial to understand and run anywhere that still shows off the problem.

This way the veteran user can show up with their problem to the development team and say "here is something that you will quickly understand to be a problem." On the developer side this is incredibly valuable. They learn of a relevant bug and immediately understand what's going on, without having to download someone else's data or understand their domain. This switches from merely convenient to strictly necessary when the developers deal with 10+ such reports a day.

Novice Users need help too

However there are a lot of novice users out there. We were all been novice users once, and even if we are veterans today we are probably still novices at something else. Knowing what to do and how to ask for help is hard. Having the guts to walk into a chat room where people will quickly see that you're a novice is even harder. It's like using public transit in a deeply foreign language. Respect is warranted here.

I categorize novice users into two groups:

  1. Experienced technical novices, who are very experienced in their field and technical things generally, but who don't yet have an thorough understanding of open source culture and how to ask questions smoothly. They're entirely capable of behaving like a veteran user if pointed in the right directions.
  2. Novice technical novices, who don't yet have the ability to distill their problems into the digestible nuggets that open source developers expect.

In the first case of technically experienced novices, I've found that being direct works surprisingly well. I used to be apologetic in asking people to submit MCVEs. Today I'm more blunt but surprisingly I find that this group doesn't seem to mind. I suspect that this group is accustomed to operating in situations where other people's time is very costly.

The second case of novice novice users are more challenging for individual developers to handle one-by-one, both because novices are more common, and because solving their problems often requires more time commitment. Instead open source communities often depend on broadcast and crowd-sourced solutions, like documentation, StackOverflow, or meetups and user groups. For example in Dask we strongly point people towards StackOverflow in order to build up a knowledge-base of question-answer pairs. Pandas has done this well; almost every Pandas question you Google leads to a StackOverflow post, handling 90% of the traffic and improving the lives of thousands. Many projects simply don't have the human capital to hand-hold individuals through using the library.

In a few projects there are enough generous and experienced users that they're able to field questions from individual users. SymPy is a good example here. I learned open source programming within SymPy. Their community was broad enough that they were able to hold my hand as I learned Git, testing, communication practices and all of the other soft skills that we need to be effective in writing great software. The support structure of SymPy is something that I've never experienced anywhere else.

My Apologies

I've found myself becoming increasingly impolite when people ask me for certain kinds of extended help with their code. I've been trying to track down why this is and I think that it comes from a mismatch of social contracts.

Large parts of technical society have an (entirely reasonable) belief that open source developers are available to answer questions about how we use their project. This was probably true in popular culture, where our stereotypical image of an open source developer was working out of their basement long into the night on things that relatively few enthusiasts bothered with. They were happy to engage and had the free time in which to do it.

In some ways things have changed a lot. We now have paid professionals building software that is used by thousands or millions of users. These professionals easily charge consulting fees of hundreds of dollars per hour for exactly the kind of assistance that people show up expecting for free under the previous model. These developers have to answer for how they spend their time when they're at work, and when they're not at work they now have families and kids that deserve just as much attention as their open source users.

Both of these cultures, the creative do-it-yourself basement culture and the more corporate culture, are important to the wonderful surge we've seen in open source software. How do we balance them? Should developers, like doctors or lawyers perform pro-bono work as part of their profession? Should grants specifically include paid time for community engagement and outreach? Should users, as part of receiving help feel an obligation to improve documentation or stick around and help others?

Solutions?

I'm not sure what to do here. I feel an obligation to remain connected with users from a broad set of applications, even those that companies or grants haven't decided to fund. However at the same time I don't know how to say "I'm sorry, I simply don't have the time to help you with your problem." in a way that feels at all compassionate.

I think that people should still ask questions. I think that we need to foster an environment in which developers can say "Sorry. Busy." more easily. I think that we as a community need better resources to teach novice users to become veteran users.

25 Aug 2016 12:00am GMT

24 Aug 2016

feedPlanet Python

Nikola: Automating Nikola rebuilds with Travis CI

In this guide, we'll set up Travis CI to rebuild a Nikola website and host it on GitHub Pages.

Why?

By using Travis CI to build your site, you can easily blog from anywhere you can edit text files. Which means you can blog with only a web browser and GitHub.com or try a service like Prose.io. You also won't need to install Nikola and Python to write. Or a real computer, a mobile phone could probably access one of those services and write something.

Caveats

  • The build might take a couple minutes to finish (1:30 for the demo site; YMMV)
  • When you commit and push to GitHub, the site will be published unconditionally. If you don't have a copy of Nikola for local use, there is no way to preview your site.

What you need

  • A computer for the initial setup that can run Nikola and the Travis CI command-line tool (written in Ruby) - you need a Unix-like system (Linux, OS X, *BSD, etc.); Windows users should try Bash on Ubuntu on Windows (available in Windows 10 starting with Anniversary Update) or a Linux virtual machine.
  • A GitHub account (free)
  • A Travis CI account linked to your GitHub account (free)

Setting up Nikola

Start by creating a new Nikola site and customizing it to your liking. Follow the Getting Started guide. You might also want to add support for other input formats, namely Markdown, but this is not a requirement (unless you want to use Prose.io).

After you're done, you must configure deploying to GitHub in Nikola. Make your first deployment from your local computer and make sure your site works right. Don't forget to set up .gitignore. Moreover, you must set GITHUB_COMMIT_SOURCE = False - otherwise, Travis CI will go into an infinite loop.

If everything works, you can make some change to your site (so you see that rebuilding works), but don't commit it just yet.

Setting up Travis CI

Next, we need to set up Travis CI. To do that, make sure you have the ruby and gem tools installed on your system. If you don't have them, install them from your OS package manager.

First, download/copy the .travis.yml file (note the dot in the beginning; the downloaded file doesn't have it!) and adjust the real name, e-mail (used for commits; line 12/13), and the username/repo name on line 21. If you want to render your site in another language besides English, add the appropriate Ubuntu language pack to the list in this file.

travis.yml

# Travis CI config for automated Nikola blog deployments
language: python
cache: apt
sudo: false
addons:
  apt:
    packages:
    - language-pack-en-base
python:
- 3.5
before_install:
- git config --global user.name 'Travis CI'
- git config --global user.email 'travis@invalid'
- git config --global push.default 'simple'
- pip install --upgrade pip wheel
- echo -e 'Host github.com\n    StrictHostKeyChecking no' >> ~/.ssh/config
- eval "$(ssh-agent -s)"
- chmod 600 id_rsa
- ssh-add id_rsa
- git remote rm origin
- git remote add origin git@github.com:USERNAME/REPO.git
- git fetch origin master
- git branch master FETCH_HEAD
install:
- pip install 'Nikola[extras]'
script:
- nikola build && nikola github_deploy -m 'Nikola auto deploy [ci skip]'
notifications:
    email:
        on_success: change
        on_failure: always

Next, we need to generate a SSH key for Travis CI.

echo id_rsa >> .gitignore
echo id_rsa.pub >> .gitignore
ssh-keygen -C TravisCI -f id_rsa -N ''

Open the id_rsa.pub file and copy its contents. Go to GitHub → your page repository → Settings → Deploy keys and add it there. Make sure Allow write access is checked.

And now, time for our venture into the Ruby world. Install the travis gem:

gem install --user-install travis

You can then use the travis command if you have configured your $PATH for RubyGems; if you haven't, the tool will output a path to use (eg. ~/.gem/ruby/2.0.0/bin/travis)

We'll use the Travis CI command-line client to log in (using your GitHub password), enable the repository and encrypt our SSH key. Run the following three commands, one at a time (they are interactive):

travis login
travis enable
travis encrypt-file id_rsa --add

Commit everything to GitHub:

git add .
git commit -am "Automate builds with Travis CI"

Hopefully, Travis CI will build your site and deploy. Check the Travis CI website or your e-mail for a notification. If there are any errors, make sure you followed this guide to the letter.

24 Aug 2016 6:05pm GMT

Nikola: Automating Nikola rebuilds with Travis CI

In this guide, we'll set up Travis CI to rebuild a Nikola website and host it on GitHub Pages.

Why?

By using Travis CI to build your site, you can easily blog from anywhere you can edit text files. Which means you can blog with only a web browser and GitHub.com or try a service like Prose.io. You also won't need to install Nikola and Python to write. Or a real computer, a mobile phone could probably access one of those services and write something.

Caveats

  • The build might take a couple minutes to finish (1:30 for the demo site; YMMV)
  • When you commit and push to GitHub, the site will be published unconditionally. If you don't have a copy of Nikola for local use, there is no way to preview your site.

What you need

  • A computer for the initial setup that can run Nikola and the Travis CI command-line tool (written in Ruby) - you need a Unix-like system (Linux, OS X, *BSD, etc.); Windows users should try Bash on Ubuntu on Windows (available in Windows 10 starting with Anniversary Update) or a Linux virtual machine.
  • A GitHub account (free)
  • A Travis CI account linked to your GitHub account (free)

Setting up Nikola

Start by creating a new Nikola site and customizing it to your liking. Follow the Getting Started guide. You might also want to add support for other input formats, namely Markdown, but this is not a requirement (unless you want to use Prose.io).

After you're done, you must configure deploying to GitHub in Nikola. Make your first deployment from your local computer and make sure your site works right. Don't forget to set up .gitignore. Moreover, you must set GITHUB_COMMIT_SOURCE = False - otherwise, Travis CI will go into an infinite loop.

If everything works, you can make some change to your site (so you see that rebuilding works), but don't commit it just yet.

Setting up Travis CI

Next, we need to set up Travis CI. To do that, make sure you have the ruby and gem tools installed on your system. If you don't have them, install them from your OS package manager.

First, download/copy the .travis.yml file (note the dot in the beginning; the downloaded file doesn't have it!) and adjust the real name, e-mail (used for commits; line 12/13), and the username/repo name on line 21. If you want to render your site in another language besides English, add the appropriate Ubuntu language pack to the list in this file.

travis.yml

# Travis CI config for automated Nikola blog deployments
language: python
cache: apt
sudo: false
addons:
  apt:
    packages:
    - language-pack-en-base
python:
- 3.5
before_install:
- git config --global user.name 'Travis CI'
- git config --global user.email 'travis@invalid'
- git config --global push.default 'simple'
- pip install --upgrade pip wheel
- echo -e 'Host github.com\n    StrictHostKeyChecking no' >> ~/.ssh/config
- eval "$(ssh-agent -s)"
- chmod 600 id_rsa
- ssh-add id_rsa
- git remote rm origin
- git remote add origin git@github.com:USERNAME/REPO.git
- git fetch origin master
- git branch master FETCH_HEAD
install:
- pip install 'Nikola[extras]'
script:
- nikola build && nikola github_deploy -m 'Nikola auto deploy [ci skip]'
notifications:
    email:
        on_success: change
        on_failure: always

Next, we need to generate a SSH key for Travis CI.

echo id_rsa >> .gitignore
echo id_rsa.pub >> .gitignore
ssh-keygen -C TravisCI -f id_rsa -N ''

Open the id_rsa.pub file and copy its contents. Go to GitHub → your page repository → Settings → Deploy keys and add it there. Make sure Allow write access is checked.

And now, time for our venture into the Ruby world. Install the travis gem:

gem install --user-install travis

You can then use the travis command if you have configured your $PATH for RubyGems; if you haven't, the tool will output a path to use (eg. ~/.gem/ruby/2.0.0/bin/travis)

We'll use the Travis CI command-line client to log in (using your GitHub password), enable the repository and encrypt our SSH key. Run the following three commands, one at a time (they are interactive):

travis login
travis enable
travis encrypt-file id_rsa --add

Commit everything to GitHub:

git add .
git commit -am "Automate builds with Travis CI"

Hopefully, Travis CI will build your site and deploy. Check the Travis CI website or your e-mail for a notification. If there are any errors, make sure you followed this guide to the letter.

24 Aug 2016 6:05pm GMT

Python Piedmont Triad User Group: PYPTUG Monthly meeting August 30 2016 (flask-restplus, openstreetmap)

Come join PYPTUG at out next monthly meeting (August 30th 2016) to learn more about the Python programming language, modules and tools. Python is the perfect language to learn if you've never programmed before, and at the other end, it is also the perfect tool that no expert would do without. Monthly meetings are in addition to our project nights.




What

Meeting will start at 6:00pm.

We will open on an Intro to PYPTUG and on how to get started with Python, PYPTUG activities and members projects, in particular some updates on the Quadcopter project, then on to News from the community.

Then on to the main talk.


Main Talk: Building a RESTful API with Flask-Restplus and Swagger
by Manikandan Ramakrishnan
Bio:
Manikandan Ramakrishnan is a Data Engineer with Inmar Inc.

Abstract:
Building an API and documenting it properly is like having a cake and eating it too. Flask-RESTPlus is a great Flask extension that makes it really easy to build robust REST APIs quickly with minimal setup. With its built in Swagger integration, it is extremely simple to document the endpoints and to enforce request/response models.

Lightning talks!


We will have some time for extemporaneous "lightning talks" of 5-10 minute duration. If you'd like to do one, some suggestions of talks were provided here, if you are looking for inspiration. Or talk about a project you are working on.
One lightning talk will cover OpenStreetMap

When

Tuesday, August 30th 2016
Meeting starts at 6:00PM

Where

Wake Forest University, close to Polo Rd and University Parkway:

Manchester Hall
room: Manchester 241
Wake Forest University, Winston-Salem, NC 27109

Map this

See also this campus map (PDF) and also the Parking Map (PDF) (Manchester hall is #20A on the parking map)

And speaking of parking: Parking after 5pm is on a first-come, first-serve basis. The official parking policy is:
"Visitors can park in any general parking lot on campus. Visitors should avoid reserved spaces, faculty/staff lots, fire lanes or other restricted area on campus. Frequent visitors should contact Parking and Transportation to register for a parking permit."

Mailing List


Don't forget to sign up to our user group mailing list:

https://groups.google.com/d/forum/pyptug?hl=en

It is the only step required to become a PYPTUG member.

RSVP on meetup:
https://www.meetup.com/PYthon-Piedmont-Triad-User-Group-PYPTUG/events/233095834/

24 Aug 2016 3:28pm GMT

Python Piedmont Triad User Group: PYPTUG Monthly meeting August 30 2016 (flask-restplus, openstreetmap)

Come join PYPTUG at out next monthly meeting (August 30th 2016) to learn more about the Python programming language, modules and tools. Python is the perfect language to learn if you've never programmed before, and at the other end, it is also the perfect tool that no expert would do without. Monthly meetings are in addition to our project nights.




What

Meeting will start at 6:00pm.

We will open on an Intro to PYPTUG and on how to get started with Python, PYPTUG activities and members projects, in particular some updates on the Quadcopter project, then on to News from the community.

Then on to the main talk.


Main Talk: Building a RESTful API with Flask-Restplus and Swagger
by Manikandan Ramakrishnan
Bio:
Manikandan Ramakrishnan is a Data Engineer with Inmar Inc.

Abstract:
Building an API and documenting it properly is like having a cake and eating it too. Flask-RESTPlus is a great Flask extension that makes it really easy to build robust REST APIs quickly with minimal setup. With its built in Swagger integration, it is extremely simple to document the endpoints and to enforce request/response models.

Lightning talks!


We will have some time for extemporaneous "lightning talks" of 5-10 minute duration. If you'd like to do one, some suggestions of talks were provided here, if you are looking for inspiration. Or talk about a project you are working on.
One lightning talk will cover OpenStreetMap

When

Tuesday, August 30th 2016
Meeting starts at 6:00PM

Where

Wake Forest University, close to Polo Rd and University Parkway:

Manchester Hall
room: Manchester 241
Wake Forest University, Winston-Salem, NC 27109

Map this

See also this campus map (PDF) and also the Parking Map (PDF) (Manchester hall is #20A on the parking map)

And speaking of parking: Parking after 5pm is on a first-come, first-serve basis. The official parking policy is:
"Visitors can park in any general parking lot on campus. Visitors should avoid reserved spaces, faculty/staff lots, fire lanes or other restricted area on campus. Frequent visitors should contact Parking and Transportation to register for a parking permit."

Mailing List


Don't forget to sign up to our user group mailing list:

https://groups.google.com/d/forum/pyptug?hl=en

It is the only step required to become a PYPTUG member.

RSVP on meetup:
https://www.meetup.com/PYthon-Piedmont-Triad-User-Group-PYPTUG/events/233095834/

24 Aug 2016 3:28pm GMT

Machinalis: Searching for aliens

First contact

Have you ever seen through a plane's window, or in Google Maps, some precisely defined circles on the Earth? Typically many of them, close to each other? Something like this:

Do you know what they are? If you are thinking of irrigation circles, you are wrong. Do not believe the lies of the conspirators. Those are, undoubtedly, proofs of extraterrestrial visitors on earth.

As I want to be ready for the first contact I need to know where these guys are working. It should be easy with so many satellite images at hand.

So I asked the machine learning experts around here to lend me a hand. Surprisingly, they refused. Mumbling I don't know what about irrigation circles. Very suspicious. But something else they mentioned is that a better initial approach would be to use some computer-vision detection technique.

So, there you go. Those damn conspirators gave me the key.

Aliens

Circles detection

So now, in the Python ecosystem computer vision means OpenCV. And as it happens, this library has got the HoughCircles module which finds circles in an image. Not surprising: OpenCV has a bazillion of useful modules like that.

Lets make it happen.

First of all, I'm going to use Landsat 8 data. I'll choose scene 229/82 for two reasons:

  • I know it includes circles, and
  • it includes my house (I want to meet the extraterrestrials living close by, not those in Area 51)
Crop of the Landsat 8 scene 229/82

Crop of the Landsat scene 229/82

The first issue I have to solve is that the HoughCircles function

finds circles in a grayscale image using a modification of the Hough transform

Well, grayscale does not exactly match multi-band Landsat 8 data, but each one of the bands can be treated as a single grayscale image. Now, a circle can express itself differently in different bands, because each band has its own way to sense the earth. So, the detector can define slightly different center coordinates for the same circle. For that reason, if two centers are too close then I'm going to keep only one of them (and discard the other as repeated).

Next, I need to determine the maximum and minimum circle's radius. Typically, those circles sizes vary, from 400 mts up to 800 mts. That is between 13 and 26 Landsat pixels (30 mts). That's a starting point. For the rest of the parameters I'll just play around and try different values (not very scientific, I'm sorry).

So I run my script (which you can see in this Jupyter notebook) and without too much effort I can see that the circles are detected:

Crop of the Landsat 8 scene 229/82, with detected circles

Crop of the Landsat 8 scene 229/82, with the detected circles (the colors of the circles correspond to the size).

By changing the parameters I get to detect more (getting more false-positives) or less circles (missing some real ones). As usual, there's a trade-off there.

Filter-out false positives

These circles only make sense in farming areas. If I configure the program not to miss real circles, then I get a lot of false positives. There are too many detected circles in cities, clouds, mountains, around rivers, etc.

Crop of the Landsat 8 scene 229/82, with detected circles and labels

That's a whole new problem that I will need to solve. I can use vegetation indices, texture computation, machine learning. There's a whole battery of possibilities to explore. Intuition, experience, domain knowledge, good data-science practices and ufology will help me out here. Unlucky enough, all that is out of the scope of this post.

So, my search for aliens will continue.

Mental disorder disclaimer

I hope it's clear that all the search for aliens story is fictional. Just an amusing way to present the subject.

Once clarified that, the technical aspects in the post are still valid.

To help our friends of Kilimo, we developed an irrigation circles' detector prototype. As hinted before, instead of approaching the problem with machine learning we attacked it using computer vision techniques.

Please, feel free to comment, contact us or whatever. I'm @py_litox on Twitter.

24 Aug 2016 2:19pm GMT

Machinalis: Searching for aliens

First contact

Have you ever seen through a plane's window, or in Google Maps, some precisely defined circles on the Earth? Typically many of them, close to each other? Something like this:

Do you know what they are? If you are thinking of irrigation circles, you are wrong. Do not believe the lies of the conspirators. Those are, undoubtedly, proofs of extraterrestrial visitors on earth.

As I want to be ready for the first contact I need to know where these guys are working. It should be easy with so many satellite images at hand.

So I asked the machine learning experts around here to lend me a hand. Surprisingly, they refused. Mumbling I don't know what about irrigation circles. Very suspicious. But something else they mentioned is that a better initial approach would be to use some computer-vision detection technique.

So, there you go. Those damn conspirators gave me the key.

Aliens

Circles detection

So now, in the Python ecosystem computer vision means OpenCV. And as it happens, this library has got the HoughCircles module which finds circles in an image. Not surprising: OpenCV has a bazillion of useful modules like that.

Lets make it happen.

First of all, I'm going to use Landsat 8 data. I'll choose scene 229/82 for two reasons:

  • I know it includes circles, and
  • it includes my house (I want to meet the extraterrestrials living close by, not those in Area 51)
Crop of the Landsat 8 scene 229/82

Crop of the Landsat scene 229/82

The first issue I have to solve is that the HoughCircles function

finds circles in a grayscale image using a modification of the Hough transform

Well, grayscale does not exactly match multi-band Landsat 8 data, but each one of the bands can be treated as a single grayscale image. Now, a circle can express itself differently in different bands, because each band has its own way to sense the earth. So, the detector can define slightly different center coordinates for the same circle. For that reason, if two centers are too close then I'm going to keep only one of them (and discard the other as repeated).

Next, I need to determine the maximum and minimum circle's radius. Typically, those circles sizes vary, from 400 mts up to 800 mts. That is between 13 and 26 Landsat pixels (30 mts). That's a starting point. For the rest of the parameters I'll just play around and try different values (not very scientific, I'm sorry).

So I run my script (which you can see in this Jupyter notebook) and without too much effort I can see that the circles are detected:

Crop of the Landsat 8 scene 229/82, with detected circles

Crop of the Landsat 8 scene 229/82, with the detected circles (the colors of the circles correspond to the size).

By changing the parameters I get to detect more (getting more false-positives) or less circles (missing some real ones). As usual, there's a trade-off there.

Filter-out false positives

These circles only make sense in farming areas. If I configure the program not to miss real circles, then I get a lot of false positives. There are too many detected circles in cities, clouds, mountains, around rivers, etc.

Crop of the Landsat 8 scene 229/82, with detected circles and labels

That's a whole new problem that I will need to solve. I can use vegetation indices, texture computation, machine learning. There's a whole battery of possibilities to explore. Intuition, experience, domain knowledge, good data-science practices and ufology will help me out here. Unlucky enough, all that is out of the scope of this post.

So, my search for aliens will continue.

Mental disorder disclaimer

I hope it's clear that all the search for aliens story is fictional. Just an amusing way to present the subject.

Once clarified that, the technical aspects in the post are still valid.

To help our friends of Kilimo, we developed an irrigation circles' detector prototype. As hinted before, instead of approaching the problem with machine learning we attacked it using computer vision techniques.

Please, feel free to comment, contact us or whatever. I'm @py_litox on Twitter.

24 Aug 2016 2:19pm GMT

Hynek Schlawack: Hardening Your Web Server’s SSL Ciphers

There are many wordy articles on configuring your web server's TLS ciphers. This is not one of them. Instead I will share a configuration which is both compatible enough for today's needs and scores a straight "A" on Qualys's SSL Server Test.

24 Aug 2016 1:49pm GMT

Hynek Schlawack: Hardening Your Web Server’s SSL Ciphers

There are many wordy articles on configuring your web server's TLS ciphers. This is not one of them. Instead I will share a configuration which is both compatible enough for today's needs and scores a straight "A" on Qualys's SSL Server Test.

24 Aug 2016 1:49pm GMT

pythonwise: Generate Relation Diagram from GAE ndb Model

Working with GAE, we wanted to create relation diagram from out ndb model. By deferring the rendering to dot and using Python's reflection this became an easy task. Some links are still missing since we're using ancestor queries, but this can be handled by some class docstring syntax or just manually editing the resulting dot file.

24 Aug 2016 6:51am GMT

pythonwise: Generate Relation Diagram from GAE ndb Model

Working with GAE, we wanted to create relation diagram from out ndb model. By deferring the rendering to dot and using Python's reflection this became an easy task. Some links are still missing since we're using ancestor queries, but this can be handled by some class docstring syntax or just manually editing the resulting dot file.

24 Aug 2016 6:51am GMT

Codementor: Asynchronous Tasks using Celery with Django

celery with django

Prerequisites

Introduction

Celery is a task queue based on distributed message passing. It is used to handle long running asynchronous tasks. RabbitMQ, on the other hand, is message broker which is used by Celery to send and receive messages. Celery is perfectly suited for tasks which will take some time to execute but we don't want our requests to be blocked while these tasks are processed. Case in point are sending emails, SMSs, making remote API calls, etc.

Target

Using the Local Django Application

Local File Directory Structure

We are going to use a Django application called mycelery. Our directory is structured in this way. The root of our django application is 'mycelery'.

Add the lines in the settings.py file to tell the Celery that we will use RabbitMQ as out message broker and accept data in json format.

BROKER_URL = 'amqp://guest:guest@localhost//'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'

CELERY_ACCEPT_CONTENT is the type of contents allowed to receive.
CELERY_TASK_SERIALIZER is a string used for identifying default serialization method.
CELERY_RESULT_SERIALIZER is the type of result serialization format.

After adding the message broker, add the lines in a new file celery.py that tells Celery that we will use the settings in settings.py defined above.

from __future__ import absolute_import
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mycelery.settings')

from django.conf import settings
from celery import Celery

app = Celery('mycelery',
             backend='amqp',
             broker='amqp://guest@localhost//')

# This reads, e.g., CELERY_ACCEPT_CONTENT = ['json'] from settings.py:
app.config_from_object('django.conf:settings')

# For autodiscover_tasks to work, you must define your tasks in a file called 'tasks.py'.
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)

@app.task(bind=True)
def debug_task(self):
    print("Request: {0!r}".format(self.request))

Create tasks in Celery

Create an app named myceleryapp and make a tasks.py file in this app's folder. All the tasks will be defined in this file.

In the tasks.py, we are just playing with a number so that it will be a long task for now.

from celery import shared_task,current_task
from numpy import random
from scipy.fftpack import fft

@shared_task
def fft_random(n):
        for i in range(n):
        x = random.normal(0, 0.1, 2000)
        y = fft(x)
        if(i%30 == 0):
                process_percent = int(100 * float(i) / float(n))
                current_task.update_state(state='PROGRESS',
                                        meta={'process_percent': process_percent})
        return random.random()

Using current_task.update_state() method, we can pass the status of the task completed to the message broker every 30 iterations.

Calling tasks in Django

To call the above task, the following lines of code is required. You can put these lines in your files from wherever you want to call them.

from .tasks import fft_random
job = fft_random.delay(int(n))

Import the method and make a call. That's it! Now your operation is running in background.

Get the status of the task

To get the status of the task above, define the following method in views.py

# Create your views here.
def task_state(request):
    data = 'Fail'
    if request.is_ajax():
        if 'task_id' in request.POST.keys() and request.POST['task_id']:
            task_id = request.POST['task_id']
            task = AsyncResult(task_id)
            data = task.result or task.state
        else:
            data = 'No task_id in the request'
    else:
        data = 'This is not an ajax request'

    json_data = json.dumps(data)
    return HttpResponse(json_data, content_type='application/json')

Task_id is send to the method from the JavaScript. This method checks the status of the task with id task_id and return them in json format to JavaScript. We can then call this method from our JavaScript and show a corresponding bar.

Conclusion

We can use Celery to run different types of tasks from sending emails to scraping a website. In the case of long running tasks, we'd like to show the status of the task to our user, and we can use a simple JavaScript bar which calls the task status url and sets the time spent on the task. With the help of Celery, a user's experience on Django websites can be improved dramatically.

Other tutorials you might also be interested in:

24 Aug 2016 6:36am GMT

Codementor: Asynchronous Tasks using Celery with Django

celery with django

Prerequisites

Introduction

Celery is a task queue based on distributed message passing. It is used to handle long running asynchronous tasks. RabbitMQ, on the other hand, is message broker which is used by Celery to send and receive messages. Celery is perfectly suited for tasks which will take some time to execute but we don't want our requests to be blocked while these tasks are processed. Case in point are sending emails, SMSs, making remote API calls, etc.

Target

Using the Local Django Application

Local File Directory Structure

We are going to use a Django application called mycelery. Our directory is structured in this way. The root of our django application is 'mycelery'.

Add the lines in the settings.py file to tell the Celery that we will use RabbitMQ as out message broker and accept data in json format.

BROKER_URL = 'amqp://guest:guest@localhost//'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'

CELERY_ACCEPT_CONTENT is the type of contents allowed to receive.
CELERY_TASK_SERIALIZER is a string used for identifying default serialization method.
CELERY_RESULT_SERIALIZER is the type of result serialization format.

After adding the message broker, add the lines in a new file celery.py that tells Celery that we will use the settings in settings.py defined above.

from __future__ import absolute_import
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mycelery.settings')

from django.conf import settings
from celery import Celery

app = Celery('mycelery',
             backend='amqp',
             broker='amqp://guest@localhost//')

# This reads, e.g., CELERY_ACCEPT_CONTENT = ['json'] from settings.py:
app.config_from_object('django.conf:settings')

# For autodiscover_tasks to work, you must define your tasks in a file called 'tasks.py'.
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)

@app.task(bind=True)
def debug_task(self):
    print("Request: {0!r}".format(self.request))

Create tasks in Celery

Create an app named myceleryapp and make a tasks.py file in this app's folder. All the tasks will be defined in this file.

In the tasks.py, we are just playing with a number so that it will be a long task for now.

from celery import shared_task,current_task
from numpy import random
from scipy.fftpack import fft

@shared_task
def fft_random(n):
        for i in range(n):
        x = random.normal(0, 0.1, 2000)
        y = fft(x)
        if(i%30 == 0):
                process_percent = int(100 * float(i) / float(n))
                current_task.update_state(state='PROGRESS',
                                        meta={'process_percent': process_percent})
        return random.random()

Using current_task.update_state() method, we can pass the status of the task completed to the message broker every 30 iterations.

Calling tasks in Django

To call the above task, the following lines of code is required. You can put these lines in your files from wherever you want to call them.

from .tasks import fft_random
job = fft_random.delay(int(n))

Import the method and make a call. That's it! Now your operation is running in background.

Get the status of the task

To get the status of the task above, define the following method in views.py

# Create your views here.
def task_state(request):
    data = 'Fail'
    if request.is_ajax():
        if 'task_id' in request.POST.keys() and request.POST['task_id']:
            task_id = request.POST['task_id']
            task = AsyncResult(task_id)
            data = task.result or task.state
        else:
            data = 'No task_id in the request'
    else:
        data = 'This is not an ajax request'

    json_data = json.dumps(data)
    return HttpResponse(json_data, content_type='application/json')

Task_id is send to the method from the JavaScript. This method checks the status of the task with id task_id and return them in json format to JavaScript. We can then call this method from our JavaScript and show a corresponding bar.

Conclusion

We can use Celery to run different types of tasks from sending emails to scraping a website. In the case of long running tasks, we'd like to show the status of the task to our user, and we can use a simple JavaScript bar which calls the task status url and sets the time spent on the task. With the help of Celery, a user's experience on Django websites can be improved dramatically.

Other tutorials you might also be interested in:

24 Aug 2016 6:36am GMT

Reuven Lerner: Fun with floats

I'm in Shanghai, and before I left to teach this morning, I decided to check the weather. I knew that it would be hot, but I wanted to double-check that it wasn't going to rain - a rarity during Israeli summers, but not too unusual in Shanghai.

I entered "shanghai weather" into DuckDuckGo, and got the following:

Never mind that it gave me a weather report for the wrong Chinese city. Take a look at the humidity reading! What's going on there? Am I supposed to worry that it's ever-so-slightly more humid than 55%?

The answer, of course, is that many programming languages have problems with floating-point numbers. Just as there's no terminating decimal number to represent 1/3, lots of numbers are non-terminating when you use binary, which computers do.

As a result floats are inaccurate. Just add 0.1 + 0.2 in many programming languages, and prepare to be astonished. Wait, you don't want to fire up a lot of languages? Here, someone has done it for you: http://0.30000000000000004.com/ (I really love this site.)

If you're working with numbers that are particularly sensitive, then you shouldn't be using floats. Rather, you should use integers, or use something like Python's decimal.Decimal, which guarantees accuracy at the expense of time and space. For example:

>> from decimal import Decimal
>>> x = Decimal('0.1')
>>> y = Decimal('0.2')
>>> x + y
Decimal('0.3')
>>> float(x+y)
0.3

Of course, you should be careful not to create your decimals with floats:

>> x = Decimal(0.1)
>>> y = Decimal(0.2)
>>> x + y
Decimal('0.3000000000000000166533453694')

Why is this the case? Let's take a look:

>> x
Decimal('0.1000000000000000055511151231257827021181583404541015625')

>>> y
Decimal('0.200000000000000011102230246251565404236316680908203125')

So, if you're dealing with sensitive numbers, be sure not to use floats! And if you're going outside in Shanghai today, it might be ever-so-slightly less humid than your weather forecast reports.

The post Fun with floats appeared first on Lerner Consulting Blog.

24 Aug 2016 3:16am GMT

Reuven Lerner: Fun with floats

I'm in Shanghai, and before I left to teach this morning, I decided to check the weather. I knew that it would be hot, but I wanted to double-check that it wasn't going to rain - a rarity during Israeli summers, but not too unusual in Shanghai.

I entered "shanghai weather" into DuckDuckGo, and got the following:

Never mind that it gave me a weather report for the wrong Chinese city. Take a look at the humidity reading! What's going on there? Am I supposed to worry that it's ever-so-slightly more humid than 55%?

The answer, of course, is that many programming languages have problems with floating-point numbers. Just as there's no terminating decimal number to represent 1/3, lots of numbers are non-terminating when you use binary, which computers do.

As a result floats are inaccurate. Just add 0.1 + 0.2 in many programming languages, and prepare to be astonished. Wait, you don't want to fire up a lot of languages? Here, someone has done it for you: http://0.30000000000000004.com/ (I really love this site.)

If you're working with numbers that are particularly sensitive, then you shouldn't be using floats. Rather, you should use integers, or use something like Python's decimal.Decimal, which guarantees accuracy at the expense of time and space. For example:

>> from decimal import Decimal
>>> x = Decimal('0.1')
>>> y = Decimal('0.2')
>>> x + y
Decimal('0.3')
>>> float(x+y)
0.3

Of course, you should be careful not to create your decimals with floats:

>> x = Decimal(0.1)
>>> y = Decimal(0.2)
>>> x + y
Decimal('0.3000000000000000166533453694')

Why is this the case? Let's take a look:

>> x
Decimal('0.1000000000000000055511151231257827021181583404541015625')

>>> y
Decimal('0.200000000000000011102230246251565404236316680908203125')

So, if you're dealing with sensitive numbers, be sure not to use floats! And if you're going outside in Shanghai today, it might be ever-so-slightly less humid than your weather forecast reports.

The post Fun with floats appeared first on Lerner Consulting Blog.

24 Aug 2016 3:16am GMT

23 Aug 2016

feedPlanet Python

Yann Larrivée: ConFoo Montreal 2017 Calling for Papers

ConFoo Montreal: March 8th-10th 2016

ConFoo Montreal: March 8th-10th 2016

Want to get your web development ideas in front of a live audience? The call for papers for the ConFoo Montreal 2017 web developer conference is open! If you have a burning desire to hold forth about PHP, Java, Ruby, Python, or any other web development topics, we want to see your proposals. The window is open only from August 21 to September 20, 2016, so hurry. An added benefit: If your proposal is selected and you live outside of the Montreal area, we will cover your travel and hotel.

You'll have 45 minutes to wow the crowd, with 35 minutes for your topic and 10 minutes for Q&A. We can't wait to see your proposals. Knock us out!

ConFoo Montreal will be held on March 8-10, 2017. For those of you who already know about our conference, be aware that this annual tradition will still be running in addition to ConFoo Vancouver. Visit our site to learn more about both events.

23 Aug 2016 4:30pm GMT

Yann Larrivée: ConFoo Montreal 2017 Calling for Papers

ConFoo Montreal: March 8th-10th 2016

ConFoo Montreal: March 8th-10th 2016

Want to get your web development ideas in front of a live audience? The call for papers for the ConFoo Montreal 2017 web developer conference is open! If you have a burning desire to hold forth about PHP, Java, Ruby, Python, or any other web development topics, we want to see your proposals. The window is open only from August 21 to September 20, 2016, so hurry. An added benefit: If your proposal is selected and you live outside of the Montreal area, we will cover your travel and hotel.

You'll have 45 minutes to wow the crowd, with 35 minutes for your topic and 10 minutes for Q&A. We can't wait to see your proposals. Knock us out!

ConFoo Montreal will be held on March 8-10, 2017. For those of you who already know about our conference, be aware that this annual tradition will still be running in addition to ConFoo Vancouver. Visit our site to learn more about both events.

23 Aug 2016 4:30pm GMT

Enthought: Webinar: Introducing the NEW Python Integration Toolkit for LabVIEW

Python Integration Toolkit for LabVIEW from Enthought - Webinar
Dates: Thursday, August 25, 2016, 1:00-1:45 PM CT or Wednesday, August 31, 2016, 9:00-9:45 AM CT / 3:00-3:45 PM BT

Register now (if you can't attend, we'll send you a recording)

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. Earlier this month, Enthought released the Python Integration Toolkit for LabVIEW, which is a "bridge" between the LabVIEW and Python environments.

In this webinar, we'll demonstrate:

  1. How the new Python Integration Toolkit for LabVIEW from Enthought seamlessly brings the power of the Python ecosystem of scientific and engineering tools to LabVIEW
  2. Examples of how you can extend LabVIEW with Python, including using Python for signal and image processing, cloud computing, web dashboards, machine learning, and more
Python Integration Toolkit for LabVIEW

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python's capabilities.

Try it with your data, free for 30 days

Download a free 30 day trial of the Python Integration Toolkit for LabVIEW from the National Instruments LabVIEW Tools Network.

How LabVIEW users can benefit from Python :

Register

23 Aug 2016 2:44pm GMT

Enthought: Webinar: Introducing the NEW Python Integration Toolkit for LabVIEW

Python Integration Toolkit for LabVIEW from Enthought - Webinar
Dates: Thursday, August 25, 2016, 1:00-1:45 PM CT or Wednesday, August 31, 2016, 9:00-9:45 AM CT / 3:00-3:45 PM BT

Register now (if you can't attend, we'll send you a recording)

LabVIEW is a software platform made by National Instruments, used widely in industries such as semiconductors, telecommunications, aerospace, manufacturing, electronics, and automotive for test and measurement applications. Earlier this month, Enthought released the Python Integration Toolkit for LabVIEW, which is a "bridge" between the LabVIEW and Python environments.

In this webinar, we'll demonstrate:

  1. How the new Python Integration Toolkit for LabVIEW from Enthought seamlessly brings the power of the Python ecosystem of scientific and engineering tools to LabVIEW
  2. Examples of how you can extend LabVIEW with Python, including using Python for signal and image processing, cloud computing, web dashboards, machine learning, and more
Python Integration Toolkit for LabVIEW

Quickly and efficiently access scientific and engineering tools for signal processing, machine learning, image and array processing, web and cloud connectivity, and much more. With only minimal coding on the Python side, this extraordinarily simple interface provides access to all of Python's capabilities.

Try it with your data, free for 30 days

Download a free 30 day trial of the Python Integration Toolkit for LabVIEW from the National Instruments LabVIEW Tools Network.

How LabVIEW users can benefit from Python :

Register

23 Aug 2016 2:44pm GMT

Mike Driscoll: Python 201 Releasing in 2 Weeks!

My second book, Python 201: Intermediate Python is releasing two weeks from today on September 6th, 2016. I just wanted to remind anyone who is interested that you can pre-order a signed paperback copy of the book here right up until release day. You will also receive a copy of the book in the following digital formats: PDF, EPUB and MOBI (Kindle format).

My book is also available for early release digitally on Gumroad and Leanpub.

Check out either of those links for more information!

23 Aug 2016 1:55pm GMT

Mike Driscoll: Python 201 Releasing in 2 Weeks!

My second book, Python 201: Intermediate Python is releasing two weeks from today on September 6th, 2016. I just wanted to remind anyone who is interested that you can pre-order a signed paperback copy of the book here right up until release day. You will also receive a copy of the book in the following digital formats: PDF, EPUB and MOBI (Kindle format).

My book is also available for early release digitally on Gumroad and Leanpub.

Check out either of those links for more information!

23 Aug 2016 1:55pm GMT

Chris Moffitt: Lessons Learned from Analyze This! Challenge

Introduction

I recently had the pleasure of participating in a crowd-sourced data science competition in the Twin Cities called Analyze This! I wanted to share some of my thoughts and experiences on the process - especially how this challenge helped me learn more about how to apply data science theory and open source tools to real world problems.

I also hope this article can encourage others in the Twin Cities to participate in future events. For those of you not in the Minneapolis-St. Paul metro area, then maybe this can help motivate you to start up a similar event in your area. I thoroughly enjoyed the experience and got a lot out of the process. Read on for more details.

Background

Analyze This! is a crowd-source data science competition. Think of it as a mashup of an in person Kaggle competition, plus a data science user group mixed in with a little bit of Toastmasters. The result is a really cool series of events that accomplishes two things. First, it helps individuals build their data science skills on a real world problem. Secondly it helps an organization get insight into their data challenges.

The process starts when the Analyze This organizers partner with a host organization to identify a real-world problem that could be solved with data analytics. Once the problem is defined and the data gathered, it is turned over to a group of eager volunteers who spend a couple of months analyzing the data and developing insights and actionable next steps for solving the defined problem. Along the way, there are periodic group meetings where experts share their knowledge on a specific data science topic. The process culminates in a friendly competition where the teams present the results to the group. The host organization and event organizers judge the results based on a pre-defined rubric. A final winning team typically wins a modest financial reward (more than enough for a dinner but not enough to pay the rent for the month).

In this specific case, Analyze This! partnered with the Science Museum of Minnesota to gather and de-identify data related to membership activity. The goal of the project was to develop a model to predict whether or not a member would renew their membership and use this information to increase membership renewal rates for the museum.

Observations

As I mentioned earlier, the entire process was really interesting, challenging and even fun. Here are a few of my learnings and observations that I took away from the events that I can apply to future challenges and real life data science projects:

The Best Way to Learn is By Doing

I came into the event with a good familiarity with python but not as much real-world experience with machine learning algorithms. I have spent time learning about various ML tools and have played with some models but at some point, you can only look at Titanic or Iris data sets for so long!

The best analogy I can think of is that it is like taking a math class and looking at the solution in the answer key. You may think you understand how to get to the solution but "thinking you can" is never the same as spending time wrestling with the problem on your own and "knowing you can."

Because the data set was brand new to us all, it forced us all to dig in and struggle with understanding the data and divining insights. There was no "right answer" that we could look at in advance. The only way to gain insight was to wrestle with the data and figure it out with your team. This meant reasearching the problem and developing working code examples.

Descriptive Analytics Still Matter

Many people have seen some variation of the chart that looks like this:

Descriptive to Prescriptive Analytics

source

Because I wanted to learn about ML, I tended to jump ahead in this chart and go straight for the predictive model without spending time on the descriptive analytics. After sitting through the presentations from each group, I realized that I should have spent more time looking at the data from a standard stats perspective and use some of those basic insights to help inform the eventual model. I also realized that the descriptive analytics were really useful in helping to tell the story around the final recommendations. In other words, it's not all about a fancy predictive model.

Speaking of Models

In this specific case, all the teams developed models to predict a members likely renewal based on various traits. Across the group, the teams tried pretty much any model that is available in the python or R ecosystem. Despite how fancy everyone tried to get, a simple logistic regression model won out. I think the moral of the story is that sometimes a relatively simple model with good results beats a complex model with a marginally better results.

Python Served Me Well

My team (and several others) used python for much of the analysis. In addition to pandas and scikit-learn, I leveraged jupyter notebooks for a lot of exploratory data analysis. Of course, I used conda to setup a python3 virtual environment for this project which made it really nice to play around with various tools without messing up other python environments.

I experimented with folium to visualize geographic data. I found it fairly simple to build interesting, data-rich maps with this tool. If there is interesting, I may write about it more in the future.

I also took TPOT for a spin. It worked well and I think it generated some useful models. We eventually used a different model but I plan to continue learning more about TPOT and look forward to seeing how it continues to improve.

Presenting Results is a Skill

One of the key aspects of the Analyze This challenge that I enjoyed is that each team had to present their solutions during a 10 minute presentation. Because we had all spent time with the same data set, we were all starting from a similar baseline. It was extremely interesting to see how the teams presented their results and used various visualizations to explain their process and provide actionable insight. We all tended to identify several common features that drove renewal rates but it was interesting to see how different teams attacked a similar problem from different angles.

Several of the groups scored results that were very close to each other. The scoring rubric factored in more weight on the the presentation than on the actual model results which I think is a wise move and separates this challenge from something like a kaggle competition.

The other interesting/challenging part of presenting the results was the wide range of knowledge in the room. On one end of the spectrum there were PhD's, Data Scientists and very experienced statisticians. On the other end were people just learning some of these concepts and had little or no training in data science or statistics. This wide spread of knowledge meant that each group had to think carefully about how to present their information in a way that would appeal to the entire audience.

Community is important

One of the goals of the Analyze This organizers is to foster a community for data science learning. I felt like they did a really nice job of making everyone feel welcome. Even though this was a competition, the more experienced members were supportive of the less knowledgeable individuals. There was a lot of formal and informal knowledge sharing.

I have seen several variations of this venn diagram to describe data scientists.

Data Science Venn Diagram

During the competition, I noticed that the pool of participants fit into many of these categories. We had everything from people that do data science as a full time job to web developers to people just interested in learning more. The really great thing was that it was a supportive group and people were willing to share knowledge and help others.

My experience with this cross-section of people reinforced my belief that the "perfect data scientist" does lie at the intersection of these multiple functions.

I hope the Analyze This! group can continue building on the success of this competition and encourage even more people to participate in the process.

Networking

I am really excited about the people I met through this process. I ended up working with a great group of guys on my team. I also got to learn a little more about how others are doing Data Science in the Twin Cities. Of course, I used this as an opportunity to expand my network.

Conclustion

I am sure you can tell that I'm a big supporter of Analyze This!, its mission and the people that are leading the program. Pedro, Kevin, Jake, Mitchell, Daniel and Justin did a tremendous amount of work to make this happen. I am very impressed with their knowledge and dedication to make this happen. They are doing this to help others and build up the community. They receive no pay for the countless hours of work they put into it.

The process was a great way to learn more about data science and hone my skills in a real-world test. I got to meet some smart people and help a worthy organization (hopefully) improve their membership renewal rates. I highly encourage those of you that might be at FARCON 2016, to stop by and listen to the group presentations. I also encourage you to look for the next challenge and find some time to participate. I am confident you will find it time well spent.

23 Aug 2016 12:25pm GMT

Chris Moffitt: Lessons Learned from Analyze This! Challenge

Introduction

I recently had the pleasure of participating in a crowd-sourced data science competition in the Twin Cities called Analyze This! I wanted to share some of my thoughts and experiences on the process - especially how this challenge helped me learn more about how to apply data science theory and open source tools to real world problems.

I also hope this article can encourage others in the Twin Cities to participate in future events. For those of you not in the Minneapolis-St. Paul metro area, then maybe this can help motivate you to start up a similar event in your area. I thoroughly enjoyed the experience and got a lot out of the process. Read on for more details.

Background

Analyze This! is a crowd-source data science competition. Think of it as a mashup of an in person Kaggle competition, plus a data science user group mixed in with a little bit of Toastmasters. The result is a really cool series of events that accomplishes two things. First, it helps individuals build their data science skills on a real world problem. Secondly it helps an organization get insight into their data challenges.

The process starts when the Analyze This organizers partner with a host organization to identify a real-world problem that could be solved with data analytics. Once the problem is defined and the data gathered, it is turned over to a group of eager volunteers who spend a couple of months analyzing the data and developing insights and actionable next steps for solving the defined problem. Along the way, there are periodic group meetings where experts share their knowledge on a specific data science topic. The process culminates in a friendly competition where the teams present the results to the group. The host organization and event organizers judge the results based on a pre-defined rubric. A final winning team typically wins a modest financial reward (more than enough for a dinner but not enough to pay the rent for the month).

In this specific case, Analyze This! partnered with the Science Museum of Minnesota to gather and de-identify data related to membership activity. The goal of the project was to develop a model to predict whether or not a member would renew their membership and use this information to increase membership renewal rates for the museum.

Observations

As I mentioned earlier, the entire process was really interesting, challenging and even fun. Here are a few of my learnings and observations that I took away from the events that I can apply to future challenges and real life data science projects:

The Best Way to Learn is By Doing

I came into the event with a good familiarity with python but not as much real-world experience with machine learning algorithms. I have spent time learning about various ML tools and have played with some models but at some point, you can only look at Titanic or Iris data sets for so long!

The best analogy I can think of is that it is like taking a math class and looking at the solution in the answer key. You may think you understand how to get to the solution but "thinking you can" is never the same as spending time wrestling with the problem on your own and "knowing you can."

Because the data set was brand new to us all, it forced us all to dig in and struggle with understanding the data and divining insights. There was no "right answer" that we could look at in advance. The only way to gain insight was to wrestle with the data and figure it out with your team. This meant reasearching the problem and developing working code examples.

Descriptive Analytics Still Matter

Many people have seen some variation of the chart that looks like this:

Descriptive to Prescriptive Analytics

source

Because I wanted to learn about ML, I tended to jump ahead in this chart and go straight for the predictive model without spending time on the descriptive analytics. After sitting through the presentations from each group, I realized that I should have spent more time looking at the data from a standard stats perspective and use some of those basic insights to help inform the eventual model. I also realized that the descriptive analytics were really useful in helping to tell the story around the final recommendations. In other words, it's not all about a fancy predictive model.

Speaking of Models

In this specific case, all the teams developed models to predict a members likely renewal based on various traits. Across the group, the teams tried pretty much any model that is available in the python or R ecosystem. Despite how fancy everyone tried to get, a simple logistic regression model won out. I think the moral of the story is that sometimes a relatively simple model with good results beats a complex model with a marginally better results.

Python Served Me Well

My team (and several others) used python for much of the analysis. In addition to pandas and scikit-learn, I leveraged jupyter notebooks for a lot of exploratory data analysis. Of course, I used conda to setup a python3 virtual environment for this project which made it really nice to play around with various tools without messing up other python environments.

I experimented with folium to visualize geographic data. I found it fairly simple to build interesting, data-rich maps with this tool. If there is interesting, I may write about it more in the future.

I also took TPOT for a spin. It worked well and I think it generated some useful models. We eventually used a different model but I plan to continue learning more about TPOT and look forward to seeing how it continues to improve.

Presenting Results is a Skill

One of the key aspects of the Analyze This challenge that I enjoyed is that each team had to present their solutions during a 10 minute presentation. Because we had all spent time with the same data set, we were all starting from a similar baseline. It was extremely interesting to see how the teams presented their results and used various visualizations to explain their process and provide actionable insight. We all tended to identify several common features that drove renewal rates but it was interesting to see how different teams attacked a similar problem from different angles.

Several of the groups scored results that were very close to each other. The scoring rubric factored in more weight on the the presentation than on the actual model results which I think is a wise move and separates this challenge from something like a kaggle competition.

The other interesting/challenging part of presenting the results was the wide range of knowledge in the room. On one end of the spectrum there were PhD's, Data Scientists and very experienced statisticians. On the other end were people just learning some of these concepts and had little or no training in data science or statistics. This wide spread of knowledge meant that each group had to think carefully about how to present their information in a way that would appeal to the entire audience.

Community is important

One of the goals of the Analyze This organizers is to foster a community for data science learning. I felt like they did a really nice job of making everyone feel welcome. Even though this was a competition, the more experienced members were supportive of the less knowledgeable individuals. There was a lot of formal and informal knowledge sharing.

I have seen several variations of this venn diagram to describe data scientists.

Data Science Venn Diagram

During the competition, I noticed that the pool of participants fit into many of these categories. We had everything from people that do data science as a full time job to web developers to people just interested in learning more. The really great thing was that it was a supportive group and people were willing to share knowledge and help others.

My experience with this cross-section of people reinforced my belief that the "perfect data scientist" does lie at the intersection of these multiple functions.

I hope the Analyze This! group can continue building on the success of this competition and encourage even more people to participate in the process.

Networking

I am really excited about the people I met through this process. I ended up working with a great group of guys on my team. I also got to learn a little more about how others are doing Data Science in the Twin Cities. Of course, I used this as an opportunity to expand my network.

Conclustion

I am sure you can tell that I'm a big supporter of Analyze This!, its mission and the people that are leading the program. Pedro, Kevin, Jake, Mitchell, Daniel and Justin did a tremendous amount of work to make this happen. I am very impressed with their knowledge and dedication to make this happen. They are doing this to help others and build up the community. They receive no pay for the countless hours of work they put into it.

The process was a great way to learn more about data science and hone my skills in a real-world test. I got to meet some smart people and help a worthy organization (hopefully) improve their membership renewal rates. I highly encourage those of you that might be at FARCON 2016, to stop by and listen to the group presentations. I also encourage you to look for the next challenge and find some time to participate. I am confident you will find it time well spent.

23 Aug 2016 12:25pm GMT

Andrew Dalke: Fragment by copy and trim

This is part of a series of essays on how to fragment a molecular graph using RDKit. These are meant to describe the low-level steps that go into fragmentation, and the ways to think about testing. To fragment a molecule in RDKit, use FragmentOnBonds(). The essays in this series are:

Why another fragmentation method?

How do you tell if an algorithm is correct? Sometimes you can inspect the code. More often you have test cases where you know the correct answer. For complex algorithms, especially when using real-world data, testing is even harder. It's not always clear where edge cases are, and often the real-world is often more complicated than you think.

Another validation solution is to write two different algorithms which accomplish the same thing, and compare them with each other. I did that in my essay about different ways to evaluate the parity of a permutation order. That cross-validation testing was good enough, because it's easy to compute all possible input orders.

In the previous essay in this series, I cross-validated that fragment_chiral() and fragment_on_bonds() give the same results. They did. That isn't surprising because they implement essentially the same algorithm. Not only that, but I looked at the FragmentOnBonds() implementation when I found out my first fragment_chiral() didn't give the same answers. (I didn't know I needed to ClearComputedProps() after I edit the structure.)

Cross-validation works better when the two algorithm are less similar. The way you think about a problem and implement a solution are connected. Two similar solutions may be the result of thinking about things the same way, and come with the same blind spots. (This could also be a design goal, if you want the new implementation to be "bug compatible" with the old.)

I've made enough hints that you likely know what I'm leading up to. RDKit's FragmentOnBonds() code currently (in mid-2016) converts directional bonds (the E-Z stereochemistry around double bonds into non-directional single bonds.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("F/C=C/F")
>>> fragmented_mol = Chem.FragmentOnBonds(mol, [0], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]C=CF.[*]F'

when I expected the last line to be

'[*]/C=C/F.[*]F'

This may or may not be what you want. Just in case, I've submitted a patch to discuss changing it in RDKit, so your experience in the future might be different than mine in the past.

I bring it up now to show how cross-validation of similar algorithms isn't always enough to figure out if the algorithms do as you expect.

Validate though re-assembly

As a reminder, in an earlier essay I did a much stronger validation test. I took the fragments, reassembled them via SMILES syntax manipulation, and compared the canonicalized result to the canonicalized input structure. These should always match, to within limits of the underlying chemistry support.

They didn't, because my original code didn't support directional bonds. Instead, in that essay I pre-processed the input SMILES string to replace the "/" and "\" characters with a "-". That gives the equivalent chemistry without directional bonds.

I could use that validation technique again, but I want to explore more of what it means to cross-validate by using a different fragmentation implementations.

Fragment by 'copy and trim'

Dave Cosgrove, on the RDKit mailing list, described the solution he uses to cut a bond and preserve chirality in toolkits which don't provide a high-level equivalent to FragmentOnBonds(). This method only works on non-ring bonds, which isn't a problem for me as I want to fragment R-groups. (As I discovered while doing the final testing, my implementation of the method also assumes there is only one molecule in the structure.)

To make fragment from a molecule, trim away all the atoms which aren't part of the fragment, except for the atom connected to the fragment. Convert that atom into the wildcard atom "*" by setting its atomic number, charge, and number of hydrogens to 0, and setting the chirality and aromaticity flags correctly. The image below shows the steps applied to cutting the ester off of aspirin:

steps to trim a fragment of aspirin

This method never touches the bonds connected to a chiral atom of a fragment, so chirality or other bond properties (like bond direction!) aren't accidently removed by breaking the bond and making a new one in its place.

A downside is that I need to copy and trim twice in order to get both fragments after cutting a chain bond. I can predict now the performance won't be as fast as FragmentOnBonds().

Let's try it out.

Trim using the interactive shell

I'll use aspirin as my reference structure and cut the bond to the ester (the R-OCOCH3), which is the bond between atoms 2 and 3.

aspirin with atoms labeled by index

I want the result to be the same as using FragmentOnBonds() to cut that bond:

>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> bond_idx = aspirin.GetBondBetweenAtoms(2, 3).GetIdx()
>>> bond_idx
2
>>> fragmented_mol = Chem.FragmentOnBonds(aspirin, [bond_idx], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

I'll need a way to find all of the atoms which are on the ester side of the bond. I don't think there's a toolkit function to get all the atoms on one side of a bond, so I'll write one myself.

Atom 2 is the connection point on the ester to the rest of the aspirin structure. I'll start with that atom, show that it's an oxygen, and get information about its bonds:

>>> atom = aspirin.GetAtomWithIdx(2)
>>> atom.GetSymbol()
'O'
>>> [bond.GetBondType() for bond in atom.GetBonds()]
[rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.SINGLE]

I'll use the bond object to get to the atom on the other side of the bond from atom 2, and report more detailed information about the atom at the other end of each bond:

>>> for bond in atom.GetBonds():
...   other_atom = bond.GetOtherAtom(atom)
...   print(other_atom.GetIdx(), bond.GetBondType(), other_atom.GetSymbol(), other_atom.GetIsAromatic())
... 
1 SINGLE C False
3 SINGLE C True

This says that atom 1 is an aliphatic carbon, and atom 3 is an aromatic carbon. This matches the structure diagram I used earlier.

I know that atom 3 is the other end of the bond which was cut, so I can stop there. What about atom 1? What is it connected to?

>>> next_atom = aspirin.GetAtomWithIdx(1)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[0, 2, 12]

I've already processed atom 2, so only atoms 0 and 12 are new.

What are those atoms connected to?

>>> next_atom = aspirin.GetAtomWithIdx(0)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
>>> next_atom = aspirin.GetAtomWithIdx(12)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]

Only atom 1, which I've already seen before so I've found all of the atoms on the ester side of the bond.

Semi-automated depth-first search

There are more atoms on the other side of the bond, so I'll have to automate it somewhat. This sort of graph search is often implemented as a depth-first search (DFS) or a breadth-first search (BFS). Python lists easily work as stacks, which makes DFS slightly more natural to implement.

To give you an idea of what I'm about to explain, here's an animation of the aspirin depth-first search:

depth-first search of an apsirin fragment

If there is the chance of a ring (called "cycle" in graph theory), then there are two ways to get to the same atom. To prevent processing the same atom multiple times, I'll set up set named "seen_ids", which contains the atoms indices I've before. Since it's the start, I've only seen the two atoms which are at the end of the bond to cut.

>>> seen_ids = set([2, 3])

I also store a list of the atoms which are in this side of the bond, at this point is only atom 3, and a stack (a Python list) of the atoms I need to process further:

>>> atom_ids = [3]
>>> stack = [aspirin.GetAtomWithIdx(3)]

I'll start by getting the top of the stack (the last element of the list) and looking at its neighbors:

>>> atom = stack.pop()
>>> neighbor_ids = [b.GetOtherAtom(atom).GetIdx() for b in atom.GetBonds()]
>>> neighbor_ids
[2, 4, 8]

I'll need to filter out the neighbors I've already seen. I'll write a helper function for that. I'll need both the atom objects and the atom ids, so this will return two lists, one for each:

def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

and use it:

>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[4, 8]
>>> [a.GetSymbol() for a in atoms_to_visit]
['C', 'C']

These atom ids are connected to the original atom, so add them all it to the atom_ids.

>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids
[3, 4, 8]

These haven't been seen before, so add the atoms (not the atom ids) to the stack of items to process:

>>> stack.extend(atoms_to_visit)
>>> stack
[<rdkit.Chem.rdchem.Atom object at 0x111bf37b0>, <rdkit.Chem.rdchem.Atom object at 0x111bf3760>]
>>> [a.GetIdx() for a in stack]
[4, 8]

Now that they've been seen, I don't need to process them again, so add the new ids to the set of seen ids:

>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{8, 2, 3, 4}

That's the basic loop. The stack isn't empty, so pop the top of the stack to get a new atom object, and repeat the earlier steps:

>>> atom = stack.pop()
>>> atom.GetIdx()
8
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[7, 9]
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids 
[3, 4, 8, 7, 9]
>>> stack.extend(atoms_to_visit)
>>> [a.GetIdx() for a in stack]
[4, 7, 9]
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{2, 3, 4, 7, 8, 9}

Then another loop. I'll stick it in a while loop to process the stack until its empty, and I'll only have it print the new atom ids added to the list:

>>> while stack:
...     atom = stack.pop()
...     atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
...     atom_ids.extend(atom_ids_to_visit)
...     stack.extend(atoms_to_visit)
...     seen_ids.update(atom_ids_to_visit)
...     print("added atoms", atom_ids_to_visit)
... 
added atoms [10, 11]
added atoms []
added atoms []
added atoms [6]
added atoms [5]
added atoms []
added atoms []

The final set of atoms is:

>>> atom_ids
[3, 4, 8, 7, 9, 10, 11, 6, 5]

Fully automated - fragment_trim()

In this section I'll put all the pieces together to make fragment_trim(), a function which implements the trim algorithm.

Find atoms to delete

The trim algorithm needs to know which atoms to delete. That will be all of the atoms on one side of the bond, except for the atom which is at the end of the bond. (I'll turn that atom into a "*" atom.) I'll use the above code to get the atom ids to delete:

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {start_atom_id, ignore_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]
    
    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

and try it out:

>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> find_atom_ids_to_delete(aspirin, 3, 2)
[1, 0, 12]
>>> find_atom_ids_to_delete(aspirin, 2, 3)
[4, 8, 7, 9, 10, 11, 6, 5]

That looks right. (I left out the other testing I did to get this right.)

Trim atoms

The trim function has two parts. The first is to turn the attachment point into the wildcard atom "*", which I'll do by removing any charges, hydrogens, chirality, or other atom properties. This atom will not be deleted. The second is to remove the other atoms on that side of the bond.

Removing atoms from a molecule is slightly tricky. The atom indices are reset after each atom is removed. (While I often use a variable name like "atom_id", the atom id is not a permanent id, but simply the current atom index.) When atom "i" is deleted, all of the atoms with a larger id "j" get the new id "j-1". That is, the larger ids are all shifted down one to fill in the gap.

This becomes a problem because I have a list of ids to delete, but as I delete them the real ids change. For example, if I delete atoms 0 and 1 from a two atom molecule, in that order, I will get an exception:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("C#N"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
[13:54:02] 

****
Range Error
idx
Violation occurred on line 155 in file /Users/dalke/ftps/rdkit-Release_2016_03_1/Code/GraphMol/ROMol.cpp
Failed Expression: 1 <= 0
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Range Error
        idx
        Violation occurred on line 155 in file Code/GraphMol/ROMol.cpp
        Failed Expression: 1 <= 0
        RDKIT: 2016.03.1
        BOOST: 1_60

This is because after RemoveAtom(0) the old atom with id 1 gets the id 0. As another example, if I remove atoms 0 and 1 from a three atom molecule, I'll end up with what was originally the second atom:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("CON"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
>>> Chem.MolToSmiles(rwmol)
'O'

Originally the ids were [C=0, O=1, N=2]. The RemoveAtom(0) removed the C and renumbered the remaining atoms, giving [O=0, N=1]. The RemoveAtom(1) then removed the N, leaving [O=0] as the remaining atom.

While you could pay attention to the renumbering and adjust the index of the atom to delete, the far easier solution is to sort the atom ids, and delete starting from the largest id.

Here is the function to turn the attachment atom into a wildcard and to delete the other atoms:

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()

(If this were a more general-purpose function I would need to call ClearComputedProps(), which is needed after any molecular editing in RDKit. I don't need to do this here because it will be done during CombineMols(), which is coming up.)

How did I figure out which properties needed to be changed to make a wildcard atom? I tracked the down during cross-validation tests of an earlier version of this algorithm.

I'll try out the new code:

>>> fragment1 = trim_atoms(aspirin, 2, [1, 0, 12])
>>> Chem.MolToSmiles(fragment1, isomericSmiles=True)
'[*]c1ccccc1C(=O)O'
>>> fragment2 = trim_atoms(aspirin, 3, [4, 8, 7, 9, 10, 11, 6, 5])
>>> Chem.MolToSmiles(fragment2, isomericSmiles=True)
'[*]OC(C)=O'

fragment_trim()

The high-level fragment code is straight-forward, and I don't think it requires explanation:

def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)
    
    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

I'll check that it does what I think it should do:

>>> new_mol = fragment_trim(aspirin, 2, 3)
>>> Chem.MolToSmiles(new_mol)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

This matches the FragmentOnBonds() output at the start of this essay.

Testing

I used the same cross-validation method I did in my previous essay. I parsed structures from ChEMBL and checked that fragment_trim() produces the same results as FragmentOnBonds(). It doesn't. The first failure mode surprised me:

fragment_trim() only works with a single connected structure

Here's one of the failure reports:

Mismatch: record: 61 id: CHEMBL1203109 smiles: Cl.CNC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O cut: 1 2
   smiles_trim: Cl.Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
smiles_on_bond: Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O

Somehow the "Cl" appears twice. After thinking about it I realized that my implementation of the copy and trim algorithm assumes that the input structure is strongly connected, that is that it only contains a single molecule. Each copy contains a chlorine atom, so when I CombineMols() I end up with two chlorine atoms.

I could fix my implementation to handle this, but as I'm only interested in cross-validation, the easier solution is to never process a structure with multiple molecules. I decided that if the input SMILES contains multiple molecules (that is, if the SMILES contains the dot disconnection symbol ".") then I would choose the component which has the most characters, using the following:

if "." in smiles:
    # The fragment_trim() method only works on a connected molecule.
    # Arbitrarily pick the component with the longest SMILES string.
    smiles = max(smiles.split("."), key=len)

Directional bonds

After 1000 records and 31450 tests there were 362 mismatches. All of them appeared to be related to directional bonds, like this mismatch report:

Mismatch: record: 124 id: CHEMBL445177 smiles: CCCCCC/C=C\CCCCCCCc1cccc(O)c1C(=O)O cut: 7 8
   smiles_trim: [*]/C=C\CCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
smiles_on_bond: [*]C=CCCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O

I knew this would happen coming in to this essay, but it's still nice to see. It demonstrates that the fragment_trim() is a better way to cross-validate FragmentOnBonds() than my earlier fragment_chiral() algorithm.

It's possible that other mismatches are hidden in the hundreds of reports, so I removed directional bonds from the SMILES and processed again:

if "/" in smiles:
    smiles = smiles.replace("/", "")
if "\\" in smiles:
    smiles = smiles.replace("\\", "")

That testing shows I forgot to include "atom.SetIsotope(0)" when I converted the attachment atom into a wildcard atom. Without it, the algorithm turned a "[18F]" into a "[18*]". It surprised me because I copied it from working code I made for an earlier version of the algorithm. Looking into it, I realized I left that line out during the copy&paste. This is why regression tests using diverse real-world data are important. I stopped the testing after 100,00 records. Here is the final status line:

Processed 100000 records, 100000 molecules, 1120618 tests, 0 mismatches.  T_trim: 2846.79 T_on_bond 258.18

While trim code is slower than using the built-in function, the point of this exercise isn't to figure out the fastest implementation but to have a better way to cross-validate the original code, which it has done.

The final code

Here is the final code:

from __future__ import print_function

import datetime
import time

from rdkit import Chem

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {ignore_atom_id, start_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]

    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
    
def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)

    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

####### Cross-validation code

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])


def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles
            
def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_trim = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_trim: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_trim, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 100 == 0:
            print_status()

        if "." in smiles:
            # The fragment_trim() method only works on a connected molecule.
            # Arbitrarily pick the component with the longest SMILES string.
            smiles = max(smiles.split("."), key=len)
            
        ## Remove directional bonds to see if there are mismatches
        ## which are not caused by directional bonds.
        #if "/" in smiles:
        #    smiles = smiles.replace("/", "")
        #if "\\" in smiles:
        #    smiles = smiles.replace("\\", "")
        
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_trim() ...
            t1 = time.time()
            mol_trim = fragment_trim(mol, a1, a2)
            t2 = time.time()
            time_trim += t2-t1
            smiles_trim = Chem.MolToSmiles(mol_trim, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_trim != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print("   smiles_trim:", smiles_trim)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)

if __name__ == "__main__":
    cross_validate()

23 Aug 2016 12:00pm GMT

Andrew Dalke: Fragment by copy and trim

This is part of a series of essays on how to fragment a molecular graph using RDKit. These are meant to describe the low-level steps that go into fragmentation, and the ways to think about testing. To fragment a molecule in RDKit, use FragmentOnBonds(). The essays in this series are:

Why another fragmentation method?

How do you tell if an algorithm is correct? Sometimes you can inspect the code. More often you have test cases where you know the correct answer. For complex algorithms, especially when using real-world data, testing is even harder. It's not always clear where edge cases are, and often the real-world is often more complicated than you think.

Another validation solution is to write two different algorithms which accomplish the same thing, and compare them with each other. I did that in my essay about different ways to evaluate the parity of a permutation order. That cross-validation testing was good enough, because it's easy to compute all possible input orders.

In the previous essay in this series, I cross-validated that fragment_chiral() and fragment_on_bonds() give the same results. They did. That isn't surprising because they implement essentially the same algorithm. Not only that, but I looked at the FragmentOnBonds() implementation when I found out my first fragment_chiral() didn't give the same answers. (I didn't know I needed to ClearComputedProps() after I edit the structure.)

Cross-validation works better when the two algorithm are less similar. The way you think about a problem and implement a solution are connected. Two similar solutions may be the result of thinking about things the same way, and come with the same blind spots. (This could also be a design goal, if you want the new implementation to be "bug compatible" with the old.)

I've made enough hints that you likely know what I'm leading up to. RDKit's FragmentOnBonds() code currently (in mid-2016) converts directional bonds (the E-Z stereochemistry around double bonds into non-directional single bonds.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("F/C=C/F")
>>> fragmented_mol = Chem.FragmentOnBonds(mol, [0], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]C=CF.[*]F'

when I expected the last line to be

'[*]/C=C/F.[*]F'

This may or may not be what you want. Just in case, I've submitted a patch to discuss changing it in RDKit, so your experience in the future might be different than mine in the past.

I bring it up now to show how cross-validation of similar algorithms isn't always enough to figure out if the algorithms do as you expect.

Validate though re-assembly

As a reminder, in an earlier essay I did a much stronger validation test. I took the fragments, reassembled them via SMILES syntax manipulation, and compared the canonicalized result to the canonicalized input structure. These should always match, to within limits of the underlying chemistry support.

They didn't, because my original code didn't support directional bonds. Instead, in that essay I pre-processed the input SMILES string to replace the "/" and "\" characters with a "-". That gives the equivalent chemistry without directional bonds.

I could use that validation technique again, but I want to explore more of what it means to cross-validate by using a different fragmentation implementations.

Fragment by 'copy and trim'

Dave Cosgrove, on the RDKit mailing list, described the solution he uses to cut a bond and preserve chirality in toolkits which don't provide a high-level equivalent to FragmentOnBonds(). This method only works on non-ring bonds, which isn't a problem for me as I want to fragment R-groups. (As I discovered while doing the final testing, my implementation of the method also assumes there is only one molecule in the structure.)

To make fragment from a molecule, trim away all the atoms which aren't part of the fragment, except for the atom connected to the fragment. Convert that atom into the wildcard atom "*" by setting its atomic number, charge, and number of hydrogens to 0, and setting the chirality and aromaticity flags correctly. The image below shows the steps applied to cutting the ester off of aspirin:

steps to trim a fragment of aspirin

This method never touches the bonds connected to a chiral atom of a fragment, so chirality or other bond properties (like bond direction!) aren't accidently removed by breaking the bond and making a new one in its place.

A downside is that I need to copy and trim twice in order to get both fragments after cutting a chain bond. I can predict now the performance won't be as fast as FragmentOnBonds().

Let's try it out.

Trim using the interactive shell

I'll use aspirin as my reference structure and cut the bond to the ester (the R-OCOCH3), which is the bond between atoms 2 and 3.

aspirin with atoms labeled by index

I want the result to be the same as using FragmentOnBonds() to cut that bond:

>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> bond_idx = aspirin.GetBondBetweenAtoms(2, 3).GetIdx()
>>> bond_idx
2
>>> fragmented_mol = Chem.FragmentOnBonds(aspirin, [bond_idx], dummyLabels=[(0,0)])
>>> Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

I'll need a way to find all of the atoms which are on the ester side of the bond. I don't think there's a toolkit function to get all the atoms on one side of a bond, so I'll write one myself.

Atom 2 is the connection point on the ester to the rest of the aspirin structure. I'll start with that atom, show that it's an oxygen, and get information about its bonds:

>>> atom = aspirin.GetAtomWithIdx(2)
>>> atom.GetSymbol()
'O'
>>> [bond.GetBondType() for bond in atom.GetBonds()]
[rdkit.Chem.rdchem.BondType.SINGLE, rdkit.Chem.rdchem.BondType.SINGLE]

I'll use the bond object to get to the atom on the other side of the bond from atom 2, and report more detailed information about the atom at the other end of each bond:

>>> for bond in atom.GetBonds():
...   other_atom = bond.GetOtherAtom(atom)
...   print(other_atom.GetIdx(), bond.GetBondType(), other_atom.GetSymbol(), other_atom.GetIsAromatic())
... 
1 SINGLE C False
3 SINGLE C True

This says that atom 1 is an aliphatic carbon, and atom 3 is an aromatic carbon. This matches the structure diagram I used earlier.

I know that atom 3 is the other end of the bond which was cut, so I can stop there. What about atom 1? What is it connected to?

>>> next_atom = aspirin.GetAtomWithIdx(1)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[0, 2, 12]

I've already processed atom 2, so only atoms 0 and 12 are new.

What are those atoms connected to?

>>> next_atom = aspirin.GetAtomWithIdx(0)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]
>>> next_atom = aspirin.GetAtomWithIdx(12)
>>> [b.GetOtherAtom(next_atom).GetIdx() for b in next_atom.GetBonds()]
[1]

Only atom 1, which I've already seen before so I've found all of the atoms on the ester side of the bond.

Semi-automated depth-first search

There are more atoms on the other side of the bond, so I'll have to automate it somewhat. This sort of graph search is often implemented as a depth-first search (DFS) or a breadth-first search (BFS). Python lists easily work as stacks, which makes DFS slightly more natural to implement.

To give you an idea of what I'm about to explain, here's an animation of the aspirin depth-first search:

depth-first search of an apsirin fragment

If there is the chance of a ring (called "cycle" in graph theory), then there are two ways to get to the same atom. To prevent processing the same atom multiple times, I'll set up set named "seen_ids", which contains the atoms indices I've before. Since it's the start, I've only seen the two atoms which are at the end of the bond to cut.

>>> seen_ids = set([2, 3])

I also store a list of the atoms which are in this side of the bond, at this point is only atom 3, and a stack (a Python list) of the atoms I need to process further:

>>> atom_ids = [3]
>>> stack = [aspirin.GetAtomWithIdx(3)]

I'll start by getting the top of the stack (the last element of the list) and looking at its neighbors:

>>> atom = stack.pop()
>>> neighbor_ids = [b.GetOtherAtom(atom).GetIdx() for b in atom.GetBonds()]
>>> neighbor_ids
[2, 4, 8]

I'll need to filter out the neighbors I've already seen. I'll write a helper function for that. I'll need both the atom objects and the atom ids, so this will return two lists, one for each:

def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

and use it:

>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[4, 8]
>>> [a.GetSymbol() for a in atoms_to_visit]
['C', 'C']

These atom ids are connected to the original atom, so add them all it to the atom_ids.

>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids
[3, 4, 8]

These haven't been seen before, so add the atoms (not the atom ids) to the stack of items to process:

>>> stack.extend(atoms_to_visit)
>>> stack
[<rdkit.Chem.rdchem.Atom object at 0x111bf37b0>, <rdkit.Chem.rdchem.Atom object at 0x111bf3760>]
>>> [a.GetIdx() for a in stack]
[4, 8]

Now that they've been seen, I don't need to process them again, so add the new ids to the set of seen ids:

>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{8, 2, 3, 4}

That's the basic loop. The stack isn't empty, so pop the top of the stack to get a new atom object, and repeat the earlier steps:

>>> atom = stack.pop()
>>> atom.GetIdx()
8
>>> atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
>>> atom_ids_to_visit
[7, 9]
>>> atom_ids.extend(atom_ids_to_visit) 
>>> atom_ids 
[3, 4, 8, 7, 9]
>>> stack.extend(atoms_to_visit)
>>> [a.GetIdx() for a in stack]
[4, 7, 9]
>>> seen_ids.update(atom_ids_to_visit)
>>> seen_ids
{2, 3, 4, 7, 8, 9}

Then another loop. I'll stick it in a while loop to process the stack until its empty, and I'll only have it print the new atom ids added to the list:

>>> while stack:
...     atom = stack.pop()
...     atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
...     atom_ids.extend(atom_ids_to_visit)
...     stack.extend(atoms_to_visit)
...     seen_ids.update(atom_ids_to_visit)
...     print("added atoms", atom_ids_to_visit)
... 
added atoms [10, 11]
added atoms []
added atoms []
added atoms [6]
added atoms [5]
added atoms []
added atoms []

The final set of atoms is:

>>> atom_ids
[3, 4, 8, 7, 9, 10, 11, 6, 5]

Fully automated - fragment_trim()

In this section I'll put all the pieces together to make fragment_trim(), a function which implements the trim algorithm.

Find atoms to delete

The trim algorithm needs to know which atoms to delete. That will be all of the atoms on one side of the bond, except for the atom which is at the end of the bond. (I'll turn that atom into a "*" atom.) I'll use the above code to get the atom ids to delete:

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {start_atom_id, ignore_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]
    
    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

and try it out:

>>> from rdkit import Chem
>>> aspirin = Chem.MolFromSmiles("O=C(Oc1ccccc1C(=O)O)C")
>>> find_atom_ids_to_delete(aspirin, 3, 2)
[1, 0, 12]
>>> find_atom_ids_to_delete(aspirin, 2, 3)
[4, 8, 7, 9, 10, 11, 6, 5]

That looks right. (I left out the other testing I did to get this right.)

Trim atoms

The trim function has two parts. The first is to turn the attachment point into the wildcard atom "*", which I'll do by removing any charges, hydrogens, chirality, or other atom properties. This atom will not be deleted. The second is to remove the other atoms on that side of the bond.

Removing atoms from a molecule is slightly tricky. The atom indices are reset after each atom is removed. (While I often use a variable name like "atom_id", the atom id is not a permanent id, but simply the current atom index.) When atom "i" is deleted, all of the atoms with a larger id "j" get the new id "j-1". That is, the larger ids are all shifted down one to fill in the gap.

This becomes a problem because I have a list of ids to delete, but as I delete them the real ids change. For example, if I delete atoms 0 and 1 from a two atom molecule, in that order, I will get an exception:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("C#N"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
[13:54:02] 

****
Range Error
idx
Violation occurred on line 155 in file /Users/dalke/ftps/rdkit-Release_2016_03_1/Code/GraphMol/ROMol.cpp
Failed Expression: 1 <= 0
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Range Error
        idx
        Violation occurred on line 155 in file Code/GraphMol/ROMol.cpp
        Failed Expression: 1 <= 0
        RDKIT: 2016.03.1
        BOOST: 1_60

This is because after RemoveAtom(0) the old atom with id 1 gets the id 0. As another example, if I remove atoms 0 and 1 from a three atom molecule, I'll end up with what was originally the second atom:

>>> rwmol = Chem.RWMol(Chem.MolFromSmiles("CON"))
>>> rwmol.RemoveAtom(0)
>>> rwmol.RemoveAtom(1)
>>> Chem.MolToSmiles(rwmol)
'O'

Originally the ids were [C=0, O=1, N=2]. The RemoveAtom(0) removed the C and renumbered the remaining atoms, giving [O=0, N=1]. The RemoveAtom(1) then removed the N, leaving [O=0] as the remaining atom.

While you could pay attention to the renumbering and adjust the index of the atom to delete, the far easier solution is to sort the atom ids, and delete starting from the largest id.

Here is the function to turn the attachment atom into a wildcard and to delete the other atoms:

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()

(If this were a more general-purpose function I would need to call ClearComputedProps(), which is needed after any molecular editing in RDKit. I don't need to do this here because it will be done during CombineMols(), which is coming up.)

How did I figure out which properties needed to be changed to make a wildcard atom? I tracked the down during cross-validation tests of an earlier version of this algorithm.

I'll try out the new code:

>>> fragment1 = trim_atoms(aspirin, 2, [1, 0, 12])
>>> Chem.MolToSmiles(fragment1, isomericSmiles=True)
'[*]c1ccccc1C(=O)O'
>>> fragment2 = trim_atoms(aspirin, 3, [4, 8, 7, 9, 10, 11, 6, 5])
>>> Chem.MolToSmiles(fragment2, isomericSmiles=True)
'[*]OC(C)=O'

fragment_trim()

The high-level fragment code is straight-forward, and I don't think it requires explanation:

def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)
    
    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

I'll check that it does what I think it should do:

>>> new_mol = fragment_trim(aspirin, 2, 3)
>>> Chem.MolToSmiles(new_mol)
'[*]OC(C)=O.[*]c1ccccc1C(=O)O'

This matches the FragmentOnBonds() output at the start of this essay.

Testing

I used the same cross-validation method I did in my previous essay. I parsed structures from ChEMBL and checked that fragment_trim() produces the same results as FragmentOnBonds(). It doesn't. The first failure mode surprised me:

fragment_trim() only works with a single connected structure

Here's one of the failure reports:

Mismatch: record: 61 id: CHEMBL1203109 smiles: Cl.CNC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O cut: 1 2
   smiles_trim: Cl.Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O
smiles_on_bond: Cl.[*]C.[*]NC(=O)c1cc(C(O)CNC(C)CCc2ccc3c(c2)OCO3)ccc1O

Somehow the "Cl" appears twice. After thinking about it I realized that my implementation of the copy and trim algorithm assumes that the input structure is strongly connected, that is that it only contains a single molecule. Each copy contains a chlorine atom, so when I CombineMols() I end up with two chlorine atoms.

I could fix my implementation to handle this, but as I'm only interested in cross-validation, the easier solution is to never process a structure with multiple molecules. I decided that if the input SMILES contains multiple molecules (that is, if the SMILES contains the dot disconnection symbol ".") then I would choose the component which has the most characters, using the following:

if "." in smiles:
    # The fragment_trim() method only works on a connected molecule.
    # Arbitrarily pick the component with the longest SMILES string.
    smiles = max(smiles.split("."), key=len)

Directional bonds

After 1000 records and 31450 tests there were 362 mismatches. All of them appeared to be related to directional bonds, like this mismatch report:

Mismatch: record: 124 id: CHEMBL445177 smiles: CCCCCC/C=C\CCCCCCCc1cccc(O)c1C(=O)O cut: 7 8
   smiles_trim: [*]/C=C\CCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O
smiles_on_bond: [*]C=CCCCCCC.[*]CCCCCCCc1cccc(O)c1C(=O)O

I knew this would happen coming in to this essay, but it's still nice to see. It demonstrates that the fragment_trim() is a better way to cross-validate FragmentOnBonds() than my earlier fragment_chiral() algorithm.

It's possible that other mismatches are hidden in the hundreds of reports, so I removed directional bonds from the SMILES and processed again:

if "/" in smiles:
    smiles = smiles.replace("/", "")
if "\\" in smiles:
    smiles = smiles.replace("\\", "")

That testing shows I forgot to include "atom.SetIsotope(0)" when I converted the attachment atom into a wildcard atom. Without it, the algorithm turned a "[18F]" into a "[18*]". It surprised me because I copied it from working code I made for an earlier version of the algorithm. Looking into it, I realized I left that line out during the copy&paste. This is why regression tests using diverse real-world data are important. I stopped the testing after 100,00 records. Here is the final status line:

Processed 100000 records, 100000 molecules, 1120618 tests, 0 mismatches.  T_trim: 2846.79 T_on_bond 258.18

While trim code is slower than using the built-in function, the point of this exercise isn't to figure out the fastest implementation but to have a better way to cross-validate the original code, which it has done.

The final code

Here is the final code:

from __future__ import print_function

import datetime
import time

from rdkit import Chem

# Look for neighbor atoms of 'atom' which haven't been seen before  
def get_atoms_to_visit(atom, seen_ids):
    neighbor_ids = []
    neighbor_atoms = []
    for bond in atom.GetBonds():
        neighbor_atom = bond.GetOtherAtom(atom)
        neighbor_id = neighbor_atom.GetIdx()
        if neighbor_id not in seen_ids:
            neighbor_ids.append(neighbor_id) 
            neighbor_atoms.append(neighbor_atom)
    
    return neighbor_ids, neighbor_atoms

# Find all of the atoms connected to 'start_atom_id', except for 'ignore_atom_id'
def find_atom_ids_to_delete(mol, start_atom_id, ignore_atom_id):
    seen_ids = {ignore_atom_id, start_atom_id}
    atom_ids = [] # Will only delete the newly found atoms, not the start atom
    stack = [mol.GetAtomWithIdx(start_atom_id)]

    # Use a depth-first search to find the connected atoms
    while stack:
        atom = stack.pop()
        atom_ids_to_visit, atoms_to_visit = get_atoms_to_visit(atom, seen_ids)
        atom_ids.extend(atom_ids_to_visit)
        stack.extend(atoms_to_visit)
        seen_ids.update(atom_ids_to_visit)
    
    return atom_ids

def trim_atoms(mol, wildcard_atom_id, atom_ids_to_delete):
    rwmol = Chem.RWMol(mol)
    # Change the attachment atom into a wildcard ("*") atom
    atom = rwmol.GetAtomWithIdx(wildcard_atom_id)
    atom.SetAtomicNum(0)
    atom.SetIsotope(0)
    atom.SetFormalCharge(0)
    atom.SetIsAromatic(False)
    atom.SetNumExplicitHs(0)
    
    # Remove the other atoms.
    # RDKit will renumber after every RemoveAtom() so
    # remove from highest to lowest atom index.
    # If not sorted, may get a "RuntimeError: Range Error"
    for atom_id in sorted(atom_ids_to_delete, reverse=True):
        rwmol.RemoveAtom(atom_id)
    
    # Don't need to ClearComputedProps() because that will
    # be done by CombineMols()
    return rwmol.GetMol()
    
def fragment_trim(mol, atom1, atom2):
    # Find the fragments on either side of a bond
    delete_atom_ids1 = find_atom_ids_to_delete(mol, atom1, atom2)
    fragment1 = trim_atoms(mol, atom1, delete_atom_ids1)
    
    delete_atom_ids2 = find_atom_ids_to_delete(mol, atom2, atom1)
    fragment2 = trim_atoms(mol, atom2, delete_atom_ids2)

    # Merge the two fragments.
    # (CombineMols() does the required ClearComputedProps())
    return Chem.CombineMols(fragment1, fragment2)

####### Cross-validation code

def fragment_on_bond(mol, atom1, atom2):
    bond = mol.GetBondBetweenAtoms(atom1, atom2)
    return Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])


def read_records(filename):
    with open(filename) as infile:
        for recno, line in enumerate(infile):
            smiles, id = line.split()
            if "*" in smiles:
                continue
            yield recno, id, smiles
            
def cross_validate():
    # Match a single bond not in a ring
    BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
    single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
    
    num_mols = num_tests = num_mismatches = 0
    time_trim = time_on_bond = 0.0

    # Helper function to print status information. Since
    # the function is defined inside of another function,
    # it has access to variables in the outer function.
    def print_status():
        print("Processed {} records, {} molecules, {} tests, {} mismatches."
              "  T_trim: {:.2f} T_on_bond {:.2f}"
              .format(recno, num_mols, num_tests, num_mismatches,
                      time_trim, time_on_bond))
    
    filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
    start_time = datetime.datetime.now()
    for recno, id, smiles in read_records(filename):
        if recno % 100 == 0:
            print_status()

        if "." in smiles:
            # The fragment_trim() method only works on a connected molecule.
            # Arbitrarily pick the component with the longest SMILES string.
            smiles = max(smiles.split("."), key=len)
            
        ## Remove directional bonds to see if there are mismatches
        ## which are not caused by directional bonds.
        #if "/" in smiles:
        #    smiles = smiles.replace("/", "")
        #if "\\" in smiles:
        #    smiles = smiles.replace("\\", "")
        
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        num_mols += 1

        # Find all of the single bonds
        matches = mol.GetSubstructMatches(single_bond_pat)

        for a1, a2 in matches:
            # For each pair of atom indices, split on that bond
            # and determine the canonicalized fragmented molecule.
            # Do it for both fragment_trim() ...
            t1 = time.time()
            mol_trim = fragment_trim(mol, a1, a2)
            t2 = time.time()
            time_trim += t2-t1
            smiles_trim = Chem.MolToSmiles(mol_trim, isomericSmiles=True)

            # ... and for fragment_on_bond()
            t1 = time.time()
            mol_on_bond = fragment_on_bond(mol, a1, a2)
            t2 = time.time()
            time_on_bond += t2-t1
            smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)

            # Report any mismatches and keep on going.
            num_tests += 1
            if smiles_trim != smiles_on_bond:
                print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
                      recno, id, smiles, a1, a2))
                print("   smiles_trim:", smiles_trim)
                print("smiles_on_bond:", smiles_on_bond)
                num_mismatches += 1
    
    end_time = datetime.datetime.now()
    elapsed_time = str(end_time - start_time)
    elapsed_time = elapsed_time.partition(".")[0]  # don't display subseconds
    
    print("Done.")
    print_status()
    print("elapsed time:", elapsed_time)

if __name__ == "__main__":
    cross_validate()

23 Aug 2016 12:00pm GMT

Python Software Foundation: The Python Software Foundation is seeking a blogger!

Interview prominent Pythonistas, connect with the community, expand your circle of friends and learn about events in the Python world!



The Python Software Foundation (PSF) is seeking a blogger to contribute to the PSF blog located at http://pyfound.blogspot.com/. As a PSF blogger you will work with the PSF Communication Officers to brainstorm blog content, communicate activities, and provide updates on content progression. Example of content includes PSF community service awardee profiles, details about global Python events and PSF grants, or recent goings-on within the PSF itself. One goal of the 2016 - 2017 PSF Board of Directors is to increase transparency around PSF activities by curating more frequent blog content.



The Python Software Foundation is a 501(c)(3) non-profit corporation that holds the intellectual property rights behind the Python programming language. We also run the North American PyCon conference annually, support other Python conferences/workshops around the world, and fund Python related development with our grants program. To see more info on our grants program, please read: https://www.python.org/psf/grants/.



Job Description
----------------------------



Needed Experience
----------------------------



Bloggers for the Python Software Foundation receive a fixed fee per post they produce.


To apply please email two to three examples of recent articles (e.g. personal blog, contribution to professional publication) as well as a brief account of your writing experience to psf-blog@python.org. If you have questions, direct them to psf-blog@python.org as well. UPDATE: The Python Software Foundation will be accepting applications until 11:59:59pm Pacific Standard Time Thursday, August 25th, 2016.

23 Aug 2016 10:26am GMT

Python Software Foundation: The Python Software Foundation is seeking a blogger!

Interview prominent Pythonistas, connect with the community, expand your circle of friends and learn about events in the Python world!



The Python Software Foundation (PSF) is seeking a blogger to contribute to the PSF blog located at http://pyfound.blogspot.com/. As a PSF blogger you will work with the PSF Communication Officers to brainstorm blog content, communicate activities, and provide updates on content progression. Example of content includes PSF community service awardee profiles, details about global Python events and PSF grants, or recent goings-on within the PSF itself. One goal of the 2016 - 2017 PSF Board of Directors is to increase transparency around PSF activities by curating more frequent blog content.



The Python Software Foundation is a 501(c)(3) non-profit corporation that holds the intellectual property rights behind the Python programming language. We also run the North American PyCon conference annually, support other Python conferences/workshops around the world, and fund Python related development with our grants program. To see more info on our grants program, please read: https://www.python.org/psf/grants/.



Job Description
----------------------------



Needed Experience
----------------------------



Bloggers for the Python Software Foundation receive a fixed fee per post they produce.


To apply please email two to three examples of recent articles (e.g. personal blog, contribution to professional publication) as well as a brief account of your writing experience to psf-blog@python.org. If you have questions, direct them to psf-blog@python.org as well. UPDATE: The Python Software Foundation will be accepting applications until 11:59:59pm Pacific Standard Time Thursday, August 25th, 2016.

23 Aug 2016 10:26am GMT

S. Lott: On Generator Functions, Yield and Return

Here's the question, lightly edited to remove the garbage. (Sometimes I'm charitable and call it "rambling". Today, I'm not feeling charitable about the garbage writing style filled with strange assumptions instead of questions.)

someone asked if you could have both a yield and a return in the same ... function/iterator. There was debate and the senior people said, let's actually write code. They wrote code and proved that couldn't have both a yield and a return in the same ... function/iterator. ....

The meeting moved on w/out anyone asking the why question. Why doesn't it make sense to have both a yield and a return. ...


The impact of the yield statement can be confusing. Writing code to mess around with it was somehow unhelpful. And the shocking "proved that couldn't have both a yield and a return in the same ... function" is a serious problem.

(Or a seriously incorrect summary of the conversation; a very real possibility considering the garbage-encrusted email. Or a sign that Python 3 isn't widely-enough used and the emil omitted this essential fact. And yes, I'm being overly sensitive to the garbage. But there's a better way to come to grips with reality and it involves asking questions and parsing details instead of repeating assumptions and writing garbage.)

An example


>>> def silly(n, stop=None):
for i in range(n):
if i == stop: return
yield i


>>> list(silly(5))
[0, 1, 2, 3, 4]
>>> list(silly(5, stop=3))
[0, 1, 2]

This works in both Python 3.5.1 and 2.7.10.

Some discussion

A function with no yield is a conventional function: the parameters from some domain are mapped to a return value in some range. Each mapping is a single evaluation of the function with concrete argument values.

A function with a yield statement becomes an iterable generator of (potentially) multiple values. The return statement changes it's behavior slightly. It no longer defines the one (and only) return value. In a generator function (one that has a yield) the return statement can be thought of as if it raised the StopIteration exception as a way to exit from the generator.

As can be seen in the example above, both statements are in one function. They both work to provide expected semantics.

The code which gets an error is this:

>>> def silly(n, stop=3):
... for i in range(n):
... if i == step: return "boom!"
... yield i


The "why?" question is should -- perhaps -- be obvious at this point. The return raises an exception; it doesn't provide a value.

The topic, however, remains troubling. The phrase "have both a yield and a return" is bothersome because it fails to recognize that the yield statement has a special role. The yield statement transforms the semantics of the function to make it into a different object with similar syntax.

It's not a matter of having them "both". It's matter of having a return in a generator. This is an entirely separate and trivial-to-answer question.

A Long Useless Rant

The email seems to contain an implicit assumption. It's the notion that programming language semantics are subtle and slippery things. And even "senior people" can't get it right. Because all programming languages (other then the email sender's personal favorite) are inherently confusing. The confusion cannot be avoided.

There are times when programming language semantics are confusing. For example, the ++ operator in C is confusing. Nothing can be done about that. The original definition was tied to the PDP-11 machine instructions. Since then... Well.... Aspects of the generated code are formally undefined. Many languages have one or more places where the semantics are "undefined" or only defined by example.

This is not one of those times.

Here's the real problem I have with the garbage aspect of the email.

If you bring personal baggage to the conversation -- i.e., assumptions based on a comparison between some other language and Python -- confusion will erupt all over the place. Languages are different. Concepts don't map from language to language very well. Yes, there are simple abstract principles which have different concrete realizations in different languages. But among the various concrete realizations, there may not be a simple mapping.

It's essential to discard all knowledge of all previous favorite programming languages when learning a new language.

I'll repeat that for the author of the email.

Don't Go To The Well With A Full Bucket.

You won't get anything.

In this specific case, the notion of "function" in Python is expanded to include two superficially similar things. The syntax is nearly identical. But the behaviors are remarkably different. It's essential to grasp the idea that the two things are different, and can't be casually lumped together as "function/iterator".

The crux of the email appears to be a failure to get the Python language rules in a profound way.

23 Aug 2016 8:00am GMT

S. Lott: On Generator Functions, Yield and Return

Here's the question, lightly edited to remove the garbage. (Sometimes I'm charitable and call it "rambling". Today, I'm not feeling charitable about the garbage writing style filled with strange assumptions instead of questions.)

someone asked if you could have both a yield and a return in the same ... function/iterator. There was debate and the senior people said, let's actually write code. They wrote code and proved that couldn't have both a yield and a return in the same ... function/iterator. ....

The meeting moved on w/out anyone asking the why question. Why doesn't it make sense to have both a yield and a return. ...


The impact of the yield statement can be confusing. Writing code to mess around with it was somehow unhelpful. And the shocking "proved that couldn't have both a yield and a return in the same ... function" is a serious problem.

(Or a seriously incorrect summary of the conversation; a very real possibility considering the garbage-encrusted email. Or a sign that Python 3 isn't widely-enough used and the emil omitted this essential fact. And yes, I'm being overly sensitive to the garbage. But there's a better way to come to grips with reality and it involves asking questions and parsing details instead of repeating assumptions and writing garbage.)

An example


>>> def silly(n, stop=None):
for i in range(n):
if i == stop: return
yield i


>>> list(silly(5))
[0, 1, 2, 3, 4]
>>> list(silly(5, stop=3))
[0, 1, 2]

This works in both Python 3.5.1 and 2.7.10.

Some discussion

A function with no yield is a conventional function: the parameters from some domain are mapped to a return value in some range. Each mapping is a single evaluation of the function with concrete argument values.

A function with a yield statement becomes an iterable generator of (potentially) multiple values. The return statement changes it's behavior slightly. It no longer defines the one (and only) return value. In a generator function (one that has a yield) the return statement can be thought of as if it raised the StopIteration exception as a way to exit from the generator.

As can be seen in the example above, both statements are in one function. They both work to provide expected semantics.

The code which gets an error is this:

>>> def silly(n, stop=3):
... for i in range(n):
... if i == step: return "boom!"
... yield i


The "why?" question is should -- perhaps -- be obvious at this point. The return raises an exception; it doesn't provide a value.

The topic, however, remains troubling. The phrase "have both a yield and a return" is bothersome because it fails to recognize that the yield statement has a special role. The yield statement transforms the semantics of the function to make it into a different object with similar syntax.

It's not a matter of having them "both". It's matter of having a return in a generator. This is an entirely separate and trivial-to-answer question.

A Long Useless Rant

The email seems to contain an implicit assumption. It's the notion that programming language semantics are subtle and slippery things. And even "senior people" can't get it right. Because all programming languages (other then the email sender's personal favorite) are inherently confusing. The confusion cannot be avoided.

There are times when programming language semantics are confusing. For example, the ++ operator in C is confusing. Nothing can be done about that. The original definition was tied to the PDP-11 machine instructions. Since then... Well.... Aspects of the generated code are formally undefined. Many languages have one or more places where the semantics are "undefined" or only defined by example.

This is not one of those times.

Here's the real problem I have with the garbage aspect of the email.

If you bring personal baggage to the conversation -- i.e., assumptions based on a comparison between some other language and Python -- confusion will erupt all over the place. Languages are different. Concepts don't map from language to language very well. Yes, there are simple abstract principles which have different concrete realizations in different languages. But among the various concrete realizations, there may not be a simple mapping.

It's essential to discard all knowledge of all previous favorite programming languages when learning a new language.

I'll repeat that for the author of the email.

Don't Go To The Well With A Full Bucket.

You won't get anything.

In this specific case, the notion of "function" in Python is expanded to include two superficially similar things. The syntax is nearly identical. But the behaviors are remarkably different. It's essential to grasp the idea that the two things are different, and can't be casually lumped together as "function/iterator".

The crux of the email appears to be a failure to get the Python language rules in a profound way.

23 Aug 2016 8:00am GMT

22 Aug 2016

feedPlanet Python

PyCharm: Next Batch of In-Depth Screencasts: VCS

In January we recorded a series of screencasts that introduced the major features of PyCharm - an overview, installation, editing, running, debugging, etc. In April we did our first "in-depth" screencast, focusing on testing.

We're happy to announce the next in-depth screencasts, and note the plural: this is a 3-part series on using version control systems (VCS) in PyCharm. The JetBrains IDEs, including PyCharm, have worked very hard over the years to put a productive, easy UI atop version control. These videos spend more time showing the features that are available.

In the first video we cover Getting Started with VCS, going over: Versioning without a VCS using Local History, Setting Up a Local Git Repository, Uploading to GitHub, and Getting a Checkout from GitHub.

Next, we go over Core VCS: Color Coding, Adding Files, Committing Changes, Using Refactor, then Diff, History, Revert, and Update.

The last video concentrates on Branching, Merging, and Pushing:

We're very happy to release these in-depth screencasts, which we've been working on for some time and were highly requested. And again, if you have any topics that you'd like to see get expanded screencast attention, let us know.

22 Aug 2016 3:45pm GMT

PyCharm: Next Batch of In-Depth Screencasts: VCS

In January we recorded a series of screencasts that introduced the major features of PyCharm - an overview, installation, editing, running, debugging, etc. In April we did our first "in-depth" screencast, focusing on testing.

We're happy to announce the next in-depth screencasts, and note the plural: this is a 3-part series on using version control systems (VCS) in PyCharm. The JetBrains IDEs, including PyCharm, have worked very hard over the years to put a productive, easy UI atop version control. These videos spend more time showing the features that are available.

In the first video we cover Getting Started with VCS, going over: Versioning without a VCS using Local History, Setting Up a Local Git Repository, Uploading to GitHub, and Getting a Checkout from GitHub.

Next, we go over Core VCS: Color Coding, Adding Files, Committing Changes, Using Refactor, then Diff, History, Revert, and Update.

The last video concentrates on Branching, Merging, and Pushing:

We're very happy to release these in-depth screencasts, which we've been working on for some time and were highly requested. And again, if you have any topics that you'd like to see get expanded screencast attention, let us know.

22 Aug 2016 3:45pm GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT