26 Mar 2017

feedPlanet Python

Import Python: Import Python Weekly Issue 117 - Memoization and Decorators, Apache Airflow, Machine learning & more

Worthy Read

With memoization, we can "memoize" (remember, store) the result of problems that we've dealt with before, and return a stored result instead of repeating calculations.
core-python

9 DBs to choose from, 5 min setup, auto-scaling, Cloud hosted. Free for 30 Days.
Sponsor

Apache Airflow is an open-source tool for orchestrating complex computational workflows and data processing pipelines. If you find yourself running cron task which execute ever longer scripts, or keeping a calendar of big data processing batch jobs then Airflow can probably help you. This article provides an introductory tutorial for people who want to get started writing pipelines with Airflow.
airflow

In this post, we'll take a step back to cover essential statistics that every data scientist should know.
statistics

PyMedium is an unofficial Medium API written in python flask. It provides developers to access to user, post list and detail information from Medium website. This is a read-only API to access public information from Medium, you can customize this API to fit your requirements and deploy on your own server.

This article is going to discuss some basic methods and functions in tensorflow used to visualize and monitor the training process. I believe visualization is top priority for the research. Because the deep learning itself is a "black box". So, if the visualization could help us analyze why the final result is successful or failed.
tensorflow

pandas

pandas

machine learning

virtual memory

kafka

flask

In this step by step guide we'll show you how to transcribe an audio file using IBM Watson speech-to-text API and a little bit of Python.
watson

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, encoder-decoder architecture, and the role of attention in learning theory.
machine learning
,
video

CI

django

core-python

Google's answer to aws lambda.
google cloud
,
cloud functions

If you have maintained a website with a huge amount of data, you probably already know that Django Admin can become very slow when the database table gets so large. If you log the SQL queries (either using Django logging or using Django Debug Toolbar), you would notice a very expensive SQL query, something like this.
admin


Projects

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment) http://www.readbeyond.it/aeneas/

Snaky - 28 Stars, 2 Fork
A snake game, three versions of AI included, implemented in python, pygame.

facebook-messenger-bot-tutorial use Python Django

Ticky - 9 Stars, 3 Fork
Tic Tac Toe game, implemented in python, pygame. It includes a unbeatable computer AI. Have a try : )

26 Mar 2017 8:57am GMT

Import Python: Import Python Weekly Issue 117 - Memoization and Decorators, Apache Airflow, Machine learning & more

Worthy Read

With memoization, we can "memoize" (remember, store) the result of problems that we've dealt with before, and return a stored result instead of repeating calculations.
core-python

9 DBs to choose from, 5 min setup, auto-scaling, Cloud hosted. Free for 30 Days.
Sponsor

Apache Airflow is an open-source tool for orchestrating complex computational workflows and data processing pipelines. If you find yourself running cron task which execute ever longer scripts, or keeping a calendar of big data processing batch jobs then Airflow can probably help you. This article provides an introductory tutorial for people who want to get started writing pipelines with Airflow.
airflow

In this post, we'll take a step back to cover essential statistics that every data scientist should know.
statistics

PyMedium is an unofficial Medium API written in python flask. It provides developers to access to user, post list and detail information from Medium website. This is a read-only API to access public information from Medium, you can customize this API to fit your requirements and deploy on your own server.

This article is going to discuss some basic methods and functions in tensorflow used to visualize and monitor the training process. I believe visualization is top priority for the research. Because the deep learning itself is a "black box". So, if the visualization could help us analyze why the final result is successful or failed.
tensorflow

pandas

pandas

machine learning

virtual memory

kafka

flask

In this step by step guide we'll show you how to transcribe an audio file using IBM Watson speech-to-text API and a little bit of Python.
watson

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, encoder-decoder architecture, and the role of attention in learning theory.
machine learning
,
video

CI

django

core-python

Google's answer to aws lambda.
google cloud
,
cloud functions

If you have maintained a website with a huge amount of data, you probably already know that Django Admin can become very slow when the database table gets so large. If you log the SQL queries (either using Django logging or using Django Debug Toolbar), you would notice a very expensive SQL query, something like this.
admin


Projects

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment) http://www.readbeyond.it/aeneas/

Snaky - 28 Stars, 2 Fork
A snake game, three versions of AI included, implemented in python, pygame.

facebook-messenger-bot-tutorial use Python Django

Ticky - 9 Stars, 3 Fork
Tic Tac Toe game, implemented in python, pygame. It includes a unbeatable computer AI. Have a try : )

26 Mar 2017 8:57am GMT

PyBites: Twitter digest 2017 week 12

Every weekend we share a curated list of 15 cool things (mostly Python) that we found / tweeted throughout the week.

26 Mar 2017 7:40am GMT

PyBites: Twitter digest 2017 week 12

Every weekend we share a curated list of 15 cool things (mostly Python) that we found / tweeted throughout the week.

26 Mar 2017 7:40am GMT

Weekly Python StackOverflow Report: (lxvi) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2017-03-26 06:01:11 GMT


  1. Python's Passing by References - [27/3]
  2. Why does __self__ of built-in functions return the builtin module it belongs to? - [9/1]
  3. Why is the output of print in python2 and python3 different with the same string? - [6/2]
  4. Why does random sampling scale with the dataset not the sample size? (pandas .sample() example) - [6/2]
  5. What's the difference between './' and '../' when using os.path.isdir()? - [6/1]
  6. Python: how to remove a value from a dict if it exactly matches the key? - [5/5]
  7. Setting value to a copy of a slice of a DataFrame - [5/3]
  8. Store most informative features from NLTK NaiveBayesClassifier in a list - [5/2]
  9. To Kill A Mocking Object: A Python Story - [5/1]
  10. Making sure a message published on a topic exchange is received by at least one consumer - [5/1]

26 Mar 2017 6:01am GMT

Weekly Python StackOverflow Report: (lxvi) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2017-03-26 06:01:11 GMT


  1. Python's Passing by References - [27/3]
  2. Why does __self__ of built-in functions return the builtin module it belongs to? - [9/1]
  3. Why is the output of print in python2 and python3 different with the same string? - [6/2]
  4. Why does random sampling scale with the dataset not the sample size? (pandas .sample() example) - [6/2]
  5. What's the difference between './' and '../' when using os.path.isdir()? - [6/1]
  6. Python: how to remove a value from a dict if it exactly matches the key? - [5/5]
  7. Setting value to a copy of a slice of a DataFrame - [5/3]
  8. Store most informative features from NLTK NaiveBayesClassifier in a list - [5/2]
  9. To Kill A Mocking Object: A Python Story - [5/1]
  10. Making sure a message published on a topic exchange is received by at least one consumer - [5/1]

26 Mar 2017 6:01am GMT

25 Mar 2017

feedPlanet Python

PyBites: Code Challenge 11 - Generators for Fun and Profit - Review

It's end of the week again so we review the code challenge of this week. It's never late to sign up, just fork our challenges repo and start coding.

25 Mar 2017 8:10am GMT

PyBites: Code Challenge 11 - Generators for Fun and Profit - Review

It's end of the week again so we review the code challenge of this week. It's never late to sign up, just fork our challenges repo and start coding.

25 Mar 2017 8:10am GMT

24 Mar 2017

feedPlanet Python

PyCharm: PyCharm 2017.1 Out Now: Faster debugger, new test runners, and more

PyCharm 2017.1 is out now! Get it now for a much faster debugger, improved Python and JavaScript unit testing, and support for the six library.

Get PyCharm 2017.1 now from our website

Please let us know what you think about PyCharm! You can reach us on Twitter, Facebook, and by leaving a comment on the blog.

PyCharm Team
-The Drive to Develop

24 Mar 2017 4:46pm GMT

PyCharm: PyCharm 2017.1 Out Now: Faster debugger, new test runners, and more

PyCharm 2017.1 is out now! Get it now for a much faster debugger, improved Python and JavaScript unit testing, and support for the six library.

Get PyCharm 2017.1 now from our website

Please let us know what you think about PyCharm! You can reach us on Twitter, Facebook, and by leaving a comment on the blog.

PyCharm Team
-The Drive to Develop

24 Mar 2017 4:46pm GMT

Andrew Dalke: ChEMBL bioactivity data

I almost only use ChEMBL structure files. I download the .sdf files and process them. ChEMBL also supplies bioactivity data, which I've never worked with. Iain Watson suggested I look to it as a source of compound set data, and provided some example SQL queries. This blog post is primarily a set of notes for myself as I experiment with the queries and learn more about what is in the data file.

There is one bit of general advice. If you're going to use the SQLite dump from ChEMBL, make sure that you did "ANALYZE" at least on the tables of interest. This may take a few hours. I'm downloading ChEMBL-22-1 to see if it comes pre-analyzed. If it doesn't, I'll ask them to do so as part of their releases.

For those playing along from home (or the office, or whereever fine SQL database engines may be found), I downloaded the SQLite dump for ChEMBL 21, which is a lovely 2542883 KB (or 2.4) compressed, and 12 GB uncompressed. That link also includes dumps for MySQL, Oracle, and Postgres, as well as schema documentation.

Unpack it the usual way (it takes a while to unpack 12GB), cd into the directory, and open the database using sqlite console:

% tar xf chembl_21_sqlite.tar.gz
% cd chembl_21_sqlite
% sqlite3 chembl_21.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite>

compound_structures

The 'compound_structures' table looks interesting. How many structures are there?

sqlite> select count(*) from compound_structures;
1583897

Wow. Just .. wow. That took a several minutes to execute. This is a problem I've had before with large databases. SQLite doesn't store the total table size, so the initial count(*) ends up doing a full table scan. This brings in every B-tree node from disk, which requires a lot of random seeks for my poor hard disk made of spinning rust. (Hmm, Crucible says I can get a replacement 500GB SSD for only EUR 168. Hmmm.)

The second time and onwards is just fine, thanks to the power of caching.

What does the structures look like? I'll decided to show only a few of the smallest structures to keep the results from overflowing the screen:

sqlite>
...>
select molregno, standard_inchi, standard_inchi_key, canonical_smiles
from compound_structures where length(canonical_smiles) 10 limit 4;

1813|InChI=1S/C4H11NO/c1-2-3-4-6-5/h2-5H2,1H3|WCVVIGQKJZLJDB-UHFFFAOYSA-N|CCCCON 3838|InChI=1S/C2H4INO/c3-1-2(4)5/h1H2,(H2,4,5)|PGLTVOMIXTUURA-UHFFFAOYSA-N|NC(=O)CI 4092|InChI=1S/C4H6N2/c5-4-6-2-1-3-6/h1-3H2|VEYKJLZUWWNWAL-UHFFFAOYSA-N|N#CN1CCC1 4730|InChI=1S/CH4N2O2/c2-1(4)3-5/h5H,(H3,2,3,4)|VSNHCAURESNICA-UHFFFAOYSA-N|NC(=O)NO

For fun, are there canonical SMILES which are listed multiple times? There are a few, so I decided to narrow it down to those with more than 2 instances. (None occur more than 3 times.)

sqlite> select canonical_smiles, count(*) from compound_structures group by canonical_smiles having count(*) > 2;
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+]([O-])C(C)C|3
CC(C)[C@@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[C@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[S+]([O-])c1nc(c2ccc(F)cc2)c([nH]1)c3ccnc(NC4CCCCC4)c3|3
  ...

Here are more details about the first output where the same SMILES is used multiple times:

sqlite>
...>
select molregno, standard_inchi from compound_structures
where canonical_smiles = "CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]";

1144470|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23) 1144471|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m1/s1 1144472|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m0/s1 The differences are in the "/t(isotopic:stereo:sp3)", "/m(fixed_:stereo:sp3:inverted)", and "/s(fixed_H:stereo_type=abs)" layers. Got that?

I don't. I used the techniques of the next section to get the molfiles for each structure. The differences are in the bonds between atoms 23/24 (the sulfoxide, represented in charge-separated form) and atoms 23/25 (the methyl on the sulfur). The molfile for the first record has no asigned bond stereochemistry, the second has a down flag for the sulfoxide, and the third has a down flag for the methyl.

molfile column in compound_structures

There's a "molfile" entry. Does it really include the structure as a raw MDL molfile? Yes, yes it does:

sqlite> select molfile from compound_structures where molregno = 805;

          11280714442D 1   1.00000     0.00000     0

  8  8  0     0  0            999 V2000
    6.0750   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -2.9792    0.0000 N   0  0  3  0  0  0           0  0  0
    6.7917   -2.9792    0.0000 N   0  0  0  0  0  0           0  0  0
    5.3625   -3.8042    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    6.0750   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -1.3292    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  1  0     0  0
  3  1  2  0     0  0
  4  2  1  0     0  0
  5  2  1  0     0  0
  6  1  1  0     0  0
  7  8  1  0     0  0
  8  6  1  0     0  0
  7  5  1  0     0  0
M  END

Why did I choose molregno = 805? I looked for a structure with 8 atoms and 8 bond by searching for the substring " 8 8 0", which is in the counts line. (It's not a perfect solution, but rather a good-enough one.

sqlite> select molregno from compound_structures where molfile LIKE "%  8  8  0%" limit 1;
805

I bet with a bit of effort I could count the number of rings by using the molfile to get the bond counts and use the number of "."s in the canonical_smiles to get the number of fragments.

compound_properties and molecule_dictionary tables

The compound_properties table stores some molecular properties. I'll get the number of heavy atoms, the number of aromatic rings, and the full molecular weight for structure 805.

sqlite> select heavy_atoms, aromatic_rings, full_mwt from compound_properties where molregno = 805;
8|0|112.17

I've been using "805", which is an internal identifier. What's its public ChEMBL id?

sqlite> select chembl_id from molecule_dictionary where molregno = 805;
CHEMBL266980

What are some of the records with only 1 or 2 atoms?

sqlite>
...>
...>
select chembl_id, heavy_atoms from molecule_dictionary, compound_properties
where molecule_dictionary.molregno = compound_properties.molregno
and heavy_atoms < 3 limit 5;

CHEMBL1098659|1 CHEMBL115849|2 CHEMBL1160819|1 CHEMBL116336|2 CHEMBL116838|2

InChI and heavy atom count for large structures

I showed that some of the SMILES were used for two or three records. What about the InChI string? I started with:

sqlite>
...>
select molregno, standard_inchi, count(*) from compound_structures
group by standard_inchi having count(*) > 1;

1378059||9 After 10 minutes with no other output, I gave up. Those 9 occurrences have a NULL value, that is:

sqlite> select count(*) from compound_structures where standard_inchi is NULL;
9

I was confused at first because there are SMILES string (I'll show only the first 40 characters), so there is structure information. The heavy atom count is also NULL:

sqlite>
...>
...>
select compound_structures.molregno, heavy_atoms, substr(canonical_smiles, 1, 40)
from compound_structures, compound_properties
where standard_inchi is NULL and compound_structures.molregno = compound_properties.molregno;

615447||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615448||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615449||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615450||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615451||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 1053861||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053864||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053865||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)N2C=CC 1378059||CC[C@H](C)[C@H](NC(=O)[C@H](CCCNC(=N)N)N Then I realized it's because the schema specifies the "heavy_atoms" field as "NUMBER(3,0)". While SQLite ignores that limit, it looks like ChEMBL doesn't try to store a count above 999.

What I'll do instead is get the molecular formula, which shows that there are over 600 heavy atoms in those structures:

sqlite>
...>
...>
...>
...>
select chembl_id, full_molformula
from compound_structures, compound_properties, molecule_dictionary
where standard_inchi is NULL
and compound_structures.molregno = compound_properties.molregno
and compound_structures.molregno = molecule_dictionary.molregno;

CHEMBL1077162|C318H381N118O208P29 CHEMBL1077163|C319H383N118O209P29 CHEMBL1077164|C318H382N119O208P29 CHEMBL1077165|C325H387N118O209P29 CHEMBL1631334|C361H574N194O98P24S CHEMBL1631337|C367H606N172O113P24 CHEMBL1631338|C362H600N180O106P24 CHEMBL2105789|C380H614N112O113S9 Those are some large structures! The reason there are no InChIs for them is that InChI didn't support large molecules until version 1.05, which came out in early 2017. Before then, InChI only supported 1024 atoms. Which is normally fine as most compounds are small (hence "small molecule chemistry"). In fact, there aren't any records with more than 79 heavy atoms:

sqlite>
...>
select heavy_atoms, count(*) from compound_properties
where heavy_atoms > 70 group by heavy_atoms;

71|364 72|207 73|46 74|29 75|3 76|7 78|2 79|2 How in the world do these large structures have 600+ atoms? Are they peptides? Mmm, no, not all. The first 8 contain a lot of phosphorouses. I'm guessing some sort of nucleic acid. The last might be a protein. Perhaps I can get a clue from the chemical name, which is in the compound_records table. Here's an example using the molregno 805 from earlier:

sqlite> select * from compound_records where molregno = 805;
1063|805|14385|14|1-Methyl-piperidin-(2Z)-ylideneamine|1|

Some of the names of the 600+ atom molecules are too long, so I'll limit the output to the first 50 characters of the name:

sqlite>
...>
...>
...>
...>
...>
select chembl_id, full_molformula, substr(compound_name, 1, 50)
from compound_structures, molecule_dictionary, compound_properties, compound_records
where standard_inchi is NULL
and compound_structures.molregno = molecule_dictionary.molregno
and compound_structures.molregno = compound_properties.molregno
and compound_structures.molregno = compound_records.molregno;

CHEMBL1077161|C307H368N116O200P28|{[(2R,3S,4R,5R)-5-(4-amino-2-oxo-1,2-dihydropyrimi CHEMBL1077162|C318H381N118O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077163|C319H383N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077164|C318H382N119O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077165|C325H387N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1631334|C361H574N194O98P24S|HRV-EnteroX CHEMBL1631337|C367H606N172O113P24|PV-5'term CHEMBL1631338|C362H600N180O106P24|PV-L4 CHEMBL2105789|C380H614N112O113S9|Mirostipen That didn't help much, but I could at least do a web search for some of the names. For example, HRV-EnteroX is a PPMO (peptide-conjugated phosphorodiamidate morpholino oligomers), which is where those phosphorous atoms come from.

The names weren't really help, and the images at ChEMBL were too small to make sense of the structures, so I looked at them over at PubChem. HRV-EnteroX looks like a 12-mer peptide conjugated to about 25 morpholino oligomers. Mirostipen looks like a peptide. CHEMBL1077161 looks like a nucleic acid strand.

I don't think there's anything interesting to explore in this direction so I'll move on.

Assay data

I'll take a look at assay data, which I deal with a lot less often than I do structure data. How many assays are there?

sqlite> select count(*) from assays;
1212831

Okay, and how many of them are human assays? For that I need the NCBI taxonomy id. Iain's example code uses 9606, which the NCBI web site tells me is for Homo sapiens. I don't think there's a table in the SQLite data dump with all of the taxonomy ids. The organism_class table says only:

sqlite> select * from organism_class where tax_id = 9606;
7|9606|Eukaryotes|Mammalia|Primates

The assay table "assay_organism" column stores the "[n]ame of the organism for the assay system", with the caution "[m]ay differ from the target organism (e.g., for a human protein expressed in non-human cells, or pathogen-infected human cells)." I'll throw caution to the wind and check that field:

sqlite> select count(*) from assays where assay_organism = "Homo sapiens";
291143
sqlite>
...>
select assay_organism, count(*) from assays
where assay_tax_id = 9606 group by assay_organism;

|17 Homo sapiens|291135 sqlite> select count(*) from assays where assay_tax_id = 9606 and assay_organism is NULL; 17 It looks like 9606 is indeed for humans.

Assay activities

What sort of assay activities are there?

sqlite> select distinct published_type from activities;

 ED50
 Transactivation
%
% Cell Death
 ...
AUC
AUC (0-24h)
AUC (0-4h)
AUC (0-infinity)
 ...
Change
Change HDL -C
Change MAP
Change TC
 ...

Okay, quite a few. There appear to be some typos as well:

sqlite>
...>
...>
select published_type, count(*) from activities where published_type in ("Activity", "A ctivity",
"Ac tivity", "Act ivity", "Acti vity", "Activ ity", "Activi ty", "Activit y", "Activty")
group by published_type;

A ctivity|1 Activ ity|2 Activit y|1 Activity|700337 Activty|1 After another 20 minutes of data exploration, I realized that there are two different types. The "published_type" is what the assayer published, while there's also a "standard_type", which looks to be a normalized value by ChEMBL:

sqlite>
...>
select published_type, standard_type from activities
where published_type in ("A ctivity", "Activ ity", "Activit y", "Activty");

A ctivity|Activity Activ ity|Activity Activ ity|Activity Activit y|Activity Activty|Activity There are a many ways to publish a report with IC50 data. I'll show only those that end with "IC50".

sqlite> select distinct published_type from activities where published_type like "%IC50";
-Log IC50
-Log IC50/IC50
-logIC50
Average IC50
CC50/IC50
CCIC50
CIC IC50
CIC50
Change in IC50
Cytotoxicity IC50
Decrease in IC50
EIC50
FIC50
Fold IC50
I/IC50
IC50
IC50/IC50
Increase in IC50
Log 1/IC50
Log IC50
MBIC50
MIC50
Mean IC50
RIC50
Ratio CC50/IC50
Ratio CIC95/IC50
Ratio ED50/MIC50
Ratio IC50
Ratio LC50/IC50
Ratio LD50/IC50
Ratio pIC50
Ratio plasma concentration/IC50
Relative ET-A IC50
Relative IC50
TBPS IC50
TC50/IC50
Time above IC50
fIC50
log1/IC50
logIC50
pIC50
pMIC50
rIC50

The "p" prefix, as in "pIC50", is shorthand for "-log", so "-Log IC50", "Log 1/IC50", and "pIC50" are almost certainly the same units. Let's see:

sqlite>
...>
select distinct published_type, standard_type from activities
where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");

-Log IC50|IC50 -Log IC50|pIC50 -Log IC50|-Log IC50 Log 1/IC50|IC50 Log 1/IC50|Log 1/IC50 pIC50|IC50 pIC50|pIC50 pIC50|Log IC50 pIC50|-Log IC50 Well color me confused. Oh! There's a "standard_flag", which "[s]hows whether the standardised columns have been curated/set (1) or just default to the published data (0)." Perhaps that will help enlighten me.

sqlite>
...>
select distinct published_type, standard_flag, standard_type from activities
where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");

-Log IC50|1|IC50 -Log IC50|1|pIC50 -Log IC50|0|-Log IC50 Log 1/IC50|1|IC50 Log 1/IC50|0|Log 1/IC50 pIC50|1|IC50 pIC50|1|pIC50 pIC50|0|Log IC50 pIC50|1|-Log IC50 pIC50|0|pIC50 Nope, I still don't understand what's going on. I'll assume it's all tied to the complexities of data curation. For now, I'll assume that the data set is nice and clean.

IC50 types

Let's look at the "IC50" values only. How do the "published_type" and "standard_type" columns compare to each other?

sqlite>
...>
select published_type, standard_type, count(*) from activities
where published_type = "IC50" group by standard_type;

IC50|% Max Response|21 IC50|Change|2 IC50|Control|1 IC50|Electrophysiological activity|6 IC50|Fold Inc IC50|1 IC50|Fold change IC50|12 IC50|IC50|1526202 IC50|IC50 ratio|1 IC50|Inhibition|12 IC50|Log IC50|20 IC50|Ratio IC50|40 IC50|SI|4 IC50|T/C|1

sqlite>
...>
select published_type, standard_type, count(*) from activities
where standard_type = "IC50" group by published_type;

-Log IC50|IC50|1736 -Log IC50(M)|IC50|28 -Log IC50(nM)|IC50|39 -logIC50|IC50|84 3.3|IC50|1 Absolute IC50 (CHOP)|IC50|940 Absolute IC50 (XBP1)|IC50|940 Average IC50|IC50|34 CIC50|IC50|6 I 50|IC50|202 I-50|IC50|25 I50|IC50|6059 IC50|IC50|1526202 IC50 |IC50|52 IC50 app|IC50|39 IC50 max|IC50|90 IC50 ratio|IC50|2 IC50(app)|IC50|457 IC50_Mean|IC50|12272 IC50_uM|IC50|20 ID50|IC50|3 Log 1/I50|IC50|280 Log 1/IC50|IC50|589 Log 1/IC50(nM)|IC50|88 Log IC50|IC50|7013 Log IC50(M)|IC50|3599 Log IC50(nM)|IC50|77 Log IC50(uM)|IC50|28 Mean IC50|IC50|1 NA|IC50|5 NT|IC50|20 log(1/IC50)|IC50|1016 pI50|IC50|2386 pIC50|IC50|43031 pIC50(mM)|IC50|71 pIC50(nM)|IC50|107 pIC50(uM)|IC50|419 Yeah, I'm going to throw my hands up here, declare "I'm a programmer, Jim, not a bioactivity specialist", and simply use the published_type of IC50.

IC50 activity values

How are the IC50 values measured? Here too I need to choose between "published_units" and "standard_units". A quick look at the two shows that the standard_units are less diverse.

sqlite>
...>
select standard_units, count(*) from activities where published_type = "IC50"
group by standard_units;

|167556 %|148 % conc|70 10'-11uM|1 10'-4umol/L|1 M ml-1|15 equiv|64 fg ml-1|1 g/ha|40 kJ m-2|20 liposomes ml-1|5 mMequiv|38 mg kg-1|248 mg.min/m3|4 mg/kg/day|1 milliequivalent|22 min|9 ml|3 mmol/Kg|10 mol|6 molar ratio|198 nA|6 nM|1296169 nM g-1|1 nM kg-1|4 nM unit-1|7 nmol/Kg|1 nmol/mg|5 nmol/min|1 ppm|208 ppm g dm^-3|7 uL|7 uM hr|1 uM tube-1|9 uM well-1|52 uM-1|25 uM-1 s-1|1 ucm|6 ucm s-1|2 ug|168 ug cm-2|1 ug g-1|2 ug well-1|12 ug.mL-1|61139 ug/g|16 umol kg-1|3 umol.kg-1|8 umol/dm3|2 "Less diverse", but still diverse. By far the most common is "nM for "nanomolar", which is the only unit I expected. How many IC50s have an activities better than 1 micromolar, which is 1000 nM?

sqlite>
...>
select count(*) from activities where published_type = "IC50"
and standard_value < 1000 and standard_units = "nM";

483041 That's fully 483041/1212831 = 40% of the assays in the data dump.

How many of the IC50s are in humans? For that I need a join with the assays table using the assay_id:

sqlite>
...>
...>
...>
...>
select count(*) from activities, assays
where published_type = "IC50"
and standard_value < 1000 and standard_units = "nM"
and activities.assay_id = assays.assay_id
and assay_tax_id = 9606;

240916 About 1/2 of them are in humans.

Assay target type from target_dictionary

What are the possible assay targets in humans? That information is stored in the target_dictionary:

sqlite> select distinct target_type from target_dictionary where tax_id = 9606;
SINGLE PROTEIN
ORGANISM
CELL-LINE
UNKNOWN
PROTEIN COMPLEX
SUBCELLULAR
TISSUE
NUCLEIC-ACID
PROTEIN FAMILY
PROTEIN-PROTEIN INTERACTION
PROTEIN COMPLEX GROUP
SELECTIVITY GROUP
CHIMERIC PROTEIN
MACROMOLECULE
SMALL MOLECULE
OLIGOSACCHARIDE

Remember earlier when I threw caution to the wind? How many of the assays are actually against human targets? I can join on the target id "tid" to compare the taxon id in the target vs. the taxon id in the assay:

sqlite>
...>
...>
select count(*) from assays, target_dictionary
where assays.tid = target_dictionary.tid
and target_dictionary.tax_id = 9606;

301810

sqlite>
...>
...>
select count(*) from assays, target_dictionary
where assays.tid = target_dictionary.tid
and assays.assay_tax_id = 9606;

291152

Compare assay organisms with target organism

What are some of the non-human assay organisms where the target is humans?

sqlite>
...>
...>
...>
...>
select distinct assay_organism from assays, target_dictionary
where assays.tid = target_dictionary.tid
and assays.assay_tax_id != 9606
and target_dictionary.tax_id = 9606
limit 10;

rice Saccharomyces cerevisiae Oryza sativa Rattus norvegicus Sus scrofa Cavia porcellus Oryctolagus cuniculus Canis lupus familiaris Proteus vulgaris Salmonella enterica subsp. enterica serovar Typhi

Compounds tested against a target name

I'm interested in the "SINGLE PROTEIN" target names in humans. The target name is a manually curated field.

sqlite> select distinct pref_name from target_dictionary where tax_id = 9606 limit 5;
Maltase-glucoamylase
Sulfonylurea receptor 2
Phosphodiesterase 5A
Voltage-gated T-type calcium channel alpha-1H subunit
Dihydrofolate reductase

What are structures used in "Dihydrofolate reductase" assays? This requires three table joins, one on 'tid' to go from target_dictionary to assays, another on 'assay_id' to get to the activity, and another on 'molregno' to go from assay to molecule_dictionary so I can get the compound's chembl_id. (To make it more interesting, three of the tables have a chembl_id column.)

sqlite>
...>
...>
...>
...>
...>
...>
select distinct molecule_dictionary.chembl_id
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.molregno = molecule_dictionary.molregno
limit 10;

CHEMBL1679 CHEMBL429694 CHEMBL106699 CHEMBL422095 CHEMBL1161155 CHEMBL350033 CHEMBL34259 CHEMBL56282 CHEMBL173175 CHEMBL173901

sqlite>
...>
...>
...>
...>
...>
select count(distinct molecule_dictionary.chembl_id)
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.molregno = molecule_dictionary.molregno;

3466 There are 3466 of these, including non-human assays. I'll limit it to human ones only:

sqlite>
...>
...>
...>
...>
...>
...>
select count(distinct molecule_dictionary.chembl_id)
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tax_id = 9606
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.molregno = molecule_dictionary.molregno;

1386 I'll further limit it to those with an IC50 of under 1 micromolar:

sqlite>
sqlite>
...>
...>
...>
...>
...>
...>
...>
...>
...>
.timer on
select count(distinct molecule_dictionary.chembl_id)
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tax_id = 9606
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.published_type = "IC50"
and activities.standard_units = "nM"
and activities.standard_value < 1000
and activities.molregno = molecule_dictionary.molregno;

255 Run Time: real 174.561 user 18.073715 sys 23.285346 I turned on the timer to show that the query took about 3 minutes! I repeated it to ensure that it wasn't a simple cache issue. Still about 3 minutes.

ANALYZE the tables

The earlier query, without the activity filter, took 5.7 seconds when the data wasn't cached, and 0.017 seconds when cached. It found 1386 matches. The new query takes almost 3 minutes more to filter those 1386 matches down to 255. That should not happen.

This is a strong indication that the query planner used the wrong plan. I've had this happen before. My solution then was to "ANALYZE" the tables, which "gathers statistics about tables and indices and stores the collected information in internal tables of the database where the query optimizer can access the information and use it to help make better query planning choices."

It can take a while, so I limited it to the tables of interest.

sqlite> analyze target_dictionary;
Run Time: real 0.212 user 0.024173 sys 0.016268
sqlite> analyze assays;
Run Time: real 248.184 user 5.890109 sys 4.793236
sqlite> analyze activities;
Run Time: real 6742.390 user 97.862790 sys 129.854073
sqlite> analyze molecule_dictionary;
Run Time: real 33.879 user 2.195662 sys 2.043848

Yes, it took almost 2 hours to analyze the activities table. But it was worth it from a pure performance view. I ran the above code twice, with this pattern:

% sudo purge  # clear the filesystem cache
% sqlite3 chembl_21.db  # start SQLite
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> .... previous query, with filter for IC50 < 1uM ...
255
Run Time: real 8.595 user 0.038847 sys 0.141945
sqlite> .... repeat query using a warm cache
255
Run Time: real 0.009 user 0.005255 sys 0.003653

Nice! Now I only need to do about 60 such queries to justify the overall analysis time.

24 Mar 2017 12:00pm GMT

Andrew Dalke: ChEMBL bioactivity data

I almost only use ChEMBL structure files. I download the .sdf files and process them. ChEMBL also supplies bioactivity data, which I've never worked with. Iain Watson suggested I look to it as a source of compound set data, and provided some example SQL queries. This blog post is primarily a set of notes for myself as I experiment with the queries and learn more about what is in the data file.

There is one bit of general advice. If you're going to use the SQLite dump from ChEMBL, make sure that you did "ANALYZE" at least on the tables of interest. This may take a few hours. I'm downloading ChEMBL-22-1 to see if it comes pre-analyzed. If it doesn't, I'll ask them to do so as part of their releases.

For those playing along from home (or the office, or whereever fine SQL database engines may be found), I downloaded the SQLite dump for ChEMBL 21, which is a lovely 2542883 KB (or 2.4) compressed, and 12 GB uncompressed. That link also includes dumps for MySQL, Oracle, and Postgres, as well as schema documentation.

Unpack it the usual way (it takes a while to unpack 12GB), cd into the directory, and open the database using sqlite console:

% tar xf chembl_21_sqlite.tar.gz
% cd chembl_21_sqlite
% sqlite3 chembl_21.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite>

compound_structures

The 'compound_structures' table looks interesting. How many structures are there?

sqlite> select count(*) from compound_structures;
1583897

Wow. Just .. wow. That took a several minutes to execute. This is a problem I've had before with large databases. SQLite doesn't store the total table size, so the initial count(*) ends up doing a full table scan. This brings in every B-tree node from disk, which requires a lot of random seeks for my poor hard disk made of spinning rust. (Hmm, Crucible says I can get a replacement 500GB SSD for only EUR 168. Hmmm.)

The second time and onwards is just fine, thanks to the power of caching.

What does the structures look like? I'll decided to show only a few of the smallest structures to keep the results from overflowing the screen:

sqlite>
...>
select molregno, standard_inchi, standard_inchi_key, canonical_smiles
from compound_structures where length(canonical_smiles) 10 limit 4;

1813|InChI=1S/C4H11NO/c1-2-3-4-6-5/h2-5H2,1H3|WCVVIGQKJZLJDB-UHFFFAOYSA-N|CCCCON 3838|InChI=1S/C2H4INO/c3-1-2(4)5/h1H2,(H2,4,5)|PGLTVOMIXTUURA-UHFFFAOYSA-N|NC(=O)CI 4092|InChI=1S/C4H6N2/c5-4-6-2-1-3-6/h1-3H2|VEYKJLZUWWNWAL-UHFFFAOYSA-N|N#CN1CCC1 4730|InChI=1S/CH4N2O2/c2-1(4)3-5/h5H,(H3,2,3,4)|VSNHCAURESNICA-UHFFFAOYSA-N|NC(=O)NO

For fun, are there canonical SMILES which are listed multiple times? There are a few, so I decided to narrow it down to those with more than 2 instances. (None occur more than 3 times.)

sqlite> select canonical_smiles, count(*) from compound_structures group by canonical_smiles having count(*) > 2;
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+]([O-])C(C)C|3
CC(C)[C@@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[C@H](C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]|3
CC(C)[S+]([O-])c1nc(c2ccc(F)cc2)c([nH]1)c3ccnc(NC4CCCCC4)c3|3
  ...

Here are more details about the first output where the same SMILES is used multiple times:

sqlite>
...>
select molregno, standard_inchi from compound_structures
where canonical_smiles = "CC(C)Nc1cc(ccn1)c2[nH]c(nc2c3ccc(F)cc3)[S+](C)[O-]";

1144470|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23) 1144471|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m1/s1 1144472|InChI=1S/C18H19FN4OS/c1-11(2)21-15-10-13(8-9-20-15)17-16(22-18(23-17)25(3)24)12-4-6-14(19)7-5-12/h4-11H,1-3H3,(H,20,21)(H,22,23)/t25-/m0/s1 The differences are in the "/t(isotopic:stereo:sp3)", "/m(fixed_:stereo:sp3:inverted)", and "/s(fixed_H:stereo_type=abs)" layers. Got that?

I don't. I used the techniques of the next section to get the molfiles for each structure. The differences are in the bonds between atoms 23/24 (the sulfoxide, represented in charge-separated form) and atoms 23/25 (the methyl on the sulfur). The molfile for the first record has no asigned bond stereochemistry, the second has a down flag for the sulfoxide, and the third has a down flag for the methyl.

molfile column in compound_structures

There's a "molfile" entry. Does it really include the structure as a raw MDL molfile? Yes, yes it does:

sqlite> select molfile from compound_structures where molregno = 805;

          11280714442D 1   1.00000     0.00000     0

  8  8  0     0  0            999 V2000
    6.0750   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -2.9792    0.0000 N   0  0  3  0  0  0           0  0  0
    6.7917   -2.9792    0.0000 N   0  0  0  0  0  0           0  0  0
    5.3625   -3.8042    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -2.5667    0.0000 C   0  0  0  0  0  0           0  0  0
    6.0750   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    4.6542   -1.7417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.3625   -1.3292    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  1  0     0  0
  3  1  2  0     0  0
  4  2  1  0     0  0
  5  2  1  0     0  0
  6  1  1  0     0  0
  7  8  1  0     0  0
  8  6  1  0     0  0
  7  5  1  0     0  0
M  END

Why did I choose molregno = 805? I looked for a structure with 8 atoms and 8 bond by searching for the substring " 8 8 0", which is in the counts line. (It's not a perfect solution, but rather a good-enough one.

sqlite> select molregno from compound_structures where molfile LIKE "%  8  8  0%" limit 1;
805

I bet with a bit of effort I could count the number of rings by using the molfile to get the bond counts and use the number of "."s in the canonical_smiles to get the number of fragments.

compound_properties and molecule_dictionary tables

The compound_properties table stores some molecular properties. I'll get the number of heavy atoms, the number of aromatic rings, and the full molecular weight for structure 805.

sqlite> select heavy_atoms, aromatic_rings, full_mwt from compound_properties where molregno = 805;
8|0|112.17

I've been using "805", which is an internal identifier. What's its public ChEMBL id?

sqlite> select chembl_id from molecule_dictionary where molregno = 805;
CHEMBL266980

What are some of the records with only 1 or 2 atoms?

sqlite>
...>
...>
select chembl_id, heavy_atoms from molecule_dictionary, compound_properties
where molecule_dictionary.molregno = compound_properties.molregno
and heavy_atoms < 3 limit 5;

CHEMBL1098659|1 CHEMBL115849|2 CHEMBL1160819|1 CHEMBL116336|2 CHEMBL116838|2

InChI and heavy atom count for large structures

I showed that some of the SMILES were used for two or three records. What about the InChI string? I started with:

sqlite>
...>
select molregno, standard_inchi, count(*) from compound_structures
group by standard_inchi having count(*) > 1;

1378059||9 After 10 minutes with no other output, I gave up. Those 9 occurrences have a NULL value, that is:

sqlite> select count(*) from compound_structures where standard_inchi is NULL;
9

I was confused at first because there are SMILES string (I'll show only the first 40 characters), so there is structure information. The heavy atom count is also NULL:

sqlite>
...>
...>
select compound_structures.molregno, heavy_atoms, substr(canonical_smiles, 1, 40)
from compound_structures, compound_properties
where standard_inchi is NULL and compound_structures.molregno = compound_properties.molregno;

615447||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615448||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615449||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615450||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 615451||CC1=CN([C@H]2C[C@H](OP(=O)(O)OC[C@H]3O[C 1053861||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053864||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)n2cnc3 1053865||CN(C)P(=O)(OC[C@@H]1CN(C[C@@H](O1)N2C=CC 1378059||CC[C@H](C)[C@H](NC(=O)[C@H](CCCNC(=N)N)N Then I realized it's because the schema specifies the "heavy_atoms" field as "NUMBER(3,0)". While SQLite ignores that limit, it looks like ChEMBL doesn't try to store a count above 999.

What I'll do instead is get the molecular formula, which shows that there are over 600 heavy atoms in those structures:

sqlite>
...>
...>
...>
...>
select chembl_id, full_molformula
from compound_structures, compound_properties, molecule_dictionary
where standard_inchi is NULL
and compound_structures.molregno = compound_properties.molregno
and compound_structures.molregno = molecule_dictionary.molregno;

CHEMBL1077162|C318H381N118O208P29 CHEMBL1077163|C319H383N118O209P29 CHEMBL1077164|C318H382N119O208P29 CHEMBL1077165|C325H387N118O209P29 CHEMBL1631334|C361H574N194O98P24S CHEMBL1631337|C367H606N172O113P24 CHEMBL1631338|C362H600N180O106P24 CHEMBL2105789|C380H614N112O113S9 Those are some large structures! The reason there are no InChIs for them is that InChI didn't support large molecules until version 1.05, which came out in early 2017. Before then, InChI only supported 1024 atoms. Which is normally fine as most compounds are small (hence "small molecule chemistry"). In fact, there aren't any records with more than 79 heavy atoms:

sqlite>
...>
select heavy_atoms, count(*) from compound_properties
where heavy_atoms > 70 group by heavy_atoms;

71|364 72|207 73|46 74|29 75|3 76|7 78|2 79|2 How in the world do these large structures have 600+ atoms? Are they peptides? Mmm, no, not all. The first 8 contain a lot of phosphorouses. I'm guessing some sort of nucleic acid. The last might be a protein. Perhaps I can get a clue from the chemical name, which is in the compound_records table. Here's an example using the molregno 805 from earlier:

sqlite> select * from compound_records where molregno = 805;
1063|805|14385|14|1-Methyl-piperidin-(2Z)-ylideneamine|1|

Some of the names of the 600+ atom molecules are too long, so I'll limit the output to the first 50 characters of the name:

sqlite>
...>
...>
...>
...>
...>
select chembl_id, full_molformula, substr(compound_name, 1, 50)
from compound_structures, molecule_dictionary, compound_properties, compound_records
where standard_inchi is NULL
and compound_structures.molregno = molecule_dictionary.molregno
and compound_structures.molregno = compound_properties.molregno
and compound_structures.molregno = compound_records.molregno;

CHEMBL1077161|C307H368N116O200P28|{[(2R,3S,4R,5R)-5-(4-amino-2-oxo-1,2-dihydropyrimi CHEMBL1077162|C318H381N118O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077163|C319H383N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077164|C318H382N119O208P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1077165|C325H387N118O209P29|{[(2R,3S,5R)-2-{[({[(2R,3S,4R,5R)-5-(4-amino-2-oxo CHEMBL1631334|C361H574N194O98P24S|HRV-EnteroX CHEMBL1631337|C367H606N172O113P24|PV-5'term CHEMBL1631338|C362H600N180O106P24|PV-L4 CHEMBL2105789|C380H614N112O113S9|Mirostipen That didn't help much, but I could at least do a web search for some of the names. For example, HRV-EnteroX is a PPMO (peptide-conjugated phosphorodiamidate morpholino oligomers), which is where those phosphorous atoms come from.

The names weren't really help, and the images at ChEMBL were too small to make sense of the structures, so I looked at them over at PubChem. HRV-EnteroX looks like a 12-mer peptide conjugated to about 25 morpholino oligomers. Mirostipen looks like a peptide. CHEMBL1077161 looks like a nucleic acid strand.

I don't think there's anything interesting to explore in this direction so I'll move on.

Assay data

I'll take a look at assay data, which I deal with a lot less often than I do structure data. How many assays are there?

sqlite> select count(*) from assays;
1212831

Okay, and how many of them are human assays? For that I need the NCBI taxonomy id. Iain's example code uses 9606, which the NCBI web site tells me is for Homo sapiens. I don't think there's a table in the SQLite data dump with all of the taxonomy ids. The organism_class table says only:

sqlite> select * from organism_class where tax_id = 9606;
7|9606|Eukaryotes|Mammalia|Primates

The assay table "assay_organism" column stores the "[n]ame of the organism for the assay system", with the caution "[m]ay differ from the target organism (e.g., for a human protein expressed in non-human cells, or pathogen-infected human cells)." I'll throw caution to the wind and check that field:

sqlite> select count(*) from assays where assay_organism = "Homo sapiens";
291143
sqlite>
...>
select assay_organism, count(*) from assays
where assay_tax_id = 9606 group by assay_organism;

|17 Homo sapiens|291135 sqlite> select count(*) from assays where assay_tax_id = 9606 and assay_organism is NULL; 17 It looks like 9606 is indeed for humans.

Assay activities

What sort of assay activities are there?

sqlite> select distinct published_type from activities;

 ED50
 Transactivation
%
% Cell Death
 ...
AUC
AUC (0-24h)
AUC (0-4h)
AUC (0-infinity)
 ...
Change
Change HDL -C
Change MAP
Change TC
 ...

Okay, quite a few. There appear to be some typos as well:

sqlite>
...>
...>
select published_type, count(*) from activities where published_type in ("Activity", "A ctivity",
"Ac tivity", "Act ivity", "Acti vity", "Activ ity", "Activi ty", "Activit y", "Activty")
group by published_type;

A ctivity|1 Activ ity|2 Activit y|1 Activity|700337 Activty|1 After another 20 minutes of data exploration, I realized that there are two different types. The "published_type" is what the assayer published, while there's also a "standard_type", which looks to be a normalized value by ChEMBL:

sqlite>
...>
select published_type, standard_type from activities
where published_type in ("A ctivity", "Activ ity", "Activit y", "Activty");

A ctivity|Activity Activ ity|Activity Activ ity|Activity Activit y|Activity Activty|Activity There are a many ways to publish a report with IC50 data. I'll show only those that end with "IC50".

sqlite> select distinct published_type from activities where published_type like "%IC50";
-Log IC50
-Log IC50/IC50
-logIC50
Average IC50
CC50/IC50
CCIC50
CIC IC50
CIC50
Change in IC50
Cytotoxicity IC50
Decrease in IC50
EIC50
FIC50
Fold IC50
I/IC50
IC50
IC50/IC50
Increase in IC50
Log 1/IC50
Log IC50
MBIC50
MIC50
Mean IC50
RIC50
Ratio CC50/IC50
Ratio CIC95/IC50
Ratio ED50/MIC50
Ratio IC50
Ratio LC50/IC50
Ratio LD50/IC50
Ratio pIC50
Ratio plasma concentration/IC50
Relative ET-A IC50
Relative IC50
TBPS IC50
TC50/IC50
Time above IC50
fIC50
log1/IC50
logIC50
pIC50
pMIC50
rIC50

The "p" prefix, as in "pIC50", is shorthand for "-log", so "-Log IC50", "Log 1/IC50", and "pIC50" are almost certainly the same units. Let's see:

sqlite>
...>
select distinct published_type, standard_type from activities
where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");

-Log IC50|IC50 -Log IC50|pIC50 -Log IC50|-Log IC50 Log 1/IC50|IC50 Log 1/IC50|Log 1/IC50 pIC50|IC50 pIC50|pIC50 pIC50|Log IC50 pIC50|-Log IC50 Well color me confused. Oh! There's a "standard_flag", which "[s]hows whether the standardised columns have been curated/set (1) or just default to the published data (0)." Perhaps that will help enlighten me.

sqlite>
...>
select distinct published_type, standard_flag, standard_type from activities
where published_type in ("-Log IC50", "Log 1/IC50", "pIC50");

-Log IC50|1|IC50 -Log IC50|1|pIC50 -Log IC50|0|-Log IC50 Log 1/IC50|1|IC50 Log 1/IC50|0|Log 1/IC50 pIC50|1|IC50 pIC50|1|pIC50 pIC50|0|Log IC50 pIC50|1|-Log IC50 pIC50|0|pIC50 Nope, I still don't understand what's going on. I'll assume it's all tied to the complexities of data curation. For now, I'll assume that the data set is nice and clean.

IC50 types

Let's look at the "IC50" values only. How do the "published_type" and "standard_type" columns compare to each other?

sqlite>
...>
select published_type, standard_type, count(*) from activities
where published_type = "IC50" group by standard_type;

IC50|% Max Response|21 IC50|Change|2 IC50|Control|1 IC50|Electrophysiological activity|6 IC50|Fold Inc IC50|1 IC50|Fold change IC50|12 IC50|IC50|1526202 IC50|IC50 ratio|1 IC50|Inhibition|12 IC50|Log IC50|20 IC50|Ratio IC50|40 IC50|SI|4 IC50|T/C|1

sqlite>
...>
select published_type, standard_type, count(*) from activities
where standard_type = "IC50" group by published_type;

-Log IC50|IC50|1736 -Log IC50(M)|IC50|28 -Log IC50(nM)|IC50|39 -logIC50|IC50|84 3.3|IC50|1 Absolute IC50 (CHOP)|IC50|940 Absolute IC50 (XBP1)|IC50|940 Average IC50|IC50|34 CIC50|IC50|6 I 50|IC50|202 I-50|IC50|25 I50|IC50|6059 IC50|IC50|1526202 IC50 |IC50|52 IC50 app|IC50|39 IC50 max|IC50|90 IC50 ratio|IC50|2 IC50(app)|IC50|457 IC50_Mean|IC50|12272 IC50_uM|IC50|20 ID50|IC50|3 Log 1/I50|IC50|280 Log 1/IC50|IC50|589 Log 1/IC50(nM)|IC50|88 Log IC50|IC50|7013 Log IC50(M)|IC50|3599 Log IC50(nM)|IC50|77 Log IC50(uM)|IC50|28 Mean IC50|IC50|1 NA|IC50|5 NT|IC50|20 log(1/IC50)|IC50|1016 pI50|IC50|2386 pIC50|IC50|43031 pIC50(mM)|IC50|71 pIC50(nM)|IC50|107 pIC50(uM)|IC50|419 Yeah, I'm going to throw my hands up here, declare "I'm a programmer, Jim, not a bioactivity specialist", and simply use the published_type of IC50.

IC50 activity values

How are the IC50 values measured? Here too I need to choose between "published_units" and "standard_units". A quick look at the two shows that the standard_units are less diverse.

sqlite>
...>
select standard_units, count(*) from activities where published_type = "IC50"
group by standard_units;

|167556 %|148 % conc|70 10'-11uM|1 10'-4umol/L|1 M ml-1|15 equiv|64 fg ml-1|1 g/ha|40 kJ m-2|20 liposomes ml-1|5 mMequiv|38 mg kg-1|248 mg.min/m3|4 mg/kg/day|1 milliequivalent|22 min|9 ml|3 mmol/Kg|10 mol|6 molar ratio|198 nA|6 nM|1296169 nM g-1|1 nM kg-1|4 nM unit-1|7 nmol/Kg|1 nmol/mg|5 nmol/min|1 ppm|208 ppm g dm^-3|7 uL|7 uM hr|1 uM tube-1|9 uM well-1|52 uM-1|25 uM-1 s-1|1 ucm|6 ucm s-1|2 ug|168 ug cm-2|1 ug g-1|2 ug well-1|12 ug.mL-1|61139 ug/g|16 umol kg-1|3 umol.kg-1|8 umol/dm3|2 "Less diverse", but still diverse. By far the most common is "nM for "nanomolar", which is the only unit I expected. How many IC50s have an activities better than 1 micromolar, which is 1000 nM?

sqlite>
...>
select count(*) from activities where published_type = "IC50"
and standard_value < 1000 and standard_units = "nM";

483041 That's fully 483041/1212831 = 40% of the assays in the data dump.

How many of the IC50s are in humans? For that I need a join with the assays table using the assay_id:

sqlite>
...>
...>
...>
...>
select count(*) from activities, assays
where published_type = "IC50"
and standard_value < 1000 and standard_units = "nM"
and activities.assay_id = assays.assay_id
and assay_tax_id = 9606;

240916 About 1/2 of them are in humans.

Assay target type from target_dictionary

What are the possible assay targets in humans? That information is stored in the target_dictionary:

sqlite> select distinct target_type from target_dictionary where tax_id = 9606;
SINGLE PROTEIN
ORGANISM
CELL-LINE
UNKNOWN
PROTEIN COMPLEX
SUBCELLULAR
TISSUE
NUCLEIC-ACID
PROTEIN FAMILY
PROTEIN-PROTEIN INTERACTION
PROTEIN COMPLEX GROUP
SELECTIVITY GROUP
CHIMERIC PROTEIN
MACROMOLECULE
SMALL MOLECULE
OLIGOSACCHARIDE

Remember earlier when I threw caution to the wind? How many of the assays are actually against human targets? I can join on the target id "tid" to compare the taxon id in the target vs. the taxon id in the assay:

sqlite>
...>
...>
select count(*) from assays, target_dictionary
where assays.tid = target_dictionary.tid
and target_dictionary.tax_id = 9606;

301810

sqlite>
...>
...>
select count(*) from assays, target_dictionary
where assays.tid = target_dictionary.tid
and assays.assay_tax_id = 9606;

291152

Compare assay organisms with target organism

What are some of the non-human assay organisms where the target is humans?

sqlite>
...>
...>
...>
...>
select distinct assay_organism from assays, target_dictionary
where assays.tid = target_dictionary.tid
and assays.assay_tax_id != 9606
and target_dictionary.tax_id = 9606
limit 10;

rice Saccharomyces cerevisiae Oryza sativa Rattus norvegicus Sus scrofa Cavia porcellus Oryctolagus cuniculus Canis lupus familiaris Proteus vulgaris Salmonella enterica subsp. enterica serovar Typhi

Compounds tested against a target name

I'm interested in the "SINGLE PROTEIN" target names in humans. The target name is a manually curated field.

sqlite> select distinct pref_name from target_dictionary where tax_id = 9606 limit 5;
Maltase-glucoamylase
Sulfonylurea receptor 2
Phosphodiesterase 5A
Voltage-gated T-type calcium channel alpha-1H subunit
Dihydrofolate reductase

What are structures used in "Dihydrofolate reductase" assays? This requires three table joins, one on 'tid' to go from target_dictionary to assays, another on 'assay_id' to get to the activity, and another on 'molregno' to go from assay to molecule_dictionary so I can get the compound's chembl_id. (To make it more interesting, three of the tables have a chembl_id column.)

sqlite>
...>
...>
...>
...>
...>
...>
select distinct molecule_dictionary.chembl_id
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.molregno = molecule_dictionary.molregno
limit 10;

CHEMBL1679 CHEMBL429694 CHEMBL106699 CHEMBL422095 CHEMBL1161155 CHEMBL350033 CHEMBL34259 CHEMBL56282 CHEMBL173175 CHEMBL173901

sqlite>
...>
...>
...>
...>
...>
select count(distinct molecule_dictionary.chembl_id)
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.molregno = molecule_dictionary.molregno;

3466 There are 3466 of these, including non-human assays. I'll limit it to human ones only:

sqlite>
...>
...>
...>
...>
...>
...>
select count(distinct molecule_dictionary.chembl_id)
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tax_id = 9606
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.molregno = molecule_dictionary.molregno;

1386 I'll further limit it to those with an IC50 of under 1 micromolar:

sqlite>
sqlite>
...>
...>
...>
...>
...>
...>
...>
...>
...>
.timer on
select count(distinct molecule_dictionary.chembl_id)
from target_dictionary, assays, activities, molecule_dictionary
where target_dictionary.pref_name = "Dihydrofolate reductase"
and target_dictionary.tax_id = 9606
and target_dictionary.tid = assays.tid
and assays.assay_id = activities.assay_id
and activities.published_type = "IC50"
and activities.standard_units = "nM"
and activities.standard_value < 1000
and activities.molregno = molecule_dictionary.molregno;

255 Run Time: real 174.561 user 18.073715 sys 23.285346 I turned on the timer to show that the query took about 3 minutes! I repeated it to ensure that it wasn't a simple cache issue. Still about 3 minutes.

ANALYZE the tables

The earlier query, without the activity filter, took 5.7 seconds when the data wasn't cached, and 0.017 seconds when cached. It found 1386 matches. The new query takes almost 3 minutes more to filter those 1386 matches down to 255. That should not happen.

This is a strong indication that the query planner used the wrong plan. I've had this happen before. My solution then was to "ANALYZE" the tables, which "gathers statistics about tables and indices and stores the collected information in internal tables of the database where the query optimizer can access the information and use it to help make better query planning choices."

It can take a while, so I limited it to the tables of interest.

sqlite> analyze target_dictionary;
Run Time: real 0.212 user 0.024173 sys 0.016268
sqlite> analyze assays;
Run Time: real 248.184 user 5.890109 sys 4.793236
sqlite> analyze activities;
Run Time: real 6742.390 user 97.862790 sys 129.854073
sqlite> analyze molecule_dictionary;
Run Time: real 33.879 user 2.195662 sys 2.043848

Yes, it took almost 2 hours to analyze the activities table. But it was worth it from a pure performance view. I ran the above code twice, with this pattern:

% sudo purge  # clear the filesystem cache
% sqlite3 chembl_21.db  # start SQLite
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> .... previous query, with filter for IC50 < 1uM ...
255
Run Time: real 8.595 user 0.038847 sys 0.141945
sqlite> .... repeat query using a warm cache
255
Run Time: real 0.009 user 0.005255 sys 0.003653

Nice! Now I only need to do about 60 such queries to justify the overall analysis time.

24 Mar 2017 12:00pm GMT

EuroPython: EuroPython 2017: Get ready for EuroPython Call for Proposals

Thinking of giving your contribution to EuroPython? Starting from March 27th you can submit a proposal on every aspect of Python: programming from novice to advanced levels, applications and frameworks, or how you have been involved in introducing Python into your organization.

We offer a variety of different contribution formats that you can present at EuroPython: from regular talks to panel discussions, from trainings to posters; if you have ideas to promote real-time human-to-human-interaction or want to run yourself a helpdesk to answer other people's python questions, this is your chance.

Read our different opportunities on our website https://ep2017.europython.eu/en/speakers/call-for-proposals/ and start drafting your ideas. Call for Proposals opens in just 3 days!

image

Enjoy,

EuroPython 2017 Team

https://ep2017.europython.eu/

EuroPython Society

24 Mar 2017 9:38am GMT

EuroPython: EuroPython 2017: Get ready for EuroPython Call for Proposals

Thinking of giving your contribution to EuroPython? Starting from March 27th you can submit a proposal on every aspect of Python: programming from novice to advanced levels, applications and frameworks, or how you have been involved in introducing Python into your organization.

We offer a variety of different contribution formats that you can present at EuroPython: from regular talks to panel discussions, from trainings to posters; if you have ideas to promote real-time human-to-human-interaction or want to run yourself a helpdesk to answer other people's python questions, this is your chance.

Read our different opportunities on our website https://ep2017.europython.eu/en/speakers/call-for-proposals/ and start drafting your ideas. Call for Proposals opens in just 3 days!

image

Enjoy,

EuroPython 2017 Team

https://ep2017.europython.eu/

EuroPython Society

24 Mar 2017 9:38am GMT

Gocept Weblog: Sprinting to push Zope to the Python 3 wonderland

Earlier this year there was a sprint in Innsbruck, Austria. We made progress in porting Zope to Python 3 by working on RestrictedPython. After this sprint RestrictedPython no longer seems to be a blocker to port the parts of Zope which rely on RestrictedPython to Python 3.

See the full sprint report on the plone.org website.

We will work further on pushing Zope towards the Python 3 wonderland on the Zope 2 Resurrection Sprint in Halle/Saale, Germany at gocept in the first week of May 2017. You are welcome to join us on site or remote.

Photo copyright: Christine Baumgartner


24 Mar 2017 9:37am GMT

Gocept Weblog: Sprinting to push Zope to the Python 3 wonderland

Earlier this year there was a sprint in Innsbruck, Austria. We made progress in porting Zope to Python 3 by working on RestrictedPython. After this sprint RestrictedPython no longer seems to be a blocker to port the parts of Zope which rely on RestrictedPython to Python 3.

See the full sprint report on the plone.org website.

We will work further on pushing Zope towards the Python 3 wonderland on the Zope 2 Resurrection Sprint in Halle/Saale, Germany at gocept in the first week of May 2017. You are welcome to join us on site or remote.

Photo copyright: Christine Baumgartner


24 Mar 2017 9:37am GMT

Catalin George Festila: Take weather data with pyowm from openweathermap .

This tutorial shows you how to download and install the pyowm python module.
One of the great things about using this python module let you to provide data from openweathermap website (need to have one account).
PyOWM runs on Python 2.7 and Python 3.2+, and integrates with Django 1.10+ models.
All documentation can be found here.

The install is simple with pip , python 2.7 and Fedora 25.

 
[root@localhost mythcat]# pip install pyowm
Collecting pyowm
Downloading pyowm-2.6.1.tar.gz (3.6MB)
100% |████████████████████████████████| 3.7MB 388kB/s
Building wheels for collected packages: pyowm
Running setup.py bdist_wheel for pyowm ... done
Stored in directory: /root/.cache/pip/wheels/9a/91/17/bb120c765f08df77645cf70a16aa372d5a297f4ae2be749e81
Successfully built pyowm
Installing collected packages: pyowm
Successfully installed pyowm-2.6.1

The source code is very simple just connect with API key and print data.

#/usr/bin/env python
#" -*- coding: utf-8 -*-
import pyowm

print " Have a account to openweathermap.org and use with api key free or pro"
print " owm = pyowm.OWM(API_key='your-API-key', subscription_type='pro')"

owm = pyowm.OWM("327407589df060c7f825b63ec1d9a096")
forecast = owm.daily_forecast("Falticeni,ro")
tomorrow = pyowm.timeutils.tomorrow()
forecast.will_be_sunny_at(tomorrow)

observation = owm.weather_at_place('Falticeni,ro')
w = observation.get_weather()
print (w)
print " Weather details"
print " =============== "

print " Get cloud coverage"
print w.get_clouds()
print " ----------------"
print " Get rain volume"
print w.get_rain()
print " ----------------"
print " Get snow volume"
print w.get_snow()

print " Get wind degree and speed"
print w.get_wind()
print " ----------------"
print " Get humidity percentage"
print w.get_humidity()
print " ----------------"
print " Get atmospheric pressure"
print w.get_pressure()
print " ----------------"
print " Get temperature in Kelvin degs"
print w.get_temperature()
print " ----------------"
print " Get temperature in Celsius degs"
print w.get_temperature(unit='celsius')
print " ----------------"
print " Get temperature in Fahrenheit degs"
print w.get_temperature('fahrenheit')
print " ----------------"
print " Get weather short status"
print w.get_status()
print " ----------------"
print " Get detailed weather status"
print w.get_detailed_status()
print " ----------------"
print " Get OWM weather condition code"
print w.get_weather_code()
print " ----------------"
print " Get weather-related icon name"
print w.get_weather_icon_name()
print " ----------------"
print " Sunrise time (ISO 8601)"
print w.get_sunrise_time('iso')
print " Sunrise time (GMT UNIXtime)"
print w.get_sunrise_time()
print " ----------------"
print " Sunset time (ISO 8601)"
print w.get_sunset_time('iso')
print " Sunset time (GMT UNIXtime)"
print w.get_sunset_time()
print " ----------------"
print " Search current weather observations in the surroundings of"
print " Latitude and longitude coordinates for Fălticeni, Romania:"
observation_list = owm.weather_around_coords(47.46, 26.30)


Let's see and the result of running the python script for one random location:

 
[root@localhost mythcat]# python openweather.py 
Have a account to openweathermap.org and use with api key free or pro
owm = pyowm.OWM(API_key='your-API-key', subscription_type='pro')

Weather details
===============
Get cloud coverage
20
----------------
Get rain volume
{}
----------------
Get snow volume
{}
Get wind degree and speed
{u'speed': 5.7, u'deg': 340}
----------------
Get humidity percentage
82
----------------
Get atmospheric pressure
{'press': 1021, 'sea_level': None}
----------------
Get temperature in Kelvin degs
{'temp_max': 287.15, 'temp_kf': None, 'temp': 287.15, 'temp_min': 287.15}
----------------
Get temperature in Celsius degs
{'temp_max': 14.0, 'temp_kf': None, 'temp': 14.0, 'temp_min': 14.0}
----------------
Get temperature in Fahrenheit degs
{'temp_max': 57.2, 'temp_kf': None, 'temp': 57.2, 'temp_min': 57.2}
----------------
Get weather short status
Clouds
----------------
Get detailed weather status
few clouds
----------------
Get OWM weather condition code
801
----------------
Get weather-related icon name
02d
----------------
Sunrise time (ISO 8601)
2017-03-24 04:08:33+00
Sunrise time (GMT UNIXtime)
1490328513
----------------
Sunset time (ISO 8601)
2017-03-24 16:33:59+00
Sunset time (GMT UNIXtime)
1490373239
----------------
Search current weather observations in the surroundings of
Latitude and longitude coordinates for Fălticeni, Romania:

24 Mar 2017 6:53am GMT

Catalin George Festila: Take weather data with pyowm from openweathermap .

This tutorial shows you how to download and install the pyowm python module.
One of the great things about using this python module let you to provide data from openweathermap website (need to have one account).
PyOWM runs on Python 2.7 and Python 3.2+, and integrates with Django 1.10+ models.
All documentation can be found here.

The install is simple with pip , python 2.7 and Fedora 25.

 
[root@localhost mythcat]# pip install pyowm
Collecting pyowm
Downloading pyowm-2.6.1.tar.gz (3.6MB)
100% |████████████████████████████████| 3.7MB 388kB/s
Building wheels for collected packages: pyowm
Running setup.py bdist_wheel for pyowm ... done
Stored in directory: /root/.cache/pip/wheels/9a/91/17/bb120c765f08df77645cf70a16aa372d5a297f4ae2be749e81
Successfully built pyowm
Installing collected packages: pyowm
Successfully installed pyowm-2.6.1

The source code is very simple just connect with API key and print data.

#/usr/bin/env python
#" -*- coding: utf-8 -*-
import pyowm

print " Have a account to openweathermap.org and use with api key free or pro"
print " owm = pyowm.OWM(API_key='your-API-key', subscription_type='pro')"

owm = pyowm.OWM("327407589df060c7f825b63ec1d9a096")
forecast = owm.daily_forecast("Falticeni,ro")
tomorrow = pyowm.timeutils.tomorrow()
forecast.will_be_sunny_at(tomorrow)

observation = owm.weather_at_place('Falticeni,ro')
w = observation.get_weather()
print (w)
print " Weather details"
print " =============== "

print " Get cloud coverage"
print w.get_clouds()
print " ----------------"
print " Get rain volume"
print w.get_rain()
print " ----------------"
print " Get snow volume"
print w.get_snow()

print " Get wind degree and speed"
print w.get_wind()
print " ----------------"
print " Get humidity percentage"
print w.get_humidity()
print " ----------------"
print " Get atmospheric pressure"
print w.get_pressure()
print " ----------------"
print " Get temperature in Kelvin degs"
print w.get_temperature()
print " ----------------"
print " Get temperature in Celsius degs"
print w.get_temperature(unit='celsius')
print " ----------------"
print " Get temperature in Fahrenheit degs"
print w.get_temperature('fahrenheit')
print " ----------------"
print " Get weather short status"
print w.get_status()
print " ----------------"
print " Get detailed weather status"
print w.get_detailed_status()
print " ----------------"
print " Get OWM weather condition code"
print w.get_weather_code()
print " ----------------"
print " Get weather-related icon name"
print w.get_weather_icon_name()
print " ----------------"
print " Sunrise time (ISO 8601)"
print w.get_sunrise_time('iso')
print " Sunrise time (GMT UNIXtime)"
print w.get_sunrise_time()
print " ----------------"
print " Sunset time (ISO 8601)"
print w.get_sunset_time('iso')
print " Sunset time (GMT UNIXtime)"
print w.get_sunset_time()
print " ----------------"
print " Search current weather observations in the surroundings of"
print " Latitude and longitude coordinates for Fălticeni, Romania:"
observation_list = owm.weather_around_coords(47.46, 26.30)


Let's see and the result of running the python script for one random location:

 
[root@localhost mythcat]# python openweather.py 
Have a account to openweathermap.org and use with api key free or pro
owm = pyowm.OWM(API_key='your-API-key', subscription_type='pro')

Weather details
===============
Get cloud coverage
20
----------------
Get rain volume
{}
----------------
Get snow volume
{}
Get wind degree and speed
{u'speed': 5.7, u'deg': 340}
----------------
Get humidity percentage
82
----------------
Get atmospheric pressure
{'press': 1021, 'sea_level': None}
----------------
Get temperature in Kelvin degs
{'temp_max': 287.15, 'temp_kf': None, 'temp': 287.15, 'temp_min': 287.15}
----------------
Get temperature in Celsius degs
{'temp_max': 14.0, 'temp_kf': None, 'temp': 14.0, 'temp_min': 14.0}
----------------
Get temperature in Fahrenheit degs
{'temp_max': 57.2, 'temp_kf': None, 'temp': 57.2, 'temp_min': 57.2}
----------------
Get weather short status
Clouds
----------------
Get detailed weather status
few clouds
----------------
Get OWM weather condition code
801
----------------
Get weather-related icon name
02d
----------------
Sunrise time (ISO 8601)
2017-03-24 04:08:33+00
Sunrise time (GMT UNIXtime)
1490328513
----------------
Sunset time (ISO 8601)
2017-03-24 16:33:59+00
Sunset time (GMT UNIXtime)
1490373239
----------------
Search current weather observations in the surroundings of
Latitude and longitude coordinates for Fălticeni, Romania:

24 Mar 2017 6:53am GMT

Vasudev Ram: Analysing that Python code snippet

By Vasudev Ram

Hi readers,

Some days ago I had written this post:

Analyse this Python code snippet

in which I had shown a snippet of Python code (run in the Python shell), and said:

"Analyse the snippet of Python code below. See what you make of it. I will discuss it in my next post."

I am a few days late in discussing it; sorry about that.

Here is the analysis:

First, here's the the snippet again, for reference:

>>> a = 1
>>> lis = [a, 2 ]
>>> lis
[1, 2]
>>> lis = [a, 2 ,
... "abc", False ]
>>>
>>> lis
[1, 2, 'abc', False]
>>> a
1
>>> b = 3
>>> lis
[1, 2, 'abc', False]
>>> a = b
>>> a
3
>>> lis
[1, 2, 'abc', False]
>>> lis = [a, 2 ]
>>> lis
[3, 2]
>>>


The potential for confusion (at least, as I said, for newbie Pythonistas) lies in these apparent points:

The variable a is set to 1.
Then it is put into the list lis, along with the constant 2.
Then lis is changed to be [a, 2, "abc", False].
One might now think that the variable a is stored in the list lis.
The next line prints its value, which shows it is 1.
All fine so far.
Then b is set to 3.
Then a is set to b, i.e. to the value of b.
So now a is 3.
But when we print lis again, it still shows 1 for the first item, not 3, as some might expect (since a is now set to 3).
Only when we run the next line:
lis = [a, 2]
and then print lis again, do we see that the first item in lis is now 3.

This has to do with the concept of naming and binding in Python.

When a Python statement like:

a = 1

is run, naming and binding happens. The name on the left is first created, and then bound to the (value of the) object on the right of the equals sign (the assignment operator). The value can be any expression, which, when evaluated, results in a value (a Python object [1]) of some kind. In this case it is the int object with value 1.

[1] Almost everything in Python is an object, like almost everything in Unix is a file. [Conditions apply :)]

When that name, a, is used in an expression, Python looks up the value of the object that the name is bound to, and uses that value in the expression, in place of the name.

So when the name a was used inside any of the lists that were bound to the name lis, it was actually the value bound to the name a that was used instead. So, the first time it was 1, so the first item of the list became 1, and stayed as 1 until another binding of some other (list) object to the name lis was done.

But by this time, the name a had been rebound to another object, the int 3, the same one that name b had been earlier bound to just before. So the next time that the name lis was bound to a list, that list now included the value of the current object that name a was now bound to, which was 3.

This is the reason why the code snippet works as it does.

On a related note (also about Python language features, syntax and semantics), I was playing around with the pprint module (Python's pretty-printer) and the Python is operator, and came up with this other snippet:

>>> import pprint
>>> lis = []
>>> for i in range(10):
... lis.append(lis)
...
>>> print lis
[[...], [...], [...], [...], [...], [...], [...], [...], [...], [...]]

>>> pprint.pprint(lis)
[<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>]

>>> len(lis)
10

>>> lis is lis[0]
True

>>> lis is lis[0] is lis[0][0]
True

>>> lis is lis[0] is lis[0][0] is lis[0][0][0]
True

in which I created a list, appended it to itself, and then used pprint.pprint on it. Also used the Python is operator between the list and its 0th item, recursively, and was interested to see that the is operator can be used in a chain. I need to look that up (pun intended).

Enjoy.

- Vasudev Ram - Online Python training and consulting Get updates (via Gumroad) on my forthcoming apps and content. Jump to posts: Python * DLang * xtopdf Subscribe to my blog by email My ActiveState Code recipesFollow me on: LinkedIn * Twitter Are you a blogger with some traffic? Get Convertkit:Email marketing for professional bloggers

Share |



Vasudev Ram

24 Mar 2017 12:49am GMT

Vasudev Ram: Analysing that Python code snippet

By Vasudev Ram

Hi readers,

Some days ago I had written this post:

Analyse this Python code snippet

in which I had shown a snippet of Python code (run in the Python shell), and said:

"Analyse the snippet of Python code below. See what you make of it. I will discuss it in my next post."

I am a few days late in discussing it; sorry about that.

Here is the analysis:

First, here's the the snippet again, for reference:

>>> a = 1
>>> lis = [a, 2 ]
>>> lis
[1, 2]
>>> lis = [a, 2 ,
... "abc", False ]
>>>
>>> lis
[1, 2, 'abc', False]
>>> a
1
>>> b = 3
>>> lis
[1, 2, 'abc', False]
>>> a = b
>>> a
3
>>> lis
[1, 2, 'abc', False]
>>> lis = [a, 2 ]
>>> lis
[3, 2]
>>>


The potential for confusion (at least, as I said, for newbie Pythonistas) lies in these apparent points:

The variable a is set to 1.
Then it is put into the list lis, along with the constant 2.
Then lis is changed to be [a, 2, "abc", False].
One might now think that the variable a is stored in the list lis.
The next line prints its value, which shows it is 1.
All fine so far.
Then b is set to 3.
Then a is set to b, i.e. to the value of b.
So now a is 3.
But when we print lis again, it still shows 1 for the first item, not 3, as some might expect (since a is now set to 3).
Only when we run the next line:
lis = [a, 2]
and then print lis again, do we see that the first item in lis is now 3.

This has to do with the concept of naming and binding in Python.

When a Python statement like:

a = 1

is run, naming and binding happens. The name on the left is first created, and then bound to the (value of the) object on the right of the equals sign (the assignment operator). The value can be any expression, which, when evaluated, results in a value (a Python object [1]) of some kind. In this case it is the int object with value 1.

[1] Almost everything in Python is an object, like almost everything in Unix is a file. [Conditions apply :)]

When that name, a, is used in an expression, Python looks up the value of the object that the name is bound to, and uses that value in the expression, in place of the name.

So when the name a was used inside any of the lists that were bound to the name lis, it was actually the value bound to the name a that was used instead. So, the first time it was 1, so the first item of the list became 1, and stayed as 1 until another binding of some other (list) object to the name lis was done.

But by this time, the name a had been rebound to another object, the int 3, the same one that name b had been earlier bound to just before. So the next time that the name lis was bound to a list, that list now included the value of the current object that name a was now bound to, which was 3.

This is the reason why the code snippet works as it does.

On a related note (also about Python language features, syntax and semantics), I was playing around with the pprint module (Python's pretty-printer) and the Python is operator, and came up with this other snippet:

>>> import pprint
>>> lis = []
>>> for i in range(10):
... lis.append(lis)
...
>>> print lis
[[...], [...], [...], [...], [...], [...], [...], [...], [...], [...]]

>>> pprint.pprint(lis)
[<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>,
<recursion on list with id=32809968>]

>>> len(lis)
10

>>> lis is lis[0]
True

>>> lis is lis[0] is lis[0][0]
True

>>> lis is lis[0] is lis[0][0] is lis[0][0][0]
True

in which I created a list, appended it to itself, and then used pprint.pprint on it. Also used the Python is operator between the list and its 0th item, recursively, and was interested to see that the is operator can be used in a chain. I need to look that up (pun intended).

Enjoy.

- Vasudev Ram - Online Python training and consulting Get updates (via Gumroad) on my forthcoming apps and content. Jump to posts: Python * DLang * xtopdf Subscribe to my blog by email My ActiveState Code recipesFollow me on: LinkedIn * Twitter Are you a blogger with some traffic? Get Convertkit:Email marketing for professional bloggers

Share |



Vasudev Ram

24 Mar 2017 12:49am GMT

Thomas Guest: From bytes to strings in Python and back again

Low level languages like C have little opinion about what goes in a string, which is simply a null-terminated sequence of bytes. Those bytes could be ASCII or UTF-8 encoded text, or they could be raw data - object code, for example. It's quite possible and legal to have a C string with mixed content.

char const * mixed =
    "EURO SIGN "          // ASCII
    "UTF-8 \xE2\x82\xAC " // UTF-8 encoded EURO SIGN
    "Latin-9 \xA4";       // Latin-9 encoded EURO SIGN

This might seem indisciplined and risky but it can be useful. Environment variables are notionally text but actually C strings, for example, meaning they can hold whatever data you want. Similarly filenames and command line parameters are only loosely text.

A higher level language like Python makes a strict distinction between bytes and strings. Bytes objects contain raw data - a sequence of octets - whereas strings are Unicode sequences. Conversion between the two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.

Simply put, a string in Python is a valid Unicode sequence. Real world text data may not be. Programmers need to take charge of reconciling any discrepancies.

We faced such problems recently at work. We're in the business of extracting meaning from clinical narratives - text data stored on medical records systems in hospitals, for example. These documents may well have passed through a variety of systems. They may be unclear about their text encoding. They may not be encoded as they claim. So what? They can and do contain abbreviations, mispellings, jargon and colloquialisms. Refining the signal from such noise is our core business: if we can correctly interpret positional and temporal aspects of a sentence such as:

Previous fracture of left neck of femur

then we can surely deal with text which claims to be UTF-8 encoded but isn't really.

Our application stack is server-based: a REST API to a Python application handles document ingest; lower down, a C++ engine does the actual document processing. The problem we faced was supporting a modern API capable of handling real world data.

It's both undesirable and unnecessary to require clients to clean their text before submitting it. We want to make the ingest direct and idiomatic. Also, we shouldn't penalise clients whose data is clean. Thus document upload is an HTTP POST request, and the document content is a JSON string - rather than, say, base64 encoded binary data. Our server, however, will be permissive about the contents of this string.

So far so good. Postel's prescription advises:

Be liberal in what you accept, and conservative in what you send.

This would suggest accepting messy text data but presenting it in a cleaned up form. In our case, we do normalise the input data - a process which includes detecting and standardising date/time information, expanding abbreviations, fixing typos and so on - but this normalised form links back to a faithful copy of the original data. What gets presented to the user is their own text annotated with our findings. That is, we subscribe to a more primitive prescription than Postel's:

Garbage in, garbage out

with the caveat that the garbage shouldn't be damaged in transit.

Happily, there is a simple way to pass dodgy strings through Python. It's used in the standard library to handle text data which isn't guaranteed to be clean - those environment variables, command line parameters, and filenames for example.

The surrogateescape error handler smuggles non-decodable bytes into the (Unicode) Python string in such a way that the original bytes can be recovered on encode, as described in PEP 383:

On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.

This workaround is possible because Unicode surrogates are intended for use in pairs. Quoting the Unicode specification, they "have no interpretation on their own". The lone trailing surrogate code - the half-a-pair - can only be the result of a surrogateescape error handler being invoked, and the original bytes can be recovered by using the same error handler on encode.

In conclusion, text data is handled differently in C++ and Python, posing a problem for layered applications. The surrogateescape error handler provides a standard and robust way of closing the gap.

Unicode Surrogate Pairs

Surrogates

Code Listing

>>> mixed = b"EURO SIGN \xE2\x82\xAC \xA4"
>>> mixed
b'EURO SIGN \xe2\x82\xac \xa4'
>>> mixed.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 14:
  invalid start byte
>>> help(mixed.decode)
Help on built-in function decode:

decode(encoding='utf-8', errors='strict') method of builtins.bytes instance
    Decode the bytes using the codec registered for encoding.

encoding
      The encoding with which to decode the bytes.
    errors
      The error handling scheme to use for the handling of decoding errors.
      The default is 'strict' meaning that decoding errors raise a
      UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
      as well as any other name registered with codecs.register_error that
      can handle UnicodeDecodeErrors.

>>> mixed.decode(errors='surrogateescape')
'EURO SIGN € \udca4'
>>> s = mixed.decode(errors='surrogateescape')
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4' in position 12:
  surrogates not allowed
>>> s.encode(errors='surrogateescape')
b'EURO SIGN \xe2\x82\xac \xa4'

24 Mar 2017 12:00am GMT

Thomas Guest: From bytes to strings in Python and back again

Low level languages like C have little opinion about what goes in a string, which is simply a null-terminated sequence of bytes. Those bytes could be ASCII or UTF-8 encoded text, or they could be raw data - object code, for example. It's quite possible and legal to have a C string with mixed content.

char const * mixed =
    "EURO SIGN "          // ASCII
    "UTF-8 \xE2\x82\xAC " // UTF-8 encoded EURO SIGN
    "Latin-9 \xA4";       // Latin-9 encoded EURO SIGN

This might seem indisciplined and risky but it can be useful. Environment variables are notionally text but actually C strings, for example, meaning they can hold whatever data you want. Similarly filenames and command line parameters are only loosely text.

A higher level language like Python makes a strict distinction between bytes and strings. Bytes objects contain raw data - a sequence of octets - whereas strings are Unicode sequences. Conversion between the two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.

Simply put, a string in Python is a valid Unicode sequence. Real world text data may not be. Programmers need to take charge of reconciling any discrepancies.

We faced such problems recently at work. We're in the business of extracting meaning from clinical narratives - text data stored on medical records systems in hospitals, for example. These documents may well have passed through a variety of systems. They may be unclear about their text encoding. They may not be encoded as they claim. So what? They can and do contain abbreviations, mispellings, jargon and colloquialisms. Refining the signal from such noise is our core business: if we can correctly interpret positional and temporal aspects of a sentence such as:

Previous fracture of left neck of femur

then we can surely deal with text which claims to be UTF-8 encoded but isn't really.

Our application stack is server-based: a REST API to a Python application handles document ingest; lower down, a C++ engine does the actual document processing. The problem we faced was supporting a modern API capable of handling real world data.

It's both undesirable and unnecessary to require clients to clean their text before submitting it. We want to make the ingest direct and idiomatic. Also, we shouldn't penalise clients whose data is clean. Thus document upload is an HTTP POST request, and the document content is a JSON string - rather than, say, base64 encoded binary data. Our server, however, will be permissive about the contents of this string.

So far so good. Postel's prescription advises:

Be liberal in what you accept, and conservative in what you send.

This would suggest accepting messy text data but presenting it in a cleaned up form. In our case, we do normalise the input data - a process which includes detecting and standardising date/time information, expanding abbreviations, fixing typos and so on - but this normalised form links back to a faithful copy of the original data. What gets presented to the user is their own text annotated with our findings. That is, we subscribe to a more primitive prescription than Postel's:

Garbage in, garbage out

with the caveat that the garbage shouldn't be damaged in transit.

Happily, there is a simple way to pass dodgy strings through Python. It's used in the standard library to handle text data which isn't guaranteed to be clean - those environment variables, command line parameters, and filenames for example.

The surrogateescape error handler smuggles non-decodable bytes into the (Unicode) Python string in such a way that the original bytes can be recovered on encode, as described in PEP 383:

On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.

This workaround is possible because Unicode surrogates are intended for use in pairs. Quoting the Unicode specification, they "have no interpretation on their own". The lone trailing surrogate code - the half-a-pair - can only be the result of a surrogateescape error handler being invoked, and the original bytes can be recovered by using the same error handler on encode.

In conclusion, text data is handled differently in C++ and Python, posing a problem for layered applications. The surrogateescape error handler provides a standard and robust way of closing the gap.

Unicode Surrogate Pairs

Surrogates

Code Listing

>>> mixed = b"EURO SIGN \xE2\x82\xAC \xA4"
>>> mixed
b'EURO SIGN \xe2\x82\xac \xa4'
>>> mixed.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 14:
  invalid start byte
>>> help(mixed.decode)
Help on built-in function decode:

decode(encoding='utf-8', errors='strict') method of builtins.bytes instance
    Decode the bytes using the codec registered for encoding.

encoding
      The encoding with which to decode the bytes.
    errors
      The error handling scheme to use for the handling of decoding errors.
      The default is 'strict' meaning that decoding errors raise a
      UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
      as well as any other name registered with codecs.register_error that
      can handle UnicodeDecodeErrors.

>>> mixed.decode(errors='surrogateescape')
'EURO SIGN € \udca4'
>>> s = mixed.decode(errors='surrogateescape')
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4' in position 12:
  surrogates not allowed
>>> s.encode(errors='surrogateescape')
b'EURO SIGN \xe2\x82\xac \xa4'

24 Mar 2017 12:00am GMT

23 Mar 2017

feedPlanet Python

Carl Chenet: Feed2tweet 1.0, tool to post RSS feeds to Twitter, released

Feed2tweet 1.0, a self-hosted Python app to automatically post RSS feeds to the Twitter social network, was released March 2017, 23th.

The main new feature of this release allows to create filters for each RSS feed, because before you could only define global filters. Contributed by Antoine Beaupré, Feed2tweet is also able to use syslog, starting from this release.

fiestaWhat's the purpose of Feed2tweet?

Some online services offer to convert your RSS entries into Twitter posts. Theses services are usually not reliable, slow and don't respect your privacy. Feed2tweet is Python self-hosted app, the source code is easy to read and you can enjoy the official documentation online with lots of examples.

Twitter Out Of The Browser

Have a look at my Github account for my other Twitter automation tools:

What about you? Do you use tools to automate the management of your Twitter account? Feel free to give me feedback in the comments below.

… and finally

You can help Feed2tweet by donating anything through Liberaypay (also possible with cryptocurrencies). That's a big factor motivation 😉

23 Mar 2017 11:00pm GMT

Carl Chenet: Feed2tweet 1.0, tool to post RSS feeds to Twitter, released

Feed2tweet 1.0, a self-hosted Python app to automatically post RSS feeds to the Twitter social network, was released March 2017, 23th.

The main new feature of this release allows to create filters for each RSS feed, because before you could only define global filters. Contributed by Antoine Beaupré, Feed2tweet is also able to use syslog, starting from this release.

fiestaWhat's the purpose of Feed2tweet?

Some online services offer to convert your RSS entries into Twitter posts. Theses services are usually not reliable, slow and don't respect your privacy. Feed2tweet is Python self-hosted app, the source code is easy to read and you can enjoy the official documentation online with lots of examples.

Twitter Out Of The Browser

Have a look at my Github account for my other Twitter automation tools:

What about you? Do you use tools to automate the management of your Twitter account? Feel free to give me feedback in the comments below.

… and finally

You can help Feed2tweet by donating anything through Liberaypay (also possible with cryptocurrencies). That's a big factor motivation 😉

23 Mar 2017 11:00pm GMT

NumFOCUS: PyData Atlanta Meetup Celebrates 1 Year and over 1,000 members

Picture
PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark.

Atlanta tells a new story about data

by Rob Clewley

In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly nerding out about data over beers and becoming fast friends. We were eager to see a shift from Atlanta's data community to be more welcoming and encouraging towards beginners, self-starters, and generalists. We were about to find out that we were not alone.

We had met at local data science-related events earlier in the year and had discovered that we had lots of opinions-and weren't afraid to advocate for them. But we also found that we listened to reason (data-driven learning!), appreciated the art in doing good science, and cared about people and the community. Open science, open data, free-and-open-source software, and creative forms of technical communication and learning were all recurring themes in our conversations. We also all agreed that Python is a great language for working with data.

Invitations were extended to like-minded friends, and the informal hangout was soon known as "Data Beers". The consistent good buzz that Data Beers generated helped us realize an opportunity to contribute more widely to the Atlanta community. At the time, Atlanta was beginning its emergence as a new hub in the tech world and startup culture.

Some of the existing data-oriented meetups around Atlanta have a more formal business atmosphere, or are highly focused on specific tools or tech opinions. Such environments seem to intimidate newcomers and those less formally educated in math or computer science. This inspired us to take a new perspective through an informal and eclectic approach. So, in January 2016, with the support of not-for-profit organization NumFOCUS, we set up the Atlanta chapter of PyData.

The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research. NumFOCUS sponsors PyData conferences and local meetups internationally. The PyData community gathers to discuss how best to apply tools using Python, R, Stan, and Julia to meet evolving challenges in data management, processing, analytics, and visualization. In all, PyData is over 28,000 members across 52 international meetups. The Python language and the data-focused ecosystem that has grown around it has been remarkably successful in attracting an inclusive mindset centered around free and open-source software and science. Our Atlanta chapter aims to be even more neutral about specific technologies so long as the underlying spirit resonates with our mission.

The three of us, with the help of friend and colleague Lizzy Rolando, began sourcing great speakers who have a distinctive approach to using data that resonated with the local tech culture. We hosted our first meetup in early April. From the beginning, we encouraged a do-it-yourself, interactive vibe to our meetings, supporting shorter-format 30 minute presentations with 20 minute question and answer sessions.

Regardless of the technical focus, we try to bring in speakers who are applying their data-driven work to something of general interest. Our programming balances technical and more qualitative talks. Our meetings have covered a diverse range of applications, addressing computer literacy and education, human rights, neuroscience, journalism, and civics.

A crowd favorite is the inclusion of 3-4 audience-submitted lightning talks at the end of the main Q&A. The strictly five-minute talks add more energy to the mix and give a wider platform to the local community. They're an opportunity to practice presentation skills for students, generate conversations around projects needing collaborators, discussions about new tools, or just have fun looking at interesting data sets.

Students, career changers, and professionals have come together as members of PyData to learn and share. Our network has generated new friends, collaborators, and even new jobs. Local organizations that share our community spirit provide generous sponsorship and refreshments for our meetings.

We believe we were in the right place at the right time to meet a need. It's evident in the positive response and rapid growth we've seen, having acquired over 1,000 members in one year and hosted over 120 attendees at our last event. It has been a whirlwind experience, and we are delighted that our community has shared our spirit and become involved with us so strongly. Here's to healthy, productive, data-driven outcomes for all of us in 2017!

23 Mar 2017 9:23pm GMT

NumFOCUS: PyData Atlanta Meetup Celebrates 1 Year and over 1,000 members

Picture
PyData Atlanta holds a meetup at MailChimp, where Jim Crozier spoke about analyzing NFL data with PySpark.

Atlanta tells a new story about data

by Rob Clewley

In late 2015, the three of us (Tony Fast, Neel Shivdasani, and myself) had been regularly nerding out about data over beers and becoming fast friends. We were eager to see a shift from Atlanta's data community to be more welcoming and encouraging towards beginners, self-starters, and generalists. We were about to find out that we were not alone.

We had met at local data science-related events earlier in the year and had discovered that we had lots of opinions-and weren't afraid to advocate for them. But we also found that we listened to reason (data-driven learning!), appreciated the art in doing good science, and cared about people and the community. Open science, open data, free-and-open-source software, and creative forms of technical communication and learning were all recurring themes in our conversations. We also all agreed that Python is a great language for working with data.

Invitations were extended to like-minded friends, and the informal hangout was soon known as "Data Beers". The consistent good buzz that Data Beers generated helped us realize an opportunity to contribute more widely to the Atlanta community. At the time, Atlanta was beginning its emergence as a new hub in the tech world and startup culture.

Some of the existing data-oriented meetups around Atlanta have a more formal business atmosphere, or are highly focused on specific tools or tech opinions. Such environments seem to intimidate newcomers and those less formally educated in math or computer science. This inspired us to take a new perspective through an informal and eclectic approach. So, in January 2016, with the support of not-for-profit organization NumFOCUS, we set up the Atlanta chapter of PyData.

The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research. NumFOCUS sponsors PyData conferences and local meetups internationally. The PyData community gathers to discuss how best to apply tools using Python, R, Stan, and Julia to meet evolving challenges in data management, processing, analytics, and visualization. In all, PyData is over 28,000 members across 52 international meetups. The Python language and the data-focused ecosystem that has grown around it has been remarkably successful in attracting an inclusive mindset centered around free and open-source software and science. Our Atlanta chapter aims to be even more neutral about specific technologies so long as the underlying spirit resonates with our mission.

The three of us, with the help of friend and colleague Lizzy Rolando, began sourcing great speakers who have a distinctive approach to using data that resonated with the local tech culture. We hosted our first meetup in early April. From the beginning, we encouraged a do-it-yourself, interactive vibe to our meetings, supporting shorter-format 30 minute presentations with 20 minute question and answer sessions.

Regardless of the technical focus, we try to bring in speakers who are applying their data-driven work to something of general interest. Our programming balances technical and more qualitative talks. Our meetings have covered a diverse range of applications, addressing computer literacy and education, human rights, neuroscience, journalism, and civics.

A crowd favorite is the inclusion of 3-4 audience-submitted lightning talks at the end of the main Q&A. The strictly five-minute talks add more energy to the mix and give a wider platform to the local community. They're an opportunity to practice presentation skills for students, generate conversations around projects needing collaborators, discussions about new tools, or just have fun looking at interesting data sets.

Students, career changers, and professionals have come together as members of PyData to learn and share. Our network has generated new friends, collaborators, and even new jobs. Local organizations that share our community spirit provide generous sponsorship and refreshments for our meetings.

We believe we were in the right place at the right time to meet a need. It's evident in the positive response and rapid growth we've seen, having acquired over 1,000 members in one year and hosted over 120 attendees at our last event. It has been a whirlwind experience, and we are delighted that our community has shared our spirit and become involved with us so strongly. Here's to healthy, productive, data-driven outcomes for all of us in 2017!

23 Mar 2017 9:23pm GMT

Reinout van Rees: Fossgis: open source for emergencies - Marco Lechner

(One of my summaries of a talk at the 2017 fossgis conference).

He works for the Bundesamtes fuer Strahlenschutz, basically the government agency that was started after Chernobil to protect against and to measure radioactivity. The software system they use/build is called IMIS.

IMIS consists of three parts:

  • Measurements (automatic + mobile measurements + laboratory results).
  • Prediction system. Including documentation (managed in Plone, a python CMS system).
  • Decision support. Help support the government layers that have to make the decisions.

They have a simple map at odlinfo.bfs.de.

The current core of the system is proprietary. They are dependent on one single firm. The system is heavily customized for their usage.

They need a new system because geographical analysis keeps getting more important and because there are new requirements coming out of the government. The current program cannot handle that.

What they want is a new system that is as simple as possible; that uses standards for geographical exchange; they don't want to be dependent on a single firm anymore. So:

  • Use open standards, so OGC. But also a specific world-wide nuclear info protocol.
  • Use existing open source software. OSGEO.
  • If we need something special, can we change/extend existing open source software?
  • If not, then it is OK to create our their software. Under an open source license.

They use open source companies to help them, including training their employees. And helping getting these employees used to modern software development (jenkins, docker, etc.)

If you use an open source strategy, what do you need to do to make it fair?

  • Your own developments should also be open source!
  • You need your own test and build infrastructure. (For instance Jenkins)
  • You need to make it easy to start working with what you made: documentation, docker, buildout (!), etc.

(Personal note: I didn't expect to hear 'buildout' at this open source GIS conference. I've helped quite a bit with that particular piece of python software :-) )

23 Mar 2017 2:28pm GMT

Reinout van Rees: Fossgis: open source for emergencies - Marco Lechner

(One of my summaries of a talk at the 2017 fossgis conference).

He works for the Bundesamtes fuer Strahlenschutz, basically the government agency that was started after Chernobil to protect against and to measure radioactivity. The software system they use/build is called IMIS.

IMIS consists of three parts:

  • Measurements (automatic + mobile measurements + laboratory results).
  • Prediction system. Including documentation (managed in Plone, a python CMS system).
  • Decision support. Help support the government layers that have to make the decisions.

They have a simple map at odlinfo.bfs.de.

The current core of the system is proprietary. They are dependent on one single firm. The system is heavily customized for their usage.

They need a new system because geographical analysis keeps getting more important and because there are new requirements coming out of the government. The current program cannot handle that.

What they want is a new system that is as simple as possible; that uses standards for geographical exchange; they don't want to be dependent on a single firm anymore. So:

  • Use open standards, so OGC. But also a specific world-wide nuclear info protocol.
  • Use existing open source software. OSGEO.
  • If we need something special, can we change/extend existing open source software?
  • If not, then it is OK to create our their software. Under an open source license.

They use open source companies to help them, including training their employees. And helping getting these employees used to modern software development (jenkins, docker, etc.)

If you use an open source strategy, what do you need to do to make it fair?

  • Your own developments should also be open source!
  • You need your own test and build infrastructure. (For instance Jenkins)
  • You need to make it easy to start working with what you made: documentation, docker, buildout (!), etc.

(Personal note: I didn't expect to hear 'buildout' at this open source GIS conference. I've helped quite a bit with that particular piece of python software :-) )

23 Mar 2017 2:28pm GMT

PyBites: Module of the Week - ipaddress

While playing around with code for our post on generators we discovered the ipaddress module, part of the Standard Library. Such a handy little module!

23 Mar 2017 10:30am GMT

PyBites: Module of the Week - ipaddress

While playing around with code for our post on generators we discovered the ipaddress module, part of the Standard Library. Such a handy little module!

23 Mar 2017 10:30am GMT

Reinout van Rees: Fossgis: sewer cadastre with qgis - jörg Höttges

(One of my summaries of a talk at the 2017 fossgis conference).

With engineer firms from the Aachen region they created qkan. Qkan is:

  • A data structure.
  • Plugins for Qgis.
  • Direct access. Not a specific application with restricted access, but unrestricted access from within Qgis. (He noticed lots of interest among the engineers to learn qgis during the project!)

It has been designed for the needs of the engineers that have to work with the data. You first import the data from the local sewer database. Qkan converts the data to what it needs. Then you can do simulations in a separate package. The results of the simulation will be visualized by Qkan in qgis. Afterwards you probably have to make some corrections to the data and give corrections back to the original database. Often you have to go look at the actual sewers to make sure the database is correct. Output is often a map with the sewer system.

Some functionality: import sewer data (in various formats). Simulate water levels. Draw graphs of the water levels in a sewer. Support database-level check ("an end node cannot occur halfway a sewer").

They took care to make the database schema simple. The source sewer database is always very complex because it has to hold lots of metadata. The engineer that has to work with it needs a much simpler schema in order to be productive. Qkan does this.

They used qgis, spatialite, postgis, python and qt (for forms). An important note: they used as many postgis functionality as possible instead of the geographical functions from qgis: the reason is that postgis (and even spatialite) is often much quicker.

With qgis, python and the "qt designer", you can make lots of handy forms. But you can always go back to the database that's underneath it.

The code is at https://github.com/hoettges

23 Mar 2017 10:24am GMT

Reinout van Rees: Fossgis: sewer cadastre with qgis - jörg Höttges

(One of my summaries of a talk at the 2017 fossgis conference).

With engineer firms from the Aachen region they created qkan. Qkan is:

  • A data structure.
  • Plugins for Qgis.
  • Direct access. Not a specific application with restricted access, but unrestricted access from within Qgis. (He noticed lots of interest among the engineers to learn qgis during the project!)

It has been designed for the needs of the engineers that have to work with the data. You first import the data from the local sewer database. Qkan converts the data to what it needs. Then you can do simulations in a separate package. The results of the simulation will be visualized by Qkan in qgis. Afterwards you probably have to make some corrections to the data and give corrections back to the original database. Often you have to go look at the actual sewers to make sure the database is correct. Output is often a map with the sewer system.

Some functionality: import sewer data (in various formats). Simulate water levels. Draw graphs of the water levels in a sewer. Support database-level check ("an end node cannot occur halfway a sewer").

They took care to make the database schema simple. The source sewer database is always very complex because it has to hold lots of metadata. The engineer that has to work with it needs a much simpler schema in order to be productive. Qkan does this.

They used qgis, spatialite, postgis, python and qt (for forms). An important note: they used as many postgis functionality as possible instead of the geographical functions from qgis: the reason is that postgis (and even spatialite) is often much quicker.

With qgis, python and the "qt designer", you can make lots of handy forms. But you can always go back to the database that's underneath it.

The code is at https://github.com/hoettges

23 Mar 2017 10:24am GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT