## 17 Oct 2018

### Planet Python

#### The No Title® Tech Blog: Haiku R1/beta1 review - revisiting BeOS, 18 years after its latest official release

Having experimented and used BeOS R5 Pro back in the early 2000's, when the company that created it was just going down, I have been following with some interest the development of Haiku during all these years. While one can argue that both the old BeOS and Haiku miss some important features to be considered modern OSes these days, the fact is that a lightweight operating system can always be, for instance, an excellent way to bring new life into old, or new but less powerfull, hardware.

17 Oct 2018 7:30pm GMT

#### The No Title® Tech Blog: Haiku R1/beta1 review - revisiting BeOS, 18 years after its latest official release

Having experimented and used BeOS R5 Pro back in the early 2000's, when the company that created it was just going down, I have been following with some interest the development of Haiku during all these years. While one can argue that both the old BeOS and Haiku miss some important features to be considered modern OSes these days, the fact is that a lightweight operating system can always be, for instance, an excellent way to bring new life into old, or new but less powerfull, hardware.

17 Oct 2018 7:30pm GMT

### Introduction

Python comes with a variety of built-in data structures, capable of storing different types of data. A Python dictionary is one such data structure that can store data in the form of key-value pairs. The values in a Python dictionary can be accessed using the keys. In this article, we will be discussing the Python dictionary in detail.

### Creating a Dictionary

To create a Python dictionary, we need to pass a sequence of items inside curly braces {}, and separate them using a comma (,). Each item has a key and a value expressed as a "key:value" pair.

The values can belong to any data type and they can repeat, but the keys must remain unique.

The following examples demonstrate how to create Python dictionaries:

Creating an empty dictionary:

dict_sample = {}



Creating a dictionary with integer keys:

dict_sample = {1: 'mango', 2: 'pawpaw'}



Creating a dictionary with mixed keys:

dict_sample = {'fruit': 'mango', 1: [4, 6, 8]}



We can also create a dictionary by explicitly calling the Python's dict() method:

dict_sample = dict({1:'mango', 2:'pawpaw'})



A dictionary can also be created from a sequence as shown below:

dict_sample = dict([(1,'mango'), (2,'pawpaw')])



Dictionaries can also be nested, which means that we can create a dictionary inside another dictionary. For example:

dict_sample = {1: {'student1' : 'Nicholas', 'student2' : 'John', 'student3' : 'Mercy'},
2: {'course1' : 'Computer Science', 'course2' : 'Mathematics', 'course3' : 'Accounting'}}



To print the dictionary contents, we can use the Python's print() function and pass the dictionary name as the argument to the function. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio', 'year': 2012}



### Accessing Elements

To access dictionary items, pass the key inside square brackets []. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
x = dict_sample["model"]
print(x)



Output:

Premio



We created a dictionary named dict_sample. A variable named x was then created and its value is set to be the value for the key "model" in the dictionary.

Here is another example:

dict = {'Name': 'Mercy', 'Age': 23, 'Course': 'Accounting'}
print("Student Name:", dict['Name'])
print("Course:", dict['Course'])
print("Age:", dict['Age'])



Output:

Student Name: Mercy
Course: Accounting
Age: 23



The dictionary object also provides the get() function, which can be used to access dictionary elements as well. We append the function with the dictionary name using the dot operator and then pass the name of the key as the argument to the function. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
x = dict_sample.get("model")
print(x)



Output:

Premio



Now we know how to access dictionary elements using a few different methods. In the next section we'll discuss how to add new elements to an already existing dictionary.

There are numerous ways to add new elements to a dictionary. We can use a new index key and assign a value to it. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample["Capacity"] = "1800CC"
print(dict_sample)



Output:

{'Capacity': '1800CC', 'year': 2012, 'Company': 'Toyota', 'model': 'Premio'}



The new element has "Capacity" as the key and "1800CC" as its corresponding value. It has been added as the first element of the dictionary.

Here is another example. First let's first create an empty dictionary:

MyDictionary = {}
print("An Empty Dictionary: ")
print(MyDictionary)



Output:

An Empty Dictionary:



The dictionary returns nothing as it has nothing stored yet. Let us add some elements to it, one at a time:

MyDictionary[0] = 'Apples'
MyDictionary[2] = 'Mangoes'
MyDictionary[3] = 20
print("\n3 elements have been added: ")
print(MyDictionary)



Output:

3 elements have been added:
{0: 'Apples', 2: 'Mangoes', 3: 20}



To add the elements, we specified keys as well as the corresponding values. For example:

MyDictionary[0] = 'Apples'



In the above example, 0 is the key while "Apples" is the value.

It is even possible for us to add a set of values to one key. For example:

MyDictionary['Values'] = 1, "Pairs", 4
print("\n3 elements have been added: ")
print(MyDictionary)



Output:

3 elements have been added:
{'Values': (1, 'Pairs', 4)}



In the above example, the name of the key is "Values" while everything after the = sign are the actual values for that key, stored as a Set.

Other than adding new elements to a dictionary, dictionary elements can also be updated/changed, which we'll go over in the next section.

### Updating Elements

After adding a value to a dictionary we can then modify the existing dictionary element. You use the key of the element to change the corresponding value. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

dict_sample["year"] = 2014

print(dict_sample)



Output:

{'year': 2014, 'model': 'Premio', 'Company': 'Toyota'}



In this example you can see that we have updated the value for the key "year" from the old value of 2012 to a new value of 2014.

### Removing Elements

The removal of an element from a dictionary can be done in several ways, which we'll discuss one-by-one in this section:

The del keyword can be used to remove the element with the specified key. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
del dict_sample["year"]
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio'}



We called the del keyword followed by the dictionary name. Inside the square brackets that follow the dictionary name, we passed the key of the element we need to delete from the dictionary, which in this example was "year". The entry for "year" in the dictionary was then deleted.

Another way to delete a key-value pair is to use the pop() function and pass the key of the entry to be deleted as the argument to the function. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample.pop("year")
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio'}



We invoked the pop() function by appending it with the dictionary name. Again, in this example the entry for "year" in the dictionary will be deleted.

The popitem() function removes the last item inserted into the dictionary, without needing to specify the key. Take a look at the following example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample.popitem()
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio'}



The last entry into the dictionary was "year". It has been removed after calling the popitem() function.

But what if you want to delete the entire dictionary? It would be difficult and cumbersome to use one of these methods on every single key. Instead, you can use the del keyword to delete the entire dictionary. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
del dict_sample
print(dict_sample)



Output:

NameError: name 'dict_sample' is not defined



The code returns an error. The reason is that we are trying to access a dictionary which doesn't exist since it has been deleted.

However, your use-case may require you to just remove all dictionary elements and be left with an empty dictionary. This can be achieved by calling the clear() function on the dictionary:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample.clear()
print(dict_sample)



Output:

{}



The code returns an empty dictionary since all the dictionary elements have been removed.

### Other Common Methods

The len() Method

With this method, you can count the number of elements in a dictionary. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
print(len(dict_sample))



Output:

3



There are three entries in the dictionary, hence the method returned 3.

The copy() Method

This method returns a copy of the existing dictionary. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
x = dict_sample.copy()

print(x)



Output:

{'Company': 'Toyota', 'year': 2012, 'model': 'Premio'}



We created a copy of dictionary named dict_sample and assigned it to the variable x. If x is printed on the console, you will see that it contains the same elements as those stored by dict_sample dictionary.

Note that this is useful because modifications made to the copied dictionary won't affect the original one.

The items() Method

When called, this method returns an iterable object. The iterable object has key-value pairs for the dictionary, as tuples in a list. This method is primarily used when you want to iterate through a dictionary.

The method is simply called on the dictionary object name as shown below:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

for k, v in dict_sample.items():
print(k, v)



Output:

('Company', 'Toyota')
('model', 'Premio')
('year', 2012)



The object returned by items() can also be used to show the changes that have been implemented on the dictionary. This is demonstrated below:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.items()

print(x)

dict_sample["model"] = "Mark X"

print(x)



Output:

dict_items([('Company', 'Toyota'), ('model', 'Premio'), ('year', 2012)])
dict_items([('Company', 'Toyota'), ('model', 'Mark X'), ('year', 2012)])



The output shows that when you change a value in the dictionary, the items object is also updated to reflect this change.

The fromkeys() Method

This method returns a dictionary having specified keys and values. It takes the syntax given below:

dictionary.fromkeys(keys, value)



The value for required keys parameter is an iterable and it specifies the keys for the new dictionary. The value for value parameter is optional and it specifies the default value for all the keys. The default value for this is None.

Suppose we need to create a dictionary of three keys all with the same value. We can do so as follows:

name = ('John', 'Nicholas', 'Mercy')
age = 25

dict_sample = dict.fromkeys(name, age)

print(dict_sample)



Output:

{'John': 25, 'Mercy': 25, 'Nicholas': 25}



In the script above, we specified the keys and one value. The fromkeys() method was able to pick the keys and combine them with this value to create a populated dictionary.

The value for the keys parameter is mandatory. The following example demonstrates what happens when the value for the values parameter is not specified:

name = ('John', 'Nicholas', 'Mercy')

dict_sample = dict.fromkeys(name)

print(dict_sample)



Output:

{'John': None, 'Mercy': None, 'Nicholas': None}



The default value, which is None, was used.

The setdefault() Method

This method is applicable when we need to get the value of the element with the specified key. If the key is not found, it will be inserted into the dictionary alongside the specified value.

The method takes the following syntax:

dictionary.setdefault(keyname, value)



In this function the keyname parameter is required. It represents the keyname of the item you need to return a value from. The value parameter is optional. If the dictionary already has the key, this parameter won't have any effect. If the key doesn't exist, then the value given in this function will become the value of the key. It has a default value of None.

For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.setdefault("color", "Gray")

print(x)



Output

Gray



The dictionary doesn't have the key for color. The setdefault() method has inserted this key and the specified a value, that is, "Gray", has been used as its value.

The following example demonstrates how the method behaves if the value for the key does exist:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.setdefault("model", "Allion")

print(x)



Output:

Premio



The value "Allion" has no effect on the dictionary since we already have a value for the key.

The keys() Method

This method also returns an iterable object. The object returned is a list of all keys in the dictionary. And just like with the items() method, the returned object can be used to reflect the changes made to the dictionary.

To use this method, we only call it on the name of the dictionary, as shown below:

dictionary.keys()



For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.keys()

print(x)



Output:

dict_keys(['model', 'Company', 'year'])



Often times this method is used to iterate through each key in your dictionary, like so:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

for k in dict_sample.keys():
print(k)



Output:

Company
model
year



### Conclusion

This marks the end of this tutorial on Python dictionaries. These dictionaries store data in "key:value" pairs. The "key" acts as the identifier for the item while "value" is the value of the item. The Python dictionary comes with a variety of functions that can be applied for retrieval or manipulation of data. In this article, we saw how Python dictionary can be created, modified and deleted along with some of the most commonly used dictionary methods.

17 Oct 2018 2:15pm GMT

### Introduction

Python comes with a variety of built-in data structures, capable of storing different types of data. A Python dictionary is one such data structure that can store data in the form of key-value pairs. The values in a Python dictionary can be accessed using the keys. In this article, we will be discussing the Python dictionary in detail.

### Creating a Dictionary

To create a Python dictionary, we need to pass a sequence of items inside curly braces {}, and separate them using a comma (,). Each item has a key and a value expressed as a "key:value" pair.

The values can belong to any data type and they can repeat, but the keys must remain unique.

The following examples demonstrate how to create Python dictionaries:

Creating an empty dictionary:

dict_sample = {}



Creating a dictionary with integer keys:

dict_sample = {1: 'mango', 2: 'pawpaw'}



Creating a dictionary with mixed keys:

dict_sample = {'fruit': 'mango', 1: [4, 6, 8]}



We can also create a dictionary by explicitly calling the Python's dict() method:

dict_sample = dict({1:'mango', 2:'pawpaw'})



A dictionary can also be created from a sequence as shown below:

dict_sample = dict([(1,'mango'), (2,'pawpaw')])



Dictionaries can also be nested, which means that we can create a dictionary inside another dictionary. For example:

dict_sample = {1: {'student1' : 'Nicholas', 'student2' : 'John', 'student3' : 'Mercy'},
2: {'course1' : 'Computer Science', 'course2' : 'Mathematics', 'course3' : 'Accounting'}}



To print the dictionary contents, we can use the Python's print() function and pass the dictionary name as the argument to the function. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio', 'year': 2012}



### Accessing Elements

To access dictionary items, pass the key inside square brackets []. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
x = dict_sample["model"]
print(x)



Output:

Premio



We created a dictionary named dict_sample. A variable named x was then created and its value is set to be the value for the key "model" in the dictionary.

Here is another example:

dict = {'Name': 'Mercy', 'Age': 23, 'Course': 'Accounting'}
print("Student Name:", dict['Name'])
print("Course:", dict['Course'])
print("Age:", dict['Age'])



Output:

Student Name: Mercy
Course: Accounting
Age: 23



The dictionary object also provides the get() function, which can be used to access dictionary elements as well. We append the function with the dictionary name using the dot operator and then pass the name of the key as the argument to the function. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
x = dict_sample.get("model")
print(x)



Output:

Premio



Now we know how to access dictionary elements using a few different methods. In the next section we'll discuss how to add new elements to an already existing dictionary.

There are numerous ways to add new elements to a dictionary. We can use a new index key and assign a value to it. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample["Capacity"] = "1800CC"
print(dict_sample)



Output:

{'Capacity': '1800CC', 'year': 2012, 'Company': 'Toyota', 'model': 'Premio'}



The new element has "Capacity" as the key and "1800CC" as its corresponding value. It has been added as the first element of the dictionary.

Here is another example. First let's first create an empty dictionary:

MyDictionary = {}
print("An Empty Dictionary: ")
print(MyDictionary)



Output:

An Empty Dictionary:



The dictionary returns nothing as it has nothing stored yet. Let us add some elements to it, one at a time:

MyDictionary[0] = 'Apples'
MyDictionary[2] = 'Mangoes'
MyDictionary[3] = 20
print("\n3 elements have been added: ")
print(MyDictionary)



Output:

3 elements have been added:
{0: 'Apples', 2: 'Mangoes', 3: 20}



To add the elements, we specified keys as well as the corresponding values. For example:

MyDictionary[0] = 'Apples'



In the above example, 0 is the key while "Apples" is the value.

It is even possible for us to add a set of values to one key. For example:

MyDictionary['Values'] = 1, "Pairs", 4
print("\n3 elements have been added: ")
print(MyDictionary)



Output:

3 elements have been added:
{'Values': (1, 'Pairs', 4)}



In the above example, the name of the key is "Values" while everything after the = sign are the actual values for that key, stored as a Set.

Other than adding new elements to a dictionary, dictionary elements can also be updated/changed, which we'll go over in the next section.

### Updating Elements

After adding a value to a dictionary we can then modify the existing dictionary element. You use the key of the element to change the corresponding value. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

dict_sample["year"] = 2014

print(dict_sample)



Output:

{'year': 2014, 'model': 'Premio', 'Company': 'Toyota'}



In this example you can see that we have updated the value for the key "year" from the old value of 2012 to a new value of 2014.

### Removing Elements

The removal of an element from a dictionary can be done in several ways, which we'll discuss one-by-one in this section:

The del keyword can be used to remove the element with the specified key. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
del dict_sample["year"]
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio'}



We called the del keyword followed by the dictionary name. Inside the square brackets that follow the dictionary name, we passed the key of the element we need to delete from the dictionary, which in this example was "year". The entry for "year" in the dictionary was then deleted.

Another way to delete a key-value pair is to use the pop() function and pass the key of the entry to be deleted as the argument to the function. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample.pop("year")
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio'}



We invoked the pop() function by appending it with the dictionary name. Again, in this example the entry for "year" in the dictionary will be deleted.

The popitem() function removes the last item inserted into the dictionary, without needing to specify the key. Take a look at the following example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample.popitem()
print(dict_sample)



Output:

{'Company': 'Toyota', 'model': 'Premio'}



The last entry into the dictionary was "year". It has been removed after calling the popitem() function.

But what if you want to delete the entire dictionary? It would be difficult and cumbersome to use one of these methods on every single key. Instead, you can use the del keyword to delete the entire dictionary. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
del dict_sample
print(dict_sample)



Output:

NameError: name 'dict_sample' is not defined



The code returns an error. The reason is that we are trying to access a dictionary which doesn't exist since it has been deleted.

However, your use-case may require you to just remove all dictionary elements and be left with an empty dictionary. This can be achieved by calling the clear() function on the dictionary:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
dict_sample.clear()
print(dict_sample)



Output:

{}



The code returns an empty dictionary since all the dictionary elements have been removed.

### Other Common Methods

The len() Method

With this method, you can count the number of elements in a dictionary. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
print(len(dict_sample))



Output:

3



There are three entries in the dictionary, hence the method returned 3.

The copy() Method

This method returns a copy of the existing dictionary. For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}
x = dict_sample.copy()

print(x)



Output:

{'Company': 'Toyota', 'year': 2012, 'model': 'Premio'}



We created a copy of dictionary named dict_sample and assigned it to the variable x. If x is printed on the console, you will see that it contains the same elements as those stored by dict_sample dictionary.

Note that this is useful because modifications made to the copied dictionary won't affect the original one.

The items() Method

When called, this method returns an iterable object. The iterable object has key-value pairs for the dictionary, as tuples in a list. This method is primarily used when you want to iterate through a dictionary.

The method is simply called on the dictionary object name as shown below:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

for k, v in dict_sample.items():
print(k, v)



Output:

('Company', 'Toyota')
('model', 'Premio')
('year', 2012)



The object returned by items() can also be used to show the changes that have been implemented on the dictionary. This is demonstrated below:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.items()

print(x)

dict_sample["model"] = "Mark X"

print(x)



Output:

dict_items([('Company', 'Toyota'), ('model', 'Premio'), ('year', 2012)])
dict_items([('Company', 'Toyota'), ('model', 'Mark X'), ('year', 2012)])



The output shows that when you change a value in the dictionary, the items object is also updated to reflect this change.

The fromkeys() Method

This method returns a dictionary having specified keys and values. It takes the syntax given below:

dictionary.fromkeys(keys, value)



The value for required keys parameter is an iterable and it specifies the keys for the new dictionary. The value for value parameter is optional and it specifies the default value for all the keys. The default value for this is None.

Suppose we need to create a dictionary of three keys all with the same value. We can do so as follows:

name = ('John', 'Nicholas', 'Mercy')
age = 25

dict_sample = dict.fromkeys(name, age)

print(dict_sample)



Output:

{'John': 25, 'Mercy': 25, 'Nicholas': 25}



In the script above, we specified the keys and one value. The fromkeys() method was able to pick the keys and combine them with this value to create a populated dictionary.

The value for the keys parameter is mandatory. The following example demonstrates what happens when the value for the values parameter is not specified:

name = ('John', 'Nicholas', 'Mercy')

dict_sample = dict.fromkeys(name)

print(dict_sample)



Output:

{'John': None, 'Mercy': None, 'Nicholas': None}



The default value, which is None, was used.

The setdefault() Method

This method is applicable when we need to get the value of the element with the specified key. If the key is not found, it will be inserted into the dictionary alongside the specified value.

The method takes the following syntax:

dictionary.setdefault(keyname, value)



In this function the keyname parameter is required. It represents the keyname of the item you need to return a value from. The value parameter is optional. If the dictionary already has the key, this parameter won't have any effect. If the key doesn't exist, then the value given in this function will become the value of the key. It has a default value of None.

For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.setdefault("color", "Gray")

print(x)



Output

Gray



The dictionary doesn't have the key for color. The setdefault() method has inserted this key and the specified a value, that is, "Gray", has been used as its value.

The following example demonstrates how the method behaves if the value for the key does exist:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.setdefault("model", "Allion")

print(x)



Output:

Premio



The value "Allion" has no effect on the dictionary since we already have a value for the key.

The keys() Method

This method also returns an iterable object. The object returned is a list of all keys in the dictionary. And just like with the items() method, the returned object can be used to reflect the changes made to the dictionary.

To use this method, we only call it on the name of the dictionary, as shown below:

dictionary.keys()



For example:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

x = dict_sample.keys()

print(x)



Output:

dict_keys(['model', 'Company', 'year'])



Often times this method is used to iterate through each key in your dictionary, like so:

dict_sample = {
"Company": "Toyota",
"model": "Premio",
"year": 2012
}

for k in dict_sample.keys():
print(k)



Output:

Company
model
year



### Conclusion

This marks the end of this tutorial on Python dictionaries. These dictionaries store data in "key:value" pairs. The "key" acts as the identifier for the item while "value" is the value of the item. The Python dictionary comes with a variety of functions that can be applied for retrieval or manipulation of data. In this article, we saw how Python dictionary can be created, modified and deleted along with some of the most commonly used dictionary methods.

17 Oct 2018 2:15pm GMT

#### Real Python: Python, Boto3, and AWS S3: Demystified

Amazon Web Services (AWS) has become a leader in cloud computing. One of its core components is S3, the object storage service offered by AWS. With its impressive availability and durability, it has become the standard way to store videos, images, and data. You can combine S3 with other services to build infinitely scalable applications.

Boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts.

If you've had some AWS exposure before, have your own AWS account, and want to take your skills to the next level by starting to use AWS services from within your Python code, then keep reading.

By the end of this tutorial, you'll:

• Be confident working with buckets and objects directly from your Python scripts
• Know how to avoid common pitfalls when using Boto3 and S3
• Understand how to set up your data from the start to avoid performance issues later
• Learn how to configure your objects to take advantage of S3's best features

Before exploring Boto3's characteristics, you will first see how to configure the SDK on your machine. This step will set you up for the rest of the tutorial.

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

## Installation

To install Boto3 on your computer, go to your terminal and run the following:

$pip install boto3  You've got the SDK. But, you won't be able to use it right now, because it doesn't know which AWS account it should connect to. To make it run against your AWS account, you'll need to provide some valid credentials. If you already have an IAM user that has full permissions to S3, you can use those user's credentials (their access key and their secret access key) without needing to create a new user. Otherwise, the easiest way to do this is to create a new AWS user and then store the new credentials. To create a new user, go to your AWS account, then go to Services and select IAM. Then choose Users and click on Add user. Give the user a name (for example, boto3user). Enable programmatic access. This will ensure that this user will be able to work with any AWS supported SDK or make separate API calls: To keep things simple, choose the preconfigured AmazonS3FullAccess policy. With this policy, the new user will be able to have full control over S3. Click on Next: Review: Select Create user: A new screen will show you the user's generated credentials. Click on the Download .csv button to make a copy of the credentials. You will need them to complete your setup. Now that you have your new user, create a new file, ~/.aws/credentials: $ touch ~/.aws/credentials


Open the file and paste the structure below. Fill in the placeholders with the new user credentials you have downloaded:

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY


Save the file.

Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account.

There is one more configuration to set up: the default region that Boto3 should interact with. You can check out the complete table of the supported AWS regions. Choose the region that is closest to you. Copy your preferred region from the Region column. In my case, I am using eu-west-1 (Ireland).

Create a new file, ~/.aws/config:

$touch ~/.aws/config  Add the following and replace the placeholder with the region you have copied: [default] region = YOUR_PREFERRED_REGION  Save your file. You are now officially set up for the rest of the tutorial. Next, you will see the different options Boto3 gives you to connect to S3 and other AWS services. ## Client Versus Resource At its core, all that Boto3 does is call AWS APIs on your behalf. For the majority of the AWS services, Boto3 offers two distinct ways of accessing these abstracted APIs: • Client: low-level service access • Resource: higher-level object-oriented service access You can use either to interact with S3. To connect to the low-level client interface, you must use Boto3's client(). You then pass in the name of the service you want to connect to, in this case, s3: import boto3 s3_client = boto3.client('s3')  To connect to the high-level interface, you'll follow a similar approach, but use resource(): import boto3 s3_resource = boto3.resource('s3')  You've successfully connected to both versions, but now you might be wondering, "Which one should I use?" With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you'll have to parse that dictionary yourself. With resource methods, the SDK does that work for you. With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend. Understanding how the client and the resource are generated is also important when you're considering which one to choose: • Boto3 generates the client from a JSON service definition file. The client's methods support every single type of interaction with the target AWS service. • Resources, on the other hand, are generated from JSON resource definition files. Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn't offered by the resource. Here's the interesting part: you don't need to change your code to use the client everywhere. For that operation, you can access the client directly via the resource like so: s3_resource.meta.client. One such client operation is .generate_presigned_url(), which enables you to give your users access to an object within your bucket for a set period of time, without requiring them to have AWS credentials. ## Common Operations Now that you know about the differences between clients and resources, let's start using them to build some new S3 components. ### Creating a Bucket To start off, you need an S3 bucket. To create one programmatically, you must first choose a name for your bucket. Remember that this name must be unique throughout the whole AWS platform, as bucket names are DNS compliant. If you try to create a bucket, but another user has already claimed your desired bucket name, your code will fail. Instead of success, you will see the following error: botocore.errorfactory.BucketAlreadyExists. You can increase your chance of success when creating your bucket by picking a random name. You can generate your own function that does that for you. In this implementation, you'll see how using the uuid module will help you achieve that. A UUID4's string representation is 36 characters long (including hyphens), and you can add a prefix to specify what each bucket is for. Here's a way you can achieve that: import uuid def create_bucket_name(bucket_prefix): # The generated bucket name must be between 3 and 63 chars long return ''.join([bucket_prefix, str(uuid.uuid4())])  You've got your bucket name, but now there's one more thing you need to be aware of: unless your region is in the United States, you'll need to define the region explicitly when you are creating the bucket. Otherwise you will get an IllegalLocationConstraintException. To exemplify what this means when you're creating your S3 bucket in a non-US region, take a look at the code below: s3_resource.create_bucket(Bucket=YOUR_BUCKET_NAME, CreateBucketConfiguration={ 'LocationConstraint': 'eu-west-1'})  You need to provide both a bucket name and a bucket configuration where you must specify the region, which in my case is eu-west-1. This isn't ideal. Imagine that you want to take your code and deploy it to the cloud. Your task will become increasingly more difficult because you've now hardcoded the region. You could refactor the region and transform it into an environment variable, but then you'd have one more thing to manage. Luckily, there is a better way to get the region programatically, by taking advantage of a session object. Boto3 will create the session from your credentials. You just need to take the region and pass it to create_bucket() as its LocationConstraint configuration. Here's how to do that: def create_bucket(bucket_prefix, s3_connection): session = boto3.session.Session() current_region = session.region_name bucket_name = create_bucket_name(bucket_prefix) bucket_response = s3_connection.create_bucket( Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': current_region}) print(bucket_name, current_region) return bucket_name, bucket_response  The nice part is that this code works no matter where you want to deploy it: locally/EC2/Lambda. Moreover, you don't need to hardcode your region. As both the client and the resource create buckets in the same way, you can pass either one as the s3_connection parameter. You'll now create two buckets. First create one using the client, which gives you back the bucket_response as a dictionary: >>> >>> first_bucket_name, first_response = create_bucket( ... bucket_prefix='firstpythonbucket', ... s3_connection=s3_resource.meta.client) firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304 eu-west-1 >>> first_response {'ResponseMetadata': {'RequestId': 'E1DCFE71EDE7C1EC', 'HostId': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'x-amz-request-id': 'E1DCFE71EDE7C1EC', 'date': 'Fri, 05 Oct 2018 15:00:00 GMT', 'location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'Location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/'}  Then create a second bucket using the resource, which gives you back a Bucket instance as the bucket_response: >>> >>> second_bucket_name, second_response = create_bucket( ... bucket_prefix='secondpythonbucket', s3_connection=s3_resource) secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644 eu-west-1 >>> second_response s3.Bucket(name='secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644')  You've got your buckets. Next, you'll want to start adding some files to them. ### Naming Your Files You can name your objects by using standard file naming conventions. You can use any valid name. In this article, you'll look at a more specific case that helps you understand how S3 works under the hood. If you're planning on hosting a large number of files in your S3 bucket, there's something you should keep in mind. If all your file names have a deterministic prefix that gets repeated for every file, such as a timestamp format like "YYYY-MM-DDThh:mm:ss", then you will soon find that you're running into performance issues when you're trying to interact with your bucket. This will happen because S3 takes the prefix of the file and maps it onto a partition. The more files you add, the more will be assigned to the same partition, and that partition will be very heavy and less responsive. What can you do to keep that from happening? The easiest solution is to randomize the file name. You can imagine many different implementations, but in this case, you'll use the trusted uuid module to help with that. To make the file names easier to read for this tutorial, you'll be taking the first six characters of the generated number's hex representation and concatenate it with your base file name. The helper function below allows you to pass in the number of bytes you want the file to have, the file name, and a sample content for the file to be repeated to make up the desired file size: def create_temp_file(size, file_name, file_content): random_file_name = ''.join([str(uuid.uuid4().hex[:6]), file_name]) with open(random_file_name, 'w') as f: f.write(str(file_content) * size) return random_file_name  Create your first file, which you'll be using shortly: first_file_name = create_temp_file(300, 'firstfile.txt', 'f')  By adding randomness to your file names, you can efficiently distribute your data within your S3 bucket. ### Creating Bucket and Object Instances The next step after creating your file is to see how to integrate it into your S3 workflow. This is where the resource's classes play an important role, as these abstractions make it easy to work with S3. By using the resource, you have access to the high-level classes (Bucket and Object). This is how you can create one of each: first_bucket = s3_resource.Bucket(name=first_bucket_name) first_object = s3_resource.Object( bucket_name=first_bucket_name, key=first_file_name)  The reason you have not seen any errors with creating the first_object variable is that Boto3 doesn't make calls to AWS to create the reference. The bucket_name and the key are called identifiers, and they are the necessary parameters to create an Object. Any other attribute of an Object, such as its size, is lazily loaded. This means that for Boto3 to get the requested attributes, it has to make calls to AWS. ### Understanding Sub-resources Bucket and Object are sub-resources of one another. Sub-resources are methods that create a new instance of a child resource. The parent's identifiers get passed to the child resource. If you have a Bucket variable, you can create an Object directly: first_object_again = first_bucket.Object(first_file_name)  Or if you have an Object variable, then you can get the Bucket: first_bucket_again = first_object.Bucket()  Great, you now understand how to generate a Bucket and an Object. Next, you'll get to upload your newly generated file to S3 using these constructs. ### Uploading a File There are three ways you can upload a file: • From an Object instance • From a Bucket instance • From the client In each case, you have to provide the Filename, which is the path of the file you want to upload. You'll now explore the three alternatives. Feel free to pick whichever you like most to upload the first_file_name to S3. Object Instance Version You can upload using an Object instance: s3_resource.Object(first_bucket_name, first_file_name).upload_file( Filename=first_file_name)  Or you can use the first_object instance: first_object.upload_file(first_file_name)  Bucket Instance Version Here's how you can upload using a Bucket instance: s3_resource.Bucket(first_bucket_name).upload_file( Filename=first_file_name, Key=first_file_name)  Client Version You can also upload using the client: s3_resource.meta.client.upload_file( Filename=first_file_name, Bucket=first_bucket_name, Key=first_file_name)  You have successfully uploaded your file to S3 using one of the three available methods. In the upcoming sections, you'll mainly work with the Object class, as the operations are very similar between the client and the Bucket versions. ### Downloading a File To download a file from S3 locally, you'll follow similar steps as you did when uploading. But in this case, the Filename parameter will map to your desired local path. This time, it will download the file to the tmp directory: s3_resource.Object(first_bucket_name, first_file_name).download_file( f'/tmp/{first_file_name}') # Python 3.6+  You've successfully downloaded your file from S3. Next, you'll see how to copy the same file between your S3 buckets using a single API call. ### Copying an Object Between Buckets If you need to copy files from one bucket to another, Boto3 offers you that possibility. In this example, you'll copy the file from the first bucket to the second, using .copy(): def copy_to_bucket(bucket_from_name, bucket_to_name, file_name): copy_source = { 'Bucket': bucket_from_name, 'Key': file_name } s3_resource.Object(bucket_to_name, file_name).copy(copy_source) copy_to_bucket(first_bucket_name, second_bucket_name, first_file_name)  Note: If you're aiming to replicate your S3 objects to a bucket in a different region, have a look at Cross Region Replication. ### Deleting an Object Let's delete the new file from the second bucket by calling .delete() on the equivalent Object instance: s3_resource.Object(second_bucket_name, first_file_name).delete()  You've now seen how to use S3's core operations. You're ready to take your knowledge to the next level with more complex characteristics in the upcoming sections. ## Advanced Configurations In this section, you're going to explore more elaborate S3 features. You'll see examples of how to use them and the benefits they can bring to your applications. ### ACL (Access Control Lists) Access Control Lists (ACLs) help you manage access to your buckets and the objects within them. They are considered the legacy way of administrating permissions to S3. Why should you know about them? If you have to manage access to individual objects, then you would use an Object ACL. By default, when you upload an object to S3, that object is private. If you want to make this object available to someone else, you can set the object's ACL to be public at creation time. Here's how you upload a new file to the bucket and make it accessible to everyone: second_file_name = create_temp_file(400, 'secondfile.txt', 's') second_object = s3_resource.Object(first_bucket.name, second_file_name) second_object.upload_file(second_file_name, ExtraArgs={ 'ACL': 'public-read'})  You can get the ObjectAcl instance from the Object, as it is one of its sub-resource classes: second_object_acl = second_object.Acl()  To see who has access to your object, use the grants attribute: >>> >>> second_object_acl.grants [{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}, {'Grantee': {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'}, 'Permission': 'READ'}]  You can make your object private again, without needing to re-upload it: >>> >>> response = second_object_acl.put(ACL='private') >>> second_object_acl.grants [{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}]  You have seen how you can use ACLs to manage access to individual objects. Next, you'll see how you can add an extra layer of security to your objects by using encryption. Note: If you're looking to split your data into multiple categories, have a look at tags. You can grant access to the objects based on their tags. ### Encryption With S3, you can protect your data using encryption. You'll explore server-side encryption using the AES-256 algorithm where AWS manages both the encryption and the keys. Create a new file and upload it using ServerSideEncryption: third_file_name = create_temp_file(300, 'thirdfile.txt', 't') third_object = s3_resource.Object(first_bucket_name, third_file_name) third_object.upload_file(third_file_name, ExtraArgs={ 'ServerSideEncryption': 'AES256'})  You can check the algorithm that was used to encrypt the file, in this case AES256: >>> >>> third_object.server_side_encryption 'AES256'  You now understand how to add an extra layer of protection to your objects using the AES-256 server-side encryption algorithm offered by AWS. ### Storage Every object that you add to your S3 bucket is associated with a storage class. All the available storage classes offer high durability. You choose how you want to store your objects based on your application's performance access requirements. At present, you can use the following storage classes with S3: • STANDARD: default for frequently accessed data • STANDARD_IA: for infrequently used data that needs to be retrieved rapidly when requested • ONEZONE_IA: for the same use case as STANDARD_IA, but stores the data in one Availability Zone instead of three • REDUCED_REDUNDANCY: for frequently used noncritical data that is easily reproducible If you want to change the storage class of an existing object, you need to recreate the object. For example, reupload the third_object and set its storage class to Standard_IA: third_object.upload_file(third_file_name, ExtraArgs={ 'ServerSideEncryption': 'AES256', 'StorageClass': 'STANDARD_IA'})  Note: If you make changes to your object, you might find that your local instance doesn't show them. What you need to do at that point is call .reload() to fetch the newest version of your object. Reload the object, and you can see its new storage class: >>> >>> third_object.reload() >>> third_object.storage_class 'STANDARD_IA'  Note: Use LifeCycle Configurations to transition objects through the different classes as you find the need for them. They will automatically transition these objects for you. ### Versioning You should use versioning to keep a complete record of your objects over time. It also acts as a protection mechanism against accidental deletion of your objects. When you request a versioned object, Boto3 will retrieve the latest version. When you add a new version of an object, the storage that object takes in total is the sum of the size of its versions. So if you're storing an object of 1 GB, and you create 10 versions, then you have to pay for 10GB of storage. Enable versioning for the first bucket. To do this, you need to use the BucketVersioning class: def enable_bucket_versioning(bucket_name): bkt_versioning = s3_resource.BucketVersioning(bucket_name) bkt_versioning.enable() print(bkt_versioning.status)  >>> >>> enable_bucket_versioning(first_bucket_name) Enabled  Then create two new versions for the first file Object, one with the contents of the original file and one with the contents of the third file: s3_resource.Object(first_bucket_name, first_file_name).upload_file( first_file_name) s3_resource.Object(first_bucket_name, first_file_name).upload_file( third_file_name)  Now reupload the second file, which will create a new version: s3_resource.Object(first_bucket_name, second_file_name).upload_file( second_file_name)  You can retrieve the latest available version of your objects like so: >>> >>> s3_resource.Object(first_bucket_name, first_file_name).version_id 'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'  In this section, you've seen how to work with some of the most important S3 attributes and add them to your objects. Next, you'll see how to easily traverse your buckets and objects. ## Traversals If you need to retrieve information from or apply an operation to all your S3 resources, Boto3 gives you several ways to iteratively traverse your buckets and your objects. You'll start by traversing all your created buckets. ### Bucket Traversal To traverse all the buckets in your account, you can use the resource's buckets attribute alongside .all(), which gives you the complete list of Bucket instances: >>> >>> for bucket in s3_resource.buckets.all(): ... print(bucket.name) ... firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304 secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644  You can use the client to retrieve the bucket information as well, but the code is more complex, as you need to extract it from the dictionary that the client returns: >>> >>> for bucket_dict in s3_resource.meta.client.list_buckets().get('Buckets'): ... print(bucket_dict['Name']) ... firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304 secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644  You have seen how to iterate through the buckets you have in your account. In the upcoming section, you'll pick one of your buckets and iteratively view the objects it contains. ### Object Traversal If you want to list all the objects from a bucket, the following code will generate an iterator for you: >>> >>> for obj in first_bucket.objects.all(): ... print(obj.key) ... 127367firstfile.txt 616abesecondfile.txt fb937cthirdfile.txt  The obj variable is an ObjectSummary. This is a lightweight representation of an Object. The summary version doesn't support all of the attributes that the Object has. If you need to access them, use the Object() sub-resource to create a new reference to the underlying stored key. Then you'll be able to extract the missing attributes: >>> >>> for obj in first_bucket.objects.all(): ... subsrc = obj.Object() ... print(obj.key, obj.storage_class, obj.last_modified, ... subsrc.version_id, subsrc.metadata) ... 127367firstfile.txt STANDARD 2018-10-05 15:09:46+00:00 eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv {} 616abesecondfile.txt STANDARD 2018-10-05 15:09:47+00:00 WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6 {} fb937cthirdfile.txt STANDARD_IA 2018-10-05 15:09:05+00:00 null {}  You can now iteratively perform operations on your buckets and objects. You're almost done. There's one more thing you should know at this stage: how to delete all the resources you've created in this tutorial. ## Deleting Buckets and Objects To remove all the buckets and objects you have created, you must first make sure that your buckets have no objects within them. ### Deleting a Non-empty Bucket To be able to delete a bucket, you must first delete every single object within the bucket, or else the BucketNotEmpty exception will be raised. When you have a versioned bucket, you need to delete every object and all its versions. If you find that a LifeCycle rule that will do this automatically for you isn't suitable to your needs, here's how you can programatically delete the objects: def delete_all_objects(bucket_name): res = [] bucket=s3_resource.Bucket(bucket_name) for obj_version in bucket.object_versions.all(): res.append({'Key': obj_version.object_key, 'VersionId': obj_version.id}) print(res) bucket.delete_objects(Delete={'Objects': res})  The above code works whether or not you have enabled versioning on your bucket. If you haven't, the version of the objects will be null. You can batch up to 1000 deletions in one API call, using .delete_objects() on your Bucket instance, which is more cost-effective than individually deleting each object. Run the new function against the first bucket to remove all the versioned objects: >>> >>> delete_all_objects(first_bucket_name) [{'Key': '127367firstfile.txt', 'VersionId': 'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'}, {'Key': '127367firstfile.txt', 'VersionId': 'UnQTaps14o3c1xdzh09Cyqg_hq4SjB53'}, {'Key': '127367firstfile.txt', 'VersionId': 'null'}, {'Key': '616abesecondfile.txt', 'VersionId': 'WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6'}, {'Key': '616abesecondfile.txt', 'VersionId': 'null'}, {'Key': 'fb937cthirdfile.txt', 'VersionId': 'null'}]  As a final test, you can upload a file to the second bucket. This bucket doesn't have versioning enabled, and thus the version will be null. Apply the same function to remove the contents: >>> >>> s3_resource.Object(second_bucket_name, first_file_name).upload_file( ... first_file_name) >>> delete_all_objects(second_bucket_name) [{'Key': '9c8b44firstfile.txt', 'VersionId': 'null'}]  You've successfully removed all the objects from both your buckets. You're now ready to delete the buckets. ### Deleting Buckets To finish off, you'll use .delete() on your Bucket instance to remove the first bucket: s3_resource.Bucket(first_bucket_name).delete()  If you want, you can use the client version to remove the second bucket: s3_resource.meta.client.delete_bucket(Bucket=second_bucket_name)  Both the operations were successful because you emptied each bucket before attempting to delete it. You've now run some of the most important operations that you can perform with S3 and Boto3. Congratulations on making it this far! As a bonus, let's explore some of the advantages of managing S3 resources with Infrastructure as Code. ## Python Code or Infrastructure as Code (IaC)? As you've seen, most of the interactions you've had with S3 in this tutorial had to do with objects. You didn't see many bucket-related operations, such as adding policies to the bucket, adding a LifeCycle rule to transition your objects through the storage classes, archive them to Glacier or delete them altogether or enforcing that all objects be encrypted by configuring Bucket Encryption. Manually managing the state of your buckets via Boto3's clients or resources becomes increasingly difficult as your application starts adding other services and grows more complex. To monitor your infrastructure in concert with Boto3, consider using an Infrastructure as Code (IaC) tool such as CloudFormation or Terraform to manage your application's infrastructure. Either one of these tools will maintain the state of your infrastructure and inform you of the changes that you've applied. If you decide to go down this route, keep the following in mind: • Any bucket related-operation that modifies the bucket in any way should be done via IaC. • If you want all your objects to act in the same way (all encrypted, or all public, for example), usually there is a way to do this directly using IaC, by adding a Bucket Policy or a specific Bucket property. • Bucket read operations, such as iterating through the contents of a bucket, should be done using Boto3. • Object-related operations at an individual object level should be done using Boto3. ## Conclusion Congratulations on making it to the end of this tutorial! You're now equipped to start working programmatically with S3. You now know how to create objects, upload them to S3, download their contents and change their attributes directly from your script, all while avoiding common pitfalls with Boto3. May this tutorial be a stepping stone in your journey to building something great using AWS! ## Further Reading If you want to learn more, check out the following: [ Improve Your Python With 🐍 Python Tricks 💌 - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ] 17 Oct 2018 2:00pm GMT #### Real Python: Python, Boto3, and AWS S3: Demystified Amazon Web Services (AWS) has become a leader in cloud computing. One of its core components is S3, the object storage service offered by AWS. With its impressive availability and durability, it has become the standard way to store videos, images, and data. You can combine S3 with other services to build infinitely scalable applications. Boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts. If you've had some AWS exposure before, have your own AWS account, and want to take your skills to the next level by starting to use AWS services from within your Python code, then keep reading. By the end of this tutorial, you'll: • Be confident working with buckets and objects directly from your Python scripts • Know how to avoid common pitfalls when using Boto3 and S3 • Understand how to set up your data from the start to avoid performance issues later • Learn how to configure your objects to take advantage of S3's best features Before exploring Boto3's characteristics, you will first see how to configure the SDK on your machine. This step will set you up for the rest of the tutorial. Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level. ## Installation To install Boto3 on your computer, go to your terminal and run the following: $ pip install boto3


You've got the SDK. But, you won't be able to use it right now, because it doesn't know which AWS account it should connect to.

To make it run against your AWS account, you'll need to provide some valid credentials. If you already have an IAM user that has full permissions to S3, you can use those user's credentials (their access key and their secret access key) without needing to create a new user. Otherwise, the easiest way to do this is to create a new AWS user and then store the new credentials.

To create a new user, go to your AWS account, then go to Services and select IAM. Then choose Users and click on Add user.

Give the user a name (for example, boto3user). Enable programmatic access. This will ensure that this user will be able to work with any AWS supported SDK or make separate API calls:

To keep things simple, choose the preconfigured AmazonS3FullAccess policy. With this policy, the new user will be able to have full control over S3. Click on Next: Review:

Select Create user:

A new screen will show you the user's generated credentials. Click on the Download .csv button to make a copy of the credentials. You will need them to complete your setup.

Now that you have your new user, create a new file, ~/.aws/credentials:

$touch ~/.aws/credentials  Open the file and paste the structure below. Fill in the placeholders with the new user credentials you have downloaded: [default] aws_access_key_id = YOUR_ACCESS_KEY_ID aws_secret_access_key = YOUR_SECRET_ACCESS_KEY  Save the file. Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account. There is one more configuration to set up: the default region that Boto3 should interact with. You can check out the complete table of the supported AWS regions. Choose the region that is closest to you. Copy your preferred region from the Region column. In my case, I am using eu-west-1 (Ireland). Create a new file, ~/.aws/config: $ touch ~/.aws/config


Add the following and replace the placeholder with the region you have copied:

[default]
region = YOUR_PREFERRED_REGION


You are now officially set up for the rest of the tutorial.

Next, you will see the different options Boto3 gives you to connect to S3 and other AWS services.

## Client Versus Resource

At its core, all that Boto3 does is call AWS APIs on your behalf. For the majority of the AWS services, Boto3 offers two distinct ways of accessing these abstracted APIs:

• Client: low-level service access
• Resource: higher-level object-oriented service access

You can use either to interact with S3.

To connect to the low-level client interface, you must use Boto3's client(). You then pass in the name of the service you want to connect to, in this case, s3:

import boto3
s3_client = boto3.client('s3')


To connect to the high-level interface, you'll follow a similar approach, but use resource():

import boto3
s3_resource = boto3.resource('s3')


You've successfully connected to both versions, but now you might be wondering, "Which one should I use?"

With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you'll have to parse that dictionary yourself. With resource methods, the SDK does that work for you.

With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend.

Understanding how the client and the resource are generated is also important when you're considering which one to choose:

• Boto3 generates the client from a JSON service definition file. The client's methods support every single type of interaction with the target AWS service.
• Resources, on the other hand, are generated from JSON resource definition files.

Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn't offered by the resource. Here's the interesting part: you don't need to change your code to use the client everywhere. For that operation, you can access the client directly via the resource like so: s3_resource.meta.client.

One such client operation is .generate_presigned_url(), which enables you to give your users access to an object within your bucket for a set period of time, without requiring them to have AWS credentials.

## Common Operations

Now that you know about the differences between clients and resources, let's start using them to build some new S3 components.

### Creating a Bucket

To start off, you need an S3 bucket. To create one programmatically, you must first choose a name for your bucket. Remember that this name must be unique throughout the whole AWS platform, as bucket names are DNS compliant. If you try to create a bucket, but another user has already claimed your desired bucket name, your code will fail. Instead of success, you will see the following error: botocore.errorfactory.BucketAlreadyExists.

You can increase your chance of success when creating your bucket by picking a random name. You can generate your own function that does that for you. In this implementation, you'll see how using the uuid module will help you achieve that. A UUID4's string representation is 36 characters long (including hyphens), and you can add a prefix to specify what each bucket is for.

Here's a way you can achieve that:

import uuid
def create_bucket_name(bucket_prefix):
# The generated bucket name must be between 3 and 63 chars long
return ''.join([bucket_prefix, str(uuid.uuid4())])


You've got your bucket name, but now there's one more thing you need to be aware of: unless your region is in the United States, you'll need to define the region explicitly when you are creating the bucket. Otherwise you will get an IllegalLocationConstraintException.

To exemplify what this means when you're creating your S3 bucket in a non-US region, take a look at the code below:

s3_resource.create_bucket(Bucket=YOUR_BUCKET_NAME,
CreateBucketConfiguration={
'LocationConstraint': 'eu-west-1'})


You need to provide both a bucket name and a bucket configuration where you must specify the region, which in my case is eu-west-1.

This isn't ideal. Imagine that you want to take your code and deploy it to the cloud. Your task will become increasingly more difficult because you've now hardcoded the region. You could refactor the region and transform it into an environment variable, but then you'd have one more thing to manage.

Luckily, there is a better way to get the region programatically, by taking advantage of a session object. Boto3 will create the session from your credentials. You just need to take the region and pass it to create_bucket() as its LocationConstraint configuration. Here's how to do that:

def create_bucket(bucket_prefix, s3_connection):
session = boto3.session.Session()
current_region = session.region_name
bucket_name = create_bucket_name(bucket_prefix)
bucket_response = s3_connection.create_bucket(
Bucket=bucket_name,
CreateBucketConfiguration={
'LocationConstraint': current_region})
print(bucket_name, current_region)
return bucket_name, bucket_response


The nice part is that this code works no matter where you want to deploy it: locally/EC2/Lambda. Moreover, you don't need to hardcode your region.

As both the client and the resource create buckets in the same way, you can pass either one as the s3_connection parameter.

You'll now create two buckets. First create one using the client, which gives you back the bucket_response as a dictionary:

>>>
>>> first_bucket_name, first_response = create_bucket(
...     bucket_prefix='firstpythonbucket',
...     s3_connection=s3_resource.meta.client)
firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304 eu-west-1

>>> first_response
{'ResponseMetadata': {'RequestId': 'E1DCFE71EDE7C1EC', 'HostId': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'r3AP32NQk9dvbHSEPIbyYADT769VQEN/+xT2BPM6HCnuCb3Z/GhR2SBP+GM7IjcxbBN7SQ+k+9B=', 'x-amz-request-id': 'E1DCFE71EDE7C1EC', 'date': 'Fri, 05 Oct 2018 15:00:00 GMT', 'location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'Location': 'http://firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304.s3.amazonaws.com/'}


Then create a second bucket using the resource, which gives you back a Bucket instance as the bucket_response:

>>>
>>> second_bucket_name, second_response = create_bucket(
...     bucket_prefix='secondpythonbucket', s3_connection=s3_resource)
secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644 eu-west-1

>>> second_response
s3.Bucket(name='secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644')


You've got your buckets. Next, you'll want to start adding some files to them.

You can name your objects by using standard file naming conventions. You can use any valid name. In this article, you'll look at a more specific case that helps you understand how S3 works under the hood.

If you're planning on hosting a large number of files in your S3 bucket, there's something you should keep in mind. If all your file names have a deterministic prefix that gets repeated for every file, such as a timestamp format like "YYYY-MM-DDThh:mm:ss", then you will soon find that you're running into performance issues when you're trying to interact with your bucket.

This will happen because S3 takes the prefix of the file and maps it onto a partition. The more files you add, the more will be assigned to the same partition, and that partition will be very heavy and less responsive.

What can you do to keep that from happening?

The easiest solution is to randomize the file name. You can imagine many different implementations, but in this case, you'll use the trusted uuid module to help with that. To make the file names easier to read for this tutorial, you'll be taking the first six characters of the generated number's hex representation and concatenate it with your base file name.

The helper function below allows you to pass in the number of bytes you want the file to have, the file name, and a sample content for the file to be repeated to make up the desired file size:

def create_temp_file(size, file_name, file_content):
random_file_name = ''.join([str(uuid.uuid4().hex[:6]), file_name])
with open(random_file_name, 'w') as f:
f.write(str(file_content) * size)
return random_file_name


Create your first file, which you'll be using shortly:

first_file_name = create_temp_file(300, 'firstfile.txt', 'f')


### Creating Bucket and Object Instances

The next step after creating your file is to see how to integrate it into your S3 workflow.

This is where the resource's classes play an important role, as these abstractions make it easy to work with S3.

By using the resource, you have access to the high-level classes (Bucket and Object). This is how you can create one of each:

first_bucket = s3_resource.Bucket(name=first_bucket_name)
first_object = s3_resource.Object(
bucket_name=first_bucket_name, key=first_file_name)


The reason you have not seen any errors with creating the first_object variable is that Boto3 doesn't make calls to AWS to create the reference. The bucket_name and the key are called identifiers, and they are the necessary parameters to create an Object. Any other attribute of an Object, such as its size, is lazily loaded. This means that for Boto3 to get the requested attributes, it has to make calls to AWS.

### Understanding Sub-resources

Bucket and Object are sub-resources of one another. Sub-resources are methods that create a new instance of a child resource. The parent's identifiers get passed to the child resource.

If you have a Bucket variable, you can create an Object directly:

first_object_again = first_bucket.Object(first_file_name)


Or if you have an Object variable, then you can get the Bucket:

first_bucket_again = first_object.Bucket()


Great, you now understand how to generate a Bucket and an Object. Next, you'll get to upload your newly generated file to S3 using these constructs.

There are three ways you can upload a file:

• From an Object instance
• From a Bucket instance
• From the client

In each case, you have to provide the Filename, which is the path of the file you want to upload. You'll now explore the three alternatives. Feel free to pick whichever you like most to upload the first_file_name to S3.

Object Instance Version

You can upload using an Object instance:

s3_resource.Object(first_bucket_name, first_file_name).upload_file(
Filename=first_file_name)


Or you can use the first_object instance:

first_object.upload_file(first_file_name)


Bucket Instance Version

Here's how you can upload using a Bucket instance:

s3_resource.Bucket(first_bucket_name).upload_file(
Filename=first_file_name, Key=first_file_name)


Client Version

You can also upload using the client:

s3_resource.meta.client.upload_file(
Filename=first_file_name, Bucket=first_bucket_name,
Key=first_file_name)


You have successfully uploaded your file to S3 using one of the three available methods. In the upcoming sections, you'll mainly work with the Object class, as the operations are very similar between the client and the Bucket versions.

To download a file from S3 locally, you'll follow similar steps as you did when uploading. But in this case, the Filename parameter will map to your desired local path. This time, it will download the file to the tmp directory:

s3_resource.Object(first_bucket_name, first_file_name).download_file(
f'/tmp/{first_file_name}') # Python 3.6+


You've successfully downloaded your file from S3. Next, you'll see how to copy the same file between your S3 buckets using a single API call.

### Copying an Object Between Buckets

If you need to copy files from one bucket to another, Boto3 offers you that possibility. In this example, you'll copy the file from the first bucket to the second, using .copy():

def copy_to_bucket(bucket_from_name, bucket_to_name, file_name):
copy_source = {
'Bucket': bucket_from_name,
'Key': file_name
}
s3_resource.Object(bucket_to_name, file_name).copy(copy_source)

copy_to_bucket(first_bucket_name, second_bucket_name, first_file_name)


Note: If you're aiming to replicate your S3 objects to a bucket in a different region, have a look at Cross Region Replication.

### Deleting an Object

Let's delete the new file from the second bucket by calling .delete() on the equivalent Object instance:

s3_resource.Object(second_bucket_name, first_file_name).delete()


You've now seen how to use S3's core operations. You're ready to take your knowledge to the next level with more complex characteristics in the upcoming sections.

In this section, you're going to explore more elaborate S3 features. You'll see examples of how to use them and the benefits they can bring to your applications.

### ACL (Access Control Lists)

By default, when you upload an object to S3, that object is private. If you want to make this object available to someone else, you can set the object's ACL to be public at creation time. Here's how you upload a new file to the bucket and make it accessible to everyone:

second_file_name = create_temp_file(400, 'secondfile.txt', 's')
second_object = s3_resource.Object(first_bucket.name, second_file_name)


You can get the ObjectAcl instance from the Object, as it is one of its sub-resource classes:

second_object_acl = second_object.Acl()


To see who has access to your object, use the grants attribute:

>>>
>>> second_object_acl.grants
[{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}, {'Grantee': {'Type': 'Group', 'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'}, 'Permission': 'READ'}]


You can make your object private again, without needing to re-upload it:

>>>
>>> response = second_object_acl.put(ACL='private')
>>> second_object_acl.grants
[{'Grantee': {'DisplayName': 'name', 'ID': '24aafdc2053d49629733ff0141fc9fede3bf77c7669e4fa2a4a861dd5678f4b5', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}]


You have seen how you can use ACLs to manage access to individual objects. Next, you'll see how you can add an extra layer of security to your objects by using encryption.

Note: If you're looking to split your data into multiple categories, have a look at tags. You can grant access to the objects based on their tags.

### Encryption

With S3, you can protect your data using encryption. You'll explore server-side encryption using the AES-256 algorithm where AWS manages both the encryption and the keys.

Create a new file and upload it using ServerSideEncryption:

third_file_name = create_temp_file(300, 'thirdfile.txt', 't')
third_object = s3_resource.Object(first_bucket_name, third_file_name)
'ServerSideEncryption': 'AES256'})


You can check the algorithm that was used to encrypt the file, in this case AES256:

>>>
>>> third_object.server_side_encryption
'AES256'


You now understand how to add an extra layer of protection to your objects using the AES-256 server-side encryption algorithm offered by AWS.

### Storage

Every object that you add to your S3 bucket is associated with a storage class. All the available storage classes offer high durability. You choose how you want to store your objects based on your application's performance access requirements.

At present, you can use the following storage classes with S3:

• STANDARD: default for frequently accessed data
• STANDARD_IA: for infrequently used data that needs to be retrieved rapidly when requested
• ONEZONE_IA: for the same use case as STANDARD_IA, but stores the data in one Availability Zone instead of three
• REDUCED_REDUNDANCY: for frequently used noncritical data that is easily reproducible

If you want to change the storage class of an existing object, you need to recreate the object.

For example, reupload the third_object and set its storage class to Standard_IA:

third_object.upload_file(third_file_name, ExtraArgs={
'ServerSideEncryption': 'AES256',
'StorageClass': 'STANDARD_IA'})


Note: If you make changes to your object, you might find that your local instance doesn't show them. What you need to do at that point is call .reload() to fetch the newest version of your object.

Reload the object, and you can see its new storage class:

>>>
>>> third_object.reload()
>>> third_object.storage_class
'STANDARD_IA'


Note: Use LifeCycle Configurations to transition objects through the different classes as you find the need for them. They will automatically transition these objects for you.

### Versioning

You should use versioning to keep a complete record of your objects over time. It also acts as a protection mechanism against accidental deletion of your objects. When you request a versioned object, Boto3 will retrieve the latest version.

When you add a new version of an object, the storage that object takes in total is the sum of the size of its versions. So if you're storing an object of 1 GB, and you create 10 versions, then you have to pay for 10GB of storage.

Enable versioning for the first bucket. To do this, you need to use the BucketVersioning class:

def enable_bucket_versioning(bucket_name):
bkt_versioning = s3_resource.BucketVersioning(bucket_name)
bkt_versioning.enable()
print(bkt_versioning.status)

>>>
>>> enable_bucket_versioning(first_bucket_name)
Enabled


Then create two new versions for the first file Object, one with the contents of the original file and one with the contents of the third file:

s3_resource.Object(first_bucket_name, first_file_name).upload_file(
first_file_name)
third_file_name)


Now reupload the second file, which will create a new version:

s3_resource.Object(first_bucket_name, second_file_name).upload_file(
second_file_name)


You can retrieve the latest available version of your objects like so:

>>>
>>> s3_resource.Object(first_bucket_name, first_file_name).version_id


In this section, you've seen how to work with some of the most important S3 attributes and add them to your objects. Next, you'll see how to easily traverse your buckets and objects.

## Traversals

If you need to retrieve information from or apply an operation to all your S3 resources, Boto3 gives you several ways to iteratively traverse your buckets and your objects. You'll start by traversing all your created buckets.

### Bucket Traversal

To traverse all the buckets in your account, you can use the resource's buckets attribute alongside .all(), which gives you the complete list of Bucket instances:

>>>
>>> for bucket in s3_resource.buckets.all():
...     print(bucket.name)
...
firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304
secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644


You can use the client to retrieve the bucket information as well, but the code is more complex, as you need to extract it from the dictionary that the client returns:

>>>
>>> for bucket_dict in s3_resource.meta.client.list_buckets().get('Buckets'):
...     print(bucket_dict['Name'])
...
firstpythonbucket7250e773-c4b1-422a-b51f-c45a52af9304
secondpythonbucket2d5d99c5-ab96-4c30-b7f7-443a95f72644


You have seen how to iterate through the buckets you have in your account. In the upcoming section, you'll pick one of your buckets and iteratively view the objects it contains.

### Object Traversal

If you want to list all the objects from a bucket, the following code will generate an iterator for you:

>>>
>>> for obj in first_bucket.objects.all():
...     print(obj.key)
...
127367firstfile.txt
616abesecondfile.txt
fb937cthirdfile.txt


The obj variable is an ObjectSummary. This is a lightweight representation of an Object. The summary version doesn't support all of the attributes that the Object has. If you need to access them, use the Object() sub-resource to create a new reference to the underlying stored key. Then you'll be able to extract the missing attributes:

>>>
>>> for obj in first_bucket.objects.all():
...     subsrc = obj.Object()
...     print(obj.key, obj.storage_class, obj.last_modified,
...
127367firstfile.txt STANDARD 2018-10-05 15:09:46+00:00 eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv {}
616abesecondfile.txt STANDARD 2018-10-05 15:09:47+00:00 WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6 {}
fb937cthirdfile.txt STANDARD_IA 2018-10-05 15:09:05+00:00 null {}


You can now iteratively perform operations on your buckets and objects. You're almost done. There's one more thing you should know at this stage: how to delete all the resources you've created in this tutorial.

## Deleting Buckets and Objects

To remove all the buckets and objects you have created, you must first make sure that your buckets have no objects within them.

### Deleting a Non-empty Bucket

To be able to delete a bucket, you must first delete every single object within the bucket, or else the BucketNotEmpty exception will be raised. When you have a versioned bucket, you need to delete every object and all its versions.

If you find that a LifeCycle rule that will do this automatically for you isn't suitable to your needs, here's how you can programatically delete the objects:

def delete_all_objects(bucket_name):
res = []
bucket=s3_resource.Bucket(bucket_name)
for obj_version in bucket.object_versions.all():
res.append({'Key': obj_version.object_key,
'VersionId': obj_version.id})
print(res)
bucket.delete_objects(Delete={'Objects': res})


The above code works whether or not you have enabled versioning on your bucket. If you haven't, the version of the objects will be null. You can batch up to 1000 deletions in one API call, using .delete_objects() on your Bucket instance, which is more cost-effective than individually deleting each object.

Run the new function against the first bucket to remove all the versioned objects:

>>>
>>> delete_all_objects(first_bucket_name)
[{'Key': '127367firstfile.txt', 'VersionId': 'eQgH6IC1VGcn7eXZ_.ayqm6NdjjhOADv'}, {'Key': '127367firstfile.txt', 'VersionId': 'UnQTaps14o3c1xdzh09Cyqg_hq4SjB53'}, {'Key': '127367firstfile.txt', 'VersionId': 'null'}, {'Key': '616abesecondfile.txt', 'VersionId': 'WIaExRLmoksJzLhN7jU5YzoJxYSu6Ey6'}, {'Key': '616abesecondfile.txt', 'VersionId': 'null'}, {'Key': 'fb937cthirdfile.txt', 'VersionId': 'null'}]


As a final test, you can upload a file to the second bucket. This bucket doesn't have versioning enabled, and thus the version will be null. Apply the same function to remove the contents:

>>>
>>> s3_resource.Object(second_bucket_name, first_file_name).upload_file(
...     first_file_name)
>>> delete_all_objects(second_bucket_name)
[{'Key': '9c8b44firstfile.txt', 'VersionId': 'null'}]


You've successfully removed all the objects from both your buckets. You're now ready to delete the buckets.

### Deleting Buckets

To finish off, you'll use .delete() on your Bucket instance to remove the first bucket:

s3_resource.Bucket(first_bucket_name).delete()


If you want, you can use the client version to remove the second bucket:

s3_resource.meta.client.delete_bucket(Bucket=second_bucket_name)


Both the operations were successful because you emptied each bucket before attempting to delete it.

You've now run some of the most important operations that you can perform with S3 and Boto3. Congratulations on making it this far! As a bonus, let's explore some of the advantages of managing S3 resources with Infrastructure as Code.

## Python Code or Infrastructure as Code (IaC)?

As you've seen, most of the interactions you've had with S3 in this tutorial had to do with objects. You didn't see many bucket-related operations, such as adding policies to the bucket, adding a LifeCycle rule to transition your objects through the storage classes, archive them to Glacier or delete them altogether or enforcing that all objects be encrypted by configuring Bucket Encryption.

Manually managing the state of your buckets via Boto3's clients or resources becomes increasingly difficult as your application starts adding other services and grows more complex. To monitor your infrastructure in concert with Boto3, consider using an Infrastructure as Code (IaC) tool such as CloudFormation or Terraform to manage your application's infrastructure. Either one of these tools will maintain the state of your infrastructure and inform you of the changes that you've applied.

If you decide to go down this route, keep the following in mind:

• Any bucket related-operation that modifies the bucket in any way should be done via IaC.
• If you want all your objects to act in the same way (all encrypted, or all public, for example), usually there is a way to do this directly using IaC, by adding a Bucket Policy or a specific Bucket property.
• Bucket read operations, such as iterating through the contents of a bucket, should be done using Boto3.
• Object-related operations at an individual object level should be done using Boto3.

## Conclusion

Congratulations on making it to the end of this tutorial!

You're now equipped to start working programmatically with S3. You now know how to create objects, upload them to S3, download their contents and change their attributes directly from your script, all while avoiding common pitfalls with Boto3.

May this tutorial be a stepping stone in your journey to building something great using AWS!

[ Improve Your Python With 🐍 Python Tricks 💌 - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

17 Oct 2018 2:00pm GMT

#### Stack Abuse: Creating a Neural Network from Scratch in Python: Multi-class Classification

This is the third article in the series of articles on "Creating a Neural Network From Scratch in Python".

If you have no prior experience with neural networks, I would suggest you first read Part 1 and Part 2 of the series (linked above). Once you feel comfortable with the concepts explained in those articles, you can come back and continue this article.

### Introduction

In the previous article, we saw how we can create a neural network from scratch, which is capable of solving binary classification problems, in Python. A binary classification problem has only two outputs. However, real-world problems are far more complex.

Consider the example of digit recognition problem where we use the image of a digit as an input and the classifier predicts the corresponding digit number. A digit can be any number between 0 and 9. This is a classic example of a multi-class classification problem where input may belong to any of the 10 possible outputs.

In this article, we will see how we can create a simple neural network from scratch in Python, which is capable of solving multi-class classification problems.

### Dataset

Let's first briefly take a look at our dataset. Our dataset will have two input features and one of the three possible output. We will manually create a dataset for this article.

To do so, execute the following script:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])
mouse_images = np.random.randn(700, 2) + np.array([3, 3])
dog_images = np.random.randn(700, 2) + np.array([-3, 3])



In the script above, we start by importing our libraries and then we create three two-dimensional arrays of size 700 x 2. You can think of each element in one set of the array as an image of a particular animal. Each array element corresponds to one of the three output classes.

An important point to note here is that, that if we plot the elements of the cat_images array on a two-dimensional plane, they will be centered around x=0 and y=-3. Similarly, the elements of the mouse_images array will be centered around x=3 and y=3, and finally, the elements of the array dog_images will be centered around x=-3 and y=3. You will see this once we plot our dataset.

Next, we need to vertically join these arrays to create our final dataset. Execute the following script to do so:

feature_set = np.vstack([cat_images, mouse_images, dog_images])



We created our feature set, and now we need to define corresponding labels for each record in our feature set. The following script does that:

labels = np.array([0]*700 + [1]*700 + [2]*700)



The above script creates a one-dimensional array of 2100 elements. The first 700 elements have been labeled as 0, the next 700 elements have been labeled as 1 while the last 700 elements have been labeled as 2. This is just our shortcut way of quickly creating the labels for our corresponding data.

For multi-class classification problems, we need to define the output label as a one-hot encoded vector since our output layer will have three nodes and each node will correspond to one output class. We want that when an output is predicted, the value of the corresponding node should be 1 while the remaining nodes should have a value of 0. For that, we need three values for the output label for each record. This is why we convert our output vector into a one-hot encoded vector.

Execute the following script to create the one-hot encoded vector array for our dataset:

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):
one_hot_labels[i, labels[i]] = 1



In the above script we create the one_hot_labels array of size 2100 x 3 where each row contains one-hot encoded vector for the corresponding record in the feature set. We then insert 1 in the corresponding column.

If you execute the above script, you will see that the one_hot_labels array will have 1 at index 0 for the first 700 records, 1 at index 1 for next 700 records while 1 at index 2 for the last 700 records.

Now let's plot the dataset that we just created. Execute the following script:

plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)
plt.show()



Once you execute the above script, you should see the following figure:

You can clearly see that we have elements belonging to three different classes. Our task will be to develop a neural network capable of classifying data into the aforementioned classes.

### Neural Network with Multiple Output Classes

The neural network that we are going to design has the following architecture:

You can see that our neural network is pretty similar to the one we developed in Part 2 of the series. It has an input layer with 2 input features and a hidden layer with 4 nodes. However, in the output layer, we can see that we have three nodes. This means that our neural network is capable of solving the multi-class classification problem where the number of possible outputs is 3.

#### Softmax and Cross-Entropy Functions

Before we move on to the code section, let us briefly review the softmax and cross entropy functions, which are respectively the most commonly used activation and loss functions for creating a neural network for multi-class classification.

##### Softmax Function

From the architecture of our neural network, we can see that we have three nodes in the output layer. We have several options for the activation function at the output layer. One option is to use sigmoid function as we did in the previous articles.

However, there is a more convenient activation function in the form of softmax that takes a vector as input and produces another vector of the same length as output. Since our output contains three nodes, we can consider the output from each node as one element of the input vector. The output will be a length of the same vector where the values of all the elements sum to 1. Mathematically, the softmax function can be represented as:

$$y_i(z_i) = \frac{e^{z_i}}{ \sum\nolimits_{k=1}^{k}{e^{z_k}} }$$

The softmax function simply divides the exponent of each input element by the sum of exponents of all the input elements. Let's take a look at a simple example of this:

def softmax(A):
expA = np.exp(A)
return expA / expA.sum()

nums = np.array([4, 5, 6])
print(softmax(nums))



In the script above we create a softmax function that takes a single vector as input, takes exponents of all the elements in the vector and then divides the resulting numbers individually by the sum of exponents of all the numbers in the input vector.

You can see that the input vector contains elements 4, 5 and 6. In the output, you will see three numbers squashed between 0 and 1 where the sum of the numbers will be equal to 1. The output looks likes this:

[0.09003057 0.24472847 0.66524096]



Softmax activation function has two major advantages over the other activation functions, particular for multi-class classification problems: The first advantage is that softmax function takes a vector as input and the second advantage is that it produces an output between 0 and 1. Remember, in our dataset, we have one-hot encoded output labels which mean that our output will have values between 0 and 1. However, the output of the feedforward process can be greater than 1, therefore softmax function is the ideal choice at the output layer since it squashes the output between 0 and 1.

##### Cross-Entropy Function

With softmax activation function at the output layer, mean squared error cost function can be used for optimizing the cost as we did in the previous articles. However, for the softmax function, a more convenient cost function exists which is called cross-entropy.

Mathematically, the cross-entropy function looks likes this:

$$H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i}$$

The cross-entropy is simply the sum of the products of all the actual probabilities with the negative log of the predicted probabilities. For multi-class classification problems, the cross-entropy function is known to outperform the gradient decent function.

Now we have sufficient knowledge to create a neural network that solves multi-class classification problems. Let's see how our neural network will work.

As always, a neural network executes in two steps: Feed-forward and back-propagation.

#### Feed Forward

The feedforward phase will remain more or less similar to what we saw in the previous article. The only difference is that now we will use the softmax activation function at the output layer rather than sigmoid function.

Remember, for the hidden layer output we will still use the sigmoid function as we did previously. The softmax function will be used only for the output layer activations.

##### Phase 1

Since we are using two different activation functions for the hidden layer and the output layer, I have divided the feed-forward phase into two sub-phases.

In the first phase, we will see how to calculate output from the hidden layer. For each input record, we have two features "x1" and "x2". To calculate the output values for each node in the hidden layer, we have to multiply the input with the corresponding weights of the hidden layer node for which we are calculating the value. Notice, we are also adding a bias term here. We then pass the dot product through sigmoid activation function to get the final value.

For instance to calculate the final value for the first node in the hidden layer, which is denoted by "ah1", you need to perform the following calculation:

$$zh1 = x1w1 + x2w2 + b$$

$$ah1 = \frac{\mathrm{1} }{\mathrm{1} + e^{-zh1} }$$

This is the resulting value for the top-most node in the hidden layer. In the same way, you can calculate the values for the 2nd, 3rd, and 4th nodes of the hidden layer.

##### Phase 2

To calculate the values for the output layer, the values in the hidden layer nodes are treated as inputs. Therefore, to calculate the output, multiply the values of the hidden layer nodes with their corresponding weights and pass the result through an activation function, which will be softmax in this case.

This operation can be mathematically expressed by the following equation:

$$zo1 = ah1w9 + ah2w10 + ah3w11 + ah4w12$$

$$zo2 = ah1w13 + ah2w14 + ah3w15 + ah4w16$$

$$zo3 = ah1w17 + ah2w18 + ah3w19 + ah4w20$$

Here zo1, zo2, and zo3 will form the vector that we will use as input to the sigmoid function. Lets name this vector "zo".

zo = [zo1, zo2, zo3]



Now to find the output value a01, we can use softmax function as follows:

$$ao1(zo) = \frac{e^{zo1}}{ \sum\nolimits_{k=1}^{k}{e^{zok}} }$$

Here "a01" is the output for the top-most node in the output layer. In the same way, you can use the softmax function to calculate the values for ao2 and ao3.

You can see that the feed-forward step for a neural network with multi-class output is pretty similar to the feed-forward step of the neural network for binary classification problems. The only difference is that here we are using softmax function at the output layer rather than the sigmoid function.

#### Back-Propagation

The basic idea behind back-propagation remains the same. We have to define a cost function and then optimize that cost function by updating the weights such that the cost is minimized. However, unlike previous articles where we used mean squared error as a cost function, in this article we will instead use cross-entropy function.

Back-propagation is an optimization problem where we have to find the function minima for our cost function.

To find the minima of a function, we can use the gradient decent algorithm. The gradient decent algorithm can be mathematically represented as follows:

$$repeat \ until \ convergence: \begin{Bmatrix} w_j := w_j - \alpha \frac{\partial }{\partial w_j} J(w_0,w_1 ....... w_n) \end{Bmatrix} ............. (1)$$

The details regarding how gradient decent function minimizes the cost have already been discussed in the previous article. Here we will jus see the mathematical operations that we need to perform.

Our cost function is:

$$H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i}$$

In our neural network, we have an output vector where each element of the vector corresponds to output from one node in the output layer. The output vector is calculated using the softmax function. If "ao" is the vector of the predicted outputs from all output nodes and "y" is the vector of the actual outputs of the corresponding nodes in the output vector, we have to basically minimize this function:

$$cost(y, {ao}) = -\sum_i y_i \log {ao_i}$$
##### Phase 1

In the first phase, we need to update weights w9 up to w20. These are the weights of the output layer nodes.

From the previous article, we know that to minimize the cost function, we have to update weight values such that the cost decreases. To do so, we need to take the derivative of the cost function with respect to each weight. Mathematically we can represent it as:

$$\frac {dcost}{dwo} = \frac {dcost}{dao} *, \frac {dao}{dzo} * \frac {dzo}{dwo} ..... (1)$$

Here "wo" refers to the weights in the output layer.

The first part of the equation can be represented as:

$$\frac {dcost}{dao} *\ \frac {dao}{dzo} ....... (2)$$

The detailed derivation of cross-entropy loss function with softmax activation function can be found at this link.

The derivative of equation (2) is:

$$\frac {dcost}{dao} *\ \frac {dao}{dzo} = ao - y ....... (3)$$

Where "ao" is predicted output while "y" is the actual output.

Finally, we need to find "dzo" with respect to "dwo" from Equation 1. The derivative is simply the outputs coming from the hidden layer as shown below:

$$\frac {dzo}{dwo} = ah$$

To find new weight values, the values returned by Equation 1 can be simply multiplied with the learning rate and subtracted from the current weight values.

We also need to update the bias "bo" for the output layer. We need to differentiate our cost function with respect to bias to get new bias value as shown below:

$$\frac {dcost}{dbo} = \frac {dcost}{dao} *\ \frac {dao}{dzo} * \frac {dzo}{dbo} ..... (4)$$

The first part of the Equation 4 has already been calculated in Equation 3. Here we only need to update "dzo" with respect to "bo" which is simply 1. So:

$$\frac {dcost}{dbo} = ao - y ........... (5)$$

To find new bias values for output layer, the values returned by Equation 5 can be simply multiplied with the learning rate and subtracted from the current bias value.

##### Phase 2

In this section, we will back-propagate our error to the previous layer and find the new weight values for hidden layer weights i.e. weights w1 to w8.

Let's collectively denote hidden layer weights as "wh". We basically have to differentiate the cost function with respect to "wh".

Mathematically we can use chain rule of differentiation to represent it as:

$$\frac {dcost}{dwh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dwh} ...... (6)$$

Here again, we will break Equation 6 into individual terms.

The first term "dcost" can be differentiated with respect to "dah" using the chain rule of differentiation as follows:

$$\frac {dcost}{dah} = \frac {dcost}{dzo} *\ \frac {dzo}{dah} ...... (7)$$

Let's again break the Equation 7 into individual terms. From the Equation 3, we know that:

$$\frac {dcost}{dao} *\ \frac {dao}{dzo} =\frac {dcost}{dzo} = = ao - y ........ (8)$$

Now we need to find dzo/dah from Equation 7, which is equal to the weights of the output layer as shown below:

$$\frac {dzo}{dah} = wo ...... (9)$$

Now we can find the value of dcost/dah by replacing the values from Equations 8 and 9 in Equation 7.

Coming back to Equation 6, we have yet to find dah/dzh and dzh/dwh.

The first term dah/dzh can be calculated as:

$$\frac {dah}{dzh} = sigmoid(zh) * (1-sigmoid(zh)) ........ (10)$$

And finally, dzh/dwh is simply the input values:

$$\frac {dzh}{dwh} = input features ........ (11)$$

If we replace the values from Equations 7, 10 and 11 in Equation 6, we can get the updated matrix for the hidden layer weights. To find new weight values for the hidden layer weights "wh", the values returned by Equation 6 can be simply multiplied with the learning rate and subtracted from the current hidden layer weight values.

Similarly, the derivative of the cost function with respect to hidden layer bias "bh" can simply be calculated as:

$$\frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dbh} ...... (12)$$

Which is simply equal to:

$$\frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} ...... (13)$$

because,

$$\frac {dzh}{dbh} = 1$$

To find new bias values for the hidden layer, the values returned by Equation 13 can be simply multiplied with the learning rate and subtracted from the current hidden layer bias values and that's it for the back-propagation.

You can see that the feed-forward and back-propagation process is quite similar to the one we saw in our last articles. The only thing we changed is the activation function and cost function.

### Code for Neural Networks for Multi-class Classification

We have covered the theory behind the neural network for multi-class classification, and now is the time to put that theory into practice.

Take a look at the following script:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])
mouse_images = np.random.randn(700, 2) + np.array([3, 3])
dog_images = np.random.randn(700, 2) + np.array([-3, 3])

feature_set = np.vstack([cat_images, mouse_images, dog_images])

labels = np.array([0]*700 + [1]*700 + [2]*700)

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):
one_hot_labels[i, labels[i]] = 1

plt.figure(figsize=(10,7))
plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)
plt.show()

def sigmoid(x):
return 1/(1+np.exp(-x))

def sigmoid_der(x):
return sigmoid(x) *(1-sigmoid (x))

def softmax(A):
expA = np.exp(A)
return expA / expA.sum(axis=1, keepdims=True)

instances = feature_set.shape[0]
attributes = feature_set.shape[1]
hidden_nodes = 4
output_labels = 3

wh = np.random.rand(attributes,hidden_nodes)
bh = np.random.randn(hidden_nodes)

wo = np.random.rand(hidden_nodes,output_labels)
bo = np.random.randn(output_labels)
lr = 10e-4

error_cost = []

for epoch in range(50000):
############# feedforward

# Phase 1
zh = np.dot(feature_set, wh) + bh
ah = sigmoid(zh)

# Phase 2
zo = np.dot(ah, wo) + bo
ao = softmax(zo)

########## Back Propagation

########## Phase 1

dcost_dzo = ao - one_hot_labels
dzo_dwo = ah

dcost_wo = np.dot(dzo_dwo.T, dcost_dzo)

dcost_bo = dcost_dzo

########## Phases 2

dzo_dah = wo
dcost_dah = np.dot(dcost_dzo , dzo_dah.T)
dah_dzh = sigmoid_der(zh)
dzh_dwh = feature_set
dcost_wh = np.dot(dzh_dwh.T, dah_dzh * dcost_dah)

dcost_bh = dcost_dah * dah_dzh

# Update Weights ================

wh -= lr * dcost_wh
bh -= lr * dcost_bh.sum(axis=0)

wo -= lr * dcost_wo
bo -= lr * dcost_bo.sum(axis=0)

if epoch % 200 == 0:
loss = np.sum(-one_hot_labels * np.log(ao))
print('Loss function value: ', loss)
error_cost.append(loss)



The code is pretty similar to the one we created in the previous article. In the feed-forward section, the only difference is that "ao", which is the final output, is being calculated using the softmax function.

Similarly, in the back-propagation section, to find the new weights for the output layer, the cost function is derived with respect to softmax function rather than the sigmoid function.

If you run the above script, you will see that the final error cost will be 0.5. The following figure shows how the cost decreases with the number of epochs.

As you can see, not many epochs are needed to reach our final error cost.

Similarly, if you run the same script with sigmoid function at the output layer, the minimum error cost that you will achieve after 50000 epochs will be around 1.5 which is greater than 0.5, achieved with softmax.

### Conclusion

Real-world neural networks are capable of solving multi-class classification problems. In this article, we saw how we can create a very simple neural network for multi-class classification, from scratch in Python. This is the final article of the series: "Neural Network from Scratch in Python". In the future articles, I will explain how we can create more specialized neural networks such as recurrent neural networks and convolutional neural networks from scratch in Python.

17 Oct 2018 12:50pm GMT

#### Stack Abuse: Creating a Neural Network from Scratch in Python: Multi-class Classification

This is the third article in the series of articles on "Creating a Neural Network From Scratch in Python".

If you have no prior experience with neural networks, I would suggest you first read Part 1 and Part 2 of the series (linked above). Once you feel comfortable with the concepts explained in those articles, you can come back and continue this article.

### Introduction

In the previous article, we saw how we can create a neural network from scratch, which is capable of solving binary classification problems, in Python. A binary classification problem has only two outputs. However, real-world problems are far more complex.

Consider the example of digit recognition problem where we use the image of a digit as an input and the classifier predicts the corresponding digit number. A digit can be any number between 0 and 9. This is a classic example of a multi-class classification problem where input may belong to any of the 10 possible outputs.

In this article, we will see how we can create a simple neural network from scratch in Python, which is capable of solving multi-class classification problems.

### Dataset

Let's first briefly take a look at our dataset. Our dataset will have two input features and one of the three possible output. We will manually create a dataset for this article.

To do so, execute the following script:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])
mouse_images = np.random.randn(700, 2) + np.array([3, 3])
dog_images = np.random.randn(700, 2) + np.array([-3, 3])



In the script above, we start by importing our libraries and then we create three two-dimensional arrays of size 700 x 2. You can think of each element in one set of the array as an image of a particular animal. Each array element corresponds to one of the three output classes.

An important point to note here is that, that if we plot the elements of the cat_images array on a two-dimensional plane, they will be centered around x=0 and y=-3. Similarly, the elements of the mouse_images array will be centered around x=3 and y=3, and finally, the elements of the array dog_images will be centered around x=-3 and y=3. You will see this once we plot our dataset.

Next, we need to vertically join these arrays to create our final dataset. Execute the following script to do so:

feature_set = np.vstack([cat_images, mouse_images, dog_images])



We created our feature set, and now we need to define corresponding labels for each record in our feature set. The following script does that:

labels = np.array([0]*700 + [1]*700 + [2]*700)



The above script creates a one-dimensional array of 2100 elements. The first 700 elements have been labeled as 0, the next 700 elements have been labeled as 1 while the last 700 elements have been labeled as 2. This is just our shortcut way of quickly creating the labels for our corresponding data.

For multi-class classification problems, we need to define the output label as a one-hot encoded vector since our output layer will have three nodes and each node will correspond to one output class. We want that when an output is predicted, the value of the corresponding node should be 1 while the remaining nodes should have a value of 0. For that, we need three values for the output label for each record. This is why we convert our output vector into a one-hot encoded vector.

Execute the following script to create the one-hot encoded vector array for our dataset:

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):
one_hot_labels[i, labels[i]] = 1



In the above script we create the one_hot_labels array of size 2100 x 3 where each row contains one-hot encoded vector for the corresponding record in the feature set. We then insert 1 in the corresponding column.

If you execute the above script, you will see that the one_hot_labels array will have 1 at index 0 for the first 700 records, 1 at index 1 for next 700 records while 1 at index 2 for the last 700 records.

Now let's plot the dataset that we just created. Execute the following script:

plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)
plt.show()



Once you execute the above script, you should see the following figure:

You can clearly see that we have elements belonging to three different classes. Our task will be to develop a neural network capable of classifying data into the aforementioned classes.

### Neural Network with Multiple Output Classes

The neural network that we are going to design has the following architecture:

You can see that our neural network is pretty similar to the one we developed in Part 2 of the series. It has an input layer with 2 input features and a hidden layer with 4 nodes. However, in the output layer, we can see that we have three nodes. This means that our neural network is capable of solving the multi-class classification problem where the number of possible outputs is 3.

#### Softmax and Cross-Entropy Functions

Before we move on to the code section, let us briefly review the softmax and cross entropy functions, which are respectively the most commonly used activation and loss functions for creating a neural network for multi-class classification.

##### Softmax Function

From the architecture of our neural network, we can see that we have three nodes in the output layer. We have several options for the activation function at the output layer. One option is to use sigmoid function as we did in the previous articles.

However, there is a more convenient activation function in the form of softmax that takes a vector as input and produces another vector of the same length as output. Since our output contains three nodes, we can consider the output from each node as one element of the input vector. The output will be a length of the same vector where the values of all the elements sum to 1. Mathematically, the softmax function can be represented as:

$$y_i(z_i) = \frac{e^{z_i}}{ \sum\nolimits_{k=1}^{k}{e^{z_k}} }$$

The softmax function simply divides the exponent of each input element by the sum of exponents of all the input elements. Let's take a look at a simple example of this:

def softmax(A):
expA = np.exp(A)
return expA / expA.sum()

nums = np.array([4, 5, 6])
print(softmax(nums))



In the script above we create a softmax function that takes a single vector as input, takes exponents of all the elements in the vector and then divides the resulting numbers individually by the sum of exponents of all the numbers in the input vector.

You can see that the input vector contains elements 4, 5 and 6. In the output, you will see three numbers squashed between 0 and 1 where the sum of the numbers will be equal to 1. The output looks likes this:

[0.09003057 0.24472847 0.66524096]



Softmax activation function has two major advantages over the other activation functions, particular for multi-class classification problems: The first advantage is that softmax function takes a vector as input and the second advantage is that it produces an output between 0 and 1. Remember, in our dataset, we have one-hot encoded output labels which mean that our output will have values between 0 and 1. However, the output of the feedforward process can be greater than 1, therefore softmax function is the ideal choice at the output layer since it squashes the output between 0 and 1.

##### Cross-Entropy Function

With softmax activation function at the output layer, mean squared error cost function can be used for optimizing the cost as we did in the previous articles. However, for the softmax function, a more convenient cost function exists which is called cross-entropy.

Mathematically, the cross-entropy function looks likes this:

$$H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i}$$

The cross-entropy is simply the sum of the products of all the actual probabilities with the negative log of the predicted probabilities. For multi-class classification problems, the cross-entropy function is known to outperform the gradient decent function.

Now we have sufficient knowledge to create a neural network that solves multi-class classification problems. Let's see how our neural network will work.

As always, a neural network executes in two steps: Feed-forward and back-propagation.

#### Feed Forward

The feedforward phase will remain more or less similar to what we saw in the previous article. The only difference is that now we will use the softmax activation function at the output layer rather than sigmoid function.

Remember, for the hidden layer output we will still use the sigmoid function as we did previously. The softmax function will be used only for the output layer activations.

##### Phase 1

Since we are using two different activation functions for the hidden layer and the output layer, I have divided the feed-forward phase into two sub-phases.

In the first phase, we will see how to calculate output from the hidden layer. For each input record, we have two features "x1" and "x2". To calculate the output values for each node in the hidden layer, we have to multiply the input with the corresponding weights of the hidden layer node for which we are calculating the value. Notice, we are also adding a bias term here. We then pass the dot product through sigmoid activation function to get the final value.

For instance to calculate the final value for the first node in the hidden layer, which is denoted by "ah1", you need to perform the following calculation:

$$zh1 = x1w1 + x2w2 + b$$

$$ah1 = \frac{\mathrm{1} }{\mathrm{1} + e^{-zh1} }$$

This is the resulting value for the top-most node in the hidden layer. In the same way, you can calculate the values for the 2nd, 3rd, and 4th nodes of the hidden layer.

##### Phase 2

To calculate the values for the output layer, the values in the hidden layer nodes are treated as inputs. Therefore, to calculate the output, multiply the values of the hidden layer nodes with their corresponding weights and pass the result through an activation function, which will be softmax in this case.

This operation can be mathematically expressed by the following equation:

$$zo1 = ah1w9 + ah2w10 + ah3w11 + ah4w12$$

$$zo2 = ah1w13 + ah2w14 + ah3w15 + ah4w16$$

$$zo3 = ah1w17 + ah2w18 + ah3w19 + ah4w20$$

Here zo1, zo2, and zo3 will form the vector that we will use as input to the sigmoid function. Lets name this vector "zo".

zo = [zo1, zo2, zo3]



Now to find the output value a01, we can use softmax function as follows:

$$ao1(zo) = \frac{e^{zo1}}{ \sum\nolimits_{k=1}^{k}{e^{zok}} }$$

Here "a01" is the output for the top-most node in the output layer. In the same way, you can use the softmax function to calculate the values for ao2 and ao3.

You can see that the feed-forward step for a neural network with multi-class output is pretty similar to the feed-forward step of the neural network for binary classification problems. The only difference is that here we are using softmax function at the output layer rather than the sigmoid function.

#### Back-Propagation

The basic idea behind back-propagation remains the same. We have to define a cost function and then optimize that cost function by updating the weights such that the cost is minimized. However, unlike previous articles where we used mean squared error as a cost function, in this article we will instead use cross-entropy function.

Back-propagation is an optimization problem where we have to find the function minima for our cost function.

To find the minima of a function, we can use the gradient decent algorithm. The gradient decent algorithm can be mathematically represented as follows:

$$repeat \ until \ convergence: \begin{Bmatrix} w_j := w_j - \alpha \frac{\partial }{\partial w_j} J(w_0,w_1 ....... w_n) \end{Bmatrix} ............. (1)$$

The details regarding how gradient decent function minimizes the cost have already been discussed in the previous article. Here we will jus see the mathematical operations that we need to perform.

Our cost function is:

$$H(y,\hat{y}) = -\sum_i y_i \log \hat{y_i}$$

In our neural network, we have an output vector where each element of the vector corresponds to output from one node in the output layer. The output vector is calculated using the softmax function. If "ao" is the vector of the predicted outputs from all output nodes and "y" is the vector of the actual outputs of the corresponding nodes in the output vector, we have to basically minimize this function:

$$cost(y, {ao}) = -\sum_i y_i \log {ao_i}$$
##### Phase 1

In the first phase, we need to update weights w9 up to w20. These are the weights of the output layer nodes.

From the previous article, we know that to minimize the cost function, we have to update weight values such that the cost decreases. To do so, we need to take the derivative of the cost function with respect to each weight. Mathematically we can represent it as:

$$\frac {dcost}{dwo} = \frac {dcost}{dao} *, \frac {dao}{dzo} * \frac {dzo}{dwo} ..... (1)$$

Here "wo" refers to the weights in the output layer.

The first part of the equation can be represented as:

$$\frac {dcost}{dao} *\ \frac {dao}{dzo} ....... (2)$$

The detailed derivation of cross-entropy loss function with softmax activation function can be found at this link.

The derivative of equation (2) is:

$$\frac {dcost}{dao} *\ \frac {dao}{dzo} = ao - y ....... (3)$$

Where "ao" is predicted output while "y" is the actual output.

Finally, we need to find "dzo" with respect to "dwo" from Equation 1. The derivative is simply the outputs coming from the hidden layer as shown below:

$$\frac {dzo}{dwo} = ah$$

To find new weight values, the values returned by Equation 1 can be simply multiplied with the learning rate and subtracted from the current weight values.

We also need to update the bias "bo" for the output layer. We need to differentiate our cost function with respect to bias to get new bias value as shown below:

$$\frac {dcost}{dbo} = \frac {dcost}{dao} *\ \frac {dao}{dzo} * \frac {dzo}{dbo} ..... (4)$$

The first part of the Equation 4 has already been calculated in Equation 3. Here we only need to update "dzo" with respect to "bo" which is simply 1. So:

$$\frac {dcost}{dbo} = ao - y ........... (5)$$

To find new bias values for output layer, the values returned by Equation 5 can be simply multiplied with the learning rate and subtracted from the current bias value.

##### Phase 2

In this section, we will back-propagate our error to the previous layer and find the new weight values for hidden layer weights i.e. weights w1 to w8.

Let's collectively denote hidden layer weights as "wh". We basically have to differentiate the cost function with respect to "wh".

Mathematically we can use chain rule of differentiation to represent it as:

$$\frac {dcost}{dwh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dwh} ...... (6)$$

Here again, we will break Equation 6 into individual terms.

The first term "dcost" can be differentiated with respect to "dah" using the chain rule of differentiation as follows:

$$\frac {dcost}{dah} = \frac {dcost}{dzo} *\ \frac {dzo}{dah} ...... (7)$$

Let's again break the Equation 7 into individual terms. From the Equation 3, we know that:

$$\frac {dcost}{dao} *\ \frac {dao}{dzo} =\frac {dcost}{dzo} = = ao - y ........ (8)$$

Now we need to find dzo/dah from Equation 7, which is equal to the weights of the output layer as shown below:

$$\frac {dzo}{dah} = wo ...... (9)$$

Now we can find the value of dcost/dah by replacing the values from Equations 8 and 9 in Equation 7.

Coming back to Equation 6, we have yet to find dah/dzh and dzh/dwh.

The first term dah/dzh can be calculated as:

$$\frac {dah}{dzh} = sigmoid(zh) * (1-sigmoid(zh)) ........ (10)$$

And finally, dzh/dwh is simply the input values:

$$\frac {dzh}{dwh} = input features ........ (11)$$

If we replace the values from Equations 7, 10 and 11 in Equation 6, we can get the updated matrix for the hidden layer weights. To find new weight values for the hidden layer weights "wh", the values returned by Equation 6 can be simply multiplied with the learning rate and subtracted from the current hidden layer weight values.

Similarly, the derivative of the cost function with respect to hidden layer bias "bh" can simply be calculated as:

$$\frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} * \frac {dzh}{dbh} ...... (12)$$

Which is simply equal to:

$$\frac {dcost}{dbh} = \frac {dcost}{dah} *, \frac {dah}{dzh} ...... (13)$$

because,

$$\frac {dzh}{dbh} = 1$$

To find new bias values for the hidden layer, the values returned by Equation 13 can be simply multiplied with the learning rate and subtracted from the current hidden layer bias values and that's it for the back-propagation.

You can see that the feed-forward and back-propagation process is quite similar to the one we saw in our last articles. The only thing we changed is the activation function and cost function.

### Code for Neural Networks for Multi-class Classification

We have covered the theory behind the neural network for multi-class classification, and now is the time to put that theory into practice.

Take a look at the following script:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

cat_images = np.random.randn(700, 2) + np.array([0, -3])
mouse_images = np.random.randn(700, 2) + np.array([3, 3])
dog_images = np.random.randn(700, 2) + np.array([-3, 3])

feature_set = np.vstack([cat_images, mouse_images, dog_images])

labels = np.array([0]*700 + [1]*700 + [2]*700)

one_hot_labels = np.zeros((2100, 3))

for i in range(2100):
one_hot_labels[i, labels[i]] = 1

plt.figure(figsize=(10,7))
plt.scatter(feature_set[:,0], feature_set[:,1], c=labels, cmap='plasma', s=100, alpha=0.5)
plt.show()

def sigmoid(x):
return 1/(1+np.exp(-x))

def sigmoid_der(x):
return sigmoid(x) *(1-sigmoid (x))

def softmax(A):
expA = np.exp(A)
return expA / expA.sum(axis=1, keepdims=True)

instances = feature_set.shape[0]
attributes = feature_set.shape[1]
hidden_nodes = 4
output_labels = 3

wh = np.random.rand(attributes,hidden_nodes)
bh = np.random.randn(hidden_nodes)

wo = np.random.rand(hidden_nodes,output_labels)
bo = np.random.randn(output_labels)
lr = 10e-4

error_cost = []

for epoch in range(50000):
############# feedforward

# Phase 1
zh = np.dot(feature_set, wh) + bh
ah = sigmoid(zh)

# Phase 2
zo = np.dot(ah, wo) + bo
ao = softmax(zo)

########## Back Propagation

########## Phase 1

dcost_dzo = ao - one_hot_labels
dzo_dwo = ah

dcost_wo = np.dot(dzo_dwo.T, dcost_dzo)

dcost_bo = dcost_dzo

########## Phases 2

dzo_dah = wo
dcost_dah = np.dot(dcost_dzo , dzo_dah.T)
dah_dzh = sigmoid_der(zh)
dzh_dwh = feature_set
dcost_wh = np.dot(dzh_dwh.T, dah_dzh * dcost_dah)

dcost_bh = dcost_dah * dah_dzh

# Update Weights ================

wh -= lr * dcost_wh
bh -= lr * dcost_bh.sum(axis=0)

wo -= lr * dcost_wo
bo -= lr * dcost_bo.sum(axis=0)

if epoch % 200 == 0:
loss = np.sum(-one_hot_labels * np.log(ao))
print('Loss function value: ', loss)
error_cost.append(loss)



The code is pretty similar to the one we created in the previous article. In the feed-forward section, the only difference is that "ao", which is the final output, is being calculated using the softmax function.

Similarly, in the back-propagation section, to find the new weights for the output layer, the cost function is derived with respect to softmax function rather than the sigmoid function.

If you run the above script, you will see that the final error cost will be 0.5. The following figure shows how the cost decreases with the number of epochs.

As you can see, not many epochs are needed to reach our final error cost.

Similarly, if you run the same script with sigmoid function at the output layer, the minimum error cost that you will achieve after 50000 epochs will be around 1.5 which is greater than 0.5, achieved with softmax.

### Conclusion

Real-world neural networks are capable of solving multi-class classification problems. In this article, we saw how we can create a very simple neural network for multi-class classification, from scratch in Python. This is the final article of the series: "Neural Network from Scratch in Python". In the future articles, I will explain how we can create more specialized neural networks such as recurrent neural networks and convolutional neural networks from scratch in Python.

17 Oct 2018 12:50pm GMT

#### Eli Bendersky: Covariance and contravariance in subtyping

Many programming languages support subtyping, a kind of polymorphism that lets us define hierarchical relations on types, with specific types being subtypes of more generic types. For example, a Cat could be a subtype of Mammal, which itself is a subtype of Vertebrate.

Intuitively, functions that accept any Mammal would accept a Cat too. More formally, this is known as the Liskov substitution principle:

Let \phi (x) be a property provable about objects x of type T. Then \phi (y) should be true for objects y of type S where S is a subtype of T.

A shorter way to say S is a subtype of T is S <: T. The relation <: is also sometimes expressed as \le, and can be thought of as "is less general than". So Cat <: Mammal and Mammal <: Vertebrate. Naturally, <: is transitive, so Cat <: Vertebrate; it's also reflexive, as T <: T for any type T [1].

## Kinds of variance in subtyping

Variance refers to how subtyping between composite types (e.g. list of Cats versus list of Mammals) relates to subtyping between their components (e.g. Cats and Mammals). Let's use the general Composite<T> to refer to some composite type with components of type T.

Given types S and T with the relation S <: T, variance is a way to describe the relation between the composite types:

• Covariant means the ordering of component types is preserved: Composite<S> <: Composite<T>.
• Contravariant means the ordering is reversed: Composite<T> <: Composite<S> [2].
• Bivariant means both covariant and contravariant.
• Invariant means neither covariant nor contravariant.

That's a lot of theory and rules right in the beginning; the following examples should help clarify all of this.

## Covariance in return types of overriding methods in C++

In C++, when a subclass method overrides a similarly named method in a superclass, their signatures have to match. There is an important exception to this rule, however. When the original return type is B* or B&, the return type of the overriding function is allowed to be D* or D& respectively, provided that D is a public subclass of B. This rule is important to implement methods like Clone:

struct Mammal {
virtual ~Mammal() = 0;
virtual Mammal* Clone() = 0;
};

struct Cat : public Mammal {
virtual ~Cat() {}

Cat* Clone() override {
return new Cat(*this);
}
};

struct Dog : public Mammal {
virtual ~Dog() {}

Dog* Clone() override {
return new Dog(*this);
}
};


And we can write functions like the following:

Mammal* DoSomething(Mammal* m) {
Mammal* cloned = m->Clone();
// Do something with cloned
return cloned;
}


No matter what the concrete run-time class of m is, m->Clone() will return the right kind of object.

Armed with our new terminology, we can say that the return type rule for overriding methods is covariant for pointer and reference types. In other words, given Cat <: Mammal we have Cat* <: Mammal*.

Being able to replace Mammal* by Cat* seems like a natural thing to do in C++, but not all typing rules are covariant. Consider this code:

struct MammalClinic {
virtual void Accept(Mammal* m);
};

struct CatClinic : public MammalClinic {
virtual void Accept(Cat* c);
};


Looks legit? We have general MammalClinics that accept all mammals, and more specialized CatClinics that only accept cats. Given a MammalClinic*, we should be able to call Accept and the right one will be invoked at run-time, right? Wrong. CatClinic::Accept does not actually override MammalClinic::Accept; it simply overloads it. If we try to add the override keyword (as we should always do starting with C++11):

struct CatClinic : public MammalClinic {
virtual void Accept(Cat* c) override;
};


We'll get:

error: 'virtual void CatClinic::Accept(Cat*)' marked 'override', but does not override
virtual void Accept(Cat* c) override;
^


This is precisely what the override keyword was created for - help us find erroneous assumptions about methods overriding other methods. The reality is that function overrides are not covariant for pointer types. They are invariant. In fact, the vast majority of typing rules in C++ are invariant; std::vector<Cat> is not a subclass of std::vector<Mammal>, even though Cat <: Mammal. As the next section demonstrates, there's a good reason for that.

## Covariant arrays in Java

Suppose we have PersianCat <: Cat, and some class representing a list of cats. Does it make sense for lists to be covariant? On initial thought, yes. Say we have this (pseudocode) function:

MakeThemMeow(List<Cat> lst) {
for each cat in lst {
cat->Meow()
}
}


Why shouldn't we be able to pass a List<PersianCat> into it? After all, all persian cats are cats, so they can all meow! As long as lists are immutable, this is actually safe. The problem appears when lists can be modified. The best example of this problem can be demonstrated with actual Java code, since in Java array constructors are covariant:

class Main {
public static void main(String[] args) {
String strings[] = {"house", "daisy"};
Object objects[] = strings; // covariant

objects[1] = "cauliflower"; // works fine
objects[0] = 5;             // throws exception
}
}


In Java, String <: Object, and since arrays are covariant, it means that String[] <: Object[], which makes the assignment on the line marked with "covariant" type-check successfully. From that point on, objects is an array of Object as far as the compiler is concerned, so assigning anything that's a subclass of Object to its elements is kosher, including integers [3]. Therefore the last line in main throws an exception at run-time:

Exception in thread "main" java.lang.ArrayStoreException: java.lang.Integer
at Main.main(Main.java:7)


Assigning an integer fails because at run-time it's known that objects is actually an array of strings. Thus, covariance together with mutability makes array types unsound. Note, however, that this is not just a mistake - it's a deliberate historical decision made when Java didn't have generics and polymorphism was still desired; the same problem exists in C# - read this for more details.

Other languages have immutable containers, which can then be made covariant without jeopardizing the soundness of the type system. For example in OCaml lists are immutable and covariant.

## Contravariance for function types

Covariance seems like a pretty intuitive concept, but what about contravariance? When does it make sense to reverse the subtyping relation for composite types to get Composite<T> <: Composite<S> for S <: T?

An important use case is function types. Consider a function that takes a Mammal and returns a Mammal; in functional programming the type of this function is commonly referred to as Mammal -> Mammal. Which function types are valid subtypes of this type?

Here's a pseudo-code definition that makes it easier to discuss:

func user(f : Mammal -> Mammal) {
// do stuff with 'f'
}


Can we call user providing it a function of type Mammal -> Cat as f? Inside its body, user may invoke f and expect its return value to be a Mammal. Since Mammal -> Cat returns cats, that's fine, so this usage is safe. It aligns with our earlier intuition that covariance makes sense for function return types.

Note that passing a Mammal -> Vertebrate function as f doesn't work as well, because user expects f to return Mammals, but our function may return a Vertebrate that's not a Mammal (maybe a Bird). Therefore, function return types are not contravariant.

But what about function parameters? So far we've been looking at function types that take Mammal - an exact match for the expected signature of f. Can we call user with a function of type Cat -> Mammal? No, because user expects to be able to pass any kind of Mammal into f, not just Cats. So function parameters are not covariant. On the other hand, it should be safe to pass a function of type Vertebrate -> Mammal as f, because it can take any Mammal, and that's what user is going to pass to it. So contravariance makes sense for function parameters.

Most generally, we can say that Vertebrate -> Cat is a subtype of Mammal -> Mammal, because parameters types are contravariant and return types are covariant. A nice quote that can help remember these rules is: be liberal in what you accept and conservative in what you produce.

This is not just theory; if we go back to C++, this is exactly how function types with std::function behave:

#include <functional>

struct Vertebrate {};
struct Mammal : public Vertebrate {};
struct Cat : public Mammal {};

Cat* f1(Vertebrate* v) {
return nullptr;
}

Vertebrate* f2(Vertebrate* v) {
return nullptr;
}

Cat* f3(Cat* v) {
return nullptr;
}

void User(std::function<Mammal*(Mammal*)> f) {
// do stuff with 'f'
}

int main() {
User(f1);       // works

return 0;
}


The invocation User(f1) compiles, because f1 is convertible to the type std::function<Mammal*(Mammal*)> [4]. Had we tried to invoke User(f2) or User(f3), they would fail because neither f2 nor f3 are proper subtypes of std::function<Mammal*(Mammal*)>.

## Bivariance

So far we've seen examples of invariance, covariance and contravariance. What about bivariance? Recall, bivariance means that given S <: T, both Composite<S> <: Composite<T> and Composite<T> <: Composite<S> are true. When is this useful? Not often at all, it turns out.

In TypeScript, function parameters are bivariant. The following code compiles correctly but fails at run-time:

function trainDog(d: Dog) { ... }
function cloneAnimal(source: Animal, done: (result: Animal) => void): void { ... }
let c = new Cat();

// Runtime error here occurs because we end up invoking 'trainDog' with a 'Cat'
cloneAnimal(c, trainDog);


Once again, this is not because the TypeScript designers are incompetent. The reason is fairly intricate and explained on this page; the summary is that it's needed to help the type-checker treat functions that don't mutate their arguments as covariant for arrays.

That said, in TypeScript 2.6 this is being changed with a new strictness flag that treats parameters only contravariantly.

## Explicit variance specification in Python type-checking

If you had to guess which of the mainstream languages has the most advanced support for variance in their type system, Python probably wouldn't be your first guess, right? I admit it wasn't mine either, because Python is dynamically (duck) typed. But the new type hinting support (described in PEP 484 with more details in PEP 483) is actually fairly advanced.

Here's an example:

class Mammal:
pass

class Cat(Mammal):
pass

def count_mammals_list(seq : List[Mammal]) -> int:
return len(seq)

mlst = [Mammal(), Mammal()]
print(count_mammals_list(mlst))


If we run mypy type-checking on this code, it will succeed. count_mammals_list takes a list of Mammals, and this is what we passed in; so far, so good. However, the following will fail:

clst = [Cat(), Cat()]
print(count_mammals_list(clst))


Because List is not covariant. Python doesn't know whether count_mammals_list will modify the list, so allowing calls with a list of Cats is potentially unsafe.

It turns out that the typing module lets us express the variance of types explicitly. Here's a very minimal "immutable list" implementation that only supports counting elements:

T_co = TypeVar('T_co', covariant=True)

class ImmutableList(Generic[T_co]):
def __init__(self, items: Iterable[T_co]) -> None:
self.lst = list(items)

def __len__(self) -> int:
return len(self.lst)


And now if we define:

def count_mammals_ilist(seq : ImmutableList[Mammal]) -> int:
return len(seq)


We can actually invoke it with a ImmutableList of Cats, and this will pass type checking:

cimmlst = ImmutableList([Cat(), Cat()])
print(count_mammals_ilist(cimmlst))


Similarly, we can support contravariant types, etc. The typing module also provides a number of useful built-ins; for example, it's not really necessary to create an ImmutableList type, as there's already a Sequence type that is covariant.

 [1] In most cases <: is also antisymmetric, making it a partial order, but in some cases it isn't; for example, structs with permuted fields can be considered subtypes of each other (in most languages they aren't!) but such subtyping is not antisymmetric.
 [2] These terms come from math, and a good rule of thumb to remember how they apply is: co means together, while contra means against. As long as the composite types vary together (in the same direction) as their component types, they are co-variant. When they vary against their component types (in the reverse direction), they are contra-variant.
 [3] Strictly speaking, integer literals like 5 are primitives in Java and not objects at all. However, due to autoboxing, this is equivalent to wrapping the 5 in Integer prior to the assignment.
 [4] Note that we're using pointer types here. The same example would work with std::function and corresponding f1 taking and returning value types. It's just that in C++ value types are not very useful for polymorphism, so pointer (or reference) values are much more commonly used.

17 Oct 2018 12:35pm GMT

#### Eli Bendersky: Covariance and contravariance in subtyping

Many programming languages support subtyping, a kind of polymorphism that lets us define hierarchical relations on types, with specific types being subtypes of more generic types. For example, a Cat could be a subtype of Mammal, which itself is a subtype of Vertebrate.

Intuitively, functions that accept any Mammal would accept a Cat too. More formally, this is known as the Liskov substitution principle:

Let \phi (x) be a property provable about objects x of type T. Then \phi (y) should be true for objects y of type S where S is a subtype of T.

A shorter way to say S is a subtype of T is S <: T. The relation <: is also sometimes expressed as \le, and can be thought of as "is less general than". So Cat <: Mammal and Mammal <: Vertebrate. Naturally, <: is transitive, so Cat <: Vertebrate; it's also reflexive, as T <: T for any type T [1].

## Kinds of variance in subtyping

Variance refers to how subtyping between composite types (e.g. list of Cats versus list of Mammals) relates to subtyping between their components (e.g. Cats and Mammals). Let's use the general Composite<T> to refer to some composite type with components of type T.

Given types S and T with the relation S <: T, variance is a way to describe the relation between the composite types:

• Covariant means the ordering of component types is preserved: Composite<S> <: Composite<T>.
• Contravariant means the ordering is reversed: Composite<T> <: Composite<S> [2].
• Bivariant means both covariant and contravariant.
• Invariant means neither covariant nor contravariant.

That's a lot of theory and rules right in the beginning; the following examples should help clarify all of this.

## Covariance in return types of overriding methods in C++

In C++, when a subclass method overrides a similarly named method in a superclass, their signatures have to match. There is an important exception to this rule, however. When the original return type is B* or B&, the return type of the overriding function is allowed to be D* or D& respectively, provided that D is a public subclass of B. This rule is important to implement methods like Clone:

struct Mammal {
virtual ~Mammal() = 0;
virtual Mammal* Clone() = 0;
};

struct Cat : public Mammal {
virtual ~Cat() {}

Cat* Clone() override {
return new Cat(*this);
}
};

struct Dog : public Mammal {
virtual ~Dog() {}

Dog* Clone() override {
return new Dog(*this);
}
};


And we can write functions like the following:

Mammal* DoSomething(Mammal* m) {
Mammal* cloned = m->Clone();
// Do something with cloned
return cloned;
}


No matter what the concrete run-time class of m is, m->Clone() will return the right kind of object.

Armed with our new terminology, we can say that the return type rule for overriding methods is covariant for pointer and reference types. In other words, given Cat <: Mammal we have Cat* <: Mammal*.

Being able to replace Mammal* by Cat* seems like a natural thing to do in C++, but not all typing rules are covariant. Consider this code:

struct MammalClinic {
virtual void Accept(Mammal* m);
};

struct CatClinic : public MammalClinic {
virtual void Accept(Cat* c);
};


Looks legit? We have general MammalClinics that accept all mammals, and more specialized CatClinics that only accept cats. Given a MammalClinic*, we should be able to call Accept and the right one will be invoked at run-time, right? Wrong. CatClinic::Accept does not actually override MammalClinic::Accept; it simply overloads it. If we try to add the override keyword (as we should always do starting with C++11):

struct CatClinic : public MammalClinic {
virtual void Accept(Cat* c) override;
};


We'll get:

error: 'virtual void CatClinic::Accept(Cat*)' marked 'override', but does not override
virtual void Accept(Cat* c) override;
^


This is precisely what the override keyword was created for - help us find erroneous assumptions about methods overriding other methods. The reality is that function overrides are not covariant for pointer types. They are invariant. In fact, the vast majority of typing rules in C++ are invariant; std::vector<Cat> is not a subclass of std::vector<Mammal>, even though Cat <: Mammal. As the next section demonstrates, there's a good reason for that.

## Covariant arrays in Java

Suppose we have PersianCat <: Cat, and some class representing a list of cats. Does it make sense for lists to be covariant? On initial thought, yes. Say we have this (pseudocode) function:

MakeThemMeow(List<Cat> lst) {
for each cat in lst {
cat->Meow()
}
}


Why shouldn't we be able to pass a List<PersianCat> into it? After all, all persian cats are cats, so they can all meow! As long as lists are immutable, this is actually safe. The problem appears when lists can be modified. The best example of this problem can be demonstrated with actual Java code, since in Java array constructors are covariant:

class Main {
public static void main(String[] args) {
String strings[] = {"house", "daisy"};
Object objects[] = strings; // covariant

objects[1] = "cauliflower"; // works fine
objects[0] = 5;             // throws exception
}
}


In Java, String <: Object, and since arrays are covariant, it means that String[] <: Object[], which makes the assignment on the line marked with "covariant" type-check successfully. From that point on, objects is an array of Object as far as the compiler is concerned, so assigning anything that's a subclass of Object to its elements is kosher, including integers [3]. Therefore the last line in main throws an exception at run-time:

Exception in thread "main" java.lang.ArrayStoreException: java.lang.Integer
at Main.main(Main.java:7)


Assigning an integer fails because at run-time it's known that objects is actually an array of strings. Thus, covariance together with mutability makes array types unsound. Note, however, that this is not just a mistake - it's a deliberate historical decision made when Java didn't have generics and polymorphism was still desired; the same problem exists in C# - read this for more details.

Other languages have immutable containers, which can then be made covariant without jeopardizing the soundness of the type system. For example in OCaml lists are immutable and covariant.

## Contravariance for function types

Covariance seems like a pretty intuitive concept, but what about contravariance? When does it make sense to reverse the subtyping relation for composite types to get Composite<T> <: Composite<S> for S <: T?

An important use case is function types. Consider a function that takes a Mammal and returns a Mammal; in functional programming the type of this function is commonly referred to as Mammal -> Mammal. Which function types are valid subtypes of this type?

Here's a pseudo-code definition that makes it easier to discuss:

func user(f : Mammal -> Mammal) {
// do stuff with 'f'
}


Can we call user providing it a function of type Mammal -> Cat as f? Inside its body, user may invoke f and expect its return value to be a Mammal. Since Mammal -> Cat returns cats, that's fine, so this usage is safe. It aligns with our earlier intuition that covariance makes sense for function return types.

Note that passing a Mammal -> Vertebrate function as f doesn't work as well, because user expects f to return Mammals, but our function may return a Vertebrate that's not a Mammal (maybe a Bird). Therefore, function return types are not contravariant.

But what about function parameters? So far we've been looking at function types that take Mammal - an exact match for the expected signature of f. Can we call user with a function of type Cat -> Mammal? No, because user expects to be able to pass any kind of Mammal into f, not just Cats. So function parameters are not covariant. On the other hand, it should be safe to pass a function of type Vertebrate -> Mammal as f, because it can take any Mammal, and that's what user is going to pass to it. So contravariance makes sense for function parameters.

Most generally, we can say that Vertebrate -> Cat is a subtype of Mammal -> Mammal, because parameters types are contravariant and return types are covariant. A nice quote that can help remember these rules is: be liberal in what you accept and conservative in what you produce.

This is not just theory; if we go back to C++, this is exactly how function types with std::function behave:

#include <functional>

struct Vertebrate {};
struct Mammal : public Vertebrate {};
struct Cat : public Mammal {};

Cat* f1(Vertebrate* v) {
return nullptr;
}

Vertebrate* f2(Vertebrate* v) {
return nullptr;
}

Cat* f3(Cat* v) {
return nullptr;
}

void User(std::function<Mammal*(Mammal*)> f) {
// do stuff with 'f'
}

int main() {
User(f1);       // works

return 0;
}


The invocation User(f1) compiles, because f1 is convertible to the type std::function<Mammal*(Mammal*)> [4]. Had we tried to invoke User(f2) or User(f3), they would fail because neither f2 nor f3 are proper subtypes of std::function<Mammal*(Mammal*)>.

## Bivariance

So far we've seen examples of invariance, covariance and contravariance. What about bivariance? Recall, bivariance means that given S <: T, both Composite<S> <: Composite<T> and Composite<T> <: Composite<S> are true. When is this useful? Not often at all, it turns out.

In TypeScript, function parameters are bivariant. The following code compiles correctly but fails at run-time:

function trainDog(d: Dog) { ... }
function cloneAnimal(source: Animal, done: (result: Animal) => void): void { ... }
let c = new Cat();

// Runtime error here occurs because we end up invoking 'trainDog' with a 'Cat'
cloneAnimal(c, trainDog);


Once again, this is not because the TypeScript designers are incompetent. The reason is fairly intricate and explained on this page; the summary is that it's needed to help the type-checker treat functions that don't mutate their arguments as covariant for arrays.

That said, in TypeScript 2.6 this is being changed with a new strictness flag that treats parameters only contravariantly.

## Explicit variance specification in Python type-checking

If you had to guess which of the mainstream languages has the most advanced support for variance in their type system, Python probably wouldn't be your first guess, right? I admit it wasn't mine either, because Python is dynamically (duck) typed. But the new type hinting support (described in PEP 484 with more details in PEP 483) is actually fairly advanced.

Here's an example:

class Mammal:
pass

class Cat(Mammal):
pass

def count_mammals_list(seq : List[Mammal]) -> int:
return len(seq)

mlst = [Mammal(), Mammal()]
print(count_mammals_list(mlst))


If we run mypy type-checking on this code, it will succeed. count_mammals_list takes a list of Mammals, and this is what we passed in; so far, so good. However, the following will fail:

clst = [Cat(), Cat()]
print(count_mammals_list(clst))


Because List is not covariant. Python doesn't know whether count_mammals_list will modify the list, so allowing calls with a list of Cats is potentially unsafe.

It turns out that the typing module lets us express the variance of types explicitly. Here's a very minimal "immutable list" implementation that only supports counting elements:

T_co = TypeVar('T_co', covariant=True)

class ImmutableList(Generic[T_co]):
def __init__(self, items: Iterable[T_co]) -> None:
self.lst = list(items)

def __len__(self) -> int:
return len(self.lst)


And now if we define:

def count_mammals_ilist(seq : ImmutableList[Mammal]) -> int:
return len(seq)


We can actually invoke it with a ImmutableList of Cats, and this will pass type checking:

cimmlst = ImmutableList([Cat(), Cat()])
print(count_mammals_ilist(cimmlst))


Similarly, we can support contravariant types, etc. The typing module also provides a number of useful built-ins; for example, it's not really necessary to create an ImmutableList type, as there's already a Sequence type that is covariant.

 [1] In most cases <: is also antisymmetric, making it a partial order, but in some cases it isn't; for example, structs with permuted fields can be considered subtypes of each other (in most languages they aren't!) but such subtyping is not antisymmetric.
 [2] These terms come from math, and a good rule of thumb to remember how they apply is: co means together, while contra means against. As long as the composite types vary together (in the same direction) as their component types, they are co-variant. When they vary against their component types (in the reverse direction), they are contra-variant.
 [3] Strictly speaking, integer literals like 5 are primitives in Java and not objects at all. However, due to autoboxing, this is equivalent to wrapping the 5 in Integer prior to the assignment.
 [4] Note that we're using pointer types here. The same example would work with std::function and corresponding f1 taking and returning value types. It's just that in C++ value types are not very useful for polymorphism, so pointer (or reference) values are much more commonly used.

17 Oct 2018 12:35pm GMT

#### PyCon: PyCon 2019 Launches Financial Aid

The PyCon conference prides itself on being affordable. However, registration is only one of several expenses an attendee must incur, and it's likely the smallest one. Flying, whether halfway around the world or from a few hundred miles away, is more expensive. Staying in a hotel for a few days is also more expensive. All together, the cost of attending a conference can become prohibitively expensive. That's where PyCon's Financial Aid program comes in. We're opening applications for Financial Aid today, and we'll be accepting them through February 12, 2019.

To apply, first set up an account on the site, and then you will be able to fill out the application here or through your dashboard.

For those proposing talks, tutorials, or posters, selecting the "I require a speaker grant if my proposal is accepted" box on your speaker profile serves as your request, so you do not need to fill out the financial aid application. Upon acceptance, we'll contact the speakers who checked that box to gather the appropriate information. Accepted speakers and presenters are prioritized for travel grants. Additionally, we do not expose grant requests to reviewers while evaluating proposals. The Program Committee evaluates proposals on the basis of their presentation, and later the Financial Aid team comes in and looks at how we can help our speakers.

We offer need-based grants to enable people from across our community to attend PyCon. The criteria for evaluating requests takes into account several things, such as whether the applicant is a student, unemployed, or underemployed; their geographic location; and their involvement in both the conference and the greater Python community.

Our process aims to help a large amount of people with partial grants, as opposed to covering full expenses for a small amount of people. Based on individual need, we craft grant amounts that we hope can turn PyCon from inaccessible to reality. While some direct costs-like those associated with PyCon itself-are discounted or waived, external costs such as travel are handled via reimbursement, where the attendee pays and then submits receipts to be paid back an amount based on their grant. For the full details, see our FAQ at https://us.pycon.org/2019/financial-assistance/faq/ and contact pycon-aid@python.org with further questions.

The Python Software Foundation & PyLadies make Financial Aid possible. This year the Python Software Foundation is providing $110,000 USD towards financial aid and PyLadies will contribute as much as they can based on the contributions they get throughout 2018. For more information about Financial Aid, see https://us.pycon.org/2019/financial-assistance. Our Call for Proposals is open! Tutorial presentations are due November 26, while talk, poster, and education summit proposals are due January 3. For more information, see https://us.pycon.org/2019/speaking/. *Note: Main content is from post written by Brian Curtin for 2018 launch 17 Oct 2018 10:01am GMT #### PyCon: PyCon 2019 Launches Financial Aid The PyCon conference prides itself on being affordable. However, registration is only one of several expenses an attendee must incur, and it's likely the smallest one. Flying, whether halfway around the world or from a few hundred miles away, is more expensive. Staying in a hotel for a few days is also more expensive. All together, the cost of attending a conference can become prohibitively expensive. That's where PyCon's Financial Aid program comes in. We're opening applications for Financial Aid today, and we'll be accepting them through February 12, 2019. To apply, first set up an account on the site, and then you will be able to fill out the application here or through your dashboard. For those proposing talks, tutorials, or posters, selecting the "I require a speaker grant if my proposal is accepted" box on your speaker profile serves as your request, so you do not need to fill out the financial aid application. Upon acceptance, we'll contact the speakers who checked that box to gather the appropriate information. Accepted speakers and presenters are prioritized for travel grants. Additionally, we do not expose grant requests to reviewers while evaluating proposals. The Program Committee evaluates proposals on the basis of their presentation, and later the Financial Aid team comes in and looks at how we can help our speakers. We offer need-based grants to enable people from across our community to attend PyCon. The criteria for evaluating requests takes into account several things, such as whether the applicant is a student, unemployed, or underemployed; their geographic location; and their involvement in both the conference and the greater Python community. Our process aims to help a large amount of people with partial grants, as opposed to covering full expenses for a small amount of people. Based on individual need, we craft grant amounts that we hope can turn PyCon from inaccessible to reality. While some direct costs-like those associated with PyCon itself-are discounted or waived, external costs such as travel are handled via reimbursement, where the attendee pays and then submits receipts to be paid back an amount based on their grant. For the full details, see our FAQ at https://us.pycon.org/2019/financial-assistance/faq/ and contact pycon-aid@python.org with further questions. The Python Software Foundation & PyLadies make Financial Aid possible. This year the Python Software Foundation is providing$110,000 USD towards financial aid and PyLadies will contribute as much as they can based on the contributions they get throughout 2018.

Our Call for Proposals is open! Tutorial presentations are due November 26, while talk, poster, and education summit proposals are due January 3. For more information, see https://us.pycon.org/2019/speaking/.

*Note: Main content is from post written by Brian Curtin for 2018 launch

17 Oct 2018 10:01am GMT

#### Mike Driscoll: Jupyter Notebook Debugging

Debugging is an important concept. The concept of debugging is trying to figure out what is wrong with your code or just trying to understand the code. There are many times where I will come to unfamiliar code and I will need to step through it in a debugger to grasp how it works. Most Python IDEs have good debuggers built into them. I personally like Wing IDE for instance. Others like PyCharm or PyDev. But what if you want to debug the code in your Jupyter Notebook? How does that work?

In this chapter we will look at a couple of different methods of debugging a Notebook. The first one is by using Python's own pdb module.

### Using pdb

The pdb module is Python's debugger module. Just as C++ has gdb, Python has pdb.

Let's start by opening up a new Notebook and adding a cell with the following code in it:

def bad_function(var):
return var + 0



If you run this code, you should end up with some output that looks like this:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-2f23ed1cac1e> in <module>()
2         return var + 0
3

----> 2         return var + 0
3

TypeError: cannot concatenate 'str' and 'int' objects


What this means is that you cannot concatenate a string and an integer. This is a pretty common problem if you don't know what types a function accepts. You will find that this is especially true when working with complex functions and classes, unless they happen to be using type hinting. One way to figure out what is going on is by adding a breakpoint using pdb's set_trace() function:

def bad_function(var):
import pdb
pdb.set_trace()
return var + 0



Now when you run the cell, you will get a prompt in the output which you can use to inspect the variables and basically run code live. If you happen to have Python 3.7, then you can simplify the example above by using the new breakpoint built-in, like this:

def bad_function(var):
breakpoint()
return var + 0



This code is functionally equivalent to the previous example but uses the new breakpoint function instead. When you run this code, it should act the same way as the code in the previous section did.

You can use any of pdb's command right inside of your Jupyter Notebook. Here are some examples:

• w(here) - Print the stack trace
• d(own) - Move the current frame X number of levels down. Defaults to one.
• u(p) - Move the current frame X number of levels up. Defaults to one.
• b(reak) - With a *lineno* argument, set a break point at that line number in the current file / context
• s(tep) - Execute the current line and stop at the next possible line
• c(ontinue) - Continue execution

Note that these are single-letter commands: w, d, u and b are the commands. You can use these commands to interactively debug your code in your Notebook along with the other commands listed in the documentation listed above.

### ipdb

IPython also has a debugger called ipdb. However it does not work with Jupyter Notebook directly. You would need to connect to the kernel using something like Jupyter console and run it from there to use it. If you would like to go that route, you can read more about using Jupyter console here.

However there is an IPython debugger that we can use called IPython.core.debugger.set_trace. Let's create a cell with the following code:

from IPython.core.debugger import set_trace

set_trace()
return var + 0



Now you can run this cell and get the ipdb debugger. Here is what the output looked like on my machine:

The IPython debugger uses the same commands as the Python debugger does. The main difference is that it provides syntax highlighting and was originally designed to work in the IPython console.

There is one other way to open up the ipdb debugger and that is by using the %pdb magic. Here is some sample code you can try in a Notebook cell:

%pdb

return var + 0



When you run this code, you should end up seeing the TypeError traceback and then the ipdb prompt will appear in the output, which you can then use as before.

There is yet another way that you can open up a debugger in your Notebook. You can use %%debug to debug the entire cell like this:

%%debug

return var + 0



This will start the debugging session immediately when you run the cell. What that means is that you would want to use some of the commands that pdb supports to step into the code and examine the function or variables as needed.

Note that you could also use %debug if you want to debug a single line.

### Wrapping Up

In this chapter we learned of several different methods that you can use to debug the code in your Jupyter Notebook. I personally prefer to use Python's pdb module, but you can use the IPython.core.debugger to get the same functionality and it could be better if you prefer to have syntax highlighting.

There is also a newer "visual debugger" package called the PixieDebugger from the pixiedust package:

I haven't used it myself. Some reviewers say it is amazing and others have said it is pretty buggy. I will leave that one up to you to determine if it is something you want to add to your toolset.

As far as I am concerned, I think using pdb or IPython's debugger work quite well and should work for you too.

17 Oct 2018 5:05am GMT

#### Mike Driscoll: Jupyter Notebook Debugging

Debugging is an important concept. The concept of debugging is trying to figure out what is wrong with your code or just trying to understand the code. There are many times where I will come to unfamiliar code and I will need to step through it in a debugger to grasp how it works. Most Python IDEs have good debuggers built into them. I personally like Wing IDE for instance. Others like PyCharm or PyDev. But what if you want to debug the code in your Jupyter Notebook? How does that work?

In this chapter we will look at a couple of different methods of debugging a Notebook. The first one is by using Python's own pdb module.

### Using pdb

The pdb module is Python's debugger module. Just as C++ has gdb, Python has pdb.

Let's start by opening up a new Notebook and adding a cell with the following code in it:

def bad_function(var):
return var + 0



If you run this code, you should end up with some output that looks like this:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-2f23ed1cac1e> in <module>()
2         return var + 0
3

----> 2         return var + 0
3

TypeError: cannot concatenate 'str' and 'int' objects


What this means is that you cannot concatenate a string and an integer. This is a pretty common problem if you don't know what types a function accepts. You will find that this is especially true when working with complex functions and classes, unless they happen to be using type hinting. One way to figure out what is going on is by adding a breakpoint using pdb's set_trace() function:

def bad_function(var):
import pdb
pdb.set_trace()
return var + 0



Now when you run the cell, you will get a prompt in the output which you can use to inspect the variables and basically run code live. If you happen to have Python 3.7, then you can simplify the example above by using the new breakpoint built-in, like this:

def bad_function(var):
breakpoint()
return var + 0



This code is functionally equivalent to the previous example but uses the new breakpoint function instead. When you run this code, it should act the same way as the code in the previous section did.

You can use any of pdb's command right inside of your Jupyter Notebook. Here are some examples:

• w(here) - Print the stack trace
• d(own) - Move the current frame X number of levels down. Defaults to one.
• u(p) - Move the current frame X number of levels up. Defaults to one.
• b(reak) - With a *lineno* argument, set a break point at that line number in the current file / context
• s(tep) - Execute the current line and stop at the next possible line
• c(ontinue) - Continue execution

Note that these are single-letter commands: w, d, u and b are the commands. You can use these commands to interactively debug your code in your Notebook along with the other commands listed in the documentation listed above.

### ipdb

IPython also has a debugger called ipdb. However it does not work with Jupyter Notebook directly. You would need to connect to the kernel using something like Jupyter console and run it from there to use it. If you would like to go that route, you can read more about using Jupyter console here.

However there is an IPython debugger that we can use called IPython.core.debugger.set_trace. Let's create a cell with the following code:

from IPython.core.debugger import set_trace

set_trace()
return var + 0



Now you can run this cell and get the ipdb debugger. Here is what the output looked like on my machine:

The IPython debugger uses the same commands as the Python debugger does. The main difference is that it provides syntax highlighting and was originally designed to work in the IPython console.

There is one other way to open up the ipdb debugger and that is by using the %pdb magic. Here is some sample code you can try in a Notebook cell:

%pdb

return var + 0



When you run this code, you should end up seeing the TypeError traceback and then the ipdb prompt will appear in the output, which you can then use as before.

There is yet another way that you can open up a debugger in your Notebook. You can use %%debug to debug the entire cell like this:

%%debug

return var + 0



This will start the debugging session immediately when you run the cell. What that means is that you would want to use some of the commands that pdb supports to step into the code and examine the function or variables as needed.

Note that you could also use %debug if you want to debug a single line.

### Wrapping Up

In this chapter we learned of several different methods that you can use to debug the code in your Jupyter Notebook. I personally prefer to use Python's pdb module, but you can use the IPython.core.debugger to get the same functionality and it could be better if you prefer to have syntax highlighting.

There is also a newer "visual debugger" package called the PixieDebugger from the pixiedust package:

I haven't used it myself. Some reviewers say it is amazing and others have said it is pretty buggy. I will leave that one up to you to determine if it is something you want to add to your toolset.

As far as I am concerned, I think using pdb or IPython's debugger work quite well and should work for you too.

17 Oct 2018 5:05am GMT

#### Vasudev Ram: The 2018 Python Developer Survey

By Vasudev Ram

Reposting a PSF-Community email as a PSA:

Participate in the 2018 Python Developer Survey.

Excerpt from an email to the psf-community@python.org and psf-members-announce@python.org mailing lists:

[ As some of you may have seen, the 2018 Python Developer Survey is available. If you haven't taken the survey yet, please do so soon! Additionally, we'd appreciate any assistance you all can provide with sharing the survey with your local Python groups, schools, work colleagues, etc. We will keep the survey open through October 26th, 2018.

Python Developers Survey 2018

We're counting on your help to better understand how different Python developers use Python and related frameworks, tools, and technologies. We also hope you'll enjoy going through the questions.

The survey is organized in partnership between the Python Software Foundation and JetBrains. Together we will publish the aggregated results. We will randomly choose and announce 100 winners to receive a Python Surprise Gift Pack (must complete the full survey to qualify). ]

To my readers: I'll post the answer to A Python email signature puzzle soon, in my next post.

- Vasudev Ram - Online Python training and consultingHit the ground running with my vi quickstart tutorial, vetted by two Windows system administrator friends.Jump to posts: Python * DLang * xtopdfInterested in a Python, SQL or Linux course?Get WP Engine, powerful managed WordPress hosting.Subscribe to my blog (jugad2.blogspot.com) by emailMy ActiveState Code recipes

17 Oct 2018 1:51am GMT

#### Vasudev Ram: The 2018 Python Developer Survey

By Vasudev Ram

Reposting a PSF-Community email as a PSA:

Participate in the 2018 Python Developer Survey.

Excerpt from an email to the psf-community@python.org and psf-members-announce@python.org mailing lists:

[ As some of you may have seen, the 2018 Python Developer Survey is available. If you haven't taken the survey yet, please do so soon! Additionally, we'd appreciate any assistance you all can provide with sharing the survey with your local Python groups, schools, work colleagues, etc. We will keep the survey open through October 26th, 2018.

Python Developers Survey 2018

We're counting on your help to better understand how different Python developers use Python and related frameworks, tools, and technologies. We also hope you'll enjoy going through the questions.

The survey is organized in partnership between the Python Software Foundation and JetBrains. Together we will publish the aggregated results. We will randomly choose and announce 100 winners to receive a Python Surprise Gift Pack (must complete the full survey to qualify). ]

To my readers: I'll post the answer to A Python email signature puzzle soon, in my next post.

- Vasudev Ram - Online Python training and consultingHit the ground running with my vi quickstart tutorial, vetted by two Windows system administrator friends.Jump to posts: Python * DLang * xtopdfInterested in a Python, SQL or Linux course?Get WP Engine, powerful managed WordPress hosting.Subscribe to my blog (jugad2.blogspot.com) by emailMy ActiveState Code recipes

17 Oct 2018 1:51am GMT

## 16 Oct 2018

### Planet Python

#### Andrea Grandi: Using ipdb with Python 3.7.x breakpoint

Python 3.7.x introduced a new method to insert a breakpoint in the code. Before Python 3.7.x to insert a debugging point we had to write import pdb; pdb.set_trace() which honestly I could never remember (and I also created a snippet on VS Code to auto complete it).

Now you can just write breakpoint() that's it!

Now... the only problem is that by default that command will use pdb which is not exactly the best debugger you can have. I usually use ipdb but there wasn't an intuitive way of using it... and no, just installing it in your virtual environment, it won't be used by default.

How to use it then? It's very simple. The new debugging command will read an environment variable named PYTHONBREAKPOINT. If you set it properly, you will be able to use ipdb instead of pdb.

export PYTHONBREAKPOINT=ipdb.set_trace


At this point, any time you use breakpoint() in your code, ipdb will be used instead of pdb.

#### References

• https://hackernoon.com/python-3-7s-new-builtin-breakpoint-a-quick-tour-4f1aebc444c

16 Oct 2018 9:00pm GMT

#### Andrea Grandi: Using ipdb with Python 3.7.x breakpoint

Python 3.7.x introduced a new method to insert a breakpoint in the code. Before Python 3.7.x to insert a debugging point we had to write import pdb; pdb.set_trace() which honestly I could never remember (and I also created a snippet on VS Code to auto complete it).

Now you can just write breakpoint() that's it!

Now... the only problem is that by default that command will use pdb which is not exactly the best debugger you can have. I usually use ipdb but there wasn't an intuitive way of using it... and no, just installing it in your virtual environment, it won't be used by default.

How to use it then? It's very simple. The new debugging command will read an environment variable named PYTHONBREAKPOINT. If you set it properly, you will be able to use ipdb instead of pdb.

export PYTHONBREAKPOINT=ipdb.set_trace


At this point, any time you use breakpoint() in your code, ipdb will be used instead of pdb.

#### References

• https://hackernoon.com/python-3-7s-new-builtin-breakpoint-a-quick-tour-4f1aebc444c

16 Oct 2018 9:00pm GMT

### Introduction

Python has a wide variety of useful packages for machine learning and statistical analysis such as TensorFlow, NumPy, scikit-learn, Pandas, and more. One package that is essential to most data science projects is matplotlib.

Available for any Python distribution, it can be installed on Python 3 with pip. Other methods are also available, check https://matplotlib.org/ for more details.

### Installation

If you use an OS with a terminal, the following command would install matplotlib with pip:




### Importing & Environment

In a Python file, we want to import the pyplot function that allows us to interface with a MATLAB-like plotting environment. We also import a lines function that lets us add lines to plots:

import matplotlib.pyplot as plt
import matplotlib.lines as mlines



Essentially, this plotting environment lets us save figures and their attributes as variables. These plots can then be printed and viewed with a simple command. For an example, we can look at the stock price of Google: specifically the date, open, close, volume, and adjusted close price (date is stored as an np.datetime64) for the most recent 250 days:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook

with cbook.get_sample_data('goog.npz') as datafile:
price_data = price_data[-250:] # get the most recent 250 trading days



We then transform the data in a way that is done quite often for time series, etc. We find the difference, $d_i$, between each observation and the one before it:

$$d_i = y_i - y_{i - 1}$$

delta1 = np.diff(price_data.adj_close) / price_data.adj_close[:-1]



We can also look at the transformations of different variables, such as volume and closing price:

# Marker size in units of points^2
volume = (15 * price_data.volume[:-2] / price_data.volume[0])**2
close = 0.003 * price_data.close[:-2] / 0.003 * price_data.open[:-2]



### Plotting a Scatter Plot

To actually plot this data, you can use the subplots() functions from plt (matplotlib.pyplot). By default this generates the area for the figure and the axes of a plot.

Here we will make a scatter plot of the differences between successive days. To elaborate, x is the difference between day i and the previous day. y is the difference between day i+1 and the previous day (i):

fig, ax = plt.subplots()
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)

ax.set_xlabel(r'$\Delta_i$', fontsize=15)
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)
ax.set_title('Volume and percent change')

ax.grid(True)
fig.tight_layout()

plt.show()



We then create labels for the x and y axes, as well as a title for the plot. We choose to plot this data with grids and a tight layout.

plt.show() displays the plot for us.

We can add a line to this plot by providing x and y coordinates as lists to a Line2D instance:

import matplotlib.lines as mlines

fig, ax = plt.subplots()
line = mlines.Line2D([-.15,0.25], [-.07,0.09], color='red')

# reusing scatterplot code
ax.scatter(delta1[:-1], delta1[1:], c=close, s=volume, alpha=0.5)

ax.set_xlabel(r'$\Delta_i$', fontsize=15)
ax.set_ylabel(r'$\Delta_{i+1}$', fontsize=15)
ax.set_title('Volume and percent change')

ax.grid(True)
fig.tight_layout()

plt.show()



### Plotting Histograms

To plot a histogram, we follow a similar process and use the hist() function from pyplot. We will generate 10000 random data points, x, with a mean of 100 and standard deviation of 15.

The hist function takes the data, x, number of bins, and other arguments such as density, which normalizes the data to a probability density, or alpha, which sets the transparency of the histogram.

We will also use the library mlab to add a line representing a normal density function with the same mean and standard deviation:

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

mu, sigma = 100, 15
x = mu + sigma*np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 30, density=1, facecolor='blue', alpha=0.75)

# add a 'best fit' line
y = mlab.normpdf( bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=4)

plt.xlabel('IQ')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\ \mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)

plt.show()



### Bar Charts

While histograms helped us with visual densities, bar charts help us view counts of data. To plot a bar chart with matplotlib, we use the bar() function. This takes the counts and data labels as x and y, along with other arguments.

As an example, we could look at a sample of the number of programmers that use different languages:

import numpy as np
import matplotlib.pyplot as plt

objects = ('Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp')
y_pos = np.arange(len(objects))
performance = [10,8,6,4,2,1]

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')

plt.show()



### Plotting Images

Analyzing images is very common in Python. Not surprisingly, we can use matplotlib to view images. We use the cv2 library to read in images.

The read_image() function summary is below:

• splits the color channels
• changes them to RGB
• resizes the image
• returns a matrix of RGB values

The rest of the code reads in the first five images of cats and dogs from data used in an image recognition CNN. The pictures are concatenated and printed on the same axis:

import matplotlib.pyplot as plt
import numpy as np
import os, cv2

cwd = os.getcwd()
TRAIN_DIR = cwd + '/data/train/'

ROWS = 256
COLS = 256
CHANNELS = 3

train_images = [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR)] # use this for full dataset
train_dogs =   [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR) if 'dog' in i]
train_cats =   [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR) if 'cat' in i]

b,g,r = cv2.split(img)
img2 = cv2.merge([r,g,b])
return cv2.resize(img2, (ROWS, COLS), interpolation=cv2.INTER_CUBIC)

for a in range(0,5):
pair = np.concatenate((cat, dog), axis=1)
plt.figure(figsize=(10,5))
plt.imshow(pair)
plt.show()



### Conclusion

In this post we saw a brief introduction of how to use matplotlib to plot data in scatter plots, histograms, and bar charts. We also added lines to these plots. Finally, we saw how to read in images using the cv2 library and used matplotlib to plot the images.

16 Oct 2018 1:10pm GMT

#### A. Jesse Jiryu Davis: Recap: PyGotham 2018 Speaker Coaching

With your help, we raised money for twelve PyGotham speakers to receive free training from opera singer and speaking coach Melissa Collom. Most of the speakers were new to the conference scene; Melissa helped them focus on their value to the audience, clarify their ideas, and speak with confidence and charisma. In a survey, nearly all speakers said the session was "very beneficial" and made them "much more likely" to propose conference talks again.

16 Oct 2018 11:14am GMT

#### A. Jesse Jiryu Davis: Recap: PyGotham 2018 Speaker Coaching

With your help, we raised money for twelve PyGotham speakers to receive free training from opera singer and speaking coach Melissa Collom. Most of the speakers were new to the conference scene; Melissa helped them focus on their value to the audience, clarify their ideas, and speak with confidence and charisma. In a survey, nearly all speakers said the session was "very beneficial" and made them "much more likely" to propose conference talks again.

16 Oct 2018 11:14am GMT

#### Codementor: Celery Task Routing: The Basics

16 Oct 2018 11:00am GMT

#### Codementor: Celery Task Routing: The Basics

16 Oct 2018 11:00am GMT

#### PyBites: Code Challenge 55 - #100DaysOfCode Curriculum Generator

There is an immense amount to be learned simply by tinkering with things. - Henry Ford

Hey Pythonistas,

It's time for another code challenge! This week we're asking you to create your own #100DaysOfCode Curriculum Generator.

Sounds exciting? It gets even better: with this challenge you can even be featured on our platform! Read on ...

## The Challenge

Did you notice that every serious progress starts with a plan? This is why we are big advocates of the #100DaysOfCode. Heck we even build a whole Python course around it.

So here is the deal: PyBites is expanding its 100 Days tracker ("grid") feature: we want folks to add their own curriculums or learning paths.

### Only one requirement: return a valid JSON response

You can make this as simple or sophisticated as you want, the only thing we request is a standard response JSON template so we can easily parse it on the platform:

Built with ObjGen -> http://www.objgen.com/json/models/q2S4Q

    {
"title": "title of your 100 days",
"version": 0.1,
"github_repo": "https://github.com/pybites/100DaysOfCode",
{
"day": 1,
"activity": "what you need to do this day?",
"done": false
},
{
"day": 2,
"activity": "what you need to do this day?",
"done": false
},
{
"day": 3,
"activity": "what you need to do this day?",
"done": false
},
...
...
{
"day": 100,
"activity": "milestone ... 100 days done",
"done": false
}
]
}


Update 17/10/2018: we took startDate and goals out because these are not relevant for the learning path, more for the cosumers of it. github_repo is optional.

### An example

Here is what we plan to do, maybe it serves as an idea how you could code this challenge up:

• as I (Bob) want to learn Data Science I am selecting 4 or 5 books I want to go through
• as #100DaysOfCode works best by spending an hour a day I am dividing the books in n pages to read every day
• I am going to add the books to our reading list app
• keeping it generic, my script will accept a bunch of book IDs (URLs) from that app and scrape the title and number of pages for each book
• I calculate the daily number of pages to read every day and define page ranges for each of the 100 days
• I convert this to the required JSON output above

If you like this idea, we opened an API endpoint to more easily pull in book info based on (Google) book ID, for example: http://pbreadinglist.herokuapp.com/api/books/bRpYDgAAQBAJ. Just replace the bookid in this endpoint.

### More ideas

Of course it does not have to be centered around books, it can be any other way you like to plan your #100DaysOfCode. As long as you return the required JSON.

Other ideas that come to mind:

• Set out your plan in a Google sheet and parse that,
• Make a curriculum pointing to various Lynda/Safaribooks/Pluralsight courses and try to make a daily task list scraping those sites,
• Make a curriculum parsing one or more (Pycon) YouTube feeds,
• Make a curriculum parsing our blog challenges and Bites of Py exercises,
• It all comes down to planning your resources and break them down into 100 digestible units.

As usual, this is a challenge that came about wanting to scratch our own itch. Lack ideas? Remember there is always something you can enhance or automate for yourself or somebody else, and by doing so sharpening your coding skills!

### Be featured

If you want to share your learning path with our community let us know in your PR linking to your JSON file and a short description. We will then add it to our 100 days grid app.

If you need help getting ready with Github, see our new instruction video.

## PyBites Community

A few more things before we take off:

• Do you want to discuss this challenge and share your Pythonic journey with other passionate Pythonistas? Confirm your email on our platform then request access to our Slack via settings.

• PyBites is here to challenge you because becoming a better Pythonista requires practice, a lot of it. For any feedback, issues or ideas use GH Issues, tweet us or ping us on our Slack.

>>> from pybites import Bob, Julian

Keep Calm and Code in Python!


16 Oct 2018 10:47am GMT

#### PyBites: Code Challenge 55 - #100DaysOfCode Curriculum Generator

There is an immense amount to be learned simply by tinkering with things. - Henry Ford

Hey Pythonistas,

It's time for another code challenge! This week we're asking you to create your own #100DaysOfCode Curriculum Generator.

Sounds exciting? It gets even better: with this challenge you can even be featured on our platform! Read on ...

## The Challenge

Did you notice that every serious progress starts with a plan? This is why we are big advocates of the #100DaysOfCode. Heck we even build a whole Python course around it.

So here is the deal: PyBites is expanding its 100 Days tracker ("grid") feature: we want folks to add their own curriculums or learning paths.

### Only one requirement: return a valid JSON response

You can make this as simple or sophisticated as you want, the only thing we request is a standard response JSON template so we can easily parse it on the platform:

Built with ObjGen -> http://www.objgen.com/json/models/q2S4Q

    {
"title": "title of your 100 days",
"version": 0.1,
"github_repo": "https://github.com/pybites/100DaysOfCode",
{
"day": 1,
"activity": "what you need to do this day?",
"done": false
},
{
"day": 2,
"activity": "what you need to do this day?",
"done": false
},
{
"day": 3,
"activity": "what you need to do this day?",
"done": false
},
...
...
{
"day": 100,
"activity": "milestone ... 100 days done",
"done": false
}
]
}


Update 17/10/2018: we took startDate and goals out because these are not relevant for the learning path, more for the cosumers of it. github_repo is optional.

### An example

Here is what we plan to do, maybe it serves as an idea how you could code this challenge up:

• as I (Bob) want to learn Data Science I am selecting 4 or 5 books I want to go through
• as #100DaysOfCode works best by spending an hour a day I am dividing the books in n pages to read every day
• I am going to add the books to our reading list app
• keeping it generic, my script will accept a bunch of book IDs (URLs) from that app and scrape the title and number of pages for each book
• I calculate the daily number of pages to read every day and define page ranges for each of the 100 days
• I convert this to the required JSON output above

If you like this idea, we opened an API endpoint to more easily pull in book info based on (Google) book ID, for example: http://pbreadinglist.herokuapp.com/api/books/bRpYDgAAQBAJ. Just replace the bookid in this endpoint.

### More ideas

Of course it does not have to be centered around books, it can be any other way you like to plan your #100DaysOfCode. As long as you return the required JSON.

Other ideas that come to mind:

• Set out your plan in a Google sheet and parse that,
• Make a curriculum pointing to various Lynda/Safaribooks/Pluralsight courses and try to make a daily task list scraping those sites,
• Make a curriculum parsing one or more (Pycon) YouTube feeds,
• Make a curriculum parsing our blog challenges and Bites of Py exercises,
• It all comes down to planning your resources and break them down into 100 digestible units.

As usual, this is a challenge that came about wanting to scratch our own itch. Lack ideas? Remember there is always something you can enhance or automate for yourself or somebody else, and by doing so sharpening your coding skills!

### Be featured

If you want to share your learning path with our community let us know in your PR linking to your JSON file and a short description. We will then add it to our 100 days grid app.

If you need help getting ready with Github, see our new instruction video.

## PyBites Community

A few more things before we take off:

• Do you want to discuss this challenge and share your Pythonic journey with other passionate Pythonistas? Confirm your email on our platform then request access to our Slack via settings.

• PyBites is here to challenge you because becoming a better Pythonista requires practice, a lot of it. For any feedback, issues or ideas use GH Issues, tweet us or ping us on our Slack.

>>> from pybites import Bob, Julian

Keep Calm and Code in Python!


16 Oct 2018 10:47am GMT

#### PyBites: Code Challenge 54 - Query the Spotify API - Review

In this article we review last week's Python Clipboard History code challenge.

## Reminder: new structure review post / Hacktoberfest is back!

From now on we will merge our solution into our Community branch and include anything noteworthy here, because:

• we are learning just like you, we are all equals :)

• we need the PRs too ;) ... as part of Hacktoberfest No. 5 that just kicked of (5 PRs and you get a cool t-shirt)

Don't be shy, share your work!

## Community Pull Requests

A good 10+ PRs this week, amazing!

Check out the awesome PRs by our community for PCC54 (or from fork: git checkout community && git merge upstream/community):

### Featured

vipinreyo's Clipboard Viewer

Lanseuo's Clipboard

### PCC54 Lessons

Refreshed pypeclip and sqlite modules. PyQT5 documentation is evolving. Hence there are not much code available in the public domain to play around with, which is a constraint in designing GUIs for Python apps using QT.

I had to really think about how to monitor the clipboard and copy the text from it just ONCE, ie, no immediate duplicates. It was more the thought process around it.

I learned some new things about tkinter

Gave me the chance to finally play with python 3.7's dataclasses, although not by much though.

Really nice one to practice various skills. I made a clipboard cache queue, a bit like vim buffers (used: deque, clear terminal, class, property, pyperclip, termcolor)

## Read Code for Fun and Profit

You can look at all submitted code here and/or on our Community branch.

Other learnings we spotted in Pull Requests for other challenges this week:

(PCC01) how with works in python

(PCC13) I tweaked your tests in order to make it pass with my data structure.

(PCC39) Played around with 'fixture' and the scope of the fixture.

(PCC47) This one was time consuming because I had to look up how to graph all of these, but it was an excellent learning exercise!

(PCC51) Expanded my skills of working with the databases within python and brushed up on some rusty SQL skills

Thanks to everyone for your participation in our blog code challenges! Keep the PRs coming and include a README.md with one or more screenshots if you want to be featured in this weekly review post.

Keep the PRs coming, again this month it counts for Hacktoberfest!

## Need more Python Practice?

Subscribe to our blog (sidebar) to get a new PyBites Code Challenge (PCC) in your inbox every start of the week.

And/or take any of our 50+ challenges on our platform.

Prefer coding self contained Python exercises in the comfort of your browser? Try our growing collection of Bites of Py.

Want to do the #100DaysOfCode but not sure what to work on? Take our course and/or start logging your progress on our platform.

Keep Calm and Code in Python!

-- Bob and Julian

16 Oct 2018 10:40am GMT

#### PyBites: Code Challenge 54 - Query the Spotify API - Review

In this article we review last week's Python Clipboard History code challenge.

## Reminder: new structure review post / Hacktoberfest is back!

From now on we will merge our solution into our Community branch and include anything noteworthy here, because:

• we are learning just like you, we are all equals :)

• we need the PRs too ;) ... as part of Hacktoberfest No. 5 that just kicked of (5 PRs and you get a cool t-shirt)

Don't be shy, share your work!

## Community Pull Requests

A good 10+ PRs this week, amazing!

Check out the awesome PRs by our community for PCC54 (or from fork: git checkout community && git merge upstream/community):

### Featured

vipinreyo's Clipboard Viewer

Lanseuo's Clipboard

### PCC54 Lessons

Refreshed pypeclip and sqlite modules. PyQT5 documentation is evolving. Hence there are not much code available in the public domain to play around with, which is a constraint in designing GUIs for Python apps using QT.

I had to really think about how to monitor the clipboard and copy the text from it just ONCE, ie, no immediate duplicates. It was more the thought process around it.

I learned some new things about tkinter

Gave me the chance to finally play with python 3.7's dataclasses, although not by much though.

Really nice one to practice various skills. I made a clipboard cache queue, a bit like vim buffers (used: deque, clear terminal, class, property, pyperclip, termcolor)

## Read Code for Fun and Profit

You can look at all submitted code here and/or on our Community branch.

Other learnings we spotted in Pull Requests for other challenges this week:

(PCC01) how with works in python

(PCC13) I tweaked your tests in order to make it pass with my data structure.

(PCC39) Played around with 'fixture' and the scope of the fixture.

(PCC47) This one was time consuming because I had to look up how to graph all of these, but it was an excellent learning exercise!

(PCC51) Expanded my skills of working with the databases within python and brushed up on some rusty SQL skills

Thanks to everyone for your participation in our blog code challenges! Keep the PRs coming and include a README.md with one or more screenshots if you want to be featured in this weekly review post.

Keep the PRs coming, again this month it counts for Hacktoberfest!

## Need more Python Practice?

Subscribe to our blog (sidebar) to get a new PyBites Code Challenge (PCC) in your inbox every start of the week.

And/or take any of our 50+ challenges on our platform.

Prefer coding self contained Python exercises in the comfort of your browser? Try our growing collection of Bites of Py.

Want to do the #100DaysOfCode but not sure what to work on? Take our course and/or start logging your progress on our platform.

Keep Calm and Code in Python!

-- Bob and Julian

16 Oct 2018 10:40am GMT

#### Python Bytes: #99 parse - the regex antidote in Python

16 Oct 2018 8:00am GMT

#### Python Bytes: #99 parse - the regex antidote in Python

16 Oct 2018 8:00am GMT

#### Mike Driscoll: Testing Jupyter Notebooks

The more you do programming, the more you will here about how you should test your code. You will hear about things like Extreme Programming and Test Driven Development (TDD). These are great ways to create quality code. But how does testing fit in with Jupyter? Frankly, it really doesn't. If you want to test your code properly, you should write your code outside of Jupyter and import it into cells if you need to. This allows you to use Python's unittest module or py.test to write tests for your code separately from Jupyter. This will also let you add on test runners like nose or put your code into a Continuous Integration setup using something like Travis CI or Jenkins.

However all is now lost. You can do some testing of your Jupyter Notebooks even though you won't have the full flexibility that you would get from keeping your code separate. We will look at some ideas that you can use to do some basic testing with Jupyter.

### Execute and Check

One popular method of "testing" a Notebook is to run it from the command line and send its output to a file. Here is the example syntax that you could use if you wanted to do the execution on the command line:

jupyter-nbconvert --to notebook --execute --output output_file_path input_file_path


Of course, we want to do this programmatically and we want to be able to capture errors. To do that, we will take our Notebook runner code from my exporting Jupyter Notebook article and re-use it. Here it is again for your convenience:

# notebook_runner.py

import nbformat
import os

from nbconvert.preprocessors import ExecutePreprocessor

def run_notebook(notebook_path):
nb_name, _ = os.path.splitext(os.path.basename(notebook_path))
dirname = os.path.dirname(notebook_path)

with open(notebook_path) as f:

proc = ExecutePreprocessor(timeout=600, kernel_name='python3')
proc.allow_errors = True

output_path = os.path.join(dirname, '{}_all_output.ipynb'.format(nb_name))

with open(output_path, mode='wt') as f:
nbformat.write(nb, f)

errors = []
for cell in nb.cells:
if 'outputs' in cell:
for output in cell['outputs']:
if output.output_type == 'error':
errors.append(output)

return nb, errors

if __name__ == '__main__':
nb, errors = run_notebook('Testing.ipynb')
print(errors)


You will note that I have updated the code to run a new Notebook. Let's go ahead and create a Notebook that has two cells of code in it. After creating the Notebook, change the title to Testing and save it. That will cause Jupyter to save the file as Testing.ipynb. Now enter the following code in the first cell:

def add(a, b):
return a + b



And enter the following code into cell #2:

1 / 0


Now you can run the Notebook runner code. When you do, you should get the following output:

[{'ename': 'ZeroDivisionError',
'evalue': 'integer division or modulo by zero',
'output_type': 'error',
'traceback': ['\x1b[0;31m\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0mTraceback (most recent call '
'last)',
'\x1b[0;32m<ipython-input-2-bc757c3fda29>\x1b[0m in '
'\x1b[0;36m<module>\x1b[0;34m()\x1b[0m\n'
'\x1b[0;32m----> 1\x1b[0;31m \x1b[0;36m1\x1b[0m '
'\x1b[0;34m/\x1b[0m '
'\x1b[0;36m0\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n'
'\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0m: integer division or '
'modulo by zero']}]


This indicates that we have some code that outputs an error. In this case, we did expect that as this is a very contrived example. In your own code, you probably wouldn't want any of your code to output an error. Regardless, this Notebook runner script isn't enough to actually do a real test. You need to wrap this code with testing code. So let's create a new file that we will save to the same location as our Notebook runner code. We will save this script with the name "test_runner.py". Put the following code in your new script:

import unittest

import runner

class TestNotebook(unittest.TestCase):

def test_runner(self):
nb, errors = runner.run_notebook('Testing.ipynb')
self.assertEqual(errors, [])

if __name__ == '__main__':
unittest.main()


This code uses Python's unittest module. Here we create a testing class with a single test function inside of it called test_runner. This function calls our Notebook runner and asserts that the errors list should be empty. To run this code, open up a terminal and navigate to the folder that contains your code. Then run the following command:

python test_runner.py


When I ran this, I got the following output:

F
======================================================================
FAIL: test_runner (__main__.TestNotebook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_runner.py", line 10, in test_runner
self.assertEqual(errors, [])
AssertionError: Lists differ: [{'output_type': u'error', 'ev... != []

First list contains 1 additional elements.
First extra element 0:
{'ename': 'ZeroDivisionError',
'evalue': 'integer division or modulo by zero',
'output_type': 'error',
'traceback': ['\x1b[0;31m---------------------------------------------------------------------------\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0m                         '
'Traceback (most recent call last)',
'\x1b[0;32m<ipython-input-2-bc757c3fda29>\x1b[0m in '
'\x1b[0;36m<module>\x1b[0;34m()\x1b[0m\n'
'\x1b[0;32m----> 1\x1b[0;31m \x1b[0;36m1\x1b[0m '
'\x1b[0;34m/\x1b[0m \x1b[0;36m0\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n'
'\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0m: integer division or modulo '
'by zero']}

Diff is 677 characters long. Set self.maxDiff to None to see it.

----------------------------------------------------------------------
Ran 1 test in 1.463s

FAILED (failures=1)


This clearly shows that our code failed. If you remove the cell that has the divide by zero issue and re-run your test, you should get this:

.
----------------------------------------------------------------------
Ran 1 test in 1.324s

OK


By removing the cell (or just correcting the error in that cell), you can make your tests pass.

### The py.test Plugin

I discovered a neat plugin you can use that appears to help you out by making the workflow a bit easier. I am referring to the py.test plugin for Jupyter, which you can learn more about here.

Basically it gives py.test the ability to recognize Jupyter Notebooks and check if the stored inputs match the stored outputs and also that Notebooks run without error. After installing the nbval package, you can run it with py.test like this (assuming you have py.test installed):

py.test --nbval


Frankly you can actually run just py.test with no commands on the test file we already created and it will use our test code as is. The main benefit of adding nbval is that you won't need to necessarily add wrapper code around Jupyter if you do so.

### Testing within the Notebook

Another way to run tests is to just include some tests in the Notebook itself. Let's add a new cell to our Testing Notebook that contains the following code:

import unittest

class TestNotebook(unittest.TestCase):



This will test the add function in the first cell eventually. We could add a bunch of different tests here. For example, we might want to test what happens if we add a string type with a None type. But you may have noticed that if you try to run this cell, you get to output. The reason is that we aren't instantiating the class yet. We need to call unittest.main to do that. So while it's good to run that cell to get it into Jupyter's memory, we actually need to add one more cell with the following code:

unittest.main(argv=[''], verbosity=2, exit=False)


This code should be put in the last cell of your Notebook so it can run all the tests that you have added. It is basically telling Python to run with verbosity level of 2 and not to exit. When you run this code you should see the following output in your Notebook:

test_add (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.003s

OK

<unittest.main.TestProgram at 0x7fbc8fffc0d0>


You can do something similar with Python's doctest module inside of Jupyter Notebooks as well.

### Wrapping Up

As I mentioned at the beginning, while you can test your code in your Jupyter Notebooks, it is actually much better if you just test your code outside of it. However there are workarounds and since some people like to use Jupyter for documentation purposes, it is good to have a way to verify that they are working correctly. In this chapter you learned how to run Notebooks programmatically and verify that the output was as you expected. You could enhance that code to verify certain errors are present if you wanted to as well.

You also learned how to use Python's unittest module in your Notebook cells directly. This does offer some nice flexibility as you can now run your code all in one place. Use these tools wisely and they will serve you well.

16 Oct 2018 5:05am GMT

#### Mike Driscoll: Testing Jupyter Notebooks

The more you do programming, the more you will here about how you should test your code. You will hear about things like Extreme Programming and Test Driven Development (TDD). These are great ways to create quality code. But how does testing fit in with Jupyter? Frankly, it really doesn't. If you want to test your code properly, you should write your code outside of Jupyter and import it into cells if you need to. This allows you to use Python's unittest module or py.test to write tests for your code separately from Jupyter. This will also let you add on test runners like nose or put your code into a Continuous Integration setup using something like Travis CI or Jenkins.

However all is now lost. You can do some testing of your Jupyter Notebooks even though you won't have the full flexibility that you would get from keeping your code separate. We will look at some ideas that you can use to do some basic testing with Jupyter.

### Execute and Check

One popular method of "testing" a Notebook is to run it from the command line and send its output to a file. Here is the example syntax that you could use if you wanted to do the execution on the command line:

jupyter-nbconvert --to notebook --execute --output output_file_path input_file_path


Of course, we want to do this programmatically and we want to be able to capture errors. To do that, we will take our Notebook runner code from my exporting Jupyter Notebook article and re-use it. Here it is again for your convenience:

# notebook_runner.py

import nbformat
import os

from nbconvert.preprocessors import ExecutePreprocessor

def run_notebook(notebook_path):
nb_name, _ = os.path.splitext(os.path.basename(notebook_path))
dirname = os.path.dirname(notebook_path)

with open(notebook_path) as f:

proc = ExecutePreprocessor(timeout=600, kernel_name='python3')
proc.allow_errors = True

output_path = os.path.join(dirname, '{}_all_output.ipynb'.format(nb_name))

with open(output_path, mode='wt') as f:
nbformat.write(nb, f)

errors = []
for cell in nb.cells:
if 'outputs' in cell:
for output in cell['outputs']:
if output.output_type == 'error':
errors.append(output)

return nb, errors

if __name__ == '__main__':
nb, errors = run_notebook('Testing.ipynb')
print(errors)


You will note that I have updated the code to run a new Notebook. Let's go ahead and create a Notebook that has two cells of code in it. After creating the Notebook, change the title to Testing and save it. That will cause Jupyter to save the file as Testing.ipynb. Now enter the following code in the first cell:

def add(a, b):
return a + b



And enter the following code into cell #2:

1 / 0


Now you can run the Notebook runner code. When you do, you should get the following output:

[{'ename': 'ZeroDivisionError',
'evalue': 'integer division or modulo by zero',
'output_type': 'error',
'traceback': ['\x1b[0;31m\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0mTraceback (most recent call '
'last)',
'\x1b[0;32m<ipython-input-2-bc757c3fda29>\x1b[0m in '
'\x1b[0;36m<module>\x1b[0;34m()\x1b[0m\n'
'\x1b[0;32m----> 1\x1b[0;31m \x1b[0;36m1\x1b[0m '
'\x1b[0;34m/\x1b[0m '
'\x1b[0;36m0\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n'
'\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0m: integer division or '
'modulo by zero']}]


This indicates that we have some code that outputs an error. In this case, we did expect that as this is a very contrived example. In your own code, you probably wouldn't want any of your code to output an error. Regardless, this Notebook runner script isn't enough to actually do a real test. You need to wrap this code with testing code. So let's create a new file that we will save to the same location as our Notebook runner code. We will save this script with the name "test_runner.py". Put the following code in your new script:

import unittest

import runner

class TestNotebook(unittest.TestCase):

def test_runner(self):
nb, errors = runner.run_notebook('Testing.ipynb')
self.assertEqual(errors, [])

if __name__ == '__main__':
unittest.main()


This code uses Python's unittest module. Here we create a testing class with a single test function inside of it called test_runner. This function calls our Notebook runner and asserts that the errors list should be empty. To run this code, open up a terminal and navigate to the folder that contains your code. Then run the following command:

python test_runner.py


When I ran this, I got the following output:

F
======================================================================
FAIL: test_runner (__main__.TestNotebook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_runner.py", line 10, in test_runner
self.assertEqual(errors, [])
AssertionError: Lists differ: [{'output_type': u'error', 'ev... != []

First list contains 1 additional elements.
First extra element 0:
{'ename': 'ZeroDivisionError',
'evalue': 'integer division or modulo by zero',
'output_type': 'error',
'traceback': ['\x1b[0;31m---------------------------------------------------------------------------\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0m                         '
'Traceback (most recent call last)',
'\x1b[0;32m<ipython-input-2-bc757c3fda29>\x1b[0m in '
'\x1b[0;36m<module>\x1b[0;34m()\x1b[0m\n'
'\x1b[0;32m----> 1\x1b[0;31m \x1b[0;36m1\x1b[0m '
'\x1b[0;34m/\x1b[0m \x1b[0;36m0\x1b[0m\x1b[0;34m\x1b[0m\x1b[0m\n'
'\x1b[0m',
'\x1b[0;31mZeroDivisionError\x1b[0m: integer division or modulo '
'by zero']}

Diff is 677 characters long. Set self.maxDiff to None to see it.

----------------------------------------------------------------------
Ran 1 test in 1.463s

FAILED (failures=1)


This clearly shows that our code failed. If you remove the cell that has the divide by zero issue and re-run your test, you should get this:

.
----------------------------------------------------------------------
Ran 1 test in 1.324s

OK


By removing the cell (or just correcting the error in that cell), you can make your tests pass.

### The py.test Plugin

I discovered a neat plugin you can use that appears to help you out by making the workflow a bit easier. I am referring to the py.test plugin for Jupyter, which you can learn more about here.

Basically it gives py.test the ability to recognize Jupyter Notebooks and check if the stored inputs match the stored outputs and also that Notebooks run without error. After installing the nbval package, you can run it with py.test like this (assuming you have py.test installed):

py.test --nbval


Frankly you can actually run just py.test with no commands on the test file we already created and it will use our test code as is. The main benefit of adding nbval is that you won't need to necessarily add wrapper code around Jupyter if you do so.

### Testing within the Notebook

Another way to run tests is to just include some tests in the Notebook itself. Let's add a new cell to our Testing Notebook that contains the following code:

import unittest

class TestNotebook(unittest.TestCase):



This will test the add function in the first cell eventually. We could add a bunch of different tests here. For example, we might want to test what happens if we add a string type with a None type. But you may have noticed that if you try to run this cell, you get to output. The reason is that we aren't instantiating the class yet. We need to call unittest.main to do that. So while it's good to run that cell to get it into Jupyter's memory, we actually need to add one more cell with the following code:

unittest.main(argv=[''], verbosity=2, exit=False)


This code should be put in the last cell of your Notebook so it can run all the tests that you have added. It is basically telling Python to run with verbosity level of 2 and not to exit. When you run this code you should see the following output in your Notebook:

test_add (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.003s

OK

<unittest.main.TestProgram at 0x7fbc8fffc0d0>


You can do something similar with Python's doctest module inside of Jupyter Notebooks as well.

### Wrapping Up

As I mentioned at the beginning, while you can test your code in your Jupyter Notebooks, it is actually much better if you just test your code outside of it. However there are workarounds and since some people like to use Jupyter for documentation purposes, it is good to have a way to verify that they are working correctly. In this chapter you learned how to run Notebooks programmatically and verify that the output was as you expected. You could enhance that code to verify certain errors are present if you wanted to as well.

You also learned how to use Python's unittest module in your Notebook cells directly. This does offer some nice flexibility as you can now run your code all in one place. Use these tools wisely and they will serve you well.

16 Oct 2018 5:05am GMT

## 10 Nov 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.

Größere Kartenansicht

10 Nov 2011 10:57am GMT

## 09 Nov 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.

09 Nov 2011 8:25pm GMT

## 08 Nov 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

08 Nov 2011 2:30pm GMT

## 07 Nov 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school

07 Nov 2011 2:00pm GMT

## 06 Nov 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

06 Nov 2011 7:45pm GMT

## 05 Nov 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

05 Nov 2011 4:33pm GMT

## 31 Oct 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

 Pupils in the big classroom

 The Trainer

 School in Countryside

 "Town"

31 Oct 2011 4:58pm GMT

#### Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.

31 Oct 2011 3:11pm GMT

## 30 Oct 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.

 interior

 home made specialities - the shop in the shop

 the Bar

30 Oct 2011 4:47pm GMT

## 29 Oct 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

 Plain Street

 Orange River in its beginngings (near Lesotho)

 Zastron Anglican Church

 The Bridge in Between "Free State" and Eastern Cape next to Zastron

 my new Background ;)

 If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites

 Freeway

 getting dark

29 Oct 2011 4:23pm GMT

## 28 Oct 2011

### Python Software Foundation | GSoC'11 Students

#### Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

 Eine ganz normale "Bundesstraße"

 und wie sie erweitert wird

 gaaaanz viele LKWs

 denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht

 Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.