22 Aug 2019

feedPlanet Python

Stack Abuse: Python for NLP: Creating Multi-Data-Type Classification Models with Keras

This is the 18th article in my series of articles on Python for NLP. In my previous article, I explained how to create a deep learning-based movie sentiment analysis model using Python's Keras library. In that article, we saw how we can perform sentiment analysis of user reviews regarding different movies on IMDB. We used the text of the review the review to predict the sentiment.

However, in text classification tasks, we can also make use of the non-textual information to classify the text. For instance, gender may have an impact on the sentiment of the review. Furthermore, nationalities may affect the public opinion about a particular movie. Therefore, this associated info, also known as meta data can also be used to improve accuracy of statistical model.

In this article, we will build upon the concepts that we studied in the last two articles and will see how to create a text classification system that classifies user reviews regarding different business, into one of the three predefined categories i.e. "good", "bad", and "average". However, in addition to the text of the review, we will use the associated meta data of the review to perform classifcation. Since we have two different types of inputs i.e. textual input and numerical input, we need to create a multiple inputs model. We will be using Keras Functional API since it supports multiple inputs and multiple output models.

After reading this article, you will be able to create a deep learning model in Keras that is capable of accepting multiple inputs, concatenating the two outputs and then performing classification or regression using the aggregated input.

Before we dive into the details of creating such a model, let's first breifly review the dataset that we are going to use.

The Dataset

The dataset for this article can be downloaded from this Kaggle link. The dataset contains multiple files, but we are only interested in the yelp_review.csv file. The file contains more than 5.2 million reviews about different businesses, including restaurants, bars, dentists, doctors, beauty salons, etc. For our purposes we will only be using the first 50,000 records to train our model. Download the dataset to your local machine.

Let's first import all the libraries that we will be using in this article before importing the dataset.

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

import pandas as pd
import numpy as np
import re

As a first step, we need to load the dataset. The following script does that:

yelp_reviews = pd.read_csv("/content/drive/My Drive/yelp_review_short.csv")

The dataset contains a column Stars that contains ratings for different businesses. The "Stars" column can have values between 1 and 5. We will simplify our problem by converting the numerical values for the reviews into categorical ones. We will add a new column reviews_score to our dataset. If the user review has a value of 1 in the Stars column, the reviews_score column will have a string value bad. If the rating is 2 or 3 in the Stars column, the reviews_score column will contain a value average. Finally review rating of 4 or 5 will have a corresponding value of good in the reviews_score column.

The following script performs this preprocessing:

bins = [0,1,3,5]
review_names = ['bad', 'average', 'good']
yelp_reviews['reviews_score'] = pd.cut(yelp_reviews['stars'], bins, labels=review_names)

Next, we will remove all the NULL values from our dataframe and will print the shape and the header of the dataset.

yelp_reviews.isnull().values.any()

print(yelp_reviews.shape)

yelp_reviews.head()

In the output you will see (50000,10), which means that our dataset contains 50,000 records with 10 columns. The header of the yelp_reviews dataframe looks like this:

head

You can see the 10 columns that our dataframe contains, including the newly added reviews_score column. The text column contains the text of the review while the useful column contains numerical value that represents the count of the people who found the review useful. Similarly, the funny and cool columns contains the counts of people who found reviews funny or cool, respectively.

Let's randomly choose a review. If you look at the 4th review (review with index 3), it has 4 stars and hence it is marked as good. Let's view the complete text of this review:

print(yelp_reviews["text"][3])

The output looks like this:

Love coming here. Yes the place always needs the floor swept but when you give out  peanuts in the shell how won't it always be a bit dirty.

The food speaks for itself, so good. Burgers are made to order and the meat is put on the grill when you order your sandwich. Getting the small burger just means 1 patty, the regular is a 2 patty burger which is twice the deliciousness.

Getting the Cajun fries adds a bit of spice to them and whatever size you order they always throw more fries (a lot more fries) into the bag.

You can clearly see that this is a positive review.

Let's now plot the number of good, average, and bad reviews.

import seaborn as sns

sns.countplot(x='reviews_score', data=yelp_reviews)

head

It is evident from the above plot that majority of the reviews are good, followed by the average reviews. The number of negative reviews is very small.

We have preprocessed our data and now we will create three models in this article. The first model will only use text inputs for predicting whether a review is good, average, or bad. In the second model, we will not use text. We will only use the meta information such as useful, funny, and cool to predict the sentiment of the review. Finally, we will create a model that accepts multiple inputs i.e. text and meta information for text classification.

Creating a Model with Text Inputs Only

The first step is to define a function that cleans the textual data.

def preprocess_text(sen):

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

Since, we are only using text in this model, we will filter all the text reviews and store them in the list. The text reviews will be cleaned using the preprocess_text function, which removes punctuations and numbers from the text.

X = []
sentences = list(yelp_reviews["text"])
for sen in sentences:
    X.append(preprocess_text(sen))

y = yelp_reviews['reviews_score']

Our X variable here contains the text reviews while the y variable contains the corresponding reviews_score values. The reviews_score column has data in the text format. We need to convert the text to a one-hot encoded vector. We can use the to_categorical method from the keras.utils module. However, first we have to convert the text into integer labels using the LabelEncoder function from the sklearn.preprocessing module.

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
y = label_encoder.fit_transform(y)

Let's now divide our data into testing and training sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Now we can convert both the training and test labels into one-hot encoded vectors:

from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

I explained in my article on word embeddings that textual data has to be converted into some sort of numeric form before it can be used by statisitical algorithms like machine and deep learning models. One way to convert text to numbers is via word embeddings. If you are unaware of how to implement word embeddings via Keras, I highly recommemd that you read this article before moving on to the next sections of the code.

The first step in word embeddings is to convert the words into thier corresponding numeric indexes. To do so, we can use the Tokenizer class from Keras.preprocessing.text module.

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

Sentences can have different lengths, and therefore the sequences returned by the Tokenizer class also consist of variable lengths. We specify that maximum length of the sequence will be 200 (although you can try any number). For the sentences having length less than 200, the remaining indexes will be padded with zeros. For the sentences having length greater than 200, the remaining indexes will be truncated.

Look at the following script:

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

Next, we need to load the built-in GloVe word embeddings.

from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions

glove_file.close()

Finally, we will create an embedding matrix where rows will be equal to the number of words in the vocabulary (plus 1). The number of columns will be 100 since each word in the GloVe word embeddings that we loaded is represented as a 100 dimensional vector.

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

Once the word embedding step is completed, we are ready to create our model. We will be using Keras' functional API to create our model. Though single input models like the one we are creating now can be developed using sequential API as well, but since in the next section we are going to develop a multiple input model that can only be developed using Keras functional API, we will stick to functional API in this section too.

We will create a very simple model with one input layer (embedding layer), one LSTM layer with 128 neurons and one dense layer that will act as the output layer as well. Since we have 3 possible outputs, the number of neurons will be 3 and the activation function will be softmax. We will use the categorical_crossentropy as our loss function and adam as the optimization function.

deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(deep_inputs)
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(3, activation='softmax')(LSTM_Layer_1)
model = Model(inputs=deep_inputs, outputs=dense_layer_1)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Let's print the summary of our model:

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 200)               0
_________________________________________________________________
embedding_1 (Embedding)      (None, 200, 100)          5572900
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 387
=================================================================
Total params: 5,690,535
Trainable params: 117,635
Non-trainable params: 5,572,900

Finally, lets print the block diagram of our neural network:

from keras.utils import plot_model
plot_model(model, to_file='model_plot1.png', show_shapes=True, show_layer_names=True)

The file model_plot1.png will be created in your local file path. If you open the image, it will look like this:

head

You can see that the model has 1 input layer, 1 embedding layer, 1 LSTM, and one dense layer which serves as the output layer as well.

Let's now train our model:

history = model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_split=0.2)

The model will be trained on 80% of the train data and will be validated on 20% of the train data. The results for the 10 epochs is as follows:

Train on 32000 samples, validate on 8000 samples
Epoch 1/10
32000/32000 [==============================] - 81s 3ms/step - loss: 0.8640 - acc: 0.6623 - val_loss: 0.8356 - val_acc: 0.6730
Epoch 2/10
32000/32000 [==============================] - 80s 3ms/step - loss: 0.8508 - acc: 0.6618 - val_loss: 0.8399 - val_acc: 0.6690
Epoch 3/10
32000/32000 [==============================] - 84s 3ms/step - loss: 0.8461 - acc: 0.6647 - val_loss: 0.8374 - val_acc: 0.6726
Epoch 4/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.8288 - acc: 0.6709 - val_loss: 0.7392 - val_acc: 0.6861
Epoch 5/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.7444 - acc: 0.6804 - val_loss: 0.6371 - val_acc: 0.7311
Epoch 6/10
32000/32000 [==============================] - 83s 3ms/step - loss: 0.5969 - acc: 0.7484 - val_loss: 0.5602 - val_acc: 0.7682
Epoch 7/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.5484 - acc: 0.7623 - val_loss: 0.5244 - val_acc: 0.7814
Epoch 8/10
32000/32000 [==============================] - 86s 3ms/step - loss: 0.5052 - acc: 0.7866 - val_loss: 0.4971 - val_acc: 0.7950
Epoch 9/10
32000/32000 [==============================] - 84s 3ms/step - loss: 0.4753 - acc: 0.8032 - val_loss: 0.4839 - val_acc: 0.7965
Epoch 10/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.4539 - acc: 0.8110 - val_loss: 0.4622 - val_acc: 0.8046

You can see that the final training accuracy of the model is 81.10% while validation accuracy is 80.46. The difference is very small and therefore we assume that our model is not overfitting on the training data.

Let's now evaluate the performance of our model on test set:

score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

The output looks like this:

10000/10000 [==============================] - 37s 4ms/step
Test Score: 0.4592904740810394
Test Accuracy: 0.8101

Finally, let's plot the values for loss and accuracy for both training and testing sets:

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

You should see the following two plots:

head

You can see the lines for both training and testing accuracies and losses are pretty close to each other which means that the model is not overfitting.

Creating a Model with Meta Information Only

In this section, we will create a classification model that uses information from the useful, funny, and cool columns of the yelp reviews. Since the data for these columns is well structured and doesn't contain any sequential or spatial pattern, we can use simple densly connected neural networks to make predictions.

Let's plot the average counts for useful, funny, and cool reviews against the review score.

import seaborn as sns
sns.barplot(x='reviews_score', y='useful', data=yelp_reviews)

head

From the output, you can see that the average count for reviews marked as useful is the highest for the bad reviews, followed by the average reviews and the good reviews.

Let's now plot the average count for funny reviews:

sns.barplot(x='reviews_score', y='funny', data=yelp_reviews)

head

The output shows that again, the average count for reviews marked as funny is highest for the bad reviews.

Finally, let's plot the average value for the cool column against the reviews_score column. We expect that the average count for the cool column will be the highest for good reviews since people often mark positive or good reviews as cool:

sns.barplot(x='reviews_score', y='cool', data=yelp_reviews)

head

As expected, the average cool count for the good reviews is the highest. From this information, we can safely assume that the count values for useful, funny, and cool columns have some correlation with the reviews_score columns. Therefore, we will try to use the data from these three columns to train our algorithm that predicts the value for reviews_score column.

Let's filter these three columns from pur dataset:

yelp_reviews_meta = yelp_reviews[['useful', 'funny', 'cool']]

X = yelp_reviews_meta.values

y = yelp_reviews['reviews_score']

Next, we will convert our labels into one-hot encoded values and then split our data into train and test sets:

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

The next step is to create our model. Our model will consist of four layers (you can try any number): the input layer, two dense hidden layers with 10 neurons and relu activation functions, and finally an output dense layer with 3 neurons and softmax activation function. The loss function and optimizer will be categorical_crossentropy and adam, respectively.

The following script defines the model:

input2 = Input(shape=(3,))
dense_layer_1 = Dense(10, activation='relu')(input2)
dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)
output = Dense(3, activation='softmax')(dense_layer_2)

model = Model(inputs=input2, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Let's print the summary of the model:

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 3)                 0
_________________________________________________________________
dense_1 (Dense)              (None, 10)                40
_________________________________________________________________
dense_2 (Dense)              (None, 10)                110
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 33
=================================================================
Total params: 183
Trainable params: 183
Non-trainable params: 0

Finally, the block diagram for the model can be created via the following script:

from keras.utils import plot_model
plot_model(model, to_file='model_plot2.png', show_shapes=True, show_layer_names=True)

Now, if you open the model_plot2.png file from your local file path, it looks like this:

head

Let's now train the model and print the accuracy and loss values for each epoch:

history = model.fit(X_train, y_train, batch_size=16, epochs=10, verbose=1, validation_split=0.2)

Train on 32000 samples, validate on 8000 samples
Epoch 1/10
32000/32000 [==============================] - 8s 260us/step - loss: 0.8429 - acc: 0.6649 - val_loss: 0.8166 - val_acc: 0.6734
Epoch 2/10
32000/32000 [==============================] - 7s 214us/step - loss: 0.8203 - acc: 0.6685 - val_loss: 0.8156 - val_acc: 0.6737
Epoch 3/10
32000/32000 [==============================] - 7s 217us/step - loss: 0.8187 - acc: 0.6685 - val_loss: 0.8150 - val_acc: 0.6736
Epoch 4/10
32000/32000 [==============================] - 7s 220us/step - loss: 0.8183 - acc: 0.6695 - val_loss: 0.8160 - val_acc: 0.6740
Epoch 5/10
32000/32000 [==============================] - 7s 227us/step - loss: 0.8177 - acc: 0.6686 - val_loss: 0.8149 - val_acc: 0.6751
Epoch 6/10
32000/32000 [==============================] - 7s 219us/step - loss: 0.8175 - acc: 0.6686 - val_loss: 0.8157 - val_acc: 0.6744
Epoch 7/10
32000/32000 [==============================] - 7s 216us/step - loss: 0.8172 - acc: 0.6696 - val_loss: 0.8145 - val_acc: 0.6733
Epoch 8/10
32000/32000 [==============================] - 7s 214us/step - loss: 0.8175 - acc: 0.6689 - val_loss: 0.8139 - val_acc: 0.6734
Epoch 9/10
32000/32000 [==============================] - 7s 215us/step - loss: 0.8169 - acc: 0.6691 - val_loss: 0.8160 - val_acc: 0.6744
Epoch 10/10
32000/32000 [==============================] - 7s 216us/step - loss: 0.8167 - acc: 0.6694 - val_loss: 0.8138 - val_acc: 0.6736

From the output, you can see that our model doesn't converge and accuracy values remain between 66 and 67 accross all the epochs.

Let's see how the model performs on test set:

score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

10000/10000 [==============================] - 0s 34us/step
Test Score: 0.8206425309181213
Test Accuracy: 0.6669

We can print the loss and accuracy values for training and test sets via the following script:

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

head

From the output, you can see that accuracy values are relatively lower. Hence, we can say that our model is underfitting. The accuracy can be increased by increasing the number of dense layers or by increasing the number of epochs, however I will leave that to you.

Let's move on to the final and most important section of this article where we will use multiple inputs of different types to train our model.

Creating a Model with Multiple Inputs

In the previous sections, we saw how to train deep learning models using either textual data or meta information. What if we want to combine textual information with meta information and use that as input to our model? We can do so using the Keras functional API. In this section we will create two submodels.

The first submodel will accept textual input in the form of text reviews. This submodel will consist of an input shape layer, an embedding layer, and an LSTM layer of 128 neurons. The second submodel will accept input in the form of meta information from the useful, funny, and cool columns. The second submodel also consist of three layers. An input layer and two dense layers.

The output from the LSTM layer of the first submodel and the output from the second dense layer of the second submodel will be concatenated together and will be used as concatenated input to another dense layer with 10 neurons. Finally, the output dense layer will have three neuorns corresponding to each review type.

Let's see how we can createe such a concatenated model.

First we have to create two different types of inputs. To do so, we will divide our data into a feature set and label set, as shown below:

X = yelp_reviews.drop('reviews_score', axis=1)

y = yelp_reviews['reviews_score']

The X variable contains the feature set, where as the y variable contains label set. We need to convert our labels into one-hot encoded vectors. We can do so using the label encoder and the to_categorical function of the keras.utils module. We will also divide our data into training and feature set.

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

Now our label set is in the required form. Since there will be only one output, therefore we don't need to process our label set. However, there will be multiple inputs to the model. Therefore, we need to preprocess our feature set.

Let's first create preproces_text function that will be used to preprocess our dataset:

def preprocess_text(sen):

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

As a first step, we will create textual input for the training and test set. Look at the following script:

X1_train = []
sentences = list(X_train["text"])
for sen in sentences:
    X1_train.append(preprocess_text(sen))

Now X1_train contains the textual input for the training set. Similarly, the following script preprocess textual input data for test set:

X1_test = []
sentences = list(X_test["text"])
for sen in sentences:
    X1_test.append(preprocess_text(sen))

Now we need to convert textual input for the training and test sets into numeric form using word embeddings. The following script does that:

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X1_train)

X1_train = tokenizer.texts_to_sequences(X1_train)
X1_test = tokenizer.texts_to_sequences(X1_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X1_train = pad_sequences(X1_train, padding='post', maxlen=maxlen)
X1_test = pad_sequences(X1_test, padding='post', maxlen=maxlen)

We will again use GloVe word embeddings for creating word vectors:

from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('/content/drive/My Drive/glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions

glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

We have preprocessed our textual input. The second input type is the meta information in the useful, funny, and cool columns. We will filter these columns from the feature set to create meta input for training the algorithms. Look at the following script:

X2_train = X_train[['useful', 'funny', 'cool']].values
X2_test = X_test[['useful', 'funny', 'cool']].values

Let's now create our two input layers. The first input layer will be used to input the textual input and the second input layer will be used to input meta information from the three columns.

input_1 = Input(shape=(maxlen,))

input_2 = Input(shape=(3,))

You can see that the first input layer input_1 is used for the textual input. The shape size has been set to the shape of the input sentence. For the second input layer, the shape corresponds to three columns.

Let's now create the first submodel that accepts data from first input layer:

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(input_1)
LSTM_Layer_1 = LSTM(128)(embedding_layer)

Similarly, the following script creates a second submodel that accepts input from the second input layer:

dense_layer_1 = Dense(10, activation='relu')(input_2)
dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)

We now have two submodels. What we want to do is concatenate the output from the first submodel with the output from the second submodel. The output from the first submodel is the output from the LSTM_Layer_1 and similarly, the output from the second submodel is the output from the dense_layer_2. We can use the Concatenate class from the keras.layers.merge module to concatenate two inputs.

The following script creates our final model:

concat_layer = Concatenate()([LSTM_Layer_1, dense_layer_2])
dense_layer_3 = Dense(10, activation='relu')(concat_layer)
output = Dense(3, activation='softmax')(dense_layer_3)
model = Model(inputs=[input_1, input_2], outputs=output)

You can see that now our model has a list of inputs with two items. The following script compiles the model and prints its summary:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
print(model.summary())

The model summary is as follows:

Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 200)          0
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 3)            0
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 200, 100)     5572900     input_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 10)           40          input_2[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 128)          117248      embedding_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 10)           110         dense_1[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 138)          0           lstm_1[0][0]
                                                                 dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 10)           1390        concatenate_1[0][0]
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 3)            33          dense_3[0][0]
==================================================================================================
Total params: 5,691,721
Trainable params: 118,821
Non-trainable params: 5,572,900

Finally, we can plot the complete network model using the following script:

from keras.utils import plot_model
plot_model(model, to_file='model_plot3.png', show_shapes=True, show_layer_names=True)

If you open the model_plot3.png file, you should see the following network diagram:

head

The above figure clearly explains how we have concatenated multiple inputs into one input to create our model.

Let's now train our model and see the results:

history = model.fit(x=[X1_train, X2_train], y=y_train, batch_size=128, epochs=10, verbose=1, validation_split=0.2)

Here is the result for the 10 epochs:

Train on 32000 samples, validate on 8000 samples
Epoch 1/10
32000/32000 [==============================] - 155s 5ms/step - loss: 0.9006 - acc: 0.6509 - val_loss: 0.8233 - val_acc: 0.6704
Epoch 2/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.8212 - acc: 0.6670 - val_loss: 0.8141 - val_acc: 0.6745
Epoch 3/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.8151 - acc: 0.6691 - val_loss: 0.8086 - val_acc: 0.6740
Epoch 4/10
32000/32000 [==============================] - 155s 5ms/step - loss: 0.8121 - acc: 0.6701 - val_loss: 0.8039 - val_acc: 0.6776
Epoch 5/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.8027 - acc: 0.6740 - val_loss: 0.7467 - val_acc: 0.6854
Epoch 6/10
32000/32000 [==============================] - 155s 5ms/step - loss: 0.6791 - acc: 0.7158 - val_loss: 0.5764 - val_acc: 0.7560
Epoch 7/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.5333 - acc: 0.7744 - val_loss: 0.5076 - val_acc: 0.7881
Epoch 8/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.4857 - acc: 0.7973 - val_loss: 0.4849 - val_acc: 0.7970
Epoch 9/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.4697 - acc: 0.8034 - val_loss: 0.4709 - val_acc: 0.8024
Epoch 10/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.4479 - acc: 0.8123 - val_loss: 0.4592 - val_acc: 0.8079

To evaluate our model, we wil have to pass both the test inputs to the evaluate function as shown below:

score = model.evaluate(x=[X1_test, X2_test], y=y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

Here are the result:

10000/10000 [==============================] - 18s 2ms/step
Test Score: 0.4576087875843048
Test Accuracy: 0.8053

Our test accuracy is 80.53%, which is slightly less than our first model that uses textual input only. This shows that meta information in yelp_reviews is not very useful for sentiment prediction.

Anyways, now you know how to create multiple input model for text classification in Keras!

Finally, let's now print the loss and accuracy for training and test sets:

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

head

You can see that the differences for loss and accuracy values is minimal between the training and test sets, hence our model is not overfitting.

Final Thoughts and Improvements

In this article, we built a very simple neural network since the purpose of the article is to explain how to create deep learning model that accepts multiple inputs of different types.

Following are some of the tips that you can follow to further improve the performance of the text classification model:

  1. We only used 50,000, out of 5.2 million records in this article since we had hardware constraint. You can try training your model on a higher number of records and see if you can achieve better performance.
  2. Try adding more LSTM and dense layers to the model. If the model overfits, try to add dropout.
  3. Try to change the optimizer function and train the model with higher number of epochs.

Please share your results along with the neural network configuration in the comments section. I would love to see how well did you perform.

22 Aug 2019 12:51pm GMT

Stack Abuse: Python for NLP: Creating Multi-Data-Type Classification Models with Keras

This is the 18th article in my series of articles on Python for NLP. In my previous article, I explained how to create a deep learning-based movie sentiment analysis model using Python's Keras library. In that article, we saw how we can perform sentiment analysis of user reviews regarding different movies on IMDB. We used the text of the review the review to predict the sentiment.

However, in text classification tasks, we can also make use of the non-textual information to classify the text. For instance, gender may have an impact on the sentiment of the review. Furthermore, nationalities may affect the public opinion about a particular movie. Therefore, this associated info, also known as meta data can also be used to improve accuracy of statistical model.

In this article, we will build upon the concepts that we studied in the last two articles and will see how to create a text classification system that classifies user reviews regarding different business, into one of the three predefined categories i.e. "good", "bad", and "average". However, in addition to the text of the review, we will use the associated meta data of the review to perform classifcation. Since we have two different types of inputs i.e. textual input and numerical input, we need to create a multiple inputs model. We will be using Keras Functional API since it supports multiple inputs and multiple output models.

After reading this article, you will be able to create a deep learning model in Keras that is capable of accepting multiple inputs, concatenating the two outputs and then performing classification or regression using the aggregated input.

Before we dive into the details of creating such a model, let's first breifly review the dataset that we are going to use.

The Dataset

The dataset for this article can be downloaded from this Kaggle link. The dataset contains multiple files, but we are only interested in the yelp_review.csv file. The file contains more than 5.2 million reviews about different businesses, including restaurants, bars, dentists, doctors, beauty salons, etc. For our purposes we will only be using the first 50,000 records to train our model. Download the dataset to your local machine.

Let's first import all the libraries that we will be using in this article before importing the dataset.

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

import pandas as pd
import numpy as np
import re

As a first step, we need to load the dataset. The following script does that:

yelp_reviews = pd.read_csv("/content/drive/My Drive/yelp_review_short.csv")

The dataset contains a column Stars that contains ratings for different businesses. The "Stars" column can have values between 1 and 5. We will simplify our problem by converting the numerical values for the reviews into categorical ones. We will add a new column reviews_score to our dataset. If the user review has a value of 1 in the Stars column, the reviews_score column will have a string value bad. If the rating is 2 or 3 in the Stars column, the reviews_score column will contain a value average. Finally review rating of 4 or 5 will have a corresponding value of good in the reviews_score column.

The following script performs this preprocessing:

bins = [0,1,3,5]
review_names = ['bad', 'average', 'good']
yelp_reviews['reviews_score'] = pd.cut(yelp_reviews['stars'], bins, labels=review_names)

Next, we will remove all the NULL values from our dataframe and will print the shape and the header of the dataset.

yelp_reviews.isnull().values.any()

print(yelp_reviews.shape)

yelp_reviews.head()

In the output you will see (50000,10), which means that our dataset contains 50,000 records with 10 columns. The header of the yelp_reviews dataframe looks like this:

head

You can see the 10 columns that our dataframe contains, including the newly added reviews_score column. The text column contains the text of the review while the useful column contains numerical value that represents the count of the people who found the review useful. Similarly, the funny and cool columns contains the counts of people who found reviews funny or cool, respectively.

Let's randomly choose a review. If you look at the 4th review (review with index 3), it has 4 stars and hence it is marked as good. Let's view the complete text of this review:

print(yelp_reviews["text"][3])

The output looks like this:

Love coming here. Yes the place always needs the floor swept but when you give out  peanuts in the shell how won't it always be a bit dirty.

The food speaks for itself, so good. Burgers are made to order and the meat is put on the grill when you order your sandwich. Getting the small burger just means 1 patty, the regular is a 2 patty burger which is twice the deliciousness.

Getting the Cajun fries adds a bit of spice to them and whatever size you order they always throw more fries (a lot more fries) into the bag.

You can clearly see that this is a positive review.

Let's now plot the number of good, average, and bad reviews.

import seaborn as sns

sns.countplot(x='reviews_score', data=yelp_reviews)

head

It is evident from the above plot that majority of the reviews are good, followed by the average reviews. The number of negative reviews is very small.

We have preprocessed our data and now we will create three models in this article. The first model will only use text inputs for predicting whether a review is good, average, or bad. In the second model, we will not use text. We will only use the meta information such as useful, funny, and cool to predict the sentiment of the review. Finally, we will create a model that accepts multiple inputs i.e. text and meta information for text classification.

Creating a Model with Text Inputs Only

The first step is to define a function that cleans the textual data.

def preprocess_text(sen):

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

Since, we are only using text in this model, we will filter all the text reviews and store them in the list. The text reviews will be cleaned using the preprocess_text function, which removes punctuations and numbers from the text.

X = []
sentences = list(yelp_reviews["text"])
for sen in sentences:
    X.append(preprocess_text(sen))

y = yelp_reviews['reviews_score']

Our X variable here contains the text reviews while the y variable contains the corresponding reviews_score values. The reviews_score column has data in the text format. We need to convert the text to a one-hot encoded vector. We can use the to_categorical method from the keras.utils module. However, first we have to convert the text into integer labels using the LabelEncoder function from the sklearn.preprocessing module.

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
y = label_encoder.fit_transform(y)

Let's now divide our data into testing and training sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Now we can convert both the training and test labels into one-hot encoded vectors:

from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

I explained in my article on word embeddings that textual data has to be converted into some sort of numeric form before it can be used by statisitical algorithms like machine and deep learning models. One way to convert text to numbers is via word embeddings. If you are unaware of how to implement word embeddings via Keras, I highly recommemd that you read this article before moving on to the next sections of the code.

The first step in word embeddings is to convert the words into thier corresponding numeric indexes. To do so, we can use the Tokenizer class from Keras.preprocessing.text module.

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

Sentences can have different lengths, and therefore the sequences returned by the Tokenizer class also consist of variable lengths. We specify that maximum length of the sequence will be 200 (although you can try any number). For the sentences having length less than 200, the remaining indexes will be padded with zeros. For the sentences having length greater than 200, the remaining indexes will be truncated.

Look at the following script:

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

Next, we need to load the built-in GloVe word embeddings.

from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions

glove_file.close()

Finally, we will create an embedding matrix where rows will be equal to the number of words in the vocabulary (plus 1). The number of columns will be 100 since each word in the GloVe word embeddings that we loaded is represented as a 100 dimensional vector.

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

Once the word embedding step is completed, we are ready to create our model. We will be using Keras' functional API to create our model. Though single input models like the one we are creating now can be developed using sequential API as well, but since in the next section we are going to develop a multiple input model that can only be developed using Keras functional API, we will stick to functional API in this section too.

We will create a very simple model with one input layer (embedding layer), one LSTM layer with 128 neurons and one dense layer that will act as the output layer as well. Since we have 3 possible outputs, the number of neurons will be 3 and the activation function will be softmax. We will use the categorical_crossentropy as our loss function and adam as the optimization function.

deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(deep_inputs)
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(3, activation='softmax')(LSTM_Layer_1)
model = Model(inputs=deep_inputs, outputs=dense_layer_1)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Let's print the summary of our model:

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 200)               0
_________________________________________________________________
embedding_1 (Embedding)      (None, 200, 100)          5572900
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 387
=================================================================
Total params: 5,690,535
Trainable params: 117,635
Non-trainable params: 5,572,900

Finally, lets print the block diagram of our neural network:

from keras.utils import plot_model
plot_model(model, to_file='model_plot1.png', show_shapes=True, show_layer_names=True)

The file model_plot1.png will be created in your local file path. If you open the image, it will look like this:

head

You can see that the model has 1 input layer, 1 embedding layer, 1 LSTM, and one dense layer which serves as the output layer as well.

Let's now train our model:

history = model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_split=0.2)

The model will be trained on 80% of the train data and will be validated on 20% of the train data. The results for the 10 epochs is as follows:

Train on 32000 samples, validate on 8000 samples
Epoch 1/10
32000/32000 [==============================] - 81s 3ms/step - loss: 0.8640 - acc: 0.6623 - val_loss: 0.8356 - val_acc: 0.6730
Epoch 2/10
32000/32000 [==============================] - 80s 3ms/step - loss: 0.8508 - acc: 0.6618 - val_loss: 0.8399 - val_acc: 0.6690
Epoch 3/10
32000/32000 [==============================] - 84s 3ms/step - loss: 0.8461 - acc: 0.6647 - val_loss: 0.8374 - val_acc: 0.6726
Epoch 4/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.8288 - acc: 0.6709 - val_loss: 0.7392 - val_acc: 0.6861
Epoch 5/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.7444 - acc: 0.6804 - val_loss: 0.6371 - val_acc: 0.7311
Epoch 6/10
32000/32000 [==============================] - 83s 3ms/step - loss: 0.5969 - acc: 0.7484 - val_loss: 0.5602 - val_acc: 0.7682
Epoch 7/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.5484 - acc: 0.7623 - val_loss: 0.5244 - val_acc: 0.7814
Epoch 8/10
32000/32000 [==============================] - 86s 3ms/step - loss: 0.5052 - acc: 0.7866 - val_loss: 0.4971 - val_acc: 0.7950
Epoch 9/10
32000/32000 [==============================] - 84s 3ms/step - loss: 0.4753 - acc: 0.8032 - val_loss: 0.4839 - val_acc: 0.7965
Epoch 10/10
32000/32000 [==============================] - 82s 3ms/step - loss: 0.4539 - acc: 0.8110 - val_loss: 0.4622 - val_acc: 0.8046

You can see that the final training accuracy of the model is 81.10% while validation accuracy is 80.46. The difference is very small and therefore we assume that our model is not overfitting on the training data.

Let's now evaluate the performance of our model on test set:

score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

The output looks like this:

10000/10000 [==============================] - 37s 4ms/step
Test Score: 0.4592904740810394
Test Accuracy: 0.8101

Finally, let's plot the values for loss and accuracy for both training and testing sets:

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

You should see the following two plots:

head

You can see the lines for both training and testing accuracies and losses are pretty close to each other which means that the model is not overfitting.

Creating a Model with Meta Information Only

In this section, we will create a classification model that uses information from the useful, funny, and cool columns of the yelp reviews. Since the data for these columns is well structured and doesn't contain any sequential or spatial pattern, we can use simple densly connected neural networks to make predictions.

Let's plot the average counts for useful, funny, and cool reviews against the review score.

import seaborn as sns
sns.barplot(x='reviews_score', y='useful', data=yelp_reviews)

head

From the output, you can see that the average count for reviews marked as useful is the highest for the bad reviews, followed by the average reviews and the good reviews.

Let's now plot the average count for funny reviews:

sns.barplot(x='reviews_score', y='funny', data=yelp_reviews)

head

The output shows that again, the average count for reviews marked as funny is highest for the bad reviews.

Finally, let's plot the average value for the cool column against the reviews_score column. We expect that the average count for the cool column will be the highest for good reviews since people often mark positive or good reviews as cool:

sns.barplot(x='reviews_score', y='cool', data=yelp_reviews)

head

As expected, the average cool count for the good reviews is the highest. From this information, we can safely assume that the count values for useful, funny, and cool columns have some correlation with the reviews_score columns. Therefore, we will try to use the data from these three columns to train our algorithm that predicts the value for reviews_score column.

Let's filter these three columns from pur dataset:

yelp_reviews_meta = yelp_reviews[['useful', 'funny', 'cool']]

X = yelp_reviews_meta.values

y = yelp_reviews['reviews_score']

Next, we will convert our labels into one-hot encoded values and then split our data into train and test sets:

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

The next step is to create our model. Our model will consist of four layers (you can try any number): the input layer, two dense hidden layers with 10 neurons and relu activation functions, and finally an output dense layer with 3 neurons and softmax activation function. The loss function and optimizer will be categorical_crossentropy and adam, respectively.

The following script defines the model:

input2 = Input(shape=(3,))
dense_layer_1 = Dense(10, activation='relu')(input2)
dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)
output = Dense(3, activation='softmax')(dense_layer_2)

model = Model(inputs=input2, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Let's print the summary of the model:

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 3)                 0
_________________________________________________________________
dense_1 (Dense)              (None, 10)                40
_________________________________________________________________
dense_2 (Dense)              (None, 10)                110
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 33
=================================================================
Total params: 183
Trainable params: 183
Non-trainable params: 0

Finally, the block diagram for the model can be created via the following script:

from keras.utils import plot_model
plot_model(model, to_file='model_plot2.png', show_shapes=True, show_layer_names=True)

Now, if you open the model_plot2.png file from your local file path, it looks like this:

head

Let's now train the model and print the accuracy and loss values for each epoch:

history = model.fit(X_train, y_train, batch_size=16, epochs=10, verbose=1, validation_split=0.2)

Train on 32000 samples, validate on 8000 samples
Epoch 1/10
32000/32000 [==============================] - 8s 260us/step - loss: 0.8429 - acc: 0.6649 - val_loss: 0.8166 - val_acc: 0.6734
Epoch 2/10
32000/32000 [==============================] - 7s 214us/step - loss: 0.8203 - acc: 0.6685 - val_loss: 0.8156 - val_acc: 0.6737
Epoch 3/10
32000/32000 [==============================] - 7s 217us/step - loss: 0.8187 - acc: 0.6685 - val_loss: 0.8150 - val_acc: 0.6736
Epoch 4/10
32000/32000 [==============================] - 7s 220us/step - loss: 0.8183 - acc: 0.6695 - val_loss: 0.8160 - val_acc: 0.6740
Epoch 5/10
32000/32000 [==============================] - 7s 227us/step - loss: 0.8177 - acc: 0.6686 - val_loss: 0.8149 - val_acc: 0.6751
Epoch 6/10
32000/32000 [==============================] - 7s 219us/step - loss: 0.8175 - acc: 0.6686 - val_loss: 0.8157 - val_acc: 0.6744
Epoch 7/10
32000/32000 [==============================] - 7s 216us/step - loss: 0.8172 - acc: 0.6696 - val_loss: 0.8145 - val_acc: 0.6733
Epoch 8/10
32000/32000 [==============================] - 7s 214us/step - loss: 0.8175 - acc: 0.6689 - val_loss: 0.8139 - val_acc: 0.6734
Epoch 9/10
32000/32000 [==============================] - 7s 215us/step - loss: 0.8169 - acc: 0.6691 - val_loss: 0.8160 - val_acc: 0.6744
Epoch 10/10
32000/32000 [==============================] - 7s 216us/step - loss: 0.8167 - acc: 0.6694 - val_loss: 0.8138 - val_acc: 0.6736

From the output, you can see that our model doesn't converge and accuracy values remain between 66 and 67 accross all the epochs.

Let's see how the model performs on test set:

score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

10000/10000 [==============================] - 0s 34us/step
Test Score: 0.8206425309181213
Test Accuracy: 0.6669

We can print the loss and accuracy values for training and test sets via the following script:

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

head

From the output, you can see that accuracy values are relatively lower. Hence, we can say that our model is underfitting. The accuracy can be increased by increasing the number of dense layers or by increasing the number of epochs, however I will leave that to you.

Let's move on to the final and most important section of this article where we will use multiple inputs of different types to train our model.

Creating a Model with Multiple Inputs

In the previous sections, we saw how to train deep learning models using either textual data or meta information. What if we want to combine textual information with meta information and use that as input to our model? We can do so using the Keras functional API. In this section we will create two submodels.

The first submodel will accept textual input in the form of text reviews. This submodel will consist of an input shape layer, an embedding layer, and an LSTM layer of 128 neurons. The second submodel will accept input in the form of meta information from the useful, funny, and cool columns. The second submodel also consist of three layers. An input layer and two dense layers.

The output from the LSTM layer of the first submodel and the output from the second dense layer of the second submodel will be concatenated together and will be used as concatenated input to another dense layer with 10 neurons. Finally, the output dense layer will have three neuorns corresponding to each review type.

Let's see how we can createe such a concatenated model.

First we have to create two different types of inputs. To do so, we will divide our data into a feature set and label set, as shown below:

X = yelp_reviews.drop('reviews_score', axis=1)

y = yelp_reviews['reviews_score']

The X variable contains the feature set, where as the y variable contains label set. We need to convert our labels into one-hot encoded vectors. We can do so using the label encoder and the to_categorical function of the keras.utils module. We will also divide our data into training and feature set.

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

Now our label set is in the required form. Since there will be only one output, therefore we don't need to process our label set. However, there will be multiple inputs to the model. Therefore, we need to preprocess our feature set.

Let's first create preproces_text function that will be used to preprocess our dataset:

def preprocess_text(sen):

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

As a first step, we will create textual input for the training and test set. Look at the following script:

X1_train = []
sentences = list(X_train["text"])
for sen in sentences:
    X1_train.append(preprocess_text(sen))

Now X1_train contains the textual input for the training set. Similarly, the following script preprocess textual input data for test set:

X1_test = []
sentences = list(X_test["text"])
for sen in sentences:
    X1_test.append(preprocess_text(sen))

Now we need to convert textual input for the training and test sets into numeric form using word embeddings. The following script does that:

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X1_train)

X1_train = tokenizer.texts_to_sequences(X1_train)
X1_test = tokenizer.texts_to_sequences(X1_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X1_train = pad_sequences(X1_train, padding='post', maxlen=maxlen)
X1_test = pad_sequences(X1_test, padding='post', maxlen=maxlen)

We will again use GloVe word embeddings for creating word vectors:

from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('/content/drive/My Drive/glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions

glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

We have preprocessed our textual input. The second input type is the meta information in the useful, funny, and cool columns. We will filter these columns from the feature set to create meta input for training the algorithms. Look at the following script:

X2_train = X_train[['useful', 'funny', 'cool']].values
X2_test = X_test[['useful', 'funny', 'cool']].values

Let's now create our two input layers. The first input layer will be used to input the textual input and the second input layer will be used to input meta information from the three columns.

input_1 = Input(shape=(maxlen,))

input_2 = Input(shape=(3,))

You can see that the first input layer input_1 is used for the textual input. The shape size has been set to the shape of the input sentence. For the second input layer, the shape corresponds to three columns.

Let's now create the first submodel that accepts data from first input layer:

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(input_1)
LSTM_Layer_1 = LSTM(128)(embedding_layer)

Similarly, the following script creates a second submodel that accepts input from the second input layer:

dense_layer_1 = Dense(10, activation='relu')(input_2)
dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)

We now have two submodels. What we want to do is concatenate the output from the first submodel with the output from the second submodel. The output from the first submodel is the output from the LSTM_Layer_1 and similarly, the output from the second submodel is the output from the dense_layer_2. We can use the Concatenate class from the keras.layers.merge module to concatenate two inputs.

The following script creates our final model:

concat_layer = Concatenate()([LSTM_Layer_1, dense_layer_2])
dense_layer_3 = Dense(10, activation='relu')(concat_layer)
output = Dense(3, activation='softmax')(dense_layer_3)
model = Model(inputs=[input_1, input_2], outputs=output)

You can see that now our model has a list of inputs with two items. The following script compiles the model and prints its summary:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
print(model.summary())

The model summary is as follows:

Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 200)          0
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 3)            0
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 200, 100)     5572900     input_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 10)           40          input_2[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 128)          117248      embedding_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 10)           110         dense_1[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 138)          0           lstm_1[0][0]
                                                                 dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 10)           1390        concatenate_1[0][0]
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 3)            33          dense_3[0][0]
==================================================================================================
Total params: 5,691,721
Trainable params: 118,821
Non-trainable params: 5,572,900

Finally, we can plot the complete network model using the following script:

from keras.utils import plot_model
plot_model(model, to_file='model_plot3.png', show_shapes=True, show_layer_names=True)

If you open the model_plot3.png file, you should see the following network diagram:

head

The above figure clearly explains how we have concatenated multiple inputs into one input to create our model.

Let's now train our model and see the results:

history = model.fit(x=[X1_train, X2_train], y=y_train, batch_size=128, epochs=10, verbose=1, validation_split=0.2)

Here is the result for the 10 epochs:

Train on 32000 samples, validate on 8000 samples
Epoch 1/10
32000/32000 [==============================] - 155s 5ms/step - loss: 0.9006 - acc: 0.6509 - val_loss: 0.8233 - val_acc: 0.6704
Epoch 2/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.8212 - acc: 0.6670 - val_loss: 0.8141 - val_acc: 0.6745
Epoch 3/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.8151 - acc: 0.6691 - val_loss: 0.8086 - val_acc: 0.6740
Epoch 4/10
32000/32000 [==============================] - 155s 5ms/step - loss: 0.8121 - acc: 0.6701 - val_loss: 0.8039 - val_acc: 0.6776
Epoch 5/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.8027 - acc: 0.6740 - val_loss: 0.7467 - val_acc: 0.6854
Epoch 6/10
32000/32000 [==============================] - 155s 5ms/step - loss: 0.6791 - acc: 0.7158 - val_loss: 0.5764 - val_acc: 0.7560
Epoch 7/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.5333 - acc: 0.7744 - val_loss: 0.5076 - val_acc: 0.7881
Epoch 8/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.4857 - acc: 0.7973 - val_loss: 0.4849 - val_acc: 0.7970
Epoch 9/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.4697 - acc: 0.8034 - val_loss: 0.4709 - val_acc: 0.8024
Epoch 10/10
32000/32000 [==============================] - 154s 5ms/step - loss: 0.4479 - acc: 0.8123 - val_loss: 0.4592 - val_acc: 0.8079

To evaluate our model, we wil have to pass both the test inputs to the evaluate function as shown below:

score = model.evaluate(x=[X1_test, X2_test], y=y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

Here are the result:

10000/10000 [==============================] - 18s 2ms/step
Test Score: 0.4576087875843048
Test Accuracy: 0.8053

Our test accuracy is 80.53%, which is slightly less than our first model that uses textual input only. This shows that meta information in yelp_reviews is not very useful for sentiment prediction.

Anyways, now you know how to create multiple input model for text classification in Keras!

Finally, let's now print the loss and accuracy for training and test sets:

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

head

You can see that the differences for loss and accuracy values is minimal between the training and test sets, hence our model is not overfitting.

Final Thoughts and Improvements

In this article, we built a very simple neural network since the purpose of the article is to explain how to create deep learning model that accepts multiple inputs of different types.

Following are some of the tips that you can follow to further improve the performance of the text classification model:

  1. We only used 50,000, out of 5.2 million records in this article since we had hardware constraint. You can try training your model on a higher number of records and see if you can achieve better performance.
  2. Try adding more LSTM and dense layers to the model. If the model overfits, try to add dropout.
  3. Try to change the optimizer function and train the model with higher number of epochs.

Please share your results along with the neural network configuration in the comments section. I would love to see how well did you perform.

22 Aug 2019 12:51pm GMT

PSF GSoC students blogs: Last Blog Post

Hey, everyone!!!

So, As I already written my last blog post here you can read by clicking to this link https://blogs.python-gsoc.org/en/iflameings-blog/. This blog post explains entire period of Gsoc. Since the title of this blog post is last, I will share my experience and how I got selected in gsoc. GSoC is a great platform if you want to getting started with open source. It doesn't just teach you to contribute to open source but It also give you brand which you needed at some point of your life. It will help you in getting internship and job. It also provide you a community for the lifetime if you keep contributing to the project. The past 3 months is my best days of my life. Learning something new everyday. It enhanced my coding and programming skills. I saw to power of test and how It help in refactoring and adding new feature to project. I learnt some trending framework of industry Jest, graphql, react, webSocket and many more small library which I install from npm and used in my project. I started contributing to gatsby-source-plone when Iodide doesn't came into the gsoc. I solve some issue before the announcement of Gsoc organisation but when Google announces the organisation Plone is not selected. I got very frusted but I keep contributing to gatsby-source-plone. On 7th of march I found that Plone comes under the PSF organisation I became very happy. I keep contributing to it and finally got selected into this project :)

22 Aug 2019 12:24pm GMT

PSF GSoC students blogs: Last Blog Post

Hey, everyone!!!

So, As I already written my last blog post here you can read by clicking to this link https://blogs.python-gsoc.org/en/iflameings-blog/. This blog post explains entire period of Gsoc. Since the title of this blog post is last, I will share my experience and how I got selected in gsoc. GSoC is a great platform if you want to getting started with open source. It doesn't just teach you to contribute to open source but It also give you brand which you needed at some point of your life. It will help you in getting internship and job. It also provide you a community for the lifetime if you keep contributing to the project. The past 3 months is my best days of my life. Learning something new everyday. It enhanced my coding and programming skills. I saw to power of test and how It help in refactoring and adding new feature to project. I learnt some trending framework of industry Jest, graphql, react, webSocket and many more small library which I install from npm and used in my project. I started contributing to gatsby-source-plone when Iodide doesn't came into the gsoc. I solve some issue before the announcement of Gsoc organisation but when Google announces the organisation Plone is not selected. I got very frusted but I keep contributing to gatsby-source-plone. On 7th of march I found that Plone comes under the PSF organisation I became very happy. I keep contributing to it and finally got selected into this project :)

22 Aug 2019 12:24pm GMT

Matt Layman: Celery In A Shiv App - Building SaaS #31

In this episode, we baked the Celery worker and beat scheduler tool into the Shiv app. This is one more step on the path to simplifying the set of tools on the production server. I started the stream by reviewing the refactoring that I did to conductor/main.py. The main file is used to dispatch to different tools with the Shiv bundle. The refactored version can pass control to Gunicorn, the Django management tools, or Celery.

22 Aug 2019 7:42am GMT

Matt Layman: Celery In A Shiv App - Building SaaS #31

In this episode, we baked the Celery worker and beat scheduler tool into the Shiv app. This is one more step on the path to simplifying the set of tools on the production server. I started the stream by reviewing the refactoring that I did to conductor/main.py. The main file is used to dispatch to different tools with the Shiv bundle. The refactored version can pass control to Gunicorn, the Django management tools, or Celery.

22 Aug 2019 7:42am GMT

PSF GSoC students blogs: Week 11 Chek-in

What did you do this week?

Documentation is written (Crystallinity map GUI, Clustering GUI, Crystallinity map + Clustering Jupyter Notebook)

Default frame view is reorganized (added new dropdown "ROI", code looks more consistent and nice)

What is coming up next?

Final evaluation

Did you get stuck anywhere?

No

22 Aug 2019 7:34am GMT

PSF GSoC students blogs: Week 11 Chek-in

What did you do this week?

Documentation is written (Crystallinity map GUI, Clustering GUI, Crystallinity map + Clustering Jupyter Notebook)

Default frame view is reorganized (added new dropdown "ROI", code looks more consistent and nice)

What is coming up next?

Final evaluation

Did you get stuck anywhere?

No

22 Aug 2019 7:34am GMT

PSF GSoC students blogs: It's the final countdown - 6th blog post

Hello everyone!

This week started with an extra dosage of anxiety for me. The deadline is coming, and I know I have a lot to work yet.

Even though the look was too much to do for so little time, that actually got me motivated, and I started to not let the anxiety get to me.

I coded a lot this week. And I enjoyed it like I didn't for a couple weeks. Not only I had to create a new part of the setup for the CLI, I had to test and document it. Julio said that I should've split in lesser PRs, but honestly, even though I iterated on my code, I only thought in reviewing every part alone and writing more, until it was finished. Every little method I wrote, I reviewed it in the next day, until I knew I was satisfied with what I wrote. The feeling of having thought of every case, every part of your code is really compensating, and that's what I was looking for in GSoC.

So, yes, I am happy that is coming to an end; not because it's ending, but because I learned a lot and even though I doubted myself many times, I was able to go through it and that's something that no one can evaluate for me.

I look forward to contributing even more to the open source community. I was always afraid of the feedback and the interaction with so many diverse people. But, to be honest, it seems to be where I can learn the most, and learning is what I like the most to do. So that's also thanks to the GSoC program.

I'm thankful for my mentors as well; they were with me in this journey, and I look forward to working with them in the Python community. I am also decided to go to Python Brazil in October, and enjoy it for the community as well.

Leonardo Rodrigues.

22 Aug 2019 2:38am GMT

PSF GSoC students blogs: It's the final countdown - 6th blog post

Hello everyone!

This week started with an extra dosage of anxiety for me. The deadline is coming, and I know I have a lot to work yet.

Even though the look was too much to do for so little time, that actually got me motivated, and I started to not let the anxiety get to me.

I coded a lot this week. And I enjoyed it like I didn't for a couple weeks. Not only I had to create a new part of the setup for the CLI, I had to test and document it. Julio said that I should've split in lesser PRs, but honestly, even though I iterated on my code, I only thought in reviewing every part alone and writing more, until it was finished. Every little method I wrote, I reviewed it in the next day, until I knew I was satisfied with what I wrote. The feeling of having thought of every case, every part of your code is really compensating, and that's what I was looking for in GSoC.

So, yes, I am happy that is coming to an end; not because it's ending, but because I learned a lot and even though I doubted myself many times, I was able to go through it and that's something that no one can evaluate for me.

I look forward to contributing even more to the open source community. I was always afraid of the feedback and the interaction with so many diverse people. But, to be honest, it seems to be where I can learn the most, and learning is what I like the most to do. So that's also thanks to the GSoC program.

I'm thankful for my mentors as well; they were with me in this journey, and I look forward to working with them in the Python community. I am also decided to go to Python Brazil in October, and enjoy it for the community as well.

Leonardo Rodrigues.

22 Aug 2019 2:38am GMT

PSF GSoC students blogs: Blog #6

In the past week, my mentor and I tried to fix the dockerfile that sets up hadoop in a ubuntu container from scratch. Since that was becoming tidious, we tried setting up a mini hadoop cluster.

Apache has this mini mini hadoop cluster set up that gives a single node cluster. I tried building this using a maven docker image. The documentation has very little information on where hadoop is actually getting downloaded and the ports it'll be connecting to by default. My mentor and I debugged the dockerfile and tried to get this up and running but still there is a problem with ports and I'm working on it. Also, we figured out how to get the files from hdfs which can be either CSV or JSON type of files. I have implemented those changes as well.

Hopefully by next week I can finish this project.

22 Aug 2019 2:17am GMT

PSF GSoC students blogs: Blog #6

In the past week, my mentor and I tried to fix the dockerfile that sets up hadoop in a ubuntu container from scratch. Since that was becoming tidious, we tried setting up a mini hadoop cluster.

Apache has this mini mini hadoop cluster set up that gives a single node cluster. I tried building this using a maven docker image. The documentation has very little information on where hadoop is actually getting downloaded and the ports it'll be connecting to by default. My mentor and I debugged the dockerfile and tried to get this up and running but still there is a problem with ports and I'm working on it. Also, we figured out how to get the files from hdfs which can be either CSV or JSON type of files. I have implemented those changes as well.

Hopefully by next week I can finish this project.

22 Aug 2019 2:17am GMT

PSF GSoC students blogs: Weekly Check-in #10

In the pat week, I was trying to set up hadoop using Dockerfile.

What did I do this week?

Setting up Hadoop in Docker with my limited knowledge in both is becoming a more difficult task due to the limited resources available on how to particularly set this up over docker using dockerfile. Also everytime I have to build the container from scratch, downloading all the files again and setting it up is a time consuming process. I tried an approach this week that got most of the instructions on the dockerfile working, yet there is an issue with starting the containers. I have addded the corresponding config files that would be used by docker and also a start-up shell script that is run while building the container to start hadoop after installing it.

What is coming up next?

I need to get the dockerfileworking by this week so that i can move ahead and refine the hadoop source classes and add more tests if possible.

Did you get stuck anywhere?

Debugging the dockerfile was a difficult task for me. My mentor was very understanding and helped me in fixing it.

22 Aug 2019 1:25am GMT

PSF GSoC students blogs: Weekly Check-in #10

In the pat week, I was trying to set up hadoop using Dockerfile.

What did I do this week?

Setting up Hadoop in Docker with my limited knowledge in both is becoming a more difficult task due to the limited resources available on how to particularly set this up over docker using dockerfile. Also everytime I have to build the container from scratch, downloading all the files again and setting it up is a time consuming process. I tried an approach this week that got most of the instructions on the dockerfile working, yet there is an issue with starting the containers. I have addded the corresponding config files that would be used by docker and also a start-up shell script that is run while building the container to start hadoop after installing it.

What is coming up next?

I need to get the dockerfileworking by this week so that i can move ahead and refine the hadoop source classes and add more tests if possible.

Did you get stuck anywhere?

Debugging the dockerfile was a difficult task for me. My mentor was very understanding and helped me in fixing it.

22 Aug 2019 1:25am GMT

PSF GSoC students blogs: Blog #5

In this week I was trying my hands on in setting up hadoop in docker.

The next phase of our project involves making it compatible with input from a hadoop data source. With my limited knowledge in hadoop and docker, I was trying to set it up. First I set it up in my local computer and made it work. I had written the basic classes that will be needed to establish a connection and successfully set up a connection.

I also added config() and args() method that can be used to fetch the arguments and its corresponding values specific to hadoop source. In hadoop, the challenging part is to handle the files from the HDFS. These files can be either CSV or JSON files. So i have to discuss with my mentor about how I can handle this.

22 Aug 2019 1:06am GMT

PSF GSoC students blogs: Blog #5

In this week I was trying my hands on in setting up hadoop in docker.

The next phase of our project involves making it compatible with input from a hadoop data source. With my limited knowledge in hadoop and docker, I was trying to set it up. First I set it up in my local computer and made it work. I had written the basic classes that will be needed to establish a connection and successfully set up a connection.

I also added config() and args() method that can be used to fetch the arguments and its corresponding values specific to hadoop source. In hadoop, the challenging part is to handle the files from the HDFS. These files can be either CSV or JSON files. So i have to discuss with my mentor about how I can handle this.

22 Aug 2019 1:06am GMT

21 Aug 2019

feedPlanet Python

Kushal Das: Remember to mark drive as removable for tails vm install

If you are installing Tails into a VM for testing or anything else, always remember to mark the drive as a removable USB drive. Otherwise, the installation step will finish properly, but, you will get errors like the following screenshot while booting from the drive.

Tails error while booting

The option to do so is available in the details section for the vm in virt-manager.

Where in libvirt

I wasted a few hours today while trying to get a new VM for the SecureDrop admin setup tests.

21 Aug 2019 8:18pm GMT

Kushal Das: Remember to mark drive as removable for tails vm install

If you are installing Tails into a VM for testing or anything else, always remember to mark the drive as a removable USB drive. Otherwise, the installation step will finish properly, but, you will get errors like the following screenshot while booting from the drive.

Tails error while booting

The option to do so is available in the details section for the vm in virt-manager.

Where in libvirt

I wasted a few hours today while trying to get a new VM for the SecureDrop admin setup tests.

21 Aug 2019 8:18pm GMT

TechBeamers Python: Python Arrays in a Nutshell

Python arrays are homogenous data structure. They are used to store multiple items but allow only the same type of data. They are available in Python by importing the array module. Lists, a built-in type in Python, are also capable of storing multiple values. But they are different from arrays because they are not bound to any specific type. So, to summarize, arrays are not fundamental type, but lists are internal to Python. An array accepts values of one kind while lists are independent of the data type. Python List In this tutorial, you'll get to know how to create

The post Python Arrays in a Nutshell appeared first on Learn Programming and Software Testing.

21 Aug 2019 6:49pm GMT

TechBeamers Python: Python Arrays in a Nutshell

Python arrays are homogenous data structure. They are used to store multiple items but allow only the same type of data. They are available in Python by importing the array module. Lists, a built-in type in Python, are also capable of storing multiple values. But they are different from arrays because they are not bound to any specific type. So, to summarize, arrays are not fundamental type, but lists are internal to Python. An array accepts values of one kind while lists are independent of the data type. Python List In this tutorial, you'll get to know how to create

The post Python Arrays in a Nutshell appeared first on Learn Programming and Software Testing.

21 Aug 2019 6:49pm GMT

PyCharm: Python 3.8 support in PyCharm

The release of Python 3.8 brought new features to the Python coding realm. The language is evolving according to its community's needs by addressing cases where new syntax or logic become necessary. From new ways of assigning expressions to restriction of usage of function declarations, calls, and variable assignations, this latest release presents new options to code. Of course, PyCharm couldn't get behind, so we now support some of the major features coming with this new version.

This article will walk you through the features currently supported by our latest PyCharm release. To try them out, get the latest version of PyCharm and download the current beta release of Python 3.8 from here. From there you will just need to switch to Python 3.8 as your interpreter in PyCharm (if you're not sure how to switch the interpreter, jump into our documentation for help).

Positional-only parameters

Function definitions are a key element when designing libraries and APIs for user consumption. The more explicit these definitions are, the easier they are to implement. One way to achieve such explicitness is by how the function can be called with its arguments. As of now, Python only had the option to define arguments as positional, keyword, or keyword-only, but with this new version we now have another way to define them by using positional-only parameters.

To use this feature, just set the arguments in your function definition and write a forward slash/ after the last positional-only argument you want to declare. This is closely analogous to the keyword-only arguments syntax, but instead of setting the arguments after the asterisk*, you do it before the slash/.

Let's look at an example. Say, you have a function in your library that selects a value randomly from different values passed as arguments. Such values can be passed in any position and the function will return you a random choice. The semantic meaning stays the same, regardless of the order of the values in the function call. By design, you decide that those arguments should be positional-only:

positional-only-random

By doing this, you ensure that your library's users won't be able to call your function with the arguments' keywords. In the past, if you renamed the arguments of your function for refactoring purposes (or any other reason), the code of your library's users would be at risk if they were to make a function call with keyword arguments (for example, select_random(a=3, b=89, c=54)). One of the advantages of positional-only parameters is that, if you decide to change the variable names in the function definition, your library's users won't be affected as there are no keyword dependencies in the function calls they make to begin with.

Assignment expressions

A new way to assign values to variables is available with this latest Python version. Now, expressions can also assign values as part of their declaration, which removes the necessity to initialize variables in advance. As a result, you can make your code less verbose and add compactness, as declarations can be made within expressions in the same line.

The syntax to declare a variable consists of the walrus operator := inside an expression enclosed by parentheses. An important note is that the walrus operator is different from the equals operator. For example, comma-separated assignments with the equals operator are not the same as the ones made by the walrus operator.

One example of such usage can be a while loop with a control variable. When you use this feature, the loop's control expression will also hold the variable definition and reassignment.

assignment-expressions

In the previous example, the 'before' while loop has a variable assignment before it and also inside its execution code. The 'after' loop has the assignment inside its control statement definition by using an assignment expression.

Final annotation and final decorator

When you design a class, you have to make sure your methods are used properly. With this new version, a final decorator and a Final type annotation are introduced to help restrict the usage of methods, classes, and variables. If needed, this feature will let you declare that a method should not be overridden, that a class should not be subclassed, or that a variable or attribute should not be reassigned.

The final decorator prohibits any class decorated with @final from being subclassed, and any method decorated with @final from being overridden in a subclass. Let's say you have a class that declares a method and that method is being used inside the class at different points. If the user modifies that method by overriding it while subclassing, there are risks that the base class behavior might change or run into errors. To avoid this, you can use the final decorator to prevent the user from overriding such a class.

Let's say you have a signature generator class like the following:

final-decorator-attribute

When initialized, the signature is generated through a create_signature method which is called within the __init__ constructor method. Depending on your class design, you may opt to protect your create_signature method with the final decorator so it is not overridden if subclassed. With the final decorator, you ensure that any other method that depends on this method inside the class is not affected by a method override. In this case, the __init__ constructor method is using the create_signature method. By using the final decorator, you ensure that the initialization of the class will not be affected by any change that might be introduced by subclassing.

Another thing to notice is that in this example, we use the Final attribute with the ENCODER attribute. This class attribute holds the type of string encoding used in the create_signature method. By class design, we choose to use the Final attribute because we use that value within the methods of the class and we don't want it to be overridden as that would change the methods' behavior.

Equals sign in f-strings

String formatting makes code more concise, readable, and less prone to error when exposing values. Variable names and values now can coexist in string contexts with the introduction of the equals sign in f-strings.

To take advantage of this new feature, type your f-string as follows: f'{expr=}' where expr is the variable that you want to expose. In this way, you get to generate a string that will show both your expression and its output.

f-strings-equals

This feature is helpful when you'd like to write variable values to your log. If you happen to use this for debugging purposes, you may want to check out PyCharm's debugger.

Learn more

For more information about usages and examples where these features can be useful, take a look at PEP-0570, PEP-0572, PEP-0591, and bpo-36817.

We at PyCharm continue to work on supporting Python 3.8 fully, and we hope these features will come in handy for you when setting up or working with a project using Python 3.8. Support for other features should be expected in the near future, so be sure to pay close attention to our latest releases.

If you have any questions or suggestions, drop us a comment. Thanks!

21 Aug 2019 4:17pm GMT

PyCharm: Python 3.8 support in PyCharm

The release of Python 3.8 brought new features to the Python coding realm. The language is evolving according to its community's needs by addressing cases where new syntax or logic become necessary. From new ways of assigning expressions to restriction of usage of function declarations, calls, and variable assignations, this latest release presents new options to code. Of course, PyCharm couldn't get behind, so we now support some of the major features coming with this new version.

This article will walk you through the features currently supported by our latest PyCharm release. To try them out, get the latest version of PyCharm and download the current beta release of Python 3.8 from here. From there you will just need to switch to Python 3.8 as your interpreter in PyCharm (if you're not sure how to switch the interpreter, jump into our documentation for help).

Positional-only parameters

Function definitions are a key element when designing libraries and APIs for user consumption. The more explicit these definitions are, the easier they are to implement. One way to achieve such explicitness is by how the function can be called with its arguments. As of now, Python only had the option to define arguments as positional, keyword, or keyword-only, but with this new version we now have another way to define them by using positional-only parameters.

To use this feature, just set the arguments in your function definition and write a forward slash/ after the last positional-only argument you want to declare. This is closely analogous to the keyword-only arguments syntax, but instead of setting the arguments after the asterisk*, you do it before the slash/.

Let's look at an example. Say, you have a function in your library that selects a value randomly from different values passed as arguments. Such values can be passed in any position and the function will return you a random choice. The semantic meaning stays the same, regardless of the order of the values in the function call. By design, you decide that those arguments should be positional-only:

positional-only-random

By doing this, you ensure that your library's users won't be able to call your function with the arguments' keywords. In the past, if you renamed the arguments of your function for refactoring purposes (or any other reason), the code of your library's users would be at risk if they were to make a function call with keyword arguments (for example, select_random(a=3, b=89, c=54)). One of the advantages of positional-only parameters is that, if you decide to change the variable names in the function definition, your library's users won't be affected as there are no keyword dependencies in the function calls they make to begin with.

Assignment expressions

A new way to assign values to variables is available with this latest Python version. Now, expressions can also assign values as part of their declaration, which removes the necessity to initialize variables in advance. As a result, you can make your code less verbose and add compactness, as declarations can be made within expressions in the same line.

The syntax to declare a variable consists of the walrus operator := inside an expression enclosed by parentheses. An important note is that the walrus operator is different from the equals operator. For example, comma-separated assignments with the equals operator are not the same as the ones made by the walrus operator.

One example of such usage can be a while loop with a control variable. When you use this feature, the loop's control expression will also hold the variable definition and reassignment.

assignment-expressions

In the previous example, the 'before' while loop has a variable assignment before it and also inside its execution code. The 'after' loop has the assignment inside its control statement definition by using an assignment expression.

Final annotation and final decorator

When you design a class, you have to make sure your methods are used properly. With this new version, a final decorator and a Final type annotation are introduced to help restrict the usage of methods, classes, and variables. If needed, this feature will let you declare that a method should not be overridden, that a class should not be subclassed, or that a variable or attribute should not be reassigned.

The final decorator prohibits any class decorated with @final from being subclassed, and any method decorated with @final from being overridden in a subclass. Let's say you have a class that declares a method and that method is being used inside the class at different points. If the user modifies that method by overriding it while subclassing, there are risks that the base class behavior might change or run into errors. To avoid this, you can use the final decorator to prevent the user from overriding such a class.

Let's say you have a signature generator class like the following:

final-decorator-attribute

When initialized, the signature is generated through a create_signature method which is called within the __init__ constructor method. Depending on your class design, you may opt to protect your create_signature method with the final decorator so it is not overridden if subclassed. With the final decorator, you ensure that any other method that depends on this method inside the class is not affected by a method override. In this case, the __init__ constructor method is using the create_signature method. By using the final decorator, you ensure that the initialization of the class will not be affected by any change that might be introduced by subclassing.

Another thing to notice is that in this example, we use the Final attribute with the ENCODER attribute. This class attribute holds the type of string encoding used in the create_signature method. By class design, we choose to use the Final attribute because we use that value within the methods of the class and we don't want it to be overridden as that would change the methods' behavior.

Equals sign in f-strings

String formatting makes code more concise, readable, and less prone to error when exposing values. Variable names and values now can coexist in string contexts with the introduction of the equals sign in f-strings.

To take advantage of this new feature, type your f-string as follows: f'{expr=}' where expr is the variable that you want to expose. In this way, you get to generate a string that will show both your expression and its output.

f-strings-equals

This feature is helpful when you'd like to write variable values to your log. If you happen to use this for debugging purposes, you may want to check out PyCharm's debugger.

Learn more

For more information about usages and examples where these features can be useful, take a look at PEP-0570, PEP-0572, PEP-0591, and bpo-36817.

We at PyCharm continue to work on supporting Python 3.8 fully, and we hope these features will come in handy for you when setting up or working with a project using Python 3.8. Support for other features should be expected in the near future, so be sure to pay close attention to our latest releases.

If you have any questions or suggestions, drop us a comment. Thanks!

21 Aug 2019 4:17pm GMT

Real Python: Your Guide to the CPython Source Code

Are there certain parts of Python that just seem magic? Like how are dictionaries so much faster than looping over a list to find an item. How does a generator remember the state of the variables each time it yields a value and why do you never have to allocate memory like other languages? It turns out, CPython, the most popular Python runtime is written in human-readable C and Python code. This tutorial will walk you through the CPython source code.

You'll cover all the concepts behind the internals of CPython, how they work and visual explanations as you go.

You'll learn how to:

Yes, this is a very long article. If you just made yourself a fresh cup of tea, coffee or your favorite beverage, it's going to be cold by the end of Part 1.

This tutorial is split into five parts. Take your time for each part and make sure you try out the demos and the interactive components. You can feel a sense of achievement that you grasp the core concepts of Python that can make you a better Python programmer.

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Part 1: Introduction to CPython

When you type python at the console or install a Python distribution from python.org, you are running CPython. CPython is one of the many Python runtimes, maintained and written by different teams of developers. Some other runtimes you may have heard are PyPy, Cython, and Jython.

The unique thing about CPython is that it contains both a runtime and the shared language specification that all Python runtimes use. CPython is the "official," or reference implementation of Python.

The Python language specification is the document that the description of the Python language. For example, it says that assert is a reserved keyword, and that [] is used for indexing, slicing, and creating empty lists.

Think about what you expect to be inside the Python distribution on your computer:

These are all part of the CPython distribution. There's a lot more than just a compiler.

Note: This article is written against version 3.8.0b3 of the CPython source code.

What's in the Source Code?

The CPython source distribution comes with a whole range of tools, libraries, and components. We'll explore those in this article. First we are going to focus on the compiler.

To download a copy of the CPython source code, you can use git to pull the latest version to a working copy locally:

git clone https://github.com/python/cpython

Note: If you don't have Git available, you can download the source in a ZIP file directly from the GitHub website.

Inside of the newly downloaded cpython directory, you will find the following subdirectories:

cpython/
│
├── Doc      ← Source for the documentation
├── Grammar  ← The a computer-readable language definition
├── Include  ← The C header files
├── Lib      ← Standard library modules written in Python
├── Mac      ← macOS support files
├── Misc     ← Miscellaneous files
├── Modules  ← Standard Library Modules written in C
├── Objects  ← Core types and the object model
├── Parser   ← The Python parser source code
├── PC       ← Windows build support files
├── PCbuild  ← Windows build support files for older Windows versions
├── Programs ← Source code for the python executable and other binaries
├── Python   ← The CPython interpreter source code
└── Tools    ← Standalone tools useful for building or extending Python

Next, we'll compile CPython from the source code. This step requires a C compiler, and some build tools, which depend on the operating system you're using.

Compiling CPython (macOS)

Compiling CPython on macOS is straightforward. You will first need the essential C compiler toolkit. The Command Line Development Tools is an app that you can update in macOS through the App Store. You need to perform the initial installation on the terminal.

To open up a terminal in macOS, go to the Launchpad, then Other then choose the Terminal app. You will want to save this app to your Dock, so right-click the Icon and select Keep in Dock.

Now, within the terminal, install the C compiler and toolkit by running the following:

$ xcode-select --install

This command will pop up with a prompt to download and install a set of tools, including Git, Make, and the GNU C compiler.

You will also need a working copy of OpenSSL to use for fetching packages from the PyPi.org website. If you later plan on using this build to install additional packages, SSL validation is required.

The simplest way to install OpenSSL on macOS is by using HomeBrew. If you already have HomeBrew installed, you can install the dependencies for CPython with the brew install command:

$ brew install openssl xz zlib

Now that you have the dependencies, you can run the configure script, enabling SSL support by discovering the location that HomeBrew installed to and enabling the debug hooks --with-pydebug:

$ CPPFLAGS="-I$(brew --prefix zlib)/include" \
 LDFLAGS="-L$(brew --prefix zlib)/lib" \
 ./configure --with-openssl=$(brew --prefix openssl) --with-pydebug

This will generate a Makefile in the root of the repository that you can use to automate the build process. The ./configure step only needs to be run once. You can build the CPython binary by running:

$ make -j2 -s

The -j2 flag allows make to run 2 jobs simultaneously. If you have 4 cores, you can change this to 4. The -s flag stops the Makefile from printing every command it runs to the console. You can remove this, but the output is very verbose.

During the build, you may receive some errors, and in the summary, it will notify you that not all packages could be built. For example, _dbm, _sqlite3, _uuid, nis, ossaudiodev, spwd, and _tkinter would fail to build with this set of instructions. That's okay if you aren't planning on developing against those packages. If you are, then check out the dev guide website for more information.

The build will take a few minutes and generate a binary called python.exe. Every time you make changes to the source code, you will need to re-run make with the same flags. The python.exe binary is the debug binary of CPython. Execute python.exe to see a working REPL:

$ ./python.exe
Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Note: Yes, that's right, the macOS build has a file extension for .exe. This is not because it's a Windows binary. Because macOS has a case-insensitive filesystem and when working with the binary, the developers didn't want people to accidentally refer to the directory Python/ so .exe was appended to avoid ambiguity. If you later run make install or make altinstall, it will rename the file back to python.

Compiling CPython (Linux)

For Linux, the first step is to download and install make, gcc, configure, and pkgconfig.

For Fedora Core, RHEL, CentOS, or other yum-based systems:

$ sudo yum install yum-utils

For Debian, Ubuntu, or other apt-based systems:

$ sudo apt install build-essential

Then install the required packages, for Fedora Core, RHEL, CentOS or other yum-based systems:

$ sudo yum-builddep python3

For Debian, Ubuntu, or other apt-based systems:

$ sudo apt install libssl-dev zlib1g-dev libncurses5-dev \
  libncursesw5-dev libreadline-dev libsqlite3-dev libgdbm-dev \
  libdb5.3-dev libbz2-dev libexpat1-dev liblzma-dev libffi-dev

Now that you have the dependencies, you can run the configure script, enabling the debug hooks --with-pydebug:

$ ./configure --with-pydebug

Review the output to ensure that OpenSSL support was marked as YES. Otherwise, check with your distribution for instructions on installing the headers for OpenSSL.

Next, you can build the CPython binary by running the generated Makefile:

$ make -j2 -s

During the build, you may receive some errors, and in the summary, it will notify you that not all packages could be built. That's okay if you aren't planning on developing against those packages. If you are, then check out the dev guide website for more information.

The build will take a few minutes and generate a binary called python. This is the debug binary of CPython. Execute ./python to see a working REPL:

$ ./python
Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Compiling CPython (Windows)

Inside the PC folder is a Visual Studio project file for building and exploring CPython. To use this, you need to have Visual Studio installed on your PC.

The newest version of Visual Studio, Visual Studio 2019, makes it easier to work with Python and the CPython source code, so it is recommended for use in this tutorial. If you already have Visual Studio 2017 installed, that would also work fine.

None of the paid features are required for compiling CPython or this tutorial. You can use the Community edition of Visual Studio, which is available for free from Microsoft's Visual Studio website.

Once you've downloaded the installer, you'll be asked to select which components you want to install. The bare minimum for this tutorial is:

Any other optional features can be deselected if you want to be more conscientious with disk space:

Visual Studio Options Window

The installer will then download and install all of the required components. The installation could take an hour, so you may want to read on and come back to this section.

Once the installer has completed, click the Launch button to start Visual Studio. You will be prompted to sign in. If you have a Microsoft account you can log in, or skip that step.

Once Visual Studio starts, you will be prompted to Open a Project. A shortcut to getting started with the Git configuration and cloning CPython is to choose the Clone or check out code option:

Choosing a Project Type in Visual Studio

For the project URL, type https://github.com/python/cpython to clone:

Cloning projects in Visual Studio

Visual Studio will then download a copy of CPython from GitHub using the version of Git bundled with Visual Studio. This step also saves you the hassle of having to install Git on Windows. The download may take 10 minutes.

Once the project has downloaded, you need to point it to the pcbuild Solution file, by clicking on Solutions and Projects and selecting pcbuild.sln:

Selecting a solution

When the solution is loaded, it will prompt you to retarget the project's inside the solution to the version of the C/C++ compiler you have installed. Visual Studio will also target the version of the Windows SDK you have installed.

Ensure that you change the Windows SDK version to the newest installed version and the platform toolset to the latest version. If you missed this window, you can right-click on the Solution in the Solutions and Projects window and click Retarget Solution.

Once this is complete, you need to download some source files to be able to build the whole CPython package. Inside the PCBuild folder there is a .bat file that automates this for you. Open up a command-line prompt inside the downloaded PCBuild and run get_externals.bat:

 > get_externals.bat
Using py -3.7 (found 3.7 with py.exe)
Fetching external libraries...
Fetching bzip2-1.0.6...
Fetching sqlite-3.21.0.0...
Fetching xz-5.2.2...
Fetching zlib-1.2.11...
Fetching external binaries...
Fetching openssl-bin-1.1.0j...
Fetching tcltk-8.6.9.0...
Finished.

Next, back within Visual Studio, build CPython by pressing Ctrl+Shift+B, or choosing Build Solution from the top menu. If you receive any errors about the Windows SDK being missing, make sure you set the right targeting settings in the Retarget Solution window. You should also see Windows Kits inside your Start Menu, and Windows Software Development Kit inside of that menu.

The build stage could take 10 minutes or more for the first time. Once the build is completed, you may see a few warnings that you can ignore and eventual completion.

To start the debug version of CPython, press F5 and CPython will start in Debug mode straight into the REPL:

CPython debugging Windows

Once this is completed, you can run the Release build by changing the build configuration from Debug to Release on the top menu bar and rerunning Build Solution again. You now have both Debug and Release versions of the CPython binary within PCBuild\win32\.

You can set up Visual Studio to be able to open a REPL with either the Release or Debug build by choosing Tools->Python->Python Environments from the top menu:

Choosing Python environments

Then click Add Environment and then target the Debug or Release binary. The Debug binary will end in _d.exe, for example, python_d.exe and pythonw_d.exe. You will most likely want to use the debug binary as it comes with Debugging support in Visual Studio and will be useful for this tutorial.

In the Add Environment window, target the python_d.exe file as the interpreter inside the PCBuild/win32 and the pythonw_d.exe as the windowed interpreter:

Adding an environment in VS2019

Now, you can start a REPL session by clicking Open Interactive Window in the Python Environments window and you will see the REPL for the compiled version of Python:

Python Environment REPL

During this tutorial there will be REPL sessions with example commands. I encourage you to use the Debug binary to run these REPL sessions in case you want to put in any breakpoints within the code.

Lastly, to make it easier to navigate the code, in the Solution View, click on the toggle button next to the Home icon to switch to Folder view:

Switching Environment Mode

Now you have a version of CPython compiled and ready to go, let's find out how the CPython compiler works.

What Does a Compiler Do?

The purpose of a compiler is to convert one language into another. Think of a compiler like a translator. You would hire a translator to listen to you speaking in English and then speak in Japanese:

Translating from English to Japanese

Some compilers will compile into a low-level machine code which can be executed directly on a system. Other compilers will compile into an intermediary language, to be executed by a virtual machine.

One important decision to make when choosing a compiler is the system portability requirements. Java and .NET CLR will compile into an Intermediary Language so that the compiled code is portable across multiple systems architectures. C, Go, C++, and Pascal will compile into a low-level executable that will only work on systems similar to the one it was compiled.

Because Python applications are typically distributed as source code, the role of the Python runtime is to convert the Python source code and execute it in one step. Internally, the CPython runtime does compile your code. A popular misconception is that Python is an interpreted language. It is actually compiled.

Python code is not compiled into machine-code. It is compiled into a special low-level intermediary language called bytecode that only CPython understands. This code is stored in .pyc files in a hidden directory and cached for execution. If you run the same Python application twice without changing the source code, it'll always be much faster the second time. This is because it loads the compiled bytecode and executes it directly.

Why Is CPython Written in C and Not Python?

The C in CPython is a reference to the C programming language, implying that this Python distribution is written in the C language.

This statement is largely true: the compiler in CPython is written in pure C. However, many of the standard library modules are written in pure Python or a combination of C and Python.

So why is CPython written in C and not Python?

The answer is located in how compilers work. There are two types of compiler:

  1. Self-hosted compilers are compilers written in the language they compile, such as the Go compiler.
  2. Source-to-source compilers are compilers written in another language that already have a compiler.

If you're writing a new programming language from scratch, you need an executable application to compile your compiler! You need a compiler to execute anything, so when new languages are developed, they're often written first in an older, more established language.

A good example would be the Go programming language. The first Go compiler was written in C, then once Go could be compiled, the compiler was rewritten in Go.

CPython kept its C heritage: many of the standard library modules, like the ssl module or the sockets module, are written in C to access low-level operating system APIs. The APIs in the Windows and Linux kernels for creating network sockets, working with the filesystem or interacting with the display are all written in C. It made sense for Python's extensibility layer to be focused on the C language. Later in this article, we will cover the Python Standard Library and the C modules.

There is a Python compiler written in Python called PyPy. PyPy's logo is an Ouroboros to represent the self-hosting nature of the compiler.

Another example of a cross-compiler for Python is Jython. Jython is written in Java and compiles from Python source code into Java bytecode. In the same way that CPython makes it easy to import C libraries and use them from Python, Jython makes it easy to import and reference Java modules and classes.

The Python Language Specification

Contained within the CPython source code is the definition of the Python language. This is the reference specification used by all the Python interpreters.

The specification is in both human-readable and machine-readable format. Inside the documentation is a detailed explanation of the Python language, what is allowed, and how each statement should behave.

Documentation

Located inside the Doc/reference directory are reStructuredText explanations of each of the features in the Python language. This forms the official Python reference guide on docs.python.org.

Inside the directory are the files you need to understand the whole language, structure, and keywords:

cpython/Doc/reference
|
├── compound_stmts.rst
├── datamodel.rst
├── executionmodel.rst
├── expressions.rst
├── grammar.rst
├── import.rst
├── index.rst
├── introduction.rst
├── lexical_analysis.rst
├── simple_stmts.rst
└── toplevel_components.rst

Inside compound_stmts.rst, the documentation for compound statements, you can see a simple example defining the with statement.

The with statement can be used in multiple ways in Python, the simplest being the instantiation of a context-manager and a nested block of code:

with x():
   ...

You can assign the result to a variable using the as keyword:

with x() as y:
   ...

You can also chain context managers together with a comma:

with x() as y, z() as jk:
   ...

Next, we'll explore the computer-readable documentation of the Python language.

Grammar

The documentation contains the human-readable specification of the language, and the machine-readable specification is housed in a single file, Grammar/Grammar.

The Grammar file is written in a context-notation called Backus-Naur Form (BNF). BNF is not specific to Python and is often used as the notation for grammars in many other languages.

The concept of grammatical structure in a programming language is inspired by Noam Chomsky's work on Syntactic Structures in the 1950s!

Python's grammar file uses the Extended-BNF (EBNF) specification with regular-expression syntax. So, in the grammar file you can use:

If you search for the with statement in the grammar file, at around line 80 you'll see the definitions for the with statement:

with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]

Anything in quotes is a string literal, which is how keywords are defined. So the with_stmt is specified as:

  1. Starting with the word with
  2. Followed by a with_item, which is a test and (optionally), the word as, and an expression
  3. Following one or many items, each separated by a comma
  4. Ending with a :
  5. Followed by a suite

There are references to some other definitions in these two lines:

If you want to explore those in detail, the whole of the Python grammar is defined in this single file.

If you want to see a recent example of how grammar is used, in PEP 572 the colon equals operator was added to the grammar file in this Git commit.

Using pgen

The grammar file itself is never used by the Python compiler. Instead, a parser table created by a tool called pgen is used. pgen reads the grammar file and converts it into a parser table. If you make changes to the grammar file, you must regenerate the parser table and recompile Python.

Note: The pgen application was rewritten in Python 3.8 from C to pure Python.

To see pgen in action, let's change part of the Python grammar. Around line 51 you will see the definition of a pass statement:

pass_stmt: 'pass'

Change that line to accept the keyword 'pass' or 'proceed' as keywords:

pass_stmt: 'pass' | 'proceed'

Now you need to rebuild the grammar files. On macOS and Linux, run make regen-grammar to run pgen over the altered grammar file. For Windows, there is no officially supported way of running pgen. However, you can clone my fork and run build.bat --regen from within the PCBuild directory.

You should see an output similar to this, showing that the new Include/graminit.h and Python/graminit.c files have been generated:

# Regenerate Doc/library/token-list.inc from Grammar/Tokens
# using Tools/scripts/generate_token.py
...
python3 ./Tools/scripts/update_file.py ./Include/graminit.h ./Include/graminit.h.new
python3 ./Tools/scripts/update_file.py ./Python/graminit.c ./Python/graminit.c.new

Note: pgen works by converting the EBNF statements into a Non-deterministic Finite Automaton (NFA), which is then turned into a Deterministic Finite Automaton (DFA). The DFAs are used by the parser as parsing tables in a special way that's unique to CPython. This technique was formed at Stanford University and developed in the 1980s, just before the advent of Python.

With the regenerated parser tables, you need to recompile CPython to see the new syntax. Use the same compilation steps you used earlier for your operating system.

If the code compiled successfully, you can execute your new CPython binary and start a REPL.

In the REPL, you can now try defining a function and instead of using the pass statement, use the proceed keyword alternative that you compiled into the Python grammar:

Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def example():
...    proceed
... 
>>> example()

Well done! You've changed the CPython syntax and compiled your own version of CPython. Ship it!

Next, we'll explore tokens and their relationship to grammar.

Tokens

Alongside the grammar file in the Grammar folder is a Tokens file, which contains each of the unique types found as a leaf node in a parse tree. We will cover parser trees in depth later. Each token also has a name and a generated unique ID. The names are used to make it simpler to refer to in the tokenizer.

Note: The Tokens file is a new feature in Python 3.8.

For example, the left parenthesis is called LPAR, and semicolons are called SEMI. You'll see these tokens later in the article:

LPAR                    '('
RPAR                    ')'
LSQB                    '['
RSQB                    ']'
COLON                   ':'
COMMA                   ','
SEMI                    ';'

As with the Grammar file, if you change the Tokens file, you need to run pgen again.

To see tokens in action, you can use the tokenize module in CPython. Create a simple Python script called test_tokens.py:

# Hello world!
def my_function():
   proceed

For the rest of this tutorial, ./python.exe will refer to the compiled version of CPython. However, the actual command will depend on your system.

For Windows:

 > python.exe

For Linux:

 > ./python

For macOS:

 > ./python.exe

Then pass this file through a module built into the standard library called tokenize. You will see the list of tokens, by line and character. Use the -e flag to output the exact token name:

$ ./python.exe -m tokenize -e test_tokens.py

0,0-0,0:            ENCODING       'utf-8'        
1,0-1,14:           COMMENT        '# Hello world!'
1,14-1,15:          NL             '\n'           
2,0-2,3:            NAME           'def'          
2,4-2,15:           NAME           'my_function'  
2,15-2,16:          LPAR           '('            
2,16-2,17:          RPAR           ')'            
2,17-2,18:          COLON          ':'            
2,18-2,19:          NEWLINE        '\n'           
3,0-3,3:            INDENT         '   '          
3,3-3,7:            NAME           'proceed'         
3,7-3,8:            NEWLINE        '\n'           
4,0-4,0:            DEDENT         ''             
4,0-4,0:            ENDMARKER      ''              

In the output, the first column is the range of the line/column coordinates, the second column is the name of the token, and the final column is the value of the token.

In the output, the tokenize module has implied some tokens that were not in the file. The ENCODING token for utf-8, and a blank line at the end, giving DEDENT to close the function declaration and an ENDMARKER to end the file.

It is best practice to have a blank line at the end of your Python source files. If you omit it, CPython adds it for you, with a tiny performance penalty.

The tokenize module is written in pure Python and is located in Lib/tokenize.py within the CPython source code.

Important: There are two tokenizers in the CPython source code: one written in Python, demonstrated here, and another written in C. The tokenizer written in Python is meant as a utility, and the one written in C is used by the Python compiler. They have identical output and behavior. The version written in C is designed for performance and the module in Python is designed for debugging.

To see a verbose readout of the C tokenizer, you can run Python with the -d flag. Using the test_tokens.py script you created earlier, run it with the following:

$ ./python.exe -d test_tokens.py

Token NAME/'def' ... It's a keyword
 DFA 'file_input', state 0: Push 'stmt'
 DFA 'stmt', state 0: Push 'compound_stmt'
 DFA 'compound_stmt', state 0: Push 'funcdef'
 DFA 'funcdef', state 0: Shift.
Token NAME/'my_function' ... It's a token we know
 DFA 'funcdef', state 1: Shift.
Token LPAR/'(' ... It's a token we know
 DFA 'funcdef', state 2: Push 'parameters'
 DFA 'parameters', state 0: Shift.
Token RPAR/')' ... It's a token we know
 DFA 'parameters', state 1: Shift.
  DFA 'parameters', state 2: Direct pop.
Token COLON/':' ... It's a token we know
 DFA 'funcdef', state 3: Shift.
Token NEWLINE/'' ... It's a token we know
 DFA 'funcdef', state 5: [switch func_body_suite to suite] Push 'suite'
 DFA 'suite', state 0: Shift.
Token INDENT/'' ... It's a token we know
 DFA 'suite', state 1: Shift.
Token NAME/'proceed' ... It's a keyword
 DFA 'suite', state 3: Push 'stmt'
...
  ACCEPT.

In the output, you can see that it highlighted proceed as a keyword. In the next chapter, we'll see how executing the Python binary gets to the tokenizer and what happens from there to execute your code.

Now that you have an overview of the Python grammar and the relationship between tokens and statements, there is a way to convert the pgen output into an interactive graph.

Here is a screenshot of the Python 3.8a2 grammar:

Python 3.8 DFA node graph

The Python package used to generate this graph, instaviz, will be covered in a later chapter.

Memory Management in CPython

Throughout this article, you will see references to a PyArena object. The arena is one of CPython's memory management structures. The code is within Python/pyarena.c and contains a wrapper around C's memory allocation and deallocation functions.

In a traditionally written C program, the developer should allocate memory for data structures before writing into that data. This allocation marks the memory as belonging to the process with the operating system.

It is also up to the developer to deallocate, or "free," the allocated memory when its no longer being used and return it to the operating system's block table of free memory. If a process allocates memory for a variable, say within a function or loop, when that function has completed, the memory is not automatically given back to the operating system in C. So if it hasn't been explicitly deallocated in the C code, it causes a memory leak. The process will continue to take more memory each time that function runs until eventually, the system runs out of memory, and crashes!

Python takes that responsibility away from the programmer and uses two algorithms: a reference counter and a garbage collector.

Whenever an interpreter is instantiated, a PyArena is created and attached one of the fields in the interpreter. During the lifecycle of a CPython interpreter, many arenas could be allocated. They are connected with a linked list. The arena stores a list of pointers to Python Objects as a PyListObject. Whenever a new Python object is created, a pointer to it is added using PyArena_AddPyObject(). This function call stores a pointer in the arena's list, a_objects.

The PyArena serves a second function, which is to allocate and reference a list of raw memory blocks. For example, a PyList would need extra memory if you added thousands of additional values. The PyList object's C code does not allocate memory directly. The object gets raw blocks of memory from the PyArena by calling PyArena_Malloc() from the PyObject with the required memory size. This task is completed by another abstraction in Objects/oballoc.c. In the object allocation module, memory can be allocated, freed, and reallocated for a Python Object.

A linked list of allocated blocks is stored inside the arena, so that when an interpreter is stopped, all managed memory blocks can be deallocated in one go using PyArena_Free().

Take the PyListObject example. If you were to .append() an object to the end of a Python list, you don't need to reallocate the memory used in the existing list beforehand. The .append() method calls list_resize() which handles memory allocation for lists. Each list object keeps a list of the amount of memory allocated. If the item you're appending will fit inside the existing free memory, it is simply added. If the list needs more memory space, it is expanded. Lists are expanded in length as 0, 4, 8, 16, 25, 35, 46, 58, 72, 88.

PyMem_Realloc() is called to expand the memory allocated in a list. PyMem_Realloc() is an API wrapper for pymalloc_realloc().

Python also has a special wrapper for the C call malloc(), which sets the max size of the memory allocation to help prevent buffer overflow errors (See PyMem_RawMalloc()).

In summary:

More information on the API is detailed on the CPython documentation.

Reference Counting

To create a variable in Python, you have to assign a value to a uniquely named variable:

my_variable = 180392

Whenever a value is assigned to a variable in Python, the name of the variable is checked within the locals and globals scope to see if it already exists.

Because my_variable is not already within the locals() or globals() dictionary, this new object is created, and the value is assigned as being the numeric constant 180392.

There is now one reference to my_variable, so the reference counter for my_variable is incremented by 1.

You will see function calls Py_INCREF() and Py_DECREF() throughout the C source code for CPython. These functions increment and decrement the count of references to that object.

References to an object are decremented when a variable falls outside of the scope in which it was declared. Scope in Python can refer to a function or method, a comprehension, or a lambda function. These are some of the more literal scopes, but there are many other implicit scopes, like passing variables to a function call.

The handling of incrementing and decrementing references based on the language is built into the CPython compiler and the core execution loop, ceval.c, which we will cover in detail later in this article.

Whenever Py_DECREF() is called, and the counter becomes 0, the PyObject_Free() function is called. For that object PyArena_Free() is called for all of the memory that was allocated.

Garbage Collection

How often does your garbage get collected? Weekly, or fortnightly?

When you're finished with something, you discard it and throw it in the trash. But that trash won't get collected straight away. You need to wait for the garbage trucks to come and pick it up.

CPython has the same principle, using a garbage collection algorithm. CPython's garbage collector is enabled by default, happens in the background and works to deallocate memory that's been used for objects which are no longer in use.

Because the garbage collection algorithm is a lot more complex than the reference counter, it doesn't happen all the time, otherwise, it would consume a huge amount of CPU resources. It happens periodically, after a set number of operations.

CPython's standard library comes with a Python module to interface with the arena and the garbage collector, the gc module. Here's how to use the gc module in debug mode:

>>>
>>> import gc
>>> gc.set_debug(gc.DEBUG_STATS)

This will print the statistics whenever the garbage collector is run.

You can get the threshold after which the garbage collector is run by calling get_threshold():

>>>
>>> gc.get_threshold()
(700, 10, 10)

You can also get the current threshold counts:

>>>
>>> gc.get_count()
(688, 1, 1)

Lastly, you can run the collection algorithm manually:

>>>
>>> gc.collect()
24

This will call collect() inside the Modules/gcmodule.c file which contains the implementation of the garbage collector algorithm.

Conclusion

In Part 1, you covered the structure of the source code repository, how to compile from source, and the Python language specification. These core concepts will be critical in Part 2 as you dive deeper into the Python interpreter process.

Part 2: The Python Interpreter Process

Now that you've seen the Python grammar and memory management, you can follow the process from typing python to the part where your code is executed.

There are five ways the python binary can be called:

  1. To run a single command with -c and a Python command
  2. To start a module with -m and the name of a module
  3. To run a file with the filename
  4. To run the stdin input using a shell pipe
  5. To start the REPL and execute commands one at a time

The three source files you need to inspect to see this process are:

  1. Programs/python.c is a simple entry point.
  2. Modules/main.c contains the code to bring together the whole process, loading configuration, executing code and clearing up memory.
  3. Python/initconfig.c loads the configuration from the system environment and merges it with any command-line flags.

This diagram shows how each of those functions is called:

Python run swim lane diagram

The execution mode is determined from the configuration.

The CPython source code style:

There is an official style guide for the CPython C code, designed originally in 2001 and updated for modern versions.

There are some naming standards which help when navigating the source code:

  • Use a Py prefix for public functions, never for static functions. The Py_ prefix is reserved for global service routines like Py_FatalError. Specific groups of routines (like specific object type APIs) use a longer prefix, such as PyString_ for string functions.

  • Public functions and variables use MixedCase with underscores, like this: PyObject_GetAttr, Py_BuildValue, PyExc_TypeError.

  • Occasionally an "internal" function has to be visible to the loader. We use the _Py prefix for this, for example, _PyObject_Dump.

  • Macros should have a MixedCase prefix and then use upper case, for example PyString_AS_STRING, Py_PRINT_RAW.

Establishing Runtime Configuration

Python run swim lane diagram

In the swimlanes, you can see that before any Python code is executed, the runtime first establishes the configuration. The configuration of the runtime is a data structure defined in Include/cpython/initconfig.h named PyConfig.

The configuration data structure includes things like:

The configuration data is primarily used by the CPython runtime to enable and disable various features.

Python also comes with several Command Line Interface Options. In Python you can enable verbose mode with the -v flag. In verbose mode, Python will print messages to the screen when modules are loaded:

$ ./python.exe -v -c "print('hello world')"


# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
...

You will see a hundred lines or more with all the imports of your user site-packages and anything else in the system environment.

You can see the definition of this flag within Include/cpython/initconfig.h inside the struct for PyConfig:

/* --- PyConfig ---------------------------------------------- */

typedef struct {
    int _config_version;  /* Internal configuration version,
                             used for ABI compatibility */
    int _config_init;     /* _PyConfigInitEnum value */

    ...

    /* If greater than 0, enable the verbose mode: print a message each time a
       module is initialized, showing the place (filename or built-in module)
       from which it is loaded.

       If greater or equal to 2, print a message for each file that is checked
       for when searching for a module. Also provides information on module
       cleanup at exit.

       Incremented by the -v option. Set by the PYTHONVERBOSE environment
       variable. If set to -1 (default), inherit Py_VerboseFlag value. */
    int verbose;

In Python/coreconfig.c, the logic for reading settings from environment variables and runtime command-line flags is established.

In the config_read_env_vars function, the environment variables are read and used to assign the values for the configuration settings:

static PyStatus
config_read_env_vars(PyConfig *config)
{
    PyStatus status;
    int use_env = config->use_environment;

    /* Get environment variables */
    _Py_get_env_flag(use_env, &config->parser_debug, "PYTHONDEBUG");
    _Py_get_env_flag(use_env, &config->verbose, "PYTHONVERBOSE");
    _Py_get_env_flag(use_env, &config->optimization_level, "PYTHONOPTIMIZE");
    _Py_get_env_flag(use_env, &config->inspect, "PYTHONINSPECT");

For the verbose setting, you can see that the value of PYTHONVERBOSE is used to set the value of &config->verbose, if PYTHONVERBOSE is found. If the environment variable does not exist, then the default value of -1 will remain.

Then in config_parse_cmdline within coreconfig.c again, the command-line flag is used to set the value, if provided:

static PyStatus
config_parse_cmdline(PyConfig *config, PyWideStringList *warnoptions,
                     Py_ssize_t *opt_index)
{
...

        switch (c) {
...

        case 'v':
            config->verbose++;
            break;
...
        /* This space reserved for other options */

        default:
            /* unknown argument: parsing failed */
            config_usage(1, program);
            return _PyStatus_EXIT(2);
        }
    } while (1);

This value is later copied to a global variable Py_VerboseFlag by the _Py_GetGlobalVariablesAsDict function.

Within a Python session, you can access the runtime flags, like verbose mode, quiet mode, using the sys.flags named tuple. The -X flags are all available inside the sys._xconfig dictionary:

>>>
$ ./python.exe -X dev -q       

>>> import sys
>>> sys.flags
sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, 
 no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, 
 quiet=1, hash_randomization=1, isolated=0, dev_mode=True, utf8_mode=0)

>>> sys._xoptions
{'dev': True}

As well as the runtime configuration in coreconfig.h, there is also the build configuration, which is located inside pyconfig.h in the root folder. This file is created dynamically in the configure step in the build process, or by Visual Studio for Windows systems.

You can see the build configuration by running:

$ ./python.exe -m sysconfig

Reading Files/Input

Once CPython has the runtime configuration and the command-line arguments, it can establish what it needs to execute.

This task is handled by the pymain_main function inside Modules/main.c. Depending on the newly created config instance, CPython will now execute code provided via several options.

Input via -c

The simplest is providing CPython a command with the -c option and a Python program inside quotes.

For example:

$ ./python.exe -c "print('hi')"
hi

Here is the full flowchart of how this happens:

Flow chart of pymain_run_command

First, the pymain_run_command() function is executed inside Modules/main.c taking the command passed in -c as an argument in the C type wchar_t*. The wchar_t* type is often used as a low-level storage type for Unicode data across CPython as the size of the type can store UTF8 characters.

When converting the wchar_t* to a Python string, the Objects/unicodetype.c file has a helper function PyUnicode_FromWideChar() that returns a PyObject, of type str. The encoding to UTF8 is then done by PyUnicode_AsUTF8String() on the Python str object to convert it to a Python bytes object.

Once this is complete, pymain_run_command() will then pass the Python bytes object to PyRun_SimpleStringFlags() for execution, but first converting the bytes to a str type again:

static int
pymain_run_command(wchar_t *command, PyCompilerFlags *cf)
{
    PyObject *unicode, *bytes;
    int ret;

    unicode = PyUnicode_FromWideChar(command, -1);
    if (unicode == NULL) {
        goto error;
    }

    if (PySys_Audit("cpython.run_command", "O", unicode) < 0) {
        return pymain_exit_err_print();
    }

    bytes = PyUnicode_AsUTF8String(unicode);
    Py_DECREF(unicode);
    if (bytes == NULL) {
        goto error;
    }

    ret = PyRun_SimpleStringFlags(PyBytes_AsString(bytes), cf);
    Py_DECREF(bytes);
    return (ret != 0);

error:
    PySys_WriteStderr("Unable to decode the command from the command line:\n");
    return pymain_exit_err_print();
}

The conversion of wchar_t* to Unicode, bytes, and then a string is roughly equivalent to the following:

unicode = str(command)
bytes_ = bytes(unicode.encode('utf8'))
# call PyRun_SimpleStringFlags with bytes_

The PyRun_SimpleStringFlags() function is part of Python/pythonrun.c. It's purpose is to turn this simple command into a Python module and then send it on to be executed. Since a Python module needs to have __main__ to be executed as a standalone module, it creates that automatically:

int
PyRun_SimpleStringFlags(const char *command, PyCompilerFlags *flags)
{
    PyObject *m, *d, *v;
    m = PyImport_AddModule("__main__");
    if (m == NULL)
        return -1;
    d = PyModule_GetDict(m);
    v = PyRun_StringFlags(command, Py_file_input, d, d, flags);
    if (v == NULL) {
        PyErr_Print();
        return -1;
    }
    Py_DECREF(v);
    return 0;
}

Once PyRun_SimpleStringFlags() has created a module and a dictionary, it calls PyRun_StringFlags(), which creates a fake filename and then calls the Python parser to create an AST from the string and return a module, mod:

PyObject *
PyRun_StringFlags(const char *str, int start, PyObject *globals,
                  PyObject *locals, PyCompilerFlags *flags)
{
...
    mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);
    if (mod != NULL)
        ret = run_mod(mod, filename, globals, locals, flags, arena);
    PyArena_Free(arena);
    return ret;

You'll dive into the AST and Parser code in the next section.

Input via -m

Another way to execute Python commands is by using the -m option with the name of a module. A typical example is python -m unittest to run the unittest module in the standard library.

Being able to execute modules as scripts were initially proposed in PEP 338 and then the standard for explicit relative imports defined in PEP366.

The use of the -m flag implies that within the module package, you want to execute whatever is inside __main__. It also implies that you want to search sys.path for the named module.

This search mechanism is why you don't need to remember where the unittest module is stored on your filesystem.

Inside Modules/main.c there is a function called when the command-line is run with the -m flag. The name of the module is passed as the modname argument.

CPython will then import a standard library module, runpy and execute it using PyObject_Call(). The import is done using the C API function PyImport_ImportModule(), found within the Python/import.c file:

static int
pymain_run_module(const wchar_t *modname, int set_argv0)
{
    PyObject *module, *runpy, *runmodule, *runargs, *result;
    runpy = PyImport_ImportModule("runpy");
 ...
    runmodule = PyObject_GetAttrString(runpy, "_run_module_as_main");
 ...
    module = PyUnicode_FromWideChar(modname, wcslen(modname));
 ...
    runargs = Py_BuildValue("(Oi)", module, set_argv0);
 ...
    result = PyObject_Call(runmodule, runargs, NULL);
 ...
    if (result == NULL) {
        return pymain_exit_err_print();
    }
    Py_DECREF(result);
    return 0;
}

In this function you'll also see 2 other C API functions: PyObject_Call() and PyObject_GetAttrString(). Because PyImport_ImportModule() returns a PyObject*, the core object type, you need to call special functions to get attributes and to call it.

In Python, if you had an object and wanted to get an attribute, then you could call getattr(). In the C API, this call is PyObject_GetAttrString(), which is found in Objects/object.c. If you wanted to run a callable, you would give it parentheses, or you can run the __call__() property on any Python object. The __call__() method is implemented inside Objects/object.c:

hi = "hi!"
hi.upper() == hi.upper.__call__()  # this is the same

The runpy module is written in pure Python and located in Lib/runpy.py.

Executing python -m <module> is equivalent to running python -m runpy <module>. The runpy module was created to abstract the process of locating and executing modules on an operating system.

runpy does a few things to run the target module:

The runpy module also supports executing directories and zip files.

Input via Filename

If the first argument to python was a filename, such as python test.py, then CPython will open a file handle, similar to using open() in Python and pass the handle to PyRun_SimpleFileExFlags() inside Python/pythonrun.c.

There are 3 paths this function can take:

  1. If the file path is a .pyc file, it will call run_pyc_file().
  2. If the file path is a script file (.py) it will run PyRun_FileExFlags().
  3. If the filepath is stdin because the user ran command | python then treat stdin as a file handle and run PyRun_FileExFlags().
int
PyRun_SimpleFileExFlags(FILE *fp, const char *filename, int closeit,
                        PyCompilerFlags *flags)
{
 ...
    m = PyImport_AddModule("__main__");
 ...
    if (maybe_pyc_file(fp, filename, ext, closeit)) {
 ...
        v = run_pyc_file(pyc_fp, filename, d, d, flags);
    } else {
        /* When running from stdin, leave __main__.__loader__ alone */
        if (strcmp(filename, "<stdin>") != 0 &&
            set_main_loader(d, filename, "SourceFileLoader") < 0) {
            fprintf(stderr, "python: failed to set __main__.__loader__\n");
            ret = -1;
            goto done;
        }
        v = PyRun_FileExFlags(fp, filename, Py_file_input, d, d,
                              closeit, flags);
    }
 ...
    return ret;
}

Input via File With PyRun_FileExFlags()

For stdin and basic script files, CPython will pass the file handle to PyRun_FileExFlags() located in the pythonrun.c file.

The purpose of PyRun_FileExFlags() is similar to PyRun_SimpleStringFlags() used for the -c input. CPython will load the file handle into PyParser_ASTFromFileObject(). We'll cover the Parser and AST modules in the next section. Because this is a full script, it doesn't need the PyImport_AddModule("__main__"); step used by -c:

PyObject *
PyRun_FileExFlags(FILE *fp, const char *filename_str, int start, PyObject *globals,
                  PyObject *locals, int closeit, PyCompilerFlags *flags)
{
 ...
    mod = PyParser_ASTFromFileObject(fp, filename, NULL, start, 0, 0,
 ...
    ret = run_mod(mod, filename, globals, locals, flags, arena);
}

Identical to PyRun_SimpleStringFlags(), once PyRun_FileExFlags() has created a Python module from the file, it sent it to run_mod() to be executed.

run_mod() is found within Python/pythonrun.c, and sends the module to the AST to be compiled into a code object. Code objects are a format used to store the bytecode operations and the format kept in .pyc files:

static PyObject *
run_mod(mod_ty mod, PyObject *filename, PyObject *globals, PyObject *locals,
            PyCompilerFlags *flags, PyArena *arena)
{
    PyCodeObject *co;
    PyObject *v;
    co = PyAST_CompileObject(mod, filename, flags, -1, arena);
    if (co == NULL)
        return NULL;

    if (PySys_Audit("exec", "O", co) < 0) {
        Py_DECREF(co);
        return NULL;
    }

    v = run_eval_code_obj(co, globals, locals);
    Py_DECREF(co);
    return v;
}

We will cover the CPython compiler and bytecodes in the next section. The call to run_eval_code_obj() is a simple wrapper function that calls PyEval_EvalCode() in the Python/eval.c file. The PyEval_EvalCode() function is the main evaluation loop for CPython, it iterates over each bytecode statement and executes it on your local machine.

Input via Compiled Bytecode With run_pyc_file()

In the PyRun_SimpleFileExFlags() there was a clause for the user providing a file path to a .pyc file. If the file path ended in .pyc then instead of loading the file as a plain text file and parsing it, it will assume that the .pyc file contains a code object written to disk.

The run_pyc_file() function inside Python/pythonrun.c then marshals the code object from the .pyc file by using the file handle. Marshaling is a technical term for copying the contents of a file into memory and converting them to a specific data structure. The code object data structure on the disk is the CPython compiler's way to caching compiled code so that it doesn't need to parse it every time the script is called:

static PyObject *
run_pyc_file(FILE *fp, const char *filename, PyObject *globals,
             PyObject *locals, PyCompilerFlags *flags)
{
    PyCodeObject *co;
    PyObject *v;
  ...
    v = PyMarshal_ReadLastObjectFromFile(fp);
  ...
    if (v == NULL || !PyCode_Check(v)) {
        Py_XDECREF(v);
        PyErr_SetString(PyExc_RuntimeError,
                   "Bad code object in .pyc file");
        goto error;
    }
    fclose(fp);
    co = (PyCodeObject *)v;
    v = run_eval_code_obj(co, globals, locals);
    if (v && flags)
        flags->cf_flags |= (co->co_flags & PyCF_MASK);
    Py_DECREF(co);
    return v;
}

Once the code object has been marshaled to memory, it is sent to run_eval_code_obj(), which calls Python/ceval.c to execute the code.

Lexing and Parsing

In the exploration of reading and executing Python files, we dived as deep as the parser and AST modules, with function calls to PyParser_ASTFromFileObject().

Sticking within Python/pythonrun.c, the PyParser_ASTFromFileObject() function will take a file handle, compiler flags and a PyArena instance and convert the file object into a node object using PyParser_ParseFileObject().

With the node object, it will then convert that into a module using the AST function PyAST_FromNodeObject():

mod_ty
PyParser_ASTFromFileObject(FILE *fp, PyObject *filename, const char* enc,
                           int start, const char *ps1,
                           const char *ps2, PyCompilerFlags *flags, int *errcode,
                           PyArena *arena)
{
    ...
    node *n = PyParser_ParseFileObject(fp, filename, enc,
                                       &_PyParser_Grammar,
                                       start, ps1, ps2, &err, &iflags);
    ...
    if (n) {
        flags->cf_flags |= iflags & PyCF_MASK;
        mod = PyAST_FromNodeObject(n, flags, filename, arena);
        PyNode_Free(n);
    ...
    return mod;
}

For PyParser_ParseFileObject() we switch to Parser/parsetok.c and the parser-tokenizer stage of the CPython interpreter. This function has two important tasks:

  1. Instantiate a tokenizer state tok_state using PyTokenizer_FromFile() in Parser/tokenizer.c
  2. Convert the tokens into a concrete parse tree (a list of node) using parsetok() in Parser/parsetok.c
node *
PyParser_ParseFileObject(FILE *fp, PyObject *filename,
                         const char *enc, grammar *g, int start,
                         const char *ps1, const char *ps2,
                         perrdetail *err_ret, int *flags)
{
    struct tok_state *tok;
...
    if ((tok = PyTokenizer_FromFile(fp, enc, ps1, ps2)) == NULL) {
        err_ret->error = E_NOMEM;
        return NULL;
    }
...
    return parsetok(tok, g, start, err_ret, flags);
}

tok_state (defined in Parser/tokenizer.h) is the data structure to store all temporary data generated by the tokenizer. It is returned to the parser-tokenizer as the data structure is required by parsetok() to develop the concrete syntax tree.

Inside parsetok(), it will use the tok_state structure and make calls to tok_get() in a loop until the file is exhausted and no more tokens can be found.

tok_get(), defined in Parser/tokenizer.c behaves like an iterator. It will keep returning the next token in the parse tree.

tok_get() is one of the most complex functions in the whole CPython codebase. It has over 640 lines and includes decades of heritage with edge cases, new language features, and syntax.

One of the simpler examples would be the part that converts a newline break into a NEWLINE token:

static int
tok_get(struct tok_state *tok, char **p_start, char **p_end)
{
...
    /* Newline */
    if (c == '\n') {
        tok->atbol = 1;
        if (blankline || tok->level > 0) {
            goto nextline;
        }
        *p_start = tok->start;
        *p_end = tok->cur - 1; /* Leave '\n' out of the string */
        tok->cont_line = 0;
        if (tok->async_def) {
            /* We're somewhere inside an 'async def' function, and
               we've encountered a NEWLINE after its signature. */
            tok->async_def_nl = 1;
        }
        return NEWLINE;
    }
...
}

In this case, NEWLINE is a token, with a value defined in Include/token.h. All tokens are constant int values, and the Include/token.h file was generated earlier when we ran make regen-grammar.

The node type returned by PyParser_ParseFileObject() is going to be essential for the next stage, converting a parse tree into an Abstract-Syntax-Tree (AST):

typedef struct _node {
    short               n_type;
    char                *n_str;
    int                 n_lineno;
    int                 n_col_offset;
    int                 n_nchildren;
    struct _node        *n_child;
    int                 n_end_lineno;
    int                 n_end_col_offset;
} node;

Since the CST is a tree of syntax, token IDs, and symbols, it would be difficult for the compiler to make quick decisions based on the Python language.

That is why the next stage is to convert the CST into an AST, a much higher-level structure. This task is performed by the Python/ast.c module, which has both a C and Python API.

Before you jump into the AST, there is a way to access the output from the parser stage. CPython has a standard library module parser, which exposes the C functions with a Python API.

The module is documented as an implementation detail of CPython so that you won't see it in other Python interpreters. Also the output from the functions is not that easy to read.

The output will be in the numeric form, using the token and symbol numbers generated by the make regen-grammar stage, stored in Include/token.h and Include/symbol.h:

>>>
>>> from pprint import pprint
>>> import parser
>>> st = parser.expr('a + 1')
>>> pprint(parser.st2list(st))
[258,
 [332,
  [306,
   [310,
    [311,
     [312,
      [313,
       [316,
        [317,
         [318,
          [319,
           [320,
            [321, [322, [323, [324, [325, [1, 'a']]]]]],
            [14, '+'],
            [321, [322, [323, [324, [325, [2, '1']]]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

To make it easier to understand, you can take all the numbers in the symbol and token modules, put them into a dictionary and recursively replace the values in the output of parser.st2list() with the names:

import symbol
import token
import parser

def lex(expression):
    symbols = {v: k for k, v in symbol.__dict__.items() if isinstance(v, int)}
    tokens = {v: k for k, v in token.__dict__.items() if isinstance(v, int)}
    lexicon = {**symbols, **tokens}
    st = parser.expr(expression)
    st_list = parser.st2list(st)

    def replace(l: list):
        r = []
        for i in l:
            if isinstance(i, list):
                r.append(replace(i))
            else:
                if i in lexicon:
                    r.append(lexicon[i])
                else:
                    r.append(i)
        return r

    return replace(st_list)

You can run lex() with a simple expression, like a + 1 to see how this is represented as a parser-tree:

>>>
>>> from pprint import pprint
>>> pprint(lex('a + 1'))

['eval_input',
 ['testlist',
  ['test',
   ['or_test',
    ['and_test',
     ['not_test',
      ['comparison',
       ['expr',
        ['xor_expr',
         ['and_expr',
          ['shift_expr',
           ['arith_expr',
            ['term',
             ['factor', ['power', ['atom_expr', ['atom', ['NAME', 'a']]]]]],
            ['PLUS', '+'],
            ['term',
             ['factor',
              ['power', ['atom_expr', ['atom', ['NUMBER', '1']]]]]]]]]]]]]]]]],
 ['NEWLINE', ''],
 ['ENDMARKER', '']]

In the output, you can see the symbols in lowercase, such as 'test' and the tokens in uppercase, such as 'NUMBER'.

Abstract Syntax Trees

The next stage in the CPython interpreter is to convert the CST generated by the parser into something more logical that can be executed. The structure is a higher-level representation of the code, called an Abstract Syntax Tree (AST).

ASTs are produced inline with the CPython interpreter process, but you can also generate them in both Python using the ast module in the Standard Library as well as through the C API.

Before diving into the C implementation of the AST, it would be useful to understand what an AST looks like for a simple piece of Python code.

To do this, here's a simple app called instaviz for this tutorial. It displays the AST and bytecode instructions (which we'll cover later) in a Web UI.

To install instaviz:

$ pip install instaviz

Then, open up a REPL by running python at the command line with no arguments:

>>>
>>> import instaviz
>>> def example():
       a = 1
       b = a + 1
       return b

>>> instaviz.show(example)

You'll see a notification on the command-line that a web server has started on port 8080. If you were using that port for something else, you can change it by calling instaviz.show(example, port=9090) or another port number.

In the web browser, you can see the detailed breakdown of your function:

Instaviz screenshot

The bottom left graph is the function you declared in REPL, represented as an Abstract Syntax Tree. Each node in the tree is an AST type. They are found in the ast module, and all inherit from _ast.AST.

Some of the nodes have properties which link them to child nodes, unlike the CST, which has a generic child node property.

For example, if you click on the Assign node in the center, this links to the line b = a + 1:

Instaviz screenshot 2

It has two properties:

  1. targets is a list of names to assign. It is a list because you can assign to multiple variables with a single expression using unpacking
  2. value is the value to assign, which in this case is a BinOp statement, a + 1.

If you click on the BinOp statement, it shows the properties of relevance:

Instaviz screenshot 3

Compiling an AST in C is not a straightforward task, so the Python/ast.c module is over 5000 lines of code.

There are a few entry points, forming part of the AST's public API. In the last section on the lexer and parser, you stopped when you'd reached the call to PyAST_FromNodeObject(). By this stage, the Python interpreter process had created a CST in the format of node * tree.

Jumping then into PyAST_FromNodeObject() inside Python/ast.c, you can see it receives the node * tree, the filename, compiler flags, and the PyArena.

The return type from this function is mod_ty, defined in Include/Python-ast.h. mod_ty is a container structure for one of the 5 module types in Python:

  1. Module
  2. Interactive
  3. Expression
  4. FunctionType
  5. Suite

In Include/Python-ast.h you can see that an Expression type requires a field body, which is an expr_ty type. The expr_ty type is also defined in Include/Python-ast.h:

enum _mod_kind {Module_kind=1, Interactive_kind=2, Expression_kind=3,
                 FunctionType_kind=4, Suite_kind=5};
struct _mod {
    enum _mod_kind kind;
    union {
        struct {
            asdl_seq *body;
            asdl_seq *type_ignores;
        } Module;

        struct {
            asdl_seq *body;
        } Interactive;

        struct {
            expr_ty body;
        } Expression;

        struct {
            asdl_seq *argtypes;
            expr_ty returns;
        } FunctionType;

        struct {
            asdl_seq *body;
        } Suite;

    } v;
};

The AST types are all listed in Parser/Python.asdl. You will see the module types, statement types, expression types, operators, and comprehensions all listed. The names of the types in this document relate to the classes generated by the AST and the same classes named in the ast standard module library.

The parameters and names in Include/Python-ast.h correlate directly to those specified in Parser/Python.asdl:

-- ASDL's 5 builtin types are:
-- identifier, int, string, object, constant

module Python
{
    mod = Module(stmt* body, type_ignore *type_ignores)
        | Interactive(stmt* body)
        | Expression(expr body)
        | FunctionType(expr* argtypes, expr returns)

The C header file and structures are there so that the Python/ast.c program can quickly generate the structures with pointers to the relevant data.

Looking at PyAST_FromNodeObject() you can see that it is essentially a switch statement around the result from TYPE(n). TYPE() is one of the core functions used by the AST to determine what type a node in the concrete syntax tree is. In the case of PyAST_FromNodeObject() it's just looking at the first node, so it can only be one of the module types defined as Module, Interactive, Expression, FunctionType.

The result of TYPE() will be either a symbol or token type, which we're very familiar with by this stage.

For file_input, the results should be a Module. Modules are a series of statements, of which there are a few types. The logic to traverse the children of n and create statement nodes is within ast_for_stmt(). This function is called either once, if there is only 1 statement in the module, or in a loop if there are many. The resulting Module is then returned with the PyArena.

For eval_input, the result should be an Expression. The result from CHILD(n ,0), which is the first child of n is passed to ast_for_testlist() which returns an expr_ty type. This expr_ty is sent to Expression() with the PyArena to create an expression node, and then passed back as a result:

mod_ty
PyAST_FromNodeObject(const node *n, PyCompilerFlags *flags,
                     PyObject *filename, PyArena *arena)
{
    ...
    switch (TYPE(n)) {
        case file_input:
            stmts = _Py_asdl_seq_new(num_stmts(n), arena);
            if (!stmts)
                goto out;
            for (i = 0; i < NCH(n) - 1; i++) {
                ch = CHILD(n, i);
                if (TYPE(ch) == NEWLINE)
                    continue;
                REQ(ch, stmt);
                num = num_stmts(ch);
                if (num == 1) {
                    s = ast_for_stmt(&c, ch);
                    if (!s)
                        goto out;
                    asdl_seq_SET(stmts, k++, s);
                }
                else {
                    ch = CHILD(ch, 0);
                    REQ(ch, simple_stmt);
                    for (j = 0; j < num; j++) {
                        s = ast_for_stmt(&c, CHILD(ch, j * 2));
                        if (!s)
                            goto out;
                        asdl_seq_SET(stmts, k++, s);
                    }
                }
            }

            /* Type ignores are stored under the ENDMARKER in file_input. */
            ...

            res = Module(stmts, type_ignores, arena);
            break;
        case eval_input: {
            expr_ty testlist_ast;

            /* XXX Why not comp_for here? */
            testlist_ast = ast_for_testlist(&c, CHILD(n, 0));
            if (!testlist_ast)
                goto out;
            res = Expression(testlist_ast, arena);
            break;
        }
        case single_input:
            ...
            break;
        case func_type_input:
            ...
        ...
    return res;
}

Inside the ast_for_stmt() function, there is another switch statement for each possible statement type (simple_stmt, compound_stmt, and so on) and the code to determine the arguments to the node class.

One of the simpler functions is for the power expression, i.e., 2**4 is 2 to the power of 4. This function starts by getting the ast_for_atom_expr(), which is the number 2 in our example, then if that has one child, it returns the atomic expression. If it has more than one child, it will get the right-hand (the number 4) and return a BinOp (binary operation) with the operator as Pow (power), the left hand of e (2), and the right hand of f (4):

static expr_ty
ast_for_power(struct compiling *c, const node *n)
{
    /* power: atom trailer* ('**' factor)*
     */
    expr_ty e;
    REQ(n, power);
    e = ast_for_atom_expr(c, CHILD(n, 0));
    if (!e)
        return NULL;
    if (NCH(n) == 1)
        return e;
    if (TYPE(CHILD(n, NCH(n) - 1)) == factor) {
        expr_ty f = ast_for_expr(c, CHILD(n, NCH(n) - 1));
        if (!f)
            return NULL;
        e = BinOp(e, Pow, f, LINENO(n), n->n_col_offset,
                  n->n_end_lineno, n->n_end_col_offset, c->c_arena);
    }
    return e;
}

You can see the result of this if you send a short function to the instaviz module:

>>>
>>> def foo():
       2**4
>>> import instaviz
>>> instaviz.show(foo)

Instaviz screenshot 4

In the UI you can also see the corresponding properties:

Instaviz screenshot 5

In summary, each statement type and expression has a corresponding ast_for_*() function to create it. The arguments are defined in Parser/Python.asdl and exposed via the ast module in the standard library. If an expression or statement has children, then it will call the corresponding ast_for_* child function in a depth-first traversal.

Conclusion

CPython's versatility and low-level execution API make it the ideal candidate for an embedded scripting engine. You will see CPython used in many UI applications, such as Game Design, 3D graphics and system automation.

The interpreter process is flexible and efficient, and now you have an understanding of how it works you're ready to understand the compiler.

Part 3: The CPython Compiler and Execution Loop

In Part 2, you saw how the CPython interpreter takes an input, such as a file or string, and converts it into a logical Abstract Syntax Tree. We're still not at the stage where this code can be executed. Next, we have to go deeper to convert the Abstract Syntax Tree into a set of sequential commands that the CPU can understand.

Compiling

Now the interpreter has an AST with the properties required for each of the operations, functions, classes, and namespaces. It is the job of the compiler to turn the AST into something the CPU can understand.

This compilation task is split into 2 parts:

  1. Traverse the tree and create a control-flow-graph, which represents the logical sequence for execution
  2. Convert the nodes in the CFG to smaller, executable statements, known as byte-code

Earlier, we were looking at how files are executed, and the PyRun_FileExFlags() function in Python/pythonrun.c. Inside this function, we converted the FILE handle into a mod, of type mod_ty. This task was completed by PyParser_ASTFromFileObject(), which in turns calls the tokenizer, parser-tokenizer and then the AST:

PyObject *
PyRun_FileExFlags(FILE *fp, const char *filename_str, int start, PyObject *globals,
                  PyObject *locals, int closeit, PyCompilerFlags *flags)
{
 ...
    mod = PyParser_ASTFromFileObject(fp, filename, NULL, start, 0, 0,
 ...
    ret = run_mod(mod, filename, globals, locals, flags, arena);
}

The resulting module from the call to is sent to run_mod() still in Python/pythonrun.c. This is a small function that gets a PyCodeObject from PyAST_CompileObject() and sends it on to run_eval_code_obj(). You will tackle run_eval_code_obj() in the next section:

static PyObject *
run_mod(mod_ty mod, PyObject *filename, PyObject *globals, PyObject *locals,
            PyCompilerFlags *flags, PyArena *arena)
{
    PyCodeObject *co;
    PyObject *v;
    co = PyAST_CompileObject(mod, filename, flags, -1, arena);
    if (co == NULL)
        return NULL;

    if (PySys_Audit("exec", "O", co) < 0) {
        Py_DECREF(co);
        return NULL;
    }

    v = run_eval_code_obj(co, globals, locals);
    Py_DECREF(co);
    return v;
}

The PyAST_CompileObject() function is the main entry point to the CPython compiler. It takes a Python module as its primary argument, along with the name of the file, the globals, locals, and the PyArena all created earlier in the interpreter process.

We're starting to get into the guts of the CPython compiler now, with decades of development and Computer Science theory behind it. Don't be put off by the language. Once we break down the compiler into logical steps, it'll make sense.

Before the compiler starts, a global compiler state is created. This type, compiler is defined in Python/compile.c and contains properties used by the compiler to remember the compiler flags, the stack, and the PyArena:

struct compiler {
    PyObject *c_filename;
    struct symtable *c_st;
    PyFutureFeatures *c_future; /* pointer to module's __future__ */
    PyCompilerFlags *c_flags;

    int c_optimize;              /* optimization level */
    int c_interactive;           /* true if in interactive mode */
    int c_nestlevel;

    PyObject *c_const_cache;     /* Python dict holding all constants,
                                    including names tuple */
    struct compiler_unit *u; /* compiler state for current block */
    PyObject *c_stack;           /* Python list holding compiler_unit ptrs */
    PyArena *c_arena;            /* pointer to memory allocation arena */
};

Inside PyAST_CompileObject(), there are 11 main steps happening:

  1. Create an empty __doc__ property to the module if it doesn't exist.
  2. Create an empty __annotations__ property to the module if it doesn't exist.
  3. Set the filename of the global compiler state to the filename argument.
  4. Set the memory allocation arena for the compiler to the one used by the interpreter.
  5. Copy any __future__ flags in the module to the future flags in the compiler.
  6. Merge runtime flags provided by the command-line or environment variables.
  7. Enable any __future__ features in the compiler.
  8. Set the optimization level to the provided argument, or default.
  9. Build a symbol table from the module object.
  10. Run the compiler with the compiler state and return the code object.
  11. Free any allocated memory by the compiler.
PyCodeObject *
PyAST_CompileObject(mod_ty mod, PyObject *filename, PyCompilerFlags *flags,
                   int optimize, PyArena *arena)
{
    struct compiler c;
    PyCodeObject *co = NULL;
    PyCompilerFlags local_flags;
    int merged;

    if (!__doc__) {                                                      // 1.
        __doc__ = PyUnicode_InternFromString("__doc__");
        if (!__doc__)
            return NULL;
    }
    if (!__annotations__) {
        __annotations__ = PyUnicode_InternFromString("__annotations__"); // 2.
        if (!__annotations__)
            return NULL;
    }
    if (!compiler_init(&c))
        return NULL;
    Py_INCREF(filename);
    c.c_filename = filename;                                             // 3.
    c.c_arena = arena;                                                   // 4.
    c.c_future = PyFuture_FromASTObject(mod, filename);                  // 5.
    if (c.c_future == NULL)
        goto finally;
    if (!flags) {
        local_flags.cf_flags = 0;
        local_flags.cf_feature_version = PY_MINOR_VERSION;
        flags = &local_flags;
    }
    merged = c.c_future->ff_features | flags->cf_flags;                  // 6.
    c.c_future->ff_features = merged;                                    // 7.
    flags->cf_flags = merged;
    c.c_flags = flags;
    c.c_optimize = (optimize == -1) ? Py_OptimizeFlag : optimize;        // 8.
    c.c_nestlevel = 0;

    if (!_PyAST_Optimize(mod, arena, c.c_optimize)) {
        goto finally;
    }

    c.c_st = PySymtable_BuildObject(mod, filename, c.c_future);          // 9.
    if (c.c_st == NULL) {
        if (!PyErr_Occurred())
            PyErr_SetString(PyExc_SystemError, "no symtable");
        goto finally;
    }

    co = compiler_mod(&c, mod);                                          // 10.

 finally:
    compiler_free(&c);                                                   // 11.
    assert(co || PyErr_Occurred());
    return co;
}

Future Flags and Compiler Flags

Before the compiler runs, there are two types of flags to toggle the features inside the compiler. These come from two places:

  1. The interpreter state, which may have been command-line options, set in pyconfig.h or via environment variables
  2. The use of __future__ statements inside the actual source code of the module

To distinguish the two types of flags, think that the __future__ flags are required because of the syntax or features in that specific module. For example, Python 3.7 introduced delayed evaluation of type hints through the annotations future flag:

from __future__ import annotations

The code after this statement might use unresolved type hints, so the __future__ statement is required. Otherwise, the module wouldn't import. It would be unmaintainable to manually request that the person importing the module enable this specific compiler flag.

The other compiler flags are specific to the environment, so they might change the way the code executes or the way the compiler runs, but they shouldn't link to the source in the same way that __future__ statements do.

One example of a compiler flag would be the -O flag for optimizing the use of assert statements. This flag disables any assert statements, which may have been put in the code for debugging purposes. It can also be enabled with the PYTHONOPTIMIZE=1 environment variable setting.

Symbol Tables

In PyAST_CompileObject() there was a reference to a symtable and a call to PySymtable_BuildObject() with the module to be executed.

The purpose of the symbol table is to provide a list of namespaces, globals, and locals for the compiler to use for referencing and resolving scopes.

The symtable structure in Include/symtable.h is well documented, so it's clear what each of the fields is for. There should be one symtable instance for the compiler, so namespacing becomes essential.

If you create a function called resolve_names() in one module and declare another function with the same name in another module, you want to be sure which one is called. The symtable serves this purpose, as well as ensuring that variables declared within a narrow scope don't automatically become globals (after all, this isn't JavaScript):

struct symtable {
    PyObject *st_filename;          /* name of file being compiled,
                                       decoded from the filesystem encoding */
    struct _symtable_entry *st_cur; /* current symbol table entry */
    struct _symtable_entry *st_top; /* symbol table entry for module */
    PyObject *st_blocks;            /* dict: map AST node addresses
                                     *       to symbol table entries */
    PyObject *st_stack;             /* list: stack of namespace info */
    PyObject *st_global;            /* borrowed ref to st_top->ste_symbols */
    int st_nblocks;                 /* number of blocks used. kept for
                                       consistency with the corresponding
                                       compiler structure */
    PyObject *st_private;           /* name of current class or NULL */
    PyFutureFeatures *st_future;    /* module's future features that affect
                                       the symbol table */
    int recursion_depth;            /* current recursion depth */
    int recursion_limit;            /* recursion limit */
};

Some of the symbol table API is exposed via the symtable module in the standard library. You can provide an expression or a module an receive a symtable.SymbolTable instance.

You can provide a string with a Python expression and the compile_type of "eval", or a module, function or class, and the compile_mode of "exec" to get a symbol table.

Looping over the elements in the table we can see some of the public and private fields and their types:

>>>
>>> import symtable
>>> s = symtable.symtable('b + 1', filename='test.py', compile_type='eval')
>>> [symbol.__dict__ for symbol in s.get_symbols()]
[{'_Symbol__name': 'b', '_Symbol__flags': 6160, '_Symbol__scope': 3, '_Symbol__namespaces': ()}]

The C code behind this is all within Python/symtable.c and the primary interface is the PySymtable_BuildObject() function.

Similar to the top-level AST function we covered earlier, the PySymtable_BuildObject() function switches between the mod_ty possible types (Module, Expression, Interactive, Suite, FunctionType), and visits each of the statements inside them.

Remember, mod_ty is an AST instance, so the will now recursively explore the nodes and branches of the tree and add entries to the symtable:

struct symtable *
PySymtable_BuildObject(mod_ty mod, PyObject *filename, PyFutureFeatures *future)
{
    struct symtable *st = symtable_new();
    asdl_seq *seq;
    int i;
    PyThreadState *tstate;
    int recursion_limit = Py_GetRecursionLimit();
...
    st->st_top = st->st_cur;
    switch (mod->kind) {
    case Module_kind:
        seq = mod->v.Module.body;
        for (i = 0; i < asdl_seq_LEN(seq); i++)
            if (!symtable_visit_stmt(st,
                        (stmt_ty)asdl_seq_GET(seq, i)))
                goto error;
        break;
    case Expression_kind:
        ...
    case Interactive_kind:
        ...
    case Suite_kind:
        ...
    case FunctionType_kind:
        ...
    }
    ...
}

So for a module, PySymtable_BuildObject() will loop through each statement in the module and call symtable_visit_stmt(). The symtable_visit_stmt() is a huge switch statement with a case for each statement type (defined in Parser/Python.asdl).

For each statement type, there is specific logic to that statement type. For example, a function definition has particular logic for:

  1. If the recursion depth is beyond the limit, raise a recursion depth error
  2. The name of the function to be added as a local variable
  3. The default values for sequential arguments to be resolved
  4. The default values for keyword arguments to be resolved
  5. Any annotations for the arguments or the return type are resolved
  6. Any function decorators are resolved
  7. The code block with the contents of the function is visited in symtable_enter_block()
  8. The arguments are visited
  9. The body of the function is visited

Note: If you've ever wondered why Python's default arguments are mutable, the reason is in this function. You can see they are a pointer to the variable in the symtable. No extra work is done to copy any values to an immutable type.

static int
symtable_visit_stmt(struct symtable *st, stmt_ty s)
{
    if (++st->recursion_depth > st->recursion_limit) {                          // 1.
        PyErr_SetString(PyExc_RecursionError,
                        "maximum recursion depth exceeded during compilation");
        VISIT_QUIT(st, 0);
    }
    switch (s->kind) {
    case FunctionDef_kind:
        if (!symtable_add_def(st, s->v.FunctionDef.name, DEF_LOCAL))            // 2.
            VISIT_QUIT(st, 0);
        if (s->v.FunctionDef.args->defaults)                                    // 3.
            VISIT_SEQ(st, expr, s->v.FunctionDef.args->defaults);
        if (s->v.FunctionDef.args->kw_defaults)                                 // 4.
            VISIT_SEQ_WITH_NULL(st, expr, s->v.FunctionDef.args->kw_defaults);
        if (!symtable_visit_annotations(st, s, s->v.FunctionDef.args,           // 5.
                                        s->v.FunctionDef.returns))
            VISIT_QUIT(st, 0);
        if (s->v.FunctionDef.decorator_list)                                    // 6.
            VISIT_SEQ(st, expr, s->v.FunctionDef.decorator_list);
        if (!symtable_enter_block(st, s->v.FunctionDef.name,                    // 7.
                                  FunctionBlock, (void *)s, s->lineno,
                                  s->col_offset))
            VISIT_QUIT(st, 0);
        VISIT(st, arguments, s->v.FunctionDef.args);                            // 8.
        VISIT_SEQ(st, stmt, s->v.FunctionDef.body);                             // 9.
        if (!symtable_exit_block(st, s))
            VISIT_QUIT(st, 0);
        break;
    case ClassDef_kind: {
        ...
    }
    case Return_kind:
        ...
    case Delete_kind:
        ...
    case Assign_kind:
        ...
    case AnnAssign_kind:
        ...

Once the resulting symtable has been created, it is sent back to be used for the compiler.

Core Compilation Process

Now that the PyAST_CompileObject() has a compiler state, a symtable, and a module in the form of the AST, the actual compilation can begin.

The purpose of the core compiler is to:

You can call the CPython compiler in Python code by calling the built-in function compile(). It returns a code object instance:

>>>
>>> compile('b+1', 'test.py', mode='eval')
<code object <module> at 0x10f222780, file "test.py", line 1>

The same as with the symtable() function, a simple expression should have a mode of 'eval' and a module, function, or class should have a mode of 'exec'.

The compiled code can be found in the co_code property of the code object:

>>>
>>> co.co_code
b'e\x00d\x00\x17\x00S\x00'

There is also a dis module in the standard library, which disassembles the bytecode instructions and can print them on the screen or give you a list of Instruction instances.

If you import dis and give the dis() function the code object's co_code property it disassembles it and prints the instructions on the REPL:

>>> import dis
>>> dis.dis(co.co_code)
          0 LOAD_NAME                0 (0)
          2 LOAD_CONST               0 (0)
          4 BINARY_ADD
          6 RETURN_VALUE

LOAD_NAME, LOAD_CONST, BINARY_ADD, and RETURN_VALUE are all bytecode instructions. They're called bytecode because, in binary form, they were a byte long. However, since Python 3.6 the storage format was changed to a word, so now they're technically wordcode, not bytecode.

The full list of bytecode instructions is available for each version of Python, and it does change between versions. For example, in Python 3.7, some new bytecode instructions were introduced to speed up execution of specific method calls.

In an earlier section, we explored the instaviz package. This included a visualization of the code object type by running the compiler. It also displays the Bytecode operations inside the code objects.

Execute instaviz again to see the code object and bytecode for a function defined on the REPL:

>>>
>>> import instaviz
>>> def example():
       a = 1
       b = a + 1
       return b
>>> instaviz.show(example)

If we now jump into compiler_mod(), a function used to switch to different compiler functions depending on the module type. We'll assume that mod is a Module. The module is compiled into the compiler state and then assemble() is run to create a PyCodeObject.

The new code object is returned back to PyAST_CompileObject() and sent on for execution:

static PyCodeObject *
compiler_mod(struct compiler *c, mod_ty mod)
{
    PyCodeObject *co;
    int addNone = 1;
    static PyObject *module;
    ...
    switch (mod->kind) {
    case Module_kind:
        if (!compiler_body(c, mod->v.Module.body)) {
            compiler_exit_scope(c);
            return 0;
        }
        break;
    case Interactive_kind:
        ...
    case Expression_kind:
        ...
    case Suite_kind:
        ...
    ...
    co = assemble(c, addNone);
    compiler_exit_scope(c);
    return co;
}

The compiler_body() function has some optimization flags and then loops over each statement in the module and visits it, similar to how the symtable functions worked:

static int
compiler_body(struct compiler *c, asdl_seq *stmts)
{
    int i = 0;
    stmt_ty st;
    PyObject *docstring;
    ...
    for (; i < asdl_seq_LEN(stmts); i++)
        VISIT(c, stmt, (stmt_ty)asdl_seq_GET(stmts, i));
    return 1;
}

The statement type is determined through a call to the asdl_seq_GET() function, which looks at the AST node's type.

Through some smart macros, VISIT calls a function in Python/compile.c for each statement type:

#define VISIT(C, TYPE, V) {\
    if (!compiler_visit_ ## TYPE((C), (V))) \
        return 0; \
}

For a stmt (the category for a statement) the compiler will then drop into compiler_visit_stmt() and switch through all of the potential statement types found in Parser/Python.asdl:

static int
compiler_visit_stmt(struct compiler *c, stmt_ty s)
{
    Py_ssize_t i, n;

    /* Always assign a lineno to the next instruction for a stmt. */
    c->u->u_lineno = s->lineno;
    c->u->u_col_offset = s->col_offset;
    c->u->u_lineno_set = 0;

    switch (s->kind) {
    case FunctionDef_kind:
        return compiler_function(c, s, 0);
    case ClassDef_kind:
        return compiler_class(c, s);
    ...
    case For_kind:
        return compiler_for(c, s);
    ...
    }

    return 1;
}

As an example, let's focus on the For statement, in Python is the:

for i in iterable:
    # block
else:  # optional if iterable is False
    # block

If the statement is a For type, it calls compiler_for(). There is an equivalent compiler_*() function for all of the statement and expression types. The more straightforward types create the bytecode instructions inline, some of the more complex statement types call other functions.

Many of the statements can have sub-statements. A for loop has a body, but you can also have complex expressions in the assignment and the iterator.

The compiler's compiler_ statements sends blocks to the compiler state. These blocks contain instructions, the instruction data structure in Python/compile.c has the opcode, any arguments, and the target block (if this is a jump instruction), it also contains the line number.

For jump statements, they can either be absolute or relative jump statements. Jump statements are used to "jump" from one operation to another. Absolute jump statements specify the exact operation number in the compiled code object, whereas relative jump statements specify the jump target relative to another operation:

struct instr {
    unsigned i_jabs : 1;
    unsigned i_jrel : 1;
    unsigned char i_opcode;
    int i_oparg;
    struct basicblock_ *i_target; /* target block (if jump instruction) */
    int i_lineno;
};

So a frame block (of type basicblock), contains the following fields:

typedef struct basicblock_ {
    /* Each basicblock in a compilation unit is linked via b_list in the
       reverse order that the block are allocated.  b_list points to the next
       block, not to be confused with b_next, which is next by control flow. */
    struct basicblock_ *b_list;
    /* number of instructions used */
    int b_iused;
    /* length of instruction array (b_instr) */
    int b_ialloc;
    /* pointer to an array of instructions, initially NULL */
    struct instr *b_instr;
    /* If b_next is non-NULL, it is a pointer to the next
       block reached by normal control flow. */
    struct basicblock_ *b_next;
    /* b_seen is used to perform a DFS of basicblocks. */
    unsigned b_seen : 1;
    /* b_return is true if a RETURN_VALUE opcode is inserted. */
    unsigned b_return : 1;
    /* depth of stack upon entry of block, computed by stackdepth() */
    int b_startdepth;
    /* instruction offset for block, computed by assemble_jump_offsets() */
    int b_offset;
} basicblock;

The For statement is somewhere in the middle in terms of complexity. There are 15 steps in the compilation of a For statement with the for <target> in <iterator>: syntax:

  1. Create a new code block called start, this allocates memory and creates a basicblock pointer
  2. Create a new code block called cleanup
  3. Create a new code block called end
  4. Push a frame block of type FOR_LOOP to the stack with start as the entry block and end as the exit block
  5. Visit the iterator expression, which adds any operations for the iterator
  6. Add the GET_ITER operation to the compiler state
  7. Switch to the start block
  8. Call ADDOP_JREL which calls compiler_addop_j() to add the FOR_ITER operation with an argument of the cleanup block
  9. Visit the target and add any special code, like tuple unpacking, to the start block
  10. Visit each statement in the body of the for loop
  11. Call ADDOP_JABS which calls compiler_addop_j() to add the JUMP_ABSOLUTE operation which indicates after the body is executed, jumps back to the start of the loop
  12. Move to the cleanup block
  13. Pop the FOR_LOOP frame block off the stack
  14. Visit the statements inside the else section of the for loop
  15. Use the end block

Referring back to the basicblock structure. You can see how in the compilation of the for statement, the various blocks are created and pushed into the compiler's frame block and stack:

static int
compiler_for(struct compiler *c, stmt_ty s)
{
    basicblock *start, *cleanup, *end;

    start = compiler_new_block(c);                       // 1.
    cleanup = compiler_new_block(c);                     // 2.
    end = compiler_new_block(c);                         // 3.
    if (start == NULL || end == NULL || cleanup == NULL)
        return 0;

    if (!compiler_push_fblock(c, FOR_LOOP, start, end))  // 4.
        return 0;

    VISIT(c, expr, s->v.For.iter);                       // 5.
    ADDOP(c, GET_ITER);                                  // 6.
    compiler_use_next_block(c, start);                   // 7.
    ADDOP_JREL(c, FOR_ITER, cleanup);                    // 8.
    VISIT(c, expr, s->v.For.target);                     // 9.
    VISIT_SEQ(c, stmt, s->v.For.body);                   // 10.
    ADDOP_JABS(c, JUMP_ABSOLUTE, start);                 // 11.
    compiler_use_next_block(c, cleanup);                 // 12.

    compiler_pop_fblock(c, FOR_LOOP, start);             // 13.

    VISIT_SEQ(c, stmt, s->v.For.orelse);                 // 14.
    compiler_use_next_block(c, end);                     // 15.
    return 1;
}

Depending on the type of operation, there are different arguments required. For example, we used ADDOP_JABS and ADDOP_JREL here, which refer to "ADD Operation with Jump to a RELative position" and "ADD Operation with Jump to an ABSolute position". This is referring to the APPOP_JREL and ADDOP_JABS macros which call compiler_addop_j(struct compiler *c, int opcode, basicblock *b, int absolute) and set the absolute argument to 0 and 1 respectively.

There are some other macros, like ADDOP_I calls compiler_addop_i() which add an operation with an integer argument, or ADDOP_O calls compiler_addop_o() which adds an operation with a PyObject argument.

Once these stages have completed, the compiler has a list of frame blocks, each containing a list of instructions and a pointer to the next block.

Assembly

With the compiler state, the assembler performs a "depth-first-search" of the blocks and merge the instructions into a single bytecode sequence. The assembler state is declared in Python/compile.c:

struct assembler {
    PyObject *a_bytecode;  /* string containing bytecode */
    int a_offset;              /* offset into bytecode */
    int a_nblocks;             /* number of reachable blocks */
    basicblock **a_postorder; /* list of blocks in dfs postorder */
    PyObject *a_lnotab;    /* string containing lnotab */
    int a_lnotab_off;      /* offset into lnotab */
    int a_lineno;              /* last lineno of emitted instruction */
    int a_lineno_off;      /* bytecode offset of last lineno */
};

The assemble() function has a few tasks:

static PyCodeObject *
assemble(struct compiler *c, int addNone)
{
    basicblock *b, *entryblock;
    struct assembler a;
    int i, j, nblocks;
    PyCodeObject *co = NULL;

    /* Make sure every block that falls off the end returns None.
       XXX NEXT_BLOCK() isn't quite right, because if the last
       block ends with a jump or return b_next shouldn't set.
     */
    if (!c->u->u_curblock->b_return) {
        NEXT_BLOCK(c);
        if (addNone)
            ADDOP_LOAD_CONST(c, Py_None);
        ADDOP(c, RETURN_VALUE);
    }
    ...
    dfs(c, entryblock, &a, nblocks);

    /* Can't modify the bytecode after computing jump offsets. */
    assemble_jump_offsets(&a, c);

    /* Emit code in reverse postorder from dfs. */
    for (i = a.a_nblocks - 1; i >= 0; i--) {
        b = a.a_postorder[i];
        for (j = 0; j < b->b_iused; j++)
            if (!assemble_emit(&a, &b->b_instr[j]))
                goto error;
    }
    ...

    co = makecode(c, &a);
 error:
    assemble_free(&a);
    return co;
}

The depth-first-search is performed by the dfs() function in Python/compile.c, which follows the the b_next pointers in each of the blocks, marks them as seen by toggling b_seen and then adds them to the assemblers **a_postorder list in reverse order.

The function loops back over the assembler's post-order list and for each block, if it has a jump operation, recursively call dfs() for that jump:

static void
dfs(struct compiler *c, basicblock *b, struct assembler *a, int end)
{
    int i, j;

    /* Get rid of recursion for normal control flow.
       Since the number of blocks is limited, unused space in a_postorder
       (from a_nblocks to end) can be used as a stack for still not ordered
       blocks. */
    for (j = end; b && !b->b_seen; b = b->b_next) {
        b->b_seen = 1;
        assert(a->a_nblocks < j);
        a->a_postorder[--j] = b;
    }
    while (j < end) {
        b = a->a_postorder[j++];
        for (i = 0; i < b->b_iused; i++) {
            struct instr *instr = &b->b_instr[i];
            if (instr->i_jrel || instr->i_jabs)
                dfs(c, instr->i_target, a, j);
        }
        assert(a->a_nblocks < j);
        a->a_postorder[a->a_nblocks++] = b;
    }
}

Creating a Code Object

The task of makecode() is to go through the compiler state, some of the assembler's properties and to put these into a PyCodeObject by calling PyCode_New():

PyCodeObject structure

The variable names, constants are put as properties to the code object:

static PyCodeObject *
makecode(struct compiler *c, struct assembler *a)
{
...

    consts = consts_dict_keys_inorder(c->u->u_consts);
    names = dict_keys_inorder(c->u->u_names, 0);
    varnames = dict_keys_inorder(c->u->u_varnames, 0);
...
    cellvars = dict_keys_inorder(c->u->u_cellvars, 0);
...
    freevars = dict_keys_inorder(c->u->u_freevars, PyTuple_GET_SIZE(cellvars));
...
    flags = compute_code_flags(c);
    if (flags < 0)
        goto error;

    bytecode = PyCode_Optimize(a->a_bytecode, consts, names, a->a_lnotab);
...
    co = PyCode_New(argcount, kwonlyargcount,
                    nlocals_int, maxdepth, flags,
                    bytecode, consts, names, varnames,
                    freevars, cellvars,
                    c->c_filename, c->u->u_name,
                    c->u->u_firstlineno,
                    a->a_lnotab);
...
    return co;
}

You may also notice that the bytecode is sent to PyCode_Optimize() before it is sent to PyCode_New(). This function is part of the bytecode optimization process in Python/peephole.c.

The peephole optimizer goes through the bytecode instructions and in certain scenarios, replace them with other instructions. For example, there is an optimizer called "constant unfolding", so if you put the following statement into your script:

a = 1 + 5

It optimizes that to:

a = 6

Because 1 and 5 are constant values, so the result should always be the same.

Conclusion

We can pull together all of these stages with the instaviz module:

import instaviz

def foo():
    a = 2**4
    b = 1 + 5
    c = [1, 4, 6]
    for i in c:
        print(i)
    else:
        print(a)
    return c


instaviz.show(foo)

Will produce an AST graph:

Instaviz screenshot 6

With bytecode instructions in sequence:

Instaviz screenshot 7

Also, the code object with the variable names, constants, and binary co_code:

Instaviz screenshot 8

Execution

In Python/pythonrun.c we broke out just before the call to run_eval_code_obj().

This call takes a code object, either fetched from the marshaled .pyc file, or compiled through the AST and compiler stages.

run_eval_code_obj() will pass the globals, locals, PyArena, and compiled PyCodeObject to PyEval_EvalCode() in Python/ceval.c.

This stage forms the execution component of CPython. Each of the bytecode operations is taken and executed using a "Stack Frame" based system.

What is a Stack Frame?

Stack Frames are a data type used by many runtimes, not just Python, that allows functions to be called and variables to be returned between functions. Stack Frames also contain arguments, local variables, and other state information.

Typically, a Stack Frame exists for every function call, and they are stacked in sequence. You can see CPython's frame stack anytime an exception is unhandled and the stack is printed on the screen.

PyEval_EvalCode() is the public API for evaluating a code object. The logic for evaluation is split between _PyEval_EvalCodeWithName() and _PyEval_EvalFrameDefault(), which are both in ceval.c.

The public API PyEval_EvalCode() will construct an execution frame from the top of the stack by calling _PyEval_EvalCodeWithName().

The construction of the first execution frame has many steps:

  1. Keyword and positional arguments are resolved.
  2. The use of *args and **kwargs in function definitions are resolved.
  3. Arguments are added as local variables to the scope.
  4. Co-routines and Generators are created, including the Asynchronous Generators.

The frame object looks like this:

PyFrameObject structure

Let's step through those sequences.

1. Constructing Thread State

Before a frame can be executed, it needs to be referenced from a thread. CPython can have many threads running at any one time within a single interpreter. An Interpreter state includes a list of those threads as a linked list. The thread structure is called PyThreadState, and there are many references throughout ceval.c.

Here is the structure of the thread state object:

PyThreadState structure

2. Constructing Frames

The input to PyEval_EvalCode() and therefore _PyEval_EvalCodeWithName() has arguments for:

The other arguments are optional, and not used for the basic API:

PyObject *
_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
           PyObject *const *args, Py_ssize_t argcount,
           PyObject *const *kwnames, PyObject *const *kwargs,
           Py_ssize_t kwcount, int kwstep,
           PyObject *const *defs, Py_ssize_t defcount,
           PyObject *kwdefs, PyObject *closure,
           PyObject *name, PyObject *qualname)
{
    ...

    /* Create the frame */
    tstate = _PyThreadState_GET();
    assert(tstate != NULL);
    f = _PyFrame_New_NoTrack(tstate, co, globals, locals);
    if (f == NULL) {
        return NULL;
    }
    fastlocals = f->f_localsplus;
    freevars = f->f_localsplus + co->co_nlocals;

3. Converting Keyword Parameters to a Dictionary

If the function definition contained a **kwargs style catch-all for keyword arguments, then a new dictionary is created, and the values are copied across. The kwargs name is then set as a variable, like in this example:

def example(arg, arg2=None, **kwargs):
    print(kwargs['extra'])  # this would resolve to a dictionary key

The logic for creating a keyword argument dictionary is in the next part of _PyEval_EvalCodeWithName():

    /* Create a dictionary for keyword parameters (**kwargs) */
    if (co->co_flags & CO_VARKEYWORDS) {
        kwdict = PyDict_New();
        if (kwdict == NULL)
            goto fail;
        i = total_args;
        if (co->co_flags & CO_VARARGS) {
            i++;
        }
        SETLOCAL(i, kwdict);
    }
    else {
        kwdict = NULL;
    }

The kwdict variable will reference a PyDictObject if any keyword arguments were found.

4. Converting Positional Arguments Into Variables

Next, each of the positional arguments (if provided) are set as local variables:

    for (i = 0; i < n /* argcount */; i++) {
        x = args[i];
        Py_INCREF(x);
        SETLOCAL(i, x);
    }

At the end of the loop, you'll see a call to SETLOCAL() with the value, so if a positional argument is defined with a value, that is available within this scope:

def example(arg1, arg2):
    print(arg1, arg2)  # both args are already local variables.

Also, the reference counter for those variables is incremented, so the garbage collector won't remove them until the frame has evaluated.

5. Packing Positional Arguments Into *args

Similar to **kwargs, a function argument prepended with a * can be set to catch all remaining positional arguments. This argument is a tuple and the *args name is set as a local variable:

    /* Pack other positional arguments into the *args argument */
    if (co->co_flags & CO_VARARGS) {
        u = _PyTuple_FromArray(args + n, argcount - n);
        if (u == NULL) {
            goto fail;
        }
        SETLOCAL(total_args, u);
    }

6. Loading Keyword Arguments

If the function was called with keyword arguments and values, the kwdict dictionary created in step 4 is now filled with any remaining keyword arguments passed by the caller that doesn't resolve to named arguments or positional arguments.

For example, the e argument was neither positional or named, so it is added to **remaining:

>>>
>>> def my_function(a, b, c=None, d=None, **remaining):
       print(a, b, c, d, remaining)

>>> my_function(a=1, b=2, c=3, d=4, e=5)
(1, 2, 3, 4, {'e': 5})

The resolution of the keyword argument dictionary values comes after the unpacking of all other arguments. PyDict_SetItem() is called for each remaining argument to add it to

    for (i = 0; i < kwcount; i += kwstep) {
        PyObject **co_varnames;
        PyObject *keyword = kwnames[i];
        PyObject *value = kwargs[i];
        ...

        if (PyDict_SetItem(kwdict, keyword, value) == -1) {
            goto fail;
        }
        continue;

      kw_found:
        ...
        Py_INCREF(value);
        SETLOCAL(j, value);
    }
    ...

At the end of the loop, you'll see a call to SETLOCAL() with the value. If a keyword argument is defined with a value, that is available within this scope:

def example(arg1, arg2, example_kwarg=None):
    print(example_kwarg)  # example_kwarg is already a local variable.

7. Adding Missing Positional Arguments

Any positional arguments provided to a function call that are not in the list of positional arguments are added to a *args tuple if this tuple does not exist, a failure is raised:

    /* Add missing positional arguments (copy default values from defs) */
    if (argcount < co->co_argcount) {
        Py_ssize_t m = co->co_argcount - defcount;
        Py_ssize_t missing = 0;
        for (i = argcount; i < m; i++) {
            if (GETLOCAL(i) == NULL) {
                missing++;
            }
        }
        if (missing) {
            missing_arguments(co, missing, defcount, fastlocals);
            goto fail;
        }
        if (n > m)
            i = n - m;
        else
            i = 0;
        for (; i < defcount; i++) {
            if (GETLOCAL(m+i) == NULL) {
                PyObject *def = defs[i];
                Py_INCREF(def);
                SETLOCAL(m+i, def);
            }
        }
    }

8. Adding Missing Keyword Arguments

Any keyword arguments provided to a function call that are not in the list of named keyword arguments are added to a **kwargs dictionary if this dictionary does not exist, a failure is raised:

    /* Add missing keyword arguments (copy default values from kwdefs) */
    if (co->co_kwonlyargcount > 0) {
        Py_ssize_t missing = 0;
        for (i = co->co_argcount; i < total_args; i++) {
            PyObject *name;
            if (GETLOCAL(i) != NULL)
                continue;
            name = PyTuple_GET_ITEM(co->co_varnames, i);
            if (kwdefs != NULL) {
                PyObject *def = PyDict_GetItemWithError(kwdefs, name);
                ...
            }
            missing++;
        }
        ...
    }

9. Collapsing Closures

Any closure names are added to the code object's list of free variable names:

    /* Copy closure variables to free variables */
    for (i = 0; i < PyTuple_GET_SIZE(co->co_freevars); ++i) {
        PyObject *o = PyTuple_GET_ITEM(closure, i);
        Py_INCREF(o);
        freevars[PyTuple_GET_SIZE(co->co_cellvars) + i] = o;
    }

10. Creating Generators, Coroutines, and Asynchronous Generators

If the evaluated code object has a flag that it is a generator, coroutine or async generator, then a new frame is created using one of the unique methods in the Generator, Coroutine or Async libraries and the current frame is added as a property.

The new frame is then returned, and the original frame is not evaluated. The frame is only evaluated when the generator/coroutine/async method is called on to execute its target:

    /* Handle generator/coroutine/asynchronous generator */
    if (co->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) {
        ...

        /* Create a new generator that owns the ready to run frame
         * and return that as the value. */
        if (is_coro) {
            gen = PyCoro_New(f, name, qualname);
        } else if (co->co_flags & CO_ASYNC_GENERATOR) {
            gen = PyAsyncGen_New(f, name, qualname);
        } else {
            gen = PyGen_NewWithQualName(f, name, qualname);
        }
        ...

        return gen;
    }

Lastly, PyEval_EvalFrameEx() is called with the new frame:

    retval = PyEval_EvalFrameEx(f,0);
    ...
}

Frame Execution

As covered earlier in the compiler and AST chapters, the code object contains a binary encoding of the bytecode to be executed. It also contains a list of variables and a symbol table.

The local and global variables are determined at runtime based on how that function, module, or block was called. This information is added to the frame by the _PyEval_EvalCodeWithName() function. There are other usages of frames, like the coroutine decorator, which dynamically generates a frame with the target as a variable.

The public API, PyEval_EvalFrameEx() calls the interpreter's configured frame evaluation function in the eval_frame property. Frame evaluation was made pluggable in Python 3.7 with PEP 523.

_PyEval_EvalFrameDefault() is the default function, and it is unusual to use anything other than this.

Frames are executed in the main execution loop inside _PyEval_EvalFrameDefault(). This function is central function that brings everything together and brings your code to life. It contains decades of optimization since even a single line of code can have a significant impact on performance for the whole of CPython.

Everything that gets executed in CPython goes through this function.

Note: Something you might notice when reading ceval.c, is how many times C macros have been used. C Macros are a way of having DRY-compliant code without the overhead of making function calls. The compiler converts the macros into C code and then compile the generated code.

If you want to see the expanded code, you can run gcc -E on Linux and macOS:

$ gcc -E Python/ceval.c

Alternatively, Visual Studio code can do inline macro expansion once you have installed the official C/C++ extension:

C Macro expansion with VScode

We can step through frame execution in Python 3.7 and beyond by enabling the tracing attribute on the current thread.

This code example sets the global tracing function to a function called trace() that gets the stack from the current frame, prints the disassembled opcodes to the screen, and some extra information for debugging:

import sys
import dis
import traceback
import io

def trace(frame, event, args):
   frame.f_trace_opcodes = True
   stack = traceback.extract_stack(frame)
   pad = "   "*len(stack) + "|"
   if event == 'opcode':
      with io.StringIO() as out:
         dis.disco(frame.f_code, frame.f_lasti, file=out)
         lines = out.getvalue().split('\n')
         [print(f"{pad}{l}") for l in lines]
   elif event == 'call':
      print(f"{pad}Calling {frame.f_code}")
   elif event == 'return':
      print(f"{pad}Returning {args}")
   elif event == 'line':
      print(f"{pad}Changing line to {frame.f_lineno}")
   else:
      print(f"{pad}{frame} ({event} - {args})")
   print(f"{pad}----------------------------------")
   return trace
sys.settrace(trace)

# Run some code for a demo
eval('"-".join([letter for letter in "hello"])')

This prints the code within each stack and point to the next operation before it is executed. When a frame returns a value, the return statement is printed:

Evaluating frame with tracing

The full list of instructions is available on the dis module documentation.

The Value Stack

Inside the core evaluation loop, a value stack is created. This stack is a list of pointers to sequential PyObject instances.

One way to think of the value stack is like a wooden peg on which you can stack cylinders. You would only add or remove one item at a time. This is done using the PUSH(a) macro, where a is a pointer to a PyObject.

For example, if you created a PyLong with the value 10 and pushed it onto the value stack:

PyObject *a = PyLong_FromLong(10);
PUSH(a);

This action would have the following effect:

PUSH()

In the next operation, to fetch that value, you would use the POP() macro to take the top value from the stack:

PyObject *a = POP();  // a is PyLongObject with a value of 10

This action would return the top value and end up with an empty value stack:

POP()

If you were to add 2 values to the stack:

PyObject *a = PyLong_FromLong(10);
PyObject *b = PyLong_FromLong(20);
PUSH(a);
PUSH(b);

They would end up in the order in which they were added, so a would be pushed to the second position in the stack:

PUSH();PUSH()

If you were to fetch the top value in the stack, you would get a pointer to b because it is at the top:

POP();

If you need to fetch the pointer to the top value in the stack without popping it, you can use the PEEK(v) operation, where v is the stack position:

PyObject *first = PEEK(0);

0 represents the top of the stack, 1 would be the second position:

PEEK()

To clone the value at the top of the stack, the DUP_TWO() macro can be used, or by using the DUP_TWO opcode:

DUP_TOP();

This action would copy the value at the top to form 2 pointers to the same object:

DUP_TOP()

There is a rotation macro ROT_TWO that swaps the first and second values:

ROT_TWO()

Each of the opcodes have a predefined "stack effect," calculated by the stack_effect() function inside Python/compile.c. This function returns the delta in the number of values inside the stack for each opcode.

Example: Adding an Item to a List

In Python, when you create a list, the .append() method is available on the list object:

my_list = []
my_list.append(obj)

Where obj is an object, you want to append to the end of the list.

There are 2 operations involved in this operation. LOAD_FAST, to load the object obj to the top of the value stack from the list of locals in the frame, and LIST_APPEND to add the object.

First exploring LOAD_FAST, there are 5 steps:

  1. The pointer to obj is loaded from GETLOCAL(), where the variable to load is the operation argument. The list of variable pointers is stored in fastlocals, which is a copy of the PyFrame attribute f_localsplus. The operation argument is a number, pointing to the index in the fastlocals array pointer. This means that the loading of a local is simply a copy of the pointer instead of having to look up the variable name.

  2. If variable no longer exists, an unbound local variable error is raised.

  3. The reference counter for value (in our case, obj) is increased by 1.

  4. The pointer to obj is pushed to the top of the value stack.

  5. The FAST_DISPATCH macro is called, if tracing is enabled, the loop goes over again (with all the tracing), if tracing is not enabled, a goto is called to fast_next_opcode, which jumps back to the top of the loop for the next instruction.

 ... 
    case TARGET(LOAD_FAST): {
        PyObject *value = GETLOCAL(oparg);                 // 1.
        if (value == NULL) {
            format_exc_check_arg(
                PyExc_UnboundLocalError,
                UNBOUNDLOCAL_ERROR_MSG,
                PyTuple_GetItem(co->co_varnames, oparg));
            goto error;                                    // 2.
        }
        Py_INCREF(value);                                  // 3.
        PUSH(value);                                       // 4.
        FAST_DISPATCH();                                   // 5.
    }
 ...

Now the pointer to obj is at the top of the value stack. The next instruction LIST_APPEND is run.

Many of the bytecode operations are referencing the base types, like PyUnicode, PyNumber. For example, LIST_APPEND appends an object to the end of a list. To achieve this, it pops the pointer from the value stack and returns the pointer to the last object in the stack. The macro is a shortcut for:

PyObject *v = (*--stack_pointer);

Now the pointer to obj is stored as v. The list pointer is loaded from PEEK(oparg).

Then the C API for Python lists is called for list and v. The code for this is inside Objects/listobject.c, which we go into in the next chapter.

A call to PREDICT is made, which guesses that the next operation will be JUMP_ABSOLUTE. The PREDICT macro has compiler-generated goto statements for each of the potential operations' case statements. This means the CPU can jump to that instruction and not have to go through the loop again:

 ...
        case TARGET(LIST_APPEND): {
            PyObject *v = POP();
            PyObject *list = PEEK(oparg);
            int err;
            err = PyList_Append(list, v);
            Py_DECREF(v);
            if (err != 0)
                goto error;
            PREDICT(JUMP_ABSOLUTE);
            DISPATCH();
        }
 ...

Opcode predictions: Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, COMPARE_OP is often followed by POP_JUMP_IF_FALSE or POP_JUMP_IF_TRUE.

"Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor's own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its unpredictable switch-case branch. Combined with the processor's internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined."

If collecting opcode statistics, you have two choices:

  1. Keep the predictions turned-on and interpret the results as if some opcodes had been combined
  2. Turn off predictions so that the opcode frequency counter updates for both opcodes

Opcode prediction is disabled with threaded code since the latter allows the CPU to record separate branch prediction information for each opcode.

Some of the operations, such as CALL_FUNCTION, CALL_METHOD, have an operation argument referencing another compiled function. In these cases, another frame is pushed to the frame stack in the thread, and the evaluation loop is run for that function until the function completes. Each time a new frame is created and pushed onto the stack, the value of the frame's f_back is set to the current frame before the new one is created.

This nesting of frames is clear when you see a stack trace, take this example script:

def function2():
  raise RuntimeError

def function1():
  function2()

if __name__ == '__main__':
  function1()

Calling this on the command line will give you:

$ ./python.exe example_stack.py

Traceback (most recent call last):
  File "example_stack.py", line 8, in <module>
    function1()
  File "example_stack.py", line 5, in function1
    function2()
  File "example_stack.py", line 2, in function2
    raise RuntimeError
RuntimeError

In traceback.py, the walk_stack() function used to print trace backs:

def walk_stack(f):
    """Walk a stack yielding the frame and line number for each frame.

    This will follow f.f_back from the given frame. If no frame is given, the
    current stack is used. Usually used with StackSummary.extract.
    """
    if f is None:
        f = sys._getframe().f_back.f_back
    while f is not None:
        yield f, f.f_lineno
        f = f.f_back

Here you can see that the current frame, fetched by calling sys._getframe() and the parent's parent is set as the frame, because you don't want to see the call to walk_stack() or print_trace() in the trace back, so those function frames are skipped.

Then the f_back pointer is followed to the top.

sys._getframe() is the Python API to get the frame attribute of the current thread.

Here is how that frame stack would look visually, with 3 frames each with its code object and a thread state pointing to the current frame:

Example frame stack

Conclusion

In this Part, you explored the most complex element of CPython: the compiler. The original author of Python, Guido van Rossum, made the statement that CPython's compiler should be "dumb" so that people can understand it.

By breaking down the compilation process into small, logical steps, it is far easier to understand.

In the next chapter, we connect the compilation process with the basis of all Python code, the object.

Part 4: Objects in CPython

CPython comes with a collection of basic types like strings, lists, tuples, dictionaries, and objects.

All of these types are built-in. You don't need to import any libraries, even from the standard library. Also, the instantiation of these built-in types has some handy shortcuts.

For example, to create a new list, you can call:

lst = list()

Or, you can use square brackets:

lst = []

Strings can be instantiated from a string-literal by using either double or single quotes. We explored the grammar definitions earlier that cause the compiler to interpret double quotes as a string literal.

All types in Python inherit from object, a built-in base type. Even strings, tuples, and list inherit from object. During the walk-through of the C code, you have read lots of references to PyObject*, the C-API structure for an object.

Because C is not object-oriented like Python, objects in C don't inherit from one another. PyObject is the data structure for the beginning of the Python object's memory.

Much of the base object API is declared in Objects/object.c, like the function PyObject_Repr, which the built-in repr() function. You will also find PyObject_Hash() and other APIs.

All of these functions can be overridden in a custom object by implementing "dunder" methods on a Python object:

class MyObject(object): 
    def __init__(self, id, name):
        self.id = id
        self.name = name

    def __repr__(self):
        return "<{0} id={1}>".format(self.name, self.id)

This code is implemented in PyObject_Repr(), inside Objects/object.c. The type of the target object, v will be inferred through a call to Py_TYPE() and if the tp_repr field is set, then the function pointer is called. If the tp_repr field is not set, i.e. the object doesn't declare a custom __repr__ method, then the default behavior is run, which is to return "<%s object at %p>" with the type name and the ID:

PyObject *
PyObject_Repr(PyObject *v)
{
    PyObject *res;
    if (PyErr_CheckSignals())
        return NULL;
...
    if (v == NULL)
        return PyUnicode_FromString("<NULL>");
    if (Py_TYPE(v)->tp_repr == NULL)
        return PyUnicode_FromFormat("<%s object at %p>",
                                    v->ob_type->tp_name, v);

...
}

The ob_type field for a given PyObject* will point to the data structure PyTypeObject, defined in Include/cpython/object.h. This data-structure lists all the built-in functions, as fields and the arguments they should receive.

Take tp_repr as an example:

typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "<module>.<name>" */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

    /* Methods to implement standard operations */
...
    reprfunc tp_repr;

Where reprfunc is a typedef for PyObject *(*reprfunc)(PyObject *);, a function that takes 1 pointer to PyObject (self).

Some of the dunder APIs are optional, because they only apply to certain types, like numbers:

    /* Method suites for standard classes */

    PyNumberMethods *tp_as_number;
    PySequenceMethods *tp_as_sequence;
    PyMappingMethods *tp_as_mapping;

A sequence, like a list would implement the following methods:

typedef struct {
    lenfunc sq_length; // len(v)
    binaryfunc sq_concat; // v + x
    ssizeargfunc sq_repeat; // for x in v
    ssizeargfunc sq_item; // v[x]
    void *was_sq_slice; // v[x:y:z]
    ssizeobjargproc sq_ass_item; // v[x] = z
    void *was_sq_ass_slice; // v[x:y] = z
    objobjproc sq_contains; // x in v

    binaryfunc sq_inplace_concat;
    ssizeargfunc sq_inplace_repeat;
} PySequenceMethods;

All of these built-in functions are called the Python Data Model. One of the great resources for the Python Data Model is "Fluent Python" by Luciano Ramalho.

Base Object Type

In Objects/object.c, the base implementation of object type is written as pure C code. There are some concrete implementations of basic logic, like shallow comparisons.

Not all methods in a Python object are part of the Data Model, so that a Python object can contain attributes (either class or instance attributes) and methods.

A simple way to think of a Python object is consisting of 2 things:

  1. The core data model, with pointers to compiled functions
  2. A dictionary with any custom attributes and methods

The core data model is defined in the PyTypeObject, and the functions are defined in:

We're going to dive into 3 of these types:

  1. Booleans
  2. Integers
  3. Generators

Booleans and Integers have a lot in common, so we'll cover those first.

The Bool and Long Integer Type

The bool type is the most straightforward implementation of the built-in types. It inherits from long and has the predefined constants, Py_True and Py_False. These constants are immutable instances, created on the instantiation of the Python interpreter.

Inside Objects/boolobject.c, you can see the helper function to create a bool instance from a number:

PyObject *PyBool_FromLong(long ok)
{
    PyObject *result;

    if (ok)
        result = Py_True;
    else
        result = Py_False;
    Py_INCREF(result);
    return result;
}

This function uses the C evaluation of a numeric type to assign Py_True or Py_False to a result and increment the reference counters.

The numeric functions for and, xor, and or are implemented, but addition, subtraction, and division are dereferenced from the base long type since it would make no sense to divide two boolean values.

The implementation of and for a bool value checks if a and b are booleans, then check their references to Py_True, otherwise, are cast as numbers, and the and operation is run on the two numbers:

static PyObject *
bool_and(PyObject *a, PyObject *b)
{
    if (!PyBool_Check(a) || !PyBool_Check(b))
        return PyLong_Type.tp_as_number->nb_and(a, b);
    return PyBool_FromLong((a == Py_True) & (b == Py_True));
}

The long type is a bit more complex, as the memory requirements are expansive. In the transition from Python 2 to 3, CPython dropped support for the int type and instead used the long type as the primary integer type. Python's long type is quite special in that it can store a variable-length number. The maximum length is set in the compiled binary.

The data structure of a Python long consists of the PyObject header and a list of digits. The list of digits, ob_digit is initially set to have one digit, but it later expanded to a longer length when initialized:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};

Memory is allocated to a new long through _PyLong_New(). This function takes a fixed length and makes sure it is smaller than MAX_LONG_DIGITS. Then it reallocates the memory for ob_digit to match the length.

To convert a C long type to a Python long type, the long is converted to a list of digits, the memory for the Python long is assigned, and then each of the digits is set. Because long is initialized with ob_digit already being at a length of 1, if the number is less than 10, then the value is set without the memory being allocated:

PyObject *
PyLong_FromLong(long ival)
{
    PyLongObject *v;
    unsigned long abs_ival;
    unsigned long t;  /* unsigned so >> doesn't propagate sign bit */
    int ndigits = 0;
    int sign;

    CHECK_SMALL_INT(ival);
...
    /* Fast path for single-digit ints */
    if (!(abs_ival >> PyLong_SHIFT)) {
        v = _PyLong_New(1);
        if (v) {
            Py_SIZE(v) = sign;
            v->ob_digit[0] = Py_SAFE_DOWNCAST(
                abs_ival, unsigned long, digit);
        }
        return (PyObject*)v;
    }
...
    /* Larger numbers: loop to determine number of digits */
    t = abs_ival;
    while (t) {
        ++ndigits;
        t >>= PyLong_SHIFT;
    }
    v = _PyLong_New(ndigits);
    if (v != NULL) {
        digit *p = v->ob_digit;
        Py_SIZE(v) = ndigits*sign;
        t = abs_ival;
        while (t) {
            *p++ = Py_SAFE_DOWNCAST(
                t & PyLong_MASK, unsigned long, digit);
            t >>= PyLong_SHIFT;
        }
    }
    return (PyObject *)v;
}

To convert a double-point floating point to a Python long, PyLong_FromDouble() does the math for you:

PyObject *
PyLong_FromDouble(double dval)
{
    PyLongObject *v;
    double frac;
    int i, ndig, expo, neg;
    neg = 0;
    if (Py_IS_INFINITY(dval)) {
        PyErr_SetString(PyExc_OverflowError,
                        "cannot convert float infinity to integer");
        return NULL;
    }
    if (Py_IS_NAN(dval)) {
        PyErr_SetString(PyExc_ValueError,
                        "cannot convert float NaN to integer");
        return NULL;
    }
    if (dval < 0.0) {
        neg = 1;
        dval = -dval;
    }
    frac = frexp(dval, &expo); /* dval = frac*2**expo; 0.0 <= frac < 1.0 */
    if (expo <= 0)
        return PyLong_FromLong(0L);
    ndig = (expo-1) / PyLong_SHIFT + 1; /* Number of 'digits' in result */
    v = _PyLong_New(ndig);
    if (v == NULL)
        return NULL;
    frac = ldexp(frac, (expo-1) % PyLong_SHIFT + 1);
    for (i = ndig; --i >= 0; ) {
        digit bits = (digit)frac;
        v->ob_digit[i] = bits;
        frac = frac - (double)bits;
        frac = ldexp(frac, PyLong_SHIFT);
    }
    if (neg)
        Py_SIZE(v) = -(Py_SIZE(v));
    return (PyObject *)v;
}

The remainder of the implementation functions in longobject.c have utilities, such as converting a Unicode string into a number with PyLong_FromUnicodeObject().

A Review of the Generator Type

Python Generators are functions which return a yield statement and can be called continually to generate further values.

Commonly they are used as a more memory efficient way of looping through values in a large block of data, like a file, a database or over a network.

Generator objects are returned in place of a value when yield is used instead of return. The generator object is created from the yield statement and returned to the caller.

Let's create a simple generator with a list of 4 constant values:

>>>
>>> def example():
...   lst = [1,2,3,4]
...   for i in lst:
...     yield i
... 
>>> gen = example()
>>> gen
<generator object example at 0x100bcc480>

If you explore the contents of the generator object, you can see some of the fields starting with gi_:

>>>
>>> dir(gen)
[ ...
 'close', 
 'gi_code', 
 'gi_frame', 
 'gi_running', 
 'gi_yieldfrom', 
 'send', 
 'throw']

The PyGenObject type is defined in Include/cpython/genobject.h and there are 3 flavors:

  1. Generator objects
  2. Coroutine objects
  3. Async generator objects

All 3 share the same subset of fields used in generators, and have similar behaviors:

Structure of generator types

Focusing first on generators, you can see the fields:

The coroutine and async generators have the same fields but prepended with cr and ag respectively.

If you call __next__() on the generator object, the next value is yielded until eventually a StopIteration is raised:

>>>
>>> gen.__next__()
1
>>> gen.__next__()
2
>>> gen.__next__()
3
>>> gen.__next__()
4
>>> gen.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Each time __next__() is called, the code object inside the generators gi_code field is executed as a new frame and the return value is pushed to the value stack.

You can also see that gi_code is the compiled code object for the generator function by importing the dis module and disassembling the bytecode inside:

>>>
>>> gen = example()
>>> import dis
>>> dis.disco(gen.gi_code)
  2           0 LOAD_CONST               1 (1)
              2 LOAD_CONST               2 (2)
              4 LOAD_CONST               3 (3)
              6 LOAD_CONST               4 (4)
              8 BUILD_LIST               4
             10 STORE_FAST               0 (l)

  3          12 SETUP_LOOP              18 (to 32)
             14 LOAD_FAST                0 (l)
             16 GET_ITER
        >>   18 FOR_ITER                10 (to 30)
             20 STORE_FAST               1 (i)

  4          22 LOAD_FAST                1 (i)
             24 YIELD_VALUE
             26 POP_TOP
             28 JUMP_ABSOLUTE           18
        >>   30 POP_BLOCK
        >>   32 LOAD_CONST               0 (None)
             34 RETURN_VALUE

Whenever __next__() is called on a generator object, gen_iternext() is called with the generator instance, which immediately calls gen_send_ex() inside Objects/genobject.c.

gen_send_ex() is the function that converts a generator object into the next yielded result. You'll see many similarities with the way frames are constructed in Python/ceval.c from a code object as these functions have similar tasks.

The gen_send_ex() function is shared with generators, coroutines, and async generators and has the following steps:

  1. The current thread state is fetched

  2. The frame object from the generator object is fetched

  3. If the generator is running when __next__() was called, raise a ValueError

  4. If the frame inside the generator is at the top of the stack:

    • In the case of a coroutine, if the coroutine is not already marked as closing, a RuntimeError is raised
    • If this is an async generator, raise a StopAsyncIteration
    • For a standard generator, a StopIteration is raised.
  5. If the last instruction in the frame (f->f_lasti) is still -1 because it has just been started, and this is a coroutine or async generator, then a non-None value can't be passed as an argument, so an exception is raised

  6. Else, this is the first time it's being called, and arguments are allowed. The value of the argument is pushed to the frame's value stack

  7. The f_back field of the frame is the caller to which return values are sent, so this is set to the current frame in the thread. This means that the return value is sent to the caller, not the creator of the generator

  8. The generator is marked as running

  9. The last exception in the generator's exception info is copied from the last exception in the thread state

  10. The thread state exception info is set to the address of the generator's exception info. This means that if the caller enters a breakpoint around the execution of a generator, the stack trace goes through the generator and the offending code is clear

  11. The frame inside the generator is executed within the Python/ceval.c main execution loop, and the value returned

  12. The thread state last exception is reset to the value before the frame was called

  13. The generator is marked as not running

  14. The following cases then match the return value and any exceptions thrown by the call to the generator. Remember that generators should raise a StopIteration when they are exhausted, either manually, or by not yielding a value. Coroutines and async generators should not:

    • If no result was returned from the frame, raise a StopIteration for generators and StopAsyncIteration for async generators
    • If a StopIteration was explicitly raised, but this is a coroutine or an async generator, raise a RuntimeError as this is not allowed
    • If a StopAsyncIteration was explicitly raised and this is an async generator, raise a RuntimeError, as this is not allowed
  15. Lastly, the result is returned back to the caller of __next__()

static PyObject *
gen_send_ex(PyGenObject *gen, PyObject *arg, int exc, int closing)
{
    PyThreadState *tstate = _PyThreadState_GET();       // 1.
    PyFrameObject *f = gen->gi_frame;                   // 2.
    PyObject *result;

    if (gen->gi_running) {     // 3.
        const char *msg = "generator already executing";
        if (PyCoro_CheckExact(gen)) {
            msg = "coroutine already executing";
        }
        else if (PyAsyncGen_CheckExact(gen)) {
            msg = "async generator already executing";
        }
        PyErr_SetString(PyExc_ValueError, msg);
        return NULL;
    }
    if (f == NULL || f->f_stacktop == NULL) { // 4.
        if (PyCoro_CheckExact(gen) && !closing) {
            /* `gen` is an exhausted coroutine: raise an error,
               except when called from gen_close(), which should
               always be a silent method. */
            PyErr_SetString(
                PyExc_RuntimeError,
                "cannot reuse already awaited coroutine"); // 4a.
        }
        else if (arg && !exc) {
            /* `gen` is an exhausted generator:
               only set exception if called from send(). */
            if (PyAsyncGen_CheckExact(gen)) {
                PyErr_SetNone(PyExc_StopAsyncIteration); // 4b.
            }
            else {
                PyErr_SetNone(PyExc_StopIteration);      // 4c.
            }
        }
        return NULL;
    }

    if (f->f_lasti == -1) {
        if (arg && arg != Py_None) { // 5.
            const char *msg = "can't send non-None value to a "
                              "just-started generator";
            if (PyCoro_CheckExact(gen)) {
                msg = NON_INIT_CORO_MSG;
            }
            else if (PyAsyncGen_CheckExact(gen)) {
                msg = "can't send non-None value to a "
                      "just-started async generator";
            }
            PyErr_SetString(PyExc_TypeError, msg);
            return NULL;
        }
    } else { // 6.
        /* Push arg onto the frame's value stack */
        result = arg ? arg : Py_None;
        Py_INCREF(result);
        *(f->f_stacktop++) = result;
    }

    /* Generators always return to their most recent caller, not
     * necessarily their creator. */
    Py_XINCREF(tstate->frame);
    assert(f->f_back == NULL);
    f->f_back = tstate->frame;                          // 7.

    gen->gi_running = 1;                                // 8.
    gen->gi_exc_state.previous_item = tstate->exc_info; // 9.
    tstate->exc_info = &gen->gi_exc_state;              // 10.
    result = PyEval_EvalFrameEx(f, exc);                // 11.
    tstate->exc_info = gen->gi_exc_state.previous_item; // 12.
    gen->gi_exc_state.previous_item = NULL;             
    gen->gi_running = 0;                                // 13.

    /* Don't keep the reference to f_back any longer than necessary.  It
     * may keep a chain of frames alive or it could create a reference
     * cycle. */
    assert(f->f_back == tstate->frame);
    Py_CLEAR(f->f_back);

    /* If the generator just returned (as opposed to yielding), signal
     * that the generator is exhausted. */
    if (result && f->f_stacktop == NULL) {  // 14a.
        if (result == Py_None) {
            /* Delay exception instantiation if we can */
            if (PyAsyncGen_CheckExact(gen)) {
                PyErr_SetNone(PyExc_StopAsyncIteration);
            }
            else {
                PyErr_SetNone(PyExc_StopIteration);
            }
        }
        else {
            /* Async generators cannot return anything but None */
            assert(!PyAsyncGen_CheckExact(gen));
            _PyGen_SetStopIterationValue(result);
        }
        Py_CLEAR(result);
    }
    else if (!result && PyErr_ExceptionMatches(PyExc_StopIteration)) { // 14b.
        const char *msg = "generator raised StopIteration";
        if (PyCoro_CheckExact(gen)) {
            msg = "coroutine raised StopIteration";
        }
        else if PyAsyncGen_CheckExact(gen) {
            msg = "async generator raised StopIteration";
        }
        _PyErr_FormatFromCause(PyExc_RuntimeError, "%s", msg);

    }
    else if (!result && PyAsyncGen_CheckExact(gen) &&
             PyErr_ExceptionMatches(PyExc_StopAsyncIteration))  // 14c.
    {
        /* code in `gen` raised a StopAsyncIteration error:
           raise a RuntimeError.
        */
        const char *msg = "async generator raised StopAsyncIteration";
        _PyErr_FormatFromCause(PyExc_RuntimeError, "%s", msg);
    }
...

    return result; // 15.
}

Going back to the evaluation of code objects whenever a function or module is called, there was a special case for generators, coroutines, and async generators in _PyEval_EvalCodeWithName(). This function checks for the CO_GENERATOR, CO_COROUTINE, and CO_ASYNC_GENERATOR flags on the code object.

When a new coroutine is created using PyCoro_New(), a new async generator is created with PyAsyncGen_New() or a generator with PyGen_NewWithQualName(). These objects are returned early instead of returning an evaluated frame, which is why you get a generator object after calling a function with a yield statement:

PyObject *
_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals, ...
...
    /* Handle generator/coroutine/asynchronous generator */
    if (co->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) {
        PyObject *gen;
        PyObject *coro_wrapper = tstate->coroutine_wrapper;
        int is_coro = co->co_flags & CO_COROUTINE;
        ...
        /* Create a new generator that owns the ready to run frame
         * and return that as the value. */
        if (is_coro) {
            gen = PyCoro_New(f, name, qualname);
        } else if (co->co_flags & CO_ASYNC_GENERATOR) {
            gen = PyAsyncGen_New(f, name, qualname);
        } else {
            gen = PyGen_NewWithQualName(f, name, qualname);
        }
        ...
        return gen;
    }
...

The flags in the code object were injected by the compiler after traversing the AST and seeing the yield or yield from statements or seeing the coroutine decorator.

PyGen_NewWithQualName() will call gen_new_with_qualname() with the generated frame and then create the PyGenObject with NULL values and the compiled code object:

static PyObject *
gen_new_with_qualname(PyTypeObject *type, PyFrameObject *f,
                      PyObject *name, PyObject *qualname)
{
    PyGenObject *gen = PyObject_GC_New(PyGenObject, type);
    if (gen == NULL) {
        Py_DECREF(f);
        return NULL;
    }
    gen->gi_frame = f;
    f->f_gen = (PyObject *) gen;
    Py_INCREF(f->f_code);
    gen->gi_code = (PyObject *)(f->f_code);
    gen->gi_running = 0;
    gen->gi_weakreflist = NULL;
    gen->gi_exc_state.exc_type = NULL;
    gen->gi_exc_state.exc_value = NULL;
    gen->gi_exc_state.exc_traceback = NULL;
    gen->gi_exc_state.previous_item = NULL;
    if (name != NULL)
        gen->gi_name = name;
    else
        gen->gi_name = ((PyCodeObject *)gen->gi_code)->co_name;
    Py_INCREF(gen->gi_name);
    if (qualname != NULL)
        gen->gi_qualname = qualname;
    else
        gen->gi_qualname = gen->gi_name;
    Py_INCREF(gen->gi_qualname);
    _PyObject_GC_TRACK(gen);
    return (PyObject *)gen;
}

Bringing this all together you can see how the generator expression is a powerful syntax where a single keyword, yield triggers a whole flow to create a unique object, copy a compiled code object as a property, set a frame, and store a list of variables in the local scope.

To the user of the generator expression, this all seems like magic, but under the covers it's not that complex.

Conclusion

Now that you understand how some built-in types, you can explore other types.

When exploring Python classes, it is important to remember there are built-in types, written in C and classes inheriting from those types, written in Python or C.

Some libraries have types written in C instead of inheriting from the built-in types. One example is numpy, a library for numeric arrays. The nparray type is written in C, is highly efficient and performant.

In the next Part, we will explore the classes and functions defined in the standard library.

Part 5: The CPython Standard Library

Python has always come "batteries included." This statement means that with a standard CPython distribution, there are libraries for working with files, threads, networks, web sites, music, keyboards, screens, text, and a whole manner of utilities.

Some of the batteries that come with CPython are more like AA batteries. They're useful for everything, like the collections module and the sys module. Some of them are a bit more obscure, like a small watch battery that you never know when it might come in useful.

There are 2 types of modules in the CPython standard library:

  1. Those written in pure Python that provides a utility
  2. Those written in C with Python wrappers

We will explore both types.

Python Modules

The modules written in pure Python are all located in the Lib/ directory in the source code. Some of the larger modules have submodules in subfolders, like the email module.

An easy module to look at would be the colorsys module. It's only a few hundred lines of Python code. You may not have come across it before. The colorsys module has some utility functions for converting color scales.

When you install a Python distribution from source, standard library modules are copied from the Lib folder into the distribution folder. This folder is always part of your path when you start Python, so you can import the modules without having to worry about where they're located.

For example:

>>>
>>> import colorsys
>>> colorsys
<module 'colorsys' from '/usr/shared/lib/python3.7/colorsys.py'>

>>> colorsys.rgb_to_hls(255,0,0)
(0.0, 127.5, -1.007905138339921) 

We can see the source code of rgb_to_hls() inside Lib/colorsys.py:

# HLS: Hue, Luminance, Saturation
# H: position in the spectrum
# L: color lightness
# S: color saturation

def rgb_to_hls(r, g, b):
    maxc = max(r, g, b)
    minc = min(r, g, b)
    # XXX Can optimize (maxc+minc) and (maxc-minc)
    l = (minc+maxc)/2.0
    if minc == maxc:
        return 0.0, l, 0.0
    if l <= 0.5:
        s = (maxc-minc) / (maxc+minc)
    else:
        s = (maxc-minc) / (2.0-maxc-minc)
    rc = (maxc-r) / (maxc-minc)
    gc = (maxc-g) / (maxc-minc)
    bc = (maxc-b) / (maxc-minc)
    if r == maxc:
        h = bc-gc
    elif g == maxc:
        h = 2.0+rc-bc
    else:
        h = 4.0+gc-rc
    h = (h/6.0) % 1.0
    return h, l, s

There's nothing special about this function, it's just standard Python. You'll find similar things with all of the pure Python standard library modules. They're just written in plain Python, well laid out and easy to understand. You may even spot improvements or bugs, so you can make changes to them and contribute it to the Python distribution. We'll cover that toward the end of this article.

Python and C Modules

The remainder of modules are written in C, or a combination or Python and C. The source code for these is in Lib/ for the Python component, and Modules/ for the C component. There are two exceptions to this rule, the sys module, found in Python/sysmodule.c and the __builtins__ module, found in Python/bltinmodule.c.

Python will import * from __builtins__ when an interpreter is instantiated, so all of the functions like print(), chr(), format(), etc. are found within Python/bltinmodule.c.

Because the sys module is so specific to the interpreter and the internals of CPython, that is found inside the Python directly. It is also marked as an "implementation detail" of CPython and not found in other distributions.

The built-in print() function was probably the first thing you learned to do in Python. So what happens when you type print("hello world!")?

  1. The argument "hello world" was converted from a string constant to a PyUnicodeObject by the compiler
  2. builtin_print() was executed with 1 argument, and NULL kwnames
  3. The file variable is set to PyId_stdout, the system's stdout handle
  4. Each argument is sent to file
  5. A line break, \n is sent to file
static PyObject *
builtin_print(PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames)
{
    ...
    if (file == NULL || file == Py_None) {
        file = _PySys_GetObjectId(&PyId_stdout);
        ...
    }
    ...
    for (i = 0; i < nargs; i++) {
        if (i > 0) {
            if (sep == NULL)
                err = PyFile_WriteString(" ", file);
            else
                err = PyFile_WriteObject(sep, file,
                                         Py_PRINT_RAW);
            if (err)
                return NULL;
        }
        err = PyFile_WriteObject(args[i], file, Py_PRINT_RAW);
        if (err)
            return NULL;
    }

    if (end == NULL)
        err = PyFile_WriteString("\n", file);
    else
        err = PyFile_WriteObject(end, file, Py_PRINT_RAW);
    ...
    Py_RETURN_NONE;
}

The contents of some modules written in C expose operating system functions. Because the CPython source code needs to compile to macOS, Windows, Linux, and other *nix-based operating systems, there are some special cases.

The time module is a good example. The way that Windows keeps and stores time in the Operating System is fundamentally different than Linux and macOS. This is one of the reasons why the accuracy of the clock functions differs between operating systems.

In Modules/timemodule.c, the operating system time functions for Unix-based systems are imported from <sys/times.h>:

#ifdef HAVE_SYS_TIMES_H
#include <sys/times.h>
#endif
...
#ifdef MS_WINDOWS
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "pythread.h"
#endif /* MS_WINDOWS */
...

Later in the file, time_process_time_ns() is defined as a wrapper for _PyTime_GetProcessTimeWithInfo():

static PyObject *
time_process_time_ns(PyObject *self, PyObject *unused)
{
    _PyTime_t t;
    if (_PyTime_GetProcessTimeWithInfo(&t, NULL) < 0) {
        return NULL;
    }
    return _PyTime_AsNanosecondsObject(t);
}

_PyTime_GetProcessTimeWithInfo() is implemented multiple different ways in the source code, but only certain parts are compiled into the binary for the module, depending on the operating system. Windows systems will call GetProcessTimes() and Unix systems will call clock_gettime().

Other modules that have multiple implementations for the same API are the threading module, the file system module, and the networking modules. Because the Operating Systems behave differently, the CPython source code implements the same behavior as best as it can and exposes it using a consistent, abstracted API.

The CPython Regression Test Suite

CPython has a robust and extensive test suite covering the core interpreter, the standard library, the tooling and distribution for both Windows and Linux/macOS.

The test suite is located in Lib/test and written almost entirely in Python.

The full test suite is a Python package, so can be run using the Python interpreter that you've compiled. Change directory to the Lib directory and run python -m test -j2, where j2 means to use 2 CPUs.

On Windows use the rt.bat script inside the PCBuild folder, ensuring that you have built the Release configuration from Visual Studio in advance:

$ cd PCbuild
$ rt.bat -q

C:\repos\cpython\PCbuild>"C:\repos\cpython\PCbuild\win32\python.exe"  -u -Wd -E -bb -m test
== CPython 3.8.0a3+
== Windows-10-10.0.17134-SP0 little-endian
== cwd: C:\repos\cpython\build\test_python_2784
== CPU count: 2
== encodings: locale=cp1252, FS=utf-8
Run tests sequentially
0:00:00 [  1/420] test_grammar
0:00:00 [  2/420] test_opcodes
0:00:00 [  3/420] test_dict
0:00:00 [  4/420] test_builtin
...

On Linux:

$ cd Lib
$ ../python -m test -j2   
== CPython 3.8.0a2+
== macOS-10.14.3-x86_64-i386-64bit little-endian
== cwd: /Users/anthonyshaw/cpython/build/test_python_23399
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Run tests in parallel using 2 child processes
0:00:00 load avg: 2.14 [  1/420] test_opcodes passed
0:00:00 load avg: 2.14 [  2/420] test_grammar passed
...

On macOS:

$ cd Lib
$ ../python.exe -m test -j2   
== CPython 3.8.0a2+
== macOS-10.14.3-x86_64-i386-64bit little-endian
== cwd: /Users/anthonyshaw/cpython/build/test_python_23399
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Run tests in parallel using 2 child processes
0:00:00 load avg: 2.14 [  1/420] test_opcodes passed
0:00:00 load avg: 2.14 [  2/420] test_grammar passed
...

Some tests require certain flags; otherwise they are skipped. For example, many of the IDLE tests require a GUI.

To see a list of test suites in the configuration, use the --list-tests flag:

$ ../python.exe -m test --list-tests

test_grammar
test_opcodes
test_dict
test_builtin
test_exceptions
...

You can run specific tests by providing the test suite as the first argument:

$ ../python.exe -m test test_webbrowser

Run tests sequentially
0:00:00 load avg: 2.74 [1/1] test_webbrowser

== Tests result: SUCCESS ==

1 test OK.

Total duration: 117 ms
Tests result: SUCCESS

You can also see a detailed list of tests that were executed with the result using the -v argument:

$ ../python.exe -m test test_webbrowser -v

== CPython 3.8.0a2+ 
== macOS-10.14.3-x86_64-i386-64bit little-endian
== cwd: /Users/anthonyshaw/cpython/build/test_python_24562
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Run tests sequentially
0:00:00 load avg: 2.36 [1/1] test_webbrowser
test_open (test.test_webbrowser.BackgroundBrowserCommandTest) ... ok
test_register (test.test_webbrowser.BrowserRegistrationTest) ... ok
test_register_default (test.test_webbrowser.BrowserRegistrationTest) ... ok
test_register_preferred (test.test_webbrowser.BrowserRegistrationTest) ... ok
test_open (test.test_webbrowser.ChromeCommandTest) ... ok
test_open_new (test.test_webbrowser.ChromeCommandTest) ... ok
...
test_open_with_autoraise_false (test.test_webbrowser.OperaCommandTest) ... ok

----------------------------------------------------------------------

Ran 34 tests in 0.056s

OK (skipped=2)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 134 ms
Tests result: SUCCESS

Understanding how to use the test suite and checking the state of the version you have compiled is very important if you wish to make changes to CPython. Before you start making changes, you should run the whole test suite and make sure everything is passing.

Installing a Custom Version

From your source repository, if you're happy with your changes and want to use them inside your system, you can install it as a custom version.

For macOS and Linux, you can use the altinstall command, which won't create symlinks for python3 and install a standalone version:

$ make altinstall

For Windows, you have to change the build configuration from Debug to Release, then copy the packaged binaries to a directory on your computer which is part of the system path.

The CPython Source Code: Conclusion

Congratulations, you made it! Did your tea get cold? Make yourself another cup. You've earned it.

Now that you've seen the CPython source code, the modules, the compiler, and the tooling, you may wish to make some changes and contribute them back to the Python ecosystem.

The official dev guide contains plenty of resources for beginners. You've already taken the first step, to understand the source, knowing how to change, compile, and test the CPython applications.

Think back to all the things you've learned about CPython over this article. All the pieces of magic to which you've learned the secrets. The journey doesn't stop here.

This might be a good time to learn more about Python and C. Who knows: you could be contributing more and more to the CPython project!


[ Improve Your Python With 🐍 Python Tricks 💌 - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

21 Aug 2019 4:10pm GMT

Real Python: Your Guide to the CPython Source Code

Are there certain parts of Python that just seem magic? Like how are dictionaries so much faster than looping over a list to find an item. How does a generator remember the state of the variables each time it yields a value and why do you never have to allocate memory like other languages? It turns out, CPython, the most popular Python runtime is written in human-readable C and Python code. This tutorial will walk you through the CPython source code.

You'll cover all the concepts behind the internals of CPython, how they work and visual explanations as you go.

You'll learn how to:

Yes, this is a very long article. If you just made yourself a fresh cup of tea, coffee or your favorite beverage, it's going to be cold by the end of Part 1.

This tutorial is split into five parts. Take your time for each part and make sure you try out the demos and the interactive components. You can feel a sense of achievement that you grasp the core concepts of Python that can make you a better Python programmer.

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Part 1: Introduction to CPython

When you type python at the console or install a Python distribution from python.org, you are running CPython. CPython is one of the many Python runtimes, maintained and written by different teams of developers. Some other runtimes you may have heard are PyPy, Cython, and Jython.

The unique thing about CPython is that it contains both a runtime and the shared language specification that all Python runtimes use. CPython is the "official," or reference implementation of Python.

The Python language specification is the document that the description of the Python language. For example, it says that assert is a reserved keyword, and that [] is used for indexing, slicing, and creating empty lists.

Think about what you expect to be inside the Python distribution on your computer:

These are all part of the CPython distribution. There's a lot more than just a compiler.

Note: This article is written against version 3.8.0b3 of the CPython source code.

What's in the Source Code?

The CPython source distribution comes with a whole range of tools, libraries, and components. We'll explore those in this article. First we are going to focus on the compiler.

To download a copy of the CPython source code, you can use git to pull the latest version to a working copy locally:

git clone https://github.com/python/cpython

Note: If you don't have Git available, you can download the source in a ZIP file directly from the GitHub website.

Inside of the newly downloaded cpython directory, you will find the following subdirectories:

cpython/
│
├── Doc      ← Source for the documentation
├── Grammar  ← The a computer-readable language definition
├── Include  ← The C header files
├── Lib      ← Standard library modules written in Python
├── Mac      ← macOS support files
├── Misc     ← Miscellaneous files
├── Modules  ← Standard Library Modules written in C
├── Objects  ← Core types and the object model
├── Parser   ← The Python parser source code
├── PC       ← Windows build support files
├── PCbuild  ← Windows build support files for older Windows versions
├── Programs ← Source code for the python executable and other binaries
├── Python   ← The CPython interpreter source code
└── Tools    ← Standalone tools useful for building or extending Python

Next, we'll compile CPython from the source code. This step requires a C compiler, and some build tools, which depend on the operating system you're using.

Compiling CPython (macOS)

Compiling CPython on macOS is straightforward. You will first need the essential C compiler toolkit. The Command Line Development Tools is an app that you can update in macOS through the App Store. You need to perform the initial installation on the terminal.

To open up a terminal in macOS, go to the Launchpad, then Other then choose the Terminal app. You will want to save this app to your Dock, so right-click the Icon and select Keep in Dock.

Now, within the terminal, install the C compiler and toolkit by running the following:

$ xcode-select --install

This command will pop up with a prompt to download and install a set of tools, including Git, Make, and the GNU C compiler.

You will also need a working copy of OpenSSL to use for fetching packages from the PyPi.org website. If you later plan on using this build to install additional packages, SSL validation is required.

The simplest way to install OpenSSL on macOS is by using HomeBrew. If you already have HomeBrew installed, you can install the dependencies for CPython with the brew install command:

$ brew install openssl xz zlib

Now that you have the dependencies, you can run the configure script, enabling SSL support by discovering the location that HomeBrew installed to and enabling the debug hooks --with-pydebug:

$ CPPFLAGS="-I$(brew --prefix zlib)/include" \
 LDFLAGS="-L$(brew --prefix zlib)/lib" \
 ./configure --with-openssl=$(brew --prefix openssl) --with-pydebug

This will generate a Makefile in the root of the repository that you can use to automate the build process. The ./configure step only needs to be run once. You can build the CPython binary by running:

$ make -j2 -s

The -j2 flag allows make to run 2 jobs simultaneously. If you have 4 cores, you can change this to 4. The -s flag stops the Makefile from printing every command it runs to the console. You can remove this, but the output is very verbose.

During the build, you may receive some errors, and in the summary, it will notify you that not all packages could be built. For example, _dbm, _sqlite3, _uuid, nis, ossaudiodev, spwd, and _tkinter would fail to build with this set of instructions. That's okay if you aren't planning on developing against those packages. If you are, then check out the dev guide website for more information.

The build will take a few minutes and generate a binary called python.exe. Every time you make changes to the source code, you will need to re-run make with the same flags. The python.exe binary is the debug binary of CPython. Execute python.exe to see a working REPL:

$ ./python.exe
Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Note: Yes, that's right, the macOS build has a file extension for .exe. This is not because it's a Windows binary. Because macOS has a case-insensitive filesystem and when working with the binary, the developers didn't want people to accidentally refer to the directory Python/ so .exe was appended to avoid ambiguity. If you later run make install or make altinstall, it will rename the file back to python.

Compiling CPython (Linux)

For Linux, the first step is to download and install make, gcc, configure, and pkgconfig.

For Fedora Core, RHEL, CentOS, or other yum-based systems:

$ sudo yum install yum-utils

For Debian, Ubuntu, or other apt-based systems:

$ sudo apt install build-essential

Then install the required packages, for Fedora Core, RHEL, CentOS or other yum-based systems:

$ sudo yum-builddep python3

For Debian, Ubuntu, or other apt-based systems:

$ sudo apt install libssl-dev zlib1g-dev libncurses5-dev \
  libncursesw5-dev libreadline-dev libsqlite3-dev libgdbm-dev \
  libdb5.3-dev libbz2-dev libexpat1-dev liblzma-dev libffi-dev

Now that you have the dependencies, you can run the configure script, enabling the debug hooks --with-pydebug:

$ ./configure --with-pydebug

Review the output to ensure that OpenSSL support was marked as YES. Otherwise, check with your distribution for instructions on installing the headers for OpenSSL.

Next, you can build the CPython binary by running the generated Makefile:

$ make -j2 -s

During the build, you may receive some errors, and in the summary, it will notify you that not all packages could be built. That's okay if you aren't planning on developing against those packages. If you are, then check out the dev guide website for more information.

The build will take a few minutes and generate a binary called python. This is the debug binary of CPython. Execute ./python to see a working REPL:

$ ./python
Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Compiling CPython (Windows)

Inside the PC folder is a Visual Studio project file for building and exploring CPython. To use this, you need to have Visual Studio installed on your PC.

The newest version of Visual Studio, Visual Studio 2019, makes it easier to work with Python and the CPython source code, so it is recommended for use in this tutorial. If you already have Visual Studio 2017 installed, that would also work fine.

None of the paid features are required for compiling CPython or this tutorial. You can use the Community edition of Visual Studio, which is available for free from Microsoft's Visual Studio website.

Once you've downloaded the installer, you'll be asked to select which components you want to install. The bare minimum for this tutorial is:

Any other optional features can be deselected if you want to be more conscientious with disk space:

Visual Studio Options Window

The installer will then download and install all of the required components. The installation could take an hour, so you may want to read on and come back to this section.

Once the installer has completed, click the Launch button to start Visual Studio. You will be prompted to sign in. If you have a Microsoft account you can log in, or skip that step.

Once Visual Studio starts, you will be prompted to Open a Project. A shortcut to getting started with the Git configuration and cloning CPython is to choose the Clone or check out code option:

Choosing a Project Type in Visual Studio

For the project URL, type https://github.com/python/cpython to clone:

Cloning projects in Visual Studio

Visual Studio will then download a copy of CPython from GitHub using the version of Git bundled with Visual Studio. This step also saves you the hassle of having to install Git on Windows. The download may take 10 minutes.

Once the project has downloaded, you need to point it to the pcbuild Solution file, by clicking on Solutions and Projects and selecting pcbuild.sln:

Selecting a solution

When the solution is loaded, it will prompt you to retarget the project's inside the solution to the version of the C/C++ compiler you have installed. Visual Studio will also target the version of the Windows SDK you have installed.

Ensure that you change the Windows SDK version to the newest installed version and the platform toolset to the latest version. If you missed this window, you can right-click on the Solution in the Solutions and Projects window and click Retarget Solution.

Once this is complete, you need to download some source files to be able to build the whole CPython package. Inside the PCBuild folder there is a .bat file that automates this for you. Open up a command-line prompt inside the downloaded PCBuild and run get_externals.bat:

 > get_externals.bat
Using py -3.7 (found 3.7 with py.exe)
Fetching external libraries...
Fetching bzip2-1.0.6...
Fetching sqlite-3.21.0.0...
Fetching xz-5.2.2...
Fetching zlib-1.2.11...
Fetching external binaries...
Fetching openssl-bin-1.1.0j...
Fetching tcltk-8.6.9.0...
Finished.

Next, back within Visual Studio, build CPython by pressing Ctrl+Shift+B, or choosing Build Solution from the top menu. If you receive any errors about the Windows SDK being missing, make sure you set the right targeting settings in the Retarget Solution window. You should also see Windows Kits inside your Start Menu, and Windows Software Development Kit inside of that menu.

The build stage could take 10 minutes or more for the first time. Once the build is completed, you may see a few warnings that you can ignore and eventual completion.

To start the debug version of CPython, press F5 and CPython will start in Debug mode straight into the REPL:

CPython debugging Windows

Once this is completed, you can run the Release build by changing the build configuration from Debug to Release on the top menu bar and rerunning Build Solution again. You now have both Debug and Release versions of the CPython binary within PCBuild\win32\.

You can set up Visual Studio to be able to open a REPL with either the Release or Debug build by choosing Tools->Python->Python Environments from the top menu:

Choosing Python environments

Then click Add Environment and then target the Debug or Release binary. The Debug binary will end in _d.exe, for example, python_d.exe and pythonw_d.exe. You will most likely want to use the debug binary as it comes with Debugging support in Visual Studio and will be useful for this tutorial.

In the Add Environment window, target the python_d.exe file as the interpreter inside the PCBuild/win32 and the pythonw_d.exe as the windowed interpreter:

Adding an environment in VS2019

Now, you can start a REPL session by clicking Open Interactive Window in the Python Environments window and you will see the REPL for the compiled version of Python:

Python Environment REPL

During this tutorial there will be REPL sessions with example commands. I encourage you to use the Debug binary to run these REPL sessions in case you want to put in any breakpoints within the code.

Lastly, to make it easier to navigate the code, in the Solution View, click on the toggle button next to the Home icon to switch to Folder view:

Switching Environment Mode

Now you have a version of CPython compiled and ready to go, let's find out how the CPython compiler works.

What Does a Compiler Do?

The purpose of a compiler is to convert one language into another. Think of a compiler like a translator. You would hire a translator to listen to you speaking in English and then speak in Japanese:

Translating from English to Japanese

Some compilers will compile into a low-level machine code which can be executed directly on a system. Other compilers will compile into an intermediary language, to be executed by a virtual machine.

One important decision to make when choosing a compiler is the system portability requirements. Java and .NET CLR will compile into an Intermediary Language so that the compiled code is portable across multiple systems architectures. C, Go, C++, and Pascal will compile into a low-level executable that will only work on systems similar to the one it was compiled.

Because Python applications are typically distributed as source code, the role of the Python runtime is to convert the Python source code and execute it in one step. Internally, the CPython runtime does compile your code. A popular misconception is that Python is an interpreted language. It is actually compiled.

Python code is not compiled into machine-code. It is compiled into a special low-level intermediary language called bytecode that only CPython understands. This code is stored in .pyc files in a hidden directory and cached for execution. If you run the same Python application twice without changing the source code, it'll always be much faster the second time. This is because it loads the compiled bytecode and executes it directly.

Why Is CPython Written in C and Not Python?

The C in CPython is a reference to the C programming language, implying that this Python distribution is written in the C language.

This statement is largely true: the compiler in CPython is written in pure C. However, many of the standard library modules are written in pure Python or a combination of C and Python.

So why is CPython written in C and not Python?

The answer is located in how compilers work. There are two types of compiler:

  1. Self-hosted compilers are compilers written in the language they compile, such as the Go compiler.
  2. Source-to-source compilers are compilers written in another language that already have a compiler.

If you're writing a new programming language from scratch, you need an executable application to compile your compiler! You need a compiler to execute anything, so when new languages are developed, they're often written first in an older, more established language.

A good example would be the Go programming language. The first Go compiler was written in C, then once Go could be compiled, the compiler was rewritten in Go.

CPython kept its C heritage: many of the standard library modules, like the ssl module or the sockets module, are written in C to access low-level operating system APIs. The APIs in the Windows and Linux kernels for creating network sockets, working with the filesystem or interacting with the display are all written in C. It made sense for Python's extensibility layer to be focused on the C language. Later in this article, we will cover the Python Standard Library and the C modules.

There is a Python compiler written in Python called PyPy. PyPy's logo is an Ouroboros to represent the self-hosting nature of the compiler.

Another example of a cross-compiler for Python is Jython. Jython is written in Java and compiles from Python source code into Java bytecode. In the same way that CPython makes it easy to import C libraries and use them from Python, Jython makes it easy to import and reference Java modules and classes.

The Python Language Specification

Contained within the CPython source code is the definition of the Python language. This is the reference specification used by all the Python interpreters.

The specification is in both human-readable and machine-readable format. Inside the documentation is a detailed explanation of the Python language, what is allowed, and how each statement should behave.

Documentation

Located inside the Doc/reference directory are reStructuredText explanations of each of the features in the Python language. This forms the official Python reference guide on docs.python.org.

Inside the directory are the files you need to understand the whole language, structure, and keywords:

cpython/Doc/reference
|
├── compound_stmts.rst
├── datamodel.rst
├── executionmodel.rst
├── expressions.rst
├── grammar.rst
├── import.rst
├── index.rst
├── introduction.rst
├── lexical_analysis.rst
├── simple_stmts.rst
└── toplevel_components.rst

Inside compound_stmts.rst, the documentation for compound statements, you can see a simple example defining the with statement.

The with statement can be used in multiple ways in Python, the simplest being the instantiation of a context-manager and a nested block of code:

with x():
   ...

You can assign the result to a variable using the as keyword:

with x() as y:
   ...

You can also chain context managers together with a comma:

with x() as y, z() as jk:
   ...

Next, we'll explore the computer-readable documentation of the Python language.

Grammar

The documentation contains the human-readable specification of the language, and the machine-readable specification is housed in a single file, Grammar/Grammar.

The Grammar file is written in a context-notation called Backus-Naur Form (BNF). BNF is not specific to Python and is often used as the notation for grammars in many other languages.

The concept of grammatical structure in a programming language is inspired by Noam Chomsky's work on Syntactic Structures in the 1950s!

Python's grammar file uses the Extended-BNF (EBNF) specification with regular-expression syntax. So, in the grammar file you can use:

If you search for the with statement in the grammar file, at around line 80 you'll see the definitions for the with statement:

with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]

Anything in quotes is a string literal, which is how keywords are defined. So the with_stmt is specified as:

  1. Starting with the word with
  2. Followed by a with_item, which is a test and (optionally), the word as, and an expression
  3. Following one or many items, each separated by a comma
  4. Ending with a :
  5. Followed by a suite

There are references to some other definitions in these two lines:

If you want to explore those in detail, the whole of the Python grammar is defined in this single file.

If you want to see a recent example of how grammar is used, in PEP 572 the colon equals operator was added to the grammar file in this Git commit.

Using pgen

The grammar file itself is never used by the Python compiler. Instead, a parser table created by a tool called pgen is used. pgen reads the grammar file and converts it into a parser table. If you make changes to the grammar file, you must regenerate the parser table and recompile Python.

Note: The pgen application was rewritten in Python 3.8 from C to pure Python.

To see pgen in action, let's change part of the Python grammar. Around line 51 you will see the definition of a pass statement:

pass_stmt: 'pass'

Change that line to accept the keyword 'pass' or 'proceed' as keywords:

pass_stmt: 'pass' | 'proceed'

Now you need to rebuild the grammar files. On macOS and Linux, run make regen-grammar to run pgen over the altered grammar file. For Windows, there is no officially supported way of running pgen. However, you can clone my fork and run build.bat --regen from within the PCBuild directory.

You should see an output similar to this, showing that the new Include/graminit.h and Python/graminit.c files have been generated:

# Regenerate Doc/library/token-list.inc from Grammar/Tokens
# using Tools/scripts/generate_token.py
...
python3 ./Tools/scripts/update_file.py ./Include/graminit.h ./Include/graminit.h.new
python3 ./Tools/scripts/update_file.py ./Python/graminit.c ./Python/graminit.c.new

Note: pgen works by converting the EBNF statements into a Non-deterministic Finite Automaton (NFA), which is then turned into a Deterministic Finite Automaton (DFA). The DFAs are used by the parser as parsing tables in a special way that's unique to CPython. This technique was formed at Stanford University and developed in the 1980s, just before the advent of Python.

With the regenerated parser tables, you need to recompile CPython to see the new syntax. Use the same compilation steps you used earlier for your operating system.

If the code compiled successfully, you can execute your new CPython binary and start a REPL.

In the REPL, you can now try defining a function and instead of using the pass statement, use the proceed keyword alternative that you compiled into the Python grammar:

Python 3.8.0b3 (tags/v3.8.0b3:4336222407, Aug 21 2019, 10:00:03) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def example():
...    proceed
... 
>>> example()

Well done! You've changed the CPython syntax and compiled your own version of CPython. Ship it!

Next, we'll explore tokens and their relationship to grammar.

Tokens

Alongside the grammar file in the Grammar folder is a Tokens file, which contains each of the unique types found as a leaf node in a parse tree. We will cover parser trees in depth later. Each token also has a name and a generated unique ID. The names are used to make it simpler to refer to in the tokenizer.

Note: The Tokens file is a new feature in Python 3.8.

For example, the left parenthesis is called LPAR, and semicolons are called SEMI. You'll see these tokens later in the article:

LPAR                    '('
RPAR                    ')'
LSQB                    '['
RSQB                    ']'
COLON                   ':'
COMMA                   ','
SEMI                    ';'

As with the Grammar file, if you change the Tokens file, you need to run pgen again.

To see tokens in action, you can use the tokenize module in CPython. Create a simple Python script called test_tokens.py:

# Hello world!
def my_function():
   proceed

For the rest of this tutorial, ./python.exe will refer to the compiled version of CPython. However, the actual command will depend on your system.

For Windows:

 > python.exe

For Linux:

 > ./python

For macOS:

 > ./python.exe

Then pass this file through a module built into the standard library called tokenize. You will see the list of tokens, by line and character. Use the -e flag to output the exact token name:

$ ./python.exe -m tokenize -e test_tokens.py

0,0-0,0:            ENCODING       'utf-8'        
1,0-1,14:           COMMENT        '# Hello world!'
1,14-1,15:          NL             '\n'           
2,0-2,3:            NAME           'def'          
2,4-2,15:           NAME           'my_function'  
2,15-2,16:          LPAR           '('            
2,16-2,17:          RPAR           ')'            
2,17-2,18:          COLON          ':'            
2,18-2,19:          NEWLINE        '\n'           
3,0-3,3:            INDENT         '   '          
3,3-3,7:            NAME           'proceed'         
3,7-3,8:            NEWLINE        '\n'           
4,0-4,0:            DEDENT         ''             
4,0-4,0:            ENDMARKER      ''              

In the output, the first column is the range of the line/column coordinates, the second column is the name of the token, and the final column is the value of the token.

In the output, the tokenize module has implied some tokens that were not in the file. The ENCODING token for utf-8, and a blank line at the end, giving DEDENT to close the function declaration and an ENDMARKER to end the file.

It is best practice to have a blank line at the end of your Python source files. If you omit it, CPython adds it for you, with a tiny performance penalty.

The tokenize module is written in pure Python and is located in Lib/tokenize.py within the CPython source code.

Important: There are two tokenizers in the CPython source code: one written in Python, demonstrated here, and another written in C. The tokenizer written in Python is meant as a utility, and the one written in C is used by the Python compiler. They have identical output and behavior. The version written in C is designed for performance and the module in Python is designed for debugging.

To see a verbose readout of the C tokenizer, you can run Python with the -d flag. Using the test_tokens.py script you created earlier, run it with the following:

$ ./python.exe -d test_tokens.py

Token NAME/'def' ... It's a keyword
 DFA 'file_input', state 0: Push 'stmt'
 DFA 'stmt', state 0: Push 'compound_stmt'
 DFA 'compound_stmt', state 0: Push 'funcdef'
 DFA 'funcdef', state 0: Shift.
Token NAME/'my_function' ... It's a token we know
 DFA 'funcdef', state 1: Shift.
Token LPAR/'(' ... It's a token we know
 DFA 'funcdef', state 2: Push 'parameters'
 DFA 'parameters', state 0: Shift.
Token RPAR/')' ... It's a token we know
 DFA 'parameters', state 1: Shift.
  DFA 'parameters', state 2: Direct pop.
Token COLON/':' ... It's a token we know
 DFA 'funcdef', state 3: Shift.
Token NEWLINE/'' ... It's a token we know
 DFA 'funcdef', state 5: [switch func_body_suite to suite] Push 'suite'
 DFA 'suite', state 0: Shift.
Token INDENT/'' ... It's a token we know
 DFA 'suite', state 1: Shift.
Token NAME/'proceed' ... It's a keyword
 DFA 'suite', state 3: Push 'stmt'
...
  ACCEPT.

In the output, you can see that it highlighted proceed as a keyword. In the next chapter, we'll see how executing the Python binary gets to the tokenizer and what happens from there to execute your code.

Now that you have an overview of the Python grammar and the relationship between tokens and statements, there is a way to convert the pgen output into an interactive graph.

Here is a screenshot of the Python 3.8a2 grammar:

Python 3.8 DFA node graph

The Python package used to generate this graph, instaviz, will be covered in a later chapter.

Memory Management in CPython

Throughout this article, you will see references to a PyArena object. The arena is one of CPython's memory management structures. The code is within Python/pyarena.c and contains a wrapper around C's memory allocation and deallocation functions.

In a traditionally written C program, the developer should allocate memory for data structures before writing into that data. This allocation marks the memory as belonging to the process with the operating system.

It is also up to the developer to deallocate, or "free," the allocated memory when its no longer being used and return it to the operating system's block table of free memory. If a process allocates memory for a variable, say within a function or loop, when that function has completed, the memory is not automatically given back to the operating system in C. So if it hasn't been explicitly deallocated in the C code, it causes a memory leak. The process will continue to take more memory each time that function runs until eventually, the system runs out of memory, and crashes!

Python takes that responsibility away from the programmer and uses two algorithms: a reference counter and a garbage collector.

Whenever an interpreter is instantiated, a PyArena is created and attached one of the fields in the interpreter. During the lifecycle of a CPython interpreter, many arenas could be allocated. They are connected with a linked list. The arena stores a list of pointers to Python Objects as a PyListObject. Whenever a new Python object is created, a pointer to it is added using PyArena_AddPyObject(). This function call stores a pointer in the arena's list, a_objects.

The PyArena serves a second function, which is to allocate and reference a list of raw memory blocks. For example, a PyList would need extra memory if you added thousands of additional values. The PyList object's C code does not allocate memory directly. The object gets raw blocks of memory from the PyArena by calling PyArena_Malloc() from the PyObject with the required memory size. This task is completed by another abstraction in Objects/oballoc.c. In the object allocation module, memory can be allocated, freed, and reallocated for a Python Object.

A linked list of allocated blocks is stored inside the arena, so that when an interpreter is stopped, all managed memory blocks can be deallocated in one go using PyArena_Free().

Take the PyListObject example. If you were to .append() an object to the end of a Python list, you don't need to reallocate the memory used in the existing list beforehand. The .append() method calls list_resize() which handles memory allocation for lists. Each list object keeps a list of the amount of memory allocated. If the item you're appending will fit inside the existing free memory, it is simply added. If the list needs more memory space, it is expanded. Lists are expanded in length as 0, 4, 8, 16, 25, 35, 46, 58, 72, 88.

PyMem_Realloc() is called to expand the memory allocated in a list. PyMem_Realloc() is an API wrapper for pymalloc_realloc().

Python also has a special wrapper for the C call malloc(), which sets the max size of the memory allocation to help prevent buffer overflow errors (See PyMem_RawMalloc()).

In summary:

More information on the API is detailed on the CPython documentation.

Reference Counting

To create a variable in Python, you have to assign a value to a uniquely named variable:

my_variable = 180392

Whenever a value is assigned to a variable in Python, the name of the variable is checked within the locals and globals scope to see if it already exists.

Because my_variable is not already within the locals() or globals() dictionary, this new object is created, and the value is assigned as being the numeric constant 180392.

There is now one reference to my_variable, so the reference counter for my_variable is incremented by 1.

You will see function calls Py_INCREF() and Py_DECREF() throughout the C source code for CPython. These functions increment and decrement the count of references to that object.

References to an object are decremented when a variable falls outside of the scope in which it was declared. Scope in Python can refer to a function or method, a comprehension, or a lambda function. These are some of the more literal scopes, but there are many other implicit scopes, like passing variables to a function call.

The handling of incrementing and decrementing references based on the language is built into the CPython compiler and the core execution loop, ceval.c, which we will cover in detail later in this article.

Whenever Py_DECREF() is called, and the counter becomes 0, the PyObject_Free() function is called. For that object PyArena_Free() is called for all of the memory that was allocated.

Garbage Collection

How often does your garbage get collected? Weekly, or fortnightly?

When you're finished with something, you discard it and throw it in the trash. But that trash won't get collected straight away. You need to wait for the garbage trucks to come and pick it up.

CPython has the same principle, using a garbage collection algorithm. CPython's garbage collector is enabled by default, happens in the background and works to deallocate memory that's been used for objects which are no longer in use.

Because the garbage collection algorithm is a lot more complex than the reference counter, it doesn't happen all the time, otherwise, it would consume a huge amount of CPU resources. It happens periodically, after a set number of operations.

CPython's standard library comes with a Python module to interface with the arena and the garbage collector, the gc module. Here's how to use the gc module in debug mode:

>>>
>>> import gc
>>> gc.set_debug(gc.DEBUG_STATS)

This will print the statistics whenever the garbage collector is run.

You can get the threshold after which the garbage collector is run by calling get_threshold():

>>>
>>> gc.get_threshold()
(700, 10, 10)

You can also get the current threshold counts:

>>>
>>> gc.get_count()
(688, 1, 1)

Lastly, you can run the collection algorithm manually:

>>>
>>> gc.collect()
24

This will call collect() inside the Modules/gcmodule.c file which contains the implementation of the garbage collector algorithm.

Conclusion

In Part 1, you covered the structure of the source code repository, how to compile from source, and the Python language specification. These core concepts will be critical in Part 2 as you dive deeper into the Python interpreter process.

Part 2: The Python Interpreter Process

Now that you've seen the Python grammar and memory management, you can follow the process from typing python to the part where your code is executed.

There are five ways the python binary can be called:

  1. To run a single command with -c and a Python command
  2. To start a module with -m and the name of a module
  3. To run a file with the filename
  4. To run the stdin input using a shell pipe
  5. To start the REPL and execute commands one at a time

The three source files you need to inspect to see this process are:

  1. Programs/python.c is a simple entry point.
  2. Modules/main.c contains the code to bring together the whole process, loading configuration, executing code and clearing up memory.
  3. Python/initconfig.c loads the configuration from the system environment and merges it with any command-line flags.

This diagram shows how each of those functions is called:

Python run swim lane diagram

The execution mode is determined from the configuration.

The CPython source code style:

There is an official style guide for the CPython C code, designed originally in 2001 and updated for modern versions.

There are some naming standards which help when navigating the source code:

  • Use a Py prefix for public functions, never for static functions. The Py_ prefix is reserved for global service routines like Py_FatalError. Specific groups of routines (like specific object type APIs) use a longer prefix, such as PyString_ for string functions.

  • Public functions and variables use MixedCase with underscores, like this: PyObject_GetAttr, Py_BuildValue, PyExc_TypeError.

  • Occasionally an "internal" function has to be visible to the loader. We use the _Py prefix for this, for example, _PyObject_Dump.

  • Macros should have a MixedCase prefix and then use upper case, for example PyString_AS_STRING, Py_PRINT_RAW.

Establishing Runtime Configuration

Python run swim lane diagram

In the swimlanes, you can see that before any Python code is executed, the runtime first establishes the configuration. The configuration of the runtime is a data structure defined in Include/cpython/initconfig.h named PyConfig.

The configuration data structure includes things like:

The configuration data is primarily used by the CPython runtime to enable and disable various features.

Python also comes with several Command Line Interface Options. In Python you can enable verbose mode with the -v flag. In verbose mode, Python will print messages to the screen when modules are loaded:

$ ./python.exe -v -c "print('hello world')"


# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
...

You will see a hundred lines or more with all the imports of your user site-packages and anything else in the system environment.

You can see the definition of this flag within Include/cpython/initconfig.h inside the struct for PyConfig:

/* --- PyConfig ---------------------------------------------- */

typedef struct {
    int _config_version;  /* Internal configuration version,
                             used for ABI compatibility */
    int _config_init;     /* _PyConfigInitEnum value */

    ...

    /* If greater than 0, enable the verbose mode: print a message each time a
       module is initialized, showing the place (filename or built-in module)
       from which it is loaded.

       If greater or equal to 2, print a message for each file that is checked
       for when searching for a module. Also provides information on module
       cleanup at exit.

       Incremented by the -v option. Set by the PYTHONVERBOSE environment
       variable. If set to -1 (default), inherit Py_VerboseFlag value. */
    int verbose;

In Python/coreconfig.c, the logic for reading settings from environment variables and runtime command-line flags is established.

In the config_read_env_vars function, the environment variables are read and used to assign the values for the configuration settings:

static PyStatus
config_read_env_vars(PyConfig *config)
{
    PyStatus status;
    int use_env = config->use_environment;

    /* Get environment variables */
    _Py_get_env_flag(use_env, &config->parser_debug, "PYTHONDEBUG");
    _Py_get_env_flag(use_env, &config->verbose, "PYTHONVERBOSE");
    _Py_get_env_flag(use_env, &config->optimization_level, "PYTHONOPTIMIZE");
    _Py_get_env_flag(use_env, &config->inspect, "PYTHONINSPECT");

For the verbose setting, you can see that the value of PYTHONVERBOSE is used to set the value of &config->verbose, if PYTHONVERBOSE is found. If the environment variable does not exist, then the default value of -1 will remain.

Then in config_parse_cmdline within coreconfig.c again, the command-line flag is used to set the value, if provided:

static PyStatus
config_parse_cmdline(PyConfig *config, PyWideStringList *warnoptions,
                     Py_ssize_t *opt_index)
{
...

        switch (c) {
...

        case 'v':
            config->verbose++;
            break;
...
        /* This space reserved for other options */

        default:
            /* unknown argument: parsing failed */
            config_usage(1, program);
            return _PyStatus_EXIT(2);
        }
    } while (1);

This value is later copied to a global variable Py_VerboseFlag by the _Py_GetGlobalVariablesAsDict function.

Within a Python session, you can access the runtime flags, like verbose mode, quiet mode, using the sys.flags named tuple. The -X flags are all available inside the sys._xconfig dictionary:

>>>
$ ./python.exe -X dev -q       

>>> import sys
>>> sys.flags
sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, 
 no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, 
 quiet=1, hash_randomization=1, isolated=0, dev_mode=True, utf8_mode=0)

>>> sys._xoptions
{'dev': True}

As well as the runtime configuration in coreconfig.h, there is also the build configuration, which is located inside pyconfig.h in the root folder. This file is created dynamically in the configure step in the build process, or by Visual Studio for Windows systems.

You can see the build configuration by running:

$ ./python.exe -m sysconfig

Reading Files/Input

Once CPython has the runtime configuration and the command-line arguments, it can establish what it needs to execute.

This task is handled by the pymain_main function inside Modules/main.c. Depending on the newly created config instance, CPython will now execute code provided via several options.

Input via -c

The simplest is providing CPython a command with the -c option and a Python program inside quotes.

For example:

$ ./python.exe -c "print('hi')"
hi

Here is the full flowchart of how this happens:

Flow chart of pymain_run_command

First, the pymain_run_command() function is executed inside Modules/main.c taking the command passed in -c as an argument in the C type wchar_t*. The wchar_t* type is often used as a low-level storage type for Unicode data across CPython as the size of the type can store UTF8 characters.

When converting the wchar_t* to a Python string, the Objects/unicodetype.c file has a helper function PyUnicode_FromWideChar() that returns a PyObject, of type str. The encoding to UTF8 is then done by PyUnicode_AsUTF8String() on the Python str object to convert it to a Python bytes object.

Once this is complete, pymain_run_command() will then pass the Python bytes object to PyRun_SimpleStringFlags() for execution, but first converting the bytes to a str type again:

static int
pymain_run_command(wchar_t *command, PyCompilerFlags *cf)
{
    PyObject *unicode, *bytes;
    int ret;

    unicode = PyUnicode_FromWideChar(command, -1);
    if (unicode == NULL) {
        goto error;
    }

    if (PySys_Audit("cpython.run_command", "O", unicode) < 0) {
        return pymain_exit_err_print();
    }

    bytes = PyUnicode_AsUTF8String(unicode);
    Py_DECREF(unicode);
    if (bytes == NULL) {
        goto error;
    }

    ret = PyRun_SimpleStringFlags(PyBytes_AsString(bytes), cf);
    Py_DECREF(bytes);
    return (ret != 0);

error:
    PySys_WriteStderr("Unable to decode the command from the command line:\n");
    return pymain_exit_err_print();
}

The conversion of wchar_t* to Unicode, bytes, and then a string is roughly equivalent to the following:

unicode = str(command)
bytes_ = bytes(unicode.encode('utf8'))
# call PyRun_SimpleStringFlags with bytes_

The PyRun_SimpleStringFlags() function is part of Python/pythonrun.c. It's purpose is to turn this simple command into a Python module and then send it on to be executed. Since a Python module needs to have __main__ to be executed as a standalone module, it creates that automatically:

int
PyRun_SimpleStringFlags(const char *command, PyCompilerFlags *flags)
{
    PyObject *m, *d, *v;
    m = PyImport_AddModule("__main__");
    if (m == NULL)
        return -1;
    d = PyModule_GetDict(m);
    v = PyRun_StringFlags(command, Py_file_input, d, d, flags);
    if (v == NULL) {
        PyErr_Print();
        return -1;
    }
    Py_DECREF(v);
    return 0;
}

Once PyRun_SimpleStringFlags() has created a module and a dictionary, it calls PyRun_StringFlags(), which creates a fake filename and then calls the Python parser to create an AST from the string and return a module, mod:

PyObject *
PyRun_StringFlags(const char *str, int start, PyObject *globals,
                  PyObject *locals, PyCompilerFlags *flags)
{
...
    mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);
    if (mod != NULL)
        ret = run_mod(mod, filename, globals, locals, flags, arena);
    PyArena_Free(arena);
    return ret;

You'll dive into the AST and Parser code in the next section.

Input via -m

Another way to execute Python commands is by using the -m option with the name of a module. A typical example is python -m unittest to run the unittest module in the standard library.

Being able to execute modules as scripts were initially proposed in PEP 338 and then the standard for explicit relative imports defined in PEP366.

The use of the -m flag implies that within the module package, you want to execute whatever is inside __main__. It also implies that you want to search sys.path for the named module.

This search mechanism is why you don't need to remember where the unittest module is stored on your filesystem.

Inside Modules/main.c there is a function called when the command-line is run with the -m flag. The name of the module is passed as the modname argument.

CPython will then import a standard library module, runpy and execute it using PyObject_Call(). The import is done using the C API function PyImport_ImportModule(), found within the Python/import.c file:

static int
pymain_run_module(const wchar_t *modname, int set_argv0)
{
    PyObject *module, *runpy, *runmodule, *runargs, *result;
    runpy = PyImport_ImportModule("runpy");
 ...
    runmodule = PyObject_GetAttrString(runpy, "_run_module_as_main");
 ...
    module = PyUnicode_FromWideChar(modname, wcslen(modname));
 ...
    runargs = Py_BuildValue("(Oi)", module, set_argv0);
 ...
    result = PyObject_Call(runmodule, runargs, NULL);
 ...
    if (result == NULL) {
        return pymain_exit_err_print();
    }
    Py_DECREF(result);
    return 0;
}

In this function you'll also see 2 other C API functions: PyObject_Call() and PyObject_GetAttrString(). Because PyImport_ImportModule() returns a PyObject*, the core object type, you need to call special functions to get attributes and to call it.

In Python, if you had an object and wanted to get an attribute, then you could call getattr(). In the C API, this call is PyObject_GetAttrString(), which is found in Objects/object.c. If you wanted to run a callable, you would give it parentheses, or you can run the __call__() property on any Python object. The __call__() method is implemented inside Objects/object.c:

hi = "hi!"
hi.upper() == hi.upper.__call__()  # this is the same

The runpy module is written in pure Python and located in Lib/runpy.py.

Executing python -m <module> is equivalent to running python -m runpy <module>. The runpy module was created to abstract the process of locating and executing modules on an operating system.

runpy does a few things to run the target module:

The runpy module also supports executing directories and zip files.

Input via Filename

If the first argument to python was a filename, such as python test.py, then CPython will open a file handle, similar to using open() in Python and pass the handle to PyRun_SimpleFileExFlags() inside Python/pythonrun.c.

There are 3 paths this function can take:

  1. If the file path is a .pyc file, it will call run_pyc_file().
  2. If the file path is a script file (.py) it will run PyRun_FileExFlags().
  3. If the filepath is stdin because the user ran command | python then treat stdin as a file handle and run PyRun_FileExFlags().
int
PyRun_SimpleFileExFlags(FILE *fp, const char *filename, int closeit,
                        PyCompilerFlags *flags)
{
 ...
    m = PyImport_AddModule("__main__");
 ...
    if (maybe_pyc_file(fp, filename, ext, closeit)) {
 ...
        v = run_pyc_file(pyc_fp, filename, d, d, flags);
    } else {
        /* When running from stdin, leave __main__.__loader__ alone */
        if (strcmp(filename, "<stdin>") != 0 &&
            set_main_loader(d, filename, "SourceFileLoader") < 0) {
            fprintf(stderr, "python: failed to set __main__.__loader__\n");
            ret = -1;
            goto done;
        }
        v = PyRun_FileExFlags(fp, filename, Py_file_input, d, d,
                              closeit, flags);
    }
 ...
    return ret;
}

Input via File With PyRun_FileExFlags()

For stdin and basic script files, CPython will pass the file handle to PyRun_FileExFlags() located in the pythonrun.c file.

The purpose of PyRun_FileExFlags() is similar to PyRun_SimpleStringFlags() used for the -c input. CPython will load the file handle into PyParser_ASTFromFileObject(). We'll cover the Parser and AST modules in the next section. Because this is a full script, it doesn't need the PyImport_AddModule("__main__"); step used by -c:

PyObject *
PyRun_FileExFlags(FILE *fp, const char *filename_str, int start, PyObject *globals,
                  PyObject *locals, int closeit, PyCompilerFlags *flags)
{
 ...
    mod = PyParser_ASTFromFileObject(fp, filename, NULL, start, 0, 0,
 ...
    ret = run_mod(mod, filename, globals, locals, flags, arena);
}

Identical to PyRun_SimpleStringFlags(), once PyRun_FileExFlags() has created a Python module from the file, it sent it to run_mod() to be executed.

run_mod() is found within Python/pythonrun.c, and sends the module to the AST to be compiled into a code object. Code objects are a format used to store the bytecode operations and the format kept in .pyc files:

static PyObject *
run_mod(mod_ty mod, PyObject *filename, PyObject *globals, PyObject *locals,
            PyCompilerFlags *flags, PyArena *arena)
{
    PyCodeObject *co;
    PyObject *v;
    co = PyAST_CompileObject(mod, filename, flags, -1, arena);
    if (co == NULL)
        return NULL;

    if (PySys_Audit("exec", "O", co) < 0) {
        Py_DECREF(co);
        return NULL;
    }

    v = run_eval_code_obj(co, globals, locals);
    Py_DECREF(co);
    return v;
}

We will cover the CPython compiler and bytecodes in the next section. The call to run_eval_code_obj() is a simple wrapper function that calls PyEval_EvalCode() in the Python/eval.c file. The PyEval_EvalCode() function is the main evaluation loop for CPython, it iterates over each bytecode statement and executes it on your local machine.

Input via Compiled Bytecode With run_pyc_file()

In the PyRun_SimpleFileExFlags() there was a clause for the user providing a file path to a .pyc file. If the file path ended in .pyc then instead of loading the file as a plain text file and parsing it, it will assume that the .pyc file contains a code object written to disk.

The run_pyc_file() function inside Python/pythonrun.c then marshals the code object from the .pyc file by using the file handle. Marshaling is a technical term for copying the contents of a file into memory and converting them to a specific data structure. The code object data structure on the disk is the CPython compiler's way to caching compiled code so that it doesn't need to parse it every time the script is called:

static PyObject *
run_pyc_file(FILE *fp, const char *filename, PyObject *globals,
             PyObject *locals, PyCompilerFlags *flags)
{
    PyCodeObject *co;
    PyObject *v;
  ...
    v = PyMarshal_ReadLastObjectFromFile(fp);
  ...
    if (v == NULL || !PyCode_Check(v)) {
        Py_XDECREF(v);
        PyErr_SetString(PyExc_RuntimeError,
                   "Bad code object in .pyc file");
        goto error;
    }
    fclose(fp);
    co = (PyCodeObject *)v;
    v = run_eval_code_obj(co, globals, locals);
    if (v && flags)
        flags->cf_flags |= (co->co_flags & PyCF_MASK);
    Py_DECREF(co);
    return v;
}

Once the code object has been marshaled to memory, it is sent to run_eval_code_obj(), which calls Python/ceval.c to execute the code.

Lexing and Parsing

In the exploration of reading and executing Python files, we dived as deep as the parser and AST modules, with function calls to PyParser_ASTFromFileObject().

Sticking within Python/pythonrun.c, the PyParser_ASTFromFileObject() function will take a file handle, compiler flags and a PyArena instance and convert the file object into a node object using PyParser_ParseFileObject().

With the node object, it will then convert that into a module using the AST function PyAST_FromNodeObject():

mod_ty
PyParser_ASTFromFileObject(FILE *fp, PyObject *filename, const char* enc,
                           int start, const char *ps1,
                           const char *ps2, PyCompilerFlags *flags, int *errcode,
                           PyArena *arena)
{
    ...
    node *n = PyParser_ParseFileObject(fp, filename, enc,
                                       &_PyParser_Grammar,
                                       start, ps1, ps2, &err, &iflags);
    ...
    if (n) {
        flags->cf_flags |= iflags & PyCF_MASK;
        mod = PyAST_FromNodeObject(n, flags, filename, arena);
        PyNode_Free(n);
    ...
    return mod;
}

For PyParser_ParseFileObject() we switch to Parser/parsetok.c and the parser-tokenizer stage of the CPython interpreter. This function has two important tasks:

  1. Instantiate a tokenizer state tok_state using PyTokenizer_FromFile() in Parser/tokenizer.c
  2. Convert the tokens into a concrete parse tree (a list of node) using parsetok() in Parser/parsetok.c
node *
PyParser_ParseFileObject(FILE *fp, PyObject *filename,
                         const char *enc, grammar *g, int start,
                         const char *ps1, const char *ps2,
                         perrdetail *err_ret, int *flags)
{
    struct tok_state *tok;
...
    if ((tok = PyTokenizer_FromFile(fp, enc, ps1, ps2)) == NULL) {
        err_ret->error = E_NOMEM;
        return NULL;
    }
...
    return parsetok(tok, g, start, err_ret, flags);
}

tok_state (defined in Parser/tokenizer.h) is the data structure to store all temporary data generated by the tokenizer. It is returned to the parser-tokenizer as the data structure is required by parsetok() to develop the concrete syntax tree.

Inside parsetok(), it will use the tok_state structure and make calls to tok_get() in a loop until the file is exhausted and no more tokens can be found.

tok_get(), defined in Parser/tokenizer.c behaves like an iterator. It will keep returning the next token in the parse tree.

tok_get() is one of the most complex functions in the whole CPython codebase. It has over 640 lines and includes decades of heritage with edge cases, new language features, and syntax.

One of the simpler examples would be the part that converts a newline break into a NEWLINE token:

static int
tok_get(struct tok_state *tok, char **p_start, char **p_end)
{
...
    /* Newline */
    if (c == '\n') {
        tok->atbol = 1;
        if (blankline || tok->level > 0) {
            goto nextline;
        }
        *p_start = tok->start;
        *p_end = tok->cur - 1; /* Leave '\n' out of the string */
        tok->cont_line = 0;
        if (tok->async_def) {
            /* We're somewhere inside an 'async def' function, and
               we've encountered a NEWLINE after its signature. */
            tok->async_def_nl = 1;
        }
        return NEWLINE;
    }
...
}

In this case, NEWLINE is a token, with a value defined in Include/token.h. All tokens are constant int values, and the Include/token.h file was generated earlier when we ran make regen-grammar.

The node type returned by PyParser_ParseFileObject() is going to be essential for the next stage, converting a parse tree into an Abstract-Syntax-Tree (AST):

typedef struct _node {
    short               n_type;
    char                *n_str;
    int                 n_lineno;
    int                 n_col_offset;
    int                 n_nchildren;
    struct _node        *n_child;
    int                 n_end_lineno;
    int                 n_end_col_offset;
} node;

Since the CST is a tree of syntax, token IDs, and symbols, it would be difficult for the compiler to make quick decisions based on the Python language.

That is why the next stage is to convert the CST into an AST, a much higher-level structure. This task is performed by the Python/ast.c module, which has both a C and Python API.

Before you jump into the AST, there is a way to access the output from the parser stage. CPython has a standard library module parser, which exposes the C functions with a Python API.

The module is documented as an implementation detail of CPython so that you won't see it in other Python interpreters. Also the output from the functions is not that easy to read.

The output will be in the numeric form, using the token and symbol numbers generated by the make regen-grammar stage, stored in Include/token.h and Include/symbol.h:

>>>
>>> from pprint import pprint
>>> import parser
>>> st = parser.expr('a + 1')
>>> pprint(parser.st2list(st))
[258,
 [332,
  [306,
   [310,
    [311,
     [312,
      [313,
       [316,
        [317,
         [318,
          [319,
           [320,
            [321, [322, [323, [324, [325, [1, 'a']]]]]],
            [14, '+'],
            [321, [322, [323, [324, [325, [2, '1']]]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

To make it easier to understand, you can take all the numbers in the symbol and token modules, put them into a dictionary and recursively replace the values in the output of parser.st2list() with the names:

import symbol
import token
import parser

def lex(expression):
    symbols = {v: k for k, v in symbol.__dict__.items() if isinstance(v, int)}
    tokens = {v: k for k, v in token.__dict__.items() if isinstance(v, int)}
    lexicon = {**symbols, **tokens}
    st = parser.expr(expression)
    st_list = parser.st2list(st)

    def replace(l: list):
        r = []
        for i in l:
            if isinstance(i, list):
                r.append(replace(i))
            else:
                if i in lexicon:
                    r.append(lexicon[i])
                else:
                    r.append(i)
        return r

    return replace(st_list)

You can run lex() with a simple expression, like a + 1 to see how this is represented as a parser-tree:

>>>
>>> from pprint import pprint
>>> pprint(lex('a + 1'))

['eval_input',
 ['testlist',
  ['test',
   ['or_test',
    ['and_test',
     ['not_test',
      ['comparison',
       ['expr',
        ['xor_expr',
         ['and_expr',
          ['shift_expr',
           ['arith_expr',
            ['term',
             ['factor', ['power', ['atom_expr', ['atom', ['NAME', 'a']]]]]],
            ['PLUS', '+'],
            ['term',
             ['factor',
              ['power', ['atom_expr', ['atom', ['NUMBER', '1']]]]]]]]]]]]]]]]],
 ['NEWLINE', ''],
 ['ENDMARKER', '']]

In the output, you can see the symbols in lowercase, such as 'test' and the tokens in uppercase, such as 'NUMBER'.

Abstract Syntax Trees

The next stage in the CPython interpreter is to convert the CST generated by the parser into something more logical that can be executed. The structure is a higher-level representation of the code, called an Abstract Syntax Tree (AST).

ASTs are produced inline with the CPython interpreter process, but you can also generate them in both Python using the ast module in the Standard Library as well as through the C API.

Before diving into the C implementation of the AST, it would be useful to understand what an AST looks like for a simple piece of Python code.

To do this, here's a simple app called instaviz for this tutorial. It displays the AST and bytecode instructions (which we'll cover later) in a Web UI.

To install instaviz:

$ pip install instaviz

Then, open up a REPL by running python at the command line with no arguments:

>>>
>>> import instaviz
>>> def example():
       a = 1
       b = a + 1
       return b

>>> instaviz.show(example)

You'll see a notification on the command-line that a web server has started on port 8080. If you were using that port for something else, you can change it by calling instaviz.show(example, port=9090) or another port number.

In the web browser, you can see the detailed breakdown of your function:

Instaviz screenshot

The bottom left graph is the function you declared in REPL, represented as an Abstract Syntax Tree. Each node in the tree is an AST type. They are found in the ast module, and all inherit from _ast.AST.

Some of the nodes have properties which link them to child nodes, unlike the CST, which has a generic child node property.

For example, if you click on the Assign node in the center, this links to the line b = a + 1:

Instaviz screenshot 2

It has two properties:

  1. targets is a list of names to assign. It is a list because you can assign to multiple variables with a single expression using unpacking
  2. value is the value to assign, which in this case is a BinOp statement, a + 1.

If you click on the BinOp statement, it shows the properties of relevance:

Instaviz screenshot 3

Compiling an AST in C is not a straightforward task, so the Python/ast.c module is over 5000 lines of code.

There are a few entry points, forming part of the AST's public API. In the last section on the lexer and parser, you stopped when you'd reached the call to PyAST_FromNodeObject(). By this stage, the Python interpreter process had created a CST in the format of node * tree.

Jumping then into PyAST_FromNodeObject() inside Python/ast.c, you can see it receives the node * tree, the filename, compiler flags, and the PyArena.

The return type from this function is mod_ty, defined in Include/Python-ast.h. mod_ty is a container structure for one of the 5 module types in Python:

  1. Module
  2. Interactive
  3. Expression
  4. FunctionType
  5. Suite

In Include/Python-ast.h you can see that an Expression type requires a field body, which is an expr_ty type. The expr_ty type is also defined in Include/Python-ast.h:

enum _mod_kind {Module_kind=1, Interactive_kind=2, Expression_kind=3,
                 FunctionType_kind=4, Suite_kind=5};
struct _mod {
    enum _mod_kind kind;
    union {
        struct {
            asdl_seq *body;
            asdl_seq *type_ignores;
        } Module;

        struct {
            asdl_seq *body;
        } Interactive;

        struct {
            expr_ty body;
        } Expression;

        struct {
            asdl_seq *argtypes;
            expr_ty returns;
        } FunctionType;

        struct {
            asdl_seq *body;
        } Suite;

    } v;
};

The AST types are all listed in Parser/Python.asdl. You will see the module types, statement types, expression types, operators, and comprehensions all listed. The names of the types in this document relate to the classes generated by the AST and the same classes named in the ast standard module library.

The parameters and names in Include/Python-ast.h correlate directly to those specified in Parser/Python.asdl:

-- ASDL's 5 builtin types are:
-- identifier, int, string, object, constant

module Python
{
    mod = Module(stmt* body, type_ignore *type_ignores)
        | Interactive(stmt* body)
        | Expression(expr body)
        | FunctionType(expr* argtypes, expr returns)

The C header file and structures are there so that the Python/ast.c program can quickly generate the structures with pointers to the relevant data.

Looking at PyAST_FromNodeObject() you can see that it is essentially a switch statement around the result from TYPE(n). TYPE() is one of the core functions used by the AST to determine what type a node in the concrete syntax tree is. In the case of PyAST_FromNodeObject() it's just looking at the first node, so it can only be one of the module types defined as Module, Interactive, Expression, FunctionType.

The result of TYPE() will be either a symbol or token type, which we're very familiar with by this stage.

For file_input, the results should be a Module. Modules are a series of statements, of which there are a few types. The logic to traverse the children of n and create statement nodes is within ast_for_stmt(). This function is called either once, if there is only 1 statement in the module, or in a loop if there are many. The resulting Module is then returned with the PyArena.

For eval_input, the result should be an Expression. The result from CHILD(n ,0), which is the first child of n is passed to ast_for_testlist() which returns an expr_ty type. This expr_ty is sent to Expression() with the PyArena to create an expression node, and then passed back as a result:

mod_ty
PyAST_FromNodeObject(const node *n, PyCompilerFlags *flags,
                     PyObject *filename, PyArena *arena)
{
    ...
    switch (TYPE(n)) {
        case file_input:
            stmts = _Py_asdl_seq_new(num_stmts(n), arena);
            if (!stmts)
                goto out;
            for (i = 0; i < NCH(n) - 1; i++) {
                ch = CHILD(n, i);
                if (TYPE(ch) == NEWLINE)
                    continue;
                REQ(ch, stmt);
                num = num_stmts(ch);
                if (num == 1) {
                    s = ast_for_stmt(&c, ch);
                    if (!s)
                        goto out;
                    asdl_seq_SET(stmts, k++, s);
                }
                else {
                    ch = CHILD(ch, 0);
                    REQ(ch, simple_stmt);
                    for (j = 0; j < num; j++) {
                        s = ast_for_stmt(&c, CHILD(ch, j * 2));
                        if (!s)
                            goto out;
                        asdl_seq_SET(stmts, k++, s);
                    }
                }
            }

            /* Type ignores are stored under the ENDMARKER in file_input. */
            ...

            res = Module(stmts, type_ignores, arena);
            break;
        case eval_input: {
            expr_ty testlist_ast;

            /* XXX Why not comp_for here? */
            testlist_ast = ast_for_testlist(&c, CHILD(n, 0));
            if (!testlist_ast)
                goto out;
            res = Expression(testlist_ast, arena);
            break;
        }
        case single_input:
            ...
            break;
        case func_type_input:
            ...
        ...
    return res;
}

Inside the ast_for_stmt() function, there is another switch statement for each possible statement type (simple_stmt, compound_stmt, and so on) and the code to determine the arguments to the node class.

One of the simpler functions is for the power expression, i.e., 2**4 is 2 to the power of 4. This function starts by getting the ast_for_atom_expr(), which is the number 2 in our example, then if that has one child, it returns the atomic expression. If it has more than one child, it will get the right-hand (the number 4) and return a BinOp (binary operation) with the operator as Pow (power), the left hand of e (2), and the right hand of f (4):

static expr_ty
ast_for_power(struct compiling *c, const node *n)
{
    /* power: atom trailer* ('**' factor)*
     */
    expr_ty e;
    REQ(n, power);
    e = ast_for_atom_expr(c, CHILD(n, 0));
    if (!e)
        return NULL;
    if (NCH(n) == 1)
        return e;
    if (TYPE(CHILD(n, NCH(n) - 1)) == factor) {
        expr_ty f = ast_for_expr(c, CHILD(n, NCH(n) - 1));
        if (!f)
            return NULL;
        e = BinOp(e, Pow, f, LINENO(n), n->n_col_offset,
                  n->n_end_lineno, n->n_end_col_offset, c->c_arena);
    }
    return e;
}

You can see the result of this if you send a short function to the instaviz module:

>>>
>>> def foo():
       2**4
>>> import instaviz
>>> instaviz.show(foo)

Instaviz screenshot 4

In the UI you can also see the corresponding properties:

Instaviz screenshot 5

In summary, each statement type and expression has a corresponding ast_for_*() function to create it. The arguments are defined in Parser/Python.asdl and exposed via the ast module in the standard library. If an expression or statement has children, then it will call the corresponding ast_for_* child function in a depth-first traversal.

Conclusion

CPython's versatility and low-level execution API make it the ideal candidate for an embedded scripting engine. You will see CPython used in many UI applications, such as Game Design, 3D graphics and system automation.

The interpreter process is flexible and efficient, and now you have an understanding of how it works you're ready to understand the compiler.

Part 3: The CPython Compiler and Execution Loop

In Part 2, you saw how the CPython interpreter takes an input, such as a file or string, and converts it into a logical Abstract Syntax Tree. We're still not at the stage where this code can be executed. Next, we have to go deeper to convert the Abstract Syntax Tree into a set of sequential commands that the CPU can understand.

Compiling

Now the interpreter has an AST with the properties required for each of the operations, functions, classes, and namespaces. It is the job of the compiler to turn the AST into something the CPU can understand.

This compilation task is split into 2 parts:

  1. Traverse the tree and create a control-flow-graph, which represents the logical sequence for execution
  2. Convert the nodes in the CFG to smaller, executable statements, known as byte-code

Earlier, we were looking at how files are executed, and the PyRun_FileExFlags() function in Python/pythonrun.c. Inside this function, we converted the FILE handle into a mod, of type mod_ty. This task was completed by PyParser_ASTFromFileObject(), which in turns calls the tokenizer, parser-tokenizer and then the AST:

PyObject *
PyRun_FileExFlags(FILE *fp, const char *filename_str, int start, PyObject *globals,
                  PyObject *locals, int closeit, PyCompilerFlags *flags)
{
 ...
    mod = PyParser_ASTFromFileObject(fp, filename, NULL, start, 0, 0,
 ...
    ret = run_mod(mod, filename, globals, locals, flags, arena);
}

The resulting module from the call to is sent to run_mod() still in Python/pythonrun.c. This is a small function that gets a PyCodeObject from PyAST_CompileObject() and sends it on to run_eval_code_obj(). You will tackle run_eval_code_obj() in the next section:

static PyObject *
run_mod(mod_ty mod, PyObject *filename, PyObject *globals, PyObject *locals,
            PyCompilerFlags *flags, PyArena *arena)
{
    PyCodeObject *co;
    PyObject *v;
    co = PyAST_CompileObject(mod, filename, flags, -1, arena);
    if (co == NULL)
        return NULL;

    if (PySys_Audit("exec", "O", co) < 0) {
        Py_DECREF(co);
        return NULL;
    }

    v = run_eval_code_obj(co, globals, locals);
    Py_DECREF(co);
    return v;
}

The PyAST_CompileObject() function is the main entry point to the CPython compiler. It takes a Python module as its primary argument, along with the name of the file, the globals, locals, and the PyArena all created earlier in the interpreter process.

We're starting to get into the guts of the CPython compiler now, with decades of development and Computer Science theory behind it. Don't be put off by the language. Once we break down the compiler into logical steps, it'll make sense.

Before the compiler starts, a global compiler state is created. This type, compiler is defined in Python/compile.c and contains properties used by the compiler to remember the compiler flags, the stack, and the PyArena:

struct compiler {
    PyObject *c_filename;
    struct symtable *c_st;
    PyFutureFeatures *c_future; /* pointer to module's __future__ */
    PyCompilerFlags *c_flags;

    int c_optimize;              /* optimization level */
    int c_interactive;           /* true if in interactive mode */
    int c_nestlevel;

    PyObject *c_const_cache;     /* Python dict holding all constants,
                                    including names tuple */
    struct compiler_unit *u; /* compiler state for current block */
    PyObject *c_stack;           /* Python list holding compiler_unit ptrs */
    PyArena *c_arena;            /* pointer to memory allocation arena */
};

Inside PyAST_CompileObject(), there are 11 main steps happening:

  1. Create an empty __doc__ property to the module if it doesn't exist.
  2. Create an empty __annotations__ property to the module if it doesn't exist.
  3. Set the filename of the global compiler state to the filename argument.
  4. Set the memory allocation arena for the compiler to the one used by the interpreter.
  5. Copy any __future__ flags in the module to the future flags in the compiler.
  6. Merge runtime flags provided by the command-line or environment variables.
  7. Enable any __future__ features in the compiler.
  8. Set the optimization level to the provided argument, or default.
  9. Build a symbol table from the module object.
  10. Run the compiler with the compiler state and return the code object.
  11. Free any allocated memory by the compiler.
PyCodeObject *
PyAST_CompileObject(mod_ty mod, PyObject *filename, PyCompilerFlags *flags,
                   int optimize, PyArena *arena)
{
    struct compiler c;
    PyCodeObject *co = NULL;
    PyCompilerFlags local_flags;
    int merged;

    if (!__doc__) {                                                      // 1.
        __doc__ = PyUnicode_InternFromString("__doc__");
        if (!__doc__)
            return NULL;
    }
    if (!__annotations__) {
        __annotations__ = PyUnicode_InternFromString("__annotations__"); // 2.
        if (!__annotations__)
            return NULL;
    }
    if (!compiler_init(&c))
        return NULL;
    Py_INCREF(filename);
    c.c_filename = filename;                                             // 3.
    c.c_arena = arena;                                                   // 4.
    c.c_future = PyFuture_FromASTObject(mod, filename);                  // 5.
    if (c.c_future == NULL)
        goto finally;
    if (!flags) {
        local_flags.cf_flags = 0;
        local_flags.cf_feature_version = PY_MINOR_VERSION;
        flags = &local_flags;
    }
    merged = c.c_future->ff_features | flags->cf_flags;                  // 6.
    c.c_future->ff_features = merged;                                    // 7.
    flags->cf_flags = merged;
    c.c_flags = flags;
    c.c_optimize = (optimize == -1) ? Py_OptimizeFlag : optimize;        // 8.
    c.c_nestlevel = 0;

    if (!_PyAST_Optimize(mod, arena, c.c_optimize)) {
        goto finally;
    }

    c.c_st = PySymtable_BuildObject(mod, filename, c.c_future);          // 9.
    if (c.c_st == NULL) {
        if (!PyErr_Occurred())
            PyErr_SetString(PyExc_SystemError, "no symtable");
        goto finally;
    }

    co = compiler_mod(&c, mod);                                          // 10.

 finally:
    compiler_free(&c);                                                   // 11.
    assert(co || PyErr_Occurred());
    return co;
}

Future Flags and Compiler Flags

Before the compiler runs, there are two types of flags to toggle the features inside the compiler. These come from two places:

  1. The interpreter state, which may have been command-line options, set in pyconfig.h or via environment variables
  2. The use of __future__ statements inside the actual source code of the module

To distinguish the two types of flags, think that the __future__ flags are required because of the syntax or features in that specific module. For example, Python 3.7 introduced delayed evaluation of type hints through the annotations future flag:

from __future__ import annotations

The code after this statement might use unresolved type hints, so the __future__ statement is required. Otherwise, the module wouldn't import. It would be unmaintainable to manually request that the person importing the module enable this specific compiler flag.

The other compiler flags are specific to the environment, so they might change the way the code executes or the way the compiler runs, but they shouldn't link to the source in the same way that __future__ statements do.

One example of a compiler flag would be the -O flag for optimizing the use of assert statements. This flag disables any assert statements, which may have been put in the code for debugging purposes. It can also be enabled with the PYTHONOPTIMIZE=1 environment variable setting.

Symbol Tables

In PyAST_CompileObject() there was a reference to a symtable and a call to PySymtable_BuildObject() with the module to be executed.

The purpose of the symbol table is to provide a list of namespaces, globals, and locals for the compiler to use for referencing and resolving scopes.

The symtable structure in Include/symtable.h is well documented, so it's clear what each of the fields is for. There should be one symtable instance for the compiler, so namespacing becomes essential.

If you create a function called resolve_names() in one module and declare another function with the same name in another module, you want to be sure which one is called. The symtable serves this purpose, as well as ensuring that variables declared within a narrow scope don't automatically become globals (after all, this isn't JavaScript):

struct symtable {
    PyObject *st_filename;          /* name of file being compiled,
                                       decoded from the filesystem encoding */
    struct _symtable_entry *st_cur; /* current symbol table entry */
    struct _symtable_entry *st_top; /* symbol table entry for module */
    PyObject *st_blocks;            /* dict: map AST node addresses
                                     *       to symbol table entries */
    PyObject *st_stack;             /* list: stack of namespace info */
    PyObject *st_global;            /* borrowed ref to st_top->ste_symbols */
    int st_nblocks;                 /* number of blocks used. kept for
                                       consistency with the corresponding
                                       compiler structure */
    PyObject *st_private;           /* name of current class or NULL */
    PyFutureFeatures *st_future;    /* module's future features that affect
                                       the symbol table */
    int recursion_depth;            /* current recursion depth */
    int recursion_limit;            /* recursion limit */
};

Some of the symbol table API is exposed via the symtable module in the standard library. You can provide an expression or a module an receive a symtable.SymbolTable instance.

You can provide a string with a Python expression and the compile_type of "eval", or a module, function or class, and the compile_mode of "exec" to get a symbol table.

Looping over the elements in the table we can see some of the public and private fields and their types:

>>>
>>> import symtable
>>> s = symtable.symtable('b + 1', filename='test.py', compile_type='eval')
>>> [symbol.__dict__ for symbol in s.get_symbols()]
[{'_Symbol__name': 'b', '_Symbol__flags': 6160, '_Symbol__scope': 3, '_Symbol__namespaces': ()}]

The C code behind this is all within Python/symtable.c and the primary interface is the PySymtable_BuildObject() function.

Similar to the top-level AST function we covered earlier, the PySymtable_BuildObject() function switches between the mod_ty possible types (Module, Expression, Interactive, Suite, FunctionType), and visits each of the statements inside them.

Remember, mod_ty is an AST instance, so the will now recursively explore the nodes and branches of the tree and add entries to the symtable:

struct symtable *
PySymtable_BuildObject(mod_ty mod, PyObject *filename, PyFutureFeatures *future)
{
    struct symtable *st = symtable_new();
    asdl_seq *seq;
    int i;
    PyThreadState *tstate;
    int recursion_limit = Py_GetRecursionLimit();
...
    st->st_top = st->st_cur;
    switch (mod->kind) {
    case Module_kind:
        seq = mod->v.Module.body;
        for (i = 0; i < asdl_seq_LEN(seq); i++)
            if (!symtable_visit_stmt(st,
                        (stmt_ty)asdl_seq_GET(seq, i)))
                goto error;
        break;
    case Expression_kind:
        ...
    case Interactive_kind:
        ...
    case Suite_kind:
        ...
    case FunctionType_kind:
        ...
    }
    ...
}

So for a module, PySymtable_BuildObject() will loop through each statement in the module and call symtable_visit_stmt(). The symtable_visit_stmt() is a huge switch statement with a case for each statement type (defined in Parser/Python.asdl).

For each statement type, there is specific logic to that statement type. For example, a function definition has particular logic for:

  1. If the recursion depth is beyond the limit, raise a recursion depth error
  2. The name of the function to be added as a local variable
  3. The default values for sequential arguments to be resolved
  4. The default values for keyword arguments to be resolved
  5. Any annotations for the arguments or the return type are resolved
  6. Any function decorators are resolved
  7. The code block with the contents of the function is visited in symtable_enter_block()
  8. The arguments are visited
  9. The body of the function is visited

Note: If you've ever wondered why Python's default arguments are mutable, the reason is in this function. You can see they are a pointer to the variable in the symtable. No extra work is done to copy any values to an immutable type.

static int
symtable_visit_stmt(struct symtable *st, stmt_ty s)
{
    if (++st->recursion_depth > st->recursion_limit) {                          // 1.
        PyErr_SetString(PyExc_RecursionError,
                        "maximum recursion depth exceeded during compilation");
        VISIT_QUIT(st, 0);
    }
    switch (s->kind) {
    case FunctionDef_kind:
        if (!symtable_add_def(st, s->v.FunctionDef.name, DEF_LOCAL))            // 2.
            VISIT_QUIT(st, 0);
        if (s->v.FunctionDef.args->defaults)                                    // 3.
            VISIT_SEQ(st, expr, s->v.FunctionDef.args->defaults);
        if (s->v.FunctionDef.args->kw_defaults)                                 // 4.
            VISIT_SEQ_WITH_NULL(st, expr, s->v.FunctionDef.args->kw_defaults);
        if (!symtable_visit_annotations(st, s, s->v.FunctionDef.args,           // 5.
                                        s->v.FunctionDef.returns))
            VISIT_QUIT(st, 0);
        if (s->v.FunctionDef.decorator_list)                                    // 6.
            VISIT_SEQ(st, expr, s->v.FunctionDef.decorator_list);
        if (!symtable_enter_block(st, s->v.FunctionDef.name,                    // 7.
                                  FunctionBlock, (void *)s, s->lineno,
                                  s->col_offset))
            VISIT_QUIT(st, 0);
        VISIT(st, arguments, s->v.FunctionDef.args);                            // 8.
        VISIT_SEQ(st, stmt, s->v.FunctionDef.body);                             // 9.
        if (!symtable_exit_block(st, s))
            VISIT_QUIT(st, 0);
        break;
    case ClassDef_kind: {
        ...
    }
    case Return_kind:
        ...
    case Delete_kind:
        ...
    case Assign_kind:
        ...
    case AnnAssign_kind:
        ...

Once the resulting symtable has been created, it is sent back to be used for the compiler.

Core Compilation Process

Now that the PyAST_CompileObject() has a compiler state, a symtable, and a module in the form of the AST, the actual compilation can begin.

The purpose of the core compiler is to:

You can call the CPython compiler in Python code by calling the built-in function compile(). It returns a code object instance:

>>>
>>> compile('b+1', 'test.py', mode='eval')
<code object <module> at 0x10f222780, file "test.py", line 1>

The same as with the symtable() function, a simple expression should have a mode of 'eval' and a module, function, or class should have a mode of 'exec'.

The compiled code can be found in the co_code property of the code object:

>>>
>>> co.co_code
b'e\x00d\x00\x17\x00S\x00'

There is also a dis module in the standard library, which disassembles the bytecode instructions and can print them on the screen or give you a list of Instruction instances.

If you import dis and give the dis() function the code object's co_code property it disassembles it and prints the instructions on the REPL:

>>> import dis
>>> dis.dis(co.co_code)
          0 LOAD_NAME                0 (0)
          2 LOAD_CONST               0 (0)
          4 BINARY_ADD
          6 RETURN_VALUE

LOAD_NAME, LOAD_CONST, BINARY_ADD, and RETURN_VALUE are all bytecode instructions. They're called bytecode because, in binary form, they were a byte long. However, since Python 3.6 the storage format was changed to a word, so now they're technically wordcode, not bytecode.

The full list of bytecode instructions is available for each version of Python, and it does change between versions. For example, in Python 3.7, some new bytecode instructions were introduced to speed up execution of specific method calls.

In an earlier section, we explored the instaviz package. This included a visualization of the code object type by running the compiler. It also displays the Bytecode operations inside the code objects.

Execute instaviz again to see the code object and bytecode for a function defined on the REPL:

>>>
>>> import instaviz
>>> def example():
       a = 1
       b = a + 1
       return b
>>> instaviz.show(example)

If we now jump into compiler_mod(), a function used to switch to different compiler functions depending on the module type. We'll assume that mod is a Module. The module is compiled into the compiler state and then assemble() is run to create a PyCodeObject.

The new code object is returned back to PyAST_CompileObject() and sent on for execution:

static PyCodeObject *
compiler_mod(struct compiler *c, mod_ty mod)
{
    PyCodeObject *co;
    int addNone = 1;
    static PyObject *module;
    ...
    switch (mod->kind) {
    case Module_kind:
        if (!compiler_body(c, mod->v.Module.body)) {
            compiler_exit_scope(c);
            return 0;
        }
        break;
    case Interactive_kind:
        ...
    case Expression_kind:
        ...
    case Suite_kind:
        ...
    ...
    co = assemble(c, addNone);
    compiler_exit_scope(c);
    return co;
}

The compiler_body() function has some optimization flags and then loops over each statement in the module and visits it, similar to how the symtable functions worked:

static int
compiler_body(struct compiler *c, asdl_seq *stmts)
{
    int i = 0;
    stmt_ty st;
    PyObject *docstring;
    ...
    for (; i < asdl_seq_LEN(stmts); i++)
        VISIT(c, stmt, (stmt_ty)asdl_seq_GET(stmts, i));
    return 1;
}

The statement type is determined through a call to the asdl_seq_GET() function, which looks at the AST node's type.

Through some smart macros, VISIT calls a function in Python/compile.c for each statement type:

#define VISIT(C, TYPE, V) {\
    if (!compiler_visit_ ## TYPE((C), (V))) \
        return 0; \
}

For a stmt (the category for a statement) the compiler will then drop into compiler_visit_stmt() and switch through all of the potential statement types found in Parser/Python.asdl:

static int
compiler_visit_stmt(struct compiler *c, stmt_ty s)
{
    Py_ssize_t i, n;

    /* Always assign a lineno to the next instruction for a stmt. */
    c->u->u_lineno = s->lineno;
    c->u->u_col_offset = s->col_offset;
    c->u->u_lineno_set = 0;

    switch (s->kind) {
    case FunctionDef_kind:
        return compiler_function(c, s, 0);
    case ClassDef_kind:
        return compiler_class(c, s);
    ...
    case For_kind:
        return compiler_for(c, s);
    ...
    }

    return 1;
}

As an example, let's focus on the For statement, in Python is the:

for i in iterable:
    # block
else:  # optional if iterable is False
    # block

If the statement is a For type, it calls compiler_for(). There is an equivalent compiler_*() function for all of the statement and expression types. The more straightforward types create the bytecode instructions inline, some of the more complex statement types call other functions.

Many of the statements can have sub-statements. A for loop has a body, but you can also have complex expressions in the assignment and the iterator.

The compiler's compiler_ statements sends blocks to the compiler state. These blocks contain instructions, the instruction data structure in Python/compile.c has the opcode, any arguments, and the target block (if this is a jump instruction), it also contains the line number.

For jump statements, they can either be absolute or relative jump statements. Jump statements are used to "jump" from one operation to another. Absolute jump statements specify the exact operation number in the compiled code object, whereas relative jump statements specify the jump target relative to another operation:

struct instr {
    unsigned i_jabs : 1;
    unsigned i_jrel : 1;
    unsigned char i_opcode;
    int i_oparg;
    struct basicblock_ *i_target; /* target block (if jump instruction) */
    int i_lineno;
};

So a frame block (of type basicblock), contains the following fields:

typedef struct basicblock_ {
    /* Each basicblock in a compilation unit is linked via b_list in the
       reverse order that the block are allocated.  b_list points to the next
       block, not to be confused with b_next, which is next by control flow. */
    struct basicblock_ *b_list;
    /* number of instructions used */
    int b_iused;
    /* length of instruction array (b_instr) */
    int b_ialloc;
    /* pointer to an array of instructions, initially NULL */
    struct instr *b_instr;
    /* If b_next is non-NULL, it is a pointer to the next
       block reached by normal control flow. */
    struct basicblock_ *b_next;
    /* b_seen is used to perform a DFS of basicblocks. */
    unsigned b_seen : 1;
    /* b_return is true if a RETURN_VALUE opcode is inserted. */
    unsigned b_return : 1;
    /* depth of stack upon entry of block, computed by stackdepth() */
    int b_startdepth;
    /* instruction offset for block, computed by assemble_jump_offsets() */
    int b_offset;
} basicblock;

The For statement is somewhere in the middle in terms of complexity. There are 15 steps in the compilation of a For statement with the for <target> in <iterator>: syntax:

  1. Create a new code block called start, this allocates memory and creates a basicblock pointer
  2. Create a new code block called cleanup
  3. Create a new code block called end
  4. Push a frame block of type FOR_LOOP to the stack with start as the entry block and end as the exit block
  5. Visit the iterator expression, which adds any operations for the iterator
  6. Add the GET_ITER operation to the compiler state
  7. Switch to the start block
  8. Call ADDOP_JREL which calls compiler_addop_j() to add the FOR_ITER operation with an argument of the cleanup block
  9. Visit the target and add any special code, like tuple unpacking, to the start block
  10. Visit each statement in the body of the for loop
  11. Call ADDOP_JABS which calls compiler_addop_j() to add the JUMP_ABSOLUTE operation which indicates after the body is executed, jumps back to the start of the loop
  12. Move to the cleanup block
  13. Pop the FOR_LOOP frame block off the stack
  14. Visit the statements inside the else section of the for loop
  15. Use the end block

Referring back to the basicblock structure. You can see how in the compilation of the for statement, the various blocks are created and pushed into the compiler's frame block and stack:

static int
compiler_for(struct compiler *c, stmt_ty s)
{
    basicblock *start, *cleanup, *end;

    start = compiler_new_block(c);                       // 1.
    cleanup = compiler_new_block(c);                     // 2.
    end = compiler_new_block(c);                         // 3.
    if (start == NULL || end == NULL || cleanup == NULL)
        return 0;

    if (!compiler_push_fblock(c, FOR_LOOP, start, end))  // 4.
        return 0;

    VISIT(c, expr, s->v.For.iter);                       // 5.
    ADDOP(c, GET_ITER);                                  // 6.
    compiler_use_next_block(c, start);                   // 7.
    ADDOP_JREL(c, FOR_ITER, cleanup);                    // 8.
    VISIT(c, expr, s->v.For.target);                     // 9.
    VISIT_SEQ(c, stmt, s->v.For.body);                   // 10.
    ADDOP_JABS(c, JUMP_ABSOLUTE, start);                 // 11.
    compiler_use_next_block(c, cleanup);                 // 12.

    compiler_pop_fblock(c, FOR_LOOP, start);             // 13.

    VISIT_SEQ(c, stmt, s->v.For.orelse);                 // 14.
    compiler_use_next_block(c, end);                     // 15.
    return 1;
}

Depending on the type of operation, there are different arguments required. For example, we used ADDOP_JABS and ADDOP_JREL here, which refer to "ADD Operation with Jump to a RELative position" and "ADD Operation with Jump to an ABSolute position". This is referring to the APPOP_JREL and ADDOP_JABS macros which call compiler_addop_j(struct compiler *c, int opcode, basicblock *b, int absolute) and set the absolute argument to 0 and 1 respectively.

There are some other macros, like ADDOP_I calls compiler_addop_i() which add an operation with an integer argument, or ADDOP_O calls compiler_addop_o() which adds an operation with a PyObject argument.

Once these stages have completed, the compiler has a list of frame blocks, each containing a list of instructions and a pointer to the next block.

Assembly

With the compiler state, the assembler performs a "depth-first-search" of the blocks and merge the instructions into a single bytecode sequence. The assembler state is declared in Python/compile.c:

struct assembler {
    PyObject *a_bytecode;  /* string containing bytecode */
    int a_offset;              /* offset into bytecode */
    int a_nblocks;             /* number of reachable blocks */
    basicblock **a_postorder; /* list of blocks in dfs postorder */
    PyObject *a_lnotab;    /* string containing lnotab */
    int a_lnotab_off;      /* offset into lnotab */
    int a_lineno;              /* last lineno of emitted instruction */
    int a_lineno_off;      /* bytecode offset of last lineno */
};

The assemble() function has a few tasks:

static PyCodeObject *
assemble(struct compiler *c, int addNone)
{
    basicblock *b, *entryblock;
    struct assembler a;
    int i, j, nblocks;
    PyCodeObject *co = NULL;

    /* Make sure every block that falls off the end returns None.
       XXX NEXT_BLOCK() isn't quite right, because if the last
       block ends with a jump or return b_next shouldn't set.
     */
    if (!c->u->u_curblock->b_return) {
        NEXT_BLOCK(c);
        if (addNone)
            ADDOP_LOAD_CONST(c, Py_None);
        ADDOP(c, RETURN_VALUE);
    }
    ...
    dfs(c, entryblock, &a, nblocks);

    /* Can't modify the bytecode after computing jump offsets. */
    assemble_jump_offsets(&a, c);

    /* Emit code in reverse postorder from dfs. */
    for (i = a.a_nblocks - 1; i >= 0; i--) {
        b = a.a_postorder[i];
        for (j = 0; j < b->b_iused; j++)
            if (!assemble_emit(&a, &b->b_instr[j]))
                goto error;
    }
    ...

    co = makecode(c, &a);
 error:
    assemble_free(&a);
    return co;
}

The depth-first-search is performed by the dfs() function in Python/compile.c, which follows the the b_next pointers in each of the blocks, marks them as seen by toggling b_seen and then adds them to the assemblers **a_postorder list in reverse order.

The function loops back over the assembler's post-order list and for each block, if it has a jump operation, recursively call dfs() for that jump:

static void
dfs(struct compiler *c, basicblock *b, struct assembler *a, int end)
{
    int i, j;

    /* Get rid of recursion for normal control flow.
       Since the number of blocks is limited, unused space in a_postorder
       (from a_nblocks to end) can be used as a stack for still not ordered
       blocks. */
    for (j = end; b && !b->b_seen; b = b->b_next) {
        b->b_seen = 1;
        assert(a->a_nblocks < j);
        a->a_postorder[--j] = b;
    }
    while (j < end) {
        b = a->a_postorder[j++];
        for (i = 0; i < b->b_iused; i++) {
            struct instr *instr = &b->b_instr[i];
            if (instr->i_jrel || instr->i_jabs)
                dfs(c, instr->i_target, a, j);
        }
        assert(a->a_nblocks < j);
        a->a_postorder[a->a_nblocks++] = b;
    }
}

Creating a Code Object

The task of makecode() is to go through the compiler state, some of the assembler's properties and to put these into a PyCodeObject by calling PyCode_New():

PyCodeObject structure

The variable names, constants are put as properties to the code object:

static PyCodeObject *
makecode(struct compiler *c, struct assembler *a)
{
...

    consts = consts_dict_keys_inorder(c->u->u_consts);
    names = dict_keys_inorder(c->u->u_names, 0);
    varnames = dict_keys_inorder(c->u->u_varnames, 0);
...
    cellvars = dict_keys_inorder(c->u->u_cellvars, 0);
...
    freevars = dict_keys_inorder(c->u->u_freevars, PyTuple_GET_SIZE(cellvars));
...
    flags = compute_code_flags(c);
    if (flags < 0)
        goto error;

    bytecode = PyCode_Optimize(a->a_bytecode, consts, names, a->a_lnotab);
...
    co = PyCode_New(argcount, kwonlyargcount,
                    nlocals_int, maxdepth, flags,
                    bytecode, consts, names, varnames,
                    freevars, cellvars,
                    c->c_filename, c->u->u_name,
                    c->u->u_firstlineno,
                    a->a_lnotab);
...
    return co;
}

You may also notice that the bytecode is sent to PyCode_Optimize() before it is sent to PyCode_New(). This function is part of the bytecode optimization process in Python/peephole.c.

The peephole optimizer goes through the bytecode instructions and in certain scenarios, replace them with other instructions. For example, there is an optimizer called "constant unfolding", so if you put the following statement into your script:

a = 1 + 5

It optimizes that to:

a = 6

Because 1 and 5 are constant values, so the result should always be the same.

Conclusion

We can pull together all of these stages with the instaviz module:

import instaviz

def foo():
    a = 2**4
    b = 1 + 5
    c = [1, 4, 6]
    for i in c:
        print(i)
    else:
        print(a)
    return c


instaviz.show(foo)

Will produce an AST graph:

Instaviz screenshot 6

With bytecode instructions in sequence:

Instaviz screenshot 7

Also, the code object with the variable names, constants, and binary co_code:

Instaviz screenshot 8

Execution

In Python/pythonrun.c we broke out just before the call to run_eval_code_obj().

This call takes a code object, either fetched from the marshaled .pyc file, or compiled through the AST and compiler stages.

run_eval_code_obj() will pass the globals, locals, PyArena, and compiled PyCodeObject to PyEval_EvalCode() in Python/ceval.c.

This stage forms the execution component of CPython. Each of the bytecode operations is taken and executed using a "Stack Frame" based system.

What is a Stack Frame?

Stack Frames are a data type used by many runtimes, not just Python, that allows functions to be called and variables to be returned between functions. Stack Frames also contain arguments, local variables, and other state information.

Typically, a Stack Frame exists for every function call, and they are stacked in sequence. You can see CPython's frame stack anytime an exception is unhandled and the stack is printed on the screen.

PyEval_EvalCode() is the public API for evaluating a code object. The logic for evaluation is split between _PyEval_EvalCodeWithName() and _PyEval_EvalFrameDefault(), which are both in ceval.c.

The public API PyEval_EvalCode() will construct an execution frame from the top of the stack by calling _PyEval_EvalCodeWithName().

The construction of the first execution frame has many steps:

  1. Keyword and positional arguments are resolved.
  2. The use of *args and **kwargs in function definitions are resolved.
  3. Arguments are added as local variables to the scope.
  4. Co-routines and Generators are created, including the Asynchronous Generators.

The frame object looks like this:

PyFrameObject structure

Let's step through those sequences.

1. Constructing Thread State

Before a frame can be executed, it needs to be referenced from a thread. CPython can have many threads running at any one time within a single interpreter. An Interpreter state includes a list of those threads as a linked list. The thread structure is called PyThreadState, and there are many references throughout ceval.c.

Here is the structure of the thread state object:

PyThreadState structure

2. Constructing Frames

The input to PyEval_EvalCode() and therefore _PyEval_EvalCodeWithName() has arguments for:

The other arguments are optional, and not used for the basic API:

PyObject *
_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals,
           PyObject *const *args, Py_ssize_t argcount,
           PyObject *const *kwnames, PyObject *const *kwargs,
           Py_ssize_t kwcount, int kwstep,
           PyObject *const *defs, Py_ssize_t defcount,
           PyObject *kwdefs, PyObject *closure,
           PyObject *name, PyObject *qualname)
{
    ...

    /* Create the frame */
    tstate = _PyThreadState_GET();
    assert(tstate != NULL);
    f = _PyFrame_New_NoTrack(tstate, co, globals, locals);
    if (f == NULL) {
        return NULL;
    }
    fastlocals = f->f_localsplus;
    freevars = f->f_localsplus + co->co_nlocals;

3. Converting Keyword Parameters to a Dictionary

If the function definition contained a **kwargs style catch-all for keyword arguments, then a new dictionary is created, and the values are copied across. The kwargs name is then set as a variable, like in this example:

def example(arg, arg2=None, **kwargs):
    print(kwargs['extra'])  # this would resolve to a dictionary key

The logic for creating a keyword argument dictionary is in the next part of _PyEval_EvalCodeWithName():

    /* Create a dictionary for keyword parameters (**kwargs) */
    if (co->co_flags & CO_VARKEYWORDS) {
        kwdict = PyDict_New();
        if (kwdict == NULL)
            goto fail;
        i = total_args;
        if (co->co_flags & CO_VARARGS) {
            i++;
        }
        SETLOCAL(i, kwdict);
    }
    else {
        kwdict = NULL;
    }

The kwdict variable will reference a PyDictObject if any keyword arguments were found.

4. Converting Positional Arguments Into Variables

Next, each of the positional arguments (if provided) are set as local variables:

    for (i = 0; i < n /* argcount */; i++) {
        x = args[i];
        Py_INCREF(x);
        SETLOCAL(i, x);
    }

At the end of the loop, you'll see a call to SETLOCAL() with the value, so if a positional argument is defined with a value, that is available within this scope:

def example(arg1, arg2):
    print(arg1, arg2)  # both args are already local variables.

Also, the reference counter for those variables is incremented, so the garbage collector won't remove them until the frame has evaluated.

5. Packing Positional Arguments Into *args

Similar to **kwargs, a function argument prepended with a * can be set to catch all remaining positional arguments. This argument is a tuple and the *args name is set as a local variable:

    /* Pack other positional arguments into the *args argument */
    if (co->co_flags & CO_VARARGS) {
        u = _PyTuple_FromArray(args + n, argcount - n);
        if (u == NULL) {
            goto fail;
        }
        SETLOCAL(total_args, u);
    }

6. Loading Keyword Arguments

If the function was called with keyword arguments and values, the kwdict dictionary created in step 4 is now filled with any remaining keyword arguments passed by the caller that doesn't resolve to named arguments or positional arguments.

For example, the e argument was neither positional or named, so it is added to **remaining:

>>>
>>> def my_function(a, b, c=None, d=None, **remaining):
       print(a, b, c, d, remaining)

>>> my_function(a=1, b=2, c=3, d=4, e=5)
(1, 2, 3, 4, {'e': 5})

The resolution of the keyword argument dictionary values comes after the unpacking of all other arguments. PyDict_SetItem() is called for each remaining argument to add it to

    for (i = 0; i < kwcount; i += kwstep) {
        PyObject **co_varnames;
        PyObject *keyword = kwnames[i];
        PyObject *value = kwargs[i];
        ...

        if (PyDict_SetItem(kwdict, keyword, value) == -1) {
            goto fail;
        }
        continue;

      kw_found:
        ...
        Py_INCREF(value);
        SETLOCAL(j, value);
    }
    ...

At the end of the loop, you'll see a call to SETLOCAL() with the value. If a keyword argument is defined with a value, that is available within this scope:

def example(arg1, arg2, example_kwarg=None):
    print(example_kwarg)  # example_kwarg is already a local variable.

7. Adding Missing Positional Arguments

Any positional arguments provided to a function call that are not in the list of positional arguments are added to a *args tuple if this tuple does not exist, a failure is raised:

    /* Add missing positional arguments (copy default values from defs) */
    if (argcount < co->co_argcount) {
        Py_ssize_t m = co->co_argcount - defcount;
        Py_ssize_t missing = 0;
        for (i = argcount; i < m; i++) {
            if (GETLOCAL(i) == NULL) {
                missing++;
            }
        }
        if (missing) {
            missing_arguments(co, missing, defcount, fastlocals);
            goto fail;
        }
        if (n > m)
            i = n - m;
        else
            i = 0;
        for (; i < defcount; i++) {
            if (GETLOCAL(m+i) == NULL) {
                PyObject *def = defs[i];
                Py_INCREF(def);
                SETLOCAL(m+i, def);
            }
        }
    }

8. Adding Missing Keyword Arguments

Any keyword arguments provided to a function call that are not in the list of named keyword arguments are added to a **kwargs dictionary if this dictionary does not exist, a failure is raised:

    /* Add missing keyword arguments (copy default values from kwdefs) */
    if (co->co_kwonlyargcount > 0) {
        Py_ssize_t missing = 0;
        for (i = co->co_argcount; i < total_args; i++) {
            PyObject *name;
            if (GETLOCAL(i) != NULL)
                continue;
            name = PyTuple_GET_ITEM(co->co_varnames, i);
            if (kwdefs != NULL) {
                PyObject *def = PyDict_GetItemWithError(kwdefs, name);
                ...
            }
            missing++;
        }
        ...
    }

9. Collapsing Closures

Any closure names are added to the code object's list of free variable names:

    /* Copy closure variables to free variables */
    for (i = 0; i < PyTuple_GET_SIZE(co->co_freevars); ++i) {
        PyObject *o = PyTuple_GET_ITEM(closure, i);
        Py_INCREF(o);
        freevars[PyTuple_GET_SIZE(co->co_cellvars) + i] = o;
    }

10. Creating Generators, Coroutines, and Asynchronous Generators

If the evaluated code object has a flag that it is a generator, coroutine or async generator, then a new frame is created using one of the unique methods in the Generator, Coroutine or Async libraries and the current frame is added as a property.

The new frame is then returned, and the original frame is not evaluated. The frame is only evaluated when the generator/coroutine/async method is called on to execute its target:

    /* Handle generator/coroutine/asynchronous generator */
    if (co->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) {
        ...

        /* Create a new generator that owns the ready to run frame
         * and return that as the value. */
        if (is_coro) {
            gen = PyCoro_New(f, name, qualname);
        } else if (co->co_flags & CO_ASYNC_GENERATOR) {
            gen = PyAsyncGen_New(f, name, qualname);
        } else {
            gen = PyGen_NewWithQualName(f, name, qualname);
        }
        ...

        return gen;
    }

Lastly, PyEval_EvalFrameEx() is called with the new frame:

    retval = PyEval_EvalFrameEx(f,0);
    ...
}

Frame Execution

As covered earlier in the compiler and AST chapters, the code object contains a binary encoding of the bytecode to be executed. It also contains a list of variables and a symbol table.

The local and global variables are determined at runtime based on how that function, module, or block was called. This information is added to the frame by the _PyEval_EvalCodeWithName() function. There are other usages of frames, like the coroutine decorator, which dynamically generates a frame with the target as a variable.

The public API, PyEval_EvalFrameEx() calls the interpreter's configured frame evaluation function in the eval_frame property. Frame evaluation was made pluggable in Python 3.7 with PEP 523.

_PyEval_EvalFrameDefault() is the default function, and it is unusual to use anything other than this.

Frames are executed in the main execution loop inside _PyEval_EvalFrameDefault(). This function is central function that brings everything together and brings your code to life. It contains decades of optimization since even a single line of code can have a significant impact on performance for the whole of CPython.

Everything that gets executed in CPython goes through this function.

Note: Something you might notice when reading ceval.c, is how many times C macros have been used. C Macros are a way of having DRY-compliant code without the overhead of making function calls. The compiler converts the macros into C code and then compile the generated code.

If you want to see the expanded code, you can run gcc -E on Linux and macOS:

$ gcc -E Python/ceval.c

Alternatively, Visual Studio code can do inline macro expansion once you have installed the official C/C++ extension:

C Macro expansion with VScode

We can step through frame execution in Python 3.7 and beyond by enabling the tracing attribute on the current thread.

This code example sets the global tracing function to a function called trace() that gets the stack from the current frame, prints the disassembled opcodes to the screen, and some extra information for debugging:

import sys
import dis
import traceback
import io

def trace(frame, event, args):
   frame.f_trace_opcodes = True
   stack = traceback.extract_stack(frame)
   pad = "   "*len(stack) + "|"
   if event == 'opcode':
      with io.StringIO() as out:
         dis.disco(frame.f_code, frame.f_lasti, file=out)
         lines = out.getvalue().split('\n')
         [print(f"{pad}{l}") for l in lines]
   elif event == 'call':
      print(f"{pad}Calling {frame.f_code}")
   elif event == 'return':
      print(f"{pad}Returning {args}")
   elif event == 'line':
      print(f"{pad}Changing line to {frame.f_lineno}")
   else:
      print(f"{pad}{frame} ({event} - {args})")
   print(f"{pad}----------------------------------")
   return trace
sys.settrace(trace)

# Run some code for a demo
eval('"-".join([letter for letter in "hello"])')

This prints the code within each stack and point to the next operation before it is executed. When a frame returns a value, the return statement is printed:

Evaluating frame with tracing

The full list of instructions is available on the dis module documentation.

The Value Stack

Inside the core evaluation loop, a value stack is created. This stack is a list of pointers to sequential PyObject instances.

One way to think of the value stack is like a wooden peg on which you can stack cylinders. You would only add or remove one item at a time. This is done using the PUSH(a) macro, where a is a pointer to a PyObject.

For example, if you created a PyLong with the value 10 and pushed it onto the value stack:

PyObject *a = PyLong_FromLong(10);
PUSH(a);

This action would have the following effect:

PUSH()

In the next operation, to fetch that value, you would use the POP() macro to take the top value from the stack:

PyObject *a = POP();  // a is PyLongObject with a value of 10

This action would return the top value and end up with an empty value stack:

POP()

If you were to add 2 values to the stack:

PyObject *a = PyLong_FromLong(10);
PyObject *b = PyLong_FromLong(20);
PUSH(a);
PUSH(b);

They would end up in the order in which they were added, so a would be pushed to the second position in the stack:

PUSH();PUSH()

If you were to fetch the top value in the stack, you would get a pointer to b because it is at the top:

POP();

If you need to fetch the pointer to the top value in the stack without popping it, you can use the PEEK(v) operation, where v is the stack position:

PyObject *first = PEEK(0);

0 represents the top of the stack, 1 would be the second position:

PEEK()

To clone the value at the top of the stack, the DUP_TWO() macro can be used, or by using the DUP_TWO opcode:

DUP_TOP();

This action would copy the value at the top to form 2 pointers to the same object:

DUP_TOP()

There is a rotation macro ROT_TWO that swaps the first and second values:

ROT_TWO()

Each of the opcodes have a predefined "stack effect," calculated by the stack_effect() function inside Python/compile.c. This function returns the delta in the number of values inside the stack for each opcode.

Example: Adding an Item to a List

In Python, when you create a list, the .append() method is available on the list object:

my_list = []
my_list.append(obj)

Where obj is an object, you want to append to the end of the list.

There are 2 operations involved in this operation. LOAD_FAST, to load the object obj to the top of the value stack from the list of locals in the frame, and LIST_APPEND to add the object.

First exploring LOAD_FAST, there are 5 steps:

  1. The pointer to obj is loaded from GETLOCAL(), where the variable to load is the operation argument. The list of variable pointers is stored in fastlocals, which is a copy of the PyFrame attribute f_localsplus. The operation argument is a number, pointing to the index in the fastlocals array pointer. This means that the loading of a local is simply a copy of the pointer instead of having to look up the variable name.

  2. If variable no longer exists, an unbound local variable error is raised.

  3. The reference counter for value (in our case, obj) is increased by 1.

  4. The pointer to obj is pushed to the top of the value stack.

  5. The FAST_DISPATCH macro is called, if tracing is enabled, the loop goes over again (with all the tracing), if tracing is not enabled, a goto is called to fast_next_opcode, which jumps back to the top of the loop for the next instruction.

 ... 
    case TARGET(LOAD_FAST): {
        PyObject *value = GETLOCAL(oparg);                 // 1.
        if (value == NULL) {
            format_exc_check_arg(
                PyExc_UnboundLocalError,
                UNBOUNDLOCAL_ERROR_MSG,
                PyTuple_GetItem(co->co_varnames, oparg));
            goto error;                                    // 2.
        }
        Py_INCREF(value);                                  // 3.
        PUSH(value);                                       // 4.
        FAST_DISPATCH();                                   // 5.
    }
 ...

Now the pointer to obj is at the top of the value stack. The next instruction LIST_APPEND is run.

Many of the bytecode operations are referencing the base types, like PyUnicode, PyNumber. For example, LIST_APPEND appends an object to the end of a list. To achieve this, it pops the pointer from the value stack and returns the pointer to the last object in the stack. The macro is a shortcut for:

PyObject *v = (*--stack_pointer);

Now the pointer to obj is stored as v. The list pointer is loaded from PEEK(oparg).

Then the C API for Python lists is called for list and v. The code for this is inside Objects/listobject.c, which we go into in the next chapter.

A call to PREDICT is made, which guesses that the next operation will be JUMP_ABSOLUTE. The PREDICT macro has compiler-generated goto statements for each of the potential operations' case statements. This means the CPU can jump to that instruction and not have to go through the loop again:

 ...
        case TARGET(LIST_APPEND): {
            PyObject *v = POP();
            PyObject *list = PEEK(oparg);
            int err;
            err = PyList_Append(list, v);
            Py_DECREF(v);
            if (err != 0)
                goto error;
            PREDICT(JUMP_ABSOLUTE);
            DISPATCH();
        }
 ...

Opcode predictions: Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, COMPARE_OP is often followed by POP_JUMP_IF_FALSE or POP_JUMP_IF_TRUE.

"Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor's own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its unpredictable switch-case branch. Combined with the processor's internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined."

If collecting opcode statistics, you have two choices:

  1. Keep the predictions turned-on and interpret the results as if some opcodes had been combined
  2. Turn off predictions so that the opcode frequency counter updates for both opcodes

Opcode prediction is disabled with threaded code since the latter allows the CPU to record separate branch prediction information for each opcode.

Some of the operations, such as CALL_FUNCTION, CALL_METHOD, have an operation argument referencing another compiled function. In these cases, another frame is pushed to the frame stack in the thread, and the evaluation loop is run for that function until the function completes. Each time a new frame is created and pushed onto the stack, the value of the frame's f_back is set to the current frame before the new one is created.

This nesting of frames is clear when you see a stack trace, take this example script:

def function2():
  raise RuntimeError

def function1():
  function2()

if __name__ == '__main__':
  function1()

Calling this on the command line will give you:

$ ./python.exe example_stack.py

Traceback (most recent call last):
  File "example_stack.py", line 8, in <module>
    function1()
  File "example_stack.py", line 5, in function1
    function2()
  File "example_stack.py", line 2, in function2
    raise RuntimeError
RuntimeError

In traceback.py, the walk_stack() function used to print trace backs:

def walk_stack(f):
    """Walk a stack yielding the frame and line number for each frame.

    This will follow f.f_back from the given frame. If no frame is given, the
    current stack is used. Usually used with StackSummary.extract.
    """
    if f is None:
        f = sys._getframe().f_back.f_back
    while f is not None:
        yield f, f.f_lineno
        f = f.f_back

Here you can see that the current frame, fetched by calling sys._getframe() and the parent's parent is set as the frame, because you don't want to see the call to walk_stack() or print_trace() in the trace back, so those function frames are skipped.

Then the f_back pointer is followed to the top.

sys._getframe() is the Python API to get the frame attribute of the current thread.

Here is how that frame stack would look visually, with 3 frames each with its code object and a thread state pointing to the current frame:

Example frame stack

Conclusion

In this Part, you explored the most complex element of CPython: the compiler. The original author of Python, Guido van Rossum, made the statement that CPython's compiler should be "dumb" so that people can understand it.

By breaking down the compilation process into small, logical steps, it is far easier to understand.

In the next chapter, we connect the compilation process with the basis of all Python code, the object.

Part 4: Objects in CPython

CPython comes with a collection of basic types like strings, lists, tuples, dictionaries, and objects.

All of these types are built-in. You don't need to import any libraries, even from the standard library. Also, the instantiation of these built-in types has some handy shortcuts.

For example, to create a new list, you can call:

lst = list()

Or, you can use square brackets:

lst = []

Strings can be instantiated from a string-literal by using either double or single quotes. We explored the grammar definitions earlier that cause the compiler to interpret double quotes as a string literal.

All types in Python inherit from object, a built-in base type. Even strings, tuples, and list inherit from object. During the walk-through of the C code, you have read lots of references to PyObject*, the C-API structure for an object.

Because C is not object-oriented like Python, objects in C don't inherit from one another. PyObject is the data structure for the beginning of the Python object's memory.

Much of the base object API is declared in Objects/object.c, like the function PyObject_Repr, which the built-in repr() function. You will also find PyObject_Hash() and other APIs.

All of these functions can be overridden in a custom object by implementing "dunder" methods on a Python object:

class MyObject(object): 
    def __init__(self, id, name):
        self.id = id
        self.name = name

    def __repr__(self):
        return "<{0} id={1}>".format(self.name, self.id)

This code is implemented in PyObject_Repr(), inside Objects/object.c. The type of the target object, v will be inferred through a call to Py_TYPE() and if the tp_repr field is set, then the function pointer is called. If the tp_repr field is not set, i.e. the object doesn't declare a custom __repr__ method, then the default behavior is run, which is to return "<%s object at %p>" with the type name and the ID:

PyObject *
PyObject_Repr(PyObject *v)
{
    PyObject *res;
    if (PyErr_CheckSignals())
        return NULL;
...
    if (v == NULL)
        return PyUnicode_FromString("<NULL>");
    if (Py_TYPE(v)->tp_repr == NULL)
        return PyUnicode_FromFormat("<%s object at %p>",
                                    v->ob_type->tp_name, v);

...
}

The ob_type field for a given PyObject* will point to the data structure PyTypeObject, defined in Include/cpython/object.h. This data-structure lists all the built-in functions, as fields and the arguments they should receive.

Take tp_repr as an example:

typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "<module>.<name>" */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

    /* Methods to implement standard operations */
...
    reprfunc tp_repr;

Where reprfunc is a typedef for PyObject *(*reprfunc)(PyObject *);, a function that takes 1 pointer to PyObject (self).

Some of the dunder APIs are optional, because they only apply to certain types, like numbers:

    /* Method suites for standard classes */

    PyNumberMethods *tp_as_number;
    PySequenceMethods *tp_as_sequence;
    PyMappingMethods *tp_as_mapping;

A sequence, like a list would implement the following methods:

typedef struct {
    lenfunc sq_length; // len(v)
    binaryfunc sq_concat; // v + x
    ssizeargfunc sq_repeat; // for x in v
    ssizeargfunc sq_item; // v[x]
    void *was_sq_slice; // v[x:y:z]
    ssizeobjargproc sq_ass_item; // v[x] = z
    void *was_sq_ass_slice; // v[x:y] = z
    objobjproc sq_contains; // x in v

    binaryfunc sq_inplace_concat;
    ssizeargfunc sq_inplace_repeat;
} PySequenceMethods;

All of these built-in functions are called the Python Data Model. One of the great resources for the Python Data Model is "Fluent Python" by Luciano Ramalho.

Base Object Type

In Objects/object.c, the base implementation of object type is written as pure C code. There are some concrete implementations of basic logic, like shallow comparisons.

Not all methods in a Python object are part of the Data Model, so that a Python object can contain attributes (either class or instance attributes) and methods.

A simple way to think of a Python object is consisting of 2 things:

  1. The core data model, with pointers to compiled functions
  2. A dictionary with any custom attributes and methods

The core data model is defined in the PyTypeObject, and the functions are defined in:

We're going to dive into 3 of these types:

  1. Booleans
  2. Integers
  3. Generators

Booleans and Integers have a lot in common, so we'll cover those first.

The Bool and Long Integer Type

The bool type is the most straightforward implementation of the built-in types. It inherits from long and has the predefined constants, Py_True and Py_False. These constants are immutable instances, created on the instantiation of the Python interpreter.

Inside Objects/boolobject.c, you can see the helper function to create a bool instance from a number:

PyObject *PyBool_FromLong(long ok)
{
    PyObject *result;

    if (ok)
        result = Py_True;
    else
        result = Py_False;
    Py_INCREF(result);
    return result;
}

This function uses the C evaluation of a numeric type to assign Py_True or Py_False to a result and increment the reference counters.

The numeric functions for and, xor, and or are implemented, but addition, subtraction, and division are dereferenced from the base long type since it would make no sense to divide two boolean values.

The implementation of and for a bool value checks if a and b are booleans, then check their references to Py_True, otherwise, are cast as numbers, and the and operation is run on the two numbers:

static PyObject *
bool_and(PyObject *a, PyObject *b)
{
    if (!PyBool_Check(a) || !PyBool_Check(b))
        return PyLong_Type.tp_as_number->nb_and(a, b);
    return PyBool_FromLong((a == Py_True) & (b == Py_True));
}

The long type is a bit more complex, as the memory requirements are expansive. In the transition from Python 2 to 3, CPython dropped support for the int type and instead used the long type as the primary integer type. Python's long type is quite special in that it can store a variable-length number. The maximum length is set in the compiled binary.

The data structure of a Python long consists of the PyObject header and a list of digits. The list of digits, ob_digit is initially set to have one digit, but it later expanded to a longer length when initialized:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};

Memory is allocated to a new long through _PyLong_New(). This function takes a fixed length and makes sure it is smaller than MAX_LONG_DIGITS. Then it reallocates the memory for ob_digit to match the length.

To convert a C long type to a Python long type, the long is converted to a list of digits, the memory for the Python long is assigned, and then each of the digits is set. Because long is initialized with ob_digit already being at a length of 1, if the number is less than 10, then the value is set without the memory being allocated:

PyObject *
PyLong_FromLong(long ival)
{
    PyLongObject *v;
    unsigned long abs_ival;
    unsigned long t;  /* unsigned so >> doesn't propagate sign bit */
    int ndigits = 0;
    int sign;

    CHECK_SMALL_INT(ival);
...
    /* Fast path for single-digit ints */
    if (!(abs_ival >> PyLong_SHIFT)) {
        v = _PyLong_New(1);
        if (v) {
            Py_SIZE(v) = sign;
            v->ob_digit[0] = Py_SAFE_DOWNCAST(
                abs_ival, unsigned long, digit);
        }
        return (PyObject*)v;
    }
...
    /* Larger numbers: loop to determine number of digits */
    t = abs_ival;
    while (t) {
        ++ndigits;
        t >>= PyLong_SHIFT;
    }
    v = _PyLong_New(ndigits);
    if (v != NULL) {
        digit *p = v->ob_digit;
        Py_SIZE(v) = ndigits*sign;
        t = abs_ival;
        while (t) {
            *p++ = Py_SAFE_DOWNCAST(
                t & PyLong_MASK, unsigned long, digit);
            t >>= PyLong_SHIFT;
        }
    }
    return (PyObject *)v;
}

To convert a double-point floating point to a Python long, PyLong_FromDouble() does the math for you:

PyObject *
PyLong_FromDouble(double dval)
{
    PyLongObject *v;
    double frac;
    int i, ndig, expo, neg;
    neg = 0;
    if (Py_IS_INFINITY(dval)) {
        PyErr_SetString(PyExc_OverflowError,
                        "cannot convert float infinity to integer");
        return NULL;
    }
    if (Py_IS_NAN(dval)) {
        PyErr_SetString(PyExc_ValueError,
                        "cannot convert float NaN to integer");
        return NULL;
    }
    if (dval < 0.0) {
        neg = 1;
        dval = -dval;
    }
    frac = frexp(dval, &expo); /* dval = frac*2**expo; 0.0 <= frac < 1.0 */
    if (expo <= 0)
        return PyLong_FromLong(0L);
    ndig = (expo-1) / PyLong_SHIFT + 1; /* Number of 'digits' in result */
    v = _PyLong_New(ndig);
    if (v == NULL)
        return NULL;
    frac = ldexp(frac, (expo-1) % PyLong_SHIFT + 1);
    for (i = ndig; --i >= 0; ) {
        digit bits = (digit)frac;
        v->ob_digit[i] = bits;
        frac = frac - (double)bits;
        frac = ldexp(frac, PyLong_SHIFT);
    }
    if (neg)
        Py_SIZE(v) = -(Py_SIZE(v));
    return (PyObject *)v;
}

The remainder of the implementation functions in longobject.c have utilities, such as converting a Unicode string into a number with PyLong_FromUnicodeObject().

A Review of the Generator Type

Python Generators are functions which return a yield statement and can be called continually to generate further values.

Commonly they are used as a more memory efficient way of looping through values in a large block of data, like a file, a database or over a network.

Generator objects are returned in place of a value when yield is used instead of return. The generator object is created from the yield statement and returned to the caller.

Let's create a simple generator with a list of 4 constant values:

>>>
>>> def example():
...   lst = [1,2,3,4]
...   for i in lst:
...     yield i
... 
>>> gen = example()
>>> gen
<generator object example at 0x100bcc480>

If you explore the contents of the generator object, you can see some of the fields starting with gi_:

>>>
>>> dir(gen)
[ ...
 'close', 
 'gi_code', 
 'gi_frame', 
 'gi_running', 
 'gi_yieldfrom', 
 'send', 
 'throw']

The PyGenObject type is defined in Include/cpython/genobject.h and there are 3 flavors:

  1. Generator objects
  2. Coroutine objects
  3. Async generator objects

All 3 share the same subset of fields used in generators, and have similar behaviors:

Structure of generator types

Focusing first on generators, you can see the fields:

The coroutine and async generators have the same fields but prepended with cr and ag respectively.

If you call __next__() on the generator object, the next value is yielded until eventually a StopIteration is raised:

>>>
>>> gen.__next__()
1
>>> gen.__next__()
2
>>> gen.__next__()
3
>>> gen.__next__()
4
>>> gen.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Each time __next__() is called, the code object inside the generators gi_code field is executed as a new frame and the return value is pushed to the value stack.

You can also see that gi_code is the compiled code object for the generator function by importing the dis module and disassembling the bytecode inside:

>>>
>>> gen = example()
>>> import dis
>>> dis.disco(gen.gi_code)
  2           0 LOAD_CONST               1 (1)
              2 LOAD_CONST               2 (2)
              4 LOAD_CONST               3 (3)
              6 LOAD_CONST               4 (4)
              8 BUILD_LIST               4
             10 STORE_FAST               0 (l)

  3          12 SETUP_LOOP              18 (to 32)
             14 LOAD_FAST                0 (l)
             16 GET_ITER
        >>   18 FOR_ITER                10 (to 30)
             20 STORE_FAST               1 (i)

  4          22 LOAD_FAST                1 (i)
             24 YIELD_VALUE
             26 POP_TOP
             28 JUMP_ABSOLUTE           18
        >>   30 POP_BLOCK
        >>   32 LOAD_CONST               0 (None)
             34 RETURN_VALUE

Whenever __next__() is called on a generator object, gen_iternext() is called with the generator instance, which immediately calls gen_send_ex() inside Objects/genobject.c.

gen_send_ex() is the function that converts a generator object into the next yielded result. You'll see many similarities with the way frames are constructed in Python/ceval.c from a code object as these functions have similar tasks.

The gen_send_ex() function is shared with generators, coroutines, and async generators and has the following steps:

  1. The current thread state is fetched

  2. The frame object from the generator object is fetched

  3. If the generator is running when __next__() was called, raise a ValueError

  4. If the frame inside the generator is at the top of the stack:

    • In the case of a coroutine, if the coroutine is not already marked as closing, a RuntimeError is raised
    • If this is an async generator, raise a StopAsyncIteration
    • For a standard generator, a StopIteration is raised.
  5. If the last instruction in the frame (f->f_lasti) is still -1 because it has just been started, and this is a coroutine or async generator, then a non-None value can't be passed as an argument, so an exception is raised

  6. Else, this is the first time it's being called, and arguments are allowed. The value of the argument is pushed to the frame's value stack

  7. The f_back field of the frame is the caller to which return values are sent, so this is set to the current frame in the thread. This means that the return value is sent to the caller, not the creator of the generator

  8. The generator is marked as running

  9. The last exception in the generator's exception info is copied from the last exception in the thread state

  10. The thread state exception info is set to the address of the generator's exception info. This means that if the caller enters a breakpoint around the execution of a generator, the stack trace goes through the generator and the offending code is clear

  11. The frame inside the generator is executed within the Python/ceval.c main execution loop, and the value returned

  12. The thread state last exception is reset to the value before the frame was called

  13. The generator is marked as not running

  14. The following cases then match the return value and any exceptions thrown by the call to the generator. Remember that generators should raise a StopIteration when they are exhausted, either manually, or by not yielding a value. Coroutines and async generators should not:

    • If no result was returned from the frame, raise a StopIteration for generators and StopAsyncIteration for async generators
    • If a StopIteration was explicitly raised, but this is a coroutine or an async generator, raise a RuntimeError as this is not allowed
    • If a StopAsyncIteration was explicitly raised and this is an async generator, raise a RuntimeError, as this is not allowed
  15. Lastly, the result is returned back to the caller of __next__()

static PyObject *
gen_send_ex(PyGenObject *gen, PyObject *arg, int exc, int closing)
{
    PyThreadState *tstate = _PyThreadState_GET();       // 1.
    PyFrameObject *f = gen->gi_frame;                   // 2.
    PyObject *result;

    if (gen->gi_running) {     // 3.
        const char *msg = "generator already executing";
        if (PyCoro_CheckExact(gen)) {
            msg = "coroutine already executing";
        }
        else if (PyAsyncGen_CheckExact(gen)) {
            msg = "async generator already executing";
        }
        PyErr_SetString(PyExc_ValueError, msg);
        return NULL;
    }
    if (f == NULL || f->f_stacktop == NULL) { // 4.
        if (PyCoro_CheckExact(gen) && !closing) {
            /* `gen` is an exhausted coroutine: raise an error,
               except when called from gen_close(), which should
               always be a silent method. */
            PyErr_SetString(
                PyExc_RuntimeError,
                "cannot reuse already awaited coroutine"); // 4a.
        }
        else if (arg && !exc) {
            /* `gen` is an exhausted generator:
               only set exception if called from send(). */
            if (PyAsyncGen_CheckExact(gen)) {
                PyErr_SetNone(PyExc_StopAsyncIteration); // 4b.
            }
            else {
                PyErr_SetNone(PyExc_StopIteration);      // 4c.
            }
        }
        return NULL;
    }

    if (f->f_lasti == -1) {
        if (arg && arg != Py_None) { // 5.
            const char *msg = "can't send non-None value to a "
                              "just-started generator";
            if (PyCoro_CheckExact(gen)) {
                msg = NON_INIT_CORO_MSG;
            }
            else if (PyAsyncGen_CheckExact(gen)) {
                msg = "can't send non-None value to a "
                      "just-started async generator";
            }
            PyErr_SetString(PyExc_TypeError, msg);
            return NULL;
        }
    } else { // 6.
        /* Push arg onto the frame's value stack */
        result = arg ? arg : Py_None;
        Py_INCREF(result);
        *(f->f_stacktop++) = result;
    }

    /* Generators always return to their most recent caller, not
     * necessarily their creator. */
    Py_XINCREF(tstate->frame);
    assert(f->f_back == NULL);
    f->f_back = tstate->frame;                          // 7.

    gen->gi_running = 1;                                // 8.
    gen->gi_exc_state.previous_item = tstate->exc_info; // 9.
    tstate->exc_info = &gen->gi_exc_state;              // 10.
    result = PyEval_EvalFrameEx(f, exc);                // 11.
    tstate->exc_info = gen->gi_exc_state.previous_item; // 12.
    gen->gi_exc_state.previous_item = NULL;             
    gen->gi_running = 0;                                // 13.

    /* Don't keep the reference to f_back any longer than necessary.  It
     * may keep a chain of frames alive or it could create a reference
     * cycle. */
    assert(f->f_back == tstate->frame);
    Py_CLEAR(f->f_back);

    /* If the generator just returned (as opposed to yielding), signal
     * that the generator is exhausted. */
    if (result && f->f_stacktop == NULL) {  // 14a.
        if (result == Py_None) {
            /* Delay exception instantiation if we can */
            if (PyAsyncGen_CheckExact(gen)) {
                PyErr_SetNone(PyExc_StopAsyncIteration);
            }
            else {
                PyErr_SetNone(PyExc_StopIteration);
            }
        }
        else {
            /* Async generators cannot return anything but None */
            assert(!PyAsyncGen_CheckExact(gen));
            _PyGen_SetStopIterationValue(result);
        }
        Py_CLEAR(result);
    }
    else if (!result && PyErr_ExceptionMatches(PyExc_StopIteration)) { // 14b.
        const char *msg = "generator raised StopIteration";
        if (PyCoro_CheckExact(gen)) {
            msg = "coroutine raised StopIteration";
        }
        else if PyAsyncGen_CheckExact(gen) {
            msg = "async generator raised StopIteration";
        }
        _PyErr_FormatFromCause(PyExc_RuntimeError, "%s", msg);

    }
    else if (!result && PyAsyncGen_CheckExact(gen) &&
             PyErr_ExceptionMatches(PyExc_StopAsyncIteration))  // 14c.
    {
        /* code in `gen` raised a StopAsyncIteration error:
           raise a RuntimeError.
        */
        const char *msg = "async generator raised StopAsyncIteration";
        _PyErr_FormatFromCause(PyExc_RuntimeError, "%s", msg);
    }
...

    return result; // 15.
}

Going back to the evaluation of code objects whenever a function or module is called, there was a special case for generators, coroutines, and async generators in _PyEval_EvalCodeWithName(). This function checks for the CO_GENERATOR, CO_COROUTINE, and CO_ASYNC_GENERATOR flags on the code object.

When a new coroutine is created using PyCoro_New(), a new async generator is created with PyAsyncGen_New() or a generator with PyGen_NewWithQualName(). These objects are returned early instead of returning an evaluated frame, which is why you get a generator object after calling a function with a yield statement:

PyObject *
_PyEval_EvalCodeWithName(PyObject *_co, PyObject *globals, PyObject *locals, ...
...
    /* Handle generator/coroutine/asynchronous generator */
    if (co->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) {
        PyObject *gen;
        PyObject *coro_wrapper = tstate->coroutine_wrapper;
        int is_coro = co->co_flags & CO_COROUTINE;
        ...
        /* Create a new generator that owns the ready to run frame
         * and return that as the value. */
        if (is_coro) {
            gen = PyCoro_New(f, name, qualname);
        } else if (co->co_flags & CO_ASYNC_GENERATOR) {
            gen = PyAsyncGen_New(f, name, qualname);
        } else {
            gen = PyGen_NewWithQualName(f, name, qualname);
        }
        ...
        return gen;
    }
...

The flags in the code object were injected by the compiler after traversing the AST and seeing the yield or yield from statements or seeing the coroutine decorator.

PyGen_NewWithQualName() will call gen_new_with_qualname() with the generated frame and then create the PyGenObject with NULL values and the compiled code object:

static PyObject *
gen_new_with_qualname(PyTypeObject *type, PyFrameObject *f,
                      PyObject *name, PyObject *qualname)
{
    PyGenObject *gen = PyObject_GC_New(PyGenObject, type);
    if (gen == NULL) {
        Py_DECREF(f);
        return NULL;
    }
    gen->gi_frame = f;
    f->f_gen = (PyObject *) gen;
    Py_INCREF(f->f_code);
    gen->gi_code = (PyObject *)(f->f_code);
    gen->gi_running = 0;
    gen->gi_weakreflist = NULL;
    gen->gi_exc_state.exc_type = NULL;
    gen->gi_exc_state.exc_value = NULL;
    gen->gi_exc_state.exc_traceback = NULL;
    gen->gi_exc_state.previous_item = NULL;
    if (name != NULL)
        gen->gi_name = name;
    else
        gen->gi_name = ((PyCodeObject *)gen->gi_code)->co_name;
    Py_INCREF(gen->gi_name);
    if (qualname != NULL)
        gen->gi_qualname = qualname;
    else
        gen->gi_qualname = gen->gi_name;
    Py_INCREF(gen->gi_qualname);
    _PyObject_GC_TRACK(gen);
    return (PyObject *)gen;
}

Bringing this all together you can see how the generator expression is a powerful syntax where a single keyword, yield triggers a whole flow to create a unique object, copy a compiled code object as a property, set a frame, and store a list of variables in the local scope.

To the user of the generator expression, this all seems like magic, but under the covers it's not that complex.

Conclusion

Now that you understand how some built-in types, you can explore other types.

When exploring Python classes, it is important to remember there are built-in types, written in C and classes inheriting from those types, written in Python or C.

Some libraries have types written in C instead of inheriting from the built-in types. One example is numpy, a library for numeric arrays. The nparray type is written in C, is highly efficient and performant.

In the next Part, we will explore the classes and functions defined in the standard library.

Part 5: The CPython Standard Library

Python has always come "batteries included." This statement means that with a standard CPython distribution, there are libraries for working with files, threads, networks, web sites, music, keyboards, screens, text, and a whole manner of utilities.

Some of the batteries that come with CPython are more like AA batteries. They're useful for everything, like the collections module and the sys module. Some of them are a bit more obscure, like a small watch battery that you never know when it might come in useful.

There are 2 types of modules in the CPython standard library:

  1. Those written in pure Python that provides a utility
  2. Those written in C with Python wrappers

We will explore both types.

Python Modules

The modules written in pure Python are all located in the Lib/ directory in the source code. Some of the larger modules have submodules in subfolders, like the email module.

An easy module to look at would be the colorsys module. It's only a few hundred lines of Python code. You may not have come across it before. The colorsys module has some utility functions for converting color scales.

When you install a Python distribution from source, standard library modules are copied from the Lib folder into the distribution folder. This folder is always part of your path when you start Python, so you can import the modules without having to worry about where they're located.

For example:

>>>
>>> import colorsys
>>> colorsys
<module 'colorsys' from '/usr/shared/lib/python3.7/colorsys.py'>

>>> colorsys.rgb_to_hls(255,0,0)
(0.0, 127.5, -1.007905138339921) 

We can see the source code of rgb_to_hls() inside Lib/colorsys.py:

# HLS: Hue, Luminance, Saturation
# H: position in the spectrum
# L: color lightness
# S: color saturation

def rgb_to_hls(r, g, b):
    maxc = max(r, g, b)
    minc = min(r, g, b)
    # XXX Can optimize (maxc+minc) and (maxc-minc)
    l = (minc+maxc)/2.0
    if minc == maxc:
        return 0.0, l, 0.0
    if l <= 0.5:
        s = (maxc-minc) / (maxc+minc)
    else:
        s = (maxc-minc) / (2.0-maxc-minc)
    rc = (maxc-r) / (maxc-minc)
    gc = (maxc-g) / (maxc-minc)
    bc = (maxc-b) / (maxc-minc)
    if r == maxc:
        h = bc-gc
    elif g == maxc:
        h = 2.0+rc-bc
    else:
        h = 4.0+gc-rc
    h = (h/6.0) % 1.0
    return h, l, s

There's nothing special about this function, it's just standard Python. You'll find similar things with all of the pure Python standard library modules. They're just written in plain Python, well laid out and easy to understand. You may even spot improvements or bugs, so you can make changes to them and contribute it to the Python distribution. We'll cover that toward the end of this article.

Python and C Modules

The remainder of modules are written in C, or a combination or Python and C. The source code for these is in Lib/ for the Python component, and Modules/ for the C component. There are two exceptions to this rule, the sys module, found in Python/sysmodule.c and the __builtins__ module, found in Python/bltinmodule.c.

Python will import * from __builtins__ when an interpreter is instantiated, so all of the functions like print(), chr(), format(), etc. are found within Python/bltinmodule.c.

Because the sys module is so specific to the interpreter and the internals of CPython, that is found inside the Python directly. It is also marked as an "implementation detail" of CPython and not found in other distributions.

The built-in print() function was probably the first thing you learned to do in Python. So what happens when you type print("hello world!")?

  1. The argument "hello world" was converted from a string constant to a PyUnicodeObject by the compiler
  2. builtin_print() was executed with 1 argument, and NULL kwnames
  3. The file variable is set to PyId_stdout, the system's stdout handle
  4. Each argument is sent to file
  5. A line break, \n is sent to file
static PyObject *
builtin_print(PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames)
{
    ...
    if (file == NULL || file == Py_None) {
        file = _PySys_GetObjectId(&PyId_stdout);
        ...
    }
    ...
    for (i = 0; i < nargs; i++) {
        if (i > 0) {
            if (sep == NULL)
                err = PyFile_WriteString(" ", file);
            else
                err = PyFile_WriteObject(sep, file,
                                         Py_PRINT_RAW);
            if (err)
                return NULL;
        }
        err = PyFile_WriteObject(args[i], file, Py_PRINT_RAW);
        if (err)
            return NULL;
    }

    if (end == NULL)
        err = PyFile_WriteString("\n", file);
    else
        err = PyFile_WriteObject(end, file, Py_PRINT_RAW);
    ...
    Py_RETURN_NONE;
}

The contents of some modules written in C expose operating system functions. Because the CPython source code needs to compile to macOS, Windows, Linux, and other *nix-based operating systems, there are some special cases.

The time module is a good example. The way that Windows keeps and stores time in the Operating System is fundamentally different than Linux and macOS. This is one of the reasons why the accuracy of the clock functions differs between operating systems.

In Modules/timemodule.c, the operating system time functions for Unix-based systems are imported from <sys/times.h>:

#ifdef HAVE_SYS_TIMES_H
#include <sys/times.h>
#endif
...
#ifdef MS_WINDOWS
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "pythread.h"
#endif /* MS_WINDOWS */
...

Later in the file, time_process_time_ns() is defined as a wrapper for _PyTime_GetProcessTimeWithInfo():

static PyObject *
time_process_time_ns(PyObject *self, PyObject *unused)
{
    _PyTime_t t;
    if (_PyTime_GetProcessTimeWithInfo(&t, NULL) < 0) {
        return NULL;
    }
    return _PyTime_AsNanosecondsObject(t);
}

_PyTime_GetProcessTimeWithInfo() is implemented multiple different ways in the source code, but only certain parts are compiled into the binary for the module, depending on the operating system. Windows systems will call GetProcessTimes() and Unix systems will call clock_gettime().

Other modules that have multiple implementations for the same API are the threading module, the file system module, and the networking modules. Because the Operating Systems behave differently, the CPython source code implements the same behavior as best as it can and exposes it using a consistent, abstracted API.

The CPython Regression Test Suite

CPython has a robust and extensive test suite covering the core interpreter, the standard library, the tooling and distribution for both Windows and Linux/macOS.

The test suite is located in Lib/test and written almost entirely in Python.

The full test suite is a Python package, so can be run using the Python interpreter that you've compiled. Change directory to the Lib directory and run python -m test -j2, where j2 means to use 2 CPUs.

On Windows use the rt.bat script inside the PCBuild folder, ensuring that you have built the Release configuration from Visual Studio in advance:

$ cd PCbuild
$ rt.bat -q

C:\repos\cpython\PCbuild>"C:\repos\cpython\PCbuild\win32\python.exe"  -u -Wd -E -bb -m test
== CPython 3.8.0a3+
== Windows-10-10.0.17134-SP0 little-endian
== cwd: C:\repos\cpython\build\test_python_2784
== CPU count: 2
== encodings: locale=cp1252, FS=utf-8
Run tests sequentially
0:00:00 [  1/420] test_grammar
0:00:00 [  2/420] test_opcodes
0:00:00 [  3/420] test_dict
0:00:00 [  4/420] test_builtin
...

On Linux:

$ cd Lib
$ ../python -m test -j2   
== CPython 3.8.0a2+
== macOS-10.14.3-x86_64-i386-64bit little-endian
== cwd: /Users/anthonyshaw/cpython/build/test_python_23399
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Run tests in parallel using 2 child processes
0:00:00 load avg: 2.14 [  1/420] test_opcodes passed
0:00:00 load avg: 2.14 [  2/420] test_grammar passed
...

On macOS:

$ cd Lib
$ ../python.exe -m test -j2   
== CPython 3.8.0a2+
== macOS-10.14.3-x86_64-i386-64bit little-endian
== cwd: /Users/anthonyshaw/cpython/build/test_python_23399
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Run tests in parallel using 2 child processes
0:00:00 load avg: 2.14 [  1/420] test_opcodes passed
0:00:00 load avg: 2.14 [  2/420] test_grammar passed
...

Some tests require certain flags; otherwise they are skipped. For example, many of the IDLE tests require a GUI.

To see a list of test suites in the configuration, use the --list-tests flag:

$ ../python.exe -m test --list-tests

test_grammar
test_opcodes
test_dict
test_builtin
test_exceptions
...

You can run specific tests by providing the test suite as the first argument:

$ ../python.exe -m test test_webbrowser

Run tests sequentially
0:00:00 load avg: 2.74 [1/1] test_webbrowser

== Tests result: SUCCESS ==

1 test OK.

Total duration: 117 ms
Tests result: SUCCESS

You can also see a detailed list of tests that were executed with the result using the -v argument:

$ ../python.exe -m test test_webbrowser -v

== CPython 3.8.0a2+ 
== macOS-10.14.3-x86_64-i386-64bit little-endian
== cwd: /Users/anthonyshaw/cpython/build/test_python_24562
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Run tests sequentially
0:00:00 load avg: 2.36 [1/1] test_webbrowser
test_open (test.test_webbrowser.BackgroundBrowserCommandTest) ... ok
test_register (test.test_webbrowser.BrowserRegistrationTest) ... ok
test_register_default (test.test_webbrowser.BrowserRegistrationTest) ... ok
test_register_preferred (test.test_webbrowser.BrowserRegistrationTest) ... ok
test_open (test.test_webbrowser.ChromeCommandTest) ... ok
test_open_new (test.test_webbrowser.ChromeCommandTest) ... ok
...
test_open_with_autoraise_false (test.test_webbrowser.OperaCommandTest) ... ok

----------------------------------------------------------------------

Ran 34 tests in 0.056s

OK (skipped=2)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 134 ms
Tests result: SUCCESS

Understanding how to use the test suite and checking the state of the version you have compiled is very important if you wish to make changes to CPython. Before you start making changes, you should run the whole test suite and make sure everything is passing.

Installing a Custom Version

From your source repository, if you're happy with your changes and want to use them inside your system, you can install it as a custom version.

For macOS and Linux, you can use the altinstall command, which won't create symlinks for python3 and install a standalone version:

$ make altinstall

For Windows, you have to change the build configuration from Debug to Release, then copy the packaged binaries to a directory on your computer which is part of the system path.

The CPython Source Code: Conclusion

Congratulations, you made it! Did your tea get cold? Make yourself another cup. You've earned it.

Now that you've seen the CPython source code, the modules, the compiler, and the tooling, you may wish to make some changes and contribute them back to the Python ecosystem.

The official dev guide contains plenty of resources for beginners. You've already taken the first step, to understand the source, knowing how to change, compile, and test the CPython applications.

Think back to all the things you've learned about CPython over this article. All the pieces of magic to which you've learned the secrets. The journey doesn't stop here.

This might be a good time to learn more about Python and C. Who knows: you could be contributing more and more to the CPython project!


[ Improve Your Python With 🐍 Python Tricks 💌 - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

21 Aug 2019 4:10pm GMT

Stack Abuse: Python String Interpolation with the Percent (%) Operator

There are a number of different ways to format strings in Python, one of which is done using the % operator, which is known as the string formatting (or interpolation) operator. In this article we'll show you how to use this operator to construct strings with a template string and variables containing your data.

The % Operator

This way of working with text has been shipped with Python since the beginning, and it's also known as C-style formatting, as it originates from the C programming language. Another description for it is simple positional formatting.

The % operator tells the Python interpreter to format a string using a given set of variables, enclosed in a tuple, following the operator. A very simple example of this is as follows:

'%s is smaller than %s' % ('one', 'two')

The Python interpreter substitutes the first occurrence of %s in the string by the given string "one", and the second %s by the string "two". These %s strings are actually placeholders in our "template" string, and they indicate that strings will be placed there.

As a first example, below we demonstrate using the Python REPL how to print a string value and a float value:

>>> print("Mr. %s, the total is %.2f." % ("Jekyll", 15.53))
'Mr. Jekyll, the total is 15.33.'

Just like the %s is a placeholder for strings, %f is a placeholder for floating point numbers. The ".2" before the f is what indicates how many digits we want displayed after the decimal point.

These are just two simple examples of what is possible, and a lot more placeholder types are supported. Here is the full list of placeholder types in more detail:

%c

This placeholder represents a single character.

>>> print("The character after %c is %c." % ("B", "C"))
The character after B is C.

Providing more than a single character as the variable here will raise an exception.

%s

This placeholder uses string conversion via str() prior to formatting. So any value that can be converted to a string via str() can be used here.

>>> place = "New York"
>>> print("Welcome to %s!" % place)
Welcome to New York!

Here we only have a single element to be used in our string formatting, and thus we're not required to enclose the element in a tuple like the previous examples.

%i and %d

These placholders represent a signed decimal integer.

>>> year = 2019
>>> print("%i will be a perfect year." % year)
2019 will be a perfect year.

Since this placeholder expects a decimal, it will be converted to one if a floating point value is provided instead.

%u

This placeholder represents an unsigned decimal integer.

%o

This placeholder represents an octal integer.

>>> number = 15
>>> print("%i in octal is %o" % (number, number))
15 in octal is 17

%x

Represents a hexadecimal integer using lowercase letters (a-f).

>>> number = 15
>>> print("%i in hex is %02x" % (number, number))
15 in hex is 0f

By using the "02" prefix in our placeholder, we're telling Python to print a two-character hex string.

%X

Represents a hexadecimal integer using uppercase letters (A-F).

>>> number = 15
>>> print("%i in hex is %04X" % (number, number))
15 in hex is 000F

And like the previous example, by using the "04" prefix in our placeholder, we're telling Python to print a four-character hex string.

%e

Represents an exponential notation with a lowercase "e".

%E

Represents an exponential notation with an uppercase "e".

%f

Represents a floating point real number.

>>> price = 15.95
>>> print("the price is %.2f" % price)
the price is 15.95

%g

The shorter version of %f and %e.

%G

The shorter version of %f and %E.

The placeholders shown above allow you to format strings by specifying data types in your templates. However, these aren't the only features of the interpolation operator. In the next subsection we'll see how we can pad our strings with spaces using the % operator.

Aligning the Output

Up until now we've only shown how to format text strings by specifying simple placeholders. With the help of an additional numerical value, you can define the total space that shall be reserved on either side of a variable in the output string.

As an example the value of %10s reserves 10 characters, with the extra spacing on the left side of the placeholder, and a value of %-10s puts any extra space to the right of the placholder. The single padding character is a space, and cannot be changed.

>>> place = "London"
>>> print ("%10s is not a place in France" % place)  # Pad to the left
      London is not a place in France
>>> print ("%-10s is not a place in France" % place) # Pad to the right
London     is not a place in France

Dealing with numbers works in the same way:

>>> print ("The postcode is %10d." % 25000)    # Padding on the left side
The postcode is      25000.
>>> print ("The postcode is %-10d." % 25000)   # Padding on the right side
The postcode is 25000     .

Truncating strings and rounding numbers is the counterpart to padding. Have a look at Rounding Numbers in Python in order to learn more about the traps that are hiding here.

Conclusion

In this article we saw how the interpolation (aka formatting) operator is a powerful way to format strings, which allows you to specify data type, floating point precision, and even spacing/padding.

21 Aug 2019 2:35pm GMT

Stack Abuse: Python String Interpolation with the Percent (%) Operator

There are a number of different ways to format strings in Python, one of which is done using the % operator, which is known as the string formatting (or interpolation) operator. In this article we'll show you how to use this operator to construct strings with a template string and variables containing your data.

The % Operator

This way of working with text has been shipped with Python since the beginning, and it's also known as C-style formatting, as it originates from the C programming language. Another description for it is simple positional formatting.

The % operator tells the Python interpreter to format a string using a given set of variables, enclosed in a tuple, following the operator. A very simple example of this is as follows:

'%s is smaller than %s' % ('one', 'two')

The Python interpreter substitutes the first occurrence of %s in the string by the given string "one", and the second %s by the string "two". These %s strings are actually placeholders in our "template" string, and they indicate that strings will be placed there.

As a first example, below we demonstrate using the Python REPL how to print a string value and a float value:

>>> print("Mr. %s, the total is %.2f." % ("Jekyll", 15.53))
'Mr. Jekyll, the total is 15.33.'

Just like the %s is a placeholder for strings, %f is a placeholder for floating point numbers. The ".2" before the f is what indicates how many digits we want displayed after the decimal point.

These are just two simple examples of what is possible, and a lot more placeholder types are supported. Here is the full list of placeholder types in more detail:

%c

This placeholder represents a single character.

>>> print("The character after %c is %c." % ("B", "C"))
The character after B is C.

Providing more than a single character as the variable here will raise an exception.

%s

This placeholder uses string conversion via str() prior to formatting. So any value that can be converted to a string via str() can be used here.

>>> place = "New York"
>>> print("Welcome to %s!" % place)
Welcome to New York!

Here we only have a single element to be used in our string formatting, and thus we're not required to enclose the element in a tuple like the previous examples.

%i and %d

These placholders represent a signed decimal integer.

>>> year = 2019
>>> print("%i will be a perfect year." % year)
2019 will be a perfect year.

Since this placeholder expects a decimal, it will be converted to one if a floating point value is provided instead.

%u

This placeholder represents an unsigned decimal integer.

%o

This placeholder represents an octal integer.

>>> number = 15
>>> print("%i in octal is %o" % (number, number))
15 in octal is 17

%x

Represents a hexadecimal integer using lowercase letters (a-f).

>>> number = 15
>>> print("%i in hex is %02x" % (number, number))
15 in hex is 0f

By using the "02" prefix in our placeholder, we're telling Python to print a two-character hex string.

%X

Represents a hexadecimal integer using uppercase letters (A-F).

>>> number = 15
>>> print("%i in hex is %04X" % (number, number))
15 in hex is 000F

And like the previous example, by using the "04" prefix in our placeholder, we're telling Python to print a four-character hex string.

%e

Represents an exponential notation with a lowercase "e".

%E

Represents an exponential notation with an uppercase "e".

%f

Represents a floating point real number.

>>> price = 15.95
>>> print("the price is %.2f" % price)
the price is 15.95

%g

The shorter version of %f and %e.

%G

The shorter version of %f and %E.

The placeholders shown above allow you to format strings by specifying data types in your templates. However, these aren't the only features of the interpolation operator. In the next subsection we'll see how we can pad our strings with spaces using the % operator.

Aligning the Output

Up until now we've only shown how to format text strings by specifying simple placeholders. With the help of an additional numerical value, you can define the total space that shall be reserved on either side of a variable in the output string.

As an example the value of %10s reserves 10 characters, with the extra spacing on the left side of the placeholder, and a value of %-10s puts any extra space to the right of the placholder. The single padding character is a space, and cannot be changed.

>>> place = "London"
>>> print ("%10s is not a place in France" % place)  # Pad to the left
      London is not a place in France
>>> print ("%-10s is not a place in France" % place) # Pad to the right
London     is not a place in France

Dealing with numbers works in the same way:

>>> print ("The postcode is %10d." % 25000)    # Padding on the left side
The postcode is      25000.
>>> print ("The postcode is %-10d." % 25000)   # Padding on the right side
The postcode is 25000     .

Truncating strings and rounding numbers is the counterpart to padding. Have a look at Rounding Numbers in Python in order to learn more about the traps that are hiding here.

Conclusion

In this article we saw how the interpolation (aka formatting) operator is a powerful way to format strings, which allows you to specify data type, floating point precision, and even spacing/padding.

21 Aug 2019 2:35pm GMT

Codementor: iOS and Android Localization Tool

Command Line Interface that converts CSV file to iOS, Android or JSON localizable strings

21 Aug 2019 12:40pm GMT

Codementor: iOS and Android Localization Tool

Command Line Interface that converts CSV file to iOS, Android or JSON localizable strings

21 Aug 2019 12:40pm GMT

Martijn Faassen: Refactoring to Multiple Exit Points

Introduction

Functions should have only a single entry point. We all agree on that. But some people also argue that functions should have a single exit that returns the value. More people don't seem to care enough about how their functions are organized. I think that makes functions a lot more complicated than they have to be. So let's talk about function organization and how multiple exit points can help.

I'm going to use Python in the examples, but these examples apply to many other languages such as JavaScript and Ruby as well, so do keep reading.

Starting point

Let's consider the following function:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            if item.match == "A":
                result = item.payload
            elif item.match == "B":
                continue
            else:
                if item.other == "C":
                    result = item.override
                else:
                    result = bar
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

It's a silly function, it's a hypothetical function, but there are plenty of functions with this kind of structure. They might not be born this way, but they've certainly grown into it. I find them difficult to follow. You can recognize them by one symptom already: quite a bit of indentation. You can also recognize them by trying to trace what happens in them; notice how your working memory fills up quickly.

Extract function from loop body

How would we go about refactoring it? The first step I would take is to extract the loop body into a separate function. You may say, why do so? Objections could be:

  • The loop body isn't reused in multiple places, so why should it be a function?
  • You have to manage function parameters whereas before all was conveniently available in the body of foo.

That is all so, but let's do it anyway and see what happens, and then get back to this in the end:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

def process_item(item, bar):
    if item.match == "A":
        result = item.payload
    elif item.match == "B":
        result = None
    else:
        if item.other == "C":
            result = item.override
        else:
            result = bar
    return result

We've had to extract two parameters - item and bar. It turns out process_item doesn't care about default. We've had to convert the continue to a result = None to keep things working properly, as now we always run into the if result is not None check whereas before we did not.

Multiple exit points

We notice that result is only touched once in each code path in process_item. This means we can convert the function to use multiple exit points with the return statement, so let's do that:

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    elif item.match == "B":
        return None
    else:
        if item.other == "C":
            return item.override
        else:
            return bar

Convert to guard clauses

That's still more complicated than it should be. Since we have early exit points, we can get rid of the elif and else clauses:

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    if item.match == "B":
        return None
    if item.other == "C":
        return item.override
    else:
        return bar

Some indentation is gone, which is a good sign. And we see another else we can get rid of now:

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    if item.match == "B":
        return None
    if item.other == "C":
        return item.override
    return bar

Pay attention to None

I think the return None case is special, so let's move that up. That's safe as A and B for item.match are mutually exclusive and this function has no side effects:

def process_item(item, bar):
    if item.match == "B":
        return None
    if item.match == "A":
        return item.payload
    if item.other == "C":
        return item.override
    return bar

This function is now a lot more regular. If you read it past return None you can forget about the case where item.match == "B", and then forget about the case where item.match == "A", and then forget about the case where item.other == "C". In the original version that was a lot harder to see.

Why pay attention to None?

This last reorganization of the guard clauses may seem like a useless action. But I pay special attention to None (or null or undefined or however your language may name the absence of value). If you organize the guard clauses that deal with None to come earlier, it makes your functions more regular and thus more easy to read.

It also triggers you to consider whether perhaps item.match == "B" is something you can handle at the call site, which can lead to further refactorings. Later we'll consider that further in a bonus refactoring.

Languages that have an Option or Maybe type such as Haskell and Rust make this more obvious and have special ways to handle these cases -- the language forces you. TypeScript also tracks tracks null/undefined in its type system. But in many other languages, such as Python, we're on our own. But we certainly still have to pay attention to None.

See also my the Story of None.

Back to process_items

Now let's look at the process_items function again:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

Multiple exit points

Let's first transform this so we return early when we can:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    else:
        return "No bar"
    if result is None:
        return default
    return result

Flip condition to create a guard

We can see clearly that "No bar" is returned if bar is None, so let's flip that condition:

def process_items(items, bar, default):
    result = None
    if bar is None:
        return "No bar"
    else:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    if result is None:
        return default
    return result

We can now see the else clause is not needed anymore, so let's unindent the for loop. We also move result = None below that guard clause for bar is None, as it's not needed until that point:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    result = None
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            break
    if result is None:
        return default
    return result

So it turns out in the rest of the function we can completely forget about bar being None. That's good. Maybe that guard can even be removed if we can somehow guarantee the non-None nature of bar at the call site. But we can't determine that in this limited example. Let's go on refactoring this function a bit more.

Turn loop break into early return

We take a look at the break. If result is not None, we break. Then after that we check if result is None. This can only happen if the loop never breaked. If the loop did break we end up returning result.

So we can just as well do the return result immediately in the loop:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    result = None
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    if result is None:
        return default
    return result

Let's look at the bit of code past the end of the loop again. We know that result has to be None if it reaches there. It's initialized to None and the loop returns early if it's ever not None. So why do we even check whether result is None anymore? We can simply always return default:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    result = None
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    return default

We have no more business setting result to None before the loop starts. It's a local variable within the loop body now:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    return default

In review

Let's look at where we started and ended.

We started with this:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            if item.match == "A":
                result = item.payload
            elif item.match == "B":
                continue
            else:
                if item.other == "C":
                    result = item.override
                else:
                    result = bar
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

And we ended with this:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    return default

def process_item(item, bar):
    if item.match == "B":
        return None
    if item.match == "A":
        return item.payload
    if item.other == "C":
        return item.override
    return bar

The second version is much easier to follow, I think. (it's also a few lines less code, but that's not that important.)

In defense of single-use functions

So we created a process_item function even though we only use it in one place. Earlier asked why you would do such a thing. What benefits does that have?

  • We could convert the function to use guard clauses, removing a level of nesting and letting us come up with followup refactoring steps that simplified our code.
  • It's clearer to see what actually really matters in the loop and what doesn't, as it's spelled out in the parameters of the function.
  • We gave what happens in the for loop a name. process_item doesn't say much in this case, but in a real-world code base your function name can help you read your code more easily.
  • Maybe we'll end up reusing it after all!

It also can lead to interesting future refactorings as it's easier to see patterns. If you do OOP for instance, you may end up with a group of functions that all share the same set of arguments and this would suggest creating a class with methods. But let's leave OOP be and consider None.

A possible followup refactoring

We know bar cannot be None when process_item is called -- see our guard clause. If we know (or find a way to guarantee) that item.payload and item.override can never be None either, we can do this:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    for item in items:
        if item.match != "B":
            return process_item(item, bar)
    return default

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    if item.other == "C":
        return item.override
    return bar

Which then leads to the question whether we should filter items with item.match != "B" before they even reach process_items in the first case -- another potential refactoring.

All of these refactorings require knowledge of what's impossible in the code and the data -- its invariants. We don't know this in this contrived example. But in a real code base, you can find out. A static type system can help make these invariants explicit, but that doesn't mean that in a dynamically typed language we should forget about them.

Yes, I'm saying the same as what I said about None before -- whether something is nullable is an important example of an invariant.

Conclusion

It's sometimes claimed that not only should a function have a single entry point, but that it should also have a single exit. One could argue such from sense of mathematical purity. But unless you work in a programming language that combines mathematical purity with convenience (compile-time checked match expressions help), that point seems moot to me. Many of us do not. (and no, we can't easily switch either.)

Another argument for single exit points comes from languages like C, where you have to free memory you allocated in the end before you exit a function, and you want to have a single place where you do the cleanup. But again that's irrelevant to many of us that use languages with automated garbage collection.

I've hope to have shown to you that for many of us, in many languages, multiple exit points can make code a lot more clear. It helps to expose invariants and potential invariants, which can then lead to followup refactorings.

P.S. If you like this content, consider following @faassen on Twitter. That's me! Besides many other things, I sometimes talk about code there too.

21 Aug 2019 11:12am GMT

Martijn Faassen: Refactoring to Multiple Exit Points

Introduction

Functions should have only a single entry point. We all agree on that. But some people also argue that functions should have a single exit that returns the value. More people don't seem to care enough about how their functions are organized. I think that makes functions a lot more complicated than they have to be. So let's talk about function organization and how multiple exit points can help.

I'm going to use Python in the examples, but these examples apply to many other languages such as JavaScript and Ruby as well, so do keep reading.

Starting point

Let's consider the following function:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            if item.match == "A":
                result = item.payload
            elif item.match == "B":
                continue
            else:
                if item.other == "C":
                    result = item.override
                else:
                    result = bar
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

It's a silly function, it's a hypothetical function, but there are plenty of functions with this kind of structure. They might not be born this way, but they've certainly grown into it. I find them difficult to follow. You can recognize them by one symptom already: quite a bit of indentation. You can also recognize them by trying to trace what happens in them; notice how your working memory fills up quickly.

Extract function from loop body

How would we go about refactoring it? The first step I would take is to extract the loop body into a separate function. You may say, why do so? Objections could be:

  • The loop body isn't reused in multiple places, so why should it be a function?
  • You have to manage function parameters whereas before all was conveniently available in the body of foo.

That is all so, but let's do it anyway and see what happens, and then get back to this in the end:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

def process_item(item, bar):
    if item.match == "A":
        result = item.payload
    elif item.match == "B":
        result = None
    else:
        if item.other == "C":
            result = item.override
        else:
            result = bar
    return result

We've had to extract two parameters - item and bar. It turns out process_item doesn't care about default. We've had to convert the continue to a result = None to keep things working properly, as now we always run into the if result is not None check whereas before we did not.

Multiple exit points

We notice that result is only touched once in each code path in process_item. This means we can convert the function to use multiple exit points with the return statement, so let's do that:

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    elif item.match == "B":
        return None
    else:
        if item.other == "C":
            return item.override
        else:
            return bar

Convert to guard clauses

That's still more complicated than it should be. Since we have early exit points, we can get rid of the elif and else clauses:

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    if item.match == "B":
        return None
    if item.other == "C":
        return item.override
    else:
        return bar

Some indentation is gone, which is a good sign. And we see another else we can get rid of now:

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    if item.match == "B":
        return None
    if item.other == "C":
        return item.override
    return bar

Pay attention to None

I think the return None case is special, so let's move that up. That's safe as A and B for item.match are mutually exclusive and this function has no side effects:

def process_item(item, bar):
    if item.match == "B":
        return None
    if item.match == "A":
        return item.payload
    if item.other == "C":
        return item.override
    return bar

This function is now a lot more regular. If you read it past return None you can forget about the case where item.match == "B", and then forget about the case where item.match == "A", and then forget about the case where item.other == "C". In the original version that was a lot harder to see.

Why pay attention to None?

This last reorganization of the guard clauses may seem like a useless action. But I pay special attention to None (or null or undefined or however your language may name the absence of value). If you organize the guard clauses that deal with None to come earlier, it makes your functions more regular and thus more easy to read.

It also triggers you to consider whether perhaps item.match == "B" is something you can handle at the call site, which can lead to further refactorings. Later we'll consider that further in a bonus refactoring.

Languages that have an Option or Maybe type such as Haskell and Rust make this more obvious and have special ways to handle these cases -- the language forces you. TypeScript also tracks tracks null/undefined in its type system. But in many other languages, such as Python, we're on our own. But we certainly still have to pay attention to None.

See also my the Story of None.

Back to process_items

Now let's look at the process_items function again:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

Multiple exit points

Let's first transform this so we return early when we can:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    else:
        return "No bar"
    if result is None:
        return default
    return result

Flip condition to create a guard

We can see clearly that "No bar" is returned if bar is None, so let's flip that condition:

def process_items(items, bar, default):
    result = None
    if bar is None:
        return "No bar"
    else:
        for item in items:
            result = process_item(item, bar)
            if result is not None:
                break
    if result is None:
        return default
    return result

We can now see the else clause is not needed anymore, so let's unindent the for loop. We also move result = None below that guard clause for bar is None, as it's not needed until that point:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    result = None
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            break
    if result is None:
        return default
    return result

So it turns out in the rest of the function we can completely forget about bar being None. That's good. Maybe that guard can even be removed if we can somehow guarantee the non-None nature of bar at the call site. But we can't determine that in this limited example. Let's go on refactoring this function a bit more.

Turn loop break into early return

We take a look at the break. If result is not None, we break. Then after that we check if result is None. This can only happen if the loop never breaked. If the loop did break we end up returning result.

So we can just as well do the return result immediately in the loop:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    result = None
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    if result is None:
        return default
    return result

Let's look at the bit of code past the end of the loop again. We know that result has to be None if it reaches there. It's initialized to None and the loop returns early if it's ever not None. So why do we even check whether result is None anymore? We can simply always return default:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    result = None
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    return default

We have no more business setting result to None before the loop starts. It's a local variable within the loop body now:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    return default

In review

Let's look at where we started and ended.

We started with this:

def process_items(items, bar, default):
    result = None
    if bar is not None:
        for item in items:
            if item.match == "A":
                result = item.payload
            elif item.match == "B":
                continue
            else:
                if item.other == "C":
                    result = item.override
                else:
                    result = bar
            if result is not None:
                break
    else:
        result = "No bar"
    if result is None:
        result = default
    return result

And we ended with this:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    for item in items:
        result = process_item(item, bar)
        if result is not None:
            return result
    return default

def process_item(item, bar):
    if item.match == "B":
        return None
    if item.match == "A":
        return item.payload
    if item.other == "C":
        return item.override
    return bar

The second version is much easier to follow, I think. (it's also a few lines less code, but that's not that important.)

In defense of single-use functions

So we created a process_item function even though we only use it in one place. Earlier asked why you would do such a thing. What benefits does that have?

  • We could convert the function to use guard clauses, removing a level of nesting and letting us come up with followup refactoring steps that simplified our code.
  • It's clearer to see what actually really matters in the loop and what doesn't, as it's spelled out in the parameters of the function.
  • We gave what happens in the for loop a name. process_item doesn't say much in this case, but in a real-world code base your function name can help you read your code more easily.
  • Maybe we'll end up reusing it after all!

It also can lead to interesting future refactorings as it's easier to see patterns. If you do OOP for instance, you may end up with a group of functions that all share the same set of arguments and this would suggest creating a class with methods. But let's leave OOP be and consider None.

A possible followup refactoring

We know bar cannot be None when process_item is called -- see our guard clause. If we know (or find a way to guarantee) that item.payload and item.override can never be None either, we can do this:

def process_items(items, bar, default):
    if bar is None:
        return "No bar"
    for item in items:
        if item.match != "B":
            return process_item(item, bar)
    return default

def process_item(item, bar):
    if item.match == "A":
        return item.payload
    if item.other == "C":
        return item.override
    return bar

Which then leads to the question whether we should filter items with item.match != "B" before they even reach process_items in the first case -- another potential refactoring.

All of these refactorings require knowledge of what's impossible in the code and the data -- its invariants. We don't know this in this contrived example. But in a real code base, you can find out. A static type system can help make these invariants explicit, but that doesn't mean that in a dynamically typed language we should forget about them.

Yes, I'm saying the same as what I said about None before -- whether something is nullable is an important example of an invariant.

Conclusion

It's sometimes claimed that not only should a function have a single entry point, but that it should also have a single exit. One could argue such from sense of mathematical purity. But unless you work in a programming language that combines mathematical purity with convenience (compile-time checked match expressions help), that point seems moot to me. Many of us do not. (and no, we can't easily switch either.)

Another argument for single exit points comes from languages like C, where you have to free memory you allocated in the end before you exit a function, and you want to have a single place where you do the cleanup. But again that's irrelevant to many of us that use languages with automated garbage collection.

I've hope to have shown to you that for many of us, in many languages, multiple exit points can make code a lot more clear. It helps to expose invariants and potential invariants, which can then lead to followup refactorings.

P.S. If you like this content, consider following @faassen on Twitter. That's me! Besides many other things, I sometimes talk about code there too.

21 Aug 2019 11:12am GMT

Catalin George Festila: Python Qt5 - contextMenu example.

A context menu is a menu in a graphical user interface (GUI) that appears upon user interaction, such as a right-click mouse operation. I create the default application and I use QMenu to create this context menu with New, Open and Quit. from PyQt5 import QtGui from PyQt5.QtWidgets import QApplication, QMainWindow, QMenu import sys class Window(QMainWindow): def __init__(self):

21 Aug 2019 11:06am GMT

Catalin George Festila: Python Qt5 - contextMenu example.

A context menu is a menu in a graphical user interface (GUI) that appears upon user interaction, such as a right-click mouse operation. I create the default application and I use QMenu to create this context menu with New, Open and Quit. from PyQt5 import QtGui from PyQt5.QtWidgets import QApplication, QMainWindow, QMenu import sys class Window(QMainWindow): def __init__(self):

21 Aug 2019 11:06am GMT

10 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: King Willams Town Bahnhof

Gestern musste ich morgens zur Station nach KWT um unsere Rerservierten Bustickets für die Weihnachtsferien in Capetown abzuholen. Der Bahnhof selber ist seit Dezember aus kostengründen ohne Zugverbindung - aber Translux und co - die langdistanzbusse haben dort ihre Büros.


Größere Kartenansicht




© benste CC NC SA

10 Nov 2011 10:57am GMT

09 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein

Niemand ist besorgt um so was - mit dem Auto fährt man einfach durch, und in der City - nahe Gnobie- "ne das ist erst gefährlich wenn die Feuerwehr da ist" - 30min später auf dem Rückweg war die Feuerwehr da.




© benste CC NC SA

09 Nov 2011 8:25pm GMT

08 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Brai Party

Brai = Grillabend o.ä.

Die möchte gern Techniker beim Flicken ihrer SpeakOn / Klinke Stecker Verzweigungen...

Die Damen "Mamas" der Siedlung bei der offiziellen Eröffnungsrede

Auch wenn weniger Leute da waren als erwartet, Laute Musik und viele Leute ...

Und natürlich ein Feuer mit echtem Holz zum Grillen.

© benste CC NC SA

08 Nov 2011 2:30pm GMT

07 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Lumanyano Primary

One of our missions was bringing Katja's Linux Server back to her room. While doing that we saw her new decoration.

Björn, Simphiwe carried the PC to Katja's school


© benste CC NC SA

07 Nov 2011 2:00pm GMT

06 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nelisa Haircut

Today I went with Björn to Needs Camp to Visit Katja's guest family for a special Party. First of all we visited some friends of Nelisa - yeah the one I'm working with in Quigney - Katja's guest fathers sister - who did her a haircut.

African Women usually get their hair done by arranging extensions and not like Europeans just cutting some hair.

In between she looked like this...

And then she was done - looks amazing considering the amount of hair she had last week - doesn't it ?

© benste CC NC SA

06 Nov 2011 7:45pm GMT

05 Nov 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Mein Samstag

Irgendwie viel mir heute auf das ich meine Blogposts mal ein bischen umstrukturieren muss - wenn ich immer nur von neuen Plätzen berichte, dann müsste ich ja eine Rundreise machen. Hier also mal ein paar Sachen aus meinem heutigen Alltag.

Erst einmal vorweg, Samstag zählt zumindest für uns Voluntäre zu den freien Tagen.

Dieses Wochenende sind nur Rommel und ich auf der Farm - Katja und Björn sind ja mittlerweile in ihren Einsatzstellen, und meine Mitbewohner Kyle und Jonathan sind zu Hause in Grahamstown - sowie auch Sipho der in Dimbaza wohnt.
Robin, die Frau von Rommel ist in Woodie Cape - schon seit Donnerstag um da ein paar Sachen zur erledigen.
Naja wie dem auch sei heute morgen haben wir uns erstmal ein gemeinsames Weetbix/Müsli Frühstück gegönnt und haben uns dann auf den Weg nach East London gemacht. 2 Sachen waren auf der Checkliste Vodacom, Ethienne (Imobilienmakler) außerdem auf dem Rückweg die fehlenden Dinge nach NeedsCamp bringen.

Nachdem wir gerade auf der Dirtroad losgefahren sind mussten wir feststellen das wir die Sachen für Needscamp und Ethienne nicht eingepackt hatten aber die Pumpe für die Wasserversorgung im Auto hatten.

Also sind wir in EastLondon ersteinmal nach Farmerama - nein nicht das onlinespiel farmville - sondern einen Laden mit ganz vielen Sachen für eine Farm - in Berea einem nördlichen Stadteil gefahren.

In Farmerama haben wir uns dann beraten lassen für einen Schnellverschluss der uns das leben mit der Pumpe leichter machen soll und außerdem eine leichtere Pumpe zur Reperatur gebracht, damit es nicht immer so ein großer Aufwand ist, wenn mal wieder das Wasser ausgegangen ist.

Fego Caffé ist in der Hemmingways Mall, dort mussten wir und PIN und PUK einer unserer Datensimcards geben lassen, da bei der PIN Abfrage leider ein zahlendreher unterlaufen ist. Naja auf jeden Fall speichern die Shops in Südafrika so sensible Daten wie eine PUK - die im Prinzip zugang zu einem gesperrten Phone verschafft.

Im Cafe hat Rommel dann ein paar online Transaktionen mit dem 3G Modem durchgeführt, welches ja jetzt wieder funktionierte - und übrigens mittlerweile in Ubuntu meinem Linuxsystem perfekt klappt.

Nebenbei bin ich nach 8ta gegangen um dort etwas über deren neue Deals zu erfahren, da wir in einigen von Hilltops Centern Internet anbieten wollen. Das Bild zeigt die Abdeckung UMTS in NeedsCamp Katjas Ort. 8ta ist ein neuer Telefonanbieter von Telkom, nachdem Vodafone sich Telkoms anteile an Vodacom gekauft hat müssen die komplett neu aufbauen.
Wir haben uns dazu entschieden mal eine kostenlose Prepaidkarte zu testen zu organisieren, denn wer weis wie genau die Karte oben ist ... Bevor man einen noch so billigen Deal für 24 Monate signed sollte man wissen obs geht.

Danach gings nach Checkers in Vincent, gesucht wurden zwei Hotplates für WoodyCape - R 129.00 eine - also ca. 12€ für eine zweigeteilte Kochplatte.
Wie man sieht im Hintergrund gibts schon Weihnachtsdeko - Anfang November und das in Südafrika bei sonnig warmen min- 25°C

Mittagessen haben wir uns bei einem Pakistanischen Curry Imbiss gegönnt - sehr empfehlenswert !
Naja und nachdem wir dann vor ner Stunde oder so zurück gekommen sind habe ich noch den Kühlschrank geputzt den ich heute morgen zum defrosten einfach nach draußen gestellt hatte. Jetzt ist der auch mal wieder sauber und ohne 3m dicke Eisschicht...

Morgen ... ja darüber werde ich gesondert berichten ... aber vermutlich erst am Montag, denn dann bin ich nochmal wieder in Quigney(East London) und habe kostenloses Internet.

© benste CC NC SA

05 Nov 2011 4:33pm GMT

31 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Sterkspruit Computer Center

Sterkspruit is one of Hilltops Computer Centres in the far north of Eastern Cape. On the trip to J'burg we've used the opportunity to take a look at the centre.

Pupils in the big classroom


The Trainer


School in Countryside


Adult Class in the Afternoon


"Town"


© benste CC NC SA

31 Oct 2011 4:58pm GMT

Benedict Stein: Technical Issues

What are you doing in an internet cafe if your ADSL and Faxline has been discontinued before months end. Well my idea was sitting outside and eating some ice cream.
At least it's sunny and not as rainy as on the weekend.


© benste CC NC SA

31 Oct 2011 3:11pm GMT

30 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Nellis Restaurant

For those who are traveling through Zastron - there is a very nice Restaurant which is serving delicious food at reasanable prices.
In addition they're selling home made juices jams and honey.




interior


home made specialities - the shop in the shop


the Bar


© benste CC NC SA

30 Oct 2011 4:47pm GMT

29 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: The way back from J'burg

Having the 10 - 12h trip from J'burg back to ELS I was able to take a lot of pcitures including these different roadsides

Plain Street


Orange River in its beginngings (near Lesotho)


Zastron Anglican Church


The Bridge in Between "Free State" and Eastern Cape next to Zastron


my new Background ;)


If you listen to GoogleMaps you'll end up traveling 50km of gravel road - as it was just renewed we didn't have that many problems and saved 1h compared to going the official way with all it's constructions sites




Freeway


getting dark


© benste CC NC SA

29 Oct 2011 4:23pm GMT

28 Oct 2011

feedPython Software Foundation | GSoC'11 Students

Benedict Stein: Wie funktioniert eigentlich eine Baustelle ?

Klar einiges mag anders sein, vieles aber gleich - aber ein in Deutschland täglich übliches Bild einer Straßenbaustelle - wie läuft das eigentlich in Südafrika ?

Ersteinmal vorweg - NEIN keine Ureinwohner die mit den Händen graben - auch wenn hier mehr Manpower genutzt wird - sind sie fleißig mit Technologie am arbeiten.

Eine ganz normale "Bundesstraße"


und wie sie erweitert wird


gaaaanz viele LKWs


denn hier wird eine Seite über einen langen Abschnitt komplett gesperrt, so das eine Ampelschaltung mit hier 45 Minuten Wartezeit entsteht


Aber wenigstens scheinen die ihren Spaß zu haben ;) - Wie auch wir denn gücklicher Weise mussten wir nie länger als 10 min. warten.

© benste CC NC SA

28 Oct 2011 4:20pm GMT