05 Dec 2024
Planet Python
Christian Ledermann: Trusted publishing โ It has never been easier to publish your python packages
Publishing Python packages used to be a daunting task, but not any more. Even better, it has become significantly more secure. Gone are the days of juggling usernames, passwords, or API tokens while relying on CLI tools. With trusted publishing, you simply provide PyPI with the details of your GitHub repository, and GitHub Actions takes care of the heavy lifting.
How to Publish Your Python Package with Trusted Publishing
I will introduce a workflow that will publish your package to TestPyPi when a tag is created (on the development
branch), or to PyPi when you merge to the main
branch.
Prepare Your Package for Publishing
Ensure your Python package follows PyPI's packaging guidelines. At a minimum, you'll need:
- A
setup.py orpyproject.toml file defining your package metadata. - Properly structured code with a clear directory layout.
- A README file to showcase your project on PyPI.
For a detailed checklist, refer to the Python Packaging User Guide.
Configure GitHub Actions in Your Repository
Let's start by creating a new GitHub action .github/workflows/test-build-publish.yml
.
name: test-build-publish
on: [push, pull_request]
permissions:
contents: read
jobs:
build-and-check-package:
name: Build & inspect our package.
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hynek/build-and-inspect-python-package@v2
This action will build your package and uploads the built wheel and the source distribution (SDist
) as GitHub Actions artefacts.
Next, we add a step to publish to TestPyPI. This step will run whenever a tag is created, ensuring that the build from the previous step has completed successfully. Replace PROJECT_OWNER and PROJECT_NAME with the appropriate values for your repository.
test-publish:
if: >-
github.event_name == 'push' &&
github.repository == 'PROJECT_OWNER/PROJECT_NAME' &&
startsWith(github.ref, 'refs/tags')
needs: build-and-check-package
name: Test publish on TestPyPI
runs-on: ubuntu-latest
environment: test-release
permissions:
id-token: write
steps:
- name: Download packages built by build-and-check-package
uses: actions/download-artifact@v4
with:
name: Packages
path: dist
- name: Upload package to Test PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
This step downloads the artefacts created during the build process and uploads them to TestPyPI for testing.
In the last step, we will upload the package to PyPI when a pull request is merged into the main
branch.
publish:
if: >-
github.event_name == 'push' &&
github.repository == 'PROJECT_OWNER/PROJECT_NAME' &&
github.ref == 'refs/heads/main'
needs: build-and-check-package
name: Publish to PyPI
runs-on: ubuntu-latest
environment: release
permissions:
id-token: write
steps:
- name: Download packages built by build-and-check-package
uses: actions/download-artifact@v4
with:
name: Packages
path: dist
- name: Publish distribution ๐ฆ to PyPI for push to main
uses: pypa/gh-action-pypi-publish@release/v1
Configure GitHub Environments
To ensure that only specific tags trigger the publishing workflow and maintain control over your release process.
Create a new environment test-release
by navigating to Settings -> Environments in your GitHub repository.
Set up the environment and add a deployment tag rule.
Limit which branches and tags can deploy to this environment based on rules or naming patterns.
Limit which branches and tags can deploy to this environment based on naming patterns.
Configure the target tags.
The pattern [0-9]*.[0-9]*.[0-9]*
matches semantic versioning tags such as 1.2.3
, 0.1.0
, or 2.5.1b3
, but it excludes arbitrary tags like bugfix-567
or feature-update
.
Repeat this for the release
environment to protect the main
branch in the same way, but this time targeting the main
branch.
Set Up a PyPI Project and Link Your GitHub Repository
Create an account on TestPyPI if you don't have one.
Navigate to your account, Publishing and add a new pending publisher.
Link your GitHub repository to the PyPI project by providing its name, your GitHub username, the repository name, the workflow name (test-build-publish.yml
) and the environment name (test-release
).
Repeat the above on PyPI with the environment name set to release
.
Test the Workflow
Now whenever you create a tag on your development branch, it will trigger a release to be uploaded to TestPyPI and merging the development branch into main will upload a release to PyPI.
What Wasn't Covered
While this guide provides an introduction to trusted publishing workflows, there are additional steps and best practices you might consider implementing. For example, setting up branch protection rules can ensure only authorized collaborators can push tags or merge to protected branches, like main
or develop
. You can also enforce status checks or require pull request reviews before merging, adding another layer of quality assurance.
Have a look at my python-repository-template that covers additional enhancement to this workflow, such as requiring unit and static tests to pass, checking the package with pyroma and ensuring that your tag matches the version of your package with vercheck.
Summary
If you've been holding back on sharing your work, now is the perfect time to try trusted publishing.
- Introducing 'Trusted Publishers' The Python Package Index Blog highlights a more secure publishing method that does not require long-lived passwords or API tokens to be shared with external systems
- Publishing to PyPI with a Trusted Publisher The official PyPI documentation to get started with using trusted publishers on PyPI.
- Building and testing Python in the official GitHub docs.
05 Dec 2024 6:52pm GMT
Luke Plant: Check if a point is in a cylinder - geometry and code
In my current project I'm doing a fair amount of geometry, and one small problem I needed to solve a while back was finding whether a point is inside a cylinder.
The accepted answer for this on math.stackexchange.com wasn't ideal - part of it was very over-complicated, and also didn't work under some circumstances. So I contributed my own answer. In this post, in addition to the maths, I'll give an implementation in Python.
Method
We can solve this problem by constructing the cylinder negatively:
-
Start with an infinite space
-
Throw out everything that isn't within the cylinder.
This is a classic mathematician's approach, but it works great here, and it also works pretty well for any simply-connected solid object with only straight or concave surfaces, depending on how complex those surfaces are.
First some definitions:
-
Our cylinder is defined by two points, A and B, and a radius R
-
The point we want to test is P
-
The vectors from the origin to points A, B and P are \(\boldsymbol{r}_A\), \(\boldsymbol{r}_B\) and \(\boldsymbol{r}_P\) respectively.
We start with the infinite space, that is we assume all points are within the cylinder until we show they aren't.
Then we construct 3 cuts to exclude certain values of \(\boldsymbol{r}_P\).
First, a cylindrical cut of radius R about an infinite line that goes through A and B. (This was taken from John Alexiou's answer in the link above):
-
The vector from point A to point B is:
\begin{equation*} \boldsymbol{e} = \boldsymbol{r}_B-\boldsymbol{r}_A \end{equation*}This defines the direction of the line through A and B.
-
The distance from any point P at vector \(\boldsymbol{r}_P\) to the line is:
\begin{equation*} d = \frac{\| \boldsymbol{e}\times\left(\boldsymbol{r}_{P}-\boldsymbol{r}_{A}\right) \|}{\|\boldsymbol{e}\|} \end{equation*}This is based on finding the distance of a point to a line, and using point A as an arbitrary point on the line. We could equally have used B.
-
We then simply exclude all points with \(d > R\).
This can be optimised slightly by squaring both sides of the comparison to avoid two square root operations.
Second, a planar cut that throws away the space above the "top" of the cylinder, which I'm calling A.
-
The plane is defined by any point on it (A will do), and any normal pointing out of the cylinder, \(-\boldsymbol{e}\) will do (i.e. in the opposite direction to \(\boldsymbol{e}\) as defined above).
-
We can see which side a point is of this plane as per Relation between a point and a plane:
The point is "above" the plane (in the direction of the normal) if:
\begin{equation*} (\boldsymbol{r}_P - \boldsymbol{r}_A) \cdot -\boldsymbol{e} > 0 \end{equation*} -
We exclude points which match the above.
Third, a planar cut which throws away the space below the bottom of the cylinder B.
-
This is the same as the previous step, but with the other end of the cylinder and the normal vector in the other direction, so the condition is:
\begin{equation*} (\boldsymbol{r}_P - \boldsymbol{r}_B) \cdot \boldsymbol{e} > 0 \end{equation*}
Python implementation
Below is a minimal implementation with zero dependencies outside the standard lib, in which there are just enough classes, with just enough methods, to express the algorithm neatly, following the above steps exactly.
In a real implementation:
-
You might separate out a
Point
class as being semantically different fromVec
-
You could move some functions to be methods (in my real implementation, I have a
Cylinder.contains_point()
method, for example) -
You should probably use
@dataclass(frozen=True)
- immutable objects are a good default, I didn't use them here because there aren't needed and I'm focusing on clarity of the code. -
Conversely, if performance is more of a consideration, and you don't have a more general need for classes like
Vec
andCylinder
:-
you might use a more efficient representation, such as a tuple or list, or a numpy array especially if you wanted bulk operations.
-
you could use more generic dot product functions etc. from numpy.
-
you might inline more of the algorithm into a single function.
-
For clarity, I also have not implemented the optimisation mentioned in which you can avoid doing some square root operations.
from __future__ import annotations import math from dataclasses import dataclass # Implementation of https://math.stackexchange.com/questions/3518495/check-if-a-general-point-is-inside-a-given-cylinder # See the accompanying blog post http://lukeplant.me.uk/blog/posts/check-if-a-point-is-in-a-cylinder-geometry-and-code/ # -- Main algorithm -- def cylinder_contains_point(cylinder: Cylinder, point: Vec) -> bool: # First condition: distance from axis cylinder_direction: Vec = cylinder.end - cylinder.start point_distance_from_axis: float = abs( cross_product( cylinder_direction, (point - cylinder.start), ) ) / abs(cylinder_direction) if point_distance_from_axis > cylinder.radius: return False # Second condition: point must lie below the top plane. # Third condition: point must lie above the bottom plane # We construct planes with normals pointing out of the cylinder at both # ends, and exclude points that are outside ("above") either plane. start_plane = Plane(cylinder.start, -cylinder_direction) if point_is_above_plane(point, start_plane): return False end_plane = Plane(cylinder.end, cylinder_direction) if point_is_above_plane(point, end_plane): return False return True # -- Supporting classes and functions -- @dataclass class Vec: """ A Vector in 3 dimensions, also used to represent points in space """ x: float y: float z: float def __add__(self, other: Vec) -> Vec: return Vec(self.x + other.x, self.y + other.y, self.z + other.z) def __sub__(self, other: Vec) -> Vec: return self + (-other) def __neg__(self) -> Vec: return -1 * self def __mul__(self, scalar: float) -> Vec: return Vec(self.x * scalar, self.y * scalar, self.z * scalar) def __rmul__(self, scalar: float) -> Vec: return self * scalar def __abs__(self) -> float: return math.sqrt(self.x**2 + self.y**2 + self.z**2) @dataclass class Plane: """ A plane defined by a point on the plane, `origin`, and a `normal` vector to the plane. """ origin: Vec normal: Vec @dataclass class Cylinder: """ A closed cylinder defined by start and end points along the center line and a radius """ start: Vec end: Vec radius: float def cross_product(a: Vec, b: Vec) -> Vec: return Vec( a.y * b.z - a.z * b.y, a.z * b.x - a.x * b.z, a.x * b.y - a.y * b.x, ) def dot_product(a: Vec, b: Vec) -> float: return a.x * b.x + a.y * b.y + a.z * b.z def point_is_above_plane(point: Vec, plane: Plane) -> bool: """ Returns True if `point` is above the plane - that is on the side of the plane which is in the direction of the plane `normal`. """ # See https://math.stackexchange.com/a/2998886/78071 return dot_product((point - plane.origin), plane.normal) > 0 # -- Tests -- def test_cylinder_contains_point(): # Test cases constructed with help of Geogebra - https://www.geogebra.org/calculator/tnc3arfm cylinder = Cylinder(start=Vec(1, 0, 0), end=Vec(6.196, 3, 0), radius=0.5) # In the Z plane: assert cylinder_contains_point(cylinder, Vec(1.02, 0, 0)) assert not cylinder_contains_point(cylinder, Vec(0.98, 0, 0)) # outside bottom plane assert cylinder_contains_point(cylinder, Vec(0.8, 0.4, 0)) assert not cylinder_contains_point(cylinder, Vec(0.8, 0.5, 0)) # too far from center assert not cylinder_contains_point(cylinder, Vec(0.8, 0.3, 0)) # outside bottom plane assert cylinder_contains_point(cylinder, Vec(1.4, -0.3, 0)) assert not cylinder_contains_point(cylinder, Vec(1.4, -0.4, 0)) # too far from center assert cylinder_contains_point(cylinder, Vec(6.2, 2.8, 0)) assert not cylinder_contains_point(cylinder, Vec(6.2, 2.2, 0)) # too far from center assert not cylinder_contains_point(cylinder, Vec(6.2, 3.2, 0)) # outside top plane # Away from Z plane assert cylinder_contains_point(cylinder, Vec(1.02, 0, 0.2)) assert not cylinder_contains_point(cylinder, Vec(1.02, 0, 1)) # too far from center assert not cylinder_contains_point(cylinder, Vec(0.8, 0.3, 2)) # too far from center, and outside bottom plane
05 Dec 2024 4:40pm GMT
EuroPython Society: EPS Board 2024-2025
We're happy to announce our new board for the 2024-2025 term:
- Anders Hammarquist
- Aris Nivorils
- Artur Czepiel (Chair)
- Cyril Bitterich
- Ege Akman
- Mia Bajiฤ (Vice Chair)
- Shekhar Koirala
You can read more about them in their nomination post at https://www.europython-society.org/list-of-eps-board-candidates-for-2024-2025/. The minutes and the video recording of the General Assembly 2024 will be published soon.
Together, we will continue to serve the community and head off to the preparations for EuroPython 2025!
05 Dec 2024 3:53pm GMT
PyCharm: How to Do Sentiment Analysis With Large Language Models
Sentiment analysis is a powerful tool for understanding emotions in text. While there are many ways to approach sentiment analysis, including more traditional lexicon-based and machine learning approaches, today we'll be focusing on one of the most cutting-edge ways of working with text - large language models (LLMs). We'll explain how you can use these powerful models to predict the sentiment expressed in a text.
As a practical tutorial, this post will introduce you to the types of LLMs most suited for sentiment analysis tasks and then show you how to choose the right model for your specific task.
We'll cover using models that other people have fine-tuned for sentiment analysis and how to fine-tune one yourself. We'll also look at some of the powerful tools and resources available that can help you work with these models easily, while demystifying what can feel like an overly complex and overwhelming topic.
To get the most out of this blog post, we'd recommend you have some experience training machine learning or deep learning models and be confident using Python. That said, you don't necessarily need to have a background in large language models to enjoy it.
Let's get started!
What are large language models?
Large language models are some of the latest and most powerful tools for solving natural language problems. In brief, they are generalist language models that can complete a range of natural language tasks, from named entity recognition to question answering. LLMs are based on the transformer architecture, a type of neural network that uses a mechanism called attention to represent complex and nuanced relationships between words in a piece of text. This design allows LLMs to accurately represent the information being conveyed in a piece of text.
The full transformer model architecture consists of two blocks. Encoder blocks are designed to receive text inputs and build a representation of them, creating a feature set based on the text corpus over which the model is trained. Decoder blocks take the features generated by the encoder and other inputs and attempt to generate a sequence based on these.
Transformer models can be divided up based on whether they contain encoder blocks, decoder blocks, or both.
- Encoder-only models tend to be good at tasks requiring a detailed understanding of the input to do downstream tasks, like text classification and named entity recognition.
- Decoder-only models are best for tasks such as text generation.
- Encoder-decoder, or sequence-to-sequence models are mainly used for tasks that require the model to evaluate an input and generate a different output, such as translation. In fact, translation was the original task that transformer models were designed for!
This Hugging Face table (also featured below), which I took from their course on natural language processing, gives an overview of what each model tends to be strongest at.
After finishing this blog post and discovering what other natural language tasks you can perform with the Transformers library, I recommend the course if you'd like to learn more about LLMs. It strikes an excellent balance between accessibility and technical depth.
Model type | Examples | Tasks |
Encoder-only | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |
Decoder-only | CTRL, GPT, GPT-2, Transformer XL | Text generation |
Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering |
Sentiment analysis is usually treated as a text or sentence classification problem with LLMs, meaning that encoder-only models such as RoBERTa, BERT, and ELECTRA are most often used for this task. However, there are some exceptions. For example, the top scoring model for aspect-based sentiment analysis, InstructABSA, is based on a fine-tuned version of T5, an encoder-decoder model.
Using large language models for sentiment analysis
With all of the background out of the way, we can now get started with using LLMs to do sentiment analysis.
Install PyCharm to get started with sentiment analysis
We'll use PyCharm Professional for this demo, but you can follow along with any other IDE that supports Python development.
PyCharm Professional is a powerful Python IDE for data science. It supports advanced Python code completion, inspections and debugging, rich databases, Jupyter, Git, Conda, and more right out of the box. You can try out great features such as our DataFrame Column Statistics and Chart View, as well as Hugging Face integrations, which make working with LLMs much simpler and faster.
If you'd like to follow along with this tutorial, you can activate your free three-month subscription to PyCharm using this special promo code: PCSA. Click on the link below, and enter the code. You'll then receive an activation code through your email.
Import the required libraries
There are two parts to this tutorial: using an LLM that someone else has fine-tuned for sentiment analysis, and fine-tuning a model ourselves.
In order to run both parts of this tutorial, we need to import the following packages:
- Transformers: As described, this will allow us to use fine-tuned LLMs for sentiment analysis and fine-tune our own models.
- PyTorch, Tensorflow, or Flax: Transformers acts as a high-level interface for deep learning frameworks, reusing their functionality for building, training, and running neural networks. In order to actually work with LLMs using the Transformers package, you will need to install your choice of PyTorch, Tensorflow, or Flax. PyTorch supports the largest number of models of the three frameworks, so that's the one we'll use in this tutorial.
- Datasets: This is another package from Hugging Face that allows you to easily work with the datasets hosted on Hugging Face Hub. We'll need this package to get a dataset to fine-tune an LLM for sentiment analysis.
In order to fine-tune our own model, we also need to import these additional packages:
- NumPy: NumPy allows us to work with arrays. We'll need this to do some post-processing on the predictions generated by our LLM.
- scikit-learn: This package contains a huge range of functionality for machine learning. We'll use it to evaluate the performance of our model.
- Evaluate: This is another package from Hugging Face. Evaluate adds a convenient interface for measuring the performance of models. It will give us an alternative way of measuring our model's performance.
- Accelerate: This final package from Hugging Face, Accelerate, takes care of distributed model training.
We can easily find and install these in PyCharm. Make sure you're using a Python 3.7 or higher interpreter. For this demo, we'll be using Python 3.11.7.
Pick the right model
The next step is picking the right model. Before we get into that, we need to cover some terminology.
LLMs are made up of two components: an architecture and a checkpoint. The architecture is like the blueprint of the model, and describes what will be contained in each layer and each operation that takes place within the model.
The checkpoint refers to the weights that will be used within each layer. Each of the pretrained models will use an architecture like T5 or GPT, and obtain the specific weights (the model checkpoint) by training the model over a huge corpus of text data.
Fine-tuning will adjust the weights in the checkpoint by retraining the last layer(s) on a dataset specialized in a certain task or domain. To make predictions (called inference), an architecture will load in the checkpoint and use this to process text inputs, and together this is called a model.
If you've ever looked at the models available on Hugging Face, you might have been overwhelmed by the sheer number of them (even when we narrow them down to encoder-only models).
So, how do you know which one to use for sentiment analysis?
One useful place to start is the sentiment analysis page on Papers With Code. This page includes a very helpful overview of this task and a Benchmarks table that includes the top-performing models for each sentiment analysis benchmarking dataset. From this page, we can see that some of the commonly appearing models are those based on BERT and RoBERTa architectures.
While we may not be able to access these exact model checkpoints on Hugging Face (as not all of them will be uploaded there), it can give us a guide for what sorts of models might perform well at this task. Papers With Code also has similar pages for a range of other natural language tasks: If you search for the task in the upper left-hand corner of the site, you can navigate to these.
Now that we know what kinds of architectures are likely to do well for this problem, we can start searching for a specific model.
PyCharm has an built-in integration with Hugging Face that allows us to search for models directly. Simply right-click anywhere in your Jupyter notebook or Python script, and select Insert HF model. You'll be presented with the following window:
You can see that we can find Hugging Face models either by the task type (which we can select from the menu on the left-hand side), by keyword search in the search box at the top of the window, or by a combination of both. Models are ranked by the number of likes by default, but we can also select models based on downloads or when the model was created or last modified.
When you use a model for a task, the checkpoint is downloaded and cached, making it faster the next time you need to use that model. You can see all of the models you've downloaded in the Hugging Face tool window.
Once we've downloaded the model, we can also look at its model card again by hovering over the model name in our Jupyter notebook or Python script. We can do the same thing with dataset cards.
Use a fine-tuned LLM for sentiment analysis
Let's move on to how we can use a model that someone else has already fine-tuned for sentiment analysis.
As mentioned, sentiment analysis is usually treated as a text classification problem for LLMs. This means that in our Hugging Face model selection window, we'll select Text Classification, which can be found under Natural Language Processing on the left-hand side. To narrow the results down to sentiment analysis models, we'll type "sentiment" in the search box in the upper left-hand corner.
We can see various fine-tuned models, and as expected from what we saw on the Papers With Code Benchmarks table, most of them use RoBERTa or BERT architectures. Let's try out the top ranked model, Twitter-roBERTa-base for Sentiment Analysis.
You can see that after we select Use Model in the Hugging Face model selection window, code is automatically generated at the caret in our Jupyter notebook or Python script to allow us to start working with this model.
from transformers import pipeline pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
Before we can do inference with this model, we'll need to modify this code.
The first thing we can check is whether we have a GPU available, which will make the model run faster. We'll check for two types: NVIDIA GPUs, which support CUDA, and Apple GPUs, which support MPS.
import torch print(f"CUDA available: {torch.cuda.is_available()}") print(f"MPS available: {torch.backends.mps.is_available()}")
My computer supports MPS, so we can add a device argument to the pipeline and add "mps"
. If your computer supports CUDA, you can instead add the argument device=0
.
from transformers import pipeline pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest", device="mps")
Finally, we can get the fine-tuned LLM to run inference over our example text.
result = pipe("I love PyCharm! It's my favorite Python IDE.") result
[{'label': 'positive', 'score': 0.9914802312850952}]
You can see that this model predicts that the text will be positive, with 99% probability.
Fine-tune your own LLM for sentiment analysis
The other way we can use LLMs for sentiment analysis is to fine-tune our own model.
You might wonder why you'd bother doing this, given the huge number of fine-tuned models that already exist on Hugging Face Hub. The main reason you might want to fine-tune a model is so that you can tailor it to your specific use case.
Most models are fine-tuned on public datasets, especially social media posts and movie reviews, and you might need your model to be more sensitive to your specific domain or use case.
Model fine-tuning can be quite a complex topic, so in this demonstration, I'll explain how to do it at a more general level. However, if you want to understand this in more detail, you can read more about it in Hugging Face's excellent NLP course, which I recommended earlier. In their tutorial, they explain in detail how to process data for fine-tuning models and two different approaches to fine-tuning: with the trainer API and without it.
To demonstrate how to fine-tune a model, we'll use the SST-2 dataset, which is composed of single lines pulled from movie reviews that have been annotated as either negative or positive.
As mentioned earlier, BERT models consistently show up as top performers on the Papers With Code benchmarks, so we'll fine-tune a BERT checkpoint.
We can again search for these models in PyCharm's Hugging Face model selection window.
We can see that the most popular BERT model is bert-base-uncased
. This is perfect for our use case, as this was also trained on lowercase text, so it will match the casing of our dataset.
We could have used the popular bert-large-uncased
, but the base model has only 110 million parameters compared to BERT large, which has 340 million, so the base model is a bit friendlier for fine-tuning on a local machine.
If you still want to use a smaller model, you could also try this with a DistilBERT model, which has far fewer parameters but still preserves most of the performance of the original BERT models.
Let's start by reading in our dataset. We can do so using the load_dataset()
function from the Datasets package. SST-2 is part of the GLUE dataset, which is designed to see how well a model can complete a range of natural language tasks.
from datasets import load_dataset sst_2_raw = load_dataset("glue", "sst2") sst_2_raw
DatasetDict({ train: Dataset({ features: ['sentence', 'label', 'idx'], num_rows: 67349 }) validation: Dataset({ features: ['sentence', 'label', 'idx'], num_rows: 872 }) test: Dataset({ features: ['sentence', 'label', 'idx'], num_rows: 1821 }) })
This dataset has already been split into the train, validation, and test sets. We have around 67,349 training examples - quite a modest number for fine-tuning such a large model.
Here's an example from this dataset.
sst_2_raw["train"][1]
{'sentence': 'contains no wit , only labored gags ', 'label': 0, 'idx': 1}
We can see what the labels mean by calling the features attribute on the training set.
sst_2_raw["train"].features
{'sentence': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'positive'], id=None), 'idx': Value(dtype='int32', id=None)}
0 indicates a negative sentiment, and 1 indicates a positive one.
Let's look at the number in each class:
print(f'Number of negative examples: {sst_2_raw["train"]["label"].count(0)}') print(f'Number of positive examples: {sst_2_raw["train"]["label"].count(1)}')
Number of negative examples: 29780 Number of positive examples: 37569
The classes in our training data are a tad unbalanced, but they aren't excessively skewed.
We now need to tokenize our data, transforming the raw text into a form that our model can use. To do this, we need to use the same tokenizer that was used to train the bert-large-uncased
model in the first place. The AutoTokenizer class will take care of all of the under-the-hood details for us.
from transformers import AutoTokenizer checkpoint = "google-bert/bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Once we've loaded in the correct tokenizer, we can apply this to the training data.
tokenised_sentences = tokenizer(sst_2_raw["train"]["sentence"])
Finally, we need to add a function to pad our tokenized sentences. This will make sure all of the inputs in a training batch are the same length - text inputs are rarely the same length and models require a consistent number of features for each input.
from transformers import DataCollatorWithPadding def tokenize_function(example): return tokenizer(example["sentence"]) tokenized_datasets = sst_2_raw.map(tokenize_function, batched=True) data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Now that we've prepared our dataset, we need to determine how well the model is fitting to the data as it trains. To do this, we need to decide which metrics to use to evaluate the model's prediction performance.
As we're dealing with a binary classification problem, we have a few choices of metrics, the most popular of which are accuracy, precision, recall, and the F1 score. In the "Evaluate the model" section, we'll discuss the pros and cons of using each of these measures.
We have two ways of creating an evaluation function for our model. The first is using the Evaluate package. This package allows us to use the specific evaluator for the SST-2 dataset, meaning we'll evaluate the model fine-tuning using the specific metrics for this task. In the case of SST-2, the metric used is accuracy.
import evaluate import numpy as np def compute_metrics(eval_preds): metric = evaluate.load("glue", "sst2") logits, labels = eval_preds predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)
However, if we want to customize the metrics used, we can also create our own evaluation function.
In this case, I've imported the accuracy, precision, recall, and F1 score metrics from scikit-learn. I've then created a function which takes in the predicted labels versus actual labels for each sentence and calculates the four required metrics. We'll use this function, as it gives us a wider variety of metrics we can check our model performance against.
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score import numpy as np def compute_metrics(eval_preds): logits, labels = eval_preds predictions = np.argmax(logits, axis=-1) return { 'accuracy': accuracy_score(labels, predictions), 'f1': f1_score(labels, predictions, average='macro'), 'precision': precision_score(labels, predictions, average='macro'), 'recall': recall_score(labels, predictions, average='macro') }
Now that we've done all of the setup, we're ready to train the model. The first thing we need to do is define some parameters that will control the training process using the TrainingArguments
class. We've only specified a few parameters here, but this class has an enormous number of possible arguments allowing you to calibrate your model training to a high degree of specificity.
from transformers import TrainingArguments training_args = TrainingArguments(output_dir="sst2-bert-fine-tuning", eval_strategy="epoch", num_train_epochs=3)
In our case, we've used the following arguments:
output_dir
: The output directory where we want our model predictions and checkpoints saved.eval_strategy="epoch"
: This ensures that the evaluation is performed at the end of each training epoch. Other possible values are "steps" (meaning that evaluation is done at regular step intervals) and "no" (meaning that evaluation is not done during training).num_train_epochs=3
: This sets the number of training epochs (or the number of times the training loop will repeat over all of the data). In this case, it's set to train on the data three times.
The next step is to load in our pre-trained BERT model.
from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Let's break this down step-by-step:
- The
AutoModelForSequenceClassification
class does two things. First, it automatically identifies the appropriate model architecture from the Hugging Face model hub given the provided checkpoint string. In our case, this would be the BERT architecture. Second, it converts this model into one we can use for classification. It does this by discarding the weights in the model's final layer(s) so that we can retrain these using our sentiment analysis dataset. - The
from_pretrained()
method loads in our selected checkpoint, which in this case isbert-base-uncased
. - The argument
num_labels=2
indicates that we have two classes to predict in our model: positive and negative.
We get a message telling us that some model weights were not initialized when we ran this code. This message is exactly the one we want - it tells us that the AutoModelForSequenceClassification
class reset the final model weights in preparation for our fine-tuning.
The last step is to set up our Trainer
object. This stage takes in the model, the training arguments, the train and validation datasets, our tokenizer and padding function, and our evaluation function. It uses all of these to train the weights for the head (or final layers) of the BERT model, evaluating the performance of the model after each epoch on the validation set.
from transformers import Trainer trainer = Trainer( model, training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics, )
We can now kick off the training. The Trainer
class gives us a nice timer that tells us both the elapsed time and how much longer the training is estimated to take. We can also see the metrics after each epoch, as we requested when creating the TrainingArguments.
trainer.train()
Evaluate the model
Classification metrics
Before we have a look at how our model performed, let's first discuss the evaluation metrics we used in more detail:
- Accuracy: As mentioned, this is the default evaluation metric for the SST-2 dataset. Accuracy is the simplest metric for evaluating classification models, being the ratio of correct predictions to all predictions. Accuracy is a good choice when the target classes are well balanced, meaning each class has an approximately equal number of instances.
- Precision: Precision calculates the percentage of the correctly predicted positive observations to the total predicted positives. It is important when the cost of a false positive is high. For example, in spam detection, you would rather miss a spam email (false negative) than have non-spam emails land in your spam folder (false positive).
- Recall (also known as sensitivity): Recall calculates the percentage of the correctly predicted positive observations to all observations in the actual class. It is of interest when the cost of false negatives is high, meaning classifying a positive class incorrectly as negative. For example, in disease diagnosis, you would rather have false alarms (false positives) than miss someone who is actually ill (false negatives).
- F1-score: The F1-score is the harmonic mean of precision and recall. It tries to find the balance between both measures. It is a more reliable metric than accuracy when dealing with imbalanced classes.
In our case, we had slightly imbalanced classes, so it's a good idea to check both accuracy and the F1 score. If they differ, the F1 score is likely to be more trustworthy. However, if they are roughly the same, it is nice to be able to use accuracy, as it is easily interpretable.
Knowing whether your model is better at predicting one class versus the other is also useful. Depending on your application, capturing all customers who are unhappy with your service may be more important, even if you sometimes get false negatives. In this case, a model with high recall would be a priority over high precision.
Model predictions
Now that we've trained our model, we need to evaluate it. Normally, we would use the test set to get a final, unbiased evaluation, but the SST-2 test set does not have labels, so we cannot use it for evaluation. In this case, we'll use the validation set accuracy scores for our final evaluation. We can do this using the following code:
trainer.evaluate(eval_dataset=tokenized_datasets["validation"])
{'eval_loss': 0.4223457872867584, 'eval_accuracy': 0.9071100917431193, 'eval_f1': 0.9070209502998072, 'eval_precision': 0.9074841225920363, 'eval_recall': 0.9068472678285763, 'eval_runtime': 3.9341, 'eval_samples_per_second': 221.649, 'eval_steps_per_second': 27.706, 'epoch': 3.0}
We see that the model has a 90% accuracy on the test set, comparable to other BERT models trained on SST-2. If we wanted to improve our model performance, we could investigate a few things:
- Check whether the model is overfitting: While small by LLM standards, the BERT model we used for fine-tuning is still very large, and our training set was quite modest. In such cases, overfitting is quite common. To check this, we should compare our validation set metrics with our training set metrics. If the training set metrics are much higher than the validation set metrics, then we have overfit the model. You can adjust a range of parameters during model training to help mitigate this.
- Train on more epochs: In this example, we only trained the model for three epochs. If the model is not overfitting, continuing to train it for longer may improve its performance.
- Check where the model has misclassified: We could dig into where the model is classifying correctly and incorrectly to see if we could spot a pattern. This may allow us to spot any issues with ambiguous cases or mislabelled data. Perhaps the fact this is a binary classification problem with no label for "neutral" sentiment means there is a subset of sentences that the model cannot properly classify.
To finish our section on evaluating this model, let's see how it goes with our test sentence. We'll pass our fine-tuned model and tokenizer to a TextClassificationPipeline, then pass our sentence to this pipeline:
from transformers import TextClassificationPipeline pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True) predictions = pipeline("I love PyCharm! It's my favourite Python IDE.") print(predictions)
[[{'label': 'LABEL_0', 'score': 0.0006891043740324676}, {'label': 'LABEL_1', 'score': 0.9993108510971069}]]
Our model assigns LABEL_0
(negative) a probability of 0.0007 and LABEL_1
(positive) a probability of 0.999, indicating it predicts that the sentence has a positive sentiment with 99% certainty. This result is similar to the one we got from the fine-tuned RoBERTa model we used earlier in the post.
Sentiment analysis benchmarks
Instead of evaluating the model on only the dataset it was trained on, we could also assess it on other datasets.
As you can see from the Papers With Code benchmarking table, you can use a wide variety of labeled datasets to assess the performance of your sentiment classifiers. These datasets include the SST-5 fine-grained classification, IMDB dataset, Yelp binary and fine-grained classification, Amazon review polarity, TweetEval, and the SemEval Aspect-based sentiment analysis dataset.
When evaluating your model, the main thing is to ensure that the datasets represent your problem domain.
Most of the benchmarking datasets contain either reviews or social media texts, so if your problem is in either of these domains, you may find an existing benchmark that mirrors your business domain closely enough. However, suppose you are applying sentiment analysis to a more specialized problem. In that case, it may be necessary to create your own benchmarks to ensure your model can generalize to your problem domain properly.
Since there are multiple ways of measuring sentiment, it's also necessary to make sure that any benchmarks you use to assess your model have the same target as the dataset you trained your model on.
For example, it wouldn't be a fair measure of a model's performance to fine-tune it on the SST-2 with a binary target, and then test it on the SST-5. As the model has never seen the very positive, very negative, and neutral categories, it will not be able to accurately predict texts with these labels and hence will perform poorly.
Wrapping up
In this blog post, we saw how LLMs can be a powerful way of classifying the sentiment expressed in a piece of text and took a hands-on approach to fine-tuning an LLM for this purpose.
We saw how understanding which types of models are most suited for sentiment analysis, as well as how being able to see the top performing models on different benchmarks with resources like Papers With Code can help you narrow down your options for which models to use.
We also learned how Hugging Face's powerful tooling for using these models and their integration into PyCharm makes using LLMs for sentiment analysis approachable for anyone with a background in machine learning.
If you'd like to continue learning about large language models, check out our guest blog post by Dido Grigorov, who explains how to build a chatbot using the LangChain package.
Get started with sentiment analysis with PyCharm today
If you're ready to get started on your own sentiment analysis project, you can activate your free three-month subscription of PyCharm. Click on the link below, and enter this promo code: PCSA. You'll then receive an activation code through your email.
05 Dec 2024 10:49am GMT
04 Dec 2024
Planet Python
Django Weblog: Django security releases issued: 5.1.4, 5.0.10, and 4.2.17
In accordance with our security release policy, the Django team is issuing releases for Django 5.1.4, Django 5.0.10, and Django 4.2.17. These releases address the security issues detailed below. We encourage all users of Django to upgrade as soon as possible.
CVE-2024-53907: Potential denial-of-service in django.utils.html.strip_tags()
The strip_tags() method and striptags template filter are subject to a potential denial-of-service attack via certain inputs containing large sequences of nested incomplete HTML entities.
Thanks to jiangniao for the report.
This issue has severity "moderate" according to the Django security policy.
CVE-2024-53908: Potential SQL injection in HasKey(lhs, rhs) on Oracle
Direct usage of the django.db.models.fields.json.HasKey lookup on Oracle is subject to SQL injection if untrusted data is used as a lhs value. Applications that use the jsonfield.has_key lookup through the __ syntax are unaffected.
Thanks to Seokchan Yoon for the report.
This issue has severity "high" according to the Django security policy.
Affected supported versions
- Django main
- Django 5.1
- Django 5.0
- Django 4.2
Resolution
Patches to resolve the issue have been applied to Django's main, 5.1, 5.0, and 4.2 branches. The patches may be obtained from the following changesets.
CVE-2024-53907: Potential denial-of-service in django.utils.html.strip_tags()
- On the main branch
- On the 5.1 branch
- On the 5.0 branch
- On the 4.2 branch
CVE-2024-53908: Potential SQL injection in HasKey(lhs, rhs) on Oracle
- On the main branch
- On the 5.1 branch
- On the 5.0 branch
- On the 4.2 branch
The following releases have been issued
- Django 5.1.4 (download Django 5.1.4 | 5.1.4 checksums)
- Django 5.0.10 (download Django 5.0.10 | 5.0.10 checksums)
- Django 4.2.17 (download Django 4.2.17 | 4.2.17 checksums)
The PGP key ID used for this release is Sarah Boyce: 3955B19851EA96EF
General notes regarding security reporting
As always, we ask that potential security issues be reported via private email to security@djangoproject.com, and not via Django's Trac instance, nor via the Django Forum, nor via the django-developers list. Please see our security policies for further information.
04 Dec 2024 3:40pm GMT
Real Python: Expression vs Statement in Python: What's the Difference?
After working with Python for a while, you'll eventually come across two seemingly similar terms: expression and statement. When you browse the official documentation or dig through a Python-related thread on an online forum, you may get the impression that people use these terms interchangeably. That's often true, but confusingly enough, there are cases when the expression vs statement distinction becomes important.
So, what's the difference between expressions and statements in Python?
Get Your Code: Click here to download the free sample code you'll use to learn about the difference between expressions and statements.
Take the Quiz: Test your knowledge with our interactive "Expression vs Statement in Python: What's the Difference?" quiz. You'll receive a score upon completion to help you track your learning progress:
Interactive Quiz
Expression vs Statement in Python: What's the Difference?In this quiz, you'll test your understanding of Python expressions vs statements. Knowing the difference between these two is crucial for writing efficient and readable Python code.
In Short: Expressions Have Values and Statements Cause Side Effects
When you open the Python glossary, you'll find the following two definitions:
Expression: A piece of syntax which can be evaluated to some value. (โฆ) (Source)
Statement: A statement is part of a suite (a "block" of code). A statement is either an expression or one of several constructs with a keyword, (โฆ) (Source)
Well, that isn't particularly helpful, is it? Fortunately, you can summarize the most important facts about expressions and statements in as little as three points:
- All instructions in Python fall under the broad category of statements.
- By this definition, all expressions are also statements-sometimes called expression statements.
- Not every statement is an expression.
In a technical sense, every line or block of code is a statement in Python. That includes expressions, which represent a special kind of statement. What makes an expression special? You'll find out now.
Expressions: Statements With Values
Essentially, you can substitute all expressions in your code with the computed values, which they'd produce at runtime, without changing the overall behavior of your program. Statements, on the other hand, can't be replaced with equivalent values unless they're expressions.
Consider the following code snippet:
>>> x = 42
>>> y = x + 8
>>> print(y)
50
In this example, all three lines of code contain statements. The first two are assignment statements, while the third one is a call to the print()
function.
When you look at each line more closely, you can start disassembling the corresponding statement into subcomponents. For example, the assignment operator (=
) consists of the parts on the left and the right. The part to the left of the equal sign indicates the variable name, such as x
or y
, and the part on the right is the value assigned to that variable.
The word value is the key here. Notice that the variable x
is assigned a literal value, 42
, that's baked right into your code. In contrast, the following line assigns an arithmetic expression, x + 8
, to the variable y
. Python must first calculate or evaluate such an expression to determine the final value for the variable when your program is running.
Arithmetic expressions are just one example of Python expressions. Others include logical expressions, conditional expressions, and more. What they all have in common is a value to which they evaluate, although each value will generally be different. As a result, you can safely substitute any expression with the corresponding value:
>>> x = 42
>>> y = 50
>>> print(y)
50
This short program gives the same result as before and is functionally identical to the previous one. You've calculated the arithmetic expression by hand and inserted the resulting value in its place.
Note that you can evaluate x + 8
, but you can't do the same with the assignment y = x + 8
, even though it incorporates an expression. The whole line of code represents a pure statement with no intrinsic value. So, what's the point of having such statements? It's time to dive into Python statements and find out.
Statements: Instructions With Side Effects
Statements that aren't expressions cause side effects, which change the state of your program or affect an external resource, such as a file on disk. For example, when you assign a value to a variable, you define or redefine that variable somewhere in Python's memory. Similarly, when you call print()
, you effectively write to the standard output stream (stdout), which, by default, displays text on the screen.
Note: While statements encompass expressions, most people use the word statement informally when they refer to pure statements or instructions with no value.
Okay. You've covered statements that are expressions and statements that aren't expressions. From now on, you can refer to them as pure expressions and pure statements, respectively. But it turns out there's a middle ground here.
Read the full article at https://realpython.com/python-expression-vs-statement/ ยป
[ Improve Your Python With ๐ Python Tricks ๐ - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
04 Dec 2024 2:00pm GMT
Ned Batchelder: Testing some tidbits
I posted a Python tidbit about checking if a string consists entirely of zeros and ones:
I got a bunch of replies suggesting other ways. I wanted to post those, but I also wanted to check if they were right. A classic testing structure would have required putting them all in functions, etc, which I didn't want to bother with.
So I cobbled together a test harness for them (also in a gist if you want):
GOOD = [
"",
"0",
"1",
"000000000000000000",
"111111111111111111",
"101000100011110101010000101010101001001010101",
]
BAD = [
"x",
"nedbat",
"x000000000000000000000000000000000000",
"111111111111111111111111111111111111x",
"".join(chr(i) for i in range(10000)),
]
TESTS = """
# The original checks
all(c in "01" for c in s)
set(s).issubset({"0", "1"})
set(s) <= {"0", "1"}
re.fullmatch(r"[01]*", s)
s.strip("01") == ""
not s.strip("01")
# Using min/max
"0" <= min(s or "0") <= max(s or "1") <= "1"
not s or (min(s) in "01" and max(s) in "01")
((ss := sorted(s or "0")) and ss[0] in "01" and ss[-1] in "01")
# Using counting
s.count("0") + s.count("1") == len(s)
(not (ctr := Counter(s)) or (ctr["0"] + ctr["1"] == len(s)))
# Using numeric tests
all(97*c - c*c > 2351 for c in s.encode())
max((abs(ord(c) - 48.5) for c in "0"+s)) < 1
all(map(lambda x: (ord(x) ^ 48) < 2, s))
# Removing all the 0 and 1
re.sub(r"[01]", "", s) == ""
len((s).translate(str.maketrans("", "", "01"))) == 0
len((s).replace("0", "").replace("1", "")) == 0
"".join(("1".join((s).split("0"))).split("1")) == ""
# A few more for good measure
set(s + "01") == set("01")
not (set(s) - set("01"))
not any(filter(lambda x: x not in {"0", "1"}, s))
all(map(lambda x: x in "01", s))
"""
import re
from collections import Counter
from inspect import cleandoc
g = {
"re": re,
"Counter": Counter,
}
for test in cleandoc(TESTS).splitlines():
test = test.partition("#")[0]
if not test:
continue
for ss, expected in [(GOOD, True), (BAD, False)]:
for s in ss:
result = eval(test, {"s": s} | g)
if bool(result) != expected:
print("OOPS:")
print(f" {s = }")
print(f" {test}")
print(f" {expected = }")
It's a good thing I did this because a few of the suggestions needed adjusting, especially for dealing with the empty string. But now they all work, and are checked!
BTW, if you prefer Mastodon to BlueSky, the posts are there too: first and second.
04 Dec 2024 12:03pm GMT
Django Weblog: Help us make it happen โค๏ธ
And just like that, 2024 is almost over! If your finances allow, donate to the Django Software Foundation to support the long-term future of Django.
Of our US $200,000.00 goal for 2024, as of December 4th, 2024, we are at:
- 83.6% funded
- $167,272.85 donated
Other ways to give
- Official merchandise store - Buy official t-shirts, accessories, and more to support Django.
- Sponsor Django via GitHub Sponsors.
- Benevity Workplace Giving Program - If your employer participates, you can make donations to the DSF via payroll deduction.
Why give to the Django Software Foundation?
Our main focus is direct support of Django's developers. This means:
- Organizing and funding development sprints so that Django's developers can meet in person.
- Helping key developers attend these sprints and other community events by covering travel expenses to official Django events.
- Providing financial assistance to community development and outreach projects such as Django Girls.
- Providing financial assistance to individuals so they can attend major conferences and events.
- Funding the Django Fellowship program, which provides full-time staff to perform community management tasks in the Django community.
Still curious? See our Frequently Asked Questions about donations.
04 Dec 2024 8:53am GMT
03 Dec 2024
Planet Python
Kushal Das: Basedpyright and neovim
Basedpyright is a fork of pyright with various type checking improvements, improved vscode support and pylance features built into the language server. It has a list of benefits over Pyright.
In case you want to use that inside of neovim using Mason, you will have to remember to have the configuration inside of a settings
key. The following is from my setup.
basedpyright = {
settings = {
basedpyright = {
analysis = {
diagnosticMode = 'openFilesOnly',
typeCheckingMode = 'basic',
capabilities = capabilities,
useLibraryCodeForTypes = true,
diagnosticSeverityOverrides = {
autoSearchPaths = true,
enableTypeIgnoreComments = false,
reportGeneralTypeIssues = 'none',
reportArgumentType = 'none',
reportUnknownMemberType = 'none',
reportAssignmentType = 'none',
},
},
},
},
},
Struggled for a few hours to fix this couple of days ago.
03 Dec 2024 8:50pm GMT
PyCoderโs Weekly: Issue #658 (Dec. 3, 2024)
#658 - DECEMBER 3, 2024
View in Browser ยป
Django Performance: Scaling and Optimization
Performance tuning in the context of Django applications is the practice of enhancing both the efficiency and effectiveness of your web project to optimize its runtime behavior. This article tells you a lot of what you need to know.
LOADFORGE.COM
Python's pathlib
Module
Python's pathlib
module is the tool to use for working with file paths. This post contains pathlib
quick reference tables and examples.
TREY HUNNER
Python Developers: Scrape Any Website Without Getting Blocked
ZenRows handles all anti-bot bypass for you, from rotating proxies and headless browsers to CAPTCHAs and AI. Get a complete web scraping toolkit to extract all the data you need with a single API call. Try ZenRows now for free โ
ZENROWS sponsor
Managing Dependencies With Python Poetry
Learn how Python Poetry can help you start new projects, maintain existing ones, and master dependency management.
REAL PYTHON course
Discussions
Articles & Tutorials
Advent of Code 2024
This annual tradition is a series of small programming puzzles that can be completed in any programming language.
ADVENTOFCODE.COM
CI/CD for Python With GitHub Actions
With most software following agile methodologies, it's essential to have robust DevOps systems in place to manage, maintain, and automate common tasks with a continually changing codebase. By using GitHub Actions, you can automate your workflows efficiently, especially for Python projects.
REAL PYTHON
How to Debug Your Textual Application
Textual is a great Python package for creating a lightweight, powerful, text-based user interface. Debugging TUIs can be a challenge though as you no longer can use print()
and the application may not even run in your IDE's terminal interface. This post talks about how to debug a TUI.
MIKE DRISCOLL
Reactive Notebooks and Deployable Web Apps in Python
What are common issues with using notebooks for Python development? How do you know the current state, share reproducible results, or create interactive applications? This week on the show, we speak with Akshay Agrawal about the open-source reactive marimo notebook for Python.
REAL PYTHON podcast
Constraints Are Good: Python's Metadata Dilemma
Python's initial flexibility in packaging with the executable setup.py
has meant that people have come to expect this power. In this post Armin argues that if constraints had been there in the first place we'd be in a better place now.
ARMIN RONACHER
Demystifying ODBC With Python
Open Database Connectivity (ODBC) is used to connect to various databases. This article aims to help you understand ODBC better by implementing database communications from scratch only using Python.
PRESTON BLACKBURN โข Shared by Preston Blackburn
What the PSF Conduct WG Does
In the past week Brett has had two different people tell him what the PSF Conduct Working Group did, and both were wrong. This post tries to correct what might be common misconceptions.
BRETT CANNON
Introduction to Retrogame Programming With Pyxel
Pyxel is a Rust based framework for building retro games that comes with a Python API wrapper. This step-by-step tutorial shows you how to do some basic sprite animation to get started.
MATHIEU LECARME
Speeding Up Data Retrieval From PostgreSQL With Psycopg
Formatting and concatenating query result columns on the PostgreSQL side and then parsing them in Python might sometimes be faster than fetching the columns as separate values.
ALIAKSEI YALETSKI โข Shared by Tiendil
Django Application Performance Optimization Checklist
"Improve the performance of your Django application by understanding, testing, and implementing some common optimization techniques."
SANKET RAI
Top 10 Rules of Continuous Integration
Continuous Integration (CI) is key to rapid deployment of new features. This post gives you ten rules to consider when doing CI.
KRISTINA NIKOLOVA
Projects & Code
Sensei: Simplifying API Client Generation
CROCOFACTORY.DEV โข Shared by Alexey
Peek: Like print
, but Easy
SALABIM.ORG โข Shared by Ruud van der Ham
Events
Weekly Real Python Office Hours Q&A (Virtual)
December 4, 2024
REALPYTHON.COM
PyCon Tanzania 2024
December 4 to December 6, 2024
PYCON.OR.TZ
DELSU Tech Invasion 2.0
December 4 to December 6, 2024
HAMPLUSTECH.COM
Canberra Python Meetup
December 5, 2024
MEETUP.COM
Sydney Python User Group (SyPy)
December 5, 2024
SYPY.ORG
PyLadies Amsterdam: Introduction to Data Storytelling
December 7, 2024
MEETUP.COM
Happy Pythoning!
This was PyCoder's Weekly Issue #658.
View in Browser ยป
[ Subscribe to ๐ PyCoder's Weekly ๐ - Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]
03 Dec 2024 7:30pm GMT
Python Insider: Python 3.13.1, 3.12.8, 3.11.11, 3.10.16 and 3.9.21 are now available
Another big release day! Python 3.13.1 and 3.12.8 were regularly scheduled releases, but they do contain a few security fixes. That makes it a nice time to release the security-fix-only versions too, so everything is as secure as we can make it.
Python 3.13.1
Python 3.13's first maintenance release. My child is all growed up now, I guess! Almost 400 bugfixes, build improvements and documentation changes went in since 3.13.0, making this the very best Python release to date.
https://www.python.org/downloads/release/python-3131/
Python 3.12.8
Python 3.12 might be slowly reaching middle age, but still received over 250 bugfixes, build improvements and documentation changes since 3.12.7.
https://www.python.org/downloads/release/python-3128/Python 3.11.11
I know it's probably hard to hear, but this is the second security-only release of Python 3.11. Yes, really! Oh yes, I know, I know, but it's true! Only 11 commits went in since 3.11.10.
https://www.python.org/downloads/release/python-31111/Python 3.10.16
Python 3.10 received a total of 14 commits since 3.10.15. Why more than 3.11? Because it needed a little bit of extra attention to keep working with current GitHub practices, I guess.
https://www.python.org/downloads/release/python-31016/Python 3.9.21
Python 3.9 isn't quite ready for pasture yet, as it's set to receive security fixes for at least another 10 months. Very similarly to 3.10, it received 14 commits since 3.9.20.
https://www.python.org/downloads/release/python-3921/Stay safe and upgrade!
As always, upgrading is highly recommended to all users of affected versions.
Enjoy the new releases
Thanks to all of the many volunteers who help make Python Development and these releases possible! Please consider supporting our efforts by volunteering yourself or through organization contributions to the Python Software Foundation.
Regards from your tireless, tireless release team,
Thomas Wouters
Ned Deily
Steve Dower
Pablo Galindo Salgado
ลukasz Langa
03 Dec 2024 7:01pm GMT
Real Python: Handling or Preventing Errors in Python: LBYL vs EAFP
Dealing with errors and exceptional situations is a common requirement in programming. You can either prevent errors before they happen or handle errors after they've happened. In general, you'll have two coding styles matching these strategies: look before you leap (LBYL), and easier to ask forgiveness than permission (EAFP). In this video course, you'll dive into the questions and considerations surrounding LBYL vs EAFP in Python.
By learning about Python's LBYL and EAFP coding styles, you'll be able to decide which strategy and coding style to use when you're dealing with errors in your code.
In this video course, you'll learn how to:
- Use the LBYL and EAFP styles in your Python code
- Understand the pros and cons of LBYL vs EAFP
- Decide when to use either LBYL or EAFP
[ Improve Your Python With ๐ Python Tricks ๐ - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
03 Dec 2024 2:00pm GMT
Seth Michael Larson: New era of slop security reports for open source
New era of slop security reports for open source
I'm on the security report triage team for CPython, pip, urllib3, Requests, and a handful of other open source projects. I'm also in a trusted position such that I get "tagged in" to other open source projects to help others when they need help with security.
Recently I've noticed an uptick in extremely low-quality, spammy, and LLM-hallucinated security reports to open source projects. The issue is in the age of LLMs, these reports appear at first-glance to be potentially legitimate and thus require time to refute. Other projects such as curl have reported similar findings.
Some reporters will run a variety of security scanning tools and open vulnerability reports based on the results seemingly without a moment of critical thinking. For example, urllib3 recently received a report because a tool was detecting our usage of SSLv2
as insecure even though our usage is to explicitly disable SSLv2.
This issue is tough to tackle because it's distributed across thousands of open source projects and due to the security-sensitive nature of reports open source maintainers are discouraged from sharing their experiences or asking for help. Sharing experiences takes time and effort, something that is in short supply amongst maintainers.
Responding to security reports is expensive
If this is happening to a handful of projects that I have visibility for, then I suspect that this is happening on a large scale to open source projects. This is a very concerning trend.
Security is already a topic that is not aligned with why many maintainers contribute their time to open source software, instead seeing security as important to help protect their users. It's critical as reporters to respect this often volunteered time.
Security reports that waste maintainers' time result in confusion, stress, frustration, and to top it off a sense of isolation due to the secretive nature of security reports. All of these feelings can add to burn-out of likely highly-trusted contributors to open source projects.
In many ways, these low-quality reports should be treated as if they are malicious. Even if this is not their intent, the outcome is maintainers that are burnt out and more averse to legitimate security work.
What platforms can do
If you're a platform accepting vulnerability reports on behalf of open source projects, here are things you can do:
- Add systems to prevent automated or abusive creation of security reports. Require reporters to solve CAPTCHAs or heavily rate-limit security report creation using automation.
- Allow a security report to be made public without publishing a vulnerability record. This would allow maintainers to "name-and-shame" offenders and better collaborate as a community how to fight back against low-quality reports. Today many of these reports aren't seen due to being private by default or when closed.
- Remove the public attribution of reporters that abuse the system, even removing previously credited reports in the case of abuse.
- Take away any positive incentive to reporting security issues, for example GitHub showing the number of GitHub Security Advisory "credits" a user appears on.
- Prevent or hamper newly registered users from reporting security issues.
What reporters can do
If you're starting a new campaign of scanning open source projects and reporting potential vulnerabilities upstream:
- DO NOT use AI / LLM systems for "detecting" vulnerabilities. These systems today cannot understand code, finding security vulnerabilities requires understanding code AND understanding human-level concepts like intent, common usage, and context.
- DO NOT run experiments on open source volunteers. My alma-mater the University of Minnesota rightfully had its reputation thrown in the trash in 2021 over their experiment to knowingly socially deceive Linux maintainers.
- DO NOT submit reports that haven't been reviewed BY A HUMAN. This reviewing time should be paid first by you, not open source volunteers.
- DO NOT spam projects, open a handful of reports and then WAIT. You could run the script and open tons of reports all-at-once, but likely you have faults in your process that will cause mass-frustration at scale. Learn from early mistakes and feedback.
- Have someone with experience in open source maintenance for the size of projects you are scanning review your plan before you begin. If that person is not on your team, then pay them for their time and expertise.
- Show up with patches, not just reports. By providing patches this makes the work of maintainers much easier.
Doing all of the above will likely lead to better outcomes for everyone.
What maintainers can do
Put the same amount of effort into responding as the reporter put into submitting a sloppy report: ie, near zero. If you receive a report that you suspect is AI or LLM generated, reply with a short response and close the report:
"I suspect this report is (AI-generated|incorrect|spam). Please respond with more justification for this report. See: https://sethmlarson.dev/slop-security-reports"
If you hear back at all then admit your mistake and you move on with the security report. Maybe the reporter will fix their process and you'll have helped other open source maintainers along the way to helping yourself.
If you don't hear back: great, you saved time and can get back to actually useful work.
Here are some questions to ask of a security report and reporter:
-
If you aren't sure: ask for help! Is there someone I trust in my community that I can ask for another look. You are not alone, there are many people around that are willing to help. For Python open source projects you can ask for help from me if needed.
-
Does the reporter have a new account, no public identity, or multiple "credited" security reports of low quality? There are sometimes legitimate reasons to want anonymity, but I've seen this commonly on very low-stakes vulnerability reports.
-
Is the vulnerability in the proof-of-concept code or the project itself? Oftentimes the proof-of-concept code will be using the project insecurely and thus the vulnerability is in the proof-of-concept code, not your code.
Most vulnerability reporters are acting in good faith
I wanted to end this article with a note that many vulnerability reporters are acting in good faith and are submitting high quality reports. Please keep in mind that vulnerability reporters are humans: not perfect and trying their best to make the world a better place.
Unfortunately, an increasing majority of reports are of low quality and are ruining the experience for others. I hope we're able to fix this issue before it gets out of hand.
๏ธHave thoughts or questions? Let's chat over email or social:
sethmichaellarson@gmail.com
@sethmlarson@fosstodon.org
Want more articles like this one? Get notified of new posts by subscribing to the RSS feed or the email newsletter. I won't share your email or send spam, only whatever this is!
Want more content now? This blog's archive has ready-to-read articles. I also curate a list of cool URLs I find on the internet.
Find a typo? This blog is open source, pull requests are appreciated.
Thanks for reading! โก This work is licensed under CC BY-SA 4.0
03 Dec 2024 12:00am GMT
02 Dec 2024
Planet Python
Python Engineering at Microsoft: Announcing: Azure Developers โ Python Day
We're thrilled to announce Azure Developers - Python Day! Join us on December 5th for a full day of online training and discover the latest services and features in Azure designed specifically for Python developers. You'll learn cutting-edge cloud development techniques that can save you time and money while providing your customers with the best experience possible.
December 5, 2024 from 9:30 am - 4:00 pm (Pacific Time) / 17:30 - 00:00 (UTC)
Select "Notify Me" on the YouTube Video to ensure you don't miss the event!
During the event, you'll hear directly from the experts behind the latest features in Azure designed for Python developers, techniques to save time and money, and a special session on our recently announced AI Toolkit for VS Code.
Whether you're a beginner or an experienced Python developer, this event is for you. We'll cover six main topic areas: Application Development, Artificial Intelligence, Cloud Native, Data Services, Security, Serverless, and Developer Productivity.
Agenda
Session Title | Theme | Speaker | Time( PT/ UTC) |
---|---|---|---|
Welcome to Azure Developers - Python Day | Dawn Wages, Senior Program Manager |
Jay Gordon, Senior Program Manager
Zhidi Shang, Principal Program Manager Lead
Abigail Gbadago, Senior Software Engineer, Python MVP
Jay Gordon, Senior Program Manager
Don't miss this opportunity to build the best applications with Python. Join us on December 5th on the Azure Developers YouTube and Twitch channels. See you there!
The post Announcing: Azure Developers - Python Day appeared first on Python.
9:50 AM / 16:50 | |||
Dev Containers and Codespaces for quick skilling and deployments | Developer Productivity | Sarah Kaiser, Senior Cloud Developer Advocate | 10:00 AM / 17:00 |
Cloudy with a Chance of Jupyter - Install JupyterHub on Azure in 30 mins | Data Services | Dharhas Pothina, CTO, Quansight | 10:30 AM / 17:30 |
Langchain on Azure SQL to enlighten AI with your own data | AI | Davide Mauri, Principal Program Manager | 11:00 AM / 18:00 |
Securing Python Applications | Security, Cloud Native | Joylynn Kirui, Senior Security Cloud Advocate | 11:30 PM / 21:30 |
Getting started with Python on Azure Cosmos DB | App Development | Theo Van Kraay, Principal Program Manager | 12:00 PM / 19:00 |
Transforming AI development in VS Code | AI, App Development, Developer Productivity | Rong Lu, Principal Mgr, Product Manager | |
12:30 PM / 19:30 | |||
Building Scalable GenAI Apps with Azure Cosmos DB & LangChain | AI, App Development | James Codella, Principal Product Mgr | 1:00 PM / 20:00 |
Python + Azure for Absolute Beginners | App Development | Rohit Ganguly, Product Manager II | 1:30 PM / 20:30 |
Deploying Python apps with GitHub Copilot for @azure | AI, Developer Productivity | Pamela Fox, Principal Cloud Developer Advocate | 2:00 PM / 21:00 |
Integrating AI into your Python apps with App Service Sidecars | AI, App Development | Tulika Chaudharie, Principal Product Manager | 2:30 AM / 18:30 |
Your First Full Stack Python Web Application | App Development | Renee Noble, Senior Cloud Developer Advocate | 2:45 PM / 22:00 |
Deploying a scalable Django app with Microsoft Azure | App Development | Velda Kiara, Senior Software Engineer, Python MVP | |
3:30 PM / 22:30 | |||
Closing Remarks | Dawn Wages, Senior Program Manager | ||
4:00 PM / 23:00 |
02 Dec 2024 3:41pm GMT
Real Python: Basic Input and Output in Python
For a program to be useful, it often needs to communicate with the outside world. In Python, the input()
function allows you to capture user input from the keyboard, while you can use the print()
function to display output to the console.
These built-in functions allow for basic user interaction in Python scripts, enabling you to gather data and provide feedback. If you want to go beyond the basics, then you can even use them to develop applications that are not only functional but also user-friendly and responsive.
By the end of this tutorial, you'll know how to:
- Take user input from the keyboard with
input()
- Display output to the console with
print()
- Use
readline
to improve the user experience when collecting input on UNIX-like systems - Format output using the
sep
andend
keyword arguments ofprint()
To get the most out of this tutorial, you should have a basic understanding of Python syntax and familiarity with using the Python interpreter and running Python scripts.
Get Your Code: Click here to download the free sample code that you'll use to learn about basic input and output in Python.
Take the Quiz: Test your knowledge with our interactive "Basic Input and Output in Python" quiz. You'll receive a score upon completion to help you track your learning progress:
Interactive Quiz
Basic Input and Output in PythonIn this quiz, you'll test your understanding of Python's built-in functions for user interaction, namely input() and print(). These functions allow you to capture user input from the keyboard and display output to the console, respectively.
Reading Input From the Keyboard
Programs often need to obtain data from users, typically through keyboard input. In Python, one way to collect user input from the keyboard is by calling the input()
function:
The input()
function pauses program execution to allow you to type in a line of input from the keyboard. Once you press the Enter key, all characters typed are read and returned as a string, excluding the newline character generated by pressing Enter.
If you add text in between the parentheses, effectively passing a value to the optional prompt
argument, then input()
displays the text you entered as a prompt:
>>> name = input("Please enter your name: ")
Please enter your name: John Doe
>>> name
'John Doe'
Adding a meaningful prompt will assist your user in understanding what they're supposed to input, which makes for a better user experience.
The input()
function always reads the user's input as a string. Even if you type characters that resemble numbers, Python will still treat them as a string:
1>>> number = input("Enter a number: ")
2Enter a number: 50
3
4>>> type(number)
5<class 'str'>
6
7>>> number + 100
8Traceback (most recent call last):
9 File "<python-input-1>", line 1, in <module>
10 number + 100
11 ~~~~~~~^~~~~
12TypeError: can only concatenate str (not "int") to str
In the example above, you wanted to add 100
to the number entered by the user. However, the expression number + 100
on line 7 doesn't work because number
is a string ("50"
) and 100
is an integer. In Python, you can't combine a string and an integer using the plus (+
) operator.
You wanted to perform a mathematical operation using two integers, but because input()
always returns a string, you need a way to read user input as a numeric type. So, you'll need to convert the string to the appropriate type:
>>> number = int(input("Enter a number: "))
Enter a number: 50
>>> type(number)
<class 'int'>
>>> number + 100
150
In this updated code snippet, you use int()
to convert the user input to an integer right after collecting it. Then, you assign the converted value to the name number
. That way, the calculation number + 100
has two integers to add. The calculation succeeds and Python returns the correct sum.
Note: When you convert user input to a numeric type using functions like int()
in a real-world scenario, it's crucial to handle potential exceptions to prevent your program from crashing due to invalid input.
The input()
function lets you collect information from your users. But once your program has calculated a result, how do you display it back to them? Up to this point, you've seen results displayed automatically as output in the interactive Python interpreter session.
However, if you ran the same code from a file instead, then Python would still calculate the values, but you wouldn't see the results. To display output in the console, you can use Python's print()
function, which lets you show text and data to your users.
Writing Output to the Console
In addition to obtaining data from the user, a program will often need to present data back to the user. In Python, you can display data to the console with the print()
function.
Read the full article at https://realpython.com/python-input-output/ ยป
[ Improve Your Python With ๐ Python Tricks ๐ - Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
02 Dec 2024 2:00pm GMT
PyCharm: The State of Data Science 2024: 6 Key Data Science Trends
Generative AI and LLMs have been hot topics this year, but are they affecting trends in data science and machine learning? What new trends in data science are worth following? Every year, JetBrains collaborates with the Python Software Foundation to carry out the Python Developer Survey, which can offer some useful insight into these questions.
The results from the latest iteration of the survey, collected between November 2023 and February 2024, included a new Data Science section. This allowed us to get a more complete picture of data science trends over the past year and highlighted how important Python remains in this domain.
While 48% of Python developers are involved in data exploration and processing, the percentage of respondents using Python for data analysis dropped from 51% in 2022 to 44% in 2023. The percentage of respondents using Python for machine learning dropped from 36% in 2022 to 34% in 2023. At the same time, 27% of respondents use Python for data engineering, and 8% use it for MLOps - two new categories that were added to the survey in 2023.
Let's take a closer look at the trends in the survey results to put these numbers into context and get a better sense of what they mean. Read on to learn about the latest developments in the fields of data science and machine learning to prepare yourself for 2025.
Data processing: pandas remains the top choice, but Polars is gaining ground
Data processing is an essential part of data science. pandas, a project that is 15 years old, is still at the top of the list of the most commonly used data processing tools. It is used by 77% of respondents who do data exploration and processing. As a mature project, its API is stable, and many working examples can be found on the internet. It's no surprise that pandas is still the obvious choice. As a NumFOCUS sponsored project, pandas has proven to the community that it is sustainable and its governance model has gained user trust. It is a great choice for beginners who may still be learning the ropes of data processing, as it's a stable project that does not undergo rapid changes.
On the other hand, Polars, which pitches itself as DataFrames for the new era, has been in the spotlight quite a bit both last year and this year, thanks to the advantages it provides in terms of speed and parallel processing. In 2023, a company led by the creator of Polars, Ritchie Vink, was formed to support the development of the project. This ensures Polars will be able to maintain its rapid pace of development. In July of 2024, version 1.0 of Polars was released. Later, Polars expanded its compatibility with other popular data science tools like Hugging Face and NVIDIA RAPIDS. It also provides a lightweight plotting backend, just like pandas.
So, for working professionals in data science, there is an advantage to switching to Polars. As the project matures, it can become a load-bearing tool in your data science workflow and can be used to process more data faster. In the 2023 survey, 10% of respondents said that they are using Polars as their data processing tool. It is not hard to imagine this figure being higher in this year's survey.
Whether you are a working professional or just starting to process your first dataset, it is important to have an efficient tool at hand that can make your work more enjoyable. With PyCharm, you can inspect your data as interactive tables, which you can scroll, sort, filter, convert to plots, or use to generate heat maps. Moreover, you can get analytics for each column and use AI assistance to explain DataFrames or create visualizations. Apart from pandas and Polars, PyCharm provides this functionality for Hugging Face datasets, NumPy, PyTorch, and TensorFlow.
An interactive table in PyCharm 2024.2.2 Pro provides tools for inspecting pandas and Polars DataFrames
The popularity of Polars has led to the creation of a new project called Narwhals. Independent from pandas and Polars, Narwhals aims to unite the APIs of both tools (and many others). Since it is a very young project (started in February 2024), it hasn't yet shown up on our list of the most popular data processing tools, but we suspect it may get there in the next few years.
Also worth mentioning are Spark (16%) and Dask (7%), which are useful for processing large quantities of data thanks to their parallel processes. These tools require a bit more engineering capability to set up. However, as the amount of data that projects depend on increasingly exceeds what a traditional Python program can handle, these tools will become more important and we may see these figures go up.
Data visualization: Will HoloViz Panel surpass Plotly Dash and Streamlit within the next year?
Data scientists have to be able to create reports and explain their findings to businesses. Various interactive visualization dashboard tools have been developed for working with Python. According to the survey results, the most popular of them is Plotly Dash.
Plotly is most known in the data science community for the ggplot2 library, which is a highly popular visualization library for users of the R language. Ever since Python became popular for data science, Plotly has also provided a Python library, which gives you a similar experience to ggplot2 in Python. In recent years, Dash, a Python framework for building reactive web apps developed by Plotly, has become an obvious choice for those who are used to Plotly and need to build an interactive dashboard. However, Dash's API requires some basic understanding of the elements used in HTML when designing the layout of an app. For users who have little to no frontend experience, this could be a hurdle they need to overcome before making effective use of Dash.
Second place for "best visualization dashboard" goes to Streamlit, which has now joined forces with Snowflake. It doesn't have as long of a history as Plotly, but it has been gaining a lot of momentum over the past few years because it's easy to use and comes packaged with a command line tool. Although Streamlit is not as customizable as Plotly, building the layout of the dashboard is quite straightforward, and it supports multipage apps, making it possible to build more complex applications.
However, in the 2024 results these numbers may change a little. There are up-and-coming tools that could catch up to - or even surpass - these apps in popularity. One of them is HoloViz Panel. As one of the libraries in the HoloViz ecosystem, it is sponsored by NumFocus and is gaining traction among the PyData community. Panel lets users generate reports in the HTML format and also works very well with Jupyter Notebook. It offers templates to help new users get started, as well as a great deal of customization options for expert users who want to fine-tune their dashboards.
ML models: scikit-learn is still prominent, while PyTorch is the most popular for deep learning
Because generative AI and LLMs have been such hot topics in recent years, you might expect deep learning frameworks and libraries to have completely taken over. However, this isn't entirely true. There is still a lot of insight that can be extracted from data using traditional statistics-based methods offered by scikit-learn, a well-known machine learning library mostly maintained by researchers. Sponsored by NumFocus since 2020, it remains the most important library in machine learning and data science. SciPy, another Python library that provides support for scientific calculations, is also one of the most used libraries in data science.
Having said that, we cannot ignore the impact of deep learning and the increase in popularity of deep learning frameworks. PyTorch, a machine learning library created by Meta, is now under the governance of the Linux Foundation. In light of this change, we can expect PyTorch to continue being a load-bearing library in the open-source ecosystem and to maintain its level of active community involvement. As the most used deep learning framework, it is loved by Python users - especially those who are familiar with numpy, since "tensors", the basic data structures in PyTorch, are very similar to numpy arrays.
You can inspect Pytorch tensors in PyCharm 2024.2.2 Pro just like you inspect Numpy arrays
Unlike TensorFlow, which uses a static computational graph, PyTorch uses a dynamic one - and this makes profiling in Python a blast. To top it all off, PyTorch also provides a profiling API, making it a good choice for research and experimentation. However, if your deep learning project needs to be scalable in deployment and needs to support multiple programming languages, TensorFlow may be a better choice, as it is compatible with many languages, including C++, JavaScript, Python, C#, Ruby, and Swift. Keras is a tool that makes TensorFlow more accessible and is also popular for deep learning frameworks.
Another framework we cannot ignore for deep learning is Hugging Face Transformers. Hugging Face is a hub that provides many state-of-the-art pre-trained deep learning models that are popular in the data science and machine learning community, which you can download and train further yourself. Transformers is a library maintained by Hugging Face and the community for state-of-the-art machine learning with PyTorch, TensorFlow, and JAX. We can expect Hugging Face Transformers will gain more users in 2024 due to the popularity of LLMs.
With PyCharm you can identify and manage Hugging Face models in a dedicated tool window. PyCharm can also help you to choose the right model for your use case from the large variety of Hugging Face models directly in the IDE.
One new library that is worth paying attention to in 2024 is Scikit-LLM, which allows you to tap into Open AI models like ChatGPT and integrate them with scikit-learn. This is very handy when text analysis is needed, and you can perform analysis using models from scikit-learn with the power of modern LLM models.
MLOps: The future of data science projects
One aspect of data science projects that is essential but frequently overlooked is MLOps (machine learning operations). In the workflow of a data science project, data scientists need to manage data, retrain the model, and have version control for all the data and models used. Sometimes, when a machine learning application is deployed in production, performance and usage also need to be observed and monitored.
In recent years, MLOps tools designed for data science projects have emerged. One of the issues that has been bothering data scientists and data engineers is versioning the data, which is crucial when your pipeline constantly has data flowing in.
Data scientists and engineers also need to track their experiments. Since the machine learning model will be retrained with new data and hyperparameters will be fine-tuned, it's important to keep track of model training and experiment results. Right now, the most popular tool is TensorBoard. However, this may be changing soon. TensorBoard.dev has been deprecated, which means users are now forced to deploy their own TensorBoard installations locally or share results using the TensorBoard integration with Google Colab. As a result, we may see a drop in the usage of TensorBoard and an uptick in that of other tools like MLflow and PyTorch.
Another MLOps step that is necessary for ensuring that data projects run smoothly is shipping the development environment for production. The use of Docker containers, a common development practice among software engineers, seems to have been adopted by the data science community. This ensures that the development environment and the production environment remain consistent, which is important for data science projects involving machine learning models that need to be deployed as applications. We can see that Docker is a popular tool among Python users who need to deploy services to the cloud.
This year, Docker containers is slightly ahead of Anaconda in the "Python installation and upgrade" category.
2023 survey results 2022 survey results
Big data: How much is enough?
One common misconception is that we will need more data to train better, more complex models in order to improve prediction. However, this is not the case. Since models can be overfitted, more is not always better in machine learning. Different tools and approaches will be required depending on the use case, the model, and how much data is being handled at the same time.
The challenge of handling a huge amount of data in Python is that most Python libraries rely on the data being stored in the memory. We could just deploy cloud computing resources with huge amounts of memory, but even this approach has its limitations and would sometimes be slow and costly.
When handling huge amounts of data that are hard to fit in memory, a common solution is to use distributed computing resources. Computation tasks and data are distributed over a cluster to be performed and handled in parallel. This approach makes data science and machine learning operations scalable, and the most popular engine for this is Apache Spark. Spark can be used with PySpark, the Python API library for it.
As of Spark 2.0, anyone using Spark RDD API is encouraged to switch to Spark SQL, which provides better performance. Spark SQL also makes it easier for data scientists to handle data because it enables SQL queries to be executed. We can expect PySpark to remain the most popular choice in 2024.
Another popular tool for managing data in clusters is Databricks. If you are using Databricks to work with your data in clusters, now you can benefit from the powerful integration of Databricks and PyCharm. You can write code for your pipelines and jobs in PyCharm, then deploy, test, and run it in real time on your Databricks cluster without any additional configuration.
Communities: Events shifting focus toward data science
Many newcomers to Python are using it for data science, and thus more Python libraries have been catering to data science use cases. In that same vein, Python events like PyCon and EuroPython are beginning to include more tracks, talks, and workshops that focus on data science, while events that are specific to data science, like PyData and SciPy, remain popular, as well.
Final thoughts
Data science and machine learning are becoming increasingly active, and together with the popularity of AI and LLMs, more and more new open source tools have become available for use in data science. The landscape of data science continues to change rapidly, and we are excited to see what becomes most popular in the 2024 survey results.
Enhance your data science experience with PyCharm
Modern data science demands skills for a wide range of tasks, including data processing and visualization, coding, model deployment, and managing large datasets. As an integrated development environment (IDE), PyCharm helps you efficiently build this skill set. It provides intelligent coding assistance, top-tier debugging, version control, integrated database management, and seamless Docker integration. For data science, PyCharm supports Jupyter notebooks, as well as key scientific and machine learning libraries, and it integrates with tools like the Hugging Face models library, Anaconda, and Databricks.
Start using PyCharm for your data science projects today and enjoy its latest improvements, including features for inspecting pandas and Polars DataFrames, and for the layer by layer inspection of PyTorch tensors, which is handy when exploring data and building deep learning models.
02 Dec 2024 10:43am GMT