17 Apr 2024

feedKubernetes Blog

Kubernetes v1.30: Uwubernetes

Editors: Amit Dsouza, Frederick Kautz, Kristin Martin, Abigail McCarthy, Natali Vlatko

Announcing the release of Kubernetes v1.30: Uwubernetes, the cutest release!

Similar to previous releases, the release of Kubernetes v1.30 introduces new stable, beta, and alpha features. The consistent delivery of top-notch releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 45 enhancements. Of those enhancements, 17 have graduated to Stable, 18 are entering Beta, and 10 have graduated to Alpha.

Release theme and logo

Kubernetes v1.30: Uwubernetes

Kubernetes v1.30 makes your clusters cuter!

Kubernetes is built and released by thousands of people from all over the world and all walks of life. Most contributors are not being paid to do this; we build it for fun, to solve a problem, to learn something, or for the simple love of the community. Many of us found our homes, our friends, and our careers here. The Release Team is honored to be a part of the continued growth of Kubernetes.

For the people who built it, for the people who release it, and for the furries who keep all of our clusters online, we present to you Kubernetes v1.30: Uwubernetes, the cutest release to date. The name is a portmanteau of "kubernetes" and "UwU," an emoticon used to indicate happiness or cuteness. We've found joy here, but we've also brought joy from our outside lives that helps to make this community as weird and wonderful and welcoming as it is. We're so happy to share our work with you.

UwU ♥️

Improvements that graduated to stable in Kubernetes v1.30

This is a selection of some of the improvements that are now stable following the v1.30 release.

Robust VolumeManager reconstruction after kubelet restart (SIG Storage)

This is a volume manager refactoring that allows the kubelet to populate additional information about how existing volumes are mounted during the kubelet startup. In general, this makes volume cleanup after kubelet restart or machine reboot more robust.

This does not bring any changes for user or cluster administrators. We used the feature process and feature gate NewVolumeManagerReconstruction to be able to fall back to the previous behavior in case something goes wrong. Now that the feature is stable, the feature gate is locked and cannot be disabled.

Prevent unauthorized volume mode conversion during volume restore (SIG Storage)

For Kubernetes 1.30, the control plane always prevents unauthorized changes to volume modes when restoring a snapshot into a PersistentVolume. As a cluster administrator, you'll need to grant permissions to the appropriate identity principals (for example: ServiceAccounts representing a storage integration) if you need to allow that kind of change at restore time.

For more information on this feature also read converting the volume mode of a Snapshot.

Pod Scheduling Readiness (SIG Scheduling)

Pod scheduling readiness graduates to stable this release, after being promoted to beta in Kubernetes v1.27.

This now-stable feature lets Kubernetes avoid trying to schedule a Pod that has been defined, when the cluster doesn't yet have the resources provisioned to allow actually binding that Pod to a node. That's not the only use case; the custom control on whether a Pod can be allowed to schedule also lets you implement quota mechanisms, security controls, and more.

Crucially, marking these Pods as exempt from scheduling cuts the work that the scheduler would otherwise do, churning through Pods that can't or won't schedule onto the nodes your cluster currently has. If you have cluster autoscaling active, using scheduling gates doesn't just cut the load on the scheduler, it can also save money. Without scheduling gates, the autoscaler might otherwise launch a node that doesn't need to be started.

In Kubernetes v1.30, by specifying (or removing) a Pod's .spec.schedulingGates, you can control when a Pod is ready to be considered for scheduling. This is a stable feature and is now formally part of the Kubernetes API definition for Pod.

Min domains in PodTopologySpread (SIG Scheduling)

The minDomains parameter for PodTopologySpread constraints graduates to stable this release, which allows you to define the minimum number of domains. This feature is designed to be used with Cluster Autoscaler.

If you previously attempted use and there weren't enough domains already present, Pods would be marked as unschedulable. The Cluster Autoscaler would then provision node(s) in new domain(s), and you'd eventually get Pods spreading over enough domains.

Go workspaces for k/k (SIG Architecture)

The Kubernetes repo now uses Go workspaces. This should not impact end users at all, but does have a impact for developers of downstream projects. Switching to workspaces caused some breaking changes in the flags to the various k8s.io/code-generator tools. Downstream consumers should look at staging/src/k8s.io/code-generator/kube_codegen.sh to see the changes.

For full details on the changes and reasons why Go workspaces was introduced, read Using Go workspaces in Kubernetes.

Improvements that graduated to beta in Kubernetes v1.30

This is a selection of some of the improvements that are now beta following the v1.30 release.

Node log query (SIG Windows)

To help with debugging issues on nodes, Kubernetes v1.27 introduced a feature that allows fetching logs of services running on the node. To use the feature, ensure that the NodeLogQuery feature gate is enabled for that node, and that the kubelet configuration options enableSystemLogHandler and enableSystemLogQuery are both set to true.

Following the v1.30 release, this is now beta (you still need to enable the feature to use it, though).

On Linux the assumption is that service logs are available via journald. On Windows the assumption is that service logs are available in the application log provider. Logs are also available by reading files within /var/log/ (Linux) or C:\var\log\ (Windows). For more information, see the log query documentation.

CRD validation ratcheting (SIG API Machinery)

You need to enable the CRDValidationRatcheting feature gate to use this behavior, which then applies to all CustomResourceDefinitions in your cluster.

Provided you enabled the feature gate, Kubernetes implements validation racheting for CustomResourceDefinitions. The API server is willing to accept updates to resources that are not valid after the update, provided that each part of the resource that failed to validate was not changed by the update operation. In other words, any invalid part of the resource that remains invalid must have already been wrong. You cannot use this mechanism to update a valid resource so that it becomes invalid.

This feature allows authors of CRDs to confidently add new validations to the OpenAPIV3 schema under certain conditions. Users can update to the new schema safely without bumping the version of the object or breaking workflows.

Contextual logging (SIG Instrumentation)

Contextual Logging advances to beta in this release, empowering developers and operators to inject customizable, correlatable contextual details like service names and transaction IDs into logs through WithValues and WithName. This enhancement simplifies the correlation and analysis of log data across distributed systems, significantly improving the efficiency of troubleshooting efforts. By offering a clearer insight into the workings of your Kubernetes environments, Contextual Logging ensures that operational challenges are more manageable, marking a notable step forward in Kubernetes observability.

Make Kubernetes aware of the LoadBalancer behaviour (SIG Network)

The LoadBalancerIPMode feature gate is now beta and is now enabled by default. This feature allows you to set the .status.loadBalancer.ingress.ipMode for a Service with type set to LoadBalancer. The .status.loadBalancer.ingress.ipMode specifies how the load-balancer IP behaves. It may be specified only when the .status.loadBalancer.ingress.ip field is also specified. See more details about specifying IPMode of load balancer status.

New alpha features

Speed up recursive SELinux label change (SIG Storage)

From the v1.27 release, Kubernetes already included an optimization that sets SELinux labels on the contents of volumes, using only constant time. Kubernetes achieves that speed up using a mount option. The slower legacy behavior requires the container runtime to recursively walk through the whole volumes and apply SELinux labelling individually to each file and directory; this is especially noticable for volumes with large amount of files and directories.

Kubernetes 1.27 graduated this feature as beta, but limited it to ReadWriteOncePod volumes. The corresponding feature gate is SELinuxMountReadWriteOncePod. It's still enabled by default and remains beta in 1.30.

Kubernetes 1.30 extends support for SELinux mount option to all volumes as alpha, with a separate feature gate: SELinuxMount. This feature gate introduces a behavioral change when multiple Pods with different SELinux labels share the same volume. See KEP for details.

We strongly encourage users that run Kubernetes with SELinux enabled to test this feature and provide any feedback on the KEP issue.

Feature gate Stage in v1.30 Behavior change
SELinuxMountReadWriteOncePod Beta No
SELinuxMount Alpha Yes

Both feature gates SELinuxMountReadWriteOncePod and SELinuxMount must be enabled to test this feature on all volumes.

This feature has no effect on Windows nodes or on Linux nodes without SELinux support.

Recursive Read-only (RRO) mounts (SIG Node)

Introducing Recursive Read-Only (RRO) Mounts in alpha this release, you'll find a new layer of security for your data. This feature lets you set volumes and their submounts as read-only, preventing accidental modifications. Imagine deploying a critical application where data integrity is key-RRO Mounts ensure that your data stays untouched, reinforcing your cluster's security with an extra safeguard. This is especially crucial in tightly controlled environments, where even the slightest change can have significant implications.

Job success/completion policy (SIG Apps)

From Kubernetes v1.30, indexed Jobs support .spec.successPolicy to define when a Job can be declared succeeded based on succeeded Pods. This allows you to define two types of criteria:

After the Job meets the success policy, the Job controller terminates the lingering Pods.

Traffic distribution for services (SIG Network)

Kubernetes v1.30 introduces the spec.trafficDistribution field within a Kubernetes Service as alpha. This allows you to express preferences for how traffic should be routed to Service endpoints. While traffic policies focus on strict semantic guarantees, traffic distribution allows you to express preferences (such as routing to topologically closer endpoints). This can help optimize for performance, cost, or reliability. You can use this field by enabling the ServiceTrafficDistribution feature gate for your cluster and all of its nodes. In Kubernetes v1.30, the following field value is supported:

PreferClose: Indicates a preference for routing traffic to endpoints that are topologically proximate to the client. The interpretation of "topologically proximate" may vary across implementations and could encompass endpoints within the same node, rack, zone, or even region. Setting this value gives implementations permission to make different tradeoffs, for example optimizing for proximity rather than equal distribution of load. You should not set this value if such tradeoffs are not acceptable.

If the field is not set, the implementation (like kube-proxy) will apply its default routing strategy.

See Traffic Distribution for more details.

Graduations, deprecations and removals for Kubernetes v1.30

Graduated to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 17 enhancements promoted to Stable:

Deprecations and removals

Removed the SecurityContextDeny admission plugin, deprecated since v1.27

(SIG Auth, SIG Security, and SIG Testing) With the removal of the SecurityContextDeny admission plugin, the Pod Security Admission plugin, available since v1.25, is recommended instead.

Release notes

Check out the full details of the Kubernetes 1.30 release in our release notes.

Availability

Kubernetes 1.30 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install 1.30 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.30 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Kat Cosgrove, for supporting us through a successful release cycle, advocating for us, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.30 release cycle, which ran for 14 weeks (January 8 to April 17), we saw contributions from 863 companies and 1391 individuals.

Event update

Upcoming release webinar

Join members of the Kubernetes v1.30 release team on Thursday, May 23rd, 2024, at 9 A.M. PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Join members of the Kubernetes v1.30 release team on DATE AND TIME TBA to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

17 Apr 2024 12:00am GMT

11 Apr 2024

feedKubernetes Blog

Spotlight on SIG Architecture: Code Organization

This is the third interview of a SIG Architecture Spotlight series that will cover the different subprojects. We will cover SIG Architecture: Code Organization.

In this SIG Architecture spotlight I talked with Madhav Jivrajani (VMware), a member of the Code Organization subproject.

Introducing the Code Organization subproject

Frederico (FSM): Hello Madhav, thank you for your availability. Could you start by telling us a bit about yourself, your role and how you got involved in Kubernetes?

Madhav Jivrajani (MJ): Hello! My name is Madhav Jivrajani, I serve as a technical lead for SIG Contributor Experience and a GitHub Admin for the Kubernetes project. Apart from that I also contribute to SIG API Machinery and SIG Etcd, but more recently, I've been helping out with the work that is needed to help Kubernetes stay on supported versions of Go, and it is through this that I am involved with the Code Organization subproject of SIG Architecture.

FSM: A project the size of Kubernetes must have unique challenges in terms of code organization -- is this a fair assumption? If so, what would you pick as some of the main challenges that are specific to Kubernetes?

MJ: That's a fair assumption! The first interesting challenge comes from the sheer size of the Kubernetes codebase. We have ≅2.2 million lines of Go code (which is steadily decreasing thanks to dims and other folks in this sub-project!), and a little over 240 dependencies that we rely on either directly or indirectly, which is why having a sub-project dedicated to helping out with dependency management is crucial: we need to know what dependencies we're pulling in, what versions these dependencies are at, and tooling to help make sure we are managing these dependencies across different parts of the codebase in a consistent manner.

Another interesting challenge with Kubernetes is that we publish a lot of Go modules as part of the Kubernetes release cycles, one example of this is client-go.However, we as a project would also like the benefits of having everything in one repository to get the advantages of using a monorepo, like atomic commits... so, because of this, code organization works with other SIGs (like SIG Release) to automate the process of publishing code from the monorepo to downstream individual repositories which are much easier to consume, and this way you won't have to import the entire Kubernetes codebase!

Code organization and Kubernetes

FSM: For someone just starting contributing to Kubernetes code-wise, what are the main things they should consider in terms of code organization? How would you sum up the key concepts?

MJ: I think one of the key things to keep in mind at least as you're starting off is the concept of staging directories. In the kubernetes/kubernetes repository, you will come across a directory called staging/. The sub-folders in this directory serve as a bunch of pseudo-repositories. For example, the kubernetes/client-go repository that publishes releases for client-go is actually a staging repo.

FSM: So the concept of staging directories fundamentally impact contributions?

MJ: Precisely, because if you'd like to contribute to any of the staging repos, you will need to send in a PR to its corresponding staging directory in kubernetes/kubernetes. Once the code merges there, we have a bot called the publishing-bot that will sync the merged commits to the required staging repositories (like kubernetes/client-go). This way we get the benefits of a monorepo but we also can modularly publish code for downstream consumption. PS: The publishing-bot needs more folks to help out!

For more information on staging repositories, please see the contributor documentation.

FSM: Speaking of contributions, the very high number of contributors, both individuals and companies, must also be a challenge: how does the subproject operate in terms of making sure that standards are being followed?

MJ: When it comes to dependency management in the project, there is a dedicated team that helps review and approve dependency changes. These are folks who have helped lay the foundation of much of the tooling that Kubernetes uses today for dependency management. This tooling helps ensure there is a consistent way that contributors can make changes to dependencies. The project has also worked on additional tooling to signal statistics of dependencies that is being added or removed: depstat

Apart from dependency management, another crucial task that the project does is management of the staging repositories. The tooling for achieving this (publishing-bot) is completely transparent to contributors and helps ensure that the staging repos get a consistent view of contributions that are submitted to kubernetes/kubernetes.

Code Organization also works towards making sure that Kubernetes stays on supported versions of Go. The linked KEP provides more context on why we need to do this. We collaborate with SIG Release to ensure that we are testing Kubernetes as rigorously and as early as we can on Go releases and working on changes that break our CI as a part of this. An example of how we track this process can be found here.

Release cycle and current priorities

FSM: Is there anything that changes during the release cycle?

MJ During the release cycle, specifically before code freeze, there are often changes that go in that add/update/delete dependencies, fix code that needs fixing as part of our effort to stay on supported versions of Go.

Furthermore, some of these changes are also candidates for backporting to our supported release branches.

FSM: Is there any major project or theme the subproject is working on right now that you would like to highlight?

MJ: I think one very interesting and immensely useful change that has been recently added (and I take the opportunity to specifically highlight the work of Tim Hockin on this) is the introduction of Go workspaces to the Kubernetes repo. A lot of our current tooling for dependency management and code publishing, as well as the experience of editing code in the Kubernetes repo, can be significantly improved by this change.

Wrapping up

FSM: How would someone interested in the topic start helping the subproject?

MJ: The first step, as is the first step with any project in Kubernetes, is to join our slack: slack.k8s.io, and after that join the #k8s-code-organization channel. There is also a code-organization office hours that takes place that you can choose to attend. Timezones are hard, so feel free to also look at the recordings or meeting notes and follow up on slack!

FSM: Excellent, thank you! Any final comments you would like to share?

MJ: The Code Organization subproject always needs help! Especially areas like the publishing bot, so don't hesitate to get involved in the #k8s-code-organization Slack channel.

11 Apr 2024 12:00am GMT

05 Apr 2024

feedKubernetes Blog

DIY: Create Your Own Cloud with Kubernetes (Part 3)

Approaching the most interesting phase, this article delves into running Kubernetes within Kubernetes. Technologies such as Kamaji and Cluster API are highlighted, along with their integration with KubeVirt.

Previous discussions have covered preparing Kubernetes on bare metal and how to turn Kubernetes into virtual machines management system. This article concludes the series by explaining how, using all of the above, you can build a full-fledged managed Kubernetes and run virtual Kubernetes clusters with just a click.

First up, let's dive into the Cluster API.

Cluster API

Cluster API is an extension for Kubernetes that allows the management of Kubernetes clusters as custom resources within another Kubernetes cluster.

The main goal of the Cluster API is to provide a unified interface for describing the basic entities of a Kubernetes cluster and managing their lifecycle. This enables the automation of processes for creating, updating, and deleting clusters, simplifying scaling, and infrastructure management.

Within the context of Cluster API, there are two terms: management cluster and tenant clusters.

It's important to understand that physically, tenant clusters do not necessarily have to run on the same infrastructure with the management cluster; more often, they are running elsewhere.

A diagram showing interaction of management Kubernetes cluster and tenant Kubernetes clusters using Cluster API

A diagram showing interaction of management Kubernetes cluster and tenant Kubernetes clusters using Cluster API

For its operation, Cluster API utilizes the concept of providers which are separate controllers responsible for specific components of the cluster being created. Within Cluster API, there are several types of providers. The major ones are:

To get started, you will need to install the Cluster API itself and one provider of each type. You can find a complete list of supported providers in the project's documentation.

For installation, you can use the clusterctl utility, or Cluster API Operator as the more declarative method.

Choosing providers

Infrastructure provider

To run Kubernetes clusters using KubeVirt, the KubeVirt Infrastructure Provider must be installed. It enables the deployment of virtual machines for worker nodes in the same management cluster, where the Cluster API operates.

Control plane provider

The Kamaji project offers a ready solution for running the Kubernetes control plane for tenant clusters as containers within the management cluster. This approach has several significant advantages:

Bootstrap provider

Kubeadm as the Bootstrap Provider - as the standard method for preparing clusters in Cluster API. This provider is developed as part of the Cluster API itself. It requires only a prepared system image with kubelet and kubeadm installed and allows generating configs in the cloud-init and ignition formats.

It's worth noting that Talos Linux also supports provisioning via the Cluster API and has providers for this. Although previous articles discussed using Talos Linux to set up a management cluster on bare-metal nodes, to provision tenant clusters the Kamaji+Kubeadm approach has more advantages. It facilitates the deployment of Kubernetes control planes in containers, thus removing the need for separate virtual machines for control plane instances. This simplifies the management and reduces costs.

How it works

The primary object in Cluster API is the Cluster resource, which acts as the parent for all the others. Typically, this resource references two others: a resource describing the control plane and a resource describing the infrastructure, each managed by a separate provider.

Unlike the Cluster, these two resources are not standardized, and their kind depends on the specific provider you are using:

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

Within Cluster API, there is also a resource named MachineDeployment, which describes a group of nodes, whether they are physical servers or virtual machines. This resource functions similarly to standard Kubernetes resources such as Deployment, ReplicaSet, and Pod, providing a mechanism for the declarative description of a group of nodes and automatic scaling.

In other words, the MachineDeployment resource allows you to declaratively describe nodes for your cluster, automating their creation, deletion, and updating according to specified parameters and the requested number of replicas.

A diagram showing the relationship of a Cluster resource and its children in Cluster API

A diagram showing the relationship of a MachineDeployment resource and its children in Cluster API

To create machines, MachineDeployment refers to a template for generating the machine itself and a template for generating its cloud-init config:

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

A diagram showing the relationship of a MachineDeployment resource and the resources it links to in Cluster API

To deploy a new Kubernetes cluster using Cluster API, you will need to prepare the following set of resources:

Polishing the cluster

In most cases, this is sufficient, but depending on the providers used, you may need other resources as well. You can find examples of the resources created for each type of provider in the Kamaji project documentation.

At this stage, you already have a ready tenant Kubernetes cluster, but so far, it contains nothing but API workers and a few core plugins that are standardly included in the installation of any Kubernetes cluster: kube-proxy and CoreDNS. For full integration, you will need to install several more components:

To install additional components, you can use a separate Cluster API Add-on Provider for Helm, or the same FluxCD discussed in previous articles.

When creating resources in FluxCD, it's possible to specify the target cluster by referring to the kubeconfig generated by Cluster API. Then, the installation will be performed directly into it. Thus, FluxCD becomes a universal tool for managing resources both in the management cluster and in the user tenant clusters.

A diagram showing the interaction scheme of fluxcd, which can install components in both management and tenant Kubernetes clusters

A diagram showing the interaction scheme of fluxcd, which can install components in both management and tenant Kubernetes clusters

What components are being discussed here? Generally, the set includes the following:

CNI Plugin

To ensure communication between pods in a tenant Kubernetes cluster, it's necessary to deploy a CNI plugin. This plugin creates a virtual network that allows pods to interact with each other and is traditionally deployed as a Daemonset on the cluster's worker nodes. You can choose and install any CNI plugin that you find suitable.

A diagram showing a CNI plugin installed inside the tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

A diagram showing a CNI plugin installed inside the tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

Cloud Controller Manager

The main task of the Cloud Controller Manager (CCM) is to integrate Kubernetes with the cloud infrastructure provider's environment (in your case, it is the management Kubernetes cluster in which all worksers of tenant Kubernetes are provisioned). Here are some tasks it performs:

  1. When a service of type LoadBalancer is created, the CCM initiates the process of creating a cloud load balancer, which directs traffic to your Kubernetes cluster.
  2. If a node is removed from the cloud infrastructure, the CCM ensures its removal from your cluster as well, maintaining the cluster's current state.
  3. When using the CCM, nodes are added to the cluster with a special taint, node.cloudprovider.kubernetes.io/uninitialized, which allows for the processing of additional business logic if necessary. After successful initialization, this taint is removed from the node.

Depending on the cloud provider, the CCM can operate both inside and outside the tenant cluster.

The KubeVirt Cloud Provider is designed to be installed in the external parent management cluster. Thus, creating services of type LoadBalancer in the tenant cluster initiates the creation of LoadBalancer services in the parent cluster, which direct traffic into the tenant cluster.

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of services it manages from the parent to the child Kubernetes cluster

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of services it manages from the parent to the child Kubernetes cluster

CSI Driver

The Container Storage Interface (CSI) is divided into two main parts for interacting with storage in Kubernetes:

In the context of using the KubeVirt CSI Driver, a unique opportunity arises. Since virtual machines in KubeVirt runs within the management Kubernetes cluster, where a full-fledged Kubernetes API is available, this opens the path for running the csi-controller outside of the user's tenant cluster. This approach is popular in the KubeVirt community and offers several key advantages:

However, the CSI-node must necessarily run inside the tenant cluster, as it directly interacts with kubelet on each node. This component is responsible for the mounting and unmounting of volumes into pods, requiring close integration with processes occurring directly on the cluster nodes.

The KubeVirt CSI Driver acts as a proxy for ordering volumes. When a PVC is created inside the tenant cluster, a PVC is created in the management cluster, and then the created PV is connected to the virtual machine.

A diagram showing a CSI plugin components installed on both inside and outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of persistent volumes it manages from the parent to the child Kubernetes cluster

A diagram showing a CSI plugin components installed on both inside and outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of persistent volumes it manages from the parent to the child Kubernetes cluster

Cluster Autoscaler

The Cluster Autoscaler is a versatile component that can work with various cloud APIs, and its integration with Cluster-API is just one of the available functions. For proper configuration, it requires access to two clusters: the tenant cluster, to track pods and determine the need for adding new nodes, and the managing Kubernetes cluster (management kubernetes cluster), where it interacts with the MachineDeployment resource and adjusts the number of replicas.

Although Cluster Autoscaler usually runs inside the tenant Kubernetes cluster, in this situation, it is suggested to install it outside for the same reasons described before. This approach is simpler to maintain and more secure as it prevents users of tenant clusters from accessing the management API of the management cluster.

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

A diagram showing a Cluster Autoscaler installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

Konnectivity

There's another additional component I'd like to mention - Konnectivity. You will likely need it later on to get webhooks and the API aggregation layer working in your tenant Kubernetes cluster. This topic is covered in detail in one of my previous article.

Unlike the components presented above, Kamaji allows you to easily enable Konnectivity and manage it as one of the core components of your tenant cluster, alongside kube-proxy and CoreDNS.

Conclusion

Now you have a fully functional Kubernetes cluster with the capability for dynamic scaling, automatic provisioning of volumes, and load balancers.

Going forward, you might consider metrics and logs collection from your tenant clusters, but that goes beyond the scope of this article.

Of course, all the components necessary for deploying a Kubernetes cluster can be packaged into a single Helm chart and deployed as a unified application. This is precisely how we organize the deployment of managed Kubernetes clusters with the click of a button on our open PaaS platform, Cozystack, where you can try all the technologies described in the article for free.

05 Apr 2024 7:40am GMT

DIY: Create Your Own Cloud with Kubernetes (Part 2)

Continuing our series of posts on how to build your own cloud using just the Kubernetes ecosystem. In the previous article, we explained how we prepare a basic Kubernetes distribution based on Talos Linux and Flux CD. In this article, we'll show you a few various virtualization technologies in Kubernetes and prepare everything need to run virtual machines in Kubernetes, primarily storage and networking.

We will talk about technologies such as KubeVirt, LINSTOR, and Kube-OVN.

But first, let's explain what virtual machines are needed for, and why can't you just use docker containers for building cloud? The reason is that containers do not provide a sufficient level of isolation. Although the situation improves year by year, we often encounter vulnerabilities that allow escaping the container sandbox and elevating privileges in the system.

On the other hand, Kubernetes was not originally designed to be a multi-tenant system, meaning the basic usage pattern involves creating a separate Kubernetes cluster for every independent project and development team.

Virtual machines are the primary means of isolating tenants from each other in a cloud environment. In virtual machines, users can execute code and programs with administrative privilege, but this doesn't affect other tenants or the environment itself. In other words, virtual machines allow to achieve hard multi-tenancy isolation, and run in environments where tenants do not trust each other.

Virtualization technologies in Kubernetes

There are several different technologies that bring virtualization into the Kubernetes world: KubeVirt and Kata Containers are the most popular ones. But you should know that they work differently.

Kata Containers implements the CRI (Container Runtime Interface) and provides an additional level of isolation for standard containers by running them in virtual machines. But they work in a same single Kubernetes-cluster.

A diagram showing how container isolation is ensured by running containers in virtual machines with Kata Containers

A diagram showing how container isolation is ensured by running containers in virtual machines with Kata Containers

KubeVirt allows running traditional virtual machines using the Kubernetes API. KubeVirt virtual machines are run as regular linux processes in containers. In other words, in KubeVirt, a container is used as a sandbox for running virtual machine (QEMU) processes. This can be clearly seen in the figure below, by looking at how live migration of virtual machines is implemented in KubeVirt. When migration is needed, the virtual machine moves from one container to another.

A diagram showing live migration of a virtual machine from one container to another in KubeVirt

A diagram showing live migration of a virtual machine from one container to another in KubeVirt

There is also an alternative project - Virtink, which implements lightweight virtualization using Cloud-Hypervisor and is initially focused on running virtual Kubernetes clusters using the Cluster API.

Considering our goals, we decided to use KubeVirt as the most popular project in this area. Besides we have extensive expertise and already made a lot of contributions to KubeVirt.

KubeVirt is easy to install and allows you to run virtual machines out-of-the-box using containerDisk feature - this allows you to store and distribute VM images directly as OCI images from container image registry. Virtual machines with containerDisk are well suited for creating Kubernetes worker nodes and other VMs that do not require state persistence.

For managing persistent data, KubeVirt offers a separate tool, Containerized Data Importer (CDI). It allows for cloning PVCs and populating them with data from base images. The CDI is necessary if you want to automatically provision persistent volumes for your virtual machines, and it is also required for the KubeVirt CSI Driver, which is used to handle persistent volumes claims from tenant Kubernetes clusters.

But at first, you have to decide where and how you will store these data.

Storage for Kubernetes VMs

With the introduction of the CSI (Container Storage Interface), a wide range of technologies that integrate with Kubernetes has become available. In fact, KubeVirt fully utilizes the CSI interface, aligning the choice of storage for virtualization closely with the choice of storage for Kubernetes itself. However, there are nuances, which you need to consider. Unlike containers, which typically use a standard filesystem, block devices are more efficient for virtual machine.

Although the CSI interface in Kubernetes allows the request of both types of volumes: filesystems and block devices, it's important to verify that your storage backend supports this.

Using block devices for virtual machines eliminates the need for an additional abstraction layer, such as a filesystem, that makes it more performant and in most cases enables the use of the ReadWriteMany mode. This mode allows concurrent access to the volume from multiple nodes, which is a critical feature for enabling the live migration of virtual machines in KubeVirt.

The storage system can be external or internal (in the case of hyper-converged infrastructure). Using external storage in many cases makes the whole system more stable, as your data is stored separately from compute nodes.

A diagram showing external data storage communication with the compute nodes

A diagram showing external data storage communication with the compute nodes

External storage solutions are often popular in enterprise systems because such storage is frequently provided by an external vendor, that takes care of its operations. The integration with Kubernetes involves only a small component installed in the cluster - the CSI driver. This driver is responsible for provisioning volumes in this storage and attaching them to pods run by Kubernetes. However, such storage solutions can also be implemented using purely open-source technologies. One of the popular solutions is TrueNAS powered by democratic-csi driver.

A diagram showing local data storage running on the compute nodes

A diagram showing local data storage running on the compute nodes

On the other hand, hyper-converged systems are often implemented using local storage (when you do not need replication) and with software-defined storages, often installed directly in Kubernetes, such as Rook/Ceph, OpenEBS, Longhorn, LINSTOR, and others.

A diagram showing clustered data storage running on the compute nodes

A diagram showing clustered data storage running on the compute nodes

A hyper-converged system has its advantages. For example, data locality: when your data is stored locally, access to such data is faster. But there are disadvantages as such a system is usually more difficult to manage and maintain.

At Ænix, we wanted to provide a ready-to-use solution that could be used without the need to purchase and setup an additional external storage, and that was optimal in terms of speed and resource utilization. LINSTOR became that solution. The time-tested and industry-popular technologies such as LVM and ZFS as backend gives confidence that data is securely stored. DRBD-based replication is incredible fast and consumes a small amount of computing resources.

For installing LINSTOR in Kubernetes, there is the Piraeus project, which already provides a ready-made block storage to use with KubeVirt.

Networking for Kubernetes VMs

Despite having the similar interface - CNI, The network architecture in Kubernetes is actually more complex and typically consists of many independent components that are not directly connected to each other. In fact, you can split Kubernetes networking into four layers, which are described below.

Node Network (Data Center Network)

The network through which nodes are interconnected with each other. This network is usually not managed by Kubernetes, but it is an important one because, without it, nothing would work. In practice, the bare metal infrastructure usually has more than one of such networks e.g. one for node-to-node communication, second for storage replication, third for external access, etc.

A diagram showing the role of the node network (data center network) on the Kubernetes networking scheme

A diagram showing the role of the node network (data center network) on the Kubernetes networking scheme

Configuring the physical network interaction between nodes goes beyond the scope of this article, as in most situations, Kubernetes utilizes already existing network infrastructure.

Pod Network

This is the network provided by your CNI plugin. The task of the CNI plugin is to ensure transparent connectivity between all containers and nodes in the cluster. Most CNI plugins implement a flat network from which separate blocks of IP addresses are allocated for use on each node.

A diagram showing the role of the pod network (CNI-plugin) on the Kubernetes network scheme

A diagram showing the role of the pod network (CNI-plugin) on the Kubernetes network scheme

In practice, your cluster can have several CNI plugins managed by Multus. This approach is often used in virtualization solutions based on KubeVirt - Rancher and OpenShift. The primary CNI plugin is used for integration with Kubernetes services, while additional CNI plugins are used to implement private networks (VPC) and integration with the physical networks of your data center.

The default CNI-plugins can be used to connect bridges or physical interfaces. Additionally, there are specialized plugins such as macvtap-cni which are designed to provide more performance.

One additional aspect to keep in mind when running virtual machines in Kubernetes is the need for IPAM (IP Address Management), especially for secondary interfaces provided by Multus. This is commonly managed by a DHCP server operating within your infrastructure. Additionally, the allocation of MAC addresses for virtual machines can be managed by Kubemacpool.

Although in our platform, we decided to go another way and fully rely on Kube-OVN. This CNI plugin is based on OVN (Open Virtual Network) which was originally developed for OpenStack and it provides a complete network solution for virtual machines in Kubernetes, features Custom Resources for managing IPs and MAC addresses, supports live migration with preserving IP addresses between the nodes, and enables the creation of VPCs for physical network separation between tenants.

In Kube-OVN you can assign separate subnets to an entire namespace or connect them as additional network interfaces using Multus.

Services Network

In addition to the CNI plugin, Kubernetes also has a services network, which is primarily needed for service discovery. Contrary to traditional virtual machines, Kubernetes is originally designed to run pods with a random address. And the services network provides a convenient abstraction (stable IP addresses and DNS names) that will always direct traffic to the correct pod. The same approach is also commonly used with virtual machines in clouds despite the fact that their IPs are usually static.

A diagram showing the role of the services network (services network plugin) on the Kubernetes network scheme

A diagram showing the role of the services network (services network plugin) on the Kubernetes network scheme

The implementation of the services network in Kubernetes is handled by the services network plugin, The standard implementation is called kube-proxy and is used in most clusters. But nowadays, this functionality might be provided as part of the CNI plugin. The most advanced implementation is offered by the Cilium project, which can be run in kube-proxy replacement mode.

Cilium is based on the eBPF technology, which allows for efficient offloading of the Linux networking stack, thereby improving performance and security compared to traditional methods based on iptables.

In practice, Cilium and Kube-OVN can be easily integrated to provide a unified solution that offers seamless, multi-tenant networking for virtual machines, as well as advanced network policies and combined services network functionality.

External Traffic Load Balancer

At this stage, you already have everything needed to run virtual machines in Kubernetes. But there is actually one more thing. You still need to access your services from outside your cluster, and an external load balancer will help you with organizing this.

For bare metal Kubernetes clusters, there are several load balancers available: MetalLB, kube-vip, LoxiLB, also Cilium and Kube-OVN provides built-in implementation.

The role of a external load balancer is to provide a stable address available externally and direct external traffic to the services network. The services network plugin will direct it to your pods and virtual machines as usual.

The role of the external load balancer on the Kubernetes network scheme

A diagram showing the role of the external load balancer on the Kubernetes network scheme

In most cases, setting up a load balancer on bare metal is achieved by creating floating IP address on the nodes within the cluster, and announce it externally using ARP/NDP or BGP protocols.

After exploring various options, we decided that MetalLB is the simplest and most reliable solution, although we do not strictly enforce the use of only it.

Another benefit is that in L2 mode, MetalLB speakers continuously check their neighbour's state by sending preforming liveness checks using a memberlist protocol. This enables failover that works independently of Kubernetes control-plane.

Conclusion

This concludes our overview of virtualization, storage, and networking in Kubernetes. The technologies mentioned here are available and already pre-configured on the Cozystack platform, where you can try them with no limitations.

In the next article, I'll detail how, on top of this, you can implement the provisioning of fully functional Kubernetes clusters with just the click of a button.

05 Apr 2024 7:35am GMT

DIY: Create Your Own Cloud with Kubernetes (Part 1)

At Ænix, we have a deep affection for Kubernetes and dream that all modern technologies will soon start utilizing its remarkable patterns.

Have you ever thought about building your own cloud? I bet you have. But is it possible to do this using only modern technologies and approaches, without leaving the cozy Kubernetes ecosystem? Our experience in developing Cozystack required us to delve deeply into it.

You might argue that Kubernetes is not intended for this purpose and why not simply use OpenStack for bare metal servers and run Kubernetes inside it as intended. But by doing so, you would simply shift the responsibility from your hands to the hands of OpenStack administrators. This would add at least one more huge and complex system to your ecosystem.

Why complicate things? - after all, Kubernetes already has everything needed to run tenant Kubernetes clusters at this point.

I want to share with you our experience in developing a cloud platform based on Kubernetes, highlighting the open-source projects that we use ourselves and believe deserve your attention.

In this series of articles, I will tell you our story about how we prepare managed Kubernetes from bare metal using only open-source technologies. Starting from the basic level of data center preparation, running virtual machines, isolating networks, setting up fault-tolerant storage to provisioning full-featured Kubernetes clusters with dynamic volume provisioning, load balancers, and autoscaling.

With this article, I start a series consisting of several parts:

I will try to describe various technologies as independently as possible, but at the same time, I will share our experience and why we came to one solution or another.

To begin with, let's understand the main advantage of Kubernetes and how it has changed the approach to using cloud resources.

It is important to understand that the use of Kubernetes in the cloud and on bare metal differs.

Kubernetes in the cloud

When you operate Kubernetes in the cloud, you don't worry about persistent volumes, cloud load balancers, or the process of provisioning nodes. All of this is handled by your cloud provider, who accepts your requests in the form of Kubernetes objects. In other words, the server side is completely hidden from you, and you don't really want to know how exactly the cloud provider implements as it's not in your area of responsibility.

A diagram showing cloud Kubernetes, with load balancing and storage done outside the cluster

A diagram showing cloud Kubernetes, with load balancing and storage done outside the cluster

Kubernetes offers convenient abstractions that work the same everywhere, allowing you to deploy your application on any Kubernetes in any cloud.

In the cloud, you very commonly have several separate entities: the Kubernetes control plane, virtual machines, persistent volumes, and load balancers as distinct entities. Using these entities, you can create highly dynamic environments.

Thanks to Kubernetes, virtual machines are now only seen as a utility entity for utilizing cloud resources. You no longer store data inside virtual machines. You can delete all your virtual machines at any moment and recreate them without breaking your application. The Kubernetes control plane will continue to hold information about what should run in your cluster. The load balancer will keep sending traffic to your workload, simply changing the endpoint to send traffic to a new node. And your data will be safely stored in external persistent volumes provided by cloud.

This approach is fundamental when using Kubernetes in clouds. The reason for it is quite obvious: the simpler the system, the more stable it is, and for this simplicity you go buying Kubernetes in the cloud.

Kubernetes on bare metal

Using Kubernetes in the clouds is really simple and convenient, which cannot be said about bare metal installations. In the bare metal world, Kubernetes, on the contrary, becomes unbearably complex. Firstly, because the entire network, backend storage, cloud balancers, etc. are usually run not outside, but inside your cluster. As result such a system is much more difficult to update and maintain.

A diagram showing bare metal Kubernetes, with load balancing and storage done inside the cluster

A diagram showing bare metal Kubernetes, with load balancing and storage done inside the cluster

Judge for yourself: in the cloud, to update a node, you typically delete the virtual machine (or even use kubectl delete node) and you let your node management tooling create a new one, based on an immutable image. The new node will join the cluster and "just work" as a node; following a very simple and commonly used pattern in the Kubernetes world. Many clusters order new virtual machines every few minutes, simply because they can use cheaper spot instances. However, when you have a physical server, you can't just delete and recreate it, firstly because it often runs some cluster services, stores data, and its update process is significantly more complicated.

There are different approaches to solving this problem, ranging from in-place updates, as done by kubeadm, kubespray, and k3s, to full automation of provisioning physical nodes through Cluster API and Metal3.

I like the hybrid approach offered by Talos Linux, where your entire system is described in a single configuration file. Most parameters of this file can be applied without rebooting or recreating the node, including the version of Kubernetes control-plane components. However, it still keeps the maximum declarative nature of Kubernetes. This approach minimizes unnecessary impact on cluster services when updating bare metal nodes. In most cases, you won't need to migrate your virtual machines and rebuild the cluster filesystem on minor updates.

Preparing a base for your future cloud

So, suppose you've decided to build your own cloud. To start somewhere, you need a base layer. You need to think not only about how you will install Kubernetes on your servers but also about how you will update and maintain it. Consider the fact that you will have to think about things like updating the kernel, installing necessary modules, as well packages and security patches. Now you have to think much more that you don't have to worry about when using a ready-made Kubernetes in the cloud.

Of course you can use standard distributions like Ubuntu or Debian, or you can consider specialized ones like Flatcar Container Linux, Fedora Core, and Talos Linux. Each has its advantages and disadvantages.

What about us? At Ænix, we use quite a few specific kernel modules like ZFS, DRBD, and OpenvSwitch, so we decided to go the route of forming a system image with all the necessary modules in advance. In this case, Talos Linux turned out to be the most convenient for us. For example, such a config is enough to build a system image with all the necessary kernel modules:

arch: amd64
platform: metal
secureboot: false
version: v1.6.4
input:
 kernel:
 path: /usr/install/amd64/vmlinuz
 initramfs:
 path: /usr/install/amd64/initramfs.xz
 baseInstaller:
 imageRef: ghcr.io/siderolabs/installer:v1.6.4
 systemExtensions:
 - imageRef: ghcr.io/siderolabs/amd-ucode:20240115
 - imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240115
 - imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240115
 - imageRef: ghcr.io/siderolabs/i915-ucode:20240115
 - imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240115
 - imageRef: ghcr.io/siderolabs/intel-ucode:20231114
 - imageRef: ghcr.io/siderolabs/qlogic-firmware:20240115
 - imageRef: ghcr.io/siderolabs/drbd:9.2.6-v1.6.4
 - imageRef: ghcr.io/siderolabs/zfs:2.1.14-v1.6.4
output:
 kind: installer
 outFormat: raw

Then we use the docker command line tool to build an OS image:

cat config.yaml | docker run --rm -i -v /dev:/dev --privileged "ghcr.io/siderolabs/imager:v1.6.4" -

And as a result, we get a Docker container image with everything we need, which we can use to install Talos Linux on our servers. You can do the same; this image will contain all the necessary firmware and kernel modules.

But the question arises, how do you deliver the freshly formed image to your nodes?

I have been contemplating the idea of PXE booting for quite some time. For example, the Kubefarm project that I wrote an article about two years ago was entirely built using this approach. But unfortunately, it does help you to deploy your very first parent cluster that will hold the others. So now you have prepared a solution that will help you do this the same using PXE approach.

Essentially, all you need to do is run temporary DHCP and PXE servers inside containers. Then your nodes will boot from your image, and you can use a simple Debian-flavored script to help you bootstrap your nodes.

asciicast

The source for that talos-bootstrap script is available on GitHub.

This script allows you to deploy Kubernetes on bare metal in five minutes and obtain a kubeconfig for accessing it. However, many unresolved issues still lie ahead.

Delivering system components

At this stage, you already have a Kubernetes cluster capable of running various workloads. However, it is not fully functional yet. In other words, you need to set up networking and storage, as well as install necessary cluster extensions, like KubeVirt to run virtual machines, as well the monitoring stack and other system-wide components.

Traditionally, this is solved by installing Helm charts into your cluster. You can do this by running helm install commands locally, but this approach becomes inconvenient when you want to track updates, and if you have multiple clusters and you want to keep them uniform. In fact, there are plenty of ways to do this declaratively. To solve this, I recommend using best GitOps practices. I mean tools like ArgoCD and FluxCD.

While ArgoCD is more convenient for dev purposes with its graphical interface and a central control plane, FluxCD, on the other hand, is better suited for creating Kubernetes distributions. With FluxCD, you can specify which charts with what parameters should be launched and describe dependencies. Then, FluxCD will take care of everything for you.

It is suggested to perform a one-time installation of FluxCD in your newly created cluster and provide it with the configuration. This will install everything necessary, bringing the cluster to the expected state.

By carrying out a single installation of FluxCD in your newly minted cluster and configuring it accordingly, you enable it to automatically deploy all the essentials. This will allow your cluster to upgrade itself into the desired state. For example, after installing our platform you'll see the next pre-configured Helm charts with system components:

NAMESPACE NAME AGE READY STATUS
cozy-cert-manager cert-manager 4m1s True Release reconciliation succeeded
cozy-cert-manager cert-manager-issuers 4m1s True Release reconciliation succeeded
cozy-cilium cilium 4m1s True Release reconciliation succeeded
cozy-cluster-api capi-operator 4m1s True Release reconciliation succeeded
cozy-cluster-api capi-providers 4m1s True Release reconciliation succeeded
cozy-dashboard dashboard 4m1s True Release reconciliation succeeded
cozy-fluxcd cozy-fluxcd 4m1s True Release reconciliation succeeded
cozy-grafana-operator grafana-operator 4m1s True Release reconciliation succeeded
cozy-kamaji kamaji 4m1s True Release reconciliation succeeded
cozy-kubeovn kubeovn 4m1s True Release reconciliation succeeded
cozy-kubevirt-cdi kubevirt-cdi 4m1s True Release reconciliation succeeded
cozy-kubevirt-cdi kubevirt-cdi-operator 4m1s True Release reconciliation succeeded
cozy-kubevirt kubevirt 4m1s True Release reconciliation succeeded
cozy-kubevirt kubevirt-operator 4m1s True Release reconciliation succeeded
cozy-linstor linstor 4m1s True Release reconciliation succeeded
cozy-linstor piraeus-operator 4m1s True Release reconciliation succeeded
cozy-mariadb-operator mariadb-operator 4m1s True Release reconciliation succeeded
cozy-metallb metallb 4m1s True Release reconciliation succeeded
cozy-monitoring monitoring 4m1s True Release reconciliation succeeded
cozy-postgres-operator postgres-operator 4m1s True Release reconciliation succeeded
cozy-rabbitmq-operator rabbitmq-operator 4m1s True Release reconciliation succeeded
cozy-redis-operator redis-operator 4m1s True Release reconciliation succeeded
cozy-telepresence telepresence 4m1s True Release reconciliation succeeded
cozy-victoria-metrics-operator victoria-metrics-operator 4m1s True Release reconciliation succeeded

Conclusion

As a result, you achieve a highly repeatable environment that you can provide to anyone, knowing that it operates exactly as intended. This is actually what the Cozystack project does, which you can try out for yourself absolutely free.

In the following articles, I will discuss how to prepare Kubernetes for running virtual machines and how to run Kubernetes clusters with the click of a button. Stay tuned, it'll be fun!

05 Apr 2024 7:30am GMT

03 Apr 2024

feedKubernetes Blog

Introducing the Windows Operational Readiness Specification

Since Windows support graduated to stable with Kubernetes 1.14 in 2019, the capability to run Windows workloads has been much appreciated by the end user community. The level of and availability of Windows workload support has consistently been a major differentiator for Kubernetes distributions used by large enterprises. However, with more Windows workloads being migrated to Kubernetes and new Windows features being continuously released, it became challenging to test Windows worker nodes in an effective and standardized way.

The Kubernetes project values the ability to certify conformance without requiring a closed-source license for a certified distribution or service that has no intention of offering Windows.

Some notable examples brought to the attention of SIG Windows were:

SIG Windows therefore recognized the need for a tailored solution to ensure Windows nodes' operational readiness before their deployment into production environments. Thus, the idea to develop a Windows Operational Readiness Specification was born.

Can't we just run the official Conformance tests?

The Kubernetes project contains a set of conformance tests, which are standardized tests designed to ensure that a Kubernetes cluster meets the required Kubernetes specifications.

However, these tests were originally defined at a time when Linux was the only operating system compatible with Kubernetes, and thus, they were not easily extendable for use with Windows. Given that Windows workloads, despite their importance, account for a smaller portion of the Kubernetes community, it was important to ensure that the primary conformance suite relied upon by many Kubernetes distributions to certify Linux conformance, didn't become encumbered with Windows specific features or enhancements such as GMSA or multi-operating system kube-proxy behavior.

Therefore, since there was a specialized need for Windows conformance testing, SIG Windows went down the path of offering Windows specific conformance tests through the Windows Operational Readiness Specification.

Can't we just run the Kubernetes end-to-end test suite?

In the Linux world, tools such as Sonobuoy simplify execution of the conformance suite, relieving users from needing to be aware of Kubernetes' compilation paths or the semantics of Ginkgo tags.

Regarding needing to compile the Kubernetes tests, we realized that Windows users might similarly find the process of compiling and running the Kubernetes e2e suite from scratch similarly undesirable, hence, there was a clear need to provide a user-friendly, "push-button" solution that is ready to go. Moreover, regarding Ginkgo tags, applying conformance tests to Windows nodes through a set of Ginkgo tags would also be burdensome for any user, including Linux enthusiasts or experienced Windows system admins alike.

To bridge the gap and give users a straightforward way to confirm their clusters support a variety of features, the Kubernetes SIG for Windows found it necessary to therefore create the Windows Operational Readiness application. This application written in Go, simplifies the process to run the necessary Windows specific tests while delivering results in a clear, accessible format.

This initiative has been a collaborative effort, with contributions from different cloud providers and platforms, including Amazon, Microsoft, SUSE, and Broadcom.

A closer look at the Windows Operational Readiness Specification

The Windows Operational Readiness specification specifically targets and executes tests found within the Kubernetes repository in a more user-friendly way than simply targeting Ginkgo tags. It introduces a structured test suite that is split into sets of core and extended tests, with each set of tests containing categories directed at testing a specific area of testing, such as networking. Core tests target fundamental and critical functionalities that Windows nodes should support as defined by the Kubernetes specification. On the other hand, extended tests cover more complex features, more aligned with diving deeper into Windows-specific capabilities such as integrations with Active Directory. These goal of these tests is to be extensive, covering a wide array of Windows-specific capabilities to ensure compatibility with a diverse set of workloads and configurations, extending beyond basic requirements. Below is the current list of categories.

Category Name Category Description
Core.Network Tests minimal networking functionality (ability to access pod-by-pod IP.)
Core.Storage Tests minimal storage functionality, (ability to mount a hostPath storage volume.)
Core.Scheduling Tests minimal scheduling functionality, (ability to schedule a pod with CPU limits.)
Core.Concurrent Tests minimal concurrent functionality, (the ability of a node to handle traffic to multiple pods concurrently.)
Extend.HostProcess Tests features related to Windows HostProcess pod functionality.
Extend.ActiveDirectory Tests features related to Active Directory functionality.
Extend.NetworkPolicy Tests features related to Network Policy functionality.
Extend.Network Tests advanced networking functionality, (ability to support IPv6)
Extend.Worker Tests features related to Windows worker node functionality, (ability for nodes to access TCP and UDP services in the same cluster)

How to conduct operational readiness tests for Windows nodes

To run the Windows Operational Readiness test suite, refer to the test suite's README, which explains how to set it up and run it. The test suite offers flexibility in how you can execute tests, either using a compiled binary or a Sonobuoy plugin. You also have the choice to run the tests against the entire test suite or by specifying a list of categories. Cloud providers have the choice of uploading their conformance results, enhancing transparency and reliability.

Once you have checked out that code, you can run a test. For example, this sample command runs the tests from the Core.Concurrent category:

./op-readiness --kubeconfig $KUBE_CONFIG --category Core.Concurrent

As a contributor to Kubernetes, if you want to test your changes against a specific pull request using the Windows Operational Readiness Specification, use the following bot command in the new pull request.

/test operational-tests-capz-windows-2019

Looking ahead

We're looking to improve our curated list of Windows-specific tests by adding new tests to the Kubernetes repository and also identifying existing test cases that can be targetted. The long term goal for the specification is to continually enhance test coverage for Windows worker nodes and improve the robustness of Windows support, facilitating a seamless experience across diverse cloud environments. We also have plans to integrate the Windows Operational Readiness tests into the official Kubernetes conformance suite.

If you are interested in helping us out, please reach out to us! We welcome help in any form, from giving once-off feedback to making a code contribution, to having long-term owners to help us drive changes. The Windows Operational Readiness specification is owned by the SIG Windows team. You can reach out to the team on the Kubernetes Slack workspace #sig-windows channel. You can also explore the Windows Operational Readiness test suite and make contributions directly to the GitHub repository.

Special thanks to Kulwant Singh (AWS), Pramita Gautam Rana (VMWare), Xinqi Li (Google) and Marcio Morales (AWS) for their help in making notable contributions to the specification. Additionally, appreciation goes to James Sturtevant (Microsoft), Mark Rossetti (Microsoft), Claudiu Belu (Cloudbase Solutions) and Aravindh Puthiyaparambil (Softdrive Technologies Group Inc.) from the SIG Windows team for their guidance and support.

03 Apr 2024 12:00am GMT

12 Mar 2024

feedKubernetes Blog

A Peek at Kubernetes v1.30

A quick look: exciting changes in Kubernetes v1.30

It's a new year and a new Kubernetes release. We're halfway through the release cycle and have quite a few interesting and exciting enhancements coming in v1.30. From brand new features in alpha, to established features graduating to stable, to long-awaited improvements, this release has something for everyone to pay attention to!

To tide you over until the official release, here's a sneak peek of the enhancements we're most excited about in this cycle!

Major changes for Kubernetes v1.30

Structured parameters for dynamic resource allocation (KEP-4381)

Dynamic resource allocation was added to Kubernetes as an alpha feature in v1.26. It defines an alternative to the traditional device-plugin API for requesting access to third-party resources. By design, dynamic resource allocation uses parameters for resources that are completely opaque to core Kubernetes. This approach poses a problem for the Cluster Autoscaler (CA) or any higher-level controller that needs to make decisions for a group of pods (e.g. a job scheduler). It cannot simulate the effect of allocating or deallocating claims over time. Only the third-party DRA drivers have the information available to do this.

​​Structured Parameters for dynamic resource allocation is an extension to the original implementation that addresses this problem by building a framework to support making these claim parameters less opaque. Instead of handling the semantics of all claim parameters themselves, drivers could manage resources and describe them using a specific "structured model" pre-defined by Kubernetes. This would allow components aware of this "structured model" to make decisions about these resources without outsourcing them to some third-party controller. For example, the scheduler could allocate claims rapidly without back-and-forth communication with dynamic resource allocation drivers. Work done for this release centers on defining the framework necessary to enable different "structured models" and to implement the "named resources" model. This model allows listing individual resource instances and, compared to the traditional device plugin API, adds the ability to select those instances individually via attributes.

Node memory swap support (KEP-2400)

In Kubernetes v1.30, memory swap support on Linux nodes gets a big change to how it works - with a strong emphasis on improving system stability. In previous Kubernetes versions, the NodeSwap feature gate was disabled by default, and when enabled, it used UnlimitedSwap behavior as the default behavior. To achieve better stability, UnlimitedSwap behavior (which might compromise node stability) will be removed in v1.30.

The updated, still-beta support for swap on Linux nodes will be available by default. However, the default behavior will be to run the node set to NoSwap (not UnlimitedSwap) mode. In NoSwap mode, the kubelet supports running on a node where swap space is active, but Pods don't use any of the page file. You'll still need to set --fail-swap-on=false for the kubelet to run on that node. However, the big change is the other mode: LimitedSwap. In this mode, the kubelet actually uses the page file on that node and allows Pods to have some of their virtual memory paged out. Containers (and their parent pods) do not have access to swap beyond their memory limit, but the system can still use the swap space if available.

Kubernetes' Node special interest group (SIG Node) will also update the documentation to help you understand how to use the revised implementation, based on feedback from end users, contributors, and the wider Kubernetes community.

Read the previous blog post or the node swap documentation for more details on Linux node swap support in Kubernetes.

Support user namespaces in pods (KEP-127)

User namespaces is a Linux-only feature that better isolates pods to prevent or mitigate several CVEs rated high/critical, including CVE-2024-21626, published in January 2024. In Kubernetes 1.30, support for user namespaces is migrating to beta and now supports pods with and without volumes, custom UID/GID ranges, and more!

Structured authorization configuration (KEP-3221)

Support for structured authorization configuration is moving to beta and will be enabled by default. This feature enables the creation of authorization chains with multiple webhooks with well-defined parameters that validate requests in a particular order and allows fine-grained control - such as explicit Deny on failures. The configuration file approach even allows you to specify CEL rules to pre-filter requests before they are dispatched to webhooks, helping you to prevent unnecessary invocations. The API server also automatically reloads the authorizer chain when the configuration file is modified.

You must specify the path to that authorization configuration using the --authorization-config command line argument. If you want to keep using command line flags instead of a configuration file, those will continue to work as-is. To gain access to new authorization webhook capabilities like multiple webhooks, failure policy, and pre-filter rules, switch to putting options in an --authorization-config file. From Kubernetes 1.30, the configuration file format is beta-level, and only requires specifying --authorization-config since the feature gate is enabled by default. An example configuration with all possible values is provided in the Authorization docs. For more details, read the Authorization docs.

Container resource based pod autoscaling (KEP-1610)

Horizontal pod autoscaling based on ContainerResource metrics will graduate to stable in v1.30. This new behavior for HorizontalPodAutoscaler allows you to configure automatic scaling based on the resource usage for individual containers, rather than the aggregate resource use over a Pod. See our previous article for further details, or read container resource metrics.

CEL for admission control (KEP-3488)

Integrating Common Expression Language (CEL) for admission control in Kubernetes introduces a more dynamic and expressive way of evaluating admission requests. This feature allows complex, fine-grained policies to be defined and enforced directly through the Kubernetes API, enhancing security and governance capabilities without compromising performance or flexibility.

CEL's addition to Kubernetes admission control empowers cluster administrators to craft intricate rules that can evaluate the content of API requests against the desired state and policies of the cluster without resorting to Webhook-based access controllers. This level of control is crucial for maintaining the integrity, security, and efficiency of cluster operations, making Kubernetes environments more robust and adaptable to various use cases and requirements. For more information on using CEL for admission control, see the API documentation for ValidatingAdmissionPolicy.

We hope you're as excited for this release as we are. Keep an eye out for the official release blog in a few weeks for more highlights!

12 Mar 2024 12:00am GMT

07 Mar 2024

feedKubernetes Blog

CRI-O: Applying seccomp profiles from OCI registries

Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.

But distributing those seccomp profiles is a major challenge in Kubernetes, because the JSON files have to be available on all nodes where a workload can possibly run. Projects like the Security Profiles Operator solve that problem by running as a daemon within the cluster, which makes me wonder which part of that distribution could be done by the container runtime.

Runtimes usually apply the profiles from a local path, for example:

apiVersion: v1
kind: Pod
metadata:
 name: pod
spec:
 containers:
 - name: container
 image: nginx:1.25.3
 securityContext:
 seccompProfile:
 type: Localhost
 localhostProfile: nginx-1.25.3.json

The profile nginx-1.25.3.json has to be available in the root directory of the kubelet, appended by the seccomp directory. This means the default location for the profile on-disk would be /var/lib/kubelet/seccomp/nginx-1.25.3.json. If the profile is not available, then runtimes will fail on container creation like this:

kubectl get pods
NAME READY STATUS RESTARTS AGE
pod 0/1 CreateContainerError 0 38s
kubectl describe pod/pod | tail
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Normal Scheduled 117s default-scheduler Successfully assigned default/pod to 127.0.0.1
 Normal Pulling 117s kubelet Pulling image "nginx:1.25.3"
 Normal Pulled 111s kubelet Successfully pulled image "nginx:1.25.3" in 5.948s (5.948s including waiting)
 Warning Failed 7s (x10 over 111s) kubelet Error: setup seccomp: unable to load local profile "/var/lib/kubelet/seccomp/nginx-1.25.3.json": open /var/lib/kubelet/seccomp/nginx-1.25.3.json: no such file or directory
 Normal Pulled 7s (x9 over 111s) kubelet Container image "nginx:1.25.3" already present on machine

The major obstacle of having to manually distribute the Localhost profiles will lead many end-users to fall back to RuntimeDefault or even running their workloads as Unconfined (with disabled seccomp).

CRI-O to the rescue

The Kubernetes container runtime CRI-O provides various features using custom annotations. The v1.30 release adds support for a new set of annotations called seccomp-profile.kubernetes.cri-o.io/POD and seccomp-profile.kubernetes.cri-o.io/<CONTAINER>. Those annotations allow you to specify:

CRI-O will only respect the annotation if the runtime is configured to allow it, as well as for workloads running as Unconfined. All other workloads will still use the value from the securityContext with a higher priority.

The annotations alone will not help much with the distribution of the profiles, but the way they can be referenced will! For example, you can now specify seccomp profiles like regular container images by using OCI artifacts:

apiVersion: v1
kind: Pod
metadata:
 name: pod
 annotations:
 seccomp-profile.kubernetes.cri-o.io/POD: quay.io/crio/seccomp:v2
spec: 

The image quay.io/crio/seccomp:v2 contains a seccomp.json file, which contains the actual profile content. Tools like ORAS or Skopeo can be used to inspect the contents of the image:

oras pull quay.io/crio/seccomp:v2
Downloading 92d8ebfa89aa seccomp.json
Downloaded 92d8ebfa89aa seccomp.json
Pulled [registry] quay.io/crio/seccomp:v2
Digest: sha256:f0205dac8a24394d9ddf4e48c7ac201ca7dcfea4c554f7ca27777a7f8c43ec1b
jq . seccomp.json | head
{
 "defaultAction": "SCMP_ACT_ERRNO",
 "defaultErrnoRet": 38,
 "defaultErrno": "ENOSYS",
 "archMap": [
 {
 "architecture": "SCMP_ARCH_X86_64",
 "subArchitectures": [
 "SCMP_ARCH_X86",
 "SCMP_ARCH_X32"
# Inspect the plain manifest of the image
skopeo inspect --raw docker://quay.io/crio/seccomp:v2 | jq .
{
 "schemaVersion": 2,
 "mediaType": "application/vnd.oci.image.manifest.v1+json",
 "config":
 {
 "mediaType": "application/vnd.cncf.seccomp-profile.config.v1+json",
 "digest": "sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356",
 "size": 3,
 },
 "layers":
 [
 {
 "mediaType": "application/vnd.oci.image.layer.v1.tar",
 "digest": "sha256:92d8ebfa89aa6dd752c6443c27e412df1b568d62b4af129494d7364802b2d476",
 "size": 18853,
 "annotations": { "org.opencontainers.image.title": "seccomp.json" },
 },
 ],
 "annotations": { "org.opencontainers.image.created": "2024-02-26T09:03:30Z" },
}

The image manifest contains a reference to a specific required config media type (application/vnd.cncf.seccomp-profile.config.v1+json) and a single layer (application/vnd.oci.image.layer.v1.tar) pointing to the seccomp.json file. But now, let's give that new feature a try!

Using the annotation for a specific container or whole pod

CRI-O needs to be configured adequately before it can utilize the annotation. To do this, add the annotation to the allowed_annotations array for the runtime. This can be done by using a drop-in configuration /etc/crio/crio.conf.d/10-crun.conf like this:

[crio.runtime]
default_runtime = "crun"

[crio.runtime.runtimes.crun]
allowed_annotations = [
 "seccomp-profile.kubernetes.cri-o.io",
]

Now, let's run CRI-O from the latest main commit. This can be done by either building it from source, using the static binary bundles or the prerelease packages.

To demonstrate this, I ran the crio binary from my command line using a single node Kubernetes cluster via local-up-cluster.sh. Now that the cluster is up and running, let's try a pod without the annotation running as seccomp Unconfined:

cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
 name: pod
spec:
 containers:
 - name: container
 image: nginx:1.25.3
 securityContext:
 seccompProfile:
 type: Unconfined
kubectl apply -f pod.yaml

The workload is up and running:

kubectl get pods
NAME READY STATUS RESTARTS AGE
pod 1/1 Running 0 15s

And no seccomp profile got applied if I inspect the container using crictl:

export CONTAINER_ID=$(sudo crictl ps --name container -q)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp
null

Now, let's modify the pod to apply the profile quay.io/crio/seccomp:v2 to the container:

apiVersion: v1
kind: Pod
metadata:
 name: pod
 annotations:
 seccomp-profile.kubernetes.cri-o.io/container: quay.io/crio/seccomp:v2
spec:
 containers:
 - name: container
 image: nginx:1.25.3

I have to delete and recreate the Pod, because only recreation will apply a new seccomp profile:

kubectl delete pod/pod
pod "pod" deleted
kubectl apply -f pod.yaml
pod/pod created

The CRI-O logs will now indicate that the runtime pulled the artifact:

WARN[…] Allowed annotations are specified for workload [seccomp-profile.kubernetes.cri-o.io]
INFO[…] Found container specific seccomp profile annotation: seccomp-profile.kubernetes.cri-o.io/container=quay.io/crio/seccomp:v2 id=26ddcbe6-6efe-414a-88fd-b1ca91979e93 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Pulling OCI artifact from ref: quay.io/crio/seccomp:v2 id=26ddcbe6-6efe-414a-88fd-b1ca91979e93 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Retrieved OCI artifact seccomp profile of len: 18853 id=26ddcbe6-6efe-414a-88fd-b1ca91979e93 name=/runtime.v1.RuntimeService/CreateContainer

And the container is finally using the profile:

export CONTAINER_ID=$(sudo crictl ps --name container -q)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp | head
{
 "defaultAction": "SCMP_ACT_ERRNO",
 "defaultErrnoRet": 38,
 "architectures": [
 "SCMP_ARCH_X86_64",
 "SCMP_ARCH_X86",
 "SCMP_ARCH_X32"
 ],
 "syscalls": [
 {

The same would work for every container in the pod, if users replace the /container suffix with the reserved name /POD, for example:

apiVersion: v1
kind: Pod
metadata:
 name: pod
 annotations:
 seccomp-profile.kubernetes.cri-o.io/POD: quay.io/crio/seccomp:v2
spec:
 containers:
 - name: container
 image: nginx:1.25.3

Using the annotation for a container image

While specifying seccomp profiles as OCI artifacts on certain workloads is a cool feature, the majority of end users would like to link seccomp profiles to published container images. This can be done by using a container image annotation; instead of being applied to a Kubernetes Pod, the annotation is some metadata applied at the container image itself. For example, Podman can be used to add the image annotation directly during image build:

podman build \
 --annotation seccomp-profile.kubernetes.cri-o.io=quay.io/crio/seccomp:v2 \
 -t quay.io/crio/nginx-seccomp:v2 .

The pushed image then contains the annotation:

skopeo inspect --raw docker://quay.io/crio/nginx-seccomp:v2 |
 jq '.annotations."seccomp-profile.kubernetes.cri-o.io"'
"quay.io/crio/seccomp:v2"

If I now use that image in an CRI-O test pod definition:

apiVersion: v1
kind: Pod
metadata:
 name: pod
 # no Pod annotations set
spec:
 containers:
 - name: container
 image: quay.io/crio/nginx-seccomp:v2

Then the CRI-O logs will indicate that the image annotation got evaluated and the profile got applied:

kubectl delete pod/pod
pod "pod" deleted
kubectl apply -f pod.yaml
pod/pod created
INFO[…] Found image specific seccomp profile annotation: seccomp-profile.kubernetes.cri-o.io=quay.io/crio/seccomp:v2 id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Pulling OCI artifact from ref: quay.io/crio/seccomp:v2 id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Retrieved OCI artifact seccomp profile of len: 18853 id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Created container 116a316cd9a11fe861dd04c43b94f45046d1ff37e2ed05a4e4194fcaab29ee63: default/pod/container id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
export CONTAINER_ID=$(sudo crictl ps --name container -q)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp | head
{
 "defaultAction": "SCMP_ACT_ERRNO",
 "defaultErrnoRet": 38,
 "architectures": [
 "SCMP_ARCH_X86_64",
 "SCMP_ARCH_X86",
 "SCMP_ARCH_X32"
 ],
 "syscalls": [
 {

For container images, the annotation seccomp-profile.kubernetes.cri-o.io will be treated in the same way as seccomp-profile.kubernetes.cri-o.io/POD and applies to the whole pod. In addition to that, the whole feature also works when using the container specific annotation on an image, for example if a container is named container1:

skopeo inspect --raw docker://quay.io/crio/nginx-seccomp:v2-container |
 jq '.annotations."seccomp-profile.kubernetes.cri-o.io/container1"'
"quay.io/crio/seccomp:v2"

The cool thing about this whole feature is that users can now create seccomp profiles for specific container images and store them side by side in the same registry. Linking the images to the profiles provides a great flexibility to maintain them over the whole application's life cycle.

Pushing profiles using ORAS

The actual creation of the OCI object that contains a seccomp profile requires a bit more work when using ORAS. I have the hope that tools like Podman will simplify the overall process in the future. Right now, the container registry needs to be OCI compatible, which is also the case for Quay.io. CRI-O expects the seccomp profile object to have a container image media type (application/vnd.cncf.seccomp-profile.config.v1+json), while ORAS uses application/vnd.oci.empty.v1+json per default. To achieve all of that, the following commands can be executed:

echo "{}" > config.json
oras push \
 --config config.json:application/vnd.cncf.seccomp-profile.config.v1+json \
 quay.io/crio/seccomp:v2 seccomp.json

The resulting image contains the mediaType that CRI-O expects. ORAS pushes a single layer seccomp.json to the registry. The name of the profile does not matter much. CRI-O will pick the first layer and check if that can act as a seccomp profile.

Future work

CRI-O internally manages the OCI artifacts like regular files. This provides the benefit of moving them around, removing them if not used any more or having any other data available than seccomp profiles. This enables future enhancements in CRI-O on top of OCI artifacts, but also allows thinking about stacking seccomp profiles as part of having multiple layers in an OCI artifact. The limitation that it only works for Unconfined workloads for v1.30.x releases is something different CRI-O would like to address in the future. Simplifying the overall user experience by not compromising security seems to be the key for a successful future of seccomp in container workloads.

The CRI-O maintainers will be happy to listen to any feedback or suggestions on the new feature! Thank you for reading this blog post, feel free to reach out to the maintainers via the Kubernetes Slack channel #crio or create an issue in the GitHub repository.

07 Mar 2024 12:00am GMT

01 Mar 2024

feedKubernetes Blog

Spotlight on SIG Cloud Provider

One of the most popular ways developers use Kubernetes-related services is via cloud providers, but have you ever wondered how cloud providers can do that? How does this whole process of integration of Kubernetes to various cloud providers happen? To answer that, let's put the spotlight on SIG Cloud Provider.

SIG Cloud Provider works to create seamless integrations between Kubernetes and various cloud providers. Their mission? Keeping the Kubernetes ecosystem fair and open for all. By setting clear standards and requirements, they ensure every cloud provider plays nicely with Kubernetes. It is their responsibility to configure cluster components to enable cloud provider integrations.

In this blog of the SIG Spotlight series, Arujjwal Negi interviews Michael McCune (Red Hat), also known as elmiko, co-chair of SIG Cloud Provider, to give us an insight into the workings of this group.

Introduction

Arujjwal: Let's start by getting to know you. Can you give us a small intro about yourself and how you got into Kubernetes?

Michael: Hi, I'm Michael McCune, most people around the community call me by my handle, elmiko. I've been a software developer for a long time now (Windows 3.1 was popular when I started!), and I've been involved with open-source software for most of my career. I first got involved with Kubernetes as a developer of machine learning and data science applications; the team I was on at the time was creating tutorials and examples to demonstrate the use of technologies like Apache Spark on Kubernetes. That said, I've been interested in distributed systems for many years and when an opportunity arose to join a team working directly on Kubernetes, I jumped at it!

Functioning and working

Arujjwal: Can you give us an insight into what SIG Cloud Provider does and how it functions?

Michael: SIG Cloud Provider was formed to help ensure that Kubernetes provides a neutral integration point for all infrastructure providers. Our largest task to date has been the extraction and migration of in-tree cloud controllers to out-of-tree components. The SIG meets regularly to discuss progress and upcoming tasks and also to answer questions and bugs that arise. Additionally, we act as a coordination point for cloud provider subprojects such as the cloud provider framework, specific cloud controller implementations, and the Konnectivity proxy project.

Arujjwal: After going through the project README, I learned that SIG Cloud Provider works with the integration of Kubernetes with cloud providers. How does this whole process go?

Michael: One of the most common ways to run Kubernetes is by deploying it to a cloud environment (AWS, Azure, GCP, etc). Frequently, the cloud infrastructures have features that enhance the performance of Kubernetes, for example, by providing elastic load balancing for Service objects. To ensure that cloud-specific services can be consistently consumed by Kubernetes, the Kubernetes community has created cloud controllers to address these integration points. Cloud providers can create their own controllers either by using the framework maintained by the SIG or by following the API guides defined in the Kubernetes code and documentation. One thing I would like to point out is that SIG Cloud Provider does not deal with the lifecycle of nodes in a Kubernetes cluster; for those types of topics, SIG Cluster Lifecycle and the Cluster API project are more appropriate venues.

Important subprojects

Arujjwal: There are a lot of subprojects within this SIG. Can you highlight some of the most important ones and what job they do?

Michael: I think the two most important subprojects today are the cloud provider framework and the extraction/migration project. The cloud provider framework is a common library to help infrastructure integrators build a cloud controller for their infrastructure. This project is most frequently the starting point for new people coming to the SIG. The extraction and migration project is the other big subproject and a large part of why the framework exists. A little history might help explain further: for a long time, Kubernetes needed some integration with the underlying infrastructure, not necessarily to add features but to be aware of cloud events like instance termination. The cloud provider integrations were built into the Kubernetes code tree, and thus the term "in-tree" was created (check out this article on the topic for more info). The activity of maintaining provider-specific code in the main Kubernetes source tree was considered undesirable by the community. The community's decision inspired the creation of the extraction and migration project to remove the "in-tree" cloud controllers in favor of "out-of-tree" components.

Arujjwal: What makes [the cloud provider framework] a good place to start? Does it have consistent good beginner work? What kind?

Michael: I feel that the cloud provider framework is a good place to start as it encodes the community's preferred practices for cloud controller managers and, as such, will give a newcomer a strong understanding of how and what the managers do. Unfortunately, there is not a consistent stream of beginner work on this component; this is due in part to the mature nature of the framework and that of the individual providers as well. For folks who are interested in getting more involved, having some Go language knowledge is good and also having an understanding of how at least one cloud API (e.g., AWS, Azure, GCP) works is also beneficial. In my personal opinion, being a newcomer to SIG Cloud Provider can be challenging as most of the code around this project deals directly with specific cloud provider interactions. My best advice to people wanting to do more work on cloud providers is to grow your familiarity with one or two cloud APIs, then look for open issues on the controller managers for those clouds, and always communicate with the other contributors as much as possible.

Accomplishments

Arujjwal: Can you share about an accomplishment(s) of the SIG that you are proud of?

Michael: Since I joined the SIG, more than a year ago, we have made great progress in advancing the extraction and migration subproject. We have moved from an alpha status on the defining KEP to a beta status and are inching ever closer to removing the old provider code from the Kubernetes source tree. I've been really proud to see the active engagement from our community members and to see the progress we have made towards extraction. I have a feeling that, within the next few releases, we will see the final removal of the in-tree cloud controllers and the completion of the subproject.

Advice for new contributors

Arujjwal: Is there any suggestion or advice for new contributors on how they can start at SIG Cloud Provider?

Michael: This is a tricky question in my opinion. SIG Cloud Provider is focused on the code pieces that integrate between Kubernetes and an underlying infrastructure. It is very common, but not necessary, for members of the SIG to be representing a cloud provider in an official capacity. I recommend that anyone interested in this part of Kubernetes should come to an SIG meeting to see how we operate and also to study the cloud provider framework project. We have some interesting ideas for future work, such as a common testing framework, that will cut across all cloud providers and will be a great opportunity for anyone looking to expand their Kubernetes involvement.

Arujjwal: Are there any specific skills you're looking for that we should highlight? To give you an example from our own [SIG ContribEx] (https://github.com/kubernetes/community/blob/master/sig-contributor-experience/README.md): if you're an expert in Hugo, we can always use some help with k8s.dev!

Michael: The SIG is currently working through the final phases of our extraction and migration process, but we are looking toward the future and starting to plan what will come next. One of the big topics that the SIG has discussed is testing. Currently, we do not have a generic common set of tests that can be exercised by each cloud provider to confirm the behaviour of their controller manager. If you are an expert in Ginkgo and the Kubetest framework, we could probably use your help in designing and implementing the new tests.


This is where the conversation ends. I hope this gave you some insights about SIG Cloud Provider's aim and working. This is just the tip of the iceberg. To know more and get involved with SIG Cloud Provider, try attending their meetings here.

01 Mar 2024 12:00am GMT

22 Feb 2024

feedKubernetes Blog

A look into the Kubernetes Book Club

Learning Kubernetes and the entire ecosystem of technologies around it is not without its challenges. In this interview, we will talk with Carlos Santana (AWS) to learn a bit more about how he created the Kubernetes Book Club, how it works, and how anyone can join in to take advantage of a community-based learning experience.

Carlos Santana speaking at KubeCon NA 2023

Frederico Muñoz (FSM): Hello Carlos, thank you so much for your availability. To start with, could you tell us a bit about yourself?

Carlos Santana (CS): Of course. My experience in deploying Kubernetes in production six years ago opened the door for me to join Knative and then contribute to Kubernetes through the Release Team. Working on upstream Kubernetes has been one of the best experiences I've had in open-source. Over the past two years, in my role as a Senior Specialist Solutions Architect at AWS, I have been assisting large enterprises build their internal developer platforms (IDP) on top of Kubernetes. Going forward, my open source contributions are directed towards CNOE and CNCF projects like Argo, Crossplane, and Backstage.

Creating the Book Club

FSM: So your path led you to Kubernetes, and at that point what was the motivating factor for starting the Book Club?

CS: The idea for the Kubernetes Book Club sprang from a casual suggestion during a TGIK livestream. For me, it was more than just about reading a book; it was about creating a learning community. This platform has not only been a source of knowledge but also a support system, especially during the challenging times of the pandemic. It's gratifying to see how this initiative has helped members cope and grow. The first book Production Kubernetes took 36 weeks, when we started on March 5th 2021. Currently don't take that long to cover a book, one or two chapters per week.

FSM: Could you describe the way the Kubernetes Book Club works? How do you select the books and how do you go through them?

CS: We collectively choose books based on the interests and needs of the group. This practical approach helps members, especially beginners, grasp complex concepts more easily. We have two weekly series, one for the EMEA timezone, and I organize the US one. Each organizer works with their co-host and picks a book on Slack, then sets up a lineup of hosts for a couple of weeks to discuss each chapter.

FSM: If I'm not mistaken, the Kubernetes Book Club is in its 17th book, which is significant: is there any secret recipe for keeping things active?

CS: The secret to keeping the club active and engaging lies in a couple of key factors.

Firstly, consistency has been crucial. We strive to maintain a regular schedule, only cancelling meetups for major events like holidays or KubeCon. This regularity helps members stay engaged and builds a reliable community.

Secondly, making the sessions interesting and interactive has been vital. For instance, I often introduce pop-up quizzes during the meetups, which not only tests members' understanding but also adds an element of fun. This approach keeps the content relatable and helps members understand how theoretical concepts are applied in real-world scenarios.

Topics covered in the Book Club

FSM: The main topics of the books have been Kubernetes, GitOps, Security, SRE, and Observability: is this a reflection of the cloud native landscape, especially in terms of popularity?

CS: Our journey began with 'Production Kubernetes', setting the tone for our focus on practical, production-ready solutions. Since then, we've delved into various aspects of the CNCF landscape, aligning our books with a different theme. Each theme, whether it be Security, Observability, or Service Mesh, is chosen based on its relevance and demand within the community. For instance, in our recent themes on Kubernetes Certifications, we brought the book authors into our fold as active hosts, enriching our discussions with their expertise.

FSM: I know that the project had recent changes, namely being integrated into the CNCF as a Cloud Native Community Group. Could you talk a bit about this change?

CS: The CNCF graciously accepted the book club as a Cloud Native Community Group. This is a significant development that has streamlined our operations and expanded our reach. This alignment has been instrumental in enhancing our administrative capabilities, similar to those used by Kubernetes Community Days (KCD) meetups. Now, we have a more robust structure for memberships, event scheduling, mailing lists, hosting web conferences, and recording sessions.

FSM: How has your involvement with the CNCF impacted the growth and engagement of the Kubernetes Book Club over the past six months?

CS: Since becoming part of the CNCF community six months ago, we've witnessed significant quantitative changes within the Kubernetes Book Club. Our membership has surged to over 600 members, and we've successfully organized and conducted more than 40 events during this period. What's even more promising is the consistent turnout, with an average of 30 attendees per event. This growth and engagement are clear indicators of the positive influence of our CNCF affiliation on the Kubernetes Book Club's reach and impact in the community.

Joining the Book Club

FSM: For anyone wanting to join, what should they do?

CS: There are three steps to join:

FSM: Excellent, thank you! Any final comments you would like to share?

CS: The Kubernetes Book Club is more than just a group of professionals discussing books; it's a vibrant community and amazing volunteers that help organize and host Neependra Khare, Eric Smalling, Sevi Karakulak, Chad M. Crowell, and Walid (CNJ) Shaari. Look us up at KubeCon and get your Kubernetes Book Club sticker!

22 Feb 2024 12:00am GMT

23 Jan 2024

feedKubernetes Blog

Image Filesystem: Configuring Kubernetes to store containers on a separate filesystem

A common issue in running/operating Kubernetes clusters is running out of disk space. When the node is provisioned, you should aim to have a good amount of storage space for your container images and running containers. The container runtime usually writes to /var. This can be located as a separate partition or on the root filesystem. CRI-O, by default, writes its containers and images to /var/lib/containers, while containerd writes its containers and images to /var/lib/containerd.

In this blog post, we want to bring attention to ways that you can configure your container runtime to store its content separately from the default partition.
This allows for more flexibility in configuring Kubernetes and provides support for adding a larger disk for the container storage while keeping the default filesystem untouched.

One area that needs more explaining is where/what Kubernetes is writing to disk.

Understanding Kubernetes disk usage

Kubernetes has persistent data and ephemeral data. The base path for the kubelet and local Kubernetes-specific storage is configurable, but it is usually assumed to be /var/lib/kubelet. In the Kubernetes docs, this is sometimes referred to as the root or node filesystem. The bulk of this data can be categorized into:

This is different from most POSIX systems as the root/node filesystem is not / but the disk that /var/lib/kubelet is on.

Ephemeral storage

Pods and containers can require temporary or transient local storage for their operation. The lifetime of the ephemeral storage does not extend beyond the life of the individual pod, and the ephemeral storage cannot be shared across pods.

Logs

By default, Kubernetes stores the logs of each running container, as files within /var/log. These logs are ephemeral and are monitored by the kubelet to make sure that they do not grow too large while the pods are running.

You can customize the log rotation settings for each node to manage the size of these logs, and configure log shipping (using a 3rd party solution) to avoid relying on the node-local storage.

Container runtime

The container runtime has two different areas of storage for containers and images.

The container runtime filesystem contains both the read-only layer and the writeable layer. This is considered the imagefs in Kubernetes documentation.

Container runtime configurations

CRI-O

CRI-O uses a storage configuration file in TOML format that lets you control how the container runtime stores persistent and temporary data. CRI-O utilizes the storage library.
Some Linux distributions have a manual entry for storage (man 5 containers-storage.conf). The main configuration for storage is located in /etc/containers/storage.conf and one can control the location for temporary data and the root directory.
The root directory is where CRI-O stores the persistent data.

[storage]
# Default storage driver
driver = "overlay"
# Temporary storage location
runroot = "/var/run/containers/storage"
# Primary read/write location of container storage 
graphroot = "/var/lib/containers/storage"

Here is a quick way to relabel your graphroot directory to match /var/lib/containers/storage:

semanage fcontext -a -e /var/lib/containers/storage <YOUR-STORAGE-PATH>
restorecon -R -v <YOUR-STORAGE-PATH>

containerd

The containerd runtime uses a TOML configuration file to control where persistent and ephemeral data is stored. The default path for the config file is located at /etc/containerd/config.toml.

The relevant fields for containerd storage are root and state.

Kubernetes node pressure eviction

Kubernetes will automatically detect if the container filesystem is split from the node filesystem. When one separates the filesystem, Kubernetes is responsible for monitoring both the node filesystem and the container runtime filesystem. Kubernetes documentation refers to the node filesystem and the container runtime filesystem as nodefs and imagefs. If either nodefs or the imagefs are running out of disk space, then the overall node is considered to have disk pressure. Kubernetes will first reclaim space by deleting unusued containers and images, and then it will resort to evicting pods. On a node that has a nodefs and an imagefs, the kubelet will garbage collect unused container images on imagefs and will remove dead pods and their containers from the nodefs. If there is only a nodefs, then Kubernetes garbage collection includes dead containers, dead pods and unused images.

Kubernetes allows more configurations for determining if your disk is full.
The eviction manager within the kubelet has some configuration settings that let you control the relevant thresholds. For filesystems, the relevant measurements are nodefs.available, nodefs.inodesfree, imagefs.available, and imagefs.inodesfree. If there is not a dedicated disk for the container runtime then imagefs is ignored.

Users can use the existing defaults:

Kubernetes allows you to set user defined values in EvictionHard and EvictionSoft in the kubelet configuration file.

EvictionHard
defines limits; once these limits are exceeded, pods will be evicted without any grace period.
EvictionSoft
defines limits; once these limits are exceeded, pods will be evicted with a grace period that can be set per signal.

If you specify a value for EvictionHard, it will replace the defaults.
This means it is important to set all signals in your configuration.

For example, the following kubelet configuration could be used to configure eviction signals and grace period options.

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
 memory.available: "100Mi"
 nodefs.available: "10%"
 nodefs.inodesFree: "5%"
 imagefs.available: "15%"
 imagefs.inodesFree: "5%"
evictionSoft:
 memory.available: "100Mi"
 nodefs.available: "10%"
 nodefs.inodesFree: "5%"
 imagefs.available: "15%"
 imagefs.inodesFree: "5%"
evictionSoftGracePeriod:
 memory.available: "1m30s"
 nodefs.available: "2m"
 nodefs.inodesFree: "2m"
 imagefs.available: "2m"
 imagefs.inodesFree: "2m"
evictionMaxPodGracePeriod: 60s

Problems

The Kubernetes project recommends that you either use the default settings for eviction or you set all the fields for eviction. You can use the default settings or specify your own evictionHard settings. If you miss a signal, then Kubernetes will not monitor that resource. One common misconfiguration administrators or users can hit is mounting a new filesystem to /var/lib/containers/storage or /var/lib/containerd. Kubernetes will detect a separate filesystem, so you want to make sure to check that imagefs.inodesfree and imagefs.available match your needs if you've done this.

Another area of confusion is that ephemeral storage reporting does not change if you define an image filesystem for your node. The image filesystem (imagefs) is used to store container image layers; if a container writes to its own root filesystem, that local write doesn't count towards the size of the container image. The place where the container runtime stores those local modifications is runtime-defined, but is often the image filesystem. If a container in a pod is writing to a filesystem-backed emptyDir volume, then this uses space from the nodefs filesystem. The kubelet always reports ephemeral storage capacity and allocations based on the filesystem represented by nodefs; this can be confusing when ephemeral writes are actually going to the image filesystem.

Future work

To fix the ephemeral storage reporting limitations and provide more configuration options to the container runtime, SIG Node are working on KEP-4191. In KEP-4191, Kubernetes will detect if the writeable layer is separated from the read-only layer (images). This would allow us to have all ephemeral storage, including the writeable layer, on the same disk as well as allowing for a separate disk for images.

Getting involved

If you would like to get involved, you can join Kubernetes Node Special-Interest-Group (SIG).

If you would like to share feedback, you can do so on our #sig-node Slack channel. If you're not already part of that Slack workspace, you can visit https://slack.k8s.io/ for an invitation.

Special thanks to all the contributors who provided great reviews, shared valuable insights or suggested the topic idea.

23 Jan 2024 12:00am GMT

15 Jan 2024

feedKubernetes Blog

Spotlight on SIG Release (Release Team Subproject)

The Release Special Interest Group (SIG Release), where Kubernetes sharpens its blade with cutting-edge features and bug fixes every 4 months. Have you ever considered how such a big project like Kubernetes manages its timeline so efficiently to release its new version, or how the internal workings of the Release Team look like? If you're curious about these questions or want to know more and get involved with the work SIG Release does, read on!

SIG Release plays a crucial role in the development and evolution of Kubernetes. Its primary responsibility is to manage the release process of new versions of Kubernetes. It operates on a regular release cycle, typically every three to four months. During this cycle, the Kubernetes Release Team works closely with other SIGs and contributors to ensure a smooth and well-coordinated release. This includes planning the release schedule, setting deadlines for code freeze and testing phases, as well as creating release artefacts like binaries, documentation, and release notes.

Before you read further, it is important to note that there are two subprojects under SIG Release - Release Engineering and Release Team.

In this blog post, Nitish Kumar interviews Verónica López (PlanetScale), Technical Lead of SIG Release, with the spotlight on the Release Team subproject, how the release process looks like, and ways to get involved.

  1. What is the typical release process for a new version of Kubernetes, from initial planning to the final release? Are there any specific methodologies and tools that you use to ensure a smooth release?

    The release process for a new Kubernetes version is a well-structured and community-driven effort. There are no specific methodologies or tools as such that we follow, except a calendar with a series of steps to keep things organised. The complete release process looks like this:

  1. How do you handle the balance between stability and introducing new features in each release? What criteria are used to determine which features make it into a release?

    It's a neverending mission, however, we think that the key is in respecting our process and guidelines. Our guidelines are the result of hours of discussions and feedback from dozens of members of the community who bring a wealth of knowledge and experience to the project. If we didn't have strict guidelines, we would keep having the same discussions over and over again, instead of using our time for more productive topics that needs our attention. All the critical exceptions require consensus from most of the team members, so we can ensure quality.

    The process of deciding what makes it into a release starts way before the Release Teams takes over the workflows. Each individual SIG along with the most experienced contributors gets to decide whether they'd like to include a feature or change, so the planning and ultimate approval usually belongs to them. Then, the Release Team makes sure those contributions meet the requirements of documentation, testing, backwards compatibility, among others, before officially allowing them in. A similar process happens with cherry-picks for the monthly patch releases, where we have strict policies about not accepting PRs that would require a full KEP, or fixes that don't include all the affected branches.

  2. What are some of the most significant challenges you've encountered while developing and releasing Kubernetes? How have you overcome these challenges?

    Every cycle of release brings its own array of challenges. It might involve tackling last-minute concerns like newly discovered Common Vulnerabilities and Exposures (CVEs), resolving bugs within our internal tools, or addressing unexpected regressions caused by features from previous releases. Another obstacle we often face is that, although our team is substantial, most of us contribute on a volunteer basis. Sometimes it can feel like we're a bit understaffed, however we always manage to get organised and make it work.

  3. As a new contributor, what should be my ideal path to get involved with SIG Release? In a community where everyone is busy with their own tasks, how can I find the right set of tasks to contribute effectively to it?

    Everyone's way of getting involved within the Open Source community is different. SIG Release is a self-serving team, meaning that we write our own tools to be able to ship releases. We collaborate a lot with other SIGs, such as SIG K8s Infra, but all the tools that we used needs to be tailor-made for our massive technical needs, while reducing costs. This means that we are constantly looking for volunteers who'd like to help with different types of projects, beyond "just" cutting a release.

    Our current project requires a mix of skills like Go programming, understanding Kubernetes internals, Linux packaging, supply chain security, technical writing, and general open-source project maintenance. This skill set is always evolving as our project grows.

    For an ideal path, this is what we suggest:

    • Get yourself familiar with the code, including how features are managed, the release calendar, and the overall structure of the Release Team.
    • Join the Kubernetes community communication channels, such as Slack (#sig-release), where we are particularly active.
    • Join the SIG Release weekly meetings which are open to all in the community. Participating in these meetings is a great way to learn about ongoing and future projects that you might find relevant for your skillset and interests.

    Remember, every experienced contributor was once in your shoes, and the community is often more than willing to guide and support newcomers. Don't hesitate to ask questions, engage in discussions, and take small steps to contribute. sig-release-questions

  4. What is the Release Shadow Program and how is it different from other shadow programs included in various other SIGs?

    The Release Shadow Program offers a chance for interested individuals to shadow experienced members of the Release Team throughout a Kubernetes release cycle. This is a unique chance to see all the hard work that a Kubernetes release requires across sub-teams. A lot of people think that all we do is cut a release every three months, but that's just the top of the iceberg.

    Our program typically aligns with a specific Kubernetes release cycle, which has a predictable timeline of approximately three months. While this program doesn't involve writing new Kubernetes features, it still requires a high sense of responsibility since the Release Team is the last step between a new release and thousands of contributors, so it's a great opportunity to learn a lot about modern software development cycles at an accelerated pace.

  5. What are the qualifications that you generally look for in a person to volunteer as a release shadow/release lead for the next Kubernetes release?

    While all the roles require some degree of technical ability, some require more hands-on experience with Go and familiarity with the Kubernetes API while others require people who are good at communicating technical content in a clear and concise way. It's important to mention that we value enthusiasm and commitment over technical expertise from day 1. If you have the right attitude and show us that you enjoy working with Kubernetes and or/release engineering, even if it's only through a personal project that you put together in your spare time, the team will make sure to guide you. Being a self-starter and not being afraid to ask questions can take you a long way in our team.

  6. What will you suggest to someone who has got rejected from being a part of the Release Shadow Program several times?

    Keep applying.

    With every release cycle we have had an exponential growth in the number of applicants, so it gets harder to be selected, which can be discouraging, but please know that getting rejected doesn't mean you're not talented. It's just practically impossible to accept every applicant, however here's an alternative that we suggest:

    Start attending our weekly Kubernetes SIG Release meetings to introduce yourself and get familiar with the team and the projects we are working on.

    The Release Team is one of the way to join SIG Release, but we are always looking for more hands to help. Again, in addition to certain technical ability, the most sought after trait that we look for is people we can trust, and that requires time. sig-release-motivation

  7. Can you discuss any ongoing initiatives or upcoming features that the release team is particularly excited about for Kubernetes v1.28? How do these advancements align with the long-term vision of Kubernetes?

    We are excited about finally publishing Kubernetes packages on community infrastructure. It has been something that we have been wanting to do for a few years now, but it's a project with many technical implications that must be in place before doing the transition. Once that's done, we'll be able to increase our productivity and take control of the entire workflows.

Final thoughts

Well, this conversation ends here but not the learning. I hope this interview has given you some idea about what SIG Release does and how to get started in helping out. It is important to mention again that this article covers the first subproject under SIG Release, the Release Team. In the next Spotlight blog on SIG Release, we will provide a spotlight on the Release Engineering subproject, what it does and how to get involved. Finally, you can go through the SIG Release charter to get a more in-depth understanding of how SIG Release operates.

15 Jan 2024 12:00am GMT

20 Dec 2023

feedKubernetes Blog

Contextual logging in Kubernetes 1.29: Better troubleshooting and enhanced logging

Authors: Mengjiao Liu (DaoCloud), Patrick Ohly (Intel)

On behalf of the Structured Logging Working Group and SIG Instrumentation, we are pleased to announce that the contextual logging feature introduced in Kubernetes v1.24 has now been successfully migrated to two components (kube-scheduler and kube-controller-manager) as well as some directories. This feature aims to provide more useful logs for better troubleshooting of Kubernetes and to empower developers to enhance Kubernetes.

What is contextual logging?

Contextual logging is based on the go-logr API. The key idea is that libraries are passed a logger instance by their caller and use that for logging instead of accessing a global logger. The binary decides the logging implementation, not the libraries. The go-logr API is designed around structured logging and supports attaching additional information to a logger.

This enables additional use cases:

One of the design decisions for contextual logging was to allow attaching a logger as value to a context.Context. Since the logger encapsulates all aspects of the intended logging for the call, it is part of the context, and not just using it. A practical advantage is that many APIs already have a ctx parameter or can add one. This provides additional advantages, like being able to get rid of context.TODO() calls inside the functions.

How to use it

The contextual logging feature is alpha starting from Kubernetes v1.24, so it requires the ContextualLogging feature gate to be enabled. If you want to test the feature while it is alpha, you need to enable this feature gate on the kube-controller-manager and the kube-scheduler.

For the kube-scheduler, there is one thing to note, in addition to enabling the ContextualLogging feature gate, instrumentation also depends on log verbosity. To avoid slowing down the scheduler with the logging instrumentation for contextual logging added for 1.29, it is important to choose carefully when to add additional information:

Here is an example that demonstrates the effect:

I1113 08:43:37.029524 87144 default_binder.go:53] "Attempting to bind pod to node" logger="Bind.DefaultBinder" pod="kube-system/coredns-69cbfb9798-ms4pq" node="127.0.0.1"

The immediate benefit is that the operation and plugin name are visible in logger. pod and node are already logged as parameters in individual log calls in kube-scheduler code. Once contextual logging is supported by more packages outside of kube-scheduler, they will also be visible there (for example, client-go). Once it is GA, log calls can be simplified to avoid repeating those values.

In kube-controller-manager, WithName is used to add the user-visible controller name to log output, for example:

I1113 08:43:29.284360 87141 graph_builder.go:285] "garbage controller monitor not synced: no monitors" logger="garbage-collector-controller"

The logger="garbage-collector-controller" was added by the kube-controller-manager core when instantiating that controller and appears in all of its log entries - at least as long as the code that it calls supports contextual logging. Further work is needed to convert shared packages like client-go.

Performance impact

Supporting contextual logging in a package, i.e. accepting a logger from a caller, is cheap. No performance impact was observed for the kube-scheduler. As noted above, adding WithName and WithValues needs to be done more carefully.

In Kubernetes 1.29, enabling contextual logging at production verbosity (-v3 or lower) caused no measurable slowdown for the kube-scheduler and is not expected for the kube-controller-manager either. At debug levels, a 28% slowdown for some test cases is still reasonable given that the resulting logs make debugging easier. For details, see the discussion around promoting the feature to beta.

Impact on downstream users

Log output is not part of the Kubernetes API and changes regularly in each release, whether it is because developers work on the code or because of the ongoing conversion to structured and contextual logging.

If downstream users have dependencies on specific logs, they need to be aware of how this change affects them.

Further reading

Get involved

If you're interested in getting involved, we always welcome new contributors to join us. Contextual logging provides a fantastic opportunity for you to contribute to Kubernetes development and make a meaningful impact. By joining Structured Logging WG, you can actively participate in the development of Kubernetes and make your first contribution. It's a great way to learn and engage with the community while gaining valuable experience.

We encourage you to explore the repository and familiarize yourself with the ongoing discussions and projects. It's a collaborative environment where you can exchange ideas, ask questions, and work together with other contributors.

If you have any questions or need guidance, don't hesitate to reach out to us and you can do so on our public Slack channel. If you're not already part of that Slack workspace, you can visit https://slack.k8s.io/ for an invitation.

We would like to express our gratitude to all the contributors who provided excellent reviews, shared valuable insights, and assisted in the implementation of this feature (in alphabetical order):

20 Dec 2023 5:30pm GMT

19 Dec 2023

feedKubernetes Blog

Kubernetes 1.29: PodReadyToStartContainers Condition Moves to Beta

Authors: Zefeng Chen (independent), Kevin Hannon (Red Hat)

With the recent release of Kubernetes 1.29, the PodReadyToStartContainers condition is available by default. The kubelet manages the value for that condition throughout a Pod's lifecycle, in the status field of a Pod. The kubelet will use the PodReadyToStartContainers condition to accurately surface the initialization state of a Pod, from the perspective of Pod sandbox creation and network configuration by a container runtime.

What's the motivation for this feature?

Cluster administrators did not have a clear and easily accessible way to view the completion of Pod's sandbox creation and initialization. As of 1.28, the Initialized condition in Pods tracks the execution of init containers. However, it has limitations in accurately reflecting the completion of sandbox creation and readiness to start containers for all Pods in a cluster. This distinction is particularly important in multi-tenant clusters where tenants own the Pod specifications, including the set of init containers, while cluster administrators manage storage plugins, networking plugins, and container runtime handlers. Therefore, there is a need for an improved mechanism to provide cluster administrators with a clear and comprehensive view of Pod sandbox creation completion and container readiness.

What's the benefit?

  1. Improved Visibility: Cluster administrators gain a clearer and more comprehensive view of Pod sandbox creation completion and container readiness. This enhanced visibility allows them to make better-informed decisions and troubleshoot issues more effectively.
  2. Metric Collection and Monitoring: Monitoring services can leverage the fields associated with the PodReadyToStartContainers condition to report sandbox creation state and latency. Metrics can be collected at per-Pod cardinality or aggregated based on various properties of the Pod, such as volumes, runtimeClassName, custom annotations for CNI and IPAM plugins or arbitrary labels and annotations, and storageClassName of PersistentVolumeClaims. This enables comprehensive monitoring and analysis of Pod readiness across the cluster.
  3. Enhanced Troubleshooting: With a more accurate representation of Pod sandbox creation and container readiness, cluster administrators can quickly identify and address any issues that may arise during the initialization process. This leads to improved troubleshooting capabilities and reduced downtime.

What's next?

Due to feedback and adoption, the Kubernetes team promoted PodReadyToStartContainersCondition to Beta in 1.29. Your comments will help determine if this condition continues forward to get promoted to GA, so please submit additional feedback on this feature!

How can I learn more?

Please check out the documentation for the PodReadyToStartContainersCondition to learn more about it and how it fits in relation to other Pod conditions.

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

19 Dec 2023 12:00am GMT

Kubernetes 1.29: Decoupling taint-manager from node-lifecycle-controller

Authors: Yuan Chen (Apple), Andrea Tosatto (Apple)

This blog discusses a new feature in Kubernetes 1.29 to improve the handling of taint-based pod eviction.

Background

In Kubernetes 1.29, an improvement has been introduced to enhance the taint-based pod eviction handling on nodes. This blog discusses the changes made to node-lifecycle-controller to separate its responsibilities and improve overall code maintainability.

Summary of changes

node-lifecycle-controller previously combined two independent functions:

With the Kubernetes 1.29 release, the taint-based eviction implementation has been moved out of node-lifecycle-controller into a separate and independent component called taint-eviction-controller. This separation aims to disentangle code, enhance code maintainability, and facilitate future extensions to either component.

As part of the change, additional metrics were introduced to help you monitor taint-based pod evictions:

How to use the new feature?

A new feature gate, SeparateTaintEvictionController, has been added. The feature is enabled by default as Beta in Kubernetes 1.29. Please refer to the feature gate document.

When this feature is enabled, users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.

To disable the new feature and use the old taint-manager within node-lifecylecycle-controller , users can set the feature gate SeparateTaintEvictionController=false.

Use cases

This new feature will allow cluster administrators to extend and enhance the default taint-eviction-controller and even replace the default taint-eviction-controller with a custom implementation to meet different needs. An example is to better support stateful workloads that use PersistentVolume on local disks.

FAQ

Does this feature change the existing behavior of taint-based pod evictions?

No, the taint-based pod eviction behavior remains unchanged. If the feature gate SeparateTaintEvictionController is turned off, the legacy node-lifecycle-controller with taint-manager will continue to be used.

Will enabling/using this feature result in an increase in the time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling/using this feature result in an increase in resource usage (CPU, RAM, disk, IO, ...)?

The increase in resource usage by running a separate taint-eviction-controller will be negligible.

Learn more

For more details, refer to the KEP.

Acknowledgments

As with any Kubernetes feature, multiple community members have contributed, from writing the KEP to implementing the new controller and reviewing the KEP and code. Special thanks to:

19 Dec 2023 12:00am GMT

18 Dec 2023

feedKubernetes Blog

Kubernetes 1.29: Single Pod Access Mode for PersistentVolumes Graduates to Stable

Author: Chris Henzie (Google)

With the release of Kubernetes v1.29, the ReadWriteOncePod volume access mode has graduated to general availability: it's part of Kubernetes' stable API. In this blog post, I'll take a closer look at this access mode and what it does.

What is ReadWriteOncePod?

ReadWriteOncePod is an access mode for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs) introduced in Kubernetes v1.22. This access mode enables you to restrict volume access to a single pod in the cluster, ensuring that only one pod can write to the volume at a time. This can be particularly useful for stateful workloads that require single-writer access to storage.

For more context on access modes and how ReadWriteOncePod works read What are access modes and why are they important? in the Introducing Single Pod Access Mode for PersistentVolumes article from 2021.

How can I start using ReadWriteOncePod?

The ReadWriteOncePod volume access mode is available by default in Kubernetes versions v1.27 and beyond. In Kubernetes v1.29 and later, the Kubernetes API always recognizes this access mode.

Note that ReadWriteOncePod is only supported for CSI volumes, and before using this feature, you will need to update the following CSI sidecars to these versions or greater:

To start using ReadWriteOncePod, you need to create a PVC with the ReadWriteOncePod access mode:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
 name: single-writer-only
spec:
 accessModes:
 - ReadWriteOncePod # Allows only a single pod to access single-writer-only.
 resources:
 requests:
 storage: 1Gi

If your storage plugin supports Dynamic provisioning, then new PersistentVolumes will be created with the ReadWriteOncePod access mode applied.

Read Migrating existing PersistentVolumes for details on migrating existing volumes to use ReadWriteOncePod.

How can I learn more?

Please see the blog posts alpha, beta, and KEP-2485 for more details on the ReadWriteOncePod access mode and motivations for CSI spec changes.

How do I get involved?

The Kubernetes #csi Slack channel and any of the standard SIG Storage communication channels are great methods to reach out to the SIG Storage and the CSI teams.

Special thanks to the following people whose thoughtful reviews and feedback helped shape this feature:

If you're interested in getting involved with the design and development of CSI or any part of the Kubernetes storage system, join the Kubernetes Storage Special Interest Group (SIG). We're rapidly growing and always welcome new contributors.

18 Dec 2023 12:00am GMT