23 Aug 2024
Kubernetes Blog
Kubernetes v1.31: kubeadm v1beta4
As part of the Kubernetes v1.31 release, kubeadm
is adopting a new (v1beta4) version of its configuration file format. Configuration in the previous v1beta3 format is now formally deprecated, which means it's supported but you should migrate to v1beta4 and stop using the deprecated format. Support for v1beta3 configuration will be removed after a minimum of 3 Kubernetes minor releases.
In this article, I'll walk you through key changes; I'll explain about the kubeadm v1beta4 configuration format, and how to migrate from v1beta3 to v1beta4.
You can read the reference for the v1beta4 configuration format: kubeadm Configuration (v1beta4).
A list of changes since v1beta3
This version improves on the v1beta3 format by fixing some minor issues and adding a few new fields.
To put it simply,
- Two new configuration elements: ResetConfiguration and UpgradeConfiguration
- For InitConfiguration and JoinConfiguration,
dryRun
mode andnodeRegistration.imagePullSerial
are supported - For ClusterConfiguration, there are new fields including
certificateValidityPeriod
,caCertificateValidityPeriod
,encryptionAlgorithm
,dns.disabled
andproxy.disabled
. - Support
extraEnvs
for all control plan components extraArgs
changed from a map to structured extra arguments for duplicates- Add a
timeouts
structure for init, join, upgrade and reset.
For details, you can see the official document below:
- Support custom environment variables in control plane components under
ClusterConfiguration
. UseapiServer.extraEnvs
,controllerManager.extraEnvs
,scheduler.extraEnvs
,etcd.local.extraEnvs
. - The ResetConfiguration API type is now supported in v1beta4. Users are able to reset a node by passing a
--config
file tokubeadm reset
. dryRun
mode is now configurable in InitConfiguration and JoinConfiguration.- Replace the existing string/string extra argument maps with structured extra arguments that support duplicates. The change applies to
ClusterConfiguration
-apiServer.extraArgs
,controllerManager.extraArgs
,scheduler.extraArgs
,etcd.local.extraArgs
. Also tonodeRegistrationOptions.kubeletExtraArgs
. - Added
ClusterConfiguration.encryptionAlgorithm
that can be used to set the asymmetric encryption algorithm used for this cluster's keys and certificates. Can be one of "RSA-2048" (default), "RSA-3072", "RSA-4096" or "ECDSA-P256". - Added
ClusterConfiguration.dns.disabled
andClusterConfiguration.proxy.disabled
that can be used to disable the CoreDNS and kube-proxy addons during cluster initialization. Skipping the related addons phases, during cluster creation will set the same fields totrue
. - Added the
nodeRegistration.imagePullSerial
field inInitConfiguration
andJoinConfiguration
, which can be used to control if kubeadm pulls images serially or in parallel. - The UpgradeConfiguration kubeadm API is now supported in v1beta4 when passing
--config
tokubeadm upgrade
subcommands. For upgrade subcommands, the usage of component configuration for kubelet and kube-proxy, as well as InitConfiguration and ClusterConfiguration, is now deprecated and will be ignored when passing--config
. - Added a
timeouts
structure toInitConfiguration
,JoinConfiguration
,ResetConfiguration
andUpgradeConfiguration
that can be used to configure various timeouts. TheClusterConfiguration.timeoutForControlPlane
field is replaced bytimeouts.controlPlaneComponentHealthCheck
. TheJoinConfiguration.discovery.timeout
is replaced bytimeouts.discovery
. - Added a
certificateValidityPeriod
andcaCertificateValidityPeriod
fields toClusterConfiguration
. These fields can be used to control the validity period of certificates generated by kubeadm during sub-commands such asinit
,join
,upgrade
andcerts
. Default values continue to be 1 year for non-CA certificates and 10 years for CA certificates. Also note that only non-CA certificates are renewable bykubeadm certs renew
.
These changes simplify the configuration of tools that use kubeadm and improve the extensibility of kubeadm itself.
How to migrate v1beta3 configuration to v1beta4?
If your configuration is not using the latest version, it is recommended that you migrate using the kubeadm config migrate command.
This command reads an existing configuration file that uses the old format, and writes a new file that uses the current format.
Example
Using kubeadm v1.31, run kubeadm config migrate --old-config old-v1beta3.yaml --new-config new-v1beta4.yaml
How do I get involved?
Huge thanks to all the contributors who helped with the design, implementation, and review of this feature:
- Lubomir I. Ivanov (neolit123)
- Dave Chen(chendave)
- Paco Xu (pacoxu)
- Sata Qiu(sataqiu)
- Baofa Fan(carlory)
- Calvin Chen(calvin0327)
- Ruquan Zhao(ruquanzhao)
For those interested in getting involved in future discussions on kubeadm configuration, you can reach out kubeadm or SIG-cluster-lifecycle by several means:
- v1beta4 related items are tracked in kubeadm issue #2890.
- Slack: #kubeadm or #sig-cluster-lifecycle
- Mailing list
23 Aug 2024 12:00am GMT
22 Aug 2024
Kubernetes Blog
Kubernetes v1.31: New Kubernetes CPUManager Static Policy: Distribute CPUs Across Cores
In Kubernetes v1.31, we are excited to introduce a significant enhancement to CPU management capabilities: the distribute-cpus-across-cores
option for the CPUManager static policy. This feature is currently in alpha and hidden by default, marking a strategic shift aimed at optimizing CPU utilization and improving system performance across multi-core processors.
Understanding the feature
Traditionally, Kubernetes' CPUManager tends to allocate CPUs as compactly as possible, typically packing them onto the fewest number of physical cores. However, allocation strategy matters, CPUs on the same physical host still share some resources of the physical core, such as the cache and execution units, etc.
While default approach minimizes inter-core communication and can be beneficial under certain scenarios, it also poses a challenge. CPUs sharing a physical core can lead to resource contention, which in turn may cause performance bottlenecks, particularly noticeable in CPU-intensive applications.
The new distribute-cpus-across-cores
feature addresses this issue by modifying the allocation strategy. When enabled, this policy option instructs the CPUManager to spread out the CPUs (hardware threads) across as many physical cores as possible. This distribution is designed to minimize contention among CPUs sharing the same physical core, potentially enhancing the performance of applications by providing them dedicated core resources.
Technically, within this static policy, the free CPU list is reordered in the manner depicted in the diagram, aiming to allocate CPUs from separate physical cores.
Enabling the feature
To enable this feature, users firstly need to add --cpu-manager-policy=static
kubelet flag or the cpuManagerPolicy: static
field in KubeletConfiuration. Then user can add --cpu-manager-policy-options distribute-cpus-across-cores=true
or distribute-cpus-across-cores=true
to their CPUManager policy options in the Kubernetes configuration or. This setting directs the CPUManager to adopt the new distribution strategy. It is important to note that this policy option cannot currently be used in conjunction with full-pcpus-only
or distribute-cpus-across-numa
options.
Current limitations and future directions
As with any new feature, especially one in alpha, there are limitations and areas for future improvement. One significant current limitation is that distribute-cpus-across-cores
cannot be combined with other policy options that might conflict in terms of CPU allocation strategies. This restriction can affect compatibility with certain workloads and deployment scenarios that rely on more specialized resource management.
Looking forward, we are committed to enhancing the compatibility and functionality of the distribute-cpus-across-cores
option. Future updates will focus on resolving these compatibility issues, allowing this policy to be combined with other CPUManager policies seamlessly. Our goal is to provide a more flexible and robust CPU allocation framework that can adapt to a variety of workloads and performance demands.
Conclusion
The introduction of the distribute-cpus-across-cores
policy in Kubernetes CPUManager is a step forward in our ongoing efforts to refine resource management and improve application performance. By reducing the contention on physical cores, this feature offers a more balanced approach to CPU resource allocation, particularly beneficial for environments running heterogeneous workloads. We encourage Kubernetes users to test this new feature and provide feedback, which will be invaluable in shaping its future development.
This draft aims to clearly explain the new feature while setting expectations for its current stage and future improvements.
Further reading
Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.
Getting involved
This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.
22 Aug 2024 12:00am GMT
Kubernetes 1.31: Fine-grained SupplementalGroups control
This blog discusses a new feature in Kubernetes 1.31 to improve the handling of supplementary groups in containers within Pods.
Motivation: Implicit group memberships defined in /etc/group
in the container image
Although this behavior may not be popular with many Kubernetes cluster users/admins, kubernetes, by default, merges group information from the Pod with information defined in /etc/group
in the container image.
Let's see an example, below Pod specifies runAsUser=1000
, runAsGroup=3000
and supplementalGroups=4000
in the Pod's security context.
apiVersion: v1
kind: Pod
metadata:
name: implicit-groups
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
containers:
- name: ctr
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
What is the result of id
command in the ctr
container?
# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/implicit-groups.yaml
# Verify that the Pod's Container is running:
$ kubectl get pod implicit-groups
# Check the id command
$ kubectl exec implicit-groups -- id
Then, output should be similar to this:
uid=1000 gid=3000 groups=3000,4000,50000
Where does group ID 50000
in supplementary groups (groups
field) come from, even though 50000
is not defined in the Pod's manifest at all? The answer is /etc/group
file in the container image.
Checking the contents of /etc/group
in the container image should show below:
$ kubectl exec implicit-groups -- cat /etc/group
...
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image
Aha! The container's primary user 1000
belongs to the group 50000
in the last entry.
Thus, the group membership defined in /etc/group
in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.
What's wrong with it?
The implicitly merged group information from /etc/group
in the container image may cause some concerns particularly in accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by uid/gid in Linux. Even worse, the implicit gids from /etc/group
can not be detected/validated by any policy engines because there is no clue for the implicit group information in the manifest. This can also be a concern for Kubernetes security.
Fine-grained SupplementalGroups control in a Pod: SupplementaryGroupsPolicy
To tackle the above problem, Kubernetes 1.31 introduces new field supplementalGroupsPolicy
in Pod's .spec.securityContext
.
This field provies a way to control how to calculate supplementary groups for the container processes in a Pod. The available policy is below:
-
Merge: The group membership defined in
/etc/group
for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backword compatibility). -
Strict: it only attaches specified group IDs in
fsGroup
,supplementalGroups
, orrunAsGroup
fields as the supplementary groups of the container processes. This means no group membership defined in/etc/group
for the container's primary user will be merged.
Let's see how Strict
policy works.
apiVersion: v1
kind: Pod
metadata:
name: strict-supplementalgroups-policy
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
supplementalGroupsPolicy: Strict
containers:
- name: ctr
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/strict-supplementalgroups-policy.yaml
# Verify that the Pod's Container is running:
$ kubectl get pod strict-supplementalgroups-policy
# Check the process identity:
kubectl exec -it strict-supplementalgroups-policy -- id
The output should be similar to this:
uid=1000 gid=3000 groups=3000,4000
You can see Strict
policy can exclude group 50000
from groups
!
Thus, ensuring supplementalGroupsPolicy: Strict
(enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.
Note:
Actually, this is not enough because container with sufficient privileges / capability can change its process identity. Please see the following section for details.Attached process identity in Pod status
This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux
field. It would be helpful to see if implicit group IDs are attached.
...
status:
containerStatuses:
- name: ctr
user:
linux:
gid: 3000
supplementalGroups:
- 3000
- 4000
uid: 1000
...
Note:
Please note that the values instatus.containerStatuses[].user.linux
field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2)
, setgid(2)
or setgroups(2)
, etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.Feature availability
To enable supplementalGroupsPolicy
field, the following components have to be used:
- Kubernetes: v1.31 or later, with the
SupplementalGroupsPolicy
feature gate enabled. As of v1.31, the gate is marked as alpha. - CRI runtime:
- containerd: v2.0 or later
- CRI-O: v1.31 or later
You can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy
field.
apiVersion: v1
kind: Node
...
status:
features:
supplementalGroupsPolicy: true
What's next?
Kubernetes SIG Node hope - and expect - that the feature will be promoted to beta and eventually general availability (GA) in future releases of Kubernetes, so that users no longer need to enable the feature gate manually.
Merge
policy is applied when supplementalGroupsPolicy
is not specified, for backwards compatibility.
How can I learn more?
- Configure a Security Context for a Pod or Container for the further details of
supplementalGroupsPolicy
- KEP-3619: Fine-grained SupplementalGroups control
How to get involved?
This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!
22 Aug 2024 12:00am GMT
Kubernetes 1.31: Custom Profiling in Kubectl Debug Graduates to Beta
There are many ways of troubleshooting the pods and nodes in the cluster. However, kubectl debug
is one of the easiest, highly used and most prominent ones. It provides a set of static profiles and each profile serves for a different kind of role. For instance, from the network administrator's point of view, debugging the node should be as easy as this:
$ kubectl debug node/mynode -it --image=busybox --profile=netadmin
On the other hand, static profiles also bring about inherent rigidity, which has some implications for some pods contrary to their ease of use. Because there are various kinds of pods (or nodes) that all have their specific necessities, and unfortunately, some can't be debugged by only using the static profiles.
Take an instance of a simple pod consisting of a container whose healthiness relies on an environment variable:
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example-container
image: customapp:latest
env:
- name: REQUIRED_ENV_VAR
value: "value1"
Currently, copying the pod is the sole mechanism that supports debugging this pod in kubectl debug. Furthermore, what if user needs to modify the REQUIRED_ENV_VAR
to something different for advanced troubleshooting?. There is no mechanism to achieve this.
Custom Profiling
Custom profiling is a new functionality available under --custom
flag, introduced in kubectl debug to provide extensibility. It expects partial Container
spec in either YAML or JSON format. In order to debug the example-container above by creating an ephemeral container, we simply have to define this YAML:
# partial_container.yaml
env:
- name: REQUIRED_ENV_VAR
value: value2
and execute:
kubectl debug example-pod -it --image=customapp --custom=partial_container.yaml
Here is another example that modifies multiple fields at once (change port number, add resource limits, modify environment variable) in JSON:
{
"ports": [
{
"containerPort": 80
}
],
"resources": {
"limits": {
"cpu": "0.5",
"memory": "512Mi"
},
"requests": {
"cpu": "0.2",
"memory": "256Mi"
}
},
"env": [
{
"name": "REQUIRED_ENV_VAR",
"value": "value2"
}
]
}
Constraints
Uncontrolled extensibility hurts the usability. So that, custom profiling is not allowed for certain fields such as command, image, lifecycle, volume devices and container name. In the future, more fields can be added to the disallowed list if required.
Limitations
The kubectl debug
command has 3 aspects: Debugging with ephemeral containers, pod copying, and node debugging. The largest intersection set of these aspects is the container spec within a Pod That's why, custom profiling only supports the modification of the fields that are defined with containers
. This leads to a limitation that if user needs to modify the other fields in the Pod spec, it is not supported.
Acknowledgments
Special thanks to all the contributors who reviewed and commented on this feature, from the initial conception to its actual implementation (alphabetical order):
22 Aug 2024 12:00am GMT
21 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Autoconfiguration For Node Cgroup Driver (beta)
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs
and systemd
. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would exit with an error. This was a source of headaches for many cluster admins. However, there is light at the end of the tunnel!
Automated cgroup driver detection
In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI
, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. A few minor releases of Kubernetes happened whilst we waited for support to land in the major two CRI implementations (containerd and CRI-O), but as of v1.31.0, this feature is now beta!
In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:
- containerd: Support was added in v2.0.0
- CRI-O: Support was added in v1.28.0
Then, they should ensure their CRI implementation is configured to the cgroup_driver they would like to use.
Future work
Eventually, support for the kubelet's cgroupDriver
configuration field will be dropped, and the kubelet will fail to start if the CRI implementation isn't new enough to have support for this feature.
21 Aug 2024 12:00am GMT
20 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Streaming Transitions from SPDY to WebSockets
In Kubernetes 1.31, by default kubectl now uses the WebSocket protocol instead of SPDY for streaming.
This post describes what these changes mean for you and why these streaming APIs matter.
Streaming APIs in Kubernetes
In Kubernetes, specific endpoints that are exposed as an HTTP or RESTful interface are upgraded to streaming connections, which require a streaming protocol. Unlike HTTP, which is a request-response protocol, a streaming protocol provides a persistent connection that's bi-directional, low-latency, and lets you interact in real-time. Streaming protocols support reading and writing data between your client and the server, in both directions, over the same connection. This type of connection is useful, for example, when you create a shell in a running container from your local workstation and run commands in the container.
Why change the streaming protocol?
Before the v1.31 release, Kubernetes used the SPDY/3.1 protocol by default when upgrading streaming connections. SPDY/3.1 has been deprecated for eight years, and it was never standardized. Many modern proxies, gateways, and load balancers no longer support the protocol. As a result, you might notice that commands like kubectl cp
, kubectl attach
, kubectl exec
, and kubectl port-forward
stop working when you try to access your cluster through a proxy or gateway.
As of Kubernetes v1.31, SIG API Machinery has modified the streaming protocol that a Kubernetes client (such as kubectl
) uses for these commands to the more modern WebSocket streaming protocol. The WebSocket protocol is a currently supported standardized streaming protocol that guarantees compatibility and interoperability with different components and programming languages. The WebSocket protocol is more widely supported by modern proxies and gateways than SPDY.
How streaming APIs work
Kubernetes upgrades HTTP connections to streaming connections by adding specific upgrade headers to the originating HTTP request. For example, an HTTP upgrade request for running the date
command on an nginx
container within a cluster is similar to the following:
$ kubectl exec -v=8 nginx -- date
GET https://127.0.0.1:43251/api/v1/namespaces/default/pods/nginx/exec?command=date…
Request Headers:
Connection: Upgrade
Upgrade: websocket
Sec-Websocket-Protocol: v5.channel.k8s.io
User-Agent: kubectl/v1.31.0 (linux/amd64) kubernetes/6911225
If the container runtime supports the WebSocket streaming protocol and at least one of the subprotocol versions (e.g. v5.channel.k8s.io
), the server responds with a successful 101 Switching Protocols
status, along with the negotiated subprotocol version:
Response Status: 101 Switching Protocols in 3 milliseconds
Response Headers:
Upgrade: websocket
Connection: Upgrade
Sec-Websocket-Accept: j0/jHW9RpaUoGsUAv97EcKw8jFM=
Sec-Websocket-Protocol: v5.channel.k8s.io
At this point the TCP connection used for the HTTP protocol has changed to a streaming connection. Subsequent STDIN, STDOUT, and STDERR data (as well as terminal resizing data and process exit code data) for this shell interaction is then streamed over this upgraded connection.
How to use the new WebSocket streaming protocol
If your cluster and kubectl are on version 1.29 or later, there are two control plane feature gates and two kubectl environment variables that govern the use of the WebSockets rather than SPDY. In Kubernetes 1.31, all of the following feature gates are in beta and are enabled by default:
- Feature gates
TranslateStreamCloseWebsocketRequests
.../exec
.../attach
PortForwardWebsockets
.../port-forward
- kubectl feature control environment variables
KUBECTL_REMOTE_COMMAND_WEBSOCKETS
kubectl exec
kubectl cp
kubectl attach
KUBECTL_PORT_FORWARD_WEBSOCKETS
kubectl port-forward
If you're connecting to an older cluster but can manage the feature gate settings, turn on both TranslateStreamCloseWebsocketRequests
(added in Kubernetes v1.29) and PortForwardWebsockets
(added in Kubernetes v1.30) to try this new behavior. Version 1.31 of kubectl
can automatically use the new behavior, but you do need to connect to a cluster where the server-side features are explicitly enabled.
Learn more about streaming APIs
- KEP 4006 - Transitioning from SPDY to WebSockets
- RFC 6455 - The WebSockets Protocol
- Container Runtime Interface streaming explained
20 Aug 2024 12:00am GMT
19 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
This post describes Pod failure policy, which graduates to stable in Kubernetes 1.31, and how to use it in your Jobs.
About Pod failure policy
When you run workloads on Kubernetes, Pods might fail for a variety of reasons. Ideally, workloads like Jobs should be able to ignore transient, retriable failures and continue running to completion.
To allow for these transient failures, Kubernetes Jobs include the backoffLimit
field, which lets you specify a number of Pod failures that you're willing to tolerate during Job execution. However, if you set a large value for the backoffLimit
field and rely solely on this field, you might notice unnecessary increases in operating costs as Pods restart excessively until the backoffLimit is met.
This becomes particularly problematic when running large-scale Jobs with thousands of long-running Pods across thousands of nodes.
The Pod failure policy extends the backoff limit mechanism to help you reduce costs in the following ways:
- Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.
- Allows you to ignore retriable errors without increasing the
backoffLimit
field.
For example, you can use a Pod failure policy to run your workload on more affordable spot machines by ignoring Pod failures caused by graceful node shutdown.
The policy allows you to distinguish between retriable and non-retriable Pod failures based on container exit codes or Pod conditions in a failed Pod.
How it works
You specify a Pod failure policy in the Job specification, represented as a list of rules.
For each rule you define match requirements based on one of the following properties:
- Container exit codes: the
onExitCodes
property. - Pod conditions: the
onPodConditions
property.
Additionally, for each rule, you specify one of the following actions to take when a Pod matches the rule:
Ignore
: Do not count the failure towards thebackoffLimit
orbackoffLimitPerIndex
.FailJob
: Fail the entire Job and terminate all running Pods.FailIndex
: Fail the index corresponding to the failed Pod. This action works with the Backoff limit per index feature.Count
: Count the failure towards thebackoffLimit
orbackoffLimitPerIndex
. This is the default behavior.
When Pod failures occur in a running Job, Kubernetes matches the failed Pod status against the list of Pod failure policy rules, in the specified order, and takes the corresponding actions for the first matched rule.
Note that when specifying the Pod failure policy, you must also set the Job's Pod template with restartPolicy: Never
. This prevents race conditions between the kubelet and the Job controller when counting Pod failures.
Kubernetes-initiated Pod disruptions
To allow matching Pod failure policy rules against failures caused by disruptions initiated by Kubernetes, this feature introduces the DisruptionTarget
Pod condition.
Kubernetes adds this condition to any Pod, regardless of whether it's managed by a Job controller, that fails because of a retriable disruption scenario. The DisruptionTarget
condition contains one of the following reasons that corresponds to these disruption scenarios:
PreemptionByKubeScheduler
: Preemption bykube-scheduler
to accommodate a new Pod that has a higher priority.DeletionByTaintManager
- the Pod is due to be deleted bykube-controller-manager
due to aNoExecute
taint that the Pod doesn't tolerate.EvictionByEvictionAPI
- the Pod is due to be deleted by an API-initiated eviction.DeletionByPodGC
- the Pod is bound to a node that no longer exists, and is due to be deleted by Pod garbage collection.TerminationByKubelet
- the Pod was terminated by graceful node shutdown, node pressure eviction or preemption for system critical pods.
In all other disruption scenarios, like eviction due to exceeding Pod container limits, Pods don't receive the DisruptionTarget
condition because the disruptions were likely caused by the Pod and would reoccur on retry.
Example
The Pod failure policy snippet below demonstrates an example use:
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailJob
onPodConditions:
- type: ConfigIssue
- action: FailJob
onExitCodes:
operator: In
values: [ 42 ]
In this example, the Pod failure policy does the following:
- Ignores any failed Pods that have the built-in
DisruptionTarget
condition. These Pods don't count towards Job backoff limits. - Fails the Job if any failed Pods have the custom user-supplied
ConfigIssue
condition, which was added either by a custom controller or webhook. - Fails the Job if any containers exited with the exit code 42.
- Counts all other Pod failures towards the default
backoffLimit
(orbackoffLimitPerIndex
if used).
Learn more
- For a hands-on guide to using Pod failure policy, see Handling retriable and non-retriable pod failures with Pod failure policy
- Read the documentation for Pod failure policy and Backoff limit per index
- Read the documentation for Pod disruption conditions
- Read the KEP for Pod failure policy
Related work
Based on the concepts introduced by Pod failure policy, the following additional work is in progress:
- JobSet integration: Configurable Failure Policy API
- Pod failure policy extension to add more granular failure reasons
- Support for Pod failure policy via JobSet in Kubeflow Training v2
- Proposal: Disrupted Pods should be removed from endpoints
Get involved
This work was sponsored by batch working group in close collaboration with the SIG Apps, and SIG Node, and SIG Scheduling communities.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
Acknowledgments
I would love to thank everyone who was involved in this project over the years - it's been a journey and a joint community effort! The list below is my best-effort attempt to remember and recognize people who made an impact. Thank you!
- Aldo Culquicondor for guidance and reviews throughout the process
- Jordan Liggitt for KEP and API reviews
- David Eads for API reviews
- Maciej Szulik for KEP reviews from SIG Apps PoV
- Clayton Coleman for guidance and SIG Node reviews
- Sergey Kanzhelev for KEP reviews from SIG Node PoV
- Dawn Chen for KEP reviews from SIG Node PoV
- Daniel Smith for reviews from SIG API machinery PoV
- Antoine Pelisse for reviews from SIG API machinery PoV
- John Belamaric for PRR reviews
- Filip Křepinský for thorough reviews from SIG Apps PoV and bug-fixing
- David Porter for thorough reviews from SIG Node PoV
- Jensen Lo for early requirements discussions, testing and reporting issues
- Daniel Vega-Myhre for advancing JobSet integration and reporting issues
- Abdullah Gharaibeh for early design discussions and guidance
- Antonio Ojea for test reviews
- Yuki Iwai for reviews and aligning implementation of the closely related Job features
- Kevin Hannon for reviews and aligning implementation of the closely related Job features
- Tim Bannister for docs reviews
- Shannon Kularathna for docs reviews
- Paola Cortés for docs reviews
19 Aug 2024 12:00am GMT
16 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)
The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future. While the project has been designed to fulfill microservice architectures in the past, it's now time to listen to the end users and introduce features which have a stronger focus on AI/ML.
One of these requirements is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries. A feature like this gives the Kubernetes project a chance to grow into use cases which go beyond running particular images.
Given that, the Kubernetes community is proud to present a new alpha feature introduced in v1.31: The Image Volume Source (KEP-4639). This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers:
…
kind: Pod
spec:
containers:
- …
volumeMounts:
- name: my-volume
mountPath: /path/to/directory
volumes:
- name: my-volume
image:
reference: my-image:tag
The above example would result in mounting my-image:tag
to /path/to/directory
in the pod's container.
Use cases
The goal of this enhancement is to stick as close as possible to the existing container image implementation within the kubelet, while introducing a new API surface to allow more extended use cases.
For example, users could share a configuration file among multiple containers in a pod without including the file in the main image, so that they can minimize security risks and the overall image size. They can also package and distribute binary artifacts using OCI images and mount them directly into Kubernetes pods, so that they can streamline their CI/CD pipeline as an example.
Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.
Another use case is that security engineers can use a public image for a malware scanner and mount in a volume of private (commercial) malware signatures, so that they can load those signatures without baking their own combined image (which might not be allowed by the copyright on the public image). Those files work regardless of the OS or version of the scanner software.
But in the long term it will be up to you as an end user of this project to outline further important use cases for the new feature. SIG Node is happy to retrieve any feedback or suggestions for further enhancements to allow more advanced usage scenarios. Feel free to provide feedback by either using the Kubernetes Slack (#sig-node) channel or the SIG Node mailinglist.
Detailed example
The Kubernetes alpha feature gate ImageVolume
needs to be enabled on the API Server as well as the kubelet to make it functional. If that's the case and the container runtime has support for the feature (like CRI-O ≥ v1.31), then an example pod.yaml
like this can be created:
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:
- name: test
image: registry.k8s.io/e2e-test-images/echoserver:2.3
volumeMounts:
- name: volume
mountPath: /volume
volumes:
- name: volume
image:
reference: quay.io/crio/artifact:v1
pullPolicy: IfNotPresent
The pod declares a new volume using the image.reference
of quay.io/crio/artifact:v1
, which refers to an OCI object containing two files. The pullPolicy
behaves in the same way as for container images and allows the following values:
Always
: the kubelet always attempts to pull the reference and the container creation will fail if the pull fails.Never
: the kubelet never pulls the reference and only uses a local image or artifact. The container creation will fail if the reference isn't present.IfNotPresent
: the kubelet pulls if the reference isn't already present on disk. The container creation will fail if the reference isn't present and the pull fails.
The volumeMounts
field is indicating that the container with the name test
should mount the volume under the path /volume
.
If you now create the pod:
kubectl apply -f pod.yaml
And exec into it:
kubectl exec -it pod -- sh
Then you're able to investigate what has been mounted:
/ # ls /volume
dir file
/ # cat /volume/file
2
/ # ls /volume/dir
file
/ # cat /volume/dir/file
1
You managed to consume an OCI artifact using Kubernetes!
The container runtime pulls the image (or artifact), mounts it to the container and makes it finally available for direct usage. There are a bunch of details in the implementation, which closely align to the existing image pull behavior of the kubelet. For example:
- If a
:latest
tag asreference
is provided, then thepullPolicy
will default toAlways
, while in any other case it will default toIfNotPresent
if unset. - The volume gets re-resolved if the pod gets deleted and recreated, which means that new remote content will become available on pod recreation. A failure to resolve or pull the image during pod startup will block containers from starting and may add significant latency. Failures will be retried using normal volume backoff and will be reported on the pod reason and message.
- Pull secrets will be assembled in the same way as for the container image by looking up node credentials, service account image pull secrets, and pod spec image pull secrets.
- The OCI object gets mounted in a single directory by merging the manifest layers in the same way as for container images.
- The volume is mounted as read-only (
ro
) and non-executable files (noexec
). - Sub-path mounts for containers are not supported (
spec.containers[*].volumeMounts.subpath
). - The field
spec.securityContext.fsGroupChangePolicy
has no effect on this volume type. - The feature will also work with the
AlwaysPullImages
admission plugin if enabled.
Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature as part of Kubernetes v1.31.
As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there! You all rock, let's keep on hacking!
Further reading
16 Aug 2024 12:00am GMT
Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order
PersistentVolume (or PVs for short) are associated with Reclaim Policy. The reclaim policy is used to determine the actions that need to be taken by the storage backend on deletion of the PVC Bound to a PV. When the reclaim policy is Delete
, the expectation is that the storage backend releases the storage resource allocated for the PV. In essence, the reclaim policy needs to be honored on PV deletion.
With the recent Kubernetes v1.31 release, a beta feature lets you configure your cluster to behave that way and honor the configured reclaim policy.
How did reclaim work in previous Kubernetes releases?
PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.
Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.
First, I'll demonstrate the behavior for clusters running an older version of Kubernetes.
Retrieve a PVC that is bound to a PV
Retrieve an existing PVC example-vanilla-block-pvc
kubectl get pvc example-vanilla-block-pvc
The following output shows the PVC and its bound PV; the PV is shown under the VOLUME
column:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
example-vanilla-block-pvc Bound pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO example-vanilla-block-sc 19s
Delete PV
When I try to delete a bound PV, the kubectl session blocks and the kubectl
tool does not return back control to the shell; for example:
kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted
^C
Retrieving the PV
kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
It can be observed that the PV is in a Terminating
state
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO Delete Terminating default/example-vanilla-block-pvc example-vanilla-block-sc 2m23s
Delete PVC
kubectl delete pvc example-vanilla-block-pvc
The following output is seen if the PVC gets successfully deleted:
persistentvolumeclaim "example-vanilla-block-pvc" deleted
The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:
kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found
Although the PV is deleted, the underlying storage resource is not deleted and needs to be removed manually.
To sum up, the reclaim policy associated with the PersistentVolume is currently ignored under certain circumstances. For a Bound
PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.
PV reclaim policy with Kubernetes v1.31
The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.
How to enable new behavior?
To take advantage of the new behavior, you must have upgraded your cluster to the v1.31 release of Kubernetes and run the CSI external-provisioner
version 5.0.1
or later.
How does it work?
For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer
on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. `
An example of a PV with the finalizer, notice the new finalizer in the finalizers list
kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
creationTimestamp: "2021-11-17T19:28:56Z"
finalizers:
- kubernetes.io/pv-protection
- external-provisioner.volume.kubernetes.io/finalizer
name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
resourceVersion: "194711"
uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: example-vanilla-block-pvc
namespace: default
resourceVersion: "194677"
uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
csi:
driver: csi.vsphere.vmware.com
fsType: ext4
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.vsphere.vmware.com
type: vSphere CNS Block Volume
volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
persistentVolumeReclaimPolicy: Delete
storageClassName: example-vanilla-block-sc
volumeMode: Filesystem
status:
phase: Bound
The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.
Similarly, the finalizer kubernetes.io/pv-controller
is added to dynamically provisioned in-tree plugin volumes.
What about CSI migrated volumes?
The fix applies to CSI migrated volumes as well.
Some caveats
The fix does not apply to statically provisioned in-tree plugin volumes.
References
How do I get involved?
The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:
- Fan Baofa (carlory)
- Jan Šafránek (jsafrane)
- Xing Yang (xing-yang)
- Matthew Wong (wongma7)
Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We're rapidly growing and always welcome new contributors.
16 Aug 2024 12:00am GMT
Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta
Kubernetes 1.29 introduced new fields matchLabelKeys
and mismatchLabelKeys
in podAffinity
and podAntiAffinity
.
In Kubernetes 1.31, this feature moves to beta and the corresponding feature gate (MatchLabelKeysInPodAffinity
) gets enabled by default.
matchLabelKeys
- Enhanced scheduling for versatile rolling updates
During a workload's (e.g., Deployment) rolling update, a cluster may have Pods from multiple versions at the same time. However, the scheduler cannot distinguish between old and new versions based on the labelSelector
specified in podAffinity
or podAntiAffinity
. As a result, it will co-locate or disperse Pods regardless of their versions.
This can lead to sub-optimal scheduling outcome, for example:
- New version Pods are co-located with old version Pods (
podAffinity
), which will eventually be removed after rolling updates. - Old version Pods are distributed across all available topologies, preventing new version Pods from finding nodes due to
podAntiAffinity
.
matchLabelKeys
is a set of Pod label keys and addresses this problem. The scheduler looks up the values of these keys from the new Pod's labels and combines them with labelSelector
so that podAffinity matches Pods that have the same key-value in labels.
By using label pod-template-hash in matchLabelKeys
, you can ensure that only Pods of the same version are evaluated for podAffinity
or podAntiAffinity
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: application-server
...
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: topology.kubernetes.io/zone
matchLabelKeys:
- pod-template-hash
The above matchLabelKeys
will be translated in Pods like:
kind: Pod
metadata:
name: application-server
labels:
pod-template-hash: xyz
...
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
- key: pod-template-hash # Added from matchLabelKeys; Only Pods from the same replicaset will match this affinity.
operator: In
values:
- xyz
topologyKey: topology.kubernetes.io/zone
matchLabelKeys:
- pod-template-hash
mismatchLabelKeys
- Service isolation
mismatchLabelKeys
is a set of Pod label keys, like matchLabelKeys
, which looks up the values of these keys from the new Pod's labels, and merge them with labelSelector
as key notin (value)
so that podAffinity
does not match Pods that have the same key-value in labels.
Suppose all Pods for each tenant get tenant
label via a controller or a manifest management tool like Helm.
Although the value of tenant
label is unknown when composing each workload's manifest, the cluster admin wants to achieve exclusive 1:1 tenant to domain placement for a tenant isolation.
mismatchLabelKeys
works for this usecase; By applying the following affinity globally using a mutating webhook, the cluster admin can ensure that the Pods from the same tenant will land on the same domain exclusively, meaning Pods from other tenants won't land on the same domain.
affinity:
podAffinity: # ensures the pods of this tenant land on the same node pool
requiredDuringSchedulingIgnoredDuringExecution:
- matchLabelKeys:
- tenant
topologyKey: node-pool
podAntiAffinity: # ensures only Pods from this tenant lands on the same node pool
requiredDuringSchedulingIgnoredDuringExecution:
- mismatchLabelKeys:
- tenant
labelSelector:
matchExpressions:
- key: tenant
operator: Exists
topologyKey: node-pool
The above matchLabelKeys
and mismatchLabelKeys
will be translated to like:
kind: Pod
metadata:
name: application-server
labels:
tenant: service-a
spec:
affinity:
podAffinity: # ensures the pods of this tenant land on the same node pool
requiredDuringSchedulingIgnoredDuringExecution:
- matchLabelKeys:
- tenant
topologyKey: node-pool
labelSelector:
matchExpressions:
- key: tenant
operator: In
values:
- service-a
podAntiAffinity: # ensures only Pods from this tenant lands on the same node pool
requiredDuringSchedulingIgnoredDuringExecution:
- mismatchLabelKeys:
- tenant
labelSelector:
matchExpressions:
- key: tenant
operator: Exists
- key: tenant
operator: NotIn
values:
- service-a
topologyKey: node-pool
Getting involved
These features are managed by Kubernetes SIG Scheduling.
Please join us and share your feedback. We look forward to hearing from you!
How can I learn more?
- The official document of podAffinity
- KEP-3633: Introduce matchLabelKeys and mismatchLabelKeys to podAffinity and podAntiAffinity
16 Aug 2024 12:00am GMT
15 Aug 2024
Kubernetes Blog
Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache
Kubernetes is renowned for its robust orchestration of containerized applications, but as clusters grow, the demands on the control plane can become a bottleneck. A key challenge has been ensuring strongly consistent reads from the etcd datastore, requiring resource-intensive quorum reads.
Today, the Kubernetes community is excited to announce a major improvement: consistent reads from cache, graduating to Beta in Kubernetes v1.31.
Why consistent reads matter
Consistent reads are essential for ensuring that Kubernetes components have an accurate view of the latest cluster state. Guaranteeing consistent reads is crucial for maintaining the accuracy and reliability of Kubernetes operations, enabling components to make informed decisions based on up-to-date information. In large-scale clusters, fetching and processing this data can be a performance bottleneck, especially for requests that involve filtering results. While Kubernetes can filter data by namespace directly within etcd, any other filtering by labels or field selectors requires the entire dataset to be fetched from etcd and then filtered in-memory by the Kubernetes API server. This is particularly impactful for components like the kubelet, which only needs to list pods scheduled to its node - but previously required the API Server and etcd to process all pods in the cluster.
The breakthrough: Caching with confidence
Kubernetes has long used a watch cache to optimize read operations. The watch cache stores a snapshot of the cluster state and receives updates through etcd watches. However, until now, it couldn't serve consistent reads directly, as there was no guarantee the cache was sufficiently up-to-date.
The consistent reads from cache feature addresses this by leveraging etcd's progress notifications mechanism. These notifications inform the watch cache about how current its data is compared to etcd. When a consistent read is requested, the system first checks if the watch cache is up-to-date. If the cache is not up-to-date, the system queries etcd for progress notifications until it's confirmed that the cache is sufficiently fresh. Once ready, the read is efficiently served directly from the cache, which can significantly improve performance, particularly in cases where it would require fetching a lot of data from etcd. This enables requests that filter data to be served from the cache, with only minimal metadata needing to be read from etcd.
Important Note: To benefit from this feature, your Kubernetes cluster must be running etcd version 3.4.31+ or 3.5.13+. For older etcd versions, Kubernetes will automatically fall back to serving consistent reads directly from etcd.
Performance gains you'll notice
This seemingly simple change has a profound impact on Kubernetes performance and scalability:
- Reduced etcd Load: Kubernetes v1.31 can offload work from etcd, freeing up resources for other critical operations.
- Lower Latency: Serving reads from cache is significantly faster than fetching and processing data from etcd. This translates to quicker responses for components, improving overall cluster responsiveness.
- Improved Scalability: Large clusters with thousands of nodes and pods will see the most significant gains, as the reduction in etcd load allows the control plane to handle more requests without sacrificing performance.
5k Node Scalability Test Results: In recent scalability tests on 5,000 node clusters, enabling consistent reads from cache delivered impressive improvements:
- 30% reduction in kube-apiserver CPU usage
- 25% reduction in etcd CPU usage
- Up to 3x reduction (from 5 seconds to 1.5 seconds) in 99th percentile pod LIST request latency
What's next?
With the graduation to beta, consistent reads from cache are enabled by default, offering a seamless performance boost to all Kubernetes users running a supported etcd version.
Our journey doesn't end here. Kubernetes community is actively exploring pagination support in the watch cache, which will unlock even more performance optimizations in the future.
Getting started
Upgrading to Kubernetes v1.31 and ensuring you are using etcd version 3.4.31+ or 3.5.13+ is the easiest way to experience the benefits of consistent reads from cache. If you have any questions or feedback, don't hesitate to reach out to the Kubernetes community.
Let us know how consistent reads from cache transforms your Kubernetes experience!
Special thanks to @ah8ad3 and @p0lyn0mial for their contributions to this feature!
15 Aug 2024 12:00am GMT
Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta
Volumes in Kubernetes have been described by two attributes: their storage class, and their capacity. The storage class is an immutable property of the volume, while the capacity can be changed dynamically with volume resize.
This complicates vertical scaling of workloads with volumes. While cloud providers and storage vendors often offer volumes which allow specifying IO quality of service (Performance) parameters like IOPS or throughput and tuning them as workloads operate, Kubernetes has no API which allows changing them.
We are pleased to announce that the VolumeAttributesClass KEP, alpha since Kubernetes 1.29, will be beta in 1.31. This provides a generic, Kubernetes-native API for modifying volume parameters like provisioned IO.
Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). In addition to the VolumeAttributesClass feature gate, your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.
See the full documentation for all details. Here we show the common workflow.
Dynamically modifying volume attributes.
A VolumeAttributesClass
is a cluster-scoped resource that specifies provisioner-specific attributes. These are created by the cluster administrator in the same way as storage classes. For example, a series of gold, silver and bronze volume attribute classes can be created for volumes with greater or lessor amounts of provisioned IO.
apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
name: silver
driverName: your-csi-driver
parameters:
provisioned-iops: "500"
provisioned-throughput: "50MiB/s"
---
apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
name: gold
driverName: your-csi-driver
parameters:
provisioned-iops: "10000"
provisioned-throughput: "500MiB/s"
An attribute class is added to a PVC in much the same way as a storage class.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pv-claim
spec:
storageClassName: any-storage-class
volumeAttributesClassName: silver
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 64Gi
Unlike a storage class, the volume attributes class can be changed:
kubectl patch pvc test-pv-claim -p '{"spec": "volumeAttributesClassName": "gold"}'
Kubernetes will work with the CSI driver to update the attributes of the volume. The status of the PVC will track the current and desired attributes class. The PV resource will also be updated with the new volume attributes class which will be set to the currently active attributes of the PV.
Limitations with the beta
As a beta feature, there are still some features which are planned for GA but not yet present. The largest is quota support, see the KEP and discussion in sig-storage for details.
See the Kubernetes CSI driver list for up-to-date information of support for this feature in CSI drivers.
15 Aug 2024 12:00am GMT
14 Aug 2024
Kubernetes Blog
Kubernetes v1.31: PersistentVolume Last Phase Transition Time Moves to GA
Announcing the graduation to General Availability (GA) of the PersistentVolume lastTransitionTime
status field, in Kubernetes v1.31!
The Kubernetes SIG Storage team is excited to announce that the "PersistentVolumeLastPhaseTransitionTime" feature, introduced as an alpha in Kubernetes v1.28, has now reached GA status and is officially part of the Kubernetes v1.31 release. This enhancement helps Kubernetes users understand when a PersistentVolume transitions between different phases, allowing for more efficient and informed resource management.
For a v1.31 cluster, you can now assume that every PersistentVolume object has a .status.lastTransitionTime
field, that holds a timestamp of when the volume last transitioned its phase. This change is not immediate; the new field will be populated whenever a PersistentVolume is updated and first transitions between phases (Pending
, Bound
, or Released
) after upgrading to Kubernetes v1.31.
What changed?
The API strategy for updating PersistentVolume objects has been modified to populate the .status.lastTransitionTime
field with the current timestamp whenever a PersistentVolume transitions phases. Users are allowed to set this field manually if needed, but it will be overwritten when the PersistentVolume transitions phases again.
For more details, read about Phase transition timestamp in the Kubernetes documentation. You can also read the previous blog post announcing the feature as alpha in v1.28.
To provide feedback, join our Kubernetes Storage Special-Interest-Group (SIG) or participate in discussions on our public Slack channel.
14 Aug 2024 12:00am GMT
Kubernetes 1.31: Moving cgroup v1 Support into Maintenance Mode
As Kubernetes continues to evolve and adapt to the changing landscape of container orchestration, the community has decided to move cgroup v1 support into maintenance mode in v1.31. This shift aligns with the broader industry's move towards cgroup v2, offering improved functionalities: including scalability and a more consistent interface. Before we dive into the consequences for Kubernetes, let's take a step back to understand what cgroups are and their significance in Linux.
Understanding cgroups
Control groups, or cgroups, are a Linux kernel feature that allows the allocation, prioritization, denial, and management of system resources (such as CPU, memory, disk I/O, and network bandwidth) among processes. This functionality is crucial for maintaining system performance and ensuring that no single process can monopolize system resources, which is especially important in multi-tenant environments.
There are two versions of cgroups: v1 and v2. While cgroup v1 provided sufficient capabilities for resource management, it had limitations that led to the development of cgroup v2. Cgroup v2 offers a more unified and consistent interface, on top of better resource control features.
Cgroups in Kubernetes
For Linux nodes, Kubernetes relies heavily on cgroups to manage and isolate the resources consumed by containers running in pods. Each container in Kubernetes is placed in its own cgroup, which allows Kubernetes to enforce resource limits, monitor usage, and ensure fair resource distribution among all containers.
How Kubernetes uses cgroups
- Resource Allocation
- Ensures that containers do not exceed their allocated CPU and memory limits.
- Isolation
- Isolates containers from each other to prevent resource contention.
- Monitoring
- Tracks resource usage for each container to provide insights and metrics.
Transitioning to Cgroup v2
The Linux community has been focusing on cgroup v2 for new features and improvements. Major Linux distributions and projects like systemd are transitioning towards cgroup v2. Using cgroup v2 provides several benefits over cgroupv1, such as Unified Hierarchy, Improved Interface, Better Resource Control, cgroup aware OOM killer, rootless support etc.
Given these advantages, Kubernetes is also making the move to embrace cgroup v2 more fully. However, this transition needs to be handled carefully to avoid disrupting existing workloads and to provide a smooth migration path for users.
Moving cgroup v1 support into maintenance mode
What does maintenance mode mean?
When cgroup v1 is placed into maintenance mode in Kubernetes, it means that:
- Feature Freeze: No new features will be added to cgroup v1 support.
- Security Fixes: Critical security fixes will still be provided.
- Best-Effort Bug Fixes: Major bugs may be fixed if feasible, but some issues might remain unresolved.
Why move to maintenance mode?
The move to maintenance mode is driven by the need to stay in line with the broader ecosystem and to encourage the adoption of cgroup v2, which offers better performance, security, and usability. By transitioning cgroup v1 to maintenance mode, Kubernetes can focus on enhancing support for cgroup v2 and ensure it meets the needs of modern workloads. It's important to note that maintenance mode does not mean deprecation; cgroup v1 will continue to receive critical security fixes and major bug fixes as needed.
What this means for cluster administrators
Users currently relying on cgroup v1 are highly encouraged to plan for the transition to cgroup v2. This transition involves:
- Upgrading Systems: Ensuring that the underlying operating systems and container runtimes support cgroup v2.
- Testing Workloads: Verifying that workloads and applications function correctly with cgroup v2.
Further reading
14 Aug 2024 12:00am GMT
13 Aug 2024
Kubernetes Blog
Kubernetes v1.31: Elli
Editors: Matteo Bianchi, Yigit Demirbas, Abigail McCarthy, Edith Puclla, Rashan Smith
Announcing the release of Kubernetes v1.31: Elli!
Similar to previous releases, the release of Kubernetes v1.31 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community. This release consists of 45 enhancements. Of those enhancements, 11 have graduated to Stable, 22 are entering Beta, and 12 have graduated to Alpha.
Release theme and logo
The Kubernetes v1.31 Release Theme is "Elli".
Kubernetes v1.31's Elli is a cute and joyful dog, with a heart of gold and a nice sailor's cap, as a playful wink to the huge and diverse family of Kubernetes contributors.
Kubernetes v1.31 marks the first release after the project has successfully celebrated its first 10 years. Kubernetes has come a very long way since its inception, and it's still moving towards exciting new directions with each release. After 10 years, it is awe-inspiring to reflect on the effort, dedication, skill, wit and tiring work of the countless Kubernetes contributors who have made this a reality.
And yet, despite the herculean effort needed to run the project, there is no shortage of people who show up, time and again, with enthusiasm, smiles and a sense of pride for contributing and being part of the community. This "spirit" that we see from new and old contributors alike is the sign of a vibrant community, a "joyful" community, if we might call it that.
Kubernetes v1.31's Elli is all about celebrating this wonderful spirit! Here's to the next decade of Kubernetes!
Highlights of features graduating to Stable
This is a selection of some of the improvements that are now stable following the v1.31 release.
AppArmor support is now stable
Kubernetes support for AppArmor is now GA. Protect your containers using AppArmor by setting the appArmorProfile.type
field in the container's securityContext
. Note that before Kubernetes v1.30, AppArmor was controlled via annotations; starting in v1.30 it is controlled using fields. It is recommended that you should migrate away from using annotations and start using the appArmorProfile.type
field.
To learn more read the AppArmor tutorial. This work was done as a part of KEP #24, by SIG Node.
Improved ingress connectivity reliability for kube-proxy
Kube-proxy improved ingress connectivity reliability is stable in v1.31. One of the common problems with load balancers in Kubernetes is the synchronization between the different components involved to avoid traffic drop. This feature implements a mechanism in kube-proxy for load balancers to do connection draining for terminating Nodes exposed by services of type: LoadBalancer
and externalTrafficPolicy: Cluster
and establish some best practices for cloud providers and Kubernetes load balancers implementations.
To use this feature, kube-proxy needs to run as default service proxy on the cluster and the load balancer needs to support connection draining. There are no specific changes required for using this feature, it has been enabled by default in kube-proxy since v1.30 and been promoted to stable in v1.31.
For more details about this feature please visit the Virtual IPs and Service Proxies documentation page.
This work was done as part of KEP #3836 by SIG Network.
Persistent Volume last phase transition time
Persistent Volume last phase transition time feature moved to GA in v1.31. This feature adds a PersistentVolumeStatus
field which holds a timestamp of when a PersistentVolume last transitioned to a different phase. With this feature enabled, every PersistentVolume object will have a new field .status.lastTransitionTime
, that holds a timestamp of when the volume last transitioned its phase. This change is not immediate; the new field will be populated whenever a PersistentVolume is updated and first transitions between phases (Pending
, Bound
, or Released
) after upgrading to Kubernetes v1.31. This allows you to measure time between when a PersistentVolume moves from Pending
to Bound
. This can be also useful for providing metrics and SLOs.
For more details about this feature please visit the PersistentVolume documentation page.
This work was done as a part of KEP #3762 by SIG Storage.
Highlights of features graduating to Beta
This is a selection of some of the improvements that are now beta following the v1.31 release.
nftables backend for kube-proxy
The nftables backend moves to beta in v1.31, behind the NFTablesProxyMode
feature gate which is now enabled by default.
The nftables API is the successor to the iptables API and is designed to provide better performance and scalability than iptables. The nftables
proxy mode is able to process changes to service endpoints faster and more efficiently than the iptables
mode, and is also able to more efficiently process packets in the kernel (though this only becomes noticeable in clusters with tens of thousands of services).
As of Kubernetes v1.31, the nftables
mode is still relatively new, and may not be compatible with all network plugins; consult the documentation for your network plugin. This proxy mode is only available on Linux nodes, and requires kernel 5.13 or later. Before migrating, note that some features, especially around NodePort services, are not implemented exactly the same in nftables mode as they are in iptables mode. Check the migration guide to see if you need to override the default configuration.
This work was done as part of KEP #3866 by SIG Network.
Changes to reclaim policy for PersistentVolumes
The Always Honor PersistentVolume Reclaim Policy feature has advanced to beta in Kubernetes v1.31. This enhancement ensures that the PersistentVolume (PV) reclaim policy is respected even after the associated PersistentVolumeClaim (PVC) is deleted, thereby preventing the leakage of volumes.
Prior to this feature, the reclaim policy linked to a PV could be disregarded under specific conditions, depending on whether the PV or PVC was deleted first. Consequently, the corresponding storage resource in the external infrastructure might not be removed, even if the reclaim policy was set to "Delete". This led to potential inconsistencies and resource leaks.
With the introduction of this feature, Kubernetes now guarantees that the "Delete" reclaim policy will be enforced, ensuring the deletion of the underlying storage object from the backend infrastructure, regardless of the deletion sequence of the PV and PVC.
This work was done as a part of KEP #2644 and by SIG Storage.
Bound service account token improvements
The ServiceAccountTokenNodeBinding
feature is promoted to beta in v1.31. This feature allows requesting a token bound only to a node, not to a pod, which includes node information in claims in the token and validates the existence of the node when the token is used. For more information, read the bound service account tokens documentation.
This work was done as part of KEP #4193 by SIG Auth.
Multiple Service CIDRs
Support for clusters with multiple Service CIDRs moves to beta in v1.31 (disabled by default).
There are multiple components in a Kubernetes cluster that consume IP addresses: Nodes, Pods and Services. Nodes and Pods IP ranges can be dynamically changed because depend on the infrastructure or the network plugin respectively. However, Services IP ranges are defined during the cluster creation as a hardcoded flag in the kube-apiserver. IP exhaustion has been a problem for long lived or large clusters, as admins needed to expand, shrink or even replace entirely the assigned Service CIDR range. These operations were never supported natively and were performed via complex and delicate maintenance operations, often causing downtime on their clusters. This new feature allows users and cluster admins to dynamically modify Service CIDR ranges with zero downtime.
For more details about this feature please visit the Virtual IPs and Service Proxies documentation page.
This work was done as part of KEP #1880 by SIG Network.
Traffic distribution for Services
Traffic distribution for Services moves to beta in v1.31 and is enabled by default.
After several iterations on finding the best user experience and traffic engineering capabilities for Services networking, SIG Networking implemented the trafficDistribution
field in the Service specification, which serves as a guideline for the underlying implementation to consider while making routing decisions.
For more details about this feature please read the 1.30 Release Blog or visit the Service documentation page.
This work was done as part of KEP #4444 by SIG Network.
Kubernetes VolumeAttributesClass ModifyVolume
VolumeAttributesClass API is moving to beta in v1.31. The VolumeAttributesClass provides a generic, Kubernetes-native API for modifying dynamically volume parameters like provisioned IO. This allows workloads to vertically scale their volumes on-line to balance cost and performance, if supported by their provider. This feature had been alpha since Kubernetes 1.29.
This work was done as a part of KEP #3751 and lead by SIG Storage.
New features in Alpha
This is a selection of some of the improvements that are now alpha following the v1.31 release.
New DRA APIs for better accelerators and other hardware management
Kubernetes v1.31 brings an updated dynamic resource allocation (DRA) API and design. The main focus in the update is on structured parameters because they make resource information and requests transparent to Kubernetes and clients and enable implementing features like cluster autoscaling. DRA support in the kubelet was updated such that version skew between kubelet and the control plane is possible. With structured parameters, the scheduler allocates ResourceClaims while scheduling a pod. Allocation by a DRA driver controller is still supported through what is now called "classic DRA".
With Kubernetes v1.31, classic DRA has a separate feature gate named DRAControlPlaneController
, which you need to enable explicitly. With such a control plane controller, a DRA driver can implement allocation policies that are not supported yet through structured parameters.
This work was done as part of KEP #3063 by SIG Node.
Support for image volumes
The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future.
One of the requirements to fulfill these use cases is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries.
Given that, v1.31 adds a new alpha feature to allow using an OCI image as a volume in a Pod. This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers. You need to enable the ImageVolume
feature gate to try this out.
This work was done as part of KEP #4639 by SIG Node and SIG Storage.
Exposing device health information through Pod status
Expose device health information through the Pod Status is added as a new alpha feature in v1.31, disabled by default.
Before Kubernetes v1.31, the way to know whether or not a Pod is associated with the failed device is to use the PodResources API.
By enabling this feature, the field allocatedResourcesStatus
will be added to each container status, within the .status
for each Pod. The allocatedResourcesStatus
field reports health information for each device assigned to the container.
This work was done as part of KEP #4680 by SIG Node.
Finer-grained authorization based on selectors
This feature allows webhook authorizers and future (but not currently designed) in-tree authorizers to allow list and watch requests, provided those requests use label and/or field selectors. For example, it is now possible for an authorizer to express: this user cannot list all pods, but can list all pods where .spec.nodeName
matches some specific value. Or to allow a user to watch all Secrets in a namespace that are not labelled as confidential: true
. Combined with CRD field selectors (also moving to beta in v1.31), it is possible to write more secure per-node extensions.
This work was done as part of KEP #4601 by SIG Auth.
Restrictions on anonymous API access
By enabling the feature gate AnonymousAuthConfigurableEndpoints
users can now use the authentication configuration file to configure the endpoints that can be accessed by anonymous requests. This allows users to protect themselves against RBAC misconfigurations that can give anonymous users broad access to the cluster.
This work was done as a part of KEP #4633 and by SIG Auth.
Graduations, deprecations, and removals in 1.31
Graduations to Stable
This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.
This release includes a total of 11 enhancements promoted to Stable:
- PersistentVolume last phase transition time
- Metric cardinality enforcement
- Kube-proxy improved ingress connectivity reliability
- Add CDI devices to device plugin API
- Move cgroup v1 support into maintenance mode
- AppArmor support
- PodHealthyPolicy for PodDisruptionBudget
- Retriable and non-retriable Pod failures for Jobs
- Elastic Indexed Jobs
- Allow StatefulSet to control start replica ordinal numbering
- Random Pod selection on ReplicaSet downscaling
Deprecations and Removals
As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones for the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process.
Cgroup v1 enters the maintenance mode
As Kubernetes continues to evolve and adapt to the changing landscape of container orchestration, the community has decided to move cgroup v1 support into maintenance mode in v1.31. This shift aligns with the broader industry's move towards cgroup v2, offering improved functionality, scalability, and a more consistent interface. Kubernetes maintance mode means that no new features will be added to cgroup v1 support. Critical security fixes will still be provided, however, bug-fixing is now best-effort, meaning major bugs may be fixed if feasible, but some issues might remain unresolved.
It is recommended that you start switching to use cgroup v2 as soon as possible. This transition depends on your architecture, including ensuring the underlying operating systems and container runtimes support cgroup v2 and testing workloads to verify that workloads and applications function correctly with cgroup v2.
Please report any problems you encounter by filing an issue.
This work was done as part of KEP #4569 by SIG Node.
A note about SHA-1 signature support
In go1.18 (released in March 2022), the crypto/x509 library started to reject certificates signed with a SHA-1 hash function. While SHA-1 is established to be unsafe and publicly trusted Certificate Authorities have not issued SHA-1 certificates since 2015, there might still be cases in the context of Kubernetes where user-provided certificates are signed using a SHA-1 hash function through private authorities with them being used for Aggregated API Servers or webhooks. If you have relied on SHA-1 based certificates, you must explicitly opt back into its support by setting GODEBUG=x509sha1=1
in your environment.
Given Go's compatibility policy for GODEBUGs, the x509sha1
GODEBUG and the support for SHA-1 certificates will fully go away in go1.24 which will be released in the first half of 2025. If you rely on SHA-1 certificates, please start moving off them.
Please see Kubernetes issue #125689 to get a better idea of timelines around the support for SHA-1 going away, when Kubernetes releases plans to adopt go1.24, and for more details on how to detect usage of SHA-1 certificates via metrics and audit logging.
Deprecation of status.nodeInfo.kubeProxyVersion
field for Nodes (KEP 4004)
The .status.nodeInfo.kubeProxyVersion
field of Nodes has been deprecated in Kubernetes v1.31, and will be removed in a later release. It's being deprecated because the value of this field wasn't (and isn't) accurate. This field is set by the kubelet, which does not have reliable information about the kube-proxy version or whether kube-proxy is running.
The DisableNodeKubeProxyVersion
feature gate will be set to true
in by default in v1.31 and the kubelet will no longer attempt to set the .status.kubeProxyVersion
field for its associated Node.
Removal of all in-tree integrations with cloud providers
As highlighted in a previous article, the last remaining in-tree support for cloud provider integration has been removed as part of the v1.31 release. This doesn't mean you can't integrate with a cloud provider, however you now must use the recommended approach using an external integration. Some integrations are part of the Kubernetes project and others are third party software.
This milestone marks the completion of the externalization process for all cloud providers' integrations from the Kubernetes core (KEP-2395), a process started with Kubernetes v1.26. This change helps Kubernetes to get closer to being a truly vendor-neutral platform.
For further details on the cloud provider integrations, read our v1.29 Cloud Provider Integrations feature blog. For additional context about the in-tree code removal, we invite you to check the (v1.29 deprecation blog).
The latter blog also contains useful information for users who need to migrate to version v1.29 and later.
Removal of in-tree provider feature gates
In Kubernetes v1.31, the following alpha feature gates InTreePluginAWSUnregister
, InTreePluginAzureDiskUnregister
, InTreePluginAzureFileUnregister
, InTreePluginGCEUnregister
, InTreePluginOpenStackUnregister
, and InTreePluginvSphereUnregister
have been removed. These feature gates were introduced to facilitate the testing of scenarios where in-tree volume plugins were removed from the codebase, without actually removing them. Since Kubernetes 1.30 had deprecated these in-tree volume plugins, these feature gates were redundant and no longer served a purpose. The only CSI migration gate still standing is InTreePluginPortworxUnregister
, which will remain in alpha until the CSI migration for Portworx is completed and its in-tree volume plugin will be ready for removal.
Removal of kubelet --keep-terminated-pod-volumes
command line flag
The kubelet flag --keep-terminated-pod-volumes
, which was deprecated in 2017, has been removed as part of the v1.31 release.
You can find more details in the pull request #122082.
Removal of CephFS volume plugin
CephFS volume plugin was removed in this release and the cephfs
volume type became non-functional.
It is recommended that you use the CephFS CSI driver as a third-party storage driver instead. If you were using the CephFS volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.
CephFS volume plugin was formally marked as deprecated in v1.28.
Removal of Ceph RBD volume plugin
The v1.31 release removes the Ceph RBD volume plugin and its CSI migration support, making the rbd
volume type non-functional.
It's recommended that you use the RBD CSI driver in your clusters instead. If you were using Ceph RBD volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.
The Ceph RBD volume plugin was formally marked as deprecated in v1.28.
Deprecation of non-CSI volume limit plugins in kube-scheduler
The v1.31 release will deprecate all non-CSI volume limit scheduler plugins, and will remove some already deprected plugins from the default plugins, including:
AzureDiskLimits
CinderLimits
EBSLimits
GCEPDLimits
It's recommended that you use the NodeVolumeLimits
plugin instead because it can handle the same functionality as the removed plugins since those volume types have been migrated to CSI. Please replace the deprecated plugins with the NodeVolumeLimits
plugin if you explicitly use them in the scheduler config. The AzureDiskLimits
, CinderLimits
, EBSLimits
, and GCEPDLimits
plugins will be removed in a future release.
These plugins will be removed from the default scheduler plugins list as they have been deprecated since Kubernetes v1.14.
Release notes and upgrade actions required
Check out the full details of the Kubernetes v1.31 release in our release notes.
Scheduler now uses QueueingHint when SchedulerQueueingHints
is enabled
Added support to the scheduler to start using a QueueingHint registered for Pod/Updated events, to determine whether updates to previously unschedulable Pods have made them schedulable. The new support is active when the feature gate SchedulerQueueingHints
is enabled.
Previously, when unschedulable Pods were updated, the scheduler always put Pods back to into a queue (activeQ
/ backoffQ
). However not all updates to Pods make Pods schedulable, especially considering many scheduling constraints nowadays are immutable. Under the new behaviour, once unschedulable Pods are updated, the scheduling queue checks with QueueingHint(s) whether the update may make the pod(s) schedulable, and requeues them to activeQ
or backoffQ
only when at least one QueueingHint returns Queue
.
Action required for custom scheduler plugin developers: Plugins have to implement a QueueingHint for Pod/Update event if the rejection from them could be resolved by updating unscheduled Pods themselves. Example: suppose you develop a custom plugin that denies Pods that have a schedulable=false
label. Given Pods with a schedulable=false
label will be schedulable if the schedulable=false
label is removed, this plugin would implement QueueingHint for Pod/Update event that returns Queue when such label changes are made in unscheduled Pods. You can find more details in the pull request #122234.
Removal of kubelet --keep-terminated-pod-volumes command line flag
The kubelet flag --keep-terminated-pod-volumes
, which was deprecated in 2017, was removed as part of the v1.31 release.
You can find more details in the pull request #122082.
Availability
Kubernetes v1.31 is available for download on GitHub or on the Kubernetes download page.
To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.31 using kubeadm.
Release team
Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.
We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.31 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Angelos Kolaitis, for supporting us through a successful release cycle, advocating for us, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.
Project velocity
The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.
In the v1.31 release cycle, which ran for 14 weeks (May 7th to August 13th), we saw contributions to Kubernetes from 113 different companies and 528 individuals.
In the whole Cloud Native ecosystem we have 379 companies counting 2268 total contributors - which means that respect to the previous release cycle we experienced an astounding 63% increase on individuals contributing!
Source for this data:
By contribution we mean when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing visit this page to get started.
Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.
Event update
Explore the upcoming Kubernetes and cloud-native events from August to November 2024, featuring KubeCon, KCD, and other notable conferences worldwide. Stay informed and engage with the Kubernetes community.
August 2024
- KubeCon + CloudNativeCon + Open Source Summit China 2024: August 21-23, 2024 | Hong Kong
- KubeDay Japan: August 27, 2024 | Tokyo, Japan
September 2024
- KCD Lahore - Pakistan 2024: September 1, 2024 | Lahore, Pakistan
- KuberTENes Birthday Bash Stockholm: September 5, 2024 | Stockholm, Sweden
- KCD Sydney '24: September 5-6, 2024 | Sydney, Australia
- KCD Washington DC 2024: September 24, 2024 | Washington, DC, United States
- KCD Porto 2024: September 27-28, 2024 | Porto, Portugal
October 2024
- KCD Austria 2024: October 8-10, 2024 | Wien, Austria
- KubeDay Australia: October 15, 2024 | Melbourne, Australia
- KCD UK - London 2024: October 22-23, 2024 | Greater London, United Kingdom
November 2024
- KubeCon + CloudNativeCon North America 2024: November 12-15, 2024 | Salt Lake City, United States
- Kubernetes on EDGE Day North America: November 12, 2024 | Salt Lake City, United States
Upcoming release webinar
Join members of the Kubernetes v1.31 release team on Thursday, Thu Sep 12, 2024 10am PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.
Get involved
The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.
- Follow us on X @Kubernetesio for latest updates
- Join the community discussion on Discuss
- Join the community on Slack
- Post questions (or answer questions) on Stack Overflow
- Share your Kubernetes story
- Read more about what's happening with Kubernetes on the blog
- Learn more about the Kubernetes Release Team
13 Aug 2024 12:00am GMT
12 Aug 2024
Kubernetes Blog
Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control
Kubernetes components use on-off switches called feature gates to manage the risk of adding a new feature. The feature gate mechanism is what enables incremental graduation of a feature through the stages Alpha, Beta, and GA.
Kubernetes components, such as kube-controller-manager and kube-scheduler, use the client-go library to interact with the API. The same library is used across the Kubernetes ecosystem to build controllers, tools, webhooks, and more. client-go now includes its own feature gating mechanism, giving developers and cluster administrators more control over how they adopt client features.
To learn more about feature gates in Kubernetes, visit Feature Gates.
Motivation
In the absence of client-go feature gates, each new feature separated feature availability from enablement in its own way, if at all. Some features were enabled by updating to a newer version of client-go. Others needed to be actively configured in each program that used them. A few were configurable at runtime using environment variables. Consuming a feature-gated functionality exposed by the kube-apiserver sometimes required a client-side fallback mechanism to remain compatible with servers that don't support the functionality due to their age or configuration. In cases where issues were discovered in these fallback mechanisms, mitigation required updating to a fixed version of client-go or rolling back.
None of these approaches offer good support for enabling a feature by default in some, but not all, programs that consume client-go. Instead of enabling a new feature at first only for a single component, a change in the default setting immediately affects the default for all Kubernetes components, which broadens the blast radius significantly.
Feature gates in client-go
To address these challenges, substantial client-go features will be phased in using the new feature gate mechanism. It will allow developers and users to enable or disable features in a way that will be familiar to anyone who has experience with feature gates in the Kubernetes components.
Out of the box, simply by using a recent version of client-go, this offers several benefits.
For people who use software built with client-go:
- Early adopters can enable a default-off client-go feature on a per-process basis.
- Misbehaving features can be disabled without building a new binary.
- The state of all known client-go feature gates is logged, allowing users to inspect it.
For people who develop software built with client-go:
- By default, client-go feature gate overrides are read from environment variables. If a bug is found in a client-go feature, users will be able to disable it without waiting for a new release.
- Developers can replace the default environment-variable-based overrides in a program to change defaults, read overrides from another source, or disable runtime overrides completely. The Kubernetes components use this customizability to integrate client-go feature gates with the existing
--feature-gates
command-line flag, feature enablement metrics, and logging.
Overriding client-go feature gates
Note: This describes the default method for overriding client-go feature gates at runtime. It can be disabled or customized by the developer of a particular program. In Kubernetes components, client-go feature gate overrides are controlled by the --feature-gates
flag.
Features of client-go can be enabled or disabled by setting environment variables prefixed with KUBE_FEATURE
. For example, to enable a feature named MyFeature
, set the environment variable as follows:
KUBE_FEATURE_MyFeature=true
To disable the feature, set the environment variable to false
:
KUBE_FEATURE_MyFeature=false
Note: Environment variables are case-sensitive on some operating systems. Therefore, KUBE_FEATURE_MyFeature
and KUBE_FEATURE_MYFEATURE
would be considered two different variables.
Customizing client-go feature gates
The default environment-variable based mechanism for feature gate overrides can be sufficient for many programs in the Kubernetes ecosystem, and requires no special integration. Programs that require different behavior can replace it with their own custom feature gate provider. This allows a program to do things like force-disable a feature that is known to work poorly, read feature gates directly from a remote configuration service, or accept feature gate overrides through command-line options.
The Kubernetes components replace client-go's default feature gate provider with a shim to the existing Kubernetes feature gate provider. For all practical purposes, client-go feature gates are treated the same as other Kubernetes feature gates: they are wired to the --feature-gates
command-line flag, included in feature enablement metrics, and logged on startup.
To replace the default feature gate provider, implement the Gates interface and call ReplaceFeatureGates at package initialization time, as in this simple example:
import (
"k8s.io/client-go/features"
)
type AlwaysEnabledGates struct{}
func (AlwaysEnabledGates) Enabled(features.Feature) bool {
return true
}
func init() {
features.ReplaceFeatureGates(AlwaysEnabledGates{})
}
Implementations that need the complete list of defined client-go features can get it by implementing the Registry interface and calling AddFeaturesToExistingFeatureGates
. For a complete example, refer to the usage within Kubernetes.
Summary
With the introduction of feature gates in client-go v1.30, rolling out a new client-go feature has become safer and easier. Users and developers can control the pace of their own adoption of client-go features. The work of Kubernetes contributors is streamlined by having a common mechanism for graduating features that span both sides of the Kubernetes API boundary.
Special shoutout to @sttts and @deads2k for their help in shaping this feature.
12 Aug 2024 12:00am GMT