21 Nov 2024
Kubernetes Blog
Gateway API v1.2: WebSockets, Timeouts, Retries, and More
Kubernetes SIG Network is delighted to announce the general availability of Gateway API v1.2! This version of the API was released on October 3, and we're delighted to report that we now have a number of conformant implementations of it for you to try out.
Gateway API v1.2 brings a number of new features to the Standard channel (Gateway API's GA release channel), introduces some new experimental features, and inaugurates our new release process - but it also brings two breaking changes that you'll want to be careful of.
Breaking changes
GRPCRoute and ReferenceGrant v1alpha2
removal
Now that the v1
versions of GRPCRoute and ReferenceGrant have graduated to Standard, the old v1alpha2
versions have been removed from both the Standard and Experimental channels, in order to ease the maintenance burden that perpetually supporting the old versions would place on the Gateway API community.
Before upgrading to Gateway API v1.2, you'll want to confirm that any implementations of Gateway API have been upgraded to support the v1 API version of these resources instead of the v1alpha2 API version. Note that even if you've been using v1 in your YAML manifests, a controller may still be using v1alpha2 which would cause it to fail during this upgrade. Additionally, Kubernetes itself goes to some effort to stop you from removing a CRD version that it thinks you're using: check out the release notes for more information about what you need to do to safely upgrade.
Change to .status.supportedFeatures
(experimental)
A much smaller breaking change: .status.supportedFeatures
in a Gateway is now a list of objects instead of a list of strings. The objects have a single name
field, so the translation from the strings is straightforward, but moving to objects permits a lot more flexibility for the future. This stanza is not yet present in the Standard channel.
Graduations to the standard channel
Gateway API 1.2.0 graduates four features to the Standard channel, meaning that they can now be considered generally available. Inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard channel features can continue to evolve with backward-compatible additions over time, and we certainly expect further refinements and improvements to these new features in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.
HTTPRoute timeouts
GEP-1742 introduced the timeouts
stanza into HTTPRoute, permitting configuring basic timeouts for HTTP traffic. This is a simple but important feature for proper resilience when handling HTTP traffic, and it is now Standard.
For example, this HTTPRoute configuration sets a timeout of 300ms for traffic to the /face
path:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: face-with-timeouts
namespace: faces
spec:
parentRefs:
- name: my-gateway
kind: Gateway
rules:
- matches:
- path:
type: PathPrefix
value: /face
backendRefs:
- name: face
port: 80
timeouts:
request: 300ms
For more information, check out the HTTP routing documentation. (Note that this applies only to HTTPRoute timeouts. GRPCRoute timeouts are not yet part of Gateway API.)
Gateway infrastructure labels and annotations
Gateway API implementations are responsible for creating the backing infrastructure needed to make each Gateway work. For example, implementations running in a Kubernetes cluster often create Services and Deployments, while cloud-based implementations may be creating cloud load balancer resources. In many cases, it can be helpful to be able to propagate labels or annotations to these generated resources.
In v1.2.0, the Gateway infrastructure
stanza moves to the Standard channel, allowing you to specify labels and annotations for the infrastructure created by the Gateway API controller. For example, if your Gateway infrastructure is running in-cluster, you can specify both Linkerd and Istio injection using the following Gateway configuration, making it simpler for the infrastructure to be incorporated into whichever service mesh you've installed:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: meshed-gateway
namespace: incoming
spec:
gatewayClassName: meshed-gateway-class
listeners:
- name: http-listener
protocol: HTTP
port: 80
infrastructure:
labels:
istio-injection: enabled
annotations:
linkerd.io/inject: enabled
For more information, check out the infrastructure
API reference.
Backend protocol support
Since Kubernetes v1.20, the Service and EndpointSlice resources have supported a stable appProtocol
field to allow users to specify the L7 protocol that Service supports. With the adoption of KEP 3726, Kubernetes now supports three new appProtocol
values:
kubernetes.io/h2c
- HTTP/2 over cleartext as described in RFC7540
kubernetes.io/ws
- WebSocket over cleartext as described in RFC6445
kubernetes.io/wss
- WebSocket over TLS as described in RFC6445
With Gateway API 1.2.0, support for honoring appProtocol
is now Standard. For example, given the following Service:
apiVersion: v1
kind: Service
metadata:
name: websocket-service
namespace: my-namespace
spec:
selector:
app.kubernetes.io/name: websocket-app
ports:
- name: http
port: 80
targetPort: 9376
protocol: TCP
appProtocol: kubernetes.io/ws
then an HTTPRoute that includes this Service as a backendRef
will automatically upgrade the connection to use WebSockets rather than assuming that the connection is pure HTTP.
For more information, check out GEP-1911.
New additions to experimental channel
Named rules for *Route resources
The rules
field in HTTPRoute and GRPCRoute resources can now be named, in order to make it easier to reference the specific rule, for example:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: multi-color-route
namespace: faces
spec:
parentRefs:
- name: my-gateway
kind: Gateway
port: 80
rules:
- name: center-rule
matches:
- path:
type: PathPrefix
value: /color/center
backendRefs:
- name: color-center
port: 80
- name: edge-rule
matches:
- path:
type: PathPrefix
value: /color/edge
backendRefs:
- name: color-edge
port: 80
Logging or status messages can now refer to these two rules as center-rule
or edge-rule
instead of being forced to refer to them by index. For more information, see GEP-995.
HTTPRoute retry support
Gateway API 1.2.0 introduces experimental support for counted HTTPRoute retries. For example, the following HTTPRoute configuration retries requests to the /face
path up to 3 times with a 500ms delay between retries:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: face-with-retries
namespace: faces
spec:
parentRefs:
- name: my-gateway
kind: Gateway
port: 80
rules:
- matches:
- path:
type: PathPrefix
value: /face
backendRefs:
- name: face
port: 80
retry:
codes: [ 500, 502, 503, 504 ]
attempts: 3
backoff: 500ms
For more information, check out GEP 1731.
HTTPRoute percentage-based mirroring
Gateway API has long supported the Request Mirroring feature, which allows sending the same request to multiple backends. In Gateway API 1.2.0, we're introducing percentage-based mirroring, which allows you to specify a percentage of requests to mirror to a different backend. For example, the following HTTPRoute configuration mirrors 42% of requests to the color-mirror
backend:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: color-mirror-route
namespace: faces
spec:
parentRefs:
- name: mirror-gateway
hostnames:
- mirror.example
rules:
- backendRefs:
- name: color
port: 80
filters:
- type: RequestMirror
requestMirror:
backendRef:
name: color-mirror
port: 80
percent: 42 # This value must be an integer.
There's also a fraction
stanza which can be used in place of percent
, to allow for more precise control over exactly what amount of traffic is mirrored, for example:
...
filters:
- type: RequestMirror
requestMirror:
backendRef:
name: color-mirror
port: 80
fraction:
numerator: 1
denominator: 10000
This configuration mirrors 1 in 10,000 requests to the color-mirror
backend, which may be relevant with very high request rates. For more details, see GEP-1731.
Additional backend TLS configuration
This release includes three additions related to TLS configuration for communications between a Gateway and a workload (a backend):
-
A new
backendTLS
field on GatewayThis new field allows you to specify the client certificate that a Gateway should use when connecting to backends.
-
A new
subjectAltNames
field on BackendTLSPolicyPreviously, the
hostname
field was used to configure both the SNI that a Gateway should send to a backend and the identity that should be provided by a certificate. When the newsubjectAltNames
field is specified, any certificate matching at least one of the specified SANs will be considered valid. This is particularly critical for SPIFFE where URI-based SANs may not be valid SNIs. -
A new
options
field on BackendTLSPolicySimilar to the TLS options field on Gateway Listeners, we believe the same concept will be broadly useful for TLS-specific configuration for Backend TLS.
For more information, check out GEP-3135.
More changes
For a full list of the changes included in this release, please refer to the v1.2.0 release notes.
Project updates
Beyond the technical, the v1.2 release also marks a few milestones in the life of the Gateway API project itself.
Release process improvements
Gateway API has never been intended to be a static API, and as more projects use it as a component to build on, it's become clear that we need to bring some more predictability to Gateway API releases. To that end, we're pleased - and a little nervous! - to announce that we've formalized a new release process:
-
Scoping (4-6 weeks): maintainers and community determine the set of features we want to include in the release. A particular emphasis here is getting features out of the Experimental channel - ideally this involves moving them to Standard, but it can also mean removing them.
-
GEP Iteration and Review (5-7 weeks): contributors write or update Gateway Enhancement Proposals (GEPs) for features accepted into the release, with emphasis on getting consensus around the design and graduation criteria of the feature.
-
API Refinement and Documentation (3-5 weeks): contributors implement the features in the Gateway API controllers and write the necessary documentation.
-
SIG Network Review and Release Candidates (2-4 weeks): maintainers get the required upstream review, build release candidates, and release the new version.
Gateway API 1.2.0 was the first release to use the new process, and although there are the usual rough edges of anything new, we believe that it went well. We've already completed the Scoping phase for Gateway API 1.3, with the release expected around the end of January 2025.
gwctl
moves out
The gwctl
CLI tool has moved into its very own repository, https://github.com/kubernetes-sigs/gwctl. gwctl
has proven a valuable tool for the Gateway API community; moving it into its own repository will, we believe, make it easier to maintain and develop. As always, we welcome contributions; while still experimental, gwctl
already helps make working with Gateway API a bit easier - especially for newcomers to the project!
Maintainer changes
Rounding out our changes to the project itself, we're pleased to announce that Mattia Lavacca has joined the ranks of Gateway API Maintainers! We're also sad to announce that Keith Mattix has stepped down as a GAMMA lead - happily, Mike Morris has returned to the role. We're grateful for everything Keith has done, and excited to have Mattia and Mike on board.
Try it out
Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.
To try out the API, follow our Getting Started Guide. As of this writing, five implementations are already conformant with Gateway API v1.2. In alphabetical order:
- Cilium v1.17.0-pre.1, Experimental channel
- Envoy Gateway v1.2.0-rc.1, Experimental channel
- Istio v1.24.0-alpha.0, Experimental channel
- Kong v3.2.0-244-gea4944bb0, Experimental channel
- Traefik v3.2, Experimental channel
Get involved
There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.
- Check out the user guides to see what use-cases can be addressed.
- Try out one of the existing Gateway controllers.
- Or join us in the community and help us build the future of Gateway API together!
The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have gotten this far without the support of this dedicated and active community.
Related Kubernetes blog articles
- Gateway API v1.1: Service mesh, GRPCRoute, and a whole lot more
- New Experimental Features in Gateway API v1.0 11/2023
- Gateway API v1.0: GA Release 10/2023
- Introducing ingress2gateway; Simplifying Upgrades to Gateway API 10/2023
- Gateway API v0.8.0: Introducing Service Mesh Support 08/2023
21 Nov 2024 5:00pm GMT
How we built a dynamic Kubernetes API Server for the API Aggregation Layer in Cozystack
Hi there! I'm Andrei Kvapil, but you might know me as @kvaps in communities dedicated to Kubernetes and cloud-native tools. In this article, I want to share how we implemented our own extension api-server in the open-source PaaS platform, Cozystack.
Kubernetes truly amazes me with its powerful extensibility features. You're probably already familiar with the controller concept and frameworks like kubebuilder and operator-sdk that help you implement it. In a nutshell, they allow you to extend your Kubernetes cluster by defining custom resources (CRDs) and writing additional controllers that handle your business logic for reconciling and managing these kinds of resources. This approach is well-documented, with a wealth of information available online on how to develop your own operators.
However, this is not the only way to extend the Kubernetes API. For more complex scenarios such as implementing imperative logic, managing subresources, and dynamically generating responses-the Kubernetes API aggregation layer provides an effective alternative. Through the aggregation layer, you can develop a custom extension API server and seamlessly integrate it within the broader Kubernetes API framework.
In this article, I will explore the API aggregation layer, the types of challenges it is well-suited to address, cases where it may be less appropriate, and how we utilized this model to implement our own extension API server in Cozystack.
What Is the API Aggregation Layer?
First, let's get definitions straight to avoid any confusion down the road. The API aggregation layer is a feature in Kubernetes, while an extension api-server is a specific implementation of an API server for the aggregation layer. An extension API server is just like the standard Kubernetes API server, except it runs separately and handles requests for your specific resource types.
So, the aggregation layer lets you write your own extension API server, integrate it easily into Kubernetes, and directly process requests for resources in a certain group. Unlike the CRD mechanism, the extension API is registered in Kubernetes as an APIService, telling Kubernetes to consider this new API server and acknowledge that it serves certain APIs.
You can execute this command to list all registered apiservices:
kubectl get apiservices.apiregistration.k8s.io
Example APIService:
NAME SERVICE AVAILABLE AGE
v1alpha1.apps.cozystack.io cozy-system/cozystack-api True 7h29m
As soon as the Kubernetes api-server receives requests for resources in the group v1alpha1.apps.cozystack.io
, it redirects all those requests to our extension api-server, which can handle them based on the business logic we've built into it.
When to use the API Aggregation Layer
The API Aggregation Layer helps solve several issues where the usual CRD mechanism might not enough. Let's break them down.
Imperative Logic and Subresources
Besides regular resources, Kubernetes also has something called subresources.
In Kubernetes, subresources are additional actions or operations you can perform on primary resources (like Pods, Deployments, Services) via the Kubernetes API. They provide interfaces to manage specific aspects of resources without affecting the entire object.
A simple example is status
, which is traditionally exposed as a separate subresource that you can access independently from the parent object. The status
field isn't meant to be changed
But beyond /status
, Pods in Kubernetes also have subresources like /exec
, /portforward
, and /log
. Interestingly, instead of the usual declarative resources in Kubernetes, these represent endpoints for imperative operations like viewing logs, proxying connections, executing commands in a running container, and so on.
To support such imperative commands on your own API, you need implement an extension API and an extension API server. Here are some well-known examples:
- KubeVirt: An add-on for Kubernetes that extends its API capabilities to run traditional virtual machines. The extension api-server created as part of KubeVirt handles subresources like
/restart
,/console
, and/vnc
for virtual machines. - Knative: A Kubernetes add-on that extends its capabilities for serverless computing, implementing the
/scale
subresource to set up autoscaling for its resource types.
By the way, even though subresource logic in Kubernetes can be imperative, you can manage access to them declaratively using Kubernetes standard RBAC model.
For example this way you can control access to the /log
and /exec
subresources of the Pod kind:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: default
name: pod-and-pod-logs-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
You're not tied to use etcd
Usually, the Kubernetes API server uses etcd for its backend. However, implementing your own API server doesn't lock you into using only etcd. If it doesn't make sense to store your server's state in etcd, you can store information in any other system and generate responses on the fly. Here are a few cases to illustrate:
-
metrics-server is a standard extension for Kubernetes which allows you to view real-time metrics of your nodes and pods. It defines alternative Pod and Node kinds in its own metrics.k8s.io API. Requests to these resources are translated into metrics directly from Kubelet. So when you run
kubectl top node
orkubectl top pod
, metrics-server fetches metrics from cAdvisor in real-time. It then returns these metrics to you. Since the information is generated in real-time and is only relevant at the moment of the request, there is no need to store it in etcd. This approach saves resources. -
If needed, you can use a backend other than etcd. You can even implement a Kubernetes-compatible API for it. For example, if you use Postgres, you can create a transparent representation of its entities in the Kubernetes API. Eg. databases, users, and grants within Postgres would appear as regular Kubernetes resources, thanks to your extension API server. You could manage them using
kubectl
or any other Kubernetes-compatible tool. Unlike controllers, which implement business logic using custom resources and reconciliation methods, an extension API server eliminates the need for separate controllers for every kind. This means you don't have to sync state between the Kubernetes API and your backend.
One-Time resources
-
Kubernetes has a special API used to provide users with information about their permissions. This is implemented using the SelfSubjectAccessReview API. One unusual detail of these resources is that you can't view them using get or list verbs. You can only create them (using the create verb) and receive output with information about what you have access to at that moment.
If you try to run
kubectl get selfsubjectaccessreviews
directly, you'll just get an error like this:Error from server (MethodNotAllowed): the server does not allow this method on the requested resource
The reason is that the Kubernetes API server doesn't support any other interaction with this type of resource (you can only CREATE them).
The SelfSubjectAccessReview API supports commands such as:
kubectl auth can-i create deployments --namespace dev
When you run the command above,
kubectl
creates a SelfSubjectAccessReview using the Kubernetes API. This allows Kubernetes to fetch a list of possible permissions for your user. Kubernetes then generates a personalized response to your request in real-time. This logic is different from a scenario where this resource is simply stored in etcd. -
Similarly, in KubeVirt's CDI (Containerized Data Importer) extension, which allows file uploads into a PVC from a local machine using the
virtctl
tool, a special token is required before the upload process begins. This token is generated by creating an UploadTokenRequest resource via the Kubernetes API. Kubernetes routes (proxies) all UploadTokenRequest resource creation requests to the CDI extension API server, which generates and returns the token in response.
Full control over conversion, validation, and output formatting
-
Your own API server can have all the capabilities of the vanilla Kubernetes API server. The resources you create in your API server can be validated immediately on the server side without additional webhooks. While CRDs also support server-side validation using Common Expression Language (CEL) for declarative validation and ValidatingAdmissionPolicies without the need for webhooks, a custom API server allows for more complex and tailored validation logic if needed.
Kubernetes allows you to serve multiple API versions for each resource type, traditionally
v1alpha1
,v1beta1
andv1
. Only one version can be specified as the storage version. All requests to other versions must be automatically converted to the version specified as storage version. With CRDs, this mechanism is implemented using conversion webhooks. Whereas in an extension API server, you can implement your own conversion mechanism, choose to mix up different storage versions (one object might be serialized asv1
, another asv2
), or rely on an external backing API. -
Directly implementing the Kubernetes API lets you format table output however you like and doesn't force you to follow the
additionalPrinterColumns
logic in CRDs. Instead, you can write your own formatter that formats the table output and custom fields in it. For example, when usingadditionalPrinterColumns
, you can display field values only following the JSONPath logic. In your own API server, you can generate and insert values on the fly, formatting the table output as you wish.
Dynamic resource registration
- The resources served by an extension api-server don't need to be pre-registered as CRDs. Once your extension API server is registered using an APIService, Kubernetes starts polling it to discover APIs and resources it can serve. After receiving a discovery response, the Kubernetes API server automatically registers all available types for this API group. Although this isn't considered common practice, you can implement logic that dynamically registers the resource types you need in your Kubernetes cluster.
When not to use the API Aggregation Layer
There are some anti-patterns where using the API Aggregation Layer isn't recommended. Let's go through them.
Unstable backend
If your API server stops responding for some reason due to an unavailable backend or other issues it may block some Kubernetes functionality. For example, when deleting namespaces, Kubernetes will wait for a response from your API server to see if there are any remaining resources. If the response doesn't come, the namespace deletion will be blocked.
Also, you might have encountered a situation where, when the metrics-server is unavailable, an extra message appears in stderr after every API request (even unrelated to metrics) stating that metrics.k8s.io
is unavailable. This is another example of how using the API Aggregation Layer can lead to problems when the api-server handling requests is unavailable.
Slow requests
If you can't guarantee an instant response for user requests, it's better to consider using a CustomResourceDefinition and controller. Otherwise, you might make your cluster less stable. Many projects implement an extension API server only for a limited set of resources, particularly for imperative logic and subresources. This recommendation is also mentioned in the official Kubernetes documentation.
Why we needed it in Cozystack
As a reminder, we're developing the open-source PaaS platform Cozystack, which can also be used as a framework for building your own private cloud. Therefore, the ability to easily extend the platform is crucial for us.
Cozystack is built on top of FluxCD. Any application is packaged into its own Helm chart, ready for deployment in a tenant namespace. Deploying any application on the platform is done by creating a HelmRelease resource, specifying the chart name and parameters for the application. All the rest logic is handled by FluxCD. This pattern allows us to easily extend the platform with new applications and provide the ability to create new applications that just need to be packaged into the appropriate Helm chart.
So, in our platform, everything is configured as HelmRelease resources. However, we ran into two problems: limitations of the RBAC model and the need for a public API. Let's delve into these
Limitations of the RBAC model
The widely-deployed RBAC system in Kubernetes doesn't allow you to restrict access to a list of resources of the same kind based on labels or specific fields in the spec. When creating a role, you can limit access across the resources in the same kind only by specifying specific resource names in resourceNames
. For verbs like get or update it will work. However, filtering by resourceNames
using list verb doesn't work like that. Thus you can limit listing certain resources by kind but not by name.
- Kubernetes has a special API used to provide users with information about their permissions. This is implemented using the SelfSubjectAccessReview API. One unusual detail of these resources is that you can't view them using get or list verbs. You can only create them (using the create verb) and receive output with information about what you have access to at that moment.
So, we decided to introduce new resource types based on the names of the Helm charts they use and generate the list of available kinds dynamically at runtime in our extension api-server. This way, we can reuse Kubernetes standard RBAC model to manage access to specific resource types.
Need for a public API
Since our platform provides capabilities for deploying various managed services, we want to organize public access to the platform's API. However, we can't allow users to interact directly with resources like HelmRelease because that would let them specify arbitrary names and parameters for Helm charts to deploy, potentially compromising our system.
We wanted to give users the ability to deploy a specific service simply by creating the resource with corresponding kind in Kubernetes. The type of this resource should be named the same as the chart from which it's deployed. Here are some examples:
kind: Kubernetes
→chart: kubernetes
kind: Postgres
→chart: postgres
kind: Redis
→chart: redis
kind: VirtualMachine
→chart: virtual-machine
Moreover, we don't want to have to add a new type to codegen and recompile our extension API server every time we add a new chart for it to start being served. The schema update should be done dynamically or provided via a ConfigMap by the administrator.
Two-Way conversion
Currently, we already have integrations and a dashboard that continue to use HelmRelease resources. At this stage, we didn't want to lose the ability to support this API. Considering that we're simply translating one resource into another, support is maintained and it works both ways. If you create a HelmRelease, you'll get a custom resource in Kubernetes, and if you create a custom resource in Kubernetes, it will also be available as a HelmRelease.
We don't have any additional controllers that synchronize state between these resources. All requests to resources in our extension API server are transparently proxied to HelmRelease and vice versa. This eliminates intermediate states and the need to write controllers and synchronization logic.
Implementation
To implement the Aggregation API, you might consider starting with the following projects:
- apiserver-builder: Currently in alpha and hasn't been updated for two years. It works like kubebuilder, providing a framework for creating an extension API server, allowing you to sequentially create a project structure and generate code for your resources.
- sample-apiserver: A ready-made example of an implemented API server, based on official Kubernetes libraries, which you can use as a foundation for your project.
For practical reasons, we chose the second project. Here's what we needed to do:
Disable etcd support
In our case, we don't need it since all resources are stored directly in the Kubernetes API.
You can disable etcd options by passing nil to RecommendedOptions.Etcd
:
Generate a common resource kind
We called it Application, and it looks like this:
This is a generic type used for any application type, and its handling logic is the same for all charts.
Configure configuration loading
Since we want to configure our extension api-server via a config file, we formed the config structure in Go:
We also modified the resource registration logic so that the resources we create are registered in scheme with different Kind
values:
As a result, we got a config where you can pass all possible types and specify what they should map to:
Implement our own registry
To store state not in etcd but translate it directly into Kubernetes HelmRelease resources (and vice versa), we wrote conversion functions from Application to HelmRelease and from HelmRelease to Application:
We implemented logic to filter resources by chart name, sourceRef
, and prefix in the HelmRelease name:
Then, using this logic, we implemented the methods Get()
, Delete()
, List()
, Create()
.
You can see the full example here:
At the end of each method, we set the correct Kind
and return an unstructured.Unstructured{}
object so that Kubernetes serializes the object correctly. Otherwise, it would always serialize them with kind: Application
, which we don't want.
What did we achieve?
In Cozystack, all our types from the ConfigMap are now available in Kubernetes as-is:
kubectl api-resources | grep cozystack
buckets apps.cozystack.io/v1alpha1 true Bucket
clickhouses apps.cozystack.io/v1alpha1 true ClickHouse
etcds apps.cozystack.io/v1alpha1 true Etcd
ferretdb apps.cozystack.io/v1alpha1 true FerretDB
httpcaches apps.cozystack.io/v1alpha1 true HTTPCache
ingresses apps.cozystack.io/v1alpha1 true Ingress
kafkas apps.cozystack.io/v1alpha1 true Kafka
kuberneteses apps.cozystack.io/v1alpha1 true Kubernetes
monitorings apps.cozystack.io/v1alpha1 true Monitoring
mysqls apps.cozystack.io/v1alpha1 true MySQL
natses apps.cozystack.io/v1alpha1 true NATS
postgreses apps.cozystack.io/v1alpha1 true Postgres
rabbitmqs apps.cozystack.io/v1alpha1 true RabbitMQ
redises apps.cozystack.io/v1alpha1 true Redis
seaweedfses apps.cozystack.io/v1alpha1 true SeaweedFS
tcpbalancers apps.cozystack.io/v1alpha1 true TCPBalancer
tenants apps.cozystack.io/v1alpha1 true Tenant
virtualmachines apps.cozystack.io/v1alpha1 true VirtualMachine
vmdisks apps.cozystack.io/v1alpha1 true VMDisk
vminstances apps.cozystack.io/v1alpha1 true VMInstance
vpns apps.cozystack.io/v1alpha1 true VPN
We can work with them just like regular Kubernetes resources.
Listing S3 Buckets:
kubectl get buckets.apps.cozystack.io -n tenant-kvaps
Example output:
NAME READY AGE VERSION
foo True 22h 0.1.0
testaasd True 27h 0.1.0
Listing Kubernetes Clusters:
kubectl get kuberneteses.apps.cozystack.io -n tenant-kvaps
Example output:
NAME READY AGE VERSION
abc False 19h 0.14.0
asdte True 22h 0.13.0
Listing Virtual Machine Disks:
kubectl get vmdisks.apps.cozystack.io -n tenant-kvaps
Example output:
NAME READY AGE VERSION
docker True 21d 0.1.0
test True 18d 0.1.0
win2k25-iso True 21d 0.1.0
win2k25-system True 21d 0.1.0
Listing Virtual Machine Instances:
kubectl get vminstances.apps.cozystack.io -n tenant-kvaps
Example output:
NAME READY AGE VERSION
docker True 21d 0.1.0
test True 18d 0.1.0
win2k25 True 20d 0.1.0
We can create, modify, and delete each of them, and any interaction with them will be translated into HelmRelease resources, while also applying the resource structure and prefix in the name.
To see all related Helm releases:
kubectl get helmreleases -n tenant-kvaps -l cozystack.io/ui
Example output:
NAME AGE READY
bucket-foo 22h True
bucket-testaasd 27h True
kubernetes-abc 19h False
kubernetes-asdte 22h True
redis-test 18d True
redis-yttt 12d True
vm-disk-docker 21d True
vm-disk-test 18d True
vm-disk-win2k25-iso 21d True
vm-disk-win2k25-system 21d True
vm-instance-docker 21d True
vm-instance-test 18d True
vm-instance-win2k25 20d True
Next Steps
We don't intend to stop here with our API. In the future, we plan to add new features:
- Add validation based on an OpenAPI spec generated directly from Helm charts.
- Develop a controller that collects release notes from deployed releases and shows users access information for specific services.
- Revamp our dashboard to work directly with the new API.
Conclusion
The API Aggregation Layer allowed us to quickly and efficiently solve our problem by providing a flexible mechanism for extending the Kubernetes API with dynamically registered resources and converting them on the fly. Ultimately, this made our platform even more flexible and extensible without the need to write code for each new resource.
You can test the API yourself in the open-source PaaS platform Cozystack, starting from version v0.18.
21 Nov 2024 12:00am GMT
08 Nov 2024
Kubernetes Blog
Kubernetes v1.32 sneak peek
As we get closer to the release date for Kubernetes v1.32, the project develops and matures. Features may be deprecated, removed, or replaced with better ones for the project's overall health.
This blog outlines some of the planned changes for the Kubernetes v1.32 release, that the release team feels you should be aware of, for the continued maintenance of your Kubernetes environment and keeping up to date with the latest changes. Information listed below is based on the current status of the v1.32 release and may change before the actual release date.
The Kubernetes API removal and deprecation process
The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release will continue to function until removal (at least one year from the deprecation). Its usage will result in a warning being displayed. Removed APIs are no longer available in the current version, so you must migrate to use the replacement instead.
-
Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
-
Beta or pre-release API versions must be supported for 3 releases after the deprecation.
-
Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place.
Whether an API is removed due to a feature graduating from beta to stable or because that API did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.
Note on the withdrawal of the old DRA implementation
The enhancement #3063 introduced Dynamic Resource Allocation (DRA) in Kubernetes 1.26.
However, in Kubernetes v1.32, this approach to DRA will be significantly changed. Code related to the original implementation will be removed, leaving KEP #4381 as the "new" base functionality.
The decision to change the existing approach originated from its incompatibility with cluster autoscaling as resource availability was non-transparent, complicating decision-making for both Cluster Autoscaler and controllers. The newly added Structured Parameter model substitutes the functionality.
This removal will allow Kubernetes to handle new hardware requirements and resource claims more predictably, bypassing the complexities of back and forth API calls to the kube-apiserver.
Please also see the enhancement issue #3063 to find out more.
API removal
There is only a single API removal planned for Kubernetes v1.32:
- The
flowcontrol.apiserver.k8s.io/v1beta3
API version of FlowSchema and PriorityLevelConfiguration has been removed. To prepare for this, you can edit your existing manifests and rewrite client software to use theflowcontrol.apiserver.k8s.io/v1 API
version, available since v1.29. All existing persisted objects are accessible via the new API. Notable changes inflowcontrol.apiserver.k8s.io/v1beta3
include that the PriorityLevelConfigurationspec.limited.nominalConcurrencyShares
field only defaults to 30 when unspecified, and an explicit value of 0 is not changed to 30.
For more information, please refer to the API deprecation guide.
Sneak peek of Kubernetes v1.32
The following list of enhancements is likely to be included in the v1.32 release. This is not a commitment and the release content is subject to change.
Even more DRA enhancements!
In this release, like the previous one, the Kubernetes project continues proposing a number of enhancements to the Dynamic Resource Allocation (DRA), a key component of the Kubernetes resource management system. These enhancements aim to improve the flexibility and efficiency of resource allocation for workloads that require specialized hardware, such as GPUs, FPGAs and network adapters. This release introduces improvements, including the addition of resource health status in the Pod status, as outlined in KEP #4680.
Add resource health status to the Pod status
It isn't easy to know when a Pod uses a device that has failed or is temporarily unhealthy. KEP #4680 proposes exposing device health via Pod status
, making troubleshooting of Pod crashes easier.
Windows strikes back!
KEP #4802 adds support for graceful shutdowns of Windows nodes in Kubernetes clusters. Before this release, Kubernetes provided graceful node shutdown functionality for Linux nodes but lacked equivalent support for Windows. This enhancement enables the kubelet on Windows nodes to handle system shutdown events properly. Doing so, it ensures that Pods running on Windows nodes are gracefully terminated, allowing workloads to be rescheduled without disruption. This improvement enhances the reliability and stability of clusters that include Windows nodes, especially during a planned maintenance or any system updates.
Allow special characters in environment variables
With the graduation of this enhancement to beta, Kubernetes now allows almost all printable ASCII characters (excluding "=") to be used as environment variable names. This change addresses the limitations previously imposed on variable naming, facilitating a broader adoption of Kubernetes by accommodating various application needs. The relaxed validation will be enabled by default via the RelaxedEnvironmentVariableValidation
feature gate, ensuring that users can easily utilize environment variables without strict constraints, enhancing flexibility for developers working with applications like .NET Core that require special characters in their configurations.
Make Kubernetes aware of the LoadBalancer behavior
KEP #1860 graduates to GA, introducing the ipMode
field for a Service of type: LoadBalancer
, which can be set to either "VIP"
or "Proxy"
. This enhancement is aimed at improving how cloud providers load balancers interact with kube-proxy and it is a change transparent to the end user. The existing behavior of kube-proxy is preserved when using "VIP"
, where kube-proxy handles the load balancing. Using "Proxy"
results in traffic sent directly to the load balancer, providing cloud providers greater control over relying on kube-proxy; this means that you could see an improvement in the performance of your load balancer for some cloud providers.
Retry generate name for resources
This enhancement improves how name conflicts are handled for Kubernetes resources created with the generateName
field. Previously, if a name conflict occurred, the API server returned a 409 HTTP Conflict error and clients had to manually retry the request. With this update, the API server automatically retries generating a new name up to seven times in case of a conflict. This significantly reduces the chances of collision, ensuring smooth generation of up to 1 million names with less than a 0.1% probability of a conflict, providing more resilience for large-scale workloads.
Want to know more?
New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.32 as part of the CHANGELOG for this release.
You can see the announcements of changes in the release notes for:
08 Nov 2024 12:00am GMT
28 Oct 2024
Kubernetes Blog
Spotlight on Kubernetes Upstream Training in Japan
We are organizers of Kubernetes Upstream Training in Japan. Our team is composed of members who actively contribute to Kubernetes, including individuals who hold roles such as member, reviewer, approver, and chair.
Our goal is to increase the number of Kubernetes contributors and foster the growth of the community. While Kubernetes community is friendly and collaborative, newcomers may find the first step of contributing to be a bit challenging. Our training program aims to lower that barrier and create an environment where even beginners can participate smoothly.
What is Kubernetes upstream training in Japan?
Our training started in 2019 and is held 1 to 2 times a year. Initially, Kubernetes Upstream Training was conducted as a co-located event of KubeCon (Kubernetes Contributor Summit), but we launched Kubernetes Upstream Training in Japan with the aim of increasing Japanese contributors by hosting a similar event in Japan.
Before the pandemic, the training was held in person, but since 2020, it has been conducted online. The training offers the following content for those who have not yet contributed to Kubernetes:
- Introduction to Kubernetes community
- Overview of Kubernetes codebase and how to create your first PR
- Tips and encouragement to lower participation barriers, such as language
- How to set up the development environment
- Hands-on session using kubernetes-sigs/contributor-playground
At the beginning of the program, we explain why contributing to Kubernetes is important and who can contribute. We emphasize that contributing to Kubernetes allows you to make a global impact and that Kubernetes community is looking forward to your contributions!
We also explain Kubernetes community, SIGs, and Working Groups. Next, we explain the roles and responsibilities of Member, Reviewer, Approver, Tech Lead, and Chair. Additionally, we introduce the communication tools we primarily use, such as Slack, GitHub, and mailing lists. Some Japanese speakers may feel that communicating in English is a barrier. Additionally, those who are new to the community need to understand where and how communication takes place. We emphasize the importance of taking that first step, which is the most important aspect we focus on in our training!
We then go over the structure of Kubernetes codebase, the main repositories, how to create a PR, and the CI/CD process using Prow. We explain in detail the process from creating a PR to getting it merged.
After several lectures, participants get to experience hands-on work using kubernetes-sigs/contributor-playground, where they can create a simple PR. The goal is for participants to get a feel for the process of contributing to Kubernetes.
At the end of the program, we also provide a detailed explanation of setting up the development environment for contributing to the kubernetes/kubernetes
repository, including building code locally, running tests efficiently, and setting up clusters.
Interview with participants
We conducted interviews with those who participated in our training program. We asked them about their reasons for joining, their impressions, and their future goals.
Keita Mochizuki (NTT DATA Group Corporation)
Keita Mochizuki is a contributor who consistently contributes to Kubernetes and related projects. Keita is also a professional in container security and has recently published a book. Additionally, he has made available a Roadmap for New Contributors, which is highly beneficial for those new to contributing.
Junya: Why did you decide to participate in Kubernetes Upstream Training?
Keita: Actually, I participated twice, in 2020 and 2022. In 2020, I had just started learning about Kubernetes and wanted to try getting involved in activities outside of work, so I signed up after seeing the event on Twitter by chance. However, I didn't have much knowledge at the time, and contributing to OSS felt like something beyond my reach. As a result, my understanding after the training was shallow, and I left with more of a "hmm, okay" feeling.
In 2022, I participated again when I was at a stage where I was seriously considering starting contributions. This time, I did prior research and was able to resolve my questions during the lectures, making it a very productive experience.
Junya: How did you feel after participating?
Keita: I felt that the significance of this training greatly depends on the participant's mindset. The training itself consists of general explanations and simple hands-on exercises, but it doesn't mean that attending the training will immediately lead to contributions.
Junya: What is your purpose for contributing?
Keita: My initial motivation was to "gain a deep understanding of Kubernetes and build a track record," meaning "contributing itself was the goal." Nowadays, I also contribute to address bugs or constraints I discover during my work. Additionally, through contributing, I've become less hesitant to analyze undocumented features directly from the source code.
Junya: What has been challenging about contributing?
Keita: The most difficult part was taking the first step. Contributing to OSS requires a certain level of knowledge, and leveraging resources like this training and support from others was essential. One phrase that stuck with me was, "Once you take the first step, it becomes easier to move forward." Also, in terms of continuing contributions as part of my job, the most challenging aspect is presenting the outcomes as achievements. To keep contributing over time, it's important to align it with business goals and strategies, but upstream contributions don't always lead to immediate results that can be directly tied to performance. Therefore, it's crucial to ensure mutual understanding with managers and gain their support.
Junya: What are your future goals?
Keita: My goal is to contribute to areas with a larger impact. So far, I've mainly contributed by fixing smaller bugs as my primary focus was building a track record, but moving forward, I'd like to challenge myself with contributions that have a greater impact on Kubernetes users or that address issues related to my work. Recently, I've also been working on reflecting the changes I've made to the codebase into the official documentation, and I see this as a step toward achieving my goals.
Junya: Thank you very much!
Yoshiki Fujikane (CyberAgent, Inc.)
Yoshiki Fujikane is one of the maintainers of PipeCD, a CNCF Sandbox project. In addition to developing new features for Kubernetes support in PipeCD, Yoshiki actively participates in community management and speaks at various technical conferences.
Junya: Why did you decide to participate in the Kubernetes Upstream Training?
Yoshiki: At the time I participated, I was still a student. I had only briefly worked with EKS, but I thought Kubernetes seemed complex yet cool, and I was casually interested in it. Back then, OSS felt like something out of reach, and upstream development for Kubernetes seemed incredibly daunting. While I had always been interested in OSS, I didn't know where to start. It was during this time that I learned about the Kubernetes Upstream Training and decided to take the challenge of contributing to Kubernetes.
Junya: What were your impressions after participating?
Yoshiki: I found it extremely valuable as a way to understand what it's like to be part of an OSS community. At the time, my English skills weren't very strong, so accessing primary sources of information felt like a big hurdle for me. Kubernetes is a very large project, and I didn't have a clear understanding of the overall structure, let alone what was necessary for contributing. The upstream training provided a Japanese explanation of the community structure and allowed me to gain hands-on experience with actual contributions. Thanks to the guidance I received, I was able to learn how to approach primary sources and use them as entry points for further investigation, which was incredibly helpful. This experience made me realize the importance of organizing and reviewing primary sources, and now I often dive into GitHub issues and documentation when something piques my interest. As a result, while I am no longer contributing to Kubernetes itself, the experience has been a great foundation for contributing to other projects.
Junya: What areas are you currently contributing to, and what are the other projects you're involved in?
Yoshiki: Right now, I'm no longer working with Kubernetes, but instead, I'm a maintainer of PipeCD, a CNCF Sandbox project. PipeCD is a CD tool that supports GitOps-style deployments for various application platforms. The tool originally started as an internal project at CyberAgent. With different teams adopting different platforms, PipeCD was developed to provide a unified CD platform with a consistent user experience. Currently, it supports Kubernetes, AWS ECS, Lambda, Cloud Run, and Terraform.
Junya: What role do you play within the PipeCD team?
Yoshiki: I work full-time on improving and developing Kubernetes-related features within the team. Since we provide PipeCD as a SaaS internally, my main focus is on adding new features and improving existing ones as part of that support. In addition to code contributions, I also contribute by giving talks at various events and managing community meetings to help grow the PipeCD community.
Junya: Could you explain what kind of improvements or developments you are working on with regards to Kubernetes?
Yoshiki: PipeCD supports GitOps and Progressive Delivery for Kubernetes, so I'm involved in the development of those features. Recently, I've been working on features that streamline deployments across multiple clusters.
Junya: Have you encountered any challenges while contributing to OSS?
Yoshiki: One challenge is developing features that maintain generality while meeting user use cases. When we receive feature requests while operating the internal SaaS, we first consider adding features to solve those issues. At the same time, we want PipeCD to be used by a broader audience as an OSS tool. So, I always think about whether a feature designed for one use case could be applied to another, ensuring the software remains flexible and widely usable.
Junya: What are your goals moving forward?
Yoshiki: I want to focus on expanding PipeCD's functionality. Currently, we are developing PipeCD under the slogan "One CD for All." As I mentioned earlier, it supports Kubernetes, AWS ECS, Lambda, Cloud Run, and Terraform, but there are many other platforms out there, and new platforms may emerge in the future. For this reason, we are currently developing a plugin system that will allow users to extend PipeCD on their own, and I want to push this effort forward. I'm also working on features for multi-cluster deployments in Kubernetes, and I aim to continue making impactful contributions.
Junya: Thank you very much!
Future of Kubernetes upstream training
We plan to continue hosting Kubernetes Upstream Training in Japan and look forward to welcoming many new contributors. Our next session is scheduled to take place at the end of November during CloudNative Days Winter 2024.
Moreover, our goal is to expand these training programs not only in Japan but also around the world. Kubernetes celebrated its 10th anniversary this year, and for the community to become even more active, it's crucial for people across the globe to continue contributing. While Upstream Training is already held in several regions, we aim to bring it to even more places.
We hope that as more people join Kubernetes community and contribute, our community will become even more vibrant!
28 Oct 2024 12:00am GMT
02 Oct 2024
Kubernetes Blog
Announcing the 2024 Steering Committee Election Results
The 2024 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 3 of which were up for election in 2024. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.
This community body is significant since it oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee's role in their charter.
Thank you to everyone who voted in the election; your participation helps support the community's continued health and success.
Results
Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):
- Antonio Ojea (@aojea), Google
- Benjamin Elder (@BenTheElder), Google
- Sascha Grunert (@saschagrunert), Red Hat
They join continuing members:
- Stephen Augustus (@justaugustus), Cisco
- Paco Xu 徐俊杰 (@pacoxu), DaoCloud
- Patrick Ohly (@pohly), Intel
- Maciej Szulik (@soltysh), Defense Unicorns
Benjamin Elder is a returning Steering Committee Member.
Big thanks!
Thank you and congratulations on a successful election to this round's election officers:
- Bridget Kromhout (@bridgetkromhout)
- Christoph Blecker (@cblecker)
- Priyanka Saggu (@Priyankasaggu11929)
Thanks to the Emeritus Steering Committee Members. Your service is appreciated by the community:
- Bob Killen (@mrbobbytables)
- Nabarun Pal (@palnabarun)
And thank you to all the candidates who came forward to run for election.
Get involved with the Steering Committee
This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee meeting notes and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Monday at 8am PT of every month. They can also be contacted at their public mailing list steering@kubernetes.io.
You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.
If you want to meet some of the newly elected Steering Committee members, join us for the Steering AMA at the Kubernetes Contributor Summit North America 2024 in Salt Lake City.
This post was adapted from one written by the Contributor Comms Subproject. If you want to write stories about the Kubernetes community, learn more about us.
02 Oct 2024 8:10pm GMT
30 Sep 2024
Kubernetes Blog
Spotlight on CNCF Deaf and Hard-of-hearing Working Group (DHHWG)
In recognition of Deaf Awareness Month and the importance of inclusivity in the tech community, we are spotlighting Catherine Paganini, facilitator and one of the founding members of CNCF Deaf and Hard-of-Hearing Working Group (DHHWG). In this interview, Sandeep Kanabar, a deaf member of the DHHWG and part of the Kubernetes SIG ContribEx Communications team, sits down with Catherine to explore the impact of the DHHWG on cloud native projects like Kubernetes.
Sandeep's journey is a testament to the power of inclusion. Through his involvement in the DHHWG, he connected with members of the Kubernetes community who encouraged him to join SIG ContribEx - the group responsible for sustaining the Kubernetes contributor experience. In an ecosystem where open-source projects are actively seeking contributors and maintainers, this story highlights how important it is to create pathways for underrepresented groups, including those with disabilities, to contribute their unique perspectives and skills.
In this interview, we delve into Catherine's journey, the challenges and triumphs of establishing the DHHWG, and the vision for a more inclusive future in cloud native. We invite Kubernetes contributors, maintainers, and community members to reflect on the significance of empathy, advocacy, and community in fostering a truly inclusive environment for all, and to think about how they can support efforts to increase diversity and accessibility within their own projects.
Introduction
Sandeep Kanabar (SK): Hello Catherine, could you please introduce yourself, share your professional background, and explain your connection to the Kubernetes ecosystem?
Catherine Paganini (CP): I'm the Head of Marketing at Buoyant, the creator of Linkerd, the CNCF-graduated service mesh, and 5th CNCF project. Four years ago, I started contributing to open source. The initial motivation was to make cloud native concepts more accessible to newbies and non-technical people. Without a technical background, it was hard for me to understand what Kubernetes, containers, service meshes, etc. mean. All content was targeted at engineers already familiar with foundational concepts. Clearly, I couldn't be the only one struggling with wrapping my head around cloud native.
My first contribution was the CNCF Landscape Guide, which I co-authored with my former colleague Jason Morgan. Next, we started the CNCF Glossary, which explains cloud native concepts in simple terms. Today, the glossary has been (partially) localised into 14 languages!
Currently, I'm the co-chair of the TAG Contributor Strategy and the Facilitator of the Deaf and Hard of Hearing Working Group (DHHWG) and Blind and Visually Impaired WG (BVIWG), which is still in formation. I'm also working on a new Linux Foundation (LF) initiative called ABIDE (Accessibility and Belonging through Inclusion, Diversity, and Equity), so stay tuned to learn more about it!
Motivation and early milestones
SK: That's inspiring! Building on your passion for accessibility, what motivated you to facilitate the creation of the DHHWG? Was there a speecifc moment or experience that sparked this initiative?
CP: Last year at KubeCon Amsterdam, I learned about a great initiative by Jay Tihema that creates pathways for Maori youth into cloud native and open source. While telling my CODA (children of deaf adults) high school friend about it, I thought it'd be great to create something similar for deaf folks. A few months later, I posted about it in a LinkedIn post that the CNCF shared. Deaf people started to reach out, wanting to participate. And the rest is history.
SK: Speaking of history, since its launch, how has the DHHWG evolved? Could you highlight some of the key milestones or achievements the group has reached recently?
CP: Our WG is about a year old. It started with a few deaf engineers and me brainstorming how to make KubeCon more accessible. We published an initial draft of Best practices for an inclusive conference and shared it with the LF events team. KubeCon Chicago was two months later, and we had a couple of deaf attendees. It was the first KubeCon accessible to deaf signers. Destiny, one of our co-chairs, even participated in a keynote panel. It was incredible how quickly everything happened!
DHHWG members at KubeCon Chicago
The team has grown since then, and we've been able to do much more. With a kiosk in the project pavilion, an open space discussion, a sign language crash course, and a few media interviews, KubeCon Paris had a stronger advocacy and outreach focus. Check out this video of our team in Paris to get a glimpse of all the different KubeCon activities - it was such a great event! The team also launched the first CNCF Community Group in sign language, Deaf in Cloud Native, a glossary team that creates sign language videos for each technical term to help standardize technical signs across the globe. It's crazy to think that it all happened within one year!
Overcoming challenges and addressing misconceptions
SK: That's remarkable progress in just a year! Building such momentum must have come with its challenges. What barriers have you encountered in facilitating the DHHWG, and how did you and the group work to overcome them?
CP: The support from the community, LF, and CNCF has been incredible. The fact that we achieved so much is proof of it. The challenges are more in helping some team members overcome their fear of contributing. Most are new to open source, and it can be intimidating to put your work out there for everyone to see. The fear of being criticized in public is real; however, as they will hopefully realize over time, our community is incredibly supportive. Instead of criticizing, people tend to help improve the work, leading to better outcomes.
SK: Are there any misconceptions about the deaf and hard-of-hearing community in tech that you'd like to address?
CP: Deaf and hard of hearing individuals are very diverse - there is no one-size-fits-all. Some deaf people are oral (speak), others sign, while some lip read or prefer captions. It generally depends on how people grew up. While some people come from deaf families and sign language is their native language, others were born into hearing families who may or may not have learned how to sign. Some deaf people grew up surrounded by hearing people, while others grew up deeply embedded in Deaf culture. Hard-of-hearing individuals, on the other hand, typically can communicate well with hearing peers one-on-one in quiet settings, but loud environments or conversations with multiple people can make it hard to follow the conversation. Most rely heavily on captions. Each background and experience will shape their communication style and preferences. In short, what works for one person, doesn't necessarily work for others. So never assume and always ask about accessibility needs and preferences.
Impact and the role of allies
SK: Can you share some key impacts/outcomes of the conference best practices document?
CP: Here are the two most important ones: Captions should be on the monitor, not in an app. That's especially important during technical talks with live demos. Deaf and hard of hearing attendees will miss important information switching between captions on their phone and code on the screen.
Interpreters are most valuable during networking, not in talks (with captions). Most people come to conferences for the hallway track. That is no different for deaf attendees. If they can't network, they are missing out on key professional connections, affecting their career prospects.
SK: In your view, how crucial is the role of allies within the DHHWG, and what contributions have they made to the group's success?
CP: Deaf and hard of hearing individuals are a minority and can only do so much. Allies are the key to any diversity and inclusion initiative. As a majority, allies can help spread the word and educate their peers, playing a key role in scaling advocacy efforts. They also have the power to demand change. It's easy for companies to ignore minorities, but if the majority demands that their employers be accessible, environmentally conscious, and good citizens, they will ultimately be pushed to adapt to new societal values.
Expanding DEI efforts and future vision
SK: The importance of allies in driving change is clear. Beyond the DHHWG, are you involved in any other DEI groups or initiatives within the tech community?
CP: As mentioned above, I'm working on an initiative called ABIDE, which is still work in progress. I don't want to share too much about it yet, but what I can say is that the DHHWG will be part of it and that we just started a Blind and Visually Impaired WG (BVIWG). ABIDE will start by focusing on accessibility, so if anyone reading this has an idea for another WG, please reach out to me via the CNCF Slack @Catherine Paganini.
SK: What does the future hold for the DHHWG? Can you share details about any ongoing or upcoming initiatives?
CP: I think we've been very successful in terms of visibility and awareness so far. We can't stop, though. Awareness work is ongoing, and most people in our community haven't heard about us or met anyone on our team yet, so a lot of work still lies ahead.
DHHWG members at KubeCon Paris
The next step is to refocus on advocacy. The same thing we did with the conference best practices but for other areas. The goal is to help educate the community about what real accessibility looks like, how projects can be more accessible, and why employers should seriously consider deaf candidates while providing them with the tools they need to conduct successful interviews and employee onboarding. We need to capture all that in documents, publish it, and then get the word out. That last part is certainly the most challenging, but it's also where everyone can get involved.
Call to action
SK: Thank you for sharing your insights, Catherine. As we wrap up, do you have any final thoughts or a call to action for our readers?
CP: As we build our accessibility page, check in regularly to see what's new. Share the docs with your team, employer, and network - anyone, really. The more people understand what accessibility really means and why it matters, the more people will recognize when something isn't accessible, and be able to call out marketing-BS, which, unfortunately, is more often the case than not. We need allies to help push for change. No minority can do this on their own. So please learn about accessibility, keep an eye out for it, and call it out when something isn't accessible. We need your help!
Wrapping up
Catherine and the DHHWG's work exemplify the power of community and advocacy. As we celebrate Deaf Awareness Month, let's reflect on her role as an ally and consider how we can all contribute to building a more inclusive tech community, particularly within open-source projects like Kubernetes.
Together, we can break down barriers, challenge misconceptions, and ensure that everyone feels welcome and valued. By advocating for accessibility, supporting initiatives like the DHHWG, and fostering a culture of empathy, we can create a truly inclusive and welcoming space for all.
30 Sep 2024 12:00am GMT
24 Sep 2024
Kubernetes Blog
Spotlight on SIG Scheduling
In this SIG Scheduling spotlight we talked with Kensei Nakada, an approver in SIG Scheduling.
Introductions
Arvind: Hello, thank you for the opportunity to learn more about SIG Scheduling! Would you like to introduce yourself and tell us a bit about your role, and how you got involved with Kubernetes?
Kensei: Hi, thanks for the opportunity! I'm Kensei Nakada (@sanposhiho), a software engineer at Tetrate.io. I have been contributing to Kubernetes in my free time for more than 3 years, and now I'm an approver of SIG Scheduling in Kubernetes. Also, I'm a founder/owner of two SIG subprojects, kube-scheduler-simulator and kube-scheduler-wasm-extension.
About SIG Scheduling
AP: That's awesome! You've been involved with the project since a long time. Can you provide a brief overview of SIG Scheduling and explain its role within the Kubernetes ecosystem?
KN: As the name implies, our responsibility is to enhance scheduling within Kubernetes. Specifically, we develop the components that determine which Node is the best place for each Pod. In Kubernetes, our main focus is on maintaining the kube-scheduler, along with other scheduling-related components as part of our SIG subprojects.
AP: I see, got it! That makes me curious--what recent innovations or developments has SIG Scheduling introduced to Kubernetes scheduling?
KN: From a feature perspective, there have been several enhancements to PodTopologySpread
recently. PodTopologySpread
is a relatively new feature in the scheduler, and we are still in the process of gathering feedback and making improvements.
Most recently, we have been focusing on a new internal enhancement called QueueingHint which aims to enhance scheduling throughput. Throughput is one of our crucial metrics in scheduling. Traditionally, we have primarily focused on optimizing the latency of each scheduling cycle. QueueingHint takes a different approach, optimizing when to retry scheduling, thereby reducing the likelihood of wasting scheduling cycles.
A: That sounds interesting! Are there any other interesting topics or projects you are currently working on within SIG Scheduling?
KN: I'm leading the development of QueueingHint
which I just shared. Given that it's a big new challenge for us, we've been facing many unexpected challenges, especially around the scalability, and we're trying to solve each of them to eventually enable it by default.
And also, I believe kube-scheduler-wasm-extension (a SIG subproject) that I started last year would be interesting to many people. Kubernetes has various extensions from many components. Traditionally, extensions are provided via webhooks (extender in the scheduler) or Go SDK (Scheduling Framework in the scheduler). However, these come with drawbacks - performance issues with webhooks and the need to rebuild and replace schedulers with Go SDK, posing difficulties for those seeking to extend the scheduler but lacking familiarity with it. The project is trying to introduce a new solution to this general challenge - a WebAssembly based extension. Wasm allows users to build plugins easily, without worrying about recompiling or replacing their scheduler, and sidestepping performance concerns.
Through this project, SIG Scheduling has been learning valuable insights about WebAssembly's interaction with large Kubernetes objects. And I believe the experience that we're gaining should be useful broadly within the community, beyond SIG Scheduling.
A: Definitely! Now, there are 8 subprojects inside SIG Scheduling. Would you like to talk about them? Are there some interesting contributions by those teams you want to highlight?
KN: Let me pick up three subprojects: Kueue, KWOK and descheduler.
- Kueue
- Recently, many people have been trying to manage batch workloads with Kubernetes, and in 2022, Kubernetes community founded WG-Batch for better support for such batch workloads in Kubernetes. Kueue is a project that takes a crucial role for it. It's a job queueing controller, deciding when a job should wait, when a job should be admitted to start, and when a job should be preempted. Kueue aims to be installed on a vanilla Kubernetes cluster while cooperating with existing matured controllers (scheduler, cluster-autoscaler, kube-controller-manager, etc).
- KWOK
- KWOK is a component in which you can create a cluster of thousands of Nodes in seconds. It's mostly useful for simulation/testing as a lightweight cluster, and actually another SIG sub project kube-scheduler-simulator uses KWOK background.
- descheduler
- Descheduler is a component recreating pods that are running on undesired Nodes. In Kubernetes, scheduling constraints (
PodAffinity
,NodeAffinity
,PodTopologySpread
, etc) are honored only at Pod schedule, but it's not guaranteed that the contrtaints are kept being satisfied afterwards. Descheduler evicts Pods violating their scheduling constraints (or other undesired conditions) so that they're recreated and rescheduled. - Descheduling Framework
- One very interesting on-going project, similar to Scheduling Framework in the scheduler, aiming to make descheduling logic extensible and allow maintainers to focus on building a core engine of descheduler.
AP: Thank you for letting us know! And I have to ask, what are some of your favorite things about this SIG?
KN: What I really like about this SIG is how actively engaged everyone is. We come from various companies and industries, bringing diverse perspectives to the table. Instead of these differences causing division, they actually generate a wealth of opinions. Each view is respected, and this makes our discussions both rich and productive.
I really appreciate this collaborative atmosphere, and I believe it has been key to continuously improving our components over the years.
Contributing to SIG Scheduling
AP: Kubernetes is a community-driven project. Any recommendations for new contributors or beginners looking to get involved and contribute to SIG scheduling? Where should they start?
KN: Let me start with a general recommendation for contributing to any SIG: a common approach is to look for good-first-issue. However, you'll soon realize that many people worldwide are trying to contribute to the Kubernetes repository.
I suggest starting by examining the implementation of a component that interests you. If you have any questions about it, ask in the corresponding Slack channel (e.g., #sig-scheduling for the scheduler, #sig-node for kubelet, etc). Once you have a rough understanding of the implementation, look at issues within the SIG (e.g., sig-scheduling), where you'll find more unassigned issues compared to good-first-issue ones. You may also want to filter issues with the kind/cleanup label, which often indicates lower-priority tasks and can be starting points.
Specifically for SIG Scheduling, you should first understand the Scheduling Framework, which is the fundamental architecture of kube-scheduler. Most of the implementation is found in pkg/scheduler. I suggest starting with ScheduleOne function and then exploring deeper from there.
Additionally, apart from the main kubernetes/kubernetes repository, consider looking into sub-projects. These typically have fewer maintainers and offer more opportunities to make a significant impact. Despite being called "sub" projects, many have a large number of users and a considerable impact on the community.
And last but not least, remember contributing to the community isn't just about code. While I talked a lot about the implementation contribution, there are many ways to contribute, and each one is valuable. One comment to an issue, one feedback to an existing feature, one review comment in PR, one clarification on the documentation; every small contribution helps drive the Kubernetes ecosystem forward.
AP: Those are some pretty useful tips! And if I may ask, how do you assist new contributors in getting started, and what skills are contributors likely to learn by participating in SIG Scheduling?
KN: Our maintainers are available to answer your questions in the #sig-scheduling Slack channel. By participating, you'll gain a deeper understanding of Kubernetes scheduling and have the opportunity to collaborate and network with maintainers from diverse backgrounds. You'll learn not just how to write code, but also how to maintain a large project, design and discuss new features, address bugs, and much more.
Future Directions
AP: What are some Kubernetes-specific challenges in terms of scheduling? Are there any particular pain points?
KN: Scheduling in Kubernetes can be quite challenging because of the diverse needs of different organizations with different business requirements. Supporting all possible use cases in kube-scheduler is impossible. Therefore, extensibility is a key focus for us. A few years ago, we rearchitected kube-scheduler with Scheduling Framework, which offers flexible extensibility for users to implement various scheduling needs through plugins. This allows maintainers to focus on the core scheduling features and the framework runtime.
Another major issue is maintaining sufficient scheduling throughput. Typically, a Kubernetes cluster has only one kube-scheduler, so its throughput directly affects the overall scheduling scalability and, consequently, the cluster's scalability. Although we have an internal performance test (scheduler_perf), unfortunately, we sometimes overlook performance degradation in less common scenarios. It's difficult as even small changes, which look irrelevant to performance, can lead to degradation.
AP: What are some upcoming goals or initiatives for SIG Scheduling? How do you envision the SIG evolving in the future?
KN: Our primary goal is always to build and maintain extensible and stable scheduling runtime, and I bet this goal will remain unchanged forever.
As already mentioned, extensibility is key to solving the challenge of the diverse needs of scheduling. Rather than trying to support every different use case directly in kube-scheduler, we will continue to focus on enhancing extensibility so that it can accommodate various use cases. kube-scheduler-wasm-extension that I mentioned is also part of this initiative.
Regarding stability, introducing new optimizations like QueueHint is one of our strategies. Additionally, maintaining throughput is also a crucial goal towards the future. We're planning to enhance our throughput monitoring (ref), so that we can notice degradation as much as possible on our own before releasing. But, realistically, we can't cover every possible scenario. We highly appreciate any attention the community can give to scheduling throughput and encourage feedback and alerts regarding performance issues!
Closing Remarks
AP: Finally, what message would you like to convey to those who are interested in learning more about SIG Scheduling?
KN: Scheduling is one of the most complicated areas in Kubernetes, and you may find it difficult at first. But, as I shared earlier, you can find many opportunities for contributions, and many maintainers are willing to help you understand things. We know your unique perspective and skills are what makes our open source so powerful 😊
Feel free to reach out to us in Slack (#sig-scheduling) or meetings. I hope this article interests everyone and we can see new contributors!
AP: Thank you so much for taking the time to do this! I'm confident that many will find this information invaluable for understanding more about SIG Scheduling and for contributing to the SIG.
24 Sep 2024 12:00am GMT
23 Aug 2024
Kubernetes Blog
Kubernetes v1.31: kubeadm v1beta4
As part of the Kubernetes v1.31 release, kubeadm
is adopting a new (v1beta4) version of its configuration file format. Configuration in the previous v1beta3 format is now formally deprecated, which means it's supported but you should migrate to v1beta4 and stop using the deprecated format. Support for v1beta3 configuration will be removed after a minimum of 3 Kubernetes minor releases.
In this article, I'll walk you through key changes; I'll explain about the kubeadm v1beta4 configuration format, and how to migrate from v1beta3 to v1beta4.
You can read the reference for the v1beta4 configuration format: kubeadm Configuration (v1beta4).
A list of changes since v1beta3
This version improves on the v1beta3 format by fixing some minor issues and adding a few new fields.
To put it simply,
- Two new configuration elements: ResetConfiguration and UpgradeConfiguration
- For InitConfiguration and JoinConfiguration,
dryRun
mode andnodeRegistration.imagePullSerial
are supported - For ClusterConfiguration, there are new fields including
certificateValidityPeriod
,caCertificateValidityPeriod
,encryptionAlgorithm
,dns.disabled
andproxy.disabled
. - Support
extraEnvs
for all control plan components extraArgs
changed from a map to structured extra arguments for duplicates- Add a
timeouts
structure for init, join, upgrade and reset.
For details, you can see the official document below:
- Support custom environment variables in control plane components under
ClusterConfiguration
. UseapiServer.extraEnvs
,controllerManager.extraEnvs
,scheduler.extraEnvs
,etcd.local.extraEnvs
. - The ResetConfiguration API type is now supported in v1beta4. Users are able to reset a node by passing a
--config
file tokubeadm reset
. dryRun
mode is now configurable in InitConfiguration and JoinConfiguration.- Replace the existing string/string extra argument maps with structured extra arguments that support duplicates. The change applies to
ClusterConfiguration
-apiServer.extraArgs
,controllerManager.extraArgs
,scheduler.extraArgs
,etcd.local.extraArgs
. Also tonodeRegistrationOptions.kubeletExtraArgs
. - Added
ClusterConfiguration.encryptionAlgorithm
that can be used to set the asymmetric encryption algorithm used for this cluster's keys and certificates. Can be one of "RSA-2048" (default), "RSA-3072", "RSA-4096" or "ECDSA-P256". - Added
ClusterConfiguration.dns.disabled
andClusterConfiguration.proxy.disabled
that can be used to disable the CoreDNS and kube-proxy addons during cluster initialization. Skipping the related addons phases, during cluster creation will set the same fields totrue
. - Added the
nodeRegistration.imagePullSerial
field inInitConfiguration
andJoinConfiguration
, which can be used to control if kubeadm pulls images serially or in parallel. - The UpgradeConfiguration kubeadm API is now supported in v1beta4 when passing
--config
tokubeadm upgrade
subcommands. For upgrade subcommands, the usage of component configuration for kubelet and kube-proxy, as well as InitConfiguration and ClusterConfiguration, is now deprecated and will be ignored when passing--config
. - Added a
timeouts
structure toInitConfiguration
,JoinConfiguration
,ResetConfiguration
andUpgradeConfiguration
that can be used to configure various timeouts. TheClusterConfiguration.timeoutForControlPlane
field is replaced bytimeouts.controlPlaneComponentHealthCheck
. TheJoinConfiguration.discovery.timeout
is replaced bytimeouts.discovery
. - Added a
certificateValidityPeriod
andcaCertificateValidityPeriod
fields toClusterConfiguration
. These fields can be used to control the validity period of certificates generated by kubeadm during sub-commands such asinit
,join
,upgrade
andcerts
. Default values continue to be 1 year for non-CA certificates and 10 years for CA certificates. Also note that only non-CA certificates are renewable bykubeadm certs renew
.
These changes simplify the configuration of tools that use kubeadm and improve the extensibility of kubeadm itself.
How to migrate v1beta3 configuration to v1beta4?
If your configuration is not using the latest version, it is recommended that you migrate using the kubeadm config migrate command.
This command reads an existing configuration file that uses the old format, and writes a new file that uses the current format.
Example
Using kubeadm v1.31, run kubeadm config migrate --old-config old-v1beta3.yaml --new-config new-v1beta4.yaml
How do I get involved?
Huge thanks to all the contributors who helped with the design, implementation, and review of this feature:
- Lubomir I. Ivanov (neolit123)
- Dave Chen(chendave)
- Paco Xu (pacoxu)
- Sata Qiu(sataqiu)
- Baofa Fan(carlory)
- Calvin Chen(calvin0327)
- Ruquan Zhao(ruquanzhao)
For those interested in getting involved in future discussions on kubeadm configuration, you can reach out kubeadm or SIG-cluster-lifecycle by several means:
- v1beta4 related items are tracked in kubeadm issue #2890.
- Slack: #kubeadm or #sig-cluster-lifecycle
- Mailing list
23 Aug 2024 12:00am GMT
22 Aug 2024
Kubernetes Blog
Kubernetes v1.31: New Kubernetes CPUManager Static Policy: Distribute CPUs Across Cores
In Kubernetes v1.31, we are excited to introduce a significant enhancement to CPU management capabilities: the distribute-cpus-across-cores
option for the CPUManager static policy. This feature is currently in alpha and hidden by default, marking a strategic shift aimed at optimizing CPU utilization and improving system performance across multi-core processors.
Understanding the feature
Traditionally, Kubernetes' CPUManager tends to allocate CPUs as compactly as possible, typically packing them onto the fewest number of physical cores. However, allocation strategy matters, CPUs on the same physical host still share some resources of the physical core, such as the cache and execution units, etc.
While default approach minimizes inter-core communication and can be beneficial under certain scenarios, it also poses a challenge. CPUs sharing a physical core can lead to resource contention, which in turn may cause performance bottlenecks, particularly noticeable in CPU-intensive applications.
The new distribute-cpus-across-cores
feature addresses this issue by modifying the allocation strategy. When enabled, this policy option instructs the CPUManager to spread out the CPUs (hardware threads) across as many physical cores as possible. This distribution is designed to minimize contention among CPUs sharing the same physical core, potentially enhancing the performance of applications by providing them dedicated core resources.
Technically, within this static policy, the free CPU list is reordered in the manner depicted in the diagram, aiming to allocate CPUs from separate physical cores.
Enabling the feature
To enable this feature, users firstly need to add --cpu-manager-policy=static
kubelet flag or the cpuManagerPolicy: static
field in KubeletConfiuration. Then user can add --cpu-manager-policy-options distribute-cpus-across-cores=true
or distribute-cpus-across-cores=true
to their CPUManager policy options in the Kubernetes configuration or. This setting directs the CPUManager to adopt the new distribution strategy. It is important to note that this policy option cannot currently be used in conjunction with full-pcpus-only
or distribute-cpus-across-numa
options.
Current limitations and future directions
As with any new feature, especially one in alpha, there are limitations and areas for future improvement. One significant current limitation is that distribute-cpus-across-cores
cannot be combined with other policy options that might conflict in terms of CPU allocation strategies. This restriction can affect compatibility with certain workloads and deployment scenarios that rely on more specialized resource management.
Looking forward, we are committed to enhancing the compatibility and functionality of the distribute-cpus-across-cores
option. Future updates will focus on resolving these compatibility issues, allowing this policy to be combined with other CPUManager policies seamlessly. Our goal is to provide a more flexible and robust CPU allocation framework that can adapt to a variety of workloads and performance demands.
Conclusion
The introduction of the distribute-cpus-across-cores
policy in Kubernetes CPUManager is a step forward in our ongoing efforts to refine resource management and improve application performance. By reducing the contention on physical cores, this feature offers a more balanced approach to CPU resource allocation, particularly beneficial for environments running heterogeneous workloads. We encourage Kubernetes users to test this new feature and provide feedback, which will be invaluable in shaping its future development.
This draft aims to clearly explain the new feature while setting expectations for its current stage and future improvements.
Further reading
Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.
Getting involved
This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.
22 Aug 2024 12:00am GMT
Kubernetes 1.31: Fine-grained SupplementalGroups control
This blog discusses a new feature in Kubernetes 1.31 to improve the handling of supplementary groups in containers within Pods.
Motivation: Implicit group memberships defined in /etc/group
in the container image
Although this behavior may not be popular with many Kubernetes cluster users/admins, kubernetes, by default, merges group information from the Pod with information defined in /etc/group
in the container image.
Let's see an example, below Pod specifies runAsUser=1000
, runAsGroup=3000
and supplementalGroups=4000
in the Pod's security context.
apiVersion: v1
kind: Pod
metadata:
name: implicit-groups
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
containers:
- name: ctr
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
What is the result of id
command in the ctr
container?
# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/implicit-groups.yaml
# Verify that the Pod's Container is running:
$ kubectl get pod implicit-groups
# Check the id command
$ kubectl exec implicit-groups -- id
Then, output should be similar to this:
uid=1000 gid=3000 groups=3000,4000,50000
Where does group ID 50000
in supplementary groups (groups
field) come from, even though 50000
is not defined in the Pod's manifest at all? The answer is /etc/group
file in the container image.
Checking the contents of /etc/group
in the container image should show below:
$ kubectl exec implicit-groups -- cat /etc/group
...
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image
Aha! The container's primary user 1000
belongs to the group 50000
in the last entry.
Thus, the group membership defined in /etc/group
in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.
What's wrong with it?
The implicitly merged group information from /etc/group
in the container image may cause some concerns particularly in accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by uid/gid in Linux. Even worse, the implicit gids from /etc/group
can not be detected/validated by any policy engines because there is no clue for the implicit group information in the manifest. This can also be a concern for Kubernetes security.
Fine-grained SupplementalGroups control in a Pod: SupplementaryGroupsPolicy
To tackle the above problem, Kubernetes 1.31 introduces new field supplementalGroupsPolicy
in Pod's .spec.securityContext
.
This field provies a way to control how to calculate supplementary groups for the container processes in a Pod. The available policy is below:
-
Merge: The group membership defined in
/etc/group
for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backword compatibility). -
Strict: it only attaches specified group IDs in
fsGroup
,supplementalGroups
, orrunAsGroup
fields as the supplementary groups of the container processes. This means no group membership defined in/etc/group
for the container's primary user will be merged.
Let's see how Strict
policy works.
apiVersion: v1
kind: Pod
metadata:
name: strict-supplementalgroups-policy
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
supplementalGroups: [4000]
supplementalGroupsPolicy: Strict
containers:
- name: ctr
image: registry.k8s.io/e2e-test-images/agnhost:2.45
command: [ "sh", "-c", "sleep 1h" ]
securityContext:
allowPrivilegeEscalation: false
# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/strict-supplementalgroups-policy.yaml
# Verify that the Pod's Container is running:
$ kubectl get pod strict-supplementalgroups-policy
# Check the process identity:
kubectl exec -it strict-supplementalgroups-policy -- id
The output should be similar to this:
uid=1000 gid=3000 groups=3000,4000
You can see Strict
policy can exclude group 50000
from groups
!
Thus, ensuring supplementalGroupsPolicy: Strict
(enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.
Note:
Actually, this is not enough because container with sufficient privileges / capability can change its process identity. Please see the following section for details.Attached process identity in Pod status
This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux
field. It would be helpful to see if implicit group IDs are attached.
...
status:
containerStatuses:
- name: ctr
user:
linux:
gid: 3000
supplementalGroups:
- 3000
- 4000
uid: 1000
...
Note:
Please note that the values instatus.containerStatuses[].user.linux
field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2)
, setgid(2)
or setgroups(2)
, etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.Feature availability
To enable supplementalGroupsPolicy
field, the following components have to be used:
- Kubernetes: v1.31 or later, with the
SupplementalGroupsPolicy
feature gate enabled. As of v1.31, the gate is marked as alpha. - CRI runtime:
- containerd: v2.0 or later
- CRI-O: v1.31 or later
You can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy
field.
apiVersion: v1
kind: Node
...
status:
features:
supplementalGroupsPolicy: true
What's next?
Kubernetes SIG Node hope - and expect - that the feature will be promoted to beta and eventually general availability (GA) in future releases of Kubernetes, so that users no longer need to enable the feature gate manually.
Merge
policy is applied when supplementalGroupsPolicy
is not specified, for backwards compatibility.
How can I learn more?
- Configure a Security Context for a Pod or Container for the further details of
supplementalGroupsPolicy
- KEP-3619: Fine-grained SupplementalGroups control
How to get involved?
This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!
22 Aug 2024 12:00am GMT
Kubernetes 1.31: Custom Profiling in Kubectl Debug Graduates to Beta
There are many ways of troubleshooting the pods and nodes in the cluster. However, kubectl debug
is one of the easiest, highly used and most prominent ones. It provides a set of static profiles and each profile serves for a different kind of role. For instance, from the network administrator's point of view, debugging the node should be as easy as this:
$ kubectl debug node/mynode -it --image=busybox --profile=netadmin
On the other hand, static profiles also bring about inherent rigidity, which has some implications for some pods contrary to their ease of use. Because there are various kinds of pods (or nodes) that all have their specific necessities, and unfortunately, some can't be debugged by only using the static profiles.
Take an instance of a simple pod consisting of a container whose healthiness relies on an environment variable:
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example-container
image: customapp:latest
env:
- name: REQUIRED_ENV_VAR
value: "value1"
Currently, copying the pod is the sole mechanism that supports debugging this pod in kubectl debug. Furthermore, what if user needs to modify the REQUIRED_ENV_VAR
to something different for advanced troubleshooting?. There is no mechanism to achieve this.
Custom Profiling
Custom profiling is a new functionality available under --custom
flag, introduced in kubectl debug to provide extensibility. It expects partial Container
spec in either YAML or JSON format. In order to debug the example-container above by creating an ephemeral container, we simply have to define this YAML:
# partial_container.yaml
env:
- name: REQUIRED_ENV_VAR
value: value2
and execute:
kubectl debug example-pod -it --image=customapp --custom=partial_container.yaml
Here is another example that modifies multiple fields at once (change port number, add resource limits, modify environment variable) in JSON:
{
"ports": [
{
"containerPort": 80
}
],
"resources": {
"limits": {
"cpu": "0.5",
"memory": "512Mi"
},
"requests": {
"cpu": "0.2",
"memory": "256Mi"
}
},
"env": [
{
"name": "REQUIRED_ENV_VAR",
"value": "value2"
}
]
}
Constraints
Uncontrolled extensibility hurts the usability. So that, custom profiling is not allowed for certain fields such as command, image, lifecycle, volume devices and container name. In the future, more fields can be added to the disallowed list if required.
Limitations
The kubectl debug
command has 3 aspects: Debugging with ephemeral containers, pod copying, and node debugging. The largest intersection set of these aspects is the container spec within a Pod That's why, custom profiling only supports the modification of the fields that are defined with containers
. This leads to a limitation that if user needs to modify the other fields in the Pod spec, it is not supported.
Acknowledgments
Special thanks to all the contributors who reviewed and commented on this feature, from the initial conception to its actual implementation (alphabetical order):
22 Aug 2024 12:00am GMT
21 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Autoconfiguration For Node Cgroup Driver (beta)
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs
and systemd
. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would exit with an error. This was a source of headaches for many cluster admins. However, there is light at the end of the tunnel!
Automated cgroup driver detection
In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI
, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. A few minor releases of Kubernetes happened whilst we waited for support to land in the major two CRI implementations (containerd and CRI-O), but as of v1.31.0, this feature is now beta!
In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:
- containerd: Support was added in v2.0.0
- CRI-O: Support was added in v1.28.0
Then, they should ensure their CRI implementation is configured to the cgroup_driver they would like to use.
Future work
Eventually, support for the kubelet's cgroupDriver
configuration field will be dropped, and the kubelet will fail to start if the CRI implementation isn't new enough to have support for this feature.
21 Aug 2024 12:00am GMT
20 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Streaming Transitions from SPDY to WebSockets
In Kubernetes 1.31, by default kubectl now uses the WebSocket protocol instead of SPDY for streaming.
This post describes what these changes mean for you and why these streaming APIs matter.
Streaming APIs in Kubernetes
In Kubernetes, specific endpoints that are exposed as an HTTP or RESTful interface are upgraded to streaming connections, which require a streaming protocol. Unlike HTTP, which is a request-response protocol, a streaming protocol provides a persistent connection that's bi-directional, low-latency, and lets you interact in real-time. Streaming protocols support reading and writing data between your client and the server, in both directions, over the same connection. This type of connection is useful, for example, when you create a shell in a running container from your local workstation and run commands in the container.
Why change the streaming protocol?
Before the v1.31 release, Kubernetes used the SPDY/3.1 protocol by default when upgrading streaming connections. SPDY/3.1 has been deprecated for eight years, and it was never standardized. Many modern proxies, gateways, and load balancers no longer support the protocol. As a result, you might notice that commands like kubectl cp
, kubectl attach
, kubectl exec
, and kubectl port-forward
stop working when you try to access your cluster through a proxy or gateway.
As of Kubernetes v1.31, SIG API Machinery has modified the streaming protocol that a Kubernetes client (such as kubectl
) uses for these commands to the more modern WebSocket streaming protocol. The WebSocket protocol is a currently supported standardized streaming protocol that guarantees compatibility and interoperability with different components and programming languages. The WebSocket protocol is more widely supported by modern proxies and gateways than SPDY.
How streaming APIs work
Kubernetes upgrades HTTP connections to streaming connections by adding specific upgrade headers to the originating HTTP request. For example, an HTTP upgrade request for running the date
command on an nginx
container within a cluster is similar to the following:
$ kubectl exec -v=8 nginx -- date
GET https://127.0.0.1:43251/api/v1/namespaces/default/pods/nginx/exec?command=date…
Request Headers:
Connection: Upgrade
Upgrade: websocket
Sec-Websocket-Protocol: v5.channel.k8s.io
User-Agent: kubectl/v1.31.0 (linux/amd64) kubernetes/6911225
If the container runtime supports the WebSocket streaming protocol and at least one of the subprotocol versions (e.g. v5.channel.k8s.io
), the server responds with a successful 101 Switching Protocols
status, along with the negotiated subprotocol version:
Response Status: 101 Switching Protocols in 3 milliseconds
Response Headers:
Upgrade: websocket
Connection: Upgrade
Sec-Websocket-Accept: j0/jHW9RpaUoGsUAv97EcKw8jFM=
Sec-Websocket-Protocol: v5.channel.k8s.io
At this point the TCP connection used for the HTTP protocol has changed to a streaming connection. Subsequent STDIN, STDOUT, and STDERR data (as well as terminal resizing data and process exit code data) for this shell interaction is then streamed over this upgraded connection.
How to use the new WebSocket streaming protocol
If your cluster and kubectl are on version 1.29 or later, there are two control plane feature gates and two kubectl environment variables that govern the use of the WebSockets rather than SPDY. In Kubernetes 1.31, all of the following feature gates are in beta and are enabled by default:
- Feature gates
TranslateStreamCloseWebsocketRequests
.../exec
.../attach
PortForwardWebsockets
.../port-forward
- kubectl feature control environment variables
KUBECTL_REMOTE_COMMAND_WEBSOCKETS
kubectl exec
kubectl cp
kubectl attach
KUBECTL_PORT_FORWARD_WEBSOCKETS
kubectl port-forward
If you're connecting to an older cluster but can manage the feature gate settings, turn on both TranslateStreamCloseWebsocketRequests
(added in Kubernetes v1.29) and PortForwardWebsockets
(added in Kubernetes v1.30) to try this new behavior. Version 1.31 of kubectl
can automatically use the new behavior, but you do need to connect to a cluster where the server-side features are explicitly enabled.
Learn more about streaming APIs
- KEP 4006 - Transitioning from SPDY to WebSockets
- RFC 6455 - The WebSockets Protocol
- Container Runtime Interface streaming explained
20 Aug 2024 12:00am GMT
19 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
This post describes Pod failure policy, which graduates to stable in Kubernetes 1.31, and how to use it in your Jobs.
About Pod failure policy
When you run workloads on Kubernetes, Pods might fail for a variety of reasons. Ideally, workloads like Jobs should be able to ignore transient, retriable failures and continue running to completion.
To allow for these transient failures, Kubernetes Jobs include the backoffLimit
field, which lets you specify a number of Pod failures that you're willing to tolerate during Job execution. However, if you set a large value for the backoffLimit
field and rely solely on this field, you might notice unnecessary increases in operating costs as Pods restart excessively until the backoffLimit is met.
This becomes particularly problematic when running large-scale Jobs with thousands of long-running Pods across thousands of nodes.
The Pod failure policy extends the backoff limit mechanism to help you reduce costs in the following ways:
- Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.
- Allows you to ignore retriable errors without increasing the
backoffLimit
field.
For example, you can use a Pod failure policy to run your workload on more affordable spot machines by ignoring Pod failures caused by graceful node shutdown.
The policy allows you to distinguish between retriable and non-retriable Pod failures based on container exit codes or Pod conditions in a failed Pod.
How it works
You specify a Pod failure policy in the Job specification, represented as a list of rules.
For each rule you define match requirements based on one of the following properties:
- Container exit codes: the
onExitCodes
property. - Pod conditions: the
onPodConditions
property.
Additionally, for each rule, you specify one of the following actions to take when a Pod matches the rule:
Ignore
: Do not count the failure towards thebackoffLimit
orbackoffLimitPerIndex
.FailJob
: Fail the entire Job and terminate all running Pods.FailIndex
: Fail the index corresponding to the failed Pod. This action works with the Backoff limit per index feature.Count
: Count the failure towards thebackoffLimit
orbackoffLimitPerIndex
. This is the default behavior.
When Pod failures occur in a running Job, Kubernetes matches the failed Pod status against the list of Pod failure policy rules, in the specified order, and takes the corresponding actions for the first matched rule.
Note that when specifying the Pod failure policy, you must also set the Job's Pod template with restartPolicy: Never
. This prevents race conditions between the kubelet and the Job controller when counting Pod failures.
Kubernetes-initiated Pod disruptions
To allow matching Pod failure policy rules against failures caused by disruptions initiated by Kubernetes, this feature introduces the DisruptionTarget
Pod condition.
Kubernetes adds this condition to any Pod, regardless of whether it's managed by a Job controller, that fails because of a retriable disruption scenario. The DisruptionTarget
condition contains one of the following reasons that corresponds to these disruption scenarios:
PreemptionByKubeScheduler
: Preemption bykube-scheduler
to accommodate a new Pod that has a higher priority.DeletionByTaintManager
- the Pod is due to be deleted bykube-controller-manager
due to aNoExecute
taint that the Pod doesn't tolerate.EvictionByEvictionAPI
- the Pod is due to be deleted by an API-initiated eviction.DeletionByPodGC
- the Pod is bound to a node that no longer exists, and is due to be deleted by Pod garbage collection.TerminationByKubelet
- the Pod was terminated by graceful node shutdown, node pressure eviction or preemption for system critical pods.
In all other disruption scenarios, like eviction due to exceeding Pod container limits, Pods don't receive the DisruptionTarget
condition because the disruptions were likely caused by the Pod and would reoccur on retry.
Example
The Pod failure policy snippet below demonstrates an example use:
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailJob
onPodConditions:
- type: ConfigIssue
- action: FailJob
onExitCodes:
operator: In
values: [ 42 ]
In this example, the Pod failure policy does the following:
- Ignores any failed Pods that have the built-in
DisruptionTarget
condition. These Pods don't count towards Job backoff limits. - Fails the Job if any failed Pods have the custom user-supplied
ConfigIssue
condition, which was added either by a custom controller or webhook. - Fails the Job if any containers exited with the exit code 42.
- Counts all other Pod failures towards the default
backoffLimit
(orbackoffLimitPerIndex
if used).
Learn more
- For a hands-on guide to using Pod failure policy, see Handling retriable and non-retriable pod failures with Pod failure policy
- Read the documentation for Pod failure policy and Backoff limit per index
- Read the documentation for Pod disruption conditions
- Read the KEP for Pod failure policy
Related work
Based on the concepts introduced by Pod failure policy, the following additional work is in progress:
- JobSet integration: Configurable Failure Policy API
- Pod failure policy extension to add more granular failure reasons
- Support for Pod failure policy via JobSet in Kubeflow Training v2
- Proposal: Disrupted Pods should be removed from endpoints
Get involved
This work was sponsored by batch working group in close collaboration with the SIG Apps, and SIG Node, and SIG Scheduling communities.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
Acknowledgments
I would love to thank everyone who was involved in this project over the years - it's been a journey and a joint community effort! The list below is my best-effort attempt to remember and recognize people who made an impact. Thank you!
- Aldo Culquicondor for guidance and reviews throughout the process
- Jordan Liggitt for KEP and API reviews
- David Eads for API reviews
- Maciej Szulik for KEP reviews from SIG Apps PoV
- Clayton Coleman for guidance and SIG Node reviews
- Sergey Kanzhelev for KEP reviews from SIG Node PoV
- Dawn Chen for KEP reviews from SIG Node PoV
- Daniel Smith for reviews from SIG API machinery PoV
- Antoine Pelisse for reviews from SIG API machinery PoV
- John Belamaric for PRR reviews
- Filip Křepinský for thorough reviews from SIG Apps PoV and bug-fixing
- David Porter for thorough reviews from SIG Node PoV
- Jensen Lo for early requirements discussions, testing and reporting issues
- Daniel Vega-Myhre for advancing JobSet integration and reporting issues
- Abdullah Gharaibeh for early design discussions and guidance
- Antonio Ojea for test reviews
- Yuki Iwai for reviews and aligning implementation of the closely related Job features
- Kevin Hannon for reviews and aligning implementation of the closely related Job features
- Tim Bannister for docs reviews
- Shannon Kularathna for docs reviews
- Paola Cortés for docs reviews
19 Aug 2024 12:00am GMT
16 Aug 2024
Kubernetes Blog
Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)
The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future. While the project has been designed to fulfill microservice architectures in the past, it's now time to listen to the end users and introduce features which have a stronger focus on AI/ML.
One of these requirements is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries. A feature like this gives the Kubernetes project a chance to grow into use cases which go beyond running particular images.
Given that, the Kubernetes community is proud to present a new alpha feature introduced in v1.31: The Image Volume Source (KEP-4639). This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers:
…
kind: Pod
spec:
containers:
- …
volumeMounts:
- name: my-volume
mountPath: /path/to/directory
volumes:
- name: my-volume
image:
reference: my-image:tag
The above example would result in mounting my-image:tag
to /path/to/directory
in the pod's container.
Use cases
The goal of this enhancement is to stick as close as possible to the existing container image implementation within the kubelet, while introducing a new API surface to allow more extended use cases.
For example, users could share a configuration file among multiple containers in a pod without including the file in the main image, so that they can minimize security risks and the overall image size. They can also package and distribute binary artifacts using OCI images and mount them directly into Kubernetes pods, so that they can streamline their CI/CD pipeline as an example.
Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.
Another use case is that security engineers can use a public image for a malware scanner and mount in a volume of private (commercial) malware signatures, so that they can load those signatures without baking their own combined image (which might not be allowed by the copyright on the public image). Those files work regardless of the OS or version of the scanner software.
But in the long term it will be up to you as an end user of this project to outline further important use cases for the new feature. SIG Node is happy to retrieve any feedback or suggestions for further enhancements to allow more advanced usage scenarios. Feel free to provide feedback by either using the Kubernetes Slack (#sig-node) channel or the SIG Node mailinglist.
Detailed example
The Kubernetes alpha feature gate ImageVolume
needs to be enabled on the API Server as well as the kubelet to make it functional. If that's the case and the container runtime has support for the feature (like CRI-O ≥ v1.31), then an example pod.yaml
like this can be created:
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:
- name: test
image: registry.k8s.io/e2e-test-images/echoserver:2.3
volumeMounts:
- name: volume
mountPath: /volume
volumes:
- name: volume
image:
reference: quay.io/crio/artifact:v1
pullPolicy: IfNotPresent
The pod declares a new volume using the image.reference
of quay.io/crio/artifact:v1
, which refers to an OCI object containing two files. The pullPolicy
behaves in the same way as for container images and allows the following values:
Always
: the kubelet always attempts to pull the reference and the container creation will fail if the pull fails.Never
: the kubelet never pulls the reference and only uses a local image or artifact. The container creation will fail if the reference isn't present.IfNotPresent
: the kubelet pulls if the reference isn't already present on disk. The container creation will fail if the reference isn't present and the pull fails.
The volumeMounts
field is indicating that the container with the name test
should mount the volume under the path /volume
.
If you now create the pod:
kubectl apply -f pod.yaml
And exec into it:
kubectl exec -it pod -- sh
Then you're able to investigate what has been mounted:
/ # ls /volume
dir file
/ # cat /volume/file
2
/ # ls /volume/dir
file
/ # cat /volume/dir/file
1
You managed to consume an OCI artifact using Kubernetes!
The container runtime pulls the image (or artifact), mounts it to the container and makes it finally available for direct usage. There are a bunch of details in the implementation, which closely align to the existing image pull behavior of the kubelet. For example:
- If a
:latest
tag asreference
is provided, then thepullPolicy
will default toAlways
, while in any other case it will default toIfNotPresent
if unset. - The volume gets re-resolved if the pod gets deleted and recreated, which means that new remote content will become available on pod recreation. A failure to resolve or pull the image during pod startup will block containers from starting and may add significant latency. Failures will be retried using normal volume backoff and will be reported on the pod reason and message.
- Pull secrets will be assembled in the same way as for the container image by looking up node credentials, service account image pull secrets, and pod spec image pull secrets.
- The OCI object gets mounted in a single directory by merging the manifest layers in the same way as for container images.
- The volume is mounted as read-only (
ro
) and non-executable files (noexec
). - Sub-path mounts for containers are not supported (
spec.containers[*].volumeMounts.subpath
). - The field
spec.securityContext.fsGroupChangePolicy
has no effect on this volume type. - The feature will also work with the
AlwaysPullImages
admission plugin if enabled.
Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature as part of Kubernetes v1.31.
As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there! You all rock, let's keep on hacking!
Further reading
16 Aug 2024 12:00am GMT
Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order
PersistentVolume (or PVs for short) are associated with Reclaim Policy. The reclaim policy is used to determine the actions that need to be taken by the storage backend on deletion of the PVC Bound to a PV. When the reclaim policy is Delete
, the expectation is that the storage backend releases the storage resource allocated for the PV. In essence, the reclaim policy needs to be honored on PV deletion.
With the recent Kubernetes v1.31 release, a beta feature lets you configure your cluster to behave that way and honor the configured reclaim policy.
How did reclaim work in previous Kubernetes releases?
PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.
Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.
First, I'll demonstrate the behavior for clusters running an older version of Kubernetes.
Retrieve a PVC that is bound to a PV
Retrieve an existing PVC example-vanilla-block-pvc
kubectl get pvc example-vanilla-block-pvc
The following output shows the PVC and its bound PV; the PV is shown under the VOLUME
column:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
example-vanilla-block-pvc Bound pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO example-vanilla-block-sc 19s
Delete PV
When I try to delete a bound PV, the kubectl session blocks and the kubectl
tool does not return back control to the shell; for example:
kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted
^C
Retrieving the PV
kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
It can be observed that the PV is in a Terminating
state
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO Delete Terminating default/example-vanilla-block-pvc example-vanilla-block-sc 2m23s
Delete PVC
kubectl delete pvc example-vanilla-block-pvc
The following output is seen if the PVC gets successfully deleted:
persistentvolumeclaim "example-vanilla-block-pvc" deleted
The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:
kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found
Although the PV is deleted, the underlying storage resource is not deleted and needs to be removed manually.
To sum up, the reclaim policy associated with the PersistentVolume is currently ignored under certain circumstances. For a Bound
PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.
PV reclaim policy with Kubernetes v1.31
The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.
How to enable new behavior?
To take advantage of the new behavior, you must have upgraded your cluster to the v1.31 release of Kubernetes and run the CSI external-provisioner
version 5.0.1
or later.
How does it work?
For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer
on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. `
An example of a PV with the finalizer, notice the new finalizer in the finalizers list
kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
creationTimestamp: "2021-11-17T19:28:56Z"
finalizers:
- kubernetes.io/pv-protection
- external-provisioner.volume.kubernetes.io/finalizer
name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
resourceVersion: "194711"
uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: example-vanilla-block-pvc
namespace: default
resourceVersion: "194677"
uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
csi:
driver: csi.vsphere.vmware.com
fsType: ext4
volumeAttributes:
storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.vsphere.vmware.com
type: vSphere CNS Block Volume
volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
persistentVolumeReclaimPolicy: Delete
storageClassName: example-vanilla-block-sc
volumeMode: Filesystem
status:
phase: Bound
The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.
Similarly, the finalizer kubernetes.io/pv-controller
is added to dynamically provisioned in-tree plugin volumes.
What about CSI migrated volumes?
The fix applies to CSI migrated volumes as well.
Some caveats
The fix does not apply to statically provisioned in-tree plugin volumes.
References
How do I get involved?
The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:
- Fan Baofa (carlory)
- Jan Šafránek (jsafrane)
- Xing Yang (xing-yang)
- Matthew Wong (wongma7)
Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We're rapidly growing and always welcome new contributors.
16 Aug 2024 12:00am GMT