28 Nov 2023
Kubernetes – Production-Grade Container Orchestration
Blog: New Experimental Features in Gateway API v1.0
Authors: Candace Holman (Red Hat), Dave Protasowski (VMware), Gaurav K Ghildiyal (Google), John Howard (Google), Simone Rodigari (IBM)
Recently, the Gateway API announced its v1.0 GA release, marking a huge milestone for the project.
Along with stabilizing some of the core functionality in the API, a number of exciting new experimental features have been added.
Backend TLS Policy
BackendTLSPolicy
is a new Gateway API type used for specifying the TLS configuration of the connection from the Gateway to backend Pods via the Service API object. It is specified as a Direct PolicyAttachment without defaults or overrides, applied to a Service that accesses a backend, where the BackendTLSPolicy resides in the same namespace as the Service to which it is applied. All Gateway API Routes that point to a referenced Service should respect a configured BackendTLSPolicy
.
While there were existing ways provided for TLS to be configured for edge and passthrough termination, this new API object specifically addresses the configuration of TLS in order to convey HTTPS from the Gateway dataplane to the backend. This is referred to as "backend TLS termination" and enables the Gateway to know how to connect to a backend Pod that has its own certificate.
The specification of a BackendTLSPolicy
consists of:
targetRef
- Defines the targeted API object of the policy. Only Service is allowed.tls
- Defines the configuration for TLS, includinghostname
,caCertRefs
, andwellKnownCACerts
. EithercaCertRefs
orwellKnownCACerts
may be specified, but not both.hostname
- Defines the Server Name Indication (SNI) that the Gateway uses to connect to the backend. The certificate served by the backend must match this SNI.caCertRefs
- Defines one or more references to objects that contain PEM-encoded TLS certificates, which are used to establish a TLS handshake between the Gateway and backend.wellKnownCACerts
- Specifies whether or not system CA certificates may be used in the TLS handshake between the Gateway and backend.
Examples
Using System Certificates
In this example, the BackendTLSPolicy
is configured to use system certificates to connect with a TLS-encrypted upstream connection where Pods backing the dev
Service are expected to serve a valid certificate for dev.example.com
.
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: BackendTLSPolicy
metadata:
name: tls-upstream-dev
spec:
targetRef:
kind: Service
name: dev-service
group: ""
tls:
wellKnownCACerts: "System"
hostname: dev.example.com
Using Explicit CA Certificates
In this example, the BackendTLSPolicy
is configured to use certificates defined in the configuration map auth-cert
to connect with a TLS-encrypted upstream connection where Pods backing the auth
Service are expected to serve a valid certificate for auth.example.com
.
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: BackendTLSPolicy
metadata:
name: tls-upstream-auth
spec:
targetRef:
kind: Service
name: auth-service
group: ""
tls:
caCertRefs:
- kind: ConfigMapReference
name: auth-cert
group: ""
hostname: auth.example.com
The following illustrates a BackendTLSPolicy that configures TLS for a Service serving a backend:
Route"] style httproute fill:#02f,color:#fff service["Service"] style service fill:#02f,color:#fff pod1["Pod"] style pod1 fill:#02f,color:#fff pod2["Pod"] style pod2 fill:#02f,color:#fff client -.->|HTTP
request| gateway gateway --> httproute httproute -.->|BackendTLSPolicy|service service --> pod1 & pod2
For more information, refer to the documentation for TLS.
HTTPRoute Timeouts
A key enhancement in Gateway API's latest release (v1.0) is the introduction of the timeouts
field within HTTPRoute Rules. This feature offers a dynamic way to manage timeouts for incoming HTTP requests, adding precision and reliability to your gateway setups.
With Timeouts, developers can fine-tune their Gateway API's behavior in two fundamental ways:
-
Request Timeout:
The request timeout is the duration within which the Gateway API implementation must send a response to a client's HTTP request. It allows flexibility in specifying when this timeout starts, either before or after the entire client request stream is received, making it implementation-specific. This timeout efficiently covers the entire request-response transaction, enhancing the responsiveness of your services.
-
Backend Request Timeout:
The backendRequest timeout is a game-changer for those dealing with backends. It sets a timeout for a single request sent from the Gateway to a backend service. This timeout spans from the initiation of the request to the reception of the full response from the backend. This feature is particularly helpful in scenarios where the Gateway needs to retry connections to a backend, ensuring smooth communication under various conditions.
Notably, the request
timeout encompasses the backendRequest
timeout. Hence, the value of backendRequest
should never exceed the value of the request
timeout.
The ability to configure these timeouts adds a new layer of reliability to your Kubernetes services. Whether it's ensuring client requests are processed within a specified timeframe or managing backend service communications, Gateway API's Timeouts offer the control and predictability you need.
To get started, you can define timeouts in your HTTPRoute Rules using the Timeouts field, specifying their type as Duration. A zero-valued timeout (0s
) disables the timeout, while a valid non-zero-valued timeout should be at least 1ms.
Here's an example of setting request and backendRequest timeouts in an HTTPRoute:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: timeout-example
spec:
parentRefs:
- name: example-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /timeout
timeouts:
request: 10s
backendRequest: 2s
backendRefs:
- name: timeout-svc
port: 8080
In this example, a request
timeout of 10 seconds is defined, ensuring that client requests are processed within that timeframe. Additionally, a 2-second backendRequest
timeout is set for individual requests from the Gateway to a backend service called timeout-svc.
These new HTTPRoute Timeouts provide Kubernetes users with more control and flexibility in managing network communications, helping ensure a smoother and more predictable experience for both clients and backends. For additional details and examples, refer to the official timeouts API documentation.
Gateway Infrastructure Labels
While Gateway API providers a common API for different implementations, each implementation will have different resources created under-the-hood to apply users' intent. This could be configuring cloud load balancers, creating in-cluster Pods and Services, or more.
While the API has always provided an extension point -- parametersRef
in GatewayClass
-- to customize implementation specific things, there was no common core way to express common infrastructure customizations.
Gateway API v1.0 paves the way for this with a new infrastructure
field on the Gateway
object, allowing customization of the underlying infrastructure. For now, this starts small with two critical fields: labels and annotations. When these are set, any generated infrastructure will have the provided labels and annotations set on them.
For example, I may want to group all my resources for one application together:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: hello-world
spec:
infrastructure:
labels:
app.kubernetes.io/name: hello-world
In the future, we are looking into more common infrastructure configurations, such as resource sizing.
For more information, refer to the documentation for this feature.
Support for Websockets, HTTP/2 and more!
Not all implementations of Gateway API support automatic protocol selection. In some cases protocols are disabled without an explicit opt-in.
When a Route's backend references a Kubernetes Service, application developers can specify the protocol using ServicePort
appProtocol
field.
For example the following store
Kubernetes Service is indicating the port 8080
supports HTTP/2 Prior Knowledge.
apiVersion: v1
kind: Service
metadata:
name: store
spec:
selector:
app: store
ports:
- protocol: TCP
appProtocol: kubernetes.io/h2c
port: 8080
targetPort: 8080
Currently, Gateway API has conformance testing for:
kubernetes.io/h2c
- HTTP/2 Prior Knowledgekubernetes.io/ws
- WebSocket over HTTP
For more information, refer to the documentation for Backend Protocol Selection.
gwctl
, our new Gateway API command line tool
gwctl
is a command line tool that aims to be a kubectl
replacement for viewing Gateway API resources.
The initial release of gwctl
that comes bundled with Gateway v1.0 release includes helpful features for managing Gateway API Policies. Gateway API Policies serve as powerful extension mechanisms for modifying the behavior of Gateway resources. One challenge with using policies is that it may be hard to discover which policies are affecting which Gateway resources. gwctl
helps bridge this gap by answering questions like:
- Which policies are available for use in the Kubernetes cluster?
- Which policies are attached to a particular Gateway, HTTPRoute, etc?
- If policies are applied to multiple resources in the Gateway resource hierarchy, what is the effective policy that is affecting a particular resource? (For example, if an HTTP request timeout policy is applied to both an HTTPRoute and its parent Gateway, what is the effective timeout for the HTTPRoute?)
gwctl
is still in the very early phases of development and hence may be a bit rough around the edges. Follow the instructions in the repository to install and try out gwctl
.
Examples
Here are some examples of how gwctl
can be used:
# List all policies in the cluster. This will also give the resource they bind
# to.
gwctl get policies -A
# List all available policy types.
gwctl get policycrds
# Describe all HTTPRoutes in namespace ns2. (Output includes effective policies)
gwctl describe httproutes -n ns2
# Describe a single HTTPRoute in the default namespace. (Output includes
# effective policies)
gwctl describe httproutes my-httproute-1
# Describe all Gateways across all namespaces. (Output includes effective
# policies)
gwctl describe gateways -A
# Describe a single GatewayClass. (Output includes effective policies)
gwctl describe gatewayclasses foo-com-external-gateway-class
Get involved
These projects, and many more, continue to be improved in Gateway API. There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both Ingress and Mesh.
If this is interesting to you, please join us in the community and help us build the future of Gateway API together!
28 Nov 2023 6:00pm GMT
24 Nov 2023
Kubernetes – Production-Grade Container Orchestration
Blog: Spotlight on SIG Testing
Author: Sandipan Panda
Welcome to another edition of the SIG spotlight blog series, where we highlight the incredible work being done by various Special Interest Groups (SIGs) within the Kubernetes project. In this edition, we turn our attention to SIG Testing, a group interested in effective testing of Kubernetes and automating away project toil. SIG Testing focus on creating and running tools and infrastructure that make it easier for the community to write and run tests, and to contribute, analyze and act upon test results.
To gain some insights into SIG Testing, Sandipan Panda spoke with Michelle Shepardson, a senior software engineer at Google and a chair of SIG Testing, and Patrick Ohly, a software engineer and architect at Intel and a SIG Testing Tech Lead.
Meet the contributors
Sandipan: Could you tell us a bit about yourself, your role, and how you got involved in the Kubernetes project and SIG Testing?
Michelle: Hi! I'm Michelle, a senior software engineer at Google. I first got involved in Kubernetes through working on tooling for SIG Testing, like the external instance of TestGrid. I'm part of oncall for TestGrid and Prow, and am now a chair for the SIG.
Patrick: Hello! I work as a software engineer and architect in a team at Intel which focuses on open source Cloud Native projects. When I ramped up on Kubernetes to develop a storage driver, my very first question was "how do I test it in a cluster and how do I log information?" That interest led to various enhancement proposals until I had (re)written enough code that also took over official roles as SIG Testing Tech Lead (for the E2E framework) and structured logging WG lead.
Testing practices and tools
Sandipan: Testing is a field in which multiple approaches and tools exist; how did you arrive at the existing practices?
Patrick: I can't speak about the early days because I wasn't around yet 😆, but looking back at some of the commit history it's pretty obvious that developers just took what was available and started using it. For E2E testing, that was Ginkgo+Gomega. Some hacks were necessary, for example around cleanup after a test run and for categorising tests. Eventually this led to Ginkgo v2 and revised best practices for E2E testing. Regarding unit testing opinions are pretty diverse: some maintainers prefer to use just the Go standard library with hand-written checks. Others use helper packages like stretchr/testify. That diversity is okay because unit tests are self-contained - contributors just have to be flexible when working on many different areas. Integration testing falls somewhere in the middle. It's based on Go unit tests, but needs complex helper packages to bring up an apiserver and other components, then runs tests that are more like E2E tests.
Subprojects owned by SIG Testing
Sandipan: SIG Testing is pretty diverse. Can you give a brief overview of the various subprojects owned by SIG Testing?
Michelle: Broadly, we have subprojects related to testing frameworks, and infrastructure, though they definitely overlap. So for the former, there's e2e-framework (used externally), test/e2e/framework (used for Kubernetes itself) and kubetest2 for end-to-end testing, as well as boskos (resource rental for e2e tests), KIND (Kubernetes-in-Docker, for local testing and development), and the cloud provider for KIND. For the latter, there's Prow (K8s-based CI/CD and chatops), and a litany of other tools and utilities for triage, analysis, coverage, Prow/TestGrid config generation, and more in the test-infra repo.
If you are willing to learn more and get involved with any of the SIG Testing subprojects, check out the SIG Testing README.
Key challenges and accomplishments
Sandipan: What are some of the key challenges you face?
Michelle: Kubernetes is a gigantic project in every aspect, from contributors to code to users and more. Testing and infrastructure have to meet that scale, keeping up with every change from every repo under Kubernetes while facilitating developing, improving, and releasing the project as much as possible, though of course, we're not the only SIG involved in that. I think another other challenge is staffing subprojects. SIG Testing has a number of subprojects that have existed for years, but many of the original maintainers for them have moved on to other areas or no longer have the time to maintain them. We need to grow long-term expertise and owners in those subprojects.
Patrick: As Michelle said, the sheer size can be a challenge. It's not just the infrastructure, also our processes must scale with the number of contributors. It's good to document best practices, but not good enough: we have many new contributors, which is good, but having reviewers explain best practices doesn't scale - assuming that the reviewers even know about them! It also doesn't help that existing code cannot get updated immediately because there is so much of it, in particular for E2E testing. The initiative to apply stricter linting to new or modified code while accepting that existing code doesn't pass those same linter checks helps a bit.
Sandipan: Any SIG accomplishments that you are proud of and would like to highlight?
Patrick: I am biased because I have been driving this, but I think that the E2E framework and linting are now in a much better shape than they used to be. We may soon be able to run integration tests with race detection enabled, which is important because we currently only have that for unit tests and those tend to be less complex.
Sandipan: Testing is always important, but is there anything specific to your work in terms of the Kubernetes release process?
Patrick: test flakes… if we have too many of those, development velocity goes down because PRs cannot be merged without clean test runs and those become less likely. Developers also lose trust in testing and just "retest" until they have a clean run, without checking whether failures might indeed be related to a regression in their current change.
The people and the scope
Sandipan: What are some of your favourite things about this SIG?
Michelle: The people, of course 🙂. Aside from that, I like the broad scope SIG Testing has. I feel like even small changes can make a big difference for fellow contributors, and even if my interests change over time, I'll never run out of projects to work on.
Patrick: I can work on things that make my life and the life of my fellow developers better, like the tooling that we have to use every day while working on some new feature elsewhere.
Sandipan: Are there any funny / cool / TIL anecdotes that you could tell us?
Patrick: I started working on E2E framework enhancements five years ago, then was less active there for a while. When I came back and wanted to test some new enhancement, I asked about how to write unit tests for the new code and was pointed to some existing tests which looked vaguely familiar, as if I had seen them before. I looked at the commit history and found that I had written them! I'll let you decide whether that says something about my failing long-term memory or simply is normal… Anyway, folks, remember to write good commit messages and comments; someone will need them at some point - it might even be yourself!
Looking ahead
Sandipan: What areas and/or subprojects does your SIG need help with?
Michelle: Some subprojects aren't staffed at the moment and could use folks willing to learn more about them. boskos and kubetest2 especially stand out to me, since both are important for testing but lack dedicated owners.
Sandipan: Are there any useful skills that new contributors to SIG Testing can bring to the table? What are some things that people can do to help this SIG if they come from a background that isn't directly linked to programming?
Michelle: I think user empathy, writing clear feedback, and recognizing patterns are really useful. Someone who uses the test framework or tooling and can outline pain points with clear examples, or who can recognize a wider issue in the project and pull data to inform solutions for it.
Sandipan: What's next for SIG Testing?
Patrick: Stricter linting will soon become mandatory for new code. There are several E2E framework sub-packages that could be modernised, if someone wants to take on that work. I also see an opportunity to unify some of our helper code for E2E and integration testing, but that needs more thought and discussion.
Michelle: I'm looking forward to making some usability improvements for some of our tools and infra, and to supporting more long-term contributions and growth of contributors into long-term roles within the SIG. If you're interested, hit us up!
Looking ahead, SIG Testing has exciting plans in store. You can get in touch with the folks at SIG Testing in their Slack channel or attend one of their regular bi-weekly meetings on Tuesdays. If you are interested in making it easier for the community to run tests and contribute test results, to ensure Kubernetes is stable across a variety of cluster configurations and cloud providers, join the SIG Testing community today!
24 Nov 2023 12:00am GMT
16 Nov 2023
Kubernetes – Production-Grade Container Orchestration
Blog: The Case for Kubernetes Resource Limits: Predictability vs. Efficiency
Author: Milan Plžík (Grafana Labs)
There's been quite a lot of posts suggesting that not using Kubernetes resource limits might be a fairly useful thing (for example, For the Love of God, Stop Using CPU Limits on Kubernetes or Kubernetes: Make your services faster by removing CPU limits ). The points made there are totally valid - it doesn't make much sense to pay for compute power that will not be used due to limits, nor to artificially increase latency. This post strives to argue that limits have their legitimate use as well.
As a Site Reliability Engineer on the Grafana Labs platform team, which maintains and improves internal infrastructure and tooling used by the product teams, I primarily try to make Kubernetes upgrades as smooth as possible. But I also spend a lot of time going down the rabbit hole of various interesting Kubernetes issues. This article reflects my personal opinion, and others in the community may disagree.
Let's flip the problem upside down. Every pod in a Kubernetes cluster has inherent resource limits - the actual CPU, memory, and other resources of the machine it's running on. If those physical limits are reached by a pod, it will experience throttling similar to what is caused by reaching Kubernetes limits.
The problem
Pods without (or with generous) limits can easily consume the extra resources on the node. This, however, has a hidden cost - the amount of extra resources available often heavily depends on pods scheduled on the particular node and their actual load. These extra resources make each pod a special snowflake when it comes to real resource allocation. Even worse, it's fairly hard to figure out the resources that the pod had at its disposal at any given moment - certainly not without unwieldy data mining of pods running on a particular node, their resource consumption, and similar. And finally, even if we pass this obstacle, we can only have data sampled up to a certain rate and get profiles only for a certain fraction of our calls. This can be scaled up, but the amount of observability data generated might easily reach diminishing returns. Thus, there's no easy way to tell if a pod had a quick spike and for a short period of time used twice as much memory as usual to handle a request burst.
Now, with Black Friday and Cyber Monday approaching, businesses expect a surge in traffic. Good performance data/benchmarks of the past performance allow businesses to plan for some extra capacity. But is data about pods without limits reliable? With memory or CPU instant spikes handled by the extra resources, everything might look good according to past data. But once the pod bin-packing changes and the extra resources get more scarce, everything might start looking different - ranging from request latencies rising negligibly to requests slowly snowballing and causing pod OOM kills. While almost no one actually cares about the former, the latter is a serious issue that requires instant capacity increase.
Configuring the limits
Not using limits takes a tradeoff - it opportunistically improves the performance if there are extra resources available, but lowers predictability of the performance, which might strike back in the future. There are a few approaches that can be used to increase the predictability again. Let's pick two of them to analyze:
- Configure workload limits to be a fixed (and small) percentage more than the requests - I'll call it fixed-fraction headroom. This allows the use of some extra shared resources, but keeps the per-node overcommit bound and can be taken to guide worst-case estimates for the workload. Note that the bigger the limits percentage is, the bigger the variance in the performance that might happen across the workloads.
- Configure workloads with
requests
=limits
. From some point of view, this is equivalent to giving each pod its own tiny machine with constrained resources; the performance is fairly predictable. This also puts the pod into the Guaranteed QoS class, which makes it get evicted only after BestEffort and Burstable pods have been evicted by a node under resource pressure (see Quality of Service for Pods).
Some other cases might also be considered, but these are probably the two simplest ones to discuss.
Cluster resource economy
Note that in both cases discussed above, we're effectively preventing the workloads from using some cluster resources it has at the cost of getting more predictability - which might sound like a steep price to pay for a bit more stable performance. Let's try to quantify the impact there.
Bin-packing and cluster resource allocation
Firstly, let's discuss bin-packing and cluster resource allocation. There's some inherent cluster inefficiency that comes to play - it's hard to achieve 100% resource allocation in a Kubernetes cluster. Thus, some percentage will be left unallocated.
When configuring fixed-fraction headroom limits, a proportional amount of this will be available to the pods. If the percentage of unallocated resources in the cluster is lower than the constant we use for setting fixed-fraction headroom limits (see the figure, line 2), all the pods together are able to theoretically use up all the node's resources; otherwise there are some resources that will inevitably be wasted (see the figure, line 1). In order to eliminate the inevitable resource waste, the percentage for fixed-fraction headroom limits should be configured so that it's at least equal to the expected percentage of unallocated resources.
For requests = limits (see the figure, line 3), this does not hold: Unless we're able to allocate all node's resources, there's going to be some inevitably wasted resources. Without any knobs to turn on the requests/limits side, the only suitable approach here is to ensure efficient bin-packing on the nodes by configuring correct machine profiles. This can be done either manually or by using a variety of cloud service provider tooling - for example Karpenter for EKS or GKE Node auto provisioning.
Optimizing actual resource utilization
Free resources also come in the form of unused resources of other pods (reserved vs. actual CPU utilization, etc.), and their availability can't be predicted in any reasonable way. Configuring limits makes it next to impossible to utilize these. Looking at this from a different perspective, if a workload wastes a significant amount of resources it has requested, re-visiting its own resource requests might be a fair thing to do. Looking at past data and picking more fitting resource requests might help to make the packing more tight (although at the price of worsening its performance - for example increasing long tail latencies).
Conclusion
Optimizing resource requests and limits is hard. Although it's much easier to break things when setting limits, those breakages might help prevent a catastrophe later by giving more insights into how the workload behaves in bordering conditions. There are cases where setting limits makes less sense: batch workloads (which are not latency-sensitive - for example non-live video encoding), best-effort services (don't need that level of availability and can be preempted), clusters that have a lot of spare resources by design (various cases of specialty workloads - for example services that handle spikes by design).
On the other hand, setting limits shouldn't be avoided at all costs - even though figuring out the "right" value for limits is harder and configuring a wrong value yields less forgiving situations. Configuring limits helps you learn about a workload's behavior in corner cases, and there are simple strategies that can help when reasoning about the right value. It's a tradeoff between efficient resource usage and performance predictability and should be considered as such.
There's also an economic aspect of workloads with spiky resource usage. Having "freebie" resources always at hand does not serve as an incentive to improve performance for the product team. Big enough spikes might easily trigger efficiency issues or even problems when trying to defend a product's SLA - and thus, might be a good candidate to mention when assessing any risks.
16 Nov 2023 12:00am GMT