>_ DevTrendsen

Language

Home

Languages

Sections

Frontend Backend Mobile DevOps AI / ML
Go

Kueue - Bringing Order to Kubernetes Task Queues

2,626 stars

Kueue Logo

Imagine this: your Kubernetes cluster is running at full capacity. On one side, critical ML models are training, on the other, analysts have launched resource-intensive ETL processes, and somewhere in the background, dozens of CI/CD jobs are running. All of this requires GPU, CPU, memory, and of course, everyone wants resources "right now". Sound familiar? As a result, some tasks are idle while others are taking resources away from higher-priority ones, and the cluster operates inefficiently. This is a headache for anyone managing batch workloads in Kubernetes.

This is exactly the problem Kueue is designed to solve — a project from kubernetes-sigs, which literally translates to "queue". It's not just another scheduler, but a full-fledged queue manager that integrates deeply with Kubernetes and allows you to truly efficiently manage the lifecycle of your tasks.

What is Kueue and who needs it?

Kueue is a set of APIs and a controller that acts as an intelligent dispatcher for your Kubernetes tasks. Its main job is to decide when a task can be admitted for execution (i.e., when pods can be created for it) and when it might be worth stopping it (removing active pods) to free up resources for higher-priority tasks.

Who will benefit from this? First and foremost, teams that actively use Kubernetes for:

  • Machine learning and data processing: ML engineers and data scientists often launch numerous training jobs requiring large amounts of GPU and CPU. Kueue helps fairly distribute these expensive resources.
  • ETL processes: Data extraction, transformation, and loading tasks can be very resource-intensive and require careful planning.
  • CI/CD pipelines: Automated builds and tests, especially in large projects, can generate peak loads on the cluster.
  • Any other batch tasks: If you have background processes that run periodically and compete for resources, Kueue is your savior.

Essentially, Kueue allows you to transform a chaotic stream of tasks into an orderly, efficiently managed queue where resources are distributed according to your rules and priorities.

Key Kueue features that change the game

Kueue doesn't just put tasks in a queue — it offers a whole arsenal of tools for fine-tuning and optimization. Let's look at the most interesting ones.

1. Smart task and priority management

Forget about manual resource allocation or scripts that try to simulate a queue. Kueue provides flexible task management mechanisms:

  • Priorities: You can assign priorities to various tasks. For example, a task for training a critical model can have a higher priority than a nightly report.
  • Queue strategies: Kueue supports two main strategies:
    • StrictFIFO: Classic "first come, first served" queue. Simple and straightforward.
    • BestEffortFIFO: A more flexible approach that tries to start tasks as early as possible, even if they are not at the very front of the queue, provided there are free resources. This prevents cluster idle time when resources are available but the "head" of the queue is waiting for something very specific.

Imagine you have multiple teams, each launching their own tasks. Kueue allows you to define who gets access to the cluster and when, based on predefined rules.

2. Advanced resource management and fair distribution

This is arguably one of Kueue's most powerful aspects. It goes far beyond basic Kubernetes scheduling, offering:

  • Resource Flavor Fungibility: Let's say you have GPUs of different models (e.g., NVIDIA A100 and V100). Kueue can be configured so that a task requiring a GPU can use any of them if it's free, instead of waiting for a specific model. This maximizes hardware utilization.
  • Fair Sharing and Cohorts: If you have multiple teams or departments using one cluster, Kueue can guarantee that none of them will monopolize resources. You can combine queues into "cohorts" and set quotas so resources are distributed fairly among them. For example, the ML team might get 60% of resources, and the analytics team — 40%.
  • Preemption: In critical situations, Kueue can preempt (stop) lower-priority tasks to free up resources for more important ones. This is especially valuable when urgent tasks or recovery from failures are involved.

3. Wide integration with popular task types

Kueue is not tied to any single task type. It has built-in support for many popular workloads, making it a versatile tool:

  • Standard Kubernetes BatchJob: Of course, can't do without them.
  • Kubeflow training jobs: Perfect for ML engineers using Kubeflow for model training.
  • RayJob and RayCluster: Support for Ray-based distributed computing.
  • JobSet: For managing groups of related jobs.
  • Plain Pod and Pod Groups: Even for simple pods and their groups.
  • Deployments and StatefulSets: Interestingly, Kueue can manage even serving workloads, allowing you to mix training and inference, dynamically allocating resources.

This means you won't have to reinvent the wheel for each task type — Kueue is ready to work with your stack out of the box.

4. Autoscaling and multi-clustering

In the modern world, clusters are rarely static, and sometimes aren't limited to a single geographic location. Kueue accounts for these realities:

  • Advanced autoscaling support: Kueue can integrate with Cluster Autoscaler, using provisioningRequest for more intelligent cluster scaling, requesting new nodes only when truly necessary for tasks in the queue.
  • MultiKueue for multi-cluster dispatching: This is fantastic! If you have multiple clusters (e.g., in different regions or clouds), MultiKueue allows you to search for free capacity and migrate tasks between them. This provides incredible flexibility and resilience, enabling effective use of global resources.
  • Topology-Aware Scheduling: Optimizing communication bandwidth between pods through scheduling that considers datacenter topology. This is critical for high-performance computing.

Technical details: under the hood of Kueue

Kueue is built as a native Kubernetes controller, which means deep integration with the ecosystem. It extends Kubernetes with its own Custom Resource Definitions (CRD) for defining queues, quotas, and workloads. This allows you to manage it with standard kubectl commands, which is very convenient.

The project is under active development under kubernetes-sigs (Special Interest Group), which guarantees compliance with Kubernetes standards and long-term support. Currently, the API is at version v1beta2, indicating its maturity and stability. The team is actively working on transitioning to v1.

I was pleasantly surprised by the project's testing level: extensive unit, integration, and E2E tests for various Kubernetes versions (up to 1.35) and use cases, including MultiKueue and Topology Aware Scheduling. This instills confidence in the solution's reliability.

Additionally, Kueue provides Prometheus metrics, making it easy to monitor queue and resource states, and has detailed documentation to help you get up to speed quickly.

Practical application: what it looks like in real life

Let's look at how Kueue can change your workflow:

  1. ML platform: A data scientist submits a model training task. Instead of waiting for a specific GPU to become free, Kueue puts the task in a queue. When a suitable GPU becomes available (possibly after a lower-priority task completes or even after preemption), Kueue starts the training. If the cluster is overloaded, MultiKueue can automatically redirect the task to another, less loaded cluster.
  2. Big data processing: A nightly ETL process starts but finds that resources are limited due to daytime analytical queries. Kueue puts it in a queue, and when resources are freed (or lower-priority tasks are preempted), the process starts. In this case, Kueue can guarantee that no team will "eat up" all resources, ensuring fair distribution.
  3. CI/CD for microservices: A development team is actively committing code, launching dozens of builds and tests. Kueue manages these tasks, guaranteeing that critical builds (e.g., for production) get priority over test branches, and cluster resources are used as efficiently as possible, without idle time.

Installing Kueue is quite simple and requires Kubernetes 1.29 or newer. Just one kubectl apply command:

kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.15.2/manifests.yaml

After that, you can configure queues and launch your tasks using examples from the documentation.

Is Kueue worth trying?

Definitely yes, if you're facing batch task management problems in Kubernetes. Kueue is not just a tool — it's a whole philosophy of efficient resource utilization and fair load distribution.

It's especially suitable for:

  • Cluster administrators and SRE engineers: For bringing order, optimizing resource utilization, and ensuring stability.
  • MLOps engineers and Data Scientists: For efficiently managing training tasks, inference, and experiments.
  • Developers using Kubernetes for CI/CD or background tasks: For speeding up processes and reducing infrastructure costs.

Kueue is a mature, well-tested, and actively developing project with a strong community. It's already used in production by many companies, which is the best proof of its reliability and practical value. If you want to get the most out of your Kubernetes cluster and forget about task chaos, give Kueue a chance — it won't disappoint you!

Check out the Kueue documentation and join the community on Slack to learn more and start using this powerful tool today.

Related projects