kubeflow/trainer

每日信息看板 · 2026-02-28

返回当天 Daily Index

开源项目

AI 总结

Kubeflow Trainer 是面向 Kubernetes 的分布式 AI 训练平台，支持多框架与多节点多 GPU 调度，并在 v2.1 引入分布式数据缓存与拓扑感知调度以提升大规模 LLM 训练效率与资源利用率。

Kubernetes 原生分布式训练/微调平台，覆盖 PyTorch、HuggingFace、DeepSpeed、JAX、XGBoost 等生态
将 MPI 运行时引入 Kubernetes，面向 HPC 场景优化多节点多 GPU 高吞吐通信与同步
集成 Kueue 与 Volcano 做拓扑感知调度，并支持多集群任务分发
提供分布式数据缓存，实现面向 GPU 节点的零拷贝数据流式传输以提升 GPU 利用率
提供 Kubeflow Python SDK（含 CustomTrainer/BuiltinTrainer、本地 PyTorch 执行）并通过 TrainJob/Runtimes API 使用

#GitHub #repo #开源项目 #Kubeflow #Kubernetes #PyTorch #MPI

原链接

内容摘录

Kubeflow Trainer

Join Slack
Coverage Status
Go Report Card
OpenSSF Best Practices
Ask DeepWiki
FOSSA Status

Latest News 🔥
[2025/11] Kubeflow Trainer v2.1 is officially released with support of
Distributed Data Cache,
topology aware scheduling with Kueue and Volcano, and LLM post-training enhancements. Check out
the GitHub release notes.
[2025/09] Kubeflow SDK v0.1 is officially released with support for CustomTrainer,
BuiltinTrainer, and local PyTorch execution. Check out
the GitHub release notes.
[2025/07] PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem. Find the
announcement in the PyTorch blog post.

<details>
<summary>More</summary>
[2025/07] Kubeflow Trainer v2.0 has been officially released. Check out
the blog post announcement and the
release notes.
[2025/04] From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in
Kubeflow TrainJob. See the KubeCon + CloudNativeCon London talk

</details>
Overview

Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable large language model
(LLM) fine-tuning and training of AI models across a wide range of frameworks, including
PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more.

Kubeflow Trainer brings MPI to Kubernetes, orchestrating multi-node, multi-GPU distributed
jobs efficiently across high-performance computing (HPC) clusters. This enables high-throughput
communication between processes, making it ideal for large-scale AI training that requires
ultra-fast synchronization between GPUs nodes.

Kubeflow Trainer seamlessly integrates with the Cloud Native AI ecosystem, including
Kueue for topology-aware scheduling and
multi-cluster job dispatching, as well as JobSet and
LeaderWorkerSet for AI workload orchestration.

Kubeflow Trainer provides a distributed data cache designed to stream large-scale data with zero-copy
transfer directly to GPU nodes. This ensures memory-efficient training jobs while maximizing
GPU utilization.

With the Kubeflow Python SDK, AI practitioners can effortlessly
develop and fine-tune LLMs while leveraging the Kubeflow Trainer APIs: TrainJob and Runtimes.

<h1 align="center">
<img src="./docs/images/trainer-tech-stack.drawio.svg" alt="logo" width="500">
<br>
</h1>
Kubeflow Trainer Introduction

Checkout following KubeCon + CloudNativeCon talks for Kubeflow Trainer capabilities:

Kubeflow Trainer

Additional talks:
From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in Kubeflow TrainJob
Streamline LLM Fine-tuning on Kubernetes With Kubeflow LLM Trainer
Getting Started

Please check the official Kubeflow Trainer documentation
to install and get started with Kubeflow Trainer.
Community

The following links provide information on how to get involved in the community:
Join our #kubeflow-trainer Slack channel.
Attend the bi-weekly AutoML and Training Working Group community meeting.
Check out who is using Kubeflow Trainer.
Contributing

Please refer to the CONTRIBUTING guide.
Changelog

Please refer to the CHANGELOG.
Kubeflow Training Operator V1

Kubeflow Trainer project is currently in <strong>alpha</strong> status, and APIs may change.
If you are using Kubeflow Training Operator V1, please refer to this migration document.

Kubeflow Community will maintain the Training Operator V1 source code at
the release-1.9 branch.

You can find the documentation for Kubeflow Training Operator V1 in these guides.
Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we
merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience
for both users and developers. We are very grateful to all who filed issues or helped resolve them,
asked and answered questions, and were part of inspiring discussions.
We'd also like to thank everyone who's contributed to and maintained the original operators.
PyTorch Operator: list of contributors
and maintainers.
MPI Operator: list of contributors
and maintainers.
XGBoost Operator: list of contributors
and maintainers.
Common library: list of contributors and
maintainers.