K8s AI Infrastructure

High-performance AI training infrastructure deployment solution for Kubernetes clusters, optimized for NVIDIA A100/A800 GPU clusters with InfiniBand networking.

✨ Features

🚀 High Performance: Optimized for NVIDIA A100/A800 GPU clusters
🌐 Advanced Networking: InfiniBand support with RDMA
📊 Comprehensive Monitoring: GPU and network metrics tracking
🔄 Automated Deployment: Streamlined setup process
🛡️ Production Ready: Enterprise-grade security and stability

🏗️ System Architecture

graph TB
    subgraph "Physical Network"
        B[Bond4]
        IB[InfiniBand Network]
        lan0[LAN0] --> B
        lan1[LAN1] --> B
        lan2[LAN2] --> IB
        lan3[LAN3] --> IB
        lan4[LAN4] --> IB
        lan5[LAN5] --> IB
    end

    subgraph "Network Control Plane"
        NO[NVIDIA Network Operator]
        VPC[VPC CNI]
        MC[Multus CNI]
        SRIOV[SR-IOV Device Plugin]
        RDMA[RDMA Device Plugin]
        
        NO --> VPC
        NO --> MC
        NO --> SRIOV
        NO --> RDMA
    end

    subgraph "Pod Networking"
        P1[AI Training Pod]
        eth0[eth0]
        rdma[RDMA Interface]
        
        P1 --> eth0
        P1 --> rdma
        eth0 --> B
        rdma --> IB
    end

    subgraph "Monitoring System"
        PM[Prometheus]
        GF[Grafana]
        PM --> GF
    end

🚀 Quick Start

Prerequisites

Kubernetes 1.20+
NVIDIA A100/A800 GPUs
Mellanox InfiniBand NICs
Helm 3.0+

Installation

Configure network environment:

./scripts/setup-network.sh

Deploy NVIDIA Network Operator:

./scripts/deploy-network-operator.sh

Verify deployment:

./scripts/test-network.sh

📚 Documentation

🛠️ Components

Network Infrastructure

Bond4 configuration for management traffic
InfiniBand network for high-speed data transfer
RDMA support for direct memory access
SR-IOV for network virtualization

Monitoring Stack

Prometheus for metrics collection
Grafana for visualization
Custom exporters for GPU and network metrics
Comprehensive alerting rules

Ray Integration

Distributed training support
GPU-aware scheduling
NCCL optimization
Topology-aware placement

📊 Performance

NVLink: Up to 600 GB/s bidirectional bandwidth
InfiniBand: Up to 200 Gb/s network speed
RDMA: Ultra-low latency communication
GPUDirect: Optimized GPU-to-GPU transfer

🤝 Contributing

Contributions are welcome! Please read our Contributing Guidelines for details.

📝 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
monitoring		monitoring
network		network
ray		ray
rdma		rdma
scripts		scripts
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

K8s AI Infrastructure

✨ Features

🏗️ System Architecture

🚀 Quick Start

Prerequisites

Installation

📚 Documentation

🛠️ Components

Network Infrastructure

Monitoring Stack

Ray Integration

📊 Performance

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Languages

ghostloda/k8s-ai-infra

Folders and files

Latest commit

History

Repository files navigation

K8s AI Infrastructure

✨ Features

🏗️ System Architecture

🚀 Quick Start

Prerequisites

Installation

📚 Documentation

🛠️ Components

Network Infrastructure

Monitoring Stack

Ray Integration

📊 Performance

🤝 Contributing

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages