Skip to content

ghostloda/k8s-ai-infra

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K8s AI Infrastructure

High-performance AI training infrastructure deployment solution for Kubernetes clusters, optimized for NVIDIA A100/A800 GPU clusters with InfiniBand networking.

✨ Features

  • 🚀 High Performance: Optimized for NVIDIA A100/A800 GPU clusters
  • 🌐 Advanced Networking: InfiniBand support with RDMA
  • 📊 Comprehensive Monitoring: GPU and network metrics tracking
  • 🔄 Automated Deployment: Streamlined setup process
  • 🛡️ Production Ready: Enterprise-grade security and stability

🏗️ System Architecture

graph TB
    subgraph "Physical Network"
        B[Bond4]
        IB[InfiniBand Network]
        lan0[LAN0] --> B
        lan1[LAN1] --> B
        lan2[LAN2] --> IB
        lan3[LAN3] --> IB
        lan4[LAN4] --> IB
        lan5[LAN5] --> IB
    end

    subgraph "Network Control Plane"
        NO[NVIDIA Network Operator]
        VPC[VPC CNI]
        MC[Multus CNI]
        SRIOV[SR-IOV Device Plugin]
        RDMA[RDMA Device Plugin]
        
        NO --> VPC
        NO --> MC
        NO --> SRIOV
        NO --> RDMA
    end

    subgraph "Pod Networking"
        P1[AI Training Pod]
        eth0[eth0]
        rdma[RDMA Interface]
        
        P1 --> eth0
        P1 --> rdma
        eth0 --> B
        rdma --> IB
    end

    subgraph "Monitoring System"
        PM[Prometheus]
        GF[Grafana]
        PM --> GF
    end
Loading

🚀 Quick Start

Prerequisites

  • Kubernetes 1.20+
  • NVIDIA A100/A800 GPUs
  • Mellanox InfiniBand NICs
  • Helm 3.0+

Installation

  1. Configure network environment:
./scripts/setup-network.sh
  1. Deploy NVIDIA Network Operator:
./scripts/deploy-network-operator.sh
  1. Verify deployment:
./scripts/test-network.sh

📚 Documentation

🛠️ Components

Network Infrastructure

  • Bond4 configuration for management traffic
  • InfiniBand network for high-speed data transfer
  • RDMA support for direct memory access
  • SR-IOV for network virtualization

Monitoring Stack

  • Prometheus for metrics collection
  • Grafana for visualization
  • Custom exporters for GPU and network metrics
  • Comprehensive alerting rules

Ray Integration

  • Distributed training support
  • GPU-aware scheduling
  • NCCL optimization
  • Topology-aware placement

📊 Performance

  • NVLink: Up to 600 GB/s bidirectional bandwidth
  • InfiniBand: Up to 200 Gb/s network speed
  • RDMA: Ultra-low latency communication
  • GPUDirect: Optimized GPU-to-GPU transfer

🤝 Contributing

Contributions are welcome! Please read our Contributing Guidelines for details.

📝 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 62.5%
  • Go 34.6%
  • Dockerfile 2.9%