PufferLib Integration Guide

This guide explains how to use the optional PufferLib reinforcement learning (RL) integration in cluster-code for intelligent cluster management and troubleshooting.

Overview

PufferLib is a high-performance reinforcement learning library. The integration with cluster-code allows you to train AI agents that can automatically diagnose and manage Kubernetes clusters.

Key Features:

Train RL agents to identify and resolve cluster issues
Simulate cluster problems for safe training
Run trained agents on real clusters for diagnostics
GPU acceleration support (optional)

Prerequisites

Python 3.8 or later
pip (Python package manager)
Optional: NVIDIA GPU with CUDA drivers for accelerated training

Quick Start

1. Set up the RL Environment

During cluster-code init, you’ll be asked if you want to set up the PufferLib environment. You can also set it up manually:

# Basic setup (CPU only)
cluster-code rl setup

# With GPU/CUDA support
cluster-code rl setup --cuda

2. Check Status

cluster-code rl status

This shows:

Python and PufferLib versions
CUDA availability
Trained model locations
Configuration details

3. Train an Agent

# Train with default settings (100 episodes, simulation mode)
cluster-code rl train

# Custom training
cluster-code rl train --episodes 500 --steps 200 --verbose

# Train on real cluster (be careful!)
cluster-code rl train --no-simulation

4. Run Diagnostics

# Run RL-based diagnostics
cluster-code rl diagnose

# Run with specific model
cluster-code rl diagnose --model ~/.cluster-code/models/cluster_agent.pt

# Run on real cluster
cluster-code rl diagnose --no-simulation

How It Works

Cluster Environment

The RL environment models your Kubernetes cluster as a Gymnasium environment:

Observation Space (18 dimensions):

Node metrics (count, ready/not-ready status)
Pod metrics (running, pending, failed, unknown)
Deployment health
Resource usage (CPU, memory)
Event counts (warning vs normal)
Issue flags (PVC, network, resource pressure)

Reward Function:

Positive rewards for resolving issues
Negative rewards for cluster degradation
Bonus for maintaining healthy cluster state
Small step penalty to encourage efficiency

Training Modes

Simulation Mode (Default)

Trains on a simulated cluster with randomly generated issues. Safe for experimentation and doesn’t affect your real cluster.

Real Cluster Mode

Connects to your configured Kubernetes cluster. Use with caution! By default, only read operations are executed.

Configuration

PufferLib configuration is stored in ~/.cluster-code/config.json:

{
  "pufferlib": {
    "enabled": true,
    "pythonPath": "~/.cluster-code/pufferlib-env/bin/python",
    "envPath": "~/.cluster-code/pufferlib-env",
    "modelPath": "~/.cluster-code/models",
    "trainingConfig": {
      "learningRate": 0.0003,
      "batchSize": 64,
      "numEpochs": 10,
      "gamma": 0.99,
      "numEnvs": 4,
      "numSteps": 128
    }
  }
}

Command Reference

`cluster-code rl status`

Display the status of the RL environment.

`cluster-code rl setup`

Set up the PufferLib RL environment.

Options:

--cuda - Install with CUDA/GPU support
--force - Force reinstall if environment exists

`cluster-code rl remove`

Remove the PufferLib RL environment.

`cluster-code rl train`

Train an RL agent for cluster management.

Options:

-e, --episodes <n> - Number of training episodes (default: 100)
-s, --steps <n> - Steps per episode (default: 100)
--no-simulation - Train on real cluster
-v, --verbose - Show verbose output

`cluster-code rl diagnose`

Run RL-based cluster diagnostics.

Options:

-m, --model <path> - Path to trained model
-s, --steps <n> - Maximum steps (default: 20)
--no-simulation - Run on real cluster

Troubleshooting

“Python is not installed”

Install Python 3.8+ from python.org or via your package manager.

“PufferLib is not installed”

Run cluster-code rl setup to install PufferLib.

“CUDA not available”

Ensure NVIDIA drivers are installed
Reinstall with cluster-code rl setup --cuda --force

Training is slow

Use GPU acceleration: cluster-code rl setup --cuda
Reduce episodes or steps
Use simulation mode for initial training

Best Practices

Start with simulation - Always train initially in simulation mode
Validate on simulation - Test trained agents on simulation before real clusters
Use read-only on real clusters - The default mode only performs read operations
Monitor training - Use TensorBoard to track training progress
Save checkpoints - Trained models are saved to ~/.cluster-code/models/

Advanced: Custom Environments

You can extend the ClusterEnv class to add custom observations or actions:

from cluster_env import ClusterEnv, ClusterAction

class CustomClusterEnv(ClusterEnv):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Add custom initialization
    
    def _calculate_reward(self, prev_state, curr_state, action, result):
        # Custom reward function
        return super()._calculate_reward(prev_state, curr_state, action, result)

PufferLib Integration Guide

Overview

Prerequisites

Quick Start

1. Set up the RL Environment

2. Check Status

3. Train an Agent

4. Run Diagnostics

How It Works

Cluster Environment

Training Modes

Simulation Mode (Default)

Real Cluster Mode

Configuration

Command Reference

cluster-code rl status

cluster-code rl setup

cluster-code rl remove

cluster-code rl train

cluster-code rl diagnose

Troubleshooting

“Python is not installed”

“PufferLib is not installed”

“CUDA not available”

Training is slow

Best Practices

Advanced: Custom Environments

References

`cluster-code rl status`

`cluster-code rl setup`

`cluster-code rl remove`

`cluster-code rl train`

`cluster-code rl diagnose`