PufferLib Integration Guide
This guide explains how to use the optional PufferLib reinforcement learning (RL) integration in cluster-code for intelligent cluster management and troubleshooting.
Overview
PufferLib is a high-performance reinforcement learning library. The integration with cluster-code allows you to train AI agents that can automatically diagnose and manage Kubernetes clusters.
Key Features:
- Train RL agents to identify and resolve cluster issues
- Simulate cluster problems for safe training
- Run trained agents on real clusters for diagnostics
- GPU acceleration support (optional)
Prerequisites
- Python 3.8 or later
- pip (Python package manager)
- Optional: NVIDIA GPU with CUDA drivers for accelerated training
Quick Start
1. Set up the RL Environment
During cluster-code init, you’ll be asked if you want to set up the PufferLib environment. You can also set it up manually:
# Basic setup (CPU only)
cluster-code rl setup
# With GPU/CUDA support
cluster-code rl setup --cuda
2. Check Status
cluster-code rl status
This shows:
- Python and PufferLib versions
- CUDA availability
- Trained model locations
- Configuration details
3. Train an Agent
# Train with default settings (100 episodes, simulation mode)
cluster-code rl train
# Custom training
cluster-code rl train --episodes 500 --steps 200 --verbose
# Train on real cluster (be careful!)
cluster-code rl train --no-simulation
4. Run Diagnostics
# Run RL-based diagnostics
cluster-code rl diagnose
# Run with specific model
cluster-code rl diagnose --model ~/.cluster-code/models/cluster_agent.pt
# Run on real cluster
cluster-code rl diagnose --no-simulation
How It Works
Cluster Environment
The RL environment models your Kubernetes cluster as a Gymnasium environment:
Observation Space (18 dimensions):
- Node metrics (count, ready/not-ready status)
- Pod metrics (running, pending, failed, unknown)
- Deployment health
- Resource usage (CPU, memory)
- Event counts (warning vs normal)
- Issue flags (PVC, network, resource pressure)
Action Space (15 actions): | Action | Description | |——–|————-| | GET_NODES | Retrieve node information | | GET_PODS | Retrieve pod information | | GET_EVENTS | Get cluster events | | GET_DEPLOYMENTS | Get deployment status | | CHECK_LOGS | Check pod logs | | DESCRIBE_UNHEALTHY | Describe unhealthy resources | | RESTART_FAILED_PODS | Restart failed pods | | SCALE_UP_DEPLOYMENT | Scale up deployment | | SCALE_DOWN_DEPLOYMENT | Scale down deployment | | DRAIN_NODE | Drain a node | | UNCORDON_NODE | Uncordon a node | | TOP_NODES | Get node resource usage | | TOP_PODS | Get pod resource usage | | GET_RESOURCE_QUOTAS | Check resource quotas | | WAIT | No action |
Reward Function:
- Positive rewards for resolving issues
- Negative rewards for cluster degradation
- Bonus for maintaining healthy cluster state
- Small step penalty to encourage efficiency
Training Modes
Simulation Mode (Default)
Trains on a simulated cluster with randomly generated issues. Safe for experimentation and doesn’t affect your real cluster.
Real Cluster Mode
Connects to your configured Kubernetes cluster. Use with caution! By default, only read operations are executed.
Configuration
PufferLib configuration is stored in ~/.cluster-code/config.json:
{
"pufferlib": {
"enabled": true,
"pythonPath": "~/.cluster-code/pufferlib-env/bin/python",
"envPath": "~/.cluster-code/pufferlib-env",
"modelPath": "~/.cluster-code/models",
"trainingConfig": {
"learningRate": 0.0003,
"batchSize": 64,
"numEpochs": 10,
"gamma": 0.99,
"numEnvs": 4,
"numSteps": 128
}
}
}
Command Reference
cluster-code rl status
Display the status of the RL environment.
cluster-code rl setup
Set up the PufferLib RL environment.
Options:
--cuda- Install with CUDA/GPU support--force- Force reinstall if environment exists
cluster-code rl remove
Remove the PufferLib RL environment.
cluster-code rl train
Train an RL agent for cluster management.
Options:
-e, --episodes <n>- Number of training episodes (default: 100)-s, --steps <n>- Steps per episode (default: 100)--no-simulation- Train on real cluster-v, --verbose- Show verbose output
cluster-code rl diagnose
Run RL-based cluster diagnostics.
Options:
-m, --model <path>- Path to trained model-s, --steps <n>- Maximum steps (default: 20)--no-simulation- Run on real cluster
Troubleshooting
“Python is not installed”
Install Python 3.8+ from python.org or via your package manager.
“PufferLib is not installed”
Run cluster-code rl setup to install PufferLib.
“CUDA not available”
- Ensure NVIDIA drivers are installed
- Reinstall with
cluster-code rl setup --cuda --force
Training is slow
- Use GPU acceleration:
cluster-code rl setup --cuda - Reduce episodes or steps
- Use simulation mode for initial training
Best Practices
- Start with simulation - Always train initially in simulation mode
- Validate on simulation - Test trained agents on simulation before real clusters
- Use read-only on real clusters - The default mode only performs read operations
- Monitor training - Use TensorBoard to track training progress
- Save checkpoints - Trained models are saved to
~/.cluster-code/models/
Advanced: Custom Environments
You can extend the ClusterEnv class to add custom observations or actions:
from cluster_env import ClusterEnv, ClusterAction
class CustomClusterEnv(ClusterEnv):
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Add custom initialization
def _calculate_reward(self, prev_state, curr_state, action, result):
# Custom reward function
return super()._calculate_reward(prev_state, curr_state, action, result)