Local LLM Support for Cluster Code
Overview
Cluster Code supports integration with locally hosted Large Language Models (LLMs), enabling you to use self-hosted or open-source models instead of proprietary cloud-based models. This provides privacy, cost control, and flexibility for your Kubernetes cluster management workflows.
Architecture
Local LLM support works through a LiteLLM proxy that translates Cluster Code requests into OpenAI-compatible API calls that your local model can understand.
Cluster Code -> LiteLLM Proxy -> Local LLM (Ollama/vLLM/etc.)
Supported Local LLM Providers
- Ollama: Most popular for local model serving
- vLLM: High-performance inference server
- Llama.cpp: C++ implementation for efficient inference
- Text Generation Inference (TGI): Hugging Face’s inference server
- Custom OpenAI-compatible endpoints: Any service with OpenAI-compatible API
Quick Setup Guide
1. Install LiteLLM Proxy
pip install 'litellm[proxy]'
2. Create LiteLLM Configuration
Create a config.yaml file:
model_list:
- model_name: local-cluster-model
litellm_params:
model: ollama/deepseek-coder-v2
api_base: "http://localhost:11434"
# Optional: Add temperature, max_tokens, etc.
temperature: 0.1
max_tokens: 4096
- model_name: local-llama
litellm_params:
model: ollama/llama3.1:8b
api_base: "http://localhost:11434"
# Optional: Load balancing between models
router_settings:
model_group_alias:
local-cluster-model-group:
- local-cluster-model
- local-llama
3. Start LiteLLM Proxy
# Set master key (replace with secure key)
export LITELLM_MASTER_KEY="your-secure-master-key-here"
# Start proxy with your config
litellm --config config.yaml --port 4000
4. Configure Cluster Code
# Set environment variables for Cluster Code
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="your-secure-master-key-here"
5. Use Local Model with Cluster Code
# Run cluster operations with local model
cluster-code --model local-cluster-model diagnose
cluster-code --model local-cluster-model "Analyze pod failures in production namespace"
cluster-code --model local-llama chat "Help me troubleshoot service connectivity issues"
Configuration Options
Environment Variables
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_BASE_URL |
Yes | URL of your LiteLLM proxy |
ANTHROPIC_AUTH_TOKEN |
Yes | Master key for LiteLLM proxy |
Recommended Models for Cluster Management
Based on testing, these models work well for Kubernetes cluster operations:
Code-Specialized Models
- deepseek-coder-v2: Excellent for troubleshooting and code analysis
- codellama: Good for configuration and script generation
- starcoder2: Strong performance on technical tasks
General Purpose Models
- llama3.1:8b: Good balance of performance and resource usage
- mistral:7b: Fast and efficient for cluster diagnostics
- qwen2.5:7b: Strong reasoning capabilities
Advanced Configuration
Model Selection Strategy
Configure different models for different tasks:
model_list:
# For code analysis and troubleshooting
- model_name: code-specialist
litellm_params:
model: ollama/deepseek-coder-v2
api_base: "http://localhost:11434"
temperature: 0.1
# For general cluster diagnostics
- model_name: diagnostics-general
litellm_params:
model: ollama/llama3.1:8b
api_base: "http://localhost:11434"
temperature: 0.3
# For creative problem-solving
- model_name: creative-solver
litellm_params:
model: ollama/mistral:7b
api_base: "http://localhost:11434"
temperature: 0.7
Performance Optimization
# Add request/rate limiting
general_settings:
master_key: "your-secure-key"
database_url: "sqlite:///litellm.db" # Enable caching
litellm_settings:
set_verbose: true
drop_params: true # Handle unsupported parameters
# Configure timeouts and retries
router_settings:
timeout: 120 # seconds
retries: 3
Resource Management
Monitor and limit resource usage:
# Add to your proxy startup
litellm --config config.yaml \
--num_workers 4 \
--max_parallel_requests 10 \
--timeout 120
Usage Examples
Interactive Troubleshooting
# Use local model for interactive session
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="your-key"
# Start chat with local model
cluster-code --model local-cluster-model chat
# Specific troubleshooting
cluster-code --model local-cluster-model "Why are my pods in CrashLoopBackOff?"
cluster-code --model code-specialist "Help me write a deployment YAML"
Automated Diagnostics
# Run comprehensive diagnostics with local model
cluster-code --model diagnostics-general diagnose --namespace production
# Analyze specific resources
cluster-code --model local-cluster-model analyze pod my-app-pod-xyz
cluster-code --model code-specialist describe deployment my-api
Script Integration
#!/bin/bash
# cluster-health-check.sh
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="your-key"
# Use local model for health analysis
cluster-code --model diagnostics-general status --output json > cluster-status.json
cluster-code --model local-cluster-model diagnose --severity critical > issues.txt
# Process results...
Hardware Requirements
Minimum Requirements
- CPU: 4+ cores
- RAM: 16GB+ (32GB recommended for larger models)
- Storage: 10GB+ for model files
- GPU: Optional but recommended (NVIDIA GPU with 8GB+ VRAM)
Performance Tiers
| Use Case | Recommended Model | RAM | GPU |
|---|---|---|---|
| Basic Diagnostics | mistral:7b | 16GB | Optional |
| Code Analysis | deepseek-coder-v2 | 24GB | Recommended |
| Advanced Operations | llama3.1:8b | 32GB | Recommended |
| Production Use | Custom fine-tuned | 64GB+ | Required |
Limitations and Considerations
Known Limitations
- Model Compatibility: Not all commands work perfectly with every local model
- Output Format: Different models may format outputs differently
- Performance: Local models are typically slower than cloud APIs
- Resource Usage: High memory and CPU consumption
- Feature Compatibility: Some advanced features may require specific model capabilities
Best Practices
- Start Small: Begin with smaller models (7B parameters) before scaling up
- Monitor Resources: Use
htopor similar tools to monitor resource usage - Cache Results: Enable LiteLLM caching to reduce redundant requests
- Test Thoroughly: Validate that your local model works with your specific use cases
- Have Fallback: Keep cloud model access as backup for critical operations
Troubleshooting Common Issues
Proxy Connection Issues
# Check if proxy is running
curl http://localhost:4000/health
# Check proxy logs
litellm --config config.yaml --debug
Model Performance Issues
# Monitor system resources
htop
nvidia-smi # If using GPU
# Test model directly
ollama list
ollama run deepseek-coder-v2
Cluster Code Connection Issues
# Test connection to proxy
curl -H "Authorization: Bearer your-key" \
http://localhost:4000/v1/models
# Check environment variables
echo $ANTHROPIC_BASE_URL
echo $ANTHROPIC_AUTH_TOKEN
Security Considerations
Authentication
- Use strong, unique master keys for LiteLLM proxy
- Rotate keys regularly
- Limit access to proxy server
Network Security
- Run proxy on localhost or secure internal network
- Use TLS/SSL for remote connections
- Implement firewall rules as needed
Data Privacy
- Local models keep your data on-premises
- No data sent to external APIs
- Full control over logging and data retention
Integration with Existing Workflows
CI/CD Integration
# .github/workflows/cluster-check.yml
name: Cluster Health Check
on: [push, pull_request]
jobs:
diagnose:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Local LLM
run: |
pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000 &
- name: Run Diagnostics
env:
ANTHROPIC_BASE_URL: "http://localhost:4000"
ANTHROPIC_AUTH_TOKEN: "$"
run: |
cluster-code --model local-cluster-model diagnose --output json
Team Collaboration
Share your LiteLLM configuration with team members:
# Create team configuration
git add config.yaml
git commit -m "Add shared LLM configuration"
# Team members can easily set up
pip install 'litellm[proxy]'
litellm --config config.yaml
Future Enhancements
Planned improvements to Local LLM support:
- Native Integration: Direct LLM endpoint configuration without proxy
- Model Auto-Selection: Automatically choose best model for specific tasks
- Performance Monitoring: Built-in metrics and alerting
- Fine-tuning Support: Integration with custom fine-tuned models
- Edge Deployment: Support for edge computing scenarios
Support
For Local LLM support issues:
- Documentation: https://docs.cluster-code.io/local-llm
- Community: https://discord.gg/cluster-code
- Issues: https://github.com/your-org/cluster-code/issues
- LiteLLM Docs: https://docs.litellm.ai/
Contributing
We welcome contributions to improve Local LLM support:
- Test with different local models
- Share performance benchmarks
- Contribute model-specific optimizations
- Improve documentation and examples
- Report bugs and suggest enhancements
Local LLM support enables Cluster Code to work with your preferred models while maintaining privacy and control over your cluster management workflows.