RLHF Strategy (Reinforcement Learning from Human Feedback)¶
Advanced fine-tuning using human preference learning with PPO (Proximal Policy Optimization).
Overview¶
RLHF is an advanced training strategy that aligns language models with human preferences through reinforcement learning. It's the strategy behind models like ChatGPT and Claude.
What is RLHF?¶
RLHF trains models to maximize reward signals based on human feedback:
- Supervised Fine-Tuning (SFT): Train base model on demonstrations
- Reward Model Training: Train model to predict human preferences
- RL Optimization: Use PPO to optimize for high reward
ModelForge focuses on the RL Optimization phase, assuming you have a reward model or preference data.
Features¶
✅ State-of-the-art alignment - Best quality for human-aligned outputs
✅ Preference learning - Learn from human feedback
✅ Iterative improvement - Continuously improve model behavior
⚠️ Complex setup - Requires reward model or preference data
⚠️ Computationally expensive - Slower than SFT/QLoRA
⚠️ Advanced technique - Recommended for experienced users
When to Use RLHF¶
✅ Use RLHF When:¶
- Aligning model outputs with human preferences
- Training conversational AI or assistants
- Need highest quality, human-like responses
- Have preference pairs or reward model
- Quality matters more than speed
- Have sufficient computational resources (high-end hardware)
❌ Don't Use RLHF When:¶
- First time fine-tuning (start with SFT)
- Limited VRAM (< 12GB)
- Simple supervised learning task
- Need fast training
- Don't have preference data
Dataset Format¶
RLHF requires datasets with preference pairs:
{"prompt": "What is the capital of France?", "chosen": "The capital of France is Paris.", "rejected": "I don't know."}
{"prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits...", "rejected": "It's complicated."}
{"prompt": "Write a haiku about coding", "chosen": "Code flows like water\nBugs hide in silent shadows\nDebug brings the light", "rejected": "Coding is fun"}
Required Fields:
- prompt: Input prompt or question
- chosen: Preferred/better response
- rejected: Non-preferred/worse response
Dataset Preparation¶
- Collect Human Feedback: Get humans to rank multiple responses
- Create Preference Pairs: Pair each prompt with chosen and rejected responses
- Quality Control: Ensure clear preference distinctions
- Balance Dataset: Include diverse prompts and preferences
Example Dataset Creation¶
# From human ratings
ratings = [
{"prompt": "Tell me a joke", "response_a": "Why did the chicken...", "response_b": "Haha funny", "preference": "a"},
]
# Convert to RLHF format
rlhf_data = []
for item in ratings:
rlhf_data.append({
"prompt": item["prompt"],
"chosen": item["response_a"] if item["preference"] == "a" else item["response_b"],
"rejected": item["response_b"] if item["preference"] == "a" else item["response_a"]
})
Configuration¶
Basic RLHF Configuration¶
{
"strategy": "rlhf",
"task": "text-generation",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"dataset": "/path/to/preference-data.jsonl",
"provider": "huggingface",
"num_train_epochs": 1,
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 8,
"learning_rate": 1.41e-5,
"lora_r": 16,
"lora_alpha": 32,
"use_4bit": true,
"bf16": true
}
Advanced RLHF Configuration¶
{
"strategy": "rlhf",
"task": "text-generation",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"dataset": "/path/to/preference-data.jsonl",
"provider": "unsloth",
"num_train_epochs": 1,
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4,
"learning_rate": 1.41e-5,
"lora_r": 64,
"lora_alpha": 16,
"use_4bit": true,
"bf16": true,
"max_seq_length": 2048,
"warmup_ratio": 0.1,
"eval_split": 0.1
}
How RLHF Works¶
Training Process¶
1. Load Pre-trained Model
↓
2. Apply LoRA Adapters
↓
3. Add Value Head (for reward prediction)
↓
4. Load Preference Dataset
↓
5. PPO Training Loop:
- Generate responses
- Calculate rewards
- Update policy
- Clip gradients
↓
6. Save Fine-tuned Model
Key Differences from SFT¶
| Aspect | SFT | RLHF |
|---|---|---|
| Objective | Minimize loss on examples | Maximize reward |
| Training | Supervised learning | Reinforcement learning |
| Dataset | Input-output pairs | Preference pairs |
| Complexity | Simple | Complex |
| Speed | Fast | Slow |
| Quality | High | Very High |
Hardware Requirements¶
Minimum Requirements¶
- GPU: 12GB+ VRAM
- RAM: 16GB+ system RAM
- Recommended: Mid-range to high-end profile
- Provider: HuggingFace (Unsloth experimental)
Memory Usage¶
RLHF requires more memory than SFT due to: - Value head network - PPO buffers - Advantage estimation
Example (7B model): - SFT with 4-bit: ~6-8 GB VRAM - RLHF with 4-bit: ~10-12 GB VRAM
Recommended Settings by Hardware¶
Mid Range (12-16GB VRAM)¶
{
"strategy": "rlhf",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"use_4bit": true,
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 8,
"lora_r": 16,
"max_seq_length": 1024
}
High End (16GB+ VRAM)¶
{
"strategy": "rlhf",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"use_4bit": true,
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4,
"lora_r": 64,
"max_seq_length": 2048
}
Hyperparameter Tuning¶
Learning Rate¶
RLHF typically uses lower learning rates than SFT:
- SFT: 2e-4
- RLHF: 1.41e-5 (recommended)
LoRA Configuration¶
{
"lora_r": 16, // Can use 16, 32, 64
"lora_alpha": 32, // Usually 2x rank
"lora_dropout": 0.05 // Lower dropout for RLHF
}
Training Epochs¶
RLHF requires fewer epochs than SFT:
- SFT: 3-5 epochs
- RLHF: 1-2 epochs (risk of reward hacking with more)
Evaluation¶
RLHF models are evaluated using:
- Reward Score: Average predicted reward
- KL Divergence: Deviation from base model (prevents over-optimization)
- Human Evaluation: Manual quality assessment
Common Issues¶
Reward Hacking¶
Problem: Model learns to game the reward without improving quality
Solutions: - Use KL penalty to stay close to base model - Limit training to 1-2 epochs - Monitor KL divergence - Use diverse preference data
High Memory Usage¶
Problem: RLHF runs out of memory
Solutions: - Use 4-bit quantization - Reduce batch size to 1 - Increase gradient accumulation - Reduce sequence length - Use smaller model
Unstable Training¶
Problem: Loss oscillates or diverges
Solutions: - Lower learning rate (1e-5 or lower) - Increase warmup steps - Use gradient clipping - Check dataset quality
Advanced Topics¶
Custom Reward Models¶
You can use custom reward models:
from transformers import AutoModelForSequenceClassification
reward_model = AutoModelForSequenceClassification.from_pretrained(
"your-reward-model"
)
Multi-Objective RLHF¶
Optimize for multiple objectives:
reward = 0.7 * quality_score + 0.3 * safety_score
Comparison with DPO¶
| Feature | RLHF | DPO |
|---|---|---|
| Complexity | High | Lower |
| Reward Model | Required | Not required |
| Stability | Less stable | More stable |
| Performance | Excellent | Excellent |
| Speed | Slower | Faster |
| Memory | Higher | Lower |
Recommendation: Try DPO first - it's simpler and often performs similarly.
Example: Training a Helpful Assistant¶
{
"strategy": "rlhf",
"task": "text-generation",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"dataset": "/data/helpful-assistant-preferences.jsonl",
"provider": "unsloth",
"num_train_epochs": 1,
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4,
"learning_rate": 1.41e-5,
"warmup_ratio": 0.1,
"lora_r": 64,
"lora_alpha": 128,
"use_4bit": true,
"bf16": true,
"max_seq_length": 2048,
"eval_split": 0.1
}
Best Practices¶
- ✅ Start with SFT before RLHF
- ✅ Use high-quality preference data
- ✅ Monitor KL divergence to prevent over-optimization
- ✅ Use lower learning rates than SFT
- ✅ Limit to 1-2 epochs
- ✅ Include diverse prompts in dataset
- ✅ Evaluate with human feedback
Next Steps¶
- DPO Strategy - Simpler alternative to RLHF
- SFT Strategy - Start here before RLHF
- QLoRA Strategy - Memory-efficient training
- Strategy Overview - Compare all strategies
RLHF: The gold standard for human-aligned AI! ⭐ Complex but powerful.