Performance Optimization¶
Advanced techniques for maximizing ModelForge performance.
Overview¶
This guide covers advanced optimization techniques to get the best training performance from ModelForge on your hardware.
Quick Wins¶
1. Use Unsloth Provider¶
Impact: 2x faster training
Effort: Low
{
"provider": "unsloth"
}
Requirements: - Linux or WSL (not native Windows) - Compatible model architecture (Llama, Mistral, Qwen, Gemma, Phi)
Speedup: Unsloth provides 2x faster training through optimized kernels.
2. Enable BF16 on Ampere+ GPUs¶
Impact: 20-30% faster
Effort: Low
{
"bf16": true,
"fp16": false
}
Requirements: - NVIDIA RTX 30xx, 40xx, or A100 GPUs - PyTorch 2.0+
Benefits: - Faster training than FP16 - Better numerical stability - No accuracy degradation
3. Optimize Batch Size¶
Impact: Variable (can be 2-3x faster)
Effort: Medium
Strategy: 1. Start with smallest batch size (1) 2. Gradually increase until you hit OOM 3. Back off to the largest size that fits
{
"per_device_train_batch_size": 4, // Increase from default 1
"gradient_accumulation_steps": 2 // Decrease from default 4
}
Formula:
Effective Batch Size = per_device_train_batch_size × gradient_accumulation_steps
Keep effective batch size constant while maximizing per_device_train_batch_size.
4. Use Flash Attention¶
Impact: 20-40% faster for long sequences
Effort: Medium
Flash Attention is automatically used by Unsloth when available.
Manual Installation:
pip install flash-attn --no-build-isolation
Benefits: - Faster attention computation - Lower memory usage - Enables longer sequences
5. Reduce Sequence Length¶
Impact: Proportional to reduction
Effort: Low
{
"max_seq_length": 1024 // Instead of 2048
}
Trade-off: Shorter context windows, but much faster training.
Recommendations: - Text generation: 1024-2048 - Summarization: 512-1024 - Question answering: 512
VRAM Optimization¶
Memory Usage Formula¶
VRAM ≈ (Model Size × Precision Factor) + (Batch Size × Sequence Length × Hidden Size)
Reduce Memory Usage¶
1. Use 4-bit Quantization¶
Savings: ~75% model memory
{
"use_4bit": true,
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_type": "nf4"
}
2. Enable Gradient Checkpointing¶
Savings: ~30-50% activation memory
Cost: 10-15% slower
{
"gradient_checkpointing": true
}
3. Reduce LoRA Rank¶
Savings: Proportional to rank reduction
{
"lora_r": 16 // Instead of 64
}
Trade-off: May reduce model quality with very low ranks.
4. Use Paged Optimizers¶
Savings: Optimizer states paged to CPU RAM
{
"optim": "paged_adamw_32bit"
}
CPU Optimization¶
1. Set Number of Workers¶
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
2. Use Multiple CPU Cores¶
Set in environment:
export OMP_NUM_THREADS=8 # Use 8 CPU cores
3. Pin Memory¶
{
"dataloader_pin_memory": true
}
Data Loading Optimization¶
1. Preprocess Dataset¶
Tokenize dataset once before training:
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("json", data_files="data.jsonl")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=2048)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.save_to_disk("preprocessed_dataset")
Then use preprocessed dataset in training.
2. Group by Length¶
Impact: 10-20% faster
{
"group_by_length": true
}
Groups similar-length sequences to minimize padding.
3. Enable Packing¶
Impact: 20-40% faster for short sequences
{
"packing": true
}
Packs multiple sequences into one to maximize GPU utilization.
Caution: May reduce quality for some tasks.
Training Hyperparameter Tuning¶
Learning Rate¶
Higher LR = faster convergence but less stable
Recommendations: - Start with default (2e-4) - Increase to 3e-4 or 5e-4 if training is stable - Decrease to 1e-4 if seeing instability
{
"learning_rate": 3e-4
}
Warmup Ratio¶
Lower warmup = faster initial training
{
"warmup_ratio": 0.03 // 3% of training
}
Learning Rate Scheduler¶
{
"lr_scheduler_type": "cosine" // Recommended
}
Options:
- cosine: Best for most cases
- linear: Simpler, sometimes faster
- constant: No decay, fastest but may not converge well
Model-Specific Optimizations¶
Llama Models¶
{
"provider": "unsloth",
"use_4bit": true,
"bf16": true,
"max_seq_length": 2048,
"gradient_checkpointing": true
}
Qwen Models¶
{
"provider": "unsloth",
"use_4bit": true,
"bf16": true,
"max_seq_length": 2048
}
T5/BART Models¶
{
"provider": "huggingface", // Unsloth doesn't support encoder-decoder
"use_4bit": true,
"bf16": true,
"max_seq_length": 1024
}
Hardware-Specific Optimizations¶
RTX 30xx/40xx (Ampere/Ada)¶
{
"provider": "unsloth",
"bf16": true,
"fp16": false,
"use_4bit": true,
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 2
}
Key Features: - BF16 support - Tensor cores - High memory bandwidth
RTX 20xx (Turing)¶
{
"provider": "unsloth",
"fp16": true,
"bf16": false,
"use_4bit": true,
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4
}
Key Features: - FP16 support (no BF16) - Tensor cores
GTX 16xx/10xx (Pascal/Turing)¶
{
"provider": "huggingface", // Unsloth may not work well
"fp16": true,
"use_4bit": true,
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 8,
"max_seq_length": 512
}
Limitations: - Limited FP16 support - No tensor cores - Lower memory bandwidth
Benchmarking¶
Measure Training Speed¶
Monitor: - Iterations per second - Samples per second - Seconds per epoch
Find Optimal Settings¶
# Test different batch sizes
for batch_size in [1, 2, 4, 8]:
for grad_accum in [16, 8, 4, 2]:
if batch_size * grad_accum == 16: # Keep effective batch size constant
print(f"Testing: batch={batch_size}, grad_accum={grad_accum}")
# Run short training and measure speed
Performance Metrics¶
Track these metrics:
- VRAM usage (nvidia-smi)
- Training speed (samples/sec)
- Time per epoch
- GPU utilization (should be >90%)
Common Bottlenecks¶
1. Low GPU Utilization¶
Symptoms: GPU utilization < 80%
Causes: - CPU preprocessing bottleneck - Small batch size - Data loading slow
Solutions: - Increase batch size - Preprocess dataset - Increase data loading workers
2. OOM (Out of Memory)¶
Symptoms: CUDA out of memory error
Solutions: 1. Reduce batch size 2. Increase gradient accumulation 3. Enable 4-bit quantization 4. Reduce sequence length 5. Enable gradient checkpointing 6. Use smaller LoRA rank
3. Slow Data Loading¶
Symptoms: GPU waits for data
Solutions: - Preprocess dataset - Increase num_workers - Use faster storage (SSD vs HDD) - Enable data caching
4. Gradient Overflow (FP16)¶
Symptoms: Loss becomes NaN
Solutions: - Use BF16 instead of FP16 - Reduce learning rate - Enable gradient clipping - Use mixed precision with loss scaling
Advanced Techniques¶
1. Gradient Accumulation Optimization¶
Find optimal balance:
total_batch_size = 16
# Test configurations
configs = [
(1, 16), # batch=1, grad_accum=16
(2, 8), # batch=2, grad_accum=8
(4, 4), # batch=4, grad_accum=4
(8, 2), # batch=8, grad_accum=2
]
# Larger batch_size usually faster if it fits in VRAM
2. Mixed Precision Training¶
{
"bf16": true,
"fp16": false,
"tf32": true // Enable TF32 on Ampere+
}
3. Compilation (PyTorch 2.0+)¶
import torch
model = torch.compile(model, mode="max-autotune")
Benefits: 10-30% speedup
Caution: Experimental, may cause issues
4. Multi-GPU Training¶
For multiple GPUs, ModelForge automatically uses DataParallel.
CUDA_VISIBLE_DEVICES=0,1 modelforge run
Performance Checklist¶
Before training, ensure:
- [ ] Using Unsloth provider (if compatible)
- [ ] BF16 enabled (Ampere+ GPUs)
- [ ] 4-bit quantization enabled
- [ ] Batch size maximized (no OOM)
- [ ] Gradient checkpointing enabled
- [ ] Dataset preprocessed
- [ ] Sequence length optimized for task
- [ ] Flash Attention installed
- [ ] GPU utilization > 90%
Troubleshooting Performance Issues¶
Training is Slower Than Expected¶
- Check GPU utilization:
nvidia-smi dmon - Verify Unsloth is being used (check logs)
- Ensure BF16 is enabled on compatible GPUs
- Check if CPU is bottleneck (htop)
- Verify no background processes using GPU
Memory Usage Too High¶
- Enable 4-bit quantization
- Reduce batch size
- Reduce sequence length
- Lower LoRA rank
- Enable gradient checkpointing
Training is Unstable¶
- Reduce learning rate
- Increase warmup steps
- Enable gradient clipping
- Switch to BF16 from FP16
- Check dataset quality
Performance Comparison¶
Training Time (1000 examples, 3 epochs)¶
| Configuration | Time | Speedup |
|---|---|---|
| Baseline (HF, FP16, batch=1) | 90 min | 1x |
| + Unsloth | 45 min | 2x |
| + BF16 | 35 min | 2.6x |
| + Larger batch (4) | 25 min | 3.6x |
| + Optimized settings | 20 min | 4.5x |
Hardware: RTX 4070 (12GB), Llama-3.1-8B, QLoRA
Next Steps¶
- Hardware Profiles - Optimize for your GPU
- Configuration Guide - All settings
- Troubleshooting - Fix common problems
Maximize your training performance! Every optimization counts.