Performance Optimization¶

Advanced techniques for maximizing ModelForge performance.

Overview¶

This guide covers advanced optimization techniques to get the best training performance from ModelForge on your hardware.

Quick Wins¶

1. Use Unsloth Provider¶

Impact: 2x faster training
Effort: Low

{
  "provider": "unsloth"
}

Requirements: - Linux or WSL (not native Windows) - Compatible model architecture (Llama, Mistral, Qwen, Gemma, Phi)

Speedup: Unsloth provides 2x faster training through optimized kernels.

2. Enable BF16 on Ampere+ GPUs¶

Impact: 20-30% faster
Effort: Low

{
  "bf16": true,
  "fp16": false
}

Requirements: - NVIDIA RTX 30xx, 40xx, or A100 GPUs - PyTorch 2.0+

Benefits: - Faster training than FP16 - Better numerical stability - No accuracy degradation

3. Optimize Batch Size¶

Impact: Variable (can be 2-3x faster)
Effort: Medium

Strategy: 1. Start with smallest batch size (1) 2. Gradually increase until you hit OOM 3. Back off to the largest size that fits

{
  "per_device_train_batch_size": 4,  // Increase from default 1
  "gradient_accumulation_steps": 2   // Decrease from default 4
}

Formula:

Effective Batch Size = per_device_train_batch_size × gradient_accumulation_steps

Keep effective batch size constant while maximizing per_device_train_batch_size.

4. Use Flash Attention¶

Impact: 20-40% faster for long sequences
Effort: Medium

Flash Attention is automatically used by Unsloth when available.

Manual Installation:

pip install flash-attn --no-build-isolation

Benefits: - Faster attention computation - Lower memory usage - Enables longer sequences

5. Reduce Sequence Length¶

Impact: Proportional to reduction
Effort: Low

{
  "max_seq_length": 1024  // Instead of 2048
}

Trade-off: Shorter context windows, but much faster training.

Recommendations: - Text generation: 1024-2048 - Summarization: 512-1024 - Question answering: 512

VRAM Optimization¶

Memory Usage Formula¶

VRAM ≈ (Model Size × Precision Factor) + (Batch Size × Sequence Length × Hidden Size)

Reduce Memory Usage¶

1. Use 4-bit Quantization¶

Savings: ~75% model memory

{
  "use_4bit": true,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_type": "nf4"
}

2. Enable Gradient Checkpointing¶

Savings: ~30-50% activation memory
Cost: 10-15% slower

{
  "gradient_checkpointing": true
}

3. Reduce LoRA Rank¶

Savings: Proportional to rank reduction

{
  "lora_r": 16  // Instead of 64
}

Trade-off: May reduce model quality with very low ranks.

4. Use Paged Optimizers¶

Savings: Optimizer states paged to CPU RAM

{
  "optim": "paged_adamw_32bit"
}

CPU Optimization¶

1. Set Number of Workers¶

import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

2. Use Multiple CPU Cores¶

Set in environment:

export OMP_NUM_THREADS=8  # Use 8 CPU cores

3. Pin Memory¶

{
  "dataloader_pin_memory": true
}

Data Loading Optimization¶

1. Preprocess Dataset¶

Tokenize dataset once before training:

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("json", data_files="data.jsonl")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=2048)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.save_to_disk("preprocessed_dataset")

Then use preprocessed dataset in training.

2. Group by Length¶

Impact: 10-20% faster

{
  "group_by_length": true
}

Groups similar-length sequences to minimize padding.

3. Enable Packing¶

Impact: 20-40% faster for short sequences

{
  "packing": true
}

Packs multiple sequences into one to maximize GPU utilization.

Caution: May reduce quality for some tasks.

Training Hyperparameter Tuning¶

Learning Rate¶

Higher LR = faster convergence but less stable

Recommendations: - Start with default (2e-4) - Increase to 3e-4 or 5e-4 if training is stable - Decrease to 1e-4 if seeing instability

{
  "learning_rate": 3e-4
}

Warmup Ratio¶

Lower warmup = faster initial training

{
  "warmup_ratio": 0.03  // 3% of training
}

Learning Rate Scheduler¶

{
  "lr_scheduler_type": "cosine"  // Recommended
}

Options: - cosine: Best for most cases - linear: Simpler, sometimes faster - constant: No decay, fastest but may not converge well

Model-Specific Optimizations¶

Llama Models¶

{
  "provider": "unsloth",
  "use_4bit": true,
  "bf16": true,
  "max_seq_length": 2048,
  "gradient_checkpointing": true
}

Qwen Models¶

{
  "provider": "unsloth",
  "use_4bit": true,
  "bf16": true,
  "max_seq_length": 2048
}

T5/BART Models¶

{
  "provider": "huggingface",  // Unsloth doesn't support encoder-decoder
  "use_4bit": true,
  "bf16": true,
  "max_seq_length": 1024
}

Hardware-Specific Optimizations¶

RTX 30xx/40xx (Ampere/Ada)¶

{
  "provider": "unsloth",
  "bf16": true,
  "fp16": false,
  "use_4bit": true,
  "per_device_train_batch_size": 4,
  "gradient_accumulation_steps": 2
}

Key Features: - BF16 support - Tensor cores - High memory bandwidth

RTX 20xx (Turing)¶

{
  "provider": "unsloth",
  "fp16": true,
  "bf16": false,
  "use_4bit": true,
  "per_device_train_batch_size": 2,
  "gradient_accumulation_steps": 4
}

Key Features: - FP16 support (no BF16) - Tensor cores

GTX 16xx/10xx (Pascal/Turing)¶

{
  "provider": "huggingface",  // Unsloth may not work well
  "fp16": true,
  "use_4bit": true,
  "per_device_train_batch_size": 1,
  "gradient_accumulation_steps": 8,
  "max_seq_length": 512
}

Limitations: - Limited FP16 support - No tensor cores - Lower memory bandwidth

Benchmarking¶

Measure Training Speed¶

Monitor: - Iterations per second - Samples per second - Seconds per epoch

Find Optimal Settings¶

# Test different batch sizes
for batch_size in [1, 2, 4, 8]:
    for grad_accum in [16, 8, 4, 2]:
        if batch_size * grad_accum == 16:  # Keep effective batch size constant
            print(f"Testing: batch={batch_size}, grad_accum={grad_accum}")
            # Run short training and measure speed

Performance Metrics¶

Track these metrics: - VRAM usage (nvidia-smi) - Training speed (samples/sec) - Time per epoch - GPU utilization (should be >90%)

Common Bottlenecks¶

1. Low GPU Utilization¶

Symptoms: GPU utilization < 80%

Causes: - CPU preprocessing bottleneck - Small batch size - Data loading slow

Solutions: - Increase batch size - Preprocess dataset - Increase data loading workers

2. OOM (Out of Memory)¶

Symptoms: CUDA out of memory error

Solutions: 1. Reduce batch size 2. Increase gradient accumulation 3. Enable 4-bit quantization 4. Reduce sequence length 5. Enable gradient checkpointing 6. Use smaller LoRA rank

3. Slow Data Loading¶

Symptoms: GPU waits for data

Solutions: - Preprocess dataset - Increase num_workers - Use faster storage (SSD vs HDD) - Enable data caching

4. Gradient Overflow (FP16)¶

Symptoms: Loss becomes NaN

Solutions: - Use BF16 instead of FP16 - Reduce learning rate - Enable gradient clipping - Use mixed precision with loss scaling

Advanced Techniques¶

1. Gradient Accumulation Optimization¶

Find optimal balance:

total_batch_size = 16

# Test configurations
configs = [
    (1, 16),   # batch=1, grad_accum=16
    (2, 8),    # batch=2, grad_accum=8
    (4, 4),    # batch=4, grad_accum=4
    (8, 2),    # batch=8, grad_accum=2
]

# Larger batch_size usually faster if it fits in VRAM

2. Mixed Precision Training¶

{
  "bf16": true,
  "fp16": false,
  "tf32": true  // Enable TF32 on Ampere+
}

3. Compilation (PyTorch 2.0+)¶

import torch
model = torch.compile(model, mode="max-autotune")

Benefits: 10-30% speedup
Caution: Experimental, may cause issues

4. Multi-GPU Training¶

For multiple GPUs, ModelForge automatically uses DataParallel.

CUDA_VISIBLE_DEVICES=0,1 modelforge run

Performance Checklist¶

Before training, ensure:

[ ] Using Unsloth provider (if compatible)
[ ] BF16 enabled (Ampere+ GPUs)
[ ] 4-bit quantization enabled
[ ] Batch size maximized (no OOM)
[ ] Gradient checkpointing enabled
[ ] Dataset preprocessed
[ ] Sequence length optimized for task
[ ] Flash Attention installed
[ ] GPU utilization > 90%

Troubleshooting Performance Issues¶

Training is Slower Than Expected¶

Check GPU utilization: nvidia-smi dmon
Verify Unsloth is being used (check logs)
Ensure BF16 is enabled on compatible GPUs
Check if CPU is bottleneck (htop)
Verify no background processes using GPU

Memory Usage Too High¶

Enable 4-bit quantization
Reduce batch size
Reduce sequence length
Lower LoRA rank
Enable gradient checkpointing

Training is Unstable¶

Reduce learning rate
Increase warmup steps
Enable gradient clipping
Switch to BF16 from FP16
Check dataset quality

Performance Comparison¶

Training Time (1000 examples, 3 epochs)¶

Configuration	Time	Speedup
Baseline (HF, FP16, batch=1)	90 min	1x
+ Unsloth	45 min	2x
+ BF16	35 min	2.6x
+ Larger batch (4)	25 min	3.6x
+ Optimized settings	20 min	4.5x

Hardware: RTX 4070 (12GB), Llama-3.1-8B, QLoRA

Next Steps¶

Hardware Profiles - Optimize for your GPU
Configuration Guide - All settings
Troubleshooting - Fix common problems

Maximize your training performance! Every optimization counts.