Hardware Profiles¶

Optimize ModelForge for your GPU with hardware-aware model recommendations and configurations.

Overview¶

Hardware profiles automatically detect your system capabilities and recommend optimal models and settings. This ensures you get the best performance without manual configuration.

How Hardware Profiles Work¶

System Scan
    ↓
Detect GPU VRAM + System RAM
    ↓
Classify into Profile (low_end/mid_range/high_end)
    ↓
Recommend Models + Settings
    ↓
Apply Optimizations

Available Profiles¶

ModelForge classifies hardware into three profiles based on GPU VRAM and system RAM:

1. Low End Profile¶

Hardware Requirements: - GPU VRAM: < 7.2 GB - OR: GPU VRAM < 15.2 GB AND System RAM < 15.2 GB

Typical Hardware: - NVIDIA GTX 1650 (4GB) - NVIDIA GTX 1660 (6GB) - NVIDIA RTX 3050 (4-6GB) - NVIDIA RTX A2000 (6GB)

Recommended Settings:

{
  "compute_specs": "low_end",
  "use_4bit": true,
  "per_device_train_batch_size": 1,
  "gradient_accumulation_steps": 8,
  "max_seq_length": 512
}

Recommended Models:

Task	Primary Model	Size	VRAM Usage
Text Generation	qwen/Qwen2.5-3B	3B params	~4-5 GB
Summarization	google/flan-t5-large	770M params	~3-4 GB
Question Answering	deepset/roberta-base-squad2	125M params	~2-3 GB

Optimization Tips: - ✅ Use 4-bit quantization (use_4bit: true) - ✅ Small batch size (1-2) - ✅ Higher gradient accumulation (8-16) - ✅ Shorter sequences (512-1024 tokens) - ✅ QLoRA strategy for memory efficiency

2. Mid Range Profile¶

Hardware Requirements: - GPU VRAM: 7.2 - 15.2 GB - System RAM: ≥ 15.2 GB - OR: GPU VRAM ≥ 15.2 GB AND System RAM < 15.2 GB

Typical Hardware: - NVIDIA RTX 2070/2080 (8GB) - NVIDIA RTX 3060 Ti (8GB) - NVIDIA RTX 3070 (8GB) - NVIDIA RTX 4060 Ti (8-16GB) - NVIDIA RTX A4000 (12GB)

Recommended Settings:

{
  "compute_specs": "mid_range",
  "use_4bit": true,
  "per_device_train_batch_size": 2,
  "gradient_accumulation_steps": 4,
  "max_seq_length": 1024
}

Recommended Models:

Task	Primary Model	Size	VRAM Usage
Text Generation	mistralai/Mistral-Small-3.1-24B-Base-2503	24B params	~12-14 GB
Text Generation (alt)	meta-llama/Llama-3.1-8B-Instruct	8B params	~8-10 GB
Summarization	google/flan-t5-large	770M params	~4-5 GB
Question Answering	meta-llama/Llama-3.1-8B-Instruct	8B params	~8-10 GB

Optimization Tips: - ✅ 4-bit quantization recommended - ✅ Moderate batch size (2-4) - ✅ Standard gradient accumulation (4-8) - ✅ Medium sequences (1024-2048 tokens) - ✅ Unsloth provider for 2x speedup - ✅ Both SFT and QLoRA strategies work well

3. High End Profile¶

Hardware Requirements: - GPU VRAM: ≥ 15.2 GB - System RAM: ≥ 15.2 GB

Typical Hardware: - NVIDIA RTX 3080/3090 (10-24GB) - NVIDIA RTX 4080/4090 (12-24GB) - NVIDIA RTX A5000/A6000 (24-48GB) - NVIDIA Tesla V100 (16-32GB) - NVIDIA A100 (40-80GB)

Recommended Settings:

{
  "compute_specs": "high_end",
  "use_4bit": false,
  "bf16": true,
  "per_device_train_batch_size": 4,
  "gradient_accumulation_steps": 2,
  "max_seq_length": 2048
}

Recommended Models:

Task	Primary Model	Size	VRAM Usage
Text Generation	meta-llama/Llama-4-Maverick-17B-128E-Instruct	17B params	~16-18 GB
Text Generation (large)	qwen/Qwen2.5-32B	32B params	~20-24 GB
Summarization	meta-llama/Llama-4-Maverick-17B-128E-Instruct	17B params	~16-18 GB
Question Answering	qwen/Qwen2.5-32B	32B params	~20-24 GB

Optimization Tips: - ✅ Optional 4-bit quantization (not required) - ✅ Use BF16 precision on Ampere+ GPUs (RTX 30xx/40xx) - ✅ Larger batch size (4-8) - ✅ Lower gradient accumulation (2-4) - ✅ Longer sequences (2048-4096 tokens) - ✅ Unsloth provider highly recommended - ✅ Can use advanced strategies (RLHF, DPO)

Profile Classification Rules¶

ModelForge uses these rules to classify your hardware:

if gpu_vram < 7.2 GB:
    profile = "low_end"

elif gpu_vram < 15.2 GB and ram < 15.2 GB:
    profile = "low_end"

elif gpu_vram < 15.2 GB and ram >= 15.2 GB:
    profile = "mid_range"

elif gpu_vram >= 15.2 GB and ram < 15.2 GB:
    profile = "mid_range"

else:  # gpu_vram >= 15.2 GB and ram >= 15.2 GB
    profile = "high_end"

Automatic Hardware Detection¶

Via UI¶

When you start training in the UI:

Select your task
Click "Detect Hardware"
ModelForge automatically:
Scans GPU and RAM
Classifies into profile
Recommends optimal models
Pre-fills configuration

Via API¶

curl -X POST http://localhost:8000/api/finetune/detect \
  -H "Content-Type: application/json" \
  -d '{"task": "text-generation"}'

Response:

{
  "hardware_specs": {
    "gpu_name": "NVIDIA RTX 3070",
    "gpu_memory_gb": 8.0,
    "ram_gb": 16.0,
    "cuda_version": "12.6"
  },
  "compute_profile": "mid_range",
  "recommended_model": "meta-llama/Llama-3.1-8B-Instruct",
  "possible_models": [
    "meta-llama/Llama-3.1-8B-Instruct",
    "qwen/Qwen2.5-7B",
    "mistralai/Mistral-Small-3.1-24B-Base-2503"
  ]
}

Manual Profile Selection¶

You can override automatic detection:

{
  "compute_specs": "mid_range",  // Force mid-range profile
  "model_name": "meta-llama/Llama-3.1-8B-Instruct",
  ...
}

Profile-Specific Optimizations¶

Memory Optimization by Profile¶

Profile	Quantization	Batch Size	Grad Accum	Max Seq Len
Low End	4-bit (required)	1	8-16	512-1024
Mid Range	4-bit (recommended)	2-4	4-8	1024-2048
High End	Optional	4-8	2-4	2048-4096

Provider Recommendations by Profile¶

Profile	Primary Provider	Secondary	Speedup
Low End	HuggingFace	-	1x
Mid Range	Unsloth	HuggingFace	2x
High End	Unsloth	HuggingFace	2x

Note: Unsloth requires Linux, WSL, or Docker. Not available on native Windows.

Strategy Recommendations by Profile¶

Profile	Recommended Strategy	Alternative
Low End	QLoRA	SFT
Mid Range	QLoRA or SFT	RLHF, DPO
High End	SFT, QLoRA, RLHF, DPO	Any

VRAM Usage Estimation¶

Formula¶

VRAM Usage ≈ Model Size × Precision Factor × Overhead Factor

Precision Factors¶

4-bit: ~0.5 GB per billion parameters
8-bit: ~1 GB per billion parameters
16-bit (FP16/BF16): ~2 GB per billion parameters
32-bit (FP32): ~4 GB per billion parameters

Example Calculations¶

7B model with 4-bit quantization:

7B × 0.5 GB/B × 1.5 (overhead) = ~5.25 GB VRAM

7B model with 16-bit precision:

7B × 2 GB/B × 1.5 (overhead) = ~21 GB VRAM

32B model with 4-bit quantization:

32B × 0.5 GB/B × 1.5 (overhead) = ~24 GB VRAM

Troubleshooting¶

Out of Memory (OOM) Errors¶

Error: CUDA out of memory

Solutions: 1. Reduce batch size: per_device_train_batch_size: 1 2. Increase gradient accumulation: gradient_accumulation_steps: 16 3. Enable 4-bit quantization: use_4bit: true 4. Reduce sequence length: max_seq_length: 512 5. Enable gradient checkpointing: gradient_checkpointing: true 6. Try a smaller model

Slow Training¶

Problem: Training is taking too long

Solutions: 1. Use Unsloth provider for 2x speedup (if on Linux/WSL) 2. Increase batch size if you have VRAM headroom 3. Reduce gradient accumulation steps 4. Use mixed precision (BF16 on RTX 30xx/40xx) 5. Consider a smaller model

Wrong Profile Detection¶

Problem: ModelForge detects wrong profile

Solutions: 1. Manually specify profile: "compute_specs": "mid_range" 2. Check GPU drivers are up to date 3. Verify CUDA is properly installed 4. Check nvidia-smi output matches expected VRAM

Best Practices¶

For Low End Hardware¶

✅ Always use 4-bit quantization
✅ Start with smallest recommended models
✅ Use QLoRA strategy
✅ Batch size = 1, gradient accumulation = 8-16
✅ Keep sequences short (512-1024)
✅ Close other GPU applications

For Mid Range Hardware¶

✅ Use 4-bit quantization for large models (7B+)
✅ Unsloth provider for best performance
✅ Batch size = 2-4
✅ Try both SFT and QLoRA strategies
✅ Medium sequences (1024-2048)

For High End Hardware¶

✅ Unsloth provider mandatory for speed
✅ Can skip quantization for small models
✅ Use BF16 on Ampere+ GPUs
✅ Larger batch sizes (4-8)
✅ Try advanced strategies (RLHF, DPO)
✅ Longer sequences (2048-4096)

Performance Comparison¶

Training Time Estimates (1000 examples, 3 epochs)¶

Model Size	Profile	Provider	Strategy	Time
3B	Low End	HuggingFace	QLoRA	~2 hours
7B	Mid Range	Unsloth	QLoRA	~45 min
7B	Mid Range	HuggingFace	QLoRA	~90 min
17B	High End	Unsloth	SFT	~60 min
32B	High End	Unsloth	QLoRA	~90 min

Times are approximate and vary based on exact hardware and configuration.

Next Steps¶

Configuration Guide - Detailed configuration options
Provider Overview - Choose HuggingFace or Unsloth
Training Strategies - Select optimal strategy
Performance Optimization - Fine-tune performance

Hardware profiles make ModelForge accessible to everyone! From 4GB to 80GB VRAM, we've got you covered.