Common Issues and Solutions¶

Troubleshooting guide for frequent ModelForge issues.

Installation Issues¶

CUDA Not Available¶

Symptom: torch.cuda.is_available() returns False

Solutions: 1. Verify NVIDIA drivers: nvidia-smi 2. Check CUDA installation: nvcc --version 3. Reinstall PyTorch with correct CUDA version:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Python Version Issues¶

Symptom: Installation fails with "requires Python 3.11"

Solution: Install Python 3.11:

# Linux
sudo apt install python3.11

# Or use pyenv
pyenv install 3.11.0

Training Issues¶

CUDA Out of Memory¶

Symptom: Training crashes with OOM error

Solutions (in order of preference): 1. Use QLoRA strategy:

{"strategy": "qlora", "use_4bit": true}

2. Reduce batch size:

{"per_device_train_batch_size": 1}

3. Reduce sequence length:

{"max_seq_length": 1024}

4. Enable gradient checkpointing:

{"gradient_checkpointing": true}

5. Use smaller model

Training Very Slow¶

Solutions: 1. Use Unsloth provider (2x faster) 2. Use bf16 on Ampere+ GPUs:

{"bf16": true}

3. Increase batch size if VRAM allows 4. Use NVMe SSD for dataset

Model Not Found¶

Symptom: "Model X not found on HuggingFace Hub"

Solutions: 1. Check model ID is correct 2. Set HuggingFace token:

export HUGGINGFACE_TOKEN=your_token

3. For gated models, accept license on HuggingFace

Windows-Specific Issues¶

Unsloth Not Working¶

Symptom: "Unsloth is not installed" on Windows

Solution: Unsloth requires Linux. Use WSL or Docker.

See Windows Installation for details.

WSL GPU Not Detected¶

Solutions: 1. Update to latest Windows version 2. Update NVIDIA drivers (525.60+) 3. Ensure WSL 2: wsl --status 4. Restart WSL: wsl --shutdown

Dataset Issues¶

Dataset Validation Failed¶

Symptom: "Missing required field 'output'"

Solution: Ensure all examples have required fields:

{"input": "text", "output": "text"}

Invalid JSON¶

Symptom: "Invalid JSON on line X"

Solution: Validate JSON:

python -m json.tool dataset.jsonl

Provider Issues¶

Provider Not Found¶

Symptom: "Unknown provider 'unsloth'"

Solution: Install provider:

pip install unsloth

max_seq_length Error with Unsloth¶

Symptom: "max_seq_length cannot be -1"

Solution: Set fixed value:

{"max_seq_length": 2048}

API Issues¶

Port Already in Use¶

Symptom: "Address already in use: 8000"

Solutions: 1. Find process: lsof -i :8000 2. Kill process or use different port:

modelforge run --port 8080

Connection Refused¶

Solutions: 1. Check ModelForge is running: ps aux | grep modelforge 2. Check firewall settings 3. Try localhost: http://localhost:8000

Performance Issues¶

High Memory Usage¶

Solutions: 1. Use gradient checkpointing 2. Use 4-bit quantization 3. Reduce batch size 4. Close other applications

Slow Inference¶

Solutions: 1. Use smaller model 2. Reduce max_seq_length 3. Use quantization 4. Batch requests

More Help¶

Still having issues? Create an issue on GitHub.