Common Issues and Solutions¶
Troubleshooting guide for frequent ModelForge issues.
Installation Issues¶
CUDA Not Available¶
Symptom: torch.cuda.is_available() returns False
Solutions:
1. Verify NVIDIA drivers: nvidia-smi
2. Check CUDA installation: nvcc --version
3. Reinstall PyTorch with correct CUDA version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
Python Version Issues¶
Symptom: Installation fails with "requires Python 3.11"
Solution: Install Python 3.11:
# Linux
sudo apt install python3.11
# Or use pyenv
pyenv install 3.11.0
Training Issues¶
CUDA Out of Memory¶
Symptom: Training crashes with OOM error
Solutions (in order of preference): 1. Use QLoRA strategy:
{"strategy": "qlora", "use_4bit": true}
{"per_device_train_batch_size": 1}
{"max_seq_length": 1024}
{"gradient_checkpointing": true}
Training Very Slow¶
Solutions: 1. Use Unsloth provider (2x faster) 2. Use bf16 on Ampere+ GPUs:
{"bf16": true}
Model Not Found¶
Symptom: "Model X not found on HuggingFace Hub"
Solutions: 1. Check model ID is correct 2. Set HuggingFace token:
export HUGGINGFACE_TOKEN=your_token
Windows-Specific Issues¶
Unsloth Not Working¶
Symptom: "Unsloth is not installed" on Windows
Solution: Unsloth requires Linux. Use WSL or Docker.
See Windows Installation for details.
WSL GPU Not Detected¶
Solutions:
1. Update to latest Windows version
2. Update NVIDIA drivers (525.60+)
3. Ensure WSL 2: wsl --status
4. Restart WSL: wsl --shutdown
Dataset Issues¶
Dataset Validation Failed¶
Symptom: "Missing required field 'output'"
Solution: Ensure all examples have required fields:
{"input": "text", "output": "text"}
Invalid JSON¶
Symptom: "Invalid JSON on line X"
Solution: Validate JSON:
python -m json.tool dataset.jsonl
Provider Issues¶
Provider Not Found¶
Symptom: "Unknown provider 'unsloth'"
Solution: Install provider:
pip install unsloth
max_seq_length Error with Unsloth¶
Symptom: "max_seq_length cannot be -1"
Solution: Set fixed value:
{"max_seq_length": 2048}
API Issues¶
Port Already in Use¶
Symptom: "Address already in use: 8000"
Solutions:
1. Find process: lsof -i :8000
2. Kill process or use different port:
modelforge run --port 8080
Connection Refused¶
Solutions:
1. Check ModelForge is running: ps aux | grep modelforge
2. Check firewall settings
3. Try localhost: http://localhost:8000
Performance Issues¶
High Memory Usage¶
Solutions: 1. Use gradient checkpointing 2. Use 4-bit quantization 3. Reduce batch size 4. Close other applications
Slow Inference¶
Solutions: 1. Use smaller model 2. Reduce max_seq_length 3. Use quantization 4. Batch requests
More Help¶
Still having issues? Create an issue on GitHub.