Dataset Formats¶

Learn how to prepare your data for training with ModelForge.

Overview¶

ModelForge uses JSONL (JSON Lines) format for training datasets. Each line is a valid JSON object representing one training example.

General Format¶

{"field1": "value1", "field2": "value2"}
{"field1": "value1", "field2": "value2"}

One JSON object per line
No commas between lines
UTF-8 encoding
File extension: .jsonl

Task-Specific Formats¶

Text Generation¶

Use Case: Chatbots, instruction following, code generation, creative writing

Format:

{"input": "What is machine learning?", "output": "Machine learning is a subset of artificial intelligence..."}
{"input": "Explain neural networks", "output": "Neural networks are computing systems inspired by biological neural networks..."}
{"input": "Write a Python function to sort a list", "output": "Here's a Python function:\n\ndef sort_list(lst):\n    return sorted(lst)"}

Required Fields: - input (string): The prompt or instruction - output (string): The expected response

Example Dataset (customer support):

{"input": "How do I reset my password?", "output": "To reset your password:\n1. Click 'Forgot Password' on the login page\n2. Enter your email address\n3. Check your email for reset link\n4. Follow the link and create new password"}
{"input": "Where is my order?", "output": "To track your order:\n1. Log into your account\n2. Go to 'My Orders'\n3. Click on the order number\n4. View tracking information"}

Summarization¶

Use Case: Document summarization, article condensing, meeting notes

Format:

{"input": "Long article or document text here...", "output": "Concise summary here."}

Required Fields: - input (string): The long text to summarize - output (string): The summary

Example Dataset (news summarization):

{"input": "The Federal Reserve announced today that it will maintain interest rates at their current level of 5.25-5.50%, citing ongoing concerns about inflation despite recent economic slowdowns. Federal Reserve Chair Jerome Powell stated in a press conference that the central bank remains data-dependent and will adjust policy as needed. Markets reacted positively to the news, with the S&P 500 gaining 1.2% in afternoon trading.", "output": "The Federal Reserve kept interest rates unchanged at 5.25-5.50% due to inflation concerns. Chair Powell emphasized data-dependent approach. Markets rose 1.2%."}
{"input": "Scientists at MIT have developed a new battery technology that could potentially triple the range of electric vehicles. The breakthrough involves using solid-state electrolytes instead of traditional liquid electrolytes, which allows for higher energy density and improved safety. The research team, led by Professor Jane Smith, published their findings in Nature Energy this week. Commercial applications are expected within 5-10 years.", "output": "MIT researchers developed solid-state battery technology that could triple EV range. The innovation improves energy density and safety. Commercial use expected in 5-10 years."}

Extractive Question Answering¶

Use Case: RAG systems, document search, FAQ systems

Format:

{"context": "Background information and document text", "question": "Question about the context", "answer": "Exact answer from context"}

Required Fields: - context (string): The paragraph or document containing the answer - question (string): The question being asked - answer (string): The answer extracted from context

Example Dataset (FAQ):

{"context": "ModelForge is a no-code toolkit for fine-tuning Large Language Models on your local GPU. It supports text generation, summarization, and question answering tasks. The tool uses LoRA (Low-Rank Adaptation) for efficient fine-tuning and supports both HuggingFace and Unsloth providers.", "question": "What tasks does ModelForge support?", "answer": "text generation, summarization, and question answering"}
{"context": "To install ModelForge, you need Python 3.11, an NVIDIA GPU with at least 4GB VRAM, CUDA installed, and a HuggingFace account. The installation process involves running 'pip install modelforge-finetuning' and then setting up your HuggingFace token.", "question": "What are the prerequisites for ModelForge?", "answer": "Python 3.11, NVIDIA GPU with 4GB+ VRAM, CUDA, HuggingFace account"}

Dataset Size Recommendations¶

Model Size	Min Examples	Recommended	Optimal
< 1B params	100	500	1,000+
1B-3B params	200	1,000	5,000+
3B-7B params	500	2,000	10,000+
7B+ params	1,000	5,000	20,000+

Quality > Quantity: 100 high-quality examples are better than 10,000 low-quality ones.

Data Quality Guidelines¶

DO:¶

✅ Use clean, grammatically correct text
✅ Ensure input-output pairs are logically related
✅ Include diverse examples covering different scenarios
✅ Use consistent formatting across examples
✅ Remove personal information (PII)
✅ Verify all data is relevant to your use case

DON'T:¶

❌ Include malformed JSON
❌ Use inconsistent field names
❌ Include duplicate or near-duplicate examples
❌ Mix different tasks in one dataset
❌ Use copyrighted content without permission
❌ Include biased or harmful content

Creating Your Dataset¶

Method 1: Manual Creation¶

Create a file dataset.jsonl:

{"input": "Example 1 input", "output": "Example 1 output"}
{"input": "Example 2 input", "output": "Example 2 output"}

Method 2: Python Script¶

import json

data = [
    {"input": "What is AI?", "output": "AI stands for Artificial Intelligence..."},
    {"input": "Explain ML", "output": "Machine Learning is..."},
]

with open('dataset.jsonl', 'w', encoding='utf-8') as f:
    for item in data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

Method 3: Convert from CSV¶

import pandas as pd
import json

# Read CSV
df = pd.read_csv('data.csv')

# Convert to JSONL
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
    for _, row in df.iterrows():
        item = {"input": row['question'], "output": row['answer']}
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

Method 4: From HuggingFace Dataset¶

from datasets import load_dataset
import json

# Load dataset
dataset = load_dataset("squad", split="train[:1000]")

# Convert to JSONL
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
    for item in dataset:
        formatted = {
            "context": item['context'],
            "question": item['question'],
            "answer": item['answers']['text'][0]
        }
        f.write(json.dumps(formatted, ensure_ascii=False) + '\n')

Validation¶

ModelForge automatically validates datasets before training:

✅ JSON syntax validation
✅ Required fields check
✅ Minimum size check (at least 10 examples)
✅ Field type validation

Common Validation Errors¶

Error: Missing required field 'output' - Fix: Ensure all examples have required fields

Error: Invalid JSON on line 42 - Fix: Check line 42 for syntax errors (missing quotes, commas, etc.)

Error: Dataset too small (5 examples, minimum 10) - Fix: Add more examples to your dataset

Error: Field 'input' must be a string - Fix: Ensure all field values are strings, not numbers or objects

Advanced Features¶

Multi-line Text¶

Use \n for line breaks:

{"input": "Write a haiku", "output": "Code flows like water\nBugs hide in shadows unseen\nDebug, test, deploy"}

Special Characters¶

Escape special characters:

{"input": "What is JSON?", "output": "JSON uses \"quotes\" for strings and {\"key\": \"value\"} for objects"}

Unicode Support¶

Full UTF-8 support:

{"input": "Translate: Hello", "output": "你好 (Chinese), Hola (Spanish), Bonjour (French)"}

Long Context¶

No hard limit on length, but consider model's max sequence length:

{"input": "Summarize this article", "output": "Summary here"}

Sample Datasets¶

ModelForge includes sample datasets for testing:

# Located in: ModelForge/test_datasets/
low_text_generation.jsonl          # Text generation examples
low_summarization_train_set.jsonl  # Summarization examples
low_qa_train_set.jsonl             # QA examples

Download from repository:

curl -O https://raw.githubusercontent.com/RETR0-OS/ModelForge/main/ModelForge/test_datasets/low_text_generation.jsonl

Best Practices¶

1. Data Splitting¶

ModelForge automatically splits data into train/validation sets:

{
  "eval_split": 0.2  // 20% for validation, 80% for training
}

2. Data Balancing¶

Ensure balanced representation: - Equal distribution of topics - Diverse input lengths - Varied complexity levels

3. Data Cleaning¶

Before creating JSONL: 1. Remove duplicates 2. Fix typos and grammar 3. Normalize formatting 4. Remove irrelevant examples

4. Iterative Improvement¶

Start with small dataset (100-500 examples)
Train and evaluate
Identify weak areas
Add targeted examples
Repeat

Troubleshooting¶

Dataset Won't Upload¶

Problem: Upload fails

Checks: 1. File is valid JSONL (one JSON object per line) 2. File size < 500MB 3. Proper UTF-8 encoding 4. No special characters in filename

Training Fails with Dataset Error¶

Problem: Training starts but fails immediately

Checks: 1. All required fields present 2. Fields are correct type (strings) 3. No empty values 4. No extremely long examples (> max_seq_length)

Poor Training Results¶

Problem: Model doesn't learn effectively

Solutions: 1. Add more examples (aim for 1,000+) 2. Improve data quality 3. Ensure examples are representative 4. Check for data leakage or duplicates

Next Steps¶

Configuration Guide - Learn about training parameters
Training Tasks - Understand different task types
Quick Start - Train your first model
Troubleshooting - Common issues

Good data is the foundation of good models! 📊