
Cleaning Chat Data for LLM Fine-Tuning with GPT-3.5 as a Curator (Part 2)
I used GPT-3.5 as a data curator to downsize ~20,000 raw Instagram DM pairs into ~2,000 high-quality English conversation pairs formatted for OpenAI fine-tuning. This is Dev Log 2 of the OmarAI series. Part 1 set up the pairing pipeline; this post is about why it broke when I tried to scale it, how I found the reversed-pair bug, and the GPT-3.5-as-curator loop that made the final dataset shippable.
TL;DR
- Too many messages = fine-tuning fails. My 20k-pair dataset hit a token/resource limit mid-training.
- Found a pairing bug where the export's newest-first order made every pair swap prompt and response. Reverse once, problem solved.
- Multilingual noise (English + Azerbaijani + invented slang) was dragging down the target model. I filtered with GPT-3.5.
- GPT-3.5 curator in batches of 50 → top 4–5 kept per batch → 20k pairs → ~2k clean multi-turn conversations.
- Final dataset shipped to
gpt-4o-2024-08-06fine-tuning — see Part 3.
What went wrong between Part 1 and here
Remember the gpt-4o-2024-08-06 fine-tune I kicked off at the end of Part 1? It failed. The run hit a resource limit I had set, because the dataset — ~20,000 pairs — was too large for the token budget to chew through in one pass.
That was frustrating for about twenty minutes and then useful, because it forced me to look at my data instead of my model. The data was worse than I thought.
Bug #1 — reversed pairs
The first thing I spotted while reviewing the dataset was that a lot of "prompts" were plainly responses to their own "replies." The pair felt backwards because it was backwards.
Root cause: Instagram's export lists messages newest-first, and my pairing algorithm assumed oldest-first. The most recent message became the "first sent" position, and its predecessor became the "response."
# BEFORE — matches the array order, which is newest-first
for i in range(0, len(msgs) - 1):
pair = (msgs[i], msgs[i + 1]) # reversed in time
# AFTER — reverse once, then pair
msgs = list(reversed(msgs)) # now oldest-first
for i in range(0, len(msgs) - 1):
pair = (msgs[i], msgs[i + 1]) # chronological
One-line fix. Obvious in hindsight. Not obvious when you're writing the preprocessor at 2 a.m.
Bug #2 — multilingual noise
Most of my conversations mix English, Azerbaijani, and a invented slang my friends and I developed. For an English-facing chatbot, the non-English conversations drag the model toward mediocre output across every language it touches. I needed to filter them.
The obvious fix — heuristic language detection — fails on messages that contain lol, fr, English emoji, and Azerbaijani in the same sentence. So I reached for a better classifier: another LLM.
The GPT-3.5 curator loop
The idea: use gpt-3.5-turbo as a data curator. Feed it batches of 50 pairs at a time and ask it to pick the best 4–5 English multi-turn conversations from each batch, reformatted as OpenAI fine-tuning JSONL.
Look at the following multiple lines of user-assistant interactions and combine
them into the top 4 most coherent and meaningful multi-turn conversations.
Each conversation should have multiple back-and-forth turns between the user and
the assistant, combining related exchanges together into one conversation.
The final output should be in a format that can be used to fine-tune an AI model.
The output must be JSONL-compatible, where each conversation is a single line,
combining relevant interactions into coherent, multi-turn dialogues.
Output format:
{"messages": [{"role": "user", "content": "User's message"},
{"role": "assistant", "content": "Assistant's response"}, ...]}
Rules:
1. Include multiple exchanges (combine related back-and-forth turns).
2. Exclude irrelevant or non-contributing messages.
3. Exclude conversations that contain non-standard ASCII characters
like â, , etc.
4. Exclude conversations that contain the word "attachment".
5. Only include conversations in clear, understandable English.
6. Pick the 4 best conversations based on coherence, relevance, and length.
7. Do not include any formatting like ```json or ``` around the output.
8. Each conversation must be a single line, flowing naturally with context.
Input:
{chunk}
The batch size of 50 was a sweet spot — big enough to give the model real choice, small enough that it stayed focused and didn't drop candidates.
Results
| Stage | # conversation pairs | Language | Notes | |---|---|---|---| | Raw Instagram pairs | ~20,000 | Mixed (3 languages + slang) | Pairing bug fixed | | After GPT-3.5 curator | ~2,000 | English only, multi-turn | 10× smaller, noticeably better |
A 10× reduction in volume came with a net quality increase. The curator's picks were coherent, English-only, multi-turn, and — critically — long enough to teach the target model to respond in multi-sentence paragraphs instead of fragments.
Format validation pass
GPT-3.5 sometimes wandered: escaped quotes inconsistently, slipped a stray ```json into one line, or produced a trailing comma. I ran a deterministic cleanup pass to fix every JSON parse error before upload:
- Strip markdown fences if present.
- Reject any line that
json.loadscan't parse. - Assert every kept line has
{messages: [{role, content}, ...]}. - Assert alternating user/assistant roles.
Anything that failed the gate got dropped. The final dataset that shipped to gpt-4o-2024-08-06 was 100% valid, 100% English, 100% multi-turn.
I went to sleep waiting for that fine-tune. Part 3 is about what the model actually sounded like when it came back.
Key takeaways
- Look at your data before blaming your model. Most "the model is bad" is actually "the data is bad."
- Use a cheaper LLM as a data curator when heuristics can't capture what you want (multilingual filtering, multi-turn coherence).
- Deterministic format validation after an LLM step is non-negotiable. LLMs wander; JSON parsers don't.
References
- OpenAI — Fine-tuning JSONL format
- OpenAI community — Fine-tuning with conversation format
- Series: Part 1 · Part 2 (you are here) · Part 3 · Part 4 · Part 5
- Try OmarAI
Frequently Asked Questions
Why use GPT-3.5 to clean data for fine-tuning a GPT-4 model?
GPT-3.5 is cheap enough to run over tens of thousands of conversations in a batch loop without blowing the budget. Using a capable but inexpensive model to curate training data for a more capable target model is a standard pattern — the curator only needs to rank and reformat, not generate new content.
What's the trick to downsize a 20k-pair dataset without losing quality?
Chunk the pairs into batches of ~50 and ask the curator model to pick the top 4–5 most coherent, most English, most informative conversations from each batch. This concentrates signal and cuts noise from bad translations, single-word messages, and mixed-language threads. A 10× reduction in pairs can come with a net quality increase.
Why is your Instagram data in three languages?
I'm Azerbaijani-American, so my DMs include English, Azerbaijani, and a mixed slang we use with friends. For an English-facing chatbot, the non-English conversations drag the model toward mediocre output in every language. Filtering them out was required — the GPT-3.5 curator handled language detection as part of the prompt.
What's the JSONL format OpenAI fine-tuning expects?
One JSON object per line — no blank lines, no commas between objects. Each object has a 'messages' field containing a list of role/content pairs following the chat-completions schema: {'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}. Any deviation (trailing commas, markdown fences around the JSON, multi-line objects) will silently break the upload.
How did you find the reversed-pair bug?
When I reviewed the data to understand why the model's quality plateaued, I noticed many pairs where the 'prompt' was clearly responding to the 'reply.' Root cause: my pairing algorithm walked the Instagram export newest-first, so first-seen became 'first sent' when it was actually last. Reversing the array before pairing fixed it in one line.