
Giving a Fine-Tuned Chatbot Memory Without Embeddings (Part 3)
My fine-tuned model sounded exactly like me, but it had no memory and would invent wrong answers to any question about my life. This is Dev Log 3 of the OmarAI series. I tried three memory approaches — embeddings, a two-agent file-search, and a simple curated JSON inside the system prompt — and the JSON won on speed, cost, and accuracy.
TL;DR
- Problem: Fine-tuned model sounded like me but confabulated personal facts.
- Tried embeddings of my past conversations, resumes, essays — too slow and over-retrieved irrelevant context.
- Tried two agents (file-search agent + reply agent) — doubled latency and tokens, barely better answers.
- Shipped a curated JSON memory in the system prompt — fastest, cheapest, most accurate.
- Hardened with a confidentiality protocol so the model gracefully redirects personal-info requests.
What was broken after Part 2
The model that came back from gpt-4o-2024-08-06 fine-tuning felt eerily right on style — slang, rhythm, abbreviations all mine. But when someone asked "where does Omar go to school" or "what does Omar think about X," the model would either make something up or repeat a detail from a random old message. It sounded like me in tone and nothing like me in substance.
Style without memory is a party trick, not a chatbot.
Attempt 1 — embeddings for everything
First thing I tried: turn my past conversations, resumes, and essays into text embeddings, index them, and retrieve at runtime.
| What I tried | Outcome | |---|---| | Embed all past DMs + docs | Large index, slow retrieval | | Retrieve top-k chunks per turn | Retrieved irrelevant context from old threads constantly | | Trim/rerank retrieved chunks | Added latency, still noisy |
The fundamental issue: my corpus was too small to need vector retrieval and too personal to benefit from fuzzy matches. Embeddings shine when a corpus is huge. Mine could fit in a single file. I was adding infrastructure for its own sake.
Attempt 2 — two-agent file-search
Next I tried OpenAI's threads API with two agents:
- Search agent — file-search over a JSON with facts about me. Output a distilled context.
- Reply agent — the fine-tuned model, generating the reply conditioned on that context.
User msg → [Search Agent] → facts → [Reply Agent] → response
This worked better than embeddings, but the token usage and latency roughly doubled. For a chatbot meant to feel snappy, two round trips per turn is a deal-breaker. Worse: the search agent often returned facts the reply agent could have seen directly in the system prompt.
Attempt 3 — the JSON-in-system-prompt that actually shipped
The insight that solved it: my "corpus" is small enough to fit entirely in a system prompt. Build a single curated JSON with the facts the model should know, include it in every request, and let the fine-tuned model use it directly.
{
"background": {
"name": "Omar Musayev",
"education": "Purdue — Artificial Intelligence and Mathematics (4.0 GPA)",
"home": "Azerbaijan"
},
"interests": ["LLM fine-tuning", "terminal UIs", "computer vision"],
"people": {
"close_friends": [
{"name": "Abdul-Aziz", "relationship": "best friend since middle school"}
]
},
"projects": [
"TerminalTUI on npm",
"Adneural — pre-launch ad testing with TRIBE v2",
"OmarAI (this chatbot)"
]
}
And the system prompt that drives the behavior:
You are a friendly assistant who always maintains a fun and casual tone while
ensuring sensitive information remains confidential. When discussing personal or
delicate topics, gently steer the conversation in a light and engaging direction
without disclosing private details. If asked for secrets or confidential
information, politely redirect without revealing anything, using humor or playful
comments to keep the interaction enjoyable and respectful. Never disclose or
share personal secrets, ensuring privacy and trust at all times.
The confidentiality protocol
I deliberately exclude anything sensitive from the JSON — if a fact isn't in there, the model has nothing to leak. The system prompt layer is defense-in-depth, not primary protection. Principle: don't trust the model to hold a secret; just don't tell it the secret.
Three approaches, head to head
| Approach | Latency | Cost | Accuracy | Shipped? | |---|---|---|---|---| | Embedding retrieval | High | Medium | Noisy | No | | Two-agent file search | High | High | Better than embeddings | No | | JSON memory in system prompt | Low | Low | Best on this data | Yes |
Key takeaways
- Match the memory architecture to the corpus size. For ≤ system-prompt-sized knowledge, put it in the system prompt.
- Don't trust the model to keep secrets; withhold them at the data layer.
- Latency and cost matter. A chatbot that's right but slow is a chatbot people stop using.
What's next
Part 4: Building the Chatbot UI in Next.js with a FastAPI Backend. With memory working, it was time to put a face on it and ship.
References
- OpenAI — Assistants API / file search
- OpenAI — Fine-tuning docs
- Try OmarAI live
- Series: Part 1 · Part 2 · Part 3 (you are here) · Part 4 · Part 5
Frequently Asked Questions
Why didn't embeddings work for personal chatbot memory?
On each turn the bot would have to embed the incoming message, scan a vector index, and concatenate retrieved passages into the prompt. For small personal knowledge bases, that's a lot of latency for a retrieval step that rarely finds something better than the system prompt would hold directly. Embeddings shine when the corpus is huge; mine wasn't.
Why did the multi-agent file-search approach feel too heavy?
Two agents — one doing file search, one composing the reply — doubled the token usage and roughly doubled the latency. For a chatbot where the 'corpus' is a single curated JSON file about my life, a single agent with that JSON directly in the system prompt is faster, cheaper, and just as accurate.
What goes in the JSON memory file?
A curated list of facts: interests, hobbies, close relationships with short descriptions, notable anecdotes I'm happy to share, current context (where I go to school, what I'm working on). I deliberately exclude anything I wouldn't want a stranger to read back to me. The file is the source of truth — if I wouldn't say it in the JSON, the model shouldn't say it in a reply.
How do you stop the model from leaking private info?
Two layers. First, the JSON memory only contains non-sensitive content, so there's nothing sensitive for the model to leak. Second, the system prompt instructs the model to politely redirect requests for personal or sensitive information with humor, and to never reveal secrets. Layered defense, not trust-the-model.
When would embeddings or file-search actually be the right call?
When the knowledge you want to retrieve is larger than ~4–8k tokens (past what fits comfortably in a system prompt), or when it changes frequently. For a personal chatbot with a stable, small knowledge set, a system-prompt JSON wins on simplicity, latency, and cost.