Skip to content
OmarAI Post-Launch: Llama vs MPT vs Falcon vs GPT for a Personal Chatbot (Part 5)
AI Project

OmarAI Post-Launch: Llama vs MPT vs Falcon vs GPT for a Personal Chatbot (Part 5)

5 min read852 wordsUpdated Apr 24, 2026

OmarAI went live, feedback came back, and the model got meaningfully better across two dimensions: length of responses and persona consistency. I also ran comparisons of GPT-4o against Llama, MPT, and Falcon for the same use case. This is Dev Log 5 of the OmarAI series — the post-launch iteration post. Live model is at theomar.me/omar-ai.

TL;DR

  • Short-reply problem persisted after launch. Fixed by seeding longer examples into the fine-tune and rewriting the system prompt to reward depth.
  • Continuity problem fixed by expanding the JSON memory and tightening its structure.
  • Persona protocol hardened — playful-fallback mechanism for persistent personal-info probing.
  • Llama / MPT / Falcon head-to-head: each has strengths, but GPT-4o-based fine-tunes still won on coherence and creativity for this use case. Hybrid setups are the future.

What launch changed

Pre-launch, my test set was me and a handful of friends. Post-launch, strangers started probing the model in ways my private tests didn't cover:

  • Asking elaborate, open-ended questions and expecting elaborate, open-ended answers.
  • Asking clarifying follow-ups that depended on memory from earlier in the thread.
  • Testing the confidentiality protocol with increasing creativity.

The feedback was great data.

Problem 1 — replies were still too short

Initial training data had a lot of short casual exchanges. The model mirrored that style. Users liked the voice but wanted more depth when asking serious questions.

Fix: seed the fine-tune with real conversations where I wrote longer paragraphs — stories, explanations, opinions — and tune the system prompt to explicitly encourage length when the question warranted it.

# Added to system prompt
When the user asks an open-ended or elaborate question, respond with depth:
include context, a short story or example, and a concrete opinion. Match the
length of the user's intent — short questions get short answers, rich questions
get rich answers.

Average reply length on substantive questions roughly doubled. Voice stayed intact.

Problem 2 — forgetting recent details

Users would share something about themselves, then ask about it two turns later, and the model wouldn't use it. Short-term memory across a single conversation was fine (it's in the message list), but references to facts about me were sometimes random pulls from the old training data instead of the current curated JSON memory.

Fix: expand the JSON memory introduced in Part 3 with a tighter structure:

| JSON section | Examples | |---|---| | background | School, major, home country, current year | | interests | LLM fine-tuning, TUI frameworks, computer vision | | relationships | Close friends with short one-line descriptions | | anecdotes | A handful of short stories I'm happy to share | | protocols | Explicit directives (e.g. "never share address") |

The model now pulls from a clean, structured source for personal facts instead of making them up.

Problem 3 — persistent probing for private info

Original system prompt handled polite requests well. It failed on persistent probing — users who'd rephrase the same personal question four different ways.

Fix: a playful-fallback mechanism. After repeated pressing, the model deflects with a random fun fact, an anecdote from the anecdotes JSON, or a joke — but never the info being requested. Staying in-character matters; a curt refusal would break the persona.

Llama vs. MPT vs. Falcon vs. GPT-4o

I also experimented with open-source bases — Llama, MPT, Falcon — fine-tuned on the same 2,000 curated pairs.

| Model | Strength | Weakness vs. GPT-4o | Fit for this project | |---|---|---|---| | GPT-4o fine-tune | Coherence, creativity, ease of deployment | Cost, not self-hostable | Shipped | | Llama | Efficient, good zero-shot | Less fluent on my personal voice after fine-tuning | Strong candidate for hybrid | | MPT | Good at long-context | Noticeably less natural in casual banter | Not ideal here | | Falcon | Open-source flexibility | Training infra overhead, weaker persona capture | Revisit later |

GPT-4o won on the axes I cared about: coherence, creativity, and deployment simplicity. A future hybrid — GPT-4o for creative turns, a smaller self-hosted Llama for high-frequency low-value queries — is probably the right long-term architecture for cost control.

Reflections

Seeing strangers interact with an AI version of me has been simultaneously surreal and useful. It pushed me to think about persona, privacy, and failure modes in ways I hadn't before — the way any product launch forces you to.

The whole series in order: Part 1 · Part 2 · Part 3 · Part 4 · Part 5 (this post). Chat with the finished product at theomar.me/omar-ai.

Key takeaways

  • Post-launch feedback is training data. You can't fake the distribution of questions strangers ask.
  • Length and depth are fine-tuning problems, not just system-prompt problems. Seed the data, then prompt the behavior.
  • The shipping model can still be GPT-4o in 2024. Open-source alternatives get closer every month, but for a personal chatbot where voice matters most, GPT-4o's still the easiest path to good.

References

Frequently Asked Questions

Why did GPT-4o still win over Llama, MPT, and Falcon for this use case?

For a personal chatbot where the product is coherence, creativity, and conversational ease, GPT-4o delivered the best balance out of the models I tried. Llama's efficiency and Falcon's open-source flexibility are strong reasons to revisit them — but for deployment-level experience, GPT-4o-based models consistently felt more fluent on my data.

How did you fix the short-reply problem after launch?

Two things. First, I expanded fine-tuning examples that included longer multi-sentence responses from real conversations where I wrote at length. Second, I tuned the system prompt to encourage depth — explicitly rewarding the model for sharing stories, context, or opinions rather than stub answers.

How do you keep a personal chatbot from leaking private info?

Don't put sensitive info in the JSON memory. Layer a confidentiality protocol into the system prompt so the model redirects sensitive requests with humor instead of information. Add a playful-fallback mechanism for persistent probing. Log and review failure cases after launch — real users always probe in ways your private test set didn't.

What would a hybrid multi-model architecture look like?

Route by intent: GPT-4o for creative/conversational replies, a smaller self-hosted Llama for high-frequency, low-value queries (greetings, routine FAQ), and a cheap cache for repeats. You get GPT-4o quality where it matters and a better cost profile overall.

What's the single biggest thing you'd do differently?

Start with a smaller, cleaner dataset sooner. Most of my pain in Parts 1–2 was volume-for-volume's-sake — thinking more data would help. The model got dramatically better the moment I cut to 2k high-quality curated pairs.

llmopenaillamamptfalconfine-tuningpersonal-chatbotdev-log

related posts