OmarAi Dev Log 1

Hello Everybody,

Today I started working on a project that I have been meaning to do for a long time, creating an AI version of me. Well at least something similar, we all know that AI can’t replace me but I need someone who I can rely on to answer questions about me to other people.

Initial Plan: My initial plan for this project is to fine-tune one of OpenAI’s LLMs to get an idea of what the final product could look like. My idea was to download my message data from Instagram, which you can easily do through Instagram settings. However, the data that Instagram gives you needs to be heavily re-organized and scrubbed to be put into OpenAI’s fine-tuning.

The file structure it gives us is similar to this:

messages/inbox/Xpersonsname/mychatlogswiththatperson

And inside, you’ll find a file called message_1.json (sometimes more, depending on the length of the chat) and possibly folders for photos, videos, and other media shared in the chat.

The message_1.json files contain a series of messages exchanged between me and the other person, formatted as a JSON object. Here’s a simplified example of what these messages look like inside the JSON:

{
  "participants": [
    {
      "name": "Omar Musayev"
    },
    {
      "name": "Abdul-Aziz Mammadli"
    }
  ],
  "messages": [
    {
      "sender_name": "Abdul-Aziz Mammadli",
      "timestamp_ms": 1712667823182,
      "content": "Hey Omar, how’s it going?"
    },
    {
      "sender_name": "Omar Musayev",
      "timestamp_ms": 1712667805807,
      "content": "Not bad, just working on this cool AI project!"
    }
  ]
}

There are a few ways I tried to organize this data into a way the fine-tuning can understand.

First, I tried to pair up every consecutive message that contained me at least once in the structure:

Prompt : answer

But then I realized that when texting, I often double text, leading to the AI thinking I'm talking to myself.

I then fixed this problem and wrote a program to detect when I and another person send consecutive messages. I didn’t realize it at the time, but this hadn’t solved the problem either.

At the time, I was satisfied with the results and moved on to the fine-tuning. I tried fine-tuning 2 models:

Babbage-002 GPT-4o-mini-2024-07-18 However, they didn’t give me the best of results. While GPT-4o-mini-2024-07-18 could give me actual responses, they were usually very short or just irrelevant. One thing it did pick up on was my vocabulary, and it would use words like “I mean”, “Kinda”, and “fr”.

This is when I realized that the reason it was giving short outputs was because I was giving it short inputs in the training data. You see, my data only took 2 messages at a time, but we teenagers rarely fit our sentences into a single message. Instead, we send multiple messages that make up a complete sentence. So I decided on a new way to sort the data:

Merge all consecutive messages from the other person into one complete prompt.
Merge all consecutive responses from me into one complete reply.
Use this merged conversation to form a better prompt-response pair for training. After doing this, I put my new data into the GPT-4o-2024-08-06 model fine-tuning, and as I am writing this, I am waiting for the model to train.

I will update you guys on the results in the next dev log.

See you guys later!!

Omar Musayev

About Me

Information

OmarAi Dev Log 1