AI Engineer & Full Stack Developer
Azerbaijan
AI Engineer with extensive experience in competitive programming, machine learning, and web development. Passionate about solving real-world problems with cutting-edge technology.
I am Omar Musayev, passionate about leveraging AI, machine learning, and cutting-edge technology to address real-world challenges. I have extensive experience in competitive programming, research, and volunteering in international organizations like the UN. I'm always eager to collaborate on open-source projects and believe in the power of technology to create meaningful, accessible solutions.
Hello Everybody,
So, over the course of the past few days I have found some mistakes in my old codes for OmarAi and I have experiemented with new things.
First hing first remember how I was waiting for my "GPT-4o-2024-08-06" model to fine tune? Well, that failed due to the huge number of data points I had which was around 20k+ message pairs. The model couldnt tokenize all this and train on it because I had set a limit on it to not use too many recources.
This was a good thing though because it prompted me to look over my data to see if there was anything I could cut out. This is when I realized that my data was havily flawed. The algorithm I had used to convert my messages from just messages to conversation pairs looked at messages from the most recent to the oldest which caused a problem. The conversation pairs would take the last message sent as the first message sent and vice verse.
The fix for this was quite easy. I just had to reverse the order of messages before using the same algorithm to form conversation pairs.
Now, coming back to my original problem of having too many messages that took too many recources without ocntirbuting anything to the model, I first had to make sure that my conversations were in English. While looking at the data, I noticed that most of the conversations included Azerbaijani or even random languages that me and my friends had devloped to talk to each other. This would negatively affect the future model as it's language understanding would be confused so I had to find a way of cutting it.
The solution I came up with? Use AI!
I used the open-ai gpt-3.5 api to go over each of my message pairs and convert them into data that was more usefull to the model that I wanted to train. I made an algorithm to feed it 50 conversations at a time and for it to choose the best 4-5 each time which was in english. The prompt I gave the model was something like:
Look at the following multiple lines of user-assistant interactions and combine them into the top 4 most coherent and meaningful multi-turn conversations. Each conversation should have multiple back-and-forth turns between the user and the assistant, combining related exchanges together into one conversation.
The final output should be in a format that can be used to fine-tune an AI model. The output must be JSONL-compatible, where each conversation is a single line, combining relevant interactions into coherent, multi-turn dialogues.
Make sure the output is in the following format:
{{"messages": [{{"role": "user", "content": "User's message"}}, {{"role": "assistant", "content": "Assistant's response"}}, ...]}}
Each conversation must follow these rules:
1. Include multiple exchanges (combine related back-and-forth turns between user and assistant).
2. Exclude irrelevant or non-contributing messages that don't add value to the conversation.
3. Exclude conversations that contain non-standard ASCII characters like \\u00e2, \\u0080, etc.
4. Exclude conversations that contain the word "attachment".
5. Only include conversations in clear, understandable English. Avoid nonsensical or overly short exchanges.
6. Pick the 4 best conversations based on coherence, relevance, and length.
7. Do not include any formatting like ```json or ``` around the output.
8. Each conversation must be a single line (no line breaks), and conversations must flow naturally with context.
Input:
{chunk}
This was a huge sucess and allowed me to downsize my dataset of 20 thousand conversations to around 2 thousand high quality conversations that could all contribute to fine tuning "OmarAI".
The next step was using an algoirithm to comb over the new data and make sure all of it was in the format that was appropirate to fine-tune open-ai models. The algorithm fixed all the inconsistencies gpt-3.5 had created in the syntax.
After the data was ready, I put it into gpt-o4 api for fine tuning and went to sleep to see the results the next day.
I will keep you guys updated on how it turns out.