Fine-Tuning a Custom Open Source Large Language Model: A Journey from Concept to Reality

Wes Bassler, Sr. Machine Learning Engineer

In the current landscape of artificial intelligence (AI) and machine learning (ML), large language models (LLMs) have become essential tools for businesses looking to enhance their customer experiences, automate workflows, and gain insights from their own data. Vendor or third-party-provided models often fall short when it comes to addressing unique requirements such as ensuring data privacy, maintaining control over the model's outputs, minimizing operational costs, and fitting capabilities to specific needs.

This blog post explores our journey in building and fine-tuning a custom open-source LLM that tackles these challenges head-on. By leveraging certain technologies and methodologies, we have developed a model that not only meets our unique business objectives but also positions us for future scalability and innovation. From investigating the right base models to refining the data and training processes, this blog describes each phase of our journey, displaying how businesses can achieve impactful results with a customized LLM solution.

Investigation: Selecting the Base Model

Our journey began with a thorough investigation phase. Although creating a completely new pre-trained model from scratch on our own data was an option, we opted to start by leveraging existing open-source Transformer models from the Hugging Face Hub. This allowed us to quickly iterate and experiment without the massive overhead and expense of developing a model from the ground up. We are also always mindful of our carbon footprint with everything we build. We explored several well-known open-source models generating buzz in the tech community, including Falcon, LLaMA 3, and Mistral, each offering good results and unique strengths depending on the use case.

Our evaluation process included using few-shot learning to test these models against our specific use cases using our internal data. We focused on how well these models could generate consistent, relevant, and accurate outputs when given a few examples. This phase was crucial in understanding each model's capabilities and limitations in handling our needs.

LangChain played a huge part in this phase and simplified many different types of testing such as Retrieval-Augmented Generation (RAG), instruction prompting, and output parsing. The Ollama project also let us test and experiment whether or not different quantization methods would be possible, not to mention being able to test locally on our Macbooks without spending a ton of money on GPU-enabled machines.

After evaluating the models mentioned above, we selected the latest Mistral 7B Base (version 3) as the most practical choice for our needs. This model consistently outperformed others in accuracy and reliability, providing the least amount of hallucinations during testing. Its performance was particularly strong in generating structured outputs in JSON format, which was crucial for integrating with our existing workloads. Notably, Mistral 7B performed exceptionally well in multi-language support, despite being primarily trained on English, further showcasing its versatility. Inference was also consistently a couple of seconds faster than the others due to its model architecture. It’s also worth noting that the Mistral model also carries an Apache 2.0 license which is something that you should always be careful to consider when choosing a model since some models have limitations for Enterprise usage.

While Mistral 7B is our current choice, we are committed to continuously evaluating the latest open-source models as they emerge to ensure we’re always using the best possible model for our needs.

Task: Classify the sentiment of the following movie reviews as Positive, Negative, or Neutral.

Example 1 - Review: "I absolutely loved this movie! The plot was engaging, and the characters were incredibly well-developed. A must-watch!"
Sentiment: Positive

Example 2 - Review: "The movie was alright, but it didn’t live up to the hype. Some parts were good, but overall it was pretty forgettable."
Sentiment: Neutral

Example 3 - Review: "I was really disappointed. The storyline was predictable, and the acting was subpar. I expected much more."
Sentiment: Negative

Now your turn - Review: "The film had some great visual effects, but the pacing was too slow, and I couldn’t stay interested."
Sentiment: ?

Few Shot Prompt example

Data: From Hand-Labeled to LLM Enhanced

Initially, we had a dataset with a little more than 5,000 examples that had been hand-labeled by internal users. These initial examples were invaluable for setting a quality benchmark and guiding our early-stage fine-tuning efforts. However, we soon realized the limitations of this approach in terms of features, scalability and accuracy.

The introduction of a multimodal LLM transformed our data labeling process. By using an LLM to generate labeled data, we could quickly scale our dataset while maintaining a high quality of annotations. We could generate tens of thousands of examples in just a couple of hours. We chose to speed this up even more by using parallelization frameworks like Dask to annotate multiple examples at the same time. This shift allowed us to build more robust instruction datasets in a structured way, which helped aid in our goal of building our own instruct model.

Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

Example Prompt for Instruction Dataset

The initial version of our custom model involved three major iterations. Each iteration improved results dramatically for our use-case. We used a single 40GB GPU machine as well as utilized Low-Rank Adaption (LoRA) training technique to speed up training, minimize model weights and be more memory efficient.

1. First Iteration: We started by training the model on our hand-labeled data (5,000 examples), focusing on a single language. Training took roughly 8 hours. This provided a solid foundation and allowed us to understand the initial performance metrics and identify areas for improvement. Although the model performed well, we recognized the need for more diverse and extensive data to enhance its capabilities.

2. Second Iteration: Leveraging the power of a multimodal LLM, we expanded our dataset to include 20,000 examples, labeled across multiple languages. This iteration aimed to improve the model's multilingual understanding as well as improve accuracy. The inclusion of multiple languages was to cater to a broader audience, making our model more versatile and valuable to many more of our customers. This training took just over 24 hours to complete.

3. Third Iteration: The final phase involved scaling up to over 50,000 examples, again leveraging an LLM for labeling across multiple languages. This significant expansion in training data enabled us to refine the model further, enhancing its accuracy, reducing hallucinations, and improving its ability to understand and respond to instructions across different languages. This training iteration was the longest at just over 30 hours.

Next Steps: Exploring Model Deployment and Optimization

As we move forward with our custom LLM, our focus shifts to optimizing deployment. One of the key areas we are actively exploring is the use of quantization to drastically speed up inference times and minimize model size. This optimization is critical as we plan to achieve the same auto scaling capabilities on our platform, enabling the model to handle workloads efficiently while maintaining lower latency.

For those interested in our platform’s architecture, we encourage you to visit a previous blog post, where we detail the current infrastructure that supports our AI and ML applications.

We will dive deeper into the technical aspects of model deployment, including quantization techniques and autoscaling, in an upcoming blog post. Stay tuned.

Conclusion

Our journey in building and fine-tuning highlights how a custom LLM can solve key challenges. By focusing on accuracy, data privacy, and cost efficiency, we've created a powerful solution tailored to our needs. Leveraging technologies like multimodal LLMs for data labeling, LangChain for experimentation and Hugging Face for model transfer learning, we’ve ensured high accuracy, reliability and consistency across multiple languages for our customers.

As we continue to innovate and evaluate new open-source models, our custom LLM is poised to drive smarter decisions, improve customer experiences, and enhance operational efficiency. Stay tuned for our upcoming blog post, where we’ll cover model deployment and optimization in greater detail.

See all blogs