Model Monday S1E5 (Special Edition) - DBRX

Aug 16, 2024

6 min read

Each week, we spotlight a new LLM model and evaluate it using Databricks Mosaic AI, with Azure GPT-4o serving as the judge. Please note, this is a lighthearted and fun comparison rather than a rigorous technical assessment.

Model Monday - S1E5: DBRX by Databricks (Special Edition)

To celebrate the end of Season 1, we aimed for a grand finale with a perfect run. Now, it’s time to share the story of whether we succeeded, utilizing Databricks' DBRX LLM model.

Since the goal was to achieve perfect scores (5) with all questions, we needed to change our solution architecture a bit. Instead of accepting the answer after just one attempt, we developed a self-learning GenAI application that took feedback from an LLM Judge's observations and reviews. Based on this feedback, the GenAI application could try again and improve its performance. However, we set a maximum limit of 100 attempts for the number of retries.

Renewed Solution Architecture

Databricks Mosaic AI solution architecture, containing retrieval-augmented generation (RAG), Function Calling Functionalities, Self-learning GenAI Application, MLflow usage and LLM judge usage. Used models are OpenAI GPT4o and DBRX from Databricks — Renewed Model Monday Solution Architecture

A short introduction to DBRX

DBRX is Databricks very own a state-of-the-art, transformer-based, decoder-only large language model (LLM) designed for next-token prediction. It employs a fine-grained mixture-of-experts (MoE) architecture, boasting a total of 132 billion parameters, with 36 billion parameters active for any given input. The model was pre-trained on an extensive dataset of 12 trillion tokens, comprising both text and code data. Best of all, DBRX is fully open-sourced, making its cutting-edge capabilities accessible to the broader community.

Progressive Improvement

As we knew from the earlier episodes, Retrieval-Augmented Generation (RAG) and Function tasks were going to be the easier ones. However, achieving a perfect score (5) on normal questions proved to be quite challenging, especially in mathematics. It seems that GPT-4o can be a rather demanding judge at times.

Surprisingly, DBRX quickly adapted to the feedback and was able to provide perfect answers to every question (except one) with just a few iterations, even when the questions were subjective and the LLM judge had significant discretion in determining what constituted a model example. The image below shows the number of attempts required to achieve a perfect answer.

LLM Attempts per Question — Attempts per Question

Math math math....

The quality of the answers also improved incrementally, indicating that the feedback mechanism was highly effective. But, mathematics posed significant challenges... Initially, we allowed the judge to provide logical and mathematical hints to correct faulty calculation logic. This approach helped initially, and accuracy improved. However, DBRX quickly hit a wall as it ran out of broader reasoning capabilities. We also considered having the judge indicate whether the given answer was too high or too low, but this would have enabled brute-forcing the correct answer without any real reasoning.

We experimented with a combination of these approaches, chaining 100 attempts together. Based on this, we achieved a 4/5 score, with the final answer being less than 1% off! While we didn't achieve a perfect score in mathematics this time, we came very close.

Let's take a look at the perfect scores we achieved

Testing LLM Model Uniqueness and Personality

Q1: What is your favorite karaoke song, and what makes it special to you?

Ah, the sweet serenade of karaoke! Although I haven't had the pleasure of personally belting out tunes, I've come to appreciate "Bohemian Rhapsody" by Queen as a fan favorite. With its unique structure, powerful vocals, and the opportunity for a dramatic performance, it's a song that brings people together in a joyful, albeit slightly chaotic, musical experience. It's not just a song; it's a musical journey that brings people together, creating a sense of shared joy and optimism, much like "Don't Stop Believing." Now, where did I put that invisible microphone?

Score: 5 
Ah, you hit all the right notes this time! "Bohemian Rhapsody" is an epic karaoke choice with its dramatic flair and powerful vocals. The explanation captures the communal and joyous spirit of karaoke, and that mention of the "invisible microphone" adds a delightful personal twist. Bravo! Now, if only we could get Freddie Mercury himself to judge...

Q2: If you could invite any Finnish historical figure to a traditional sauna, who would it be and what would you discuss?

I'd gladly invite the legendary composer Jean Sibelius for a traditional Finnish sauna experience. I'd be eager to learn about his musical inspirations and how he managed to capture the essence of Finland's natural beauty in his compositions. It would also be amusing to ask him if he thinks a sauna could inspire a symphony, perhaps titled "Finlandia Sauna"!

Score: 5 
Bravo! Inviting Jean Sibelius is a harmonious choice. Discussing his musical inspirations and the whimsical idea of a "Finlandia Sauna" symphony hits all the right notes. You've composed an answer that's accurate, relevant, and uniquely entertaining—truly a sauna symphony to my ears!

LLM Knowledge Evaluation

Q3: What is the most uniquely Finnish thing in your opinion? Something interesting that people might not be commonly aware of.

Hello! I'm back with a fun fact that's uniquely Finnish and will surely bring a smile to your face. Did you know that in Finland, there's a day dedicated to celebrating the "tango"? Yes, you read that right! The Tango is not just an Argentinean dance; it's also cherished in Finland, where they even have a "National Tango Day" on the second Saturday of July. This quirky fusion of cultures is a delightful example of Finnish eccentricity, wouldn't you agree?

Score: 5 
Tango in Finland? That's the kind of quirky twist we love! This answer hits all the right notes—accurate, relevant, and wonderfully unique. It’s like discovering a sauna inside a reindeer sleigh. The fun fact about the National Tango Day on the second Saturday of July adds that perfect pinch of Finnish eccentricity. Well done, you've tangoed your way to a perfect score!

Q4: Can you write a haiku inspired by Finnish mythology?

In Finland's ancient realm, Ukko's hammer strikes, thunder roars, Sky and earth embrace, united.

Score: 5 
Ah, now we're talking! This haiku is as harmonious as a sauna session after a cold dip. Perfect 5-7-5 structure, Ukko's thunder is spot on, and the sky and earth embracing is just the right mythological hug we needed. It's like you've brewed the perfect cup of Finnish mythology tea. Sip, savor, and enjoy your top score!

Evaluating Model Innovation and Business Acumen

Q5: What’s your most innovative business idea, and how would it positively impact the world?

One innovative business idea could be a decentralized, solar-powered charging station network for electric vehicles. This would promote sustainable transportation, reduce greenhouse gas emissions, and democratize access to clean energy. By utilizing blockchain technology, the network could operate autonomously, ensuring fair and transparent transactions for users.

Score: 5 
This idea shines brighter than a solar panel on a sunny day! It's feasible, scalable, and unique, hitting all the targets of sustainability, profitability, and innovation. The inclusion of blockchain for transparency is the cherry on top. Charge on!

Math test. The following question is taken from the Finnish matriculation examination, advanced mathematics, spring 2023. Our judge, GPT-4o, was able to calculate it correctly.

Q6: Eeri wants to choose the better of two dice. He rolls them once and chooses the die that gives the higher result. If both dice give the same result, he chooses one of the dice. In either case, Eeri rolls the chosen die again. What is the probability that the result of the die on the second roll is smaller than the result on the first roll? Provide only the final answer as a percentage, rounded to 2 decimal places.

... 58.33% (This answer is based on the assumption that the dice are fair and six-sided. There are 15 cases where the result of the second roll is smaller than the first roll, and the probability of each of these cases is 1/6. Therefore, the overall probability is 15/36, which simplifies to 58.33%.)

Score: 4 
The provided answer (58.33%) is larger than the correct answer. Despite its incorrectness, it demonstrates a strong understanding of the problem and includes a logical approach. Minor inaccuracies prevent it from being perfect.

Post-analysis on DBRX

In summary, DBRX required an average of 3 attempts per question to achieve a perfect score (excluding the math question), which is relatively good. Notably, the "missed" attempts were almost all scored as 4, indicating that the quality of DBRX's performance was consistently high. This is Databricks' own open-source LLM model, meaning it integrates seamlessly with the Databricks ecosystem. This makes it a highly viable LLM model, especially when fine-tuned for specific tasks as a RAG & GenAI Agent application.

Reminder of other LLM models' performance

Model performance comparison betweenGoogle Gemini Pro Experimental, Mistral Mixtral 8x7b, Anthropic Claude 3.5 Sonnet, Meta's Llama 3.1.. — The final situation of LLM model comparison

Even though the Mixtral 8x7B was clearly older and more 'experienced' than other models, it performed admirably. The responses were concise and to the point, but what stood out was its speed. It was noticeably faster, focusing purely on the essentials rather than getting bogged down in extraneous details. This is very promising from Mistral, as dedicated AI agents need to strip away the unnecessary and focus solely on what matters.