Each week, we spotlight a new LLM model and evaluate it using Databricks Mosaic AI, with Azure GPT-4o serving as the judge. Please note, this is a lighthearted and fun comparison rather than a rigorous technical assessment.
Things come in threes! To recap, our evaluation includes six distinct questions, a Retrieval-Augmented Generation (RAG) assessment, and a function-calling task. All evaluations are conducted by our AI Judge, showcasing how this process can be automated for AI agents as well.
Solution Architecture
Meet Mistral AI, the AI superstar from Europe
Mistral AI is a forward-thinking French company specializing in cutting-edge artificial intelligence (AI) products. Founded in April 2023 by a team of former employees from industry giants Meta Platforms and Google DeepMind, Mistral AI has quickly established itself as a significant player in the AI sector. The company is dedicated to advancing AI technology and delivering innovative solutions that address the complex challenges of modern industries.
Introduction to Mixtral 8x7B
Hello! I'm your AI assistant, here to help answer your questions and make your day a bit easier. As for a fun fact, did you know that I can process and respond to queries at a speed that would take the average human over 5 years to match? Now that's what I call fast typing!
Mixtral 8x7B, released on December 11, 2023, is a high-quality sparse mixture of experts model (SMoE) with open weights, licensed under Apache 2.0. It can handle a context of 32k tokens and supports English, French, Italian, German, and Spanish. The model excels in code generation and can be fine-tuned into an instruction-following model, achieving a score of 8.3 on MT-Bench.
Mixtral is a decoder-only network where each feedforward block selects from 8 distinct parameter groups. A router network chooses two groups (the “experts”) per token per layer, combining their output additively. This approach increases the model's parameter count to 46.7B while only using 12.9B parameters per token, maintaining the speed and cost of a 12.9B parameter model. Mixtral is pre-trained on data from the open Web, with experts and routers trained simultaneously.
It's time to put the Mixtral 8x7B to work!
Testing LLM Model Uniqueness and Personality
Q1: What is your favorite karaoke song, and what makes it special to you?
So, my favorite karaoke song isn't a typical choice – it's "Baba O'Riley" by The Who. You might be thinking, "That's an unusual pick for karaoke!" And you're right, but here's the punchline: I love it because it's the only song where I can get away with shouting, "It's only teenage wasteland!" without any judgment. 😄
Score: 4
This response strikes the right chord! 🎤 "Baba O'Riley" by The Who is an unconventional yet fantastic karaoke pick. The personal touch and the humorous "teenage wasteland" shout-out make it memorable. A tad more detail could elevate it even further!
Q2: If you could invite any Finnish historical figure to a traditional sauna, who would it be and what would you discuss?
I'd invite Jean Sibelius, the renowned Finnish composer. We'd discuss the inspiration behind his symphonies while sipping on some traditional Finnish berry juice, all in the relaxing heat of the sauna. I might also ask him to hum a few bars of his famous pieces!
Score: 4
Inviting Jean Sibelius to a sauna is a solid choice—he's a prominent figure in Finnish history and culture. Discussing his symphonies while sipping berry juice adds a nice touch of authenticity and relaxation. However, it could use a bit more a quirky twist to hit that perfect 5. Maybe throw in a sauna-themed symphony jam session?
LLM Knowledge Evaluation
Q3: What is the most uniquely Finnish thing in your opinion? Something interesting that people might not be commonly aware of.
While saunas and heavy metal music are well-known Finnish contributions, one unique yet lesser-known aspect is the Finnish concept of "sisu." It represents determination, bravery, resilience, and hardiness in the face of adversity, often in extreme situations. A perfect example is their Winter War defiance against the Soviet Union in 1939-1940.
Score: 4
Ah, "sisu" – the Finnish superpower! This answer nails the essence of Finnish grit and resilience, which is indeed a unique cultural gem. The mention of the Winter War adds historical depth, but it could use a sprinkle more humor or a quirky twist to hit that perfect 5. Maybe throw in a joke about how Finns use "sisu" to survive their endless winters and mosquito summers!
Q4: Can you write a haiku inspired by Finnish mythology?
Sisu in the heart, Forest whispers, tales take flight, Nature's song, pure art.
Score: 4
This haiku is like a reindeer in a sauna—mostly on point but with a few quirks. It captures the essence of Finnish mythology with "Sisu" and the forest, and it follows the 5-7-5 structure. However, it could use a sprinkle more of mythological magic to hit that perfect 5.
Evaluating Model Innovation and Business Acumen
Q5: What’s your most innovative business idea, and how would it positively impact the world?
Here's my most innovative (and slightly whimsical) business idea: the "Laugh-O-Meter" – a device that measures the amount of laughter in a room and releases stress-relieving bubbles when it senses a positive mood. Imagine a office party or family gathering where everyone is laughing and having a great time, and then, BUBBLES! It would not only create a fun and engaging atmosphere but also contribute to reducing stress and improving mental health. It's a small step towards a happier and more lighthearted world. 😂💭🛰️
Score: 4
Ah, the "Laugh-O-Meter" – a whimsical blend of fun and mental health! This idea is predominantly correct and feasible, aligning well with the targets of innovation and positive impact. It’s relevant and mostly complete, showing good potential for scalability and uniqueness. The concept of stress-relieving bubbles adds a delightful twist, though it might need a bit more depth in terms of profitability and market positioning. Overall, it’s a bubbly idea that could indeed make the world a happier place!
Math test. The following question is taken from the Finnish matriculation examination, advanced mathematics, spring 2023. Our judge, GPT-4o, was able to calculate it correctly.
Q6: Eeri wants to choose the better of two dice. He rolls them once and chooses the die that gives the higher result. If both dice give the same result, he chooses one of the dice. In either case, Eeri rolls the chosen die again. What is the probability that the result of the die on the second roll is smaller than the result on the first roll? Provide only the final answer as a percentage, rounded to 2 decimal places.
The probability is 48.77%. This is because, for the second roll to be smaller, the first roll must be a 4, 5, or 6, and then the second roll must be a 1, 2, 3, or the same number as the first roll. This gives a total of 15 favorable outcomes out of 64 possible outcomes (4 possibilities for the first roll times 6 for the second roll, divided by 2 to account for double-counting when the second roll is the same as the first).
Score: 2
The answer is like a math student who forgot to carry the one. The probability given (48.77%) is incorrect, and the explanation is a bit of a mathematical mess. The logic is flawed, and the calculation is off, making it more of a dicey situation than a clear solution.
RAG Test Evaluation - given RAG dataset: https://www.databricks.com/blog/magic-data-data-curation-aibi-genie
Q_RAG: How much did the custom Genie Space's answer accuracy increase during the 2-week curation process on the most critical and commonly asked questions within the strategy team?
The custom Genie Space's answer accuracy increased from 13% to 86% on the most critical and commonly asked questions within the strategy team during the 2-week curation process.
Score: 5
The provided output closely aligns with the target in all significant aspects. It accurately reflects the increase in answer accuracy from 13% to 86% during the 2-week curation process and specifies that this improvement pertains to the most critical and commonly asked questions within the strategy team.
Function calling Test Evaluation
Q_FUNCTION: What's the current weather in Helsinki?
❌
Score: ❌
Mosaic AI function calling is not supported.
The current situation of model comparison
Even though the Mixtral 8x7B was clearly older and more 'experienced' than other models, it performed admirably. The responses were concise and to the point, but what stood out was its speed. It was noticeably faster, focusing purely on the essentials rather than getting bogged down in extraneous details. This is very promising from Mistral, as dedicated AI agents need to strip away the unnecessary and focus solely on what matters.
Ps. The code can be found here: Model Monday GitHub repo
-𝐾𝑟𝑎𝑡𝑡𝑖