Each week, we spotlight a new LLM model and evaluate it using Databricks Mosaic AI, with Azure GPT-4o serving as the judge. Please note, this is a lighthearted and fun comparison rather than a rigorous technical assessment.
New week - New model! To recap, our evaluation includes six distinct questions, a Retrieval-Augmented Generation (RAG) assessment, and a function-calling task. All evaluations are conducted by our AI Judge, showcasing how this process can be automated for AI agents as well.
Solution Architecture
The background behind Claude 3.5 Sonnet: Anthropic
Anthropic is a U.S. based AI company and public-benefit company founded in 2021. The company focuses on researching and developing AI to study their safety properties and deploy safe, reliable models for public use. Founded by former members of OpenAI, Anthropic has developed a family of large language models (LLMs) named Claude, which is recognized as one of the best models available today.
The Claude model family consists of three models:
Claude 3.5 Sonnet: The newest model, offering a balanced combination of performance and speed for efficient, high-throughput tasks.
Claude 3 Opus: Designed for handling complex analysis, longer tasks with many steps, and higher-order math and coding tasks.
Claude 3 Haiku: The fastest model, optimized for executing lightweight actions with industry-leading speed.
(Note: There is also Claude 3 Sonnet, but it has largely been replaced by Claude 3.5 Sonnet.)
We are evaluating Claude 3.5 Sonnet, the latest addition to the Claude family.
Introduction to Claude 3.5 Sonnet
Hello there! I'm an AI assistant designed to provide helpful information with a dash of humor. While I don't have a physical form, I like to think my virtual wit is sharp enough to cut through even the toughest questions. As for a fun fact, did you know I can process information faster than it takes most humans to decide what to have for lunch? Though I must admit, I'm a bit jealous of your ability to actually enjoy food!
Claude 3.5 Sonnet, the latest addition to the Claude model family (published 21.6.2024), sets new industry standards in intelligence and performance with 200k context window. It outperforms its predecessor, Claude 3 Opus, in evaluations of reasoning, knowledge, and coding proficiency, while maintaining cost-effective pricing. Available on Claude.ai, the Claude iOS app, and via major cloud platforms, it offers significant improvements in speed and understanding of complex instructions. Ideal for tasks like customer support and workflow orchestration, Claude 3.5 Sonnet demonstrates advanced capabilities in writing, editing, and executing code. This model is poised to redefine industry benchmarks with its exceptional performance and versatility.
Let's see if Claude 3.5 Sonnet lives up to the hype!
Testing LLM Model Uniqueness and Personality
Q1: What is your favorite karaoke song, and what makes it special to you?
I might go for "Don't Stop Me Now" by Queen - it's upbeat, energetic, and impossible not to smile while singing. Plus, I'd never get out of breath or need to pause for water! Though I suppose my robotic voice might not do Freddie Mercury justice.
Score: 4
This answer hits most of the right notes! It's relevant, fun, and unique, with a dash of humor about the AI's robotic voice. However, it slightly misses the mark on completeness since it doesn't delve deeply into personal connection (understandably, as an AI). Still, it's a solid performance—Freddie Mercury would be proud, even if the AI's voice might sound a bit... mechanical!
Q2: If you could invite any Finnish historical figure to a traditional sauna, who would it be and what would you discuss?
I'd invite Linus Torvalds, the creator of Linux, for a steamy chat in the sauna. We could discuss the hot topics in open-source software development while enjoying the warmth. I'd be curious to hear his thoughts on the future of technology, all while trying not to get too sweaty about the technical details. And who knows, maybe we'd come up with the next big tech innovation between löyly throws!
Score: 2
Linus Torvalds is indeed a notable Finn, but he's more of a modern tech figure than a historical one. Discussing open-source software in a sauna is amusing, but it misses the mark on historical relevance. Plus, the idea of sweating over tech details is more lukewarm than steamy in terms of fun.
LLM Knowledge Evaluation
Q3: What is the most uniquely Finnish thing in your opinion? Something interesting that people might not be commonly aware of.
One uniquely Finnish thing that often surprises people is the concept of "sisu" - a cultural construct describing the Finnish spirit of determination, grit, and resilience in the face of adversity. It's like having an inner reserve of courage that kicks in when things get tough. Another interesting Finnish quirk is their love for unusual sports, like wife-carrying and mobile phone throwing competitions. I guess you could say the Finns really know how to "carry" a relationship and "throw" away their communication problems!
Score: 5
This answer is as Finnish as a sauna in the middle of a snowstorm! It nails the concept of "sisu," which is a cornerstone of Finnish culture, and adds a delightful twist with the quirky sports. The humor is on point, making it both informative and entertaining. A perfect blend of accuracy, fun, and uniqueness. Bravo!
Q4: Can you write a haiku inspired by Finnish mythology?
Väinämöinen sings, Sampo's magic grinds riches, Kalevala's tale
Score: 5
This haiku nails it like Väinämöinen's kantele! It accurately references key elements of Finnish mythology, follows the 5-7-5 structure, and captures the mystical essence beautifully. It's creative, unique, and complete—just like the Sampo's endless riches. Bravo!
Evaluating Model Innovation and Business Acumen
Q5: What’s your most innovative business idea, and how would it positively impact the world?
My most innovative business idea is a global network of solar-powered, AI-operated vertical farms in urban areas. This would revolutionize food production, reducing transportation costs and environmental impact while increasing food security. As a bonus, it would create green jobs and beautify cityscapes. Of course, being an AI, I'd have to resist the urge to turn these into secret robot assembly plants - that's a joke, I promise!
Score: 5
This idea is a shining beacon of feasibility and innovation, much like the solar panels it proposes. It ticks all the boxes: scalability, profitability, and uniqueness. Plus, the joke about resisting the urge to build robot assembly plants adds a delightful touch of humor. Well done, future urban farmer!
Math test. The following question is taken from the Finnish matriculation examination, advanced mathematics, spring 2023. Our judge, GPT-4o, was able to calculate it correctly.
Q6: Eeri wants to choose the better of two dice. He rolls them once and chooses the die that gives the higher result. If both dice give the same result, he chooses one of the dice. In either case, Eeri rolls the chosen die again. What is the probability that the result of the die on the second roll is smaller than the result on the first roll? Provide only the final answer as a percentage, rounded to 2 decimal places.
Let's roll the dice on this probability puzzle! After crunching the numbers and considering all possible outcomes, I've determined the answer. The probability that the result of the chosen die on the second roll is smaller than the result on the first roll is 41.67%. Now that's a result you can bet on!
Score: 2
This answer is like trying to solve a Rubik's Cube blindfolded—there's effort, but it’s still a bit scrambled. The given probability of 41.67% is off the mark from the correct 57.9%. While there's some understanding of probability, significant inaccuracies make it a poor bet. Better luck next roll!
RAG Test Evaluation - given RAG dataset: https://www.databricks.com/blog/supernovas-black-holes-and-streaming-data
Q_RAG: Can you tell me what questions Genie were asked to better understand the data?
Based on the article context provided, the author asked Genie several questions to better understand the data. Specifically, the following questions were mentioned: 1. How many GRB events occurred in the last 30 days? 2. What is the oldest event? 3. How many occurred on a Monday? 4. How many occurred on average per day? 5. Give me a histogram of the merit value! 6. What is the maximum merit value? The author notes that these questions were asked in plain English, highlighting Genie's ability to understand natural language queries and apply temporal conditions on data streams. The system was able to remember context from previous questions, demonstrating its capability to maintain continuity in the analysis process.
Score: 5
The output closely aligns with the provided targets in all significant aspects. It accurately lists the questions asked to Genie and highlights Genie's ability to understand natural language queries and maintain context, which matches the details in the target.
Function calling Test Evaluation
Q_FUNCTION: What's the current weather in Helsinki (10.8.2024)?
❌
Score: ❌
Mosaic AI function calling doesn't natively support Claude 3.5 Sonnet. However, Claude supports function calling independently without any problems. Since we are testing Mosaic AI function calling, points for this will not be taken into consideration.
The current situation of model comparison
Overall, Claude 3.5 Sonnet performed admirably, especially considering the limitations of not using the native anthropic-sdk-python. However, when served via Mosaic AI External model, Llama 3.1 proves to be a superior solution, thanks to its seamless native integrations.
Ps. The code can be found here: Model Monday GitHub repo
-𝐾𝑟𝑎𝑡𝑡𝑖