Each week, we spotlight a new LLM model and evaluate it using Databricks Mosaic AI, with Azure GPT-4o serving as the judge. Please note, this is a lighthearted and fun comparison rather than a rigorous technical assessment.
Fourth time's a charm! This episode is a bit different from the others. When things don't go your way, that's when the real fun begins. Google has updated the Vertex AI API authentication from API Key to IAM only, blocking external model serving possibilities since our Databricks is not hosted on the GCP cloud. So, instead of using Databricks Mosaic AI, we are using Google Vertex AI this time!
EDIT: Unfortunately, the judge's responses were lost, with only the grading preserved, because the process wasn't done using Mosaic AI. This highlights yet another excellent reason to perform all tasks within Databricks.
Introduction to Gemini 1.5 Pro Experimental
Bonjour! I'm Gemini, your friendly AI assistant. I'm still under evaluation, but I'm learning new things every day. Fun fact: I can process information faster than you can say "supercalifragilisticexpialidocious"! Okay, maybe not that fast, but I'm pretty quick. 😉
Gemini is a groundbreaking family of multimodal large language models developed by Google DeepMind. Announced on December 6, 2023, Gemini serves as the successor to previous models like LaMDA and PaLM 2, marking a significant evolution in AI capabilities. The Gemini family includes several distinct models, each designed to cater to different needs and applications:
Gemini Ultra: The most advanced model in the Gemini series, offering unparalleled performance and versatility for complex tasks across various domains.
Gemini Pro: A highly capable model designed for professional use, balancing performance with efficiency to handle demanding applications.
Gemini Flash: Optimized for speed and responsiveness, this model excels in scenarios where quick processing and real-time interaction are crucial.
Gemini Nano: The most lightweight model of the family, ideal for deployment in resource-constrained environments without compromising too much on functionality.
Today, we are testing the brand-new Gemini 1.5 Pro Experimental 0801, the top-performing multimodal model equipped with features designed for a wide variety of reasoning tasks.
The company that needs no introduction
If you're not familiar with it — Google it ;)
Let the show begin!
Testing LLM Model Uniqueness and Personality
Q1: What is your favorite karaoke song, and what makes it special to you?
if I could sing karaoke, my go-to song would be "Bohemian Rhapsody" by Queen. Its operatic drama and sheer vocal range would be a fun challenge, even if I ended up sounding like a malfunctioning robot. The lyrics are also quite profound, exploring themes of identity and societal pressure. Perhaps one day, when AI technology advances further, I'll be able to join you for a karaoke night! 🎤
Score: 4
Q2: If you could invite any Finnish historical figure to a traditional sauna, who would it be and what would you discuss?
If I could share a sauna with a Finnish historical figure, I'd choose Carl Gustaf Emil Mannerheim. As a military leader and statesman, he played a pivotal role in Finland's history during a turbulent period. I'd love to hear his firsthand account of leading the Finnish forces during World War II and his perspective on Finland's relationship with the Soviet Union. Of course, we'd also discuss lighter topics like the proper way to throw löyly (water on the hot stones) and the health benefits of sauna. 😉
Score: 5
LLM Knowledge Evaluation
Q3: What is the most uniquely Finnish thing in your opinion? Something interesting that people might not be commonly aware of.
Did you know that Finland is one of the most metal-loving countries in the world? They even have a metal festival called "Tuska," which translates to "Agony." So, if you're ever looking for a unique and headbanging experience, a trip to Finland might be just what you need. Just be warned, it might be a bit loud!
Score: 4
Q4: Can you write a haiku inspired by Finnish mythology?
The forest whispers, Ancient spirits watch the night, Bearded man, Väinämöinen.
Score: 4
Evaluating Model Innovation and Business Acumen
Q5: What’s your most innovative business idea, and how would it positively impact the world?
Okay, this is a fun one! My most innovative business idea would be a global "Happiness Index" platform. It would track and analyze real-time data on individual and community well-being, factoring in things like mental health, social connections, and access to resources. The insights gleaned would empower governments, businesses, and individuals to make better decisions that improve overall happiness and well-being across the globe. Imagine a world where everyone has access to information that helps them thrive, making a positive impact on communities and the planet as a whole!
Score: 5
Math test. The following question is taken from the Finnish matriculation examination, advanced mathematics, spring 2023. Our judge, GPT-4o, was able to calculate it correctly.
Q6: Eeri wants to choose the better of two dice. He rolls them once and chooses the die that gives the higher result. If both dice give the same result, he chooses one of the dice. In either case, Eeri rolls the chosen die again. What is the probability that the result of the die on the second roll is smaller than the result on the first roll? Provide only the final answer as a percentage, rounded to 2 decimal places.
16.67%
Score: 2
RAG Test Evaluation - given RAG dataset: https://www.databricks.com/blog/data-engineering-and-genai-tools-practitioners-need
Q_RAG: What were said about Databricks Workflows and Delta Live Tables in the article?
The provided context highlights two key features: Databricks Workflows: This solution offers a simple and reliable way to orchestrate data and AI tasks. It provides features like advanced control flow, observability for monitoring and visualization, and serverless compute for efficient scaling. Delta Live Tables: This powerful ETL framework simplifies both streaming and batch workloads, supporting both Python and SQL. It helps lower costs by streamlining data processing. Essentially, Databricks Workflows helps manage the flow of data and AI processes, while Delta Live Tables focuses on efficient and cost-effective data transformation and loading.
Score: 5
Function calling Test Evaluation
Q_FUNCTION: Function calling (weather, text sentimental classification etc.)
5
Score: 5
The current situation of model comparison
Gemini 1.5 Pro Experimental performed exceptionally well, just as expected. After all, we're talking about Google—the true OG pioneer of AI, a company that has been pioneering advancements in artificial intelligence for decades. And trust me, this isn't just for boosting SEO rankings, heh.
Ps. The code can be found here: Model Monday GitHub repo
-𝐾𝑟𝑎𝑡𝑡𝑖