Model Monday - Llama 3.1

Aug 12, 2024

6 min read

Each week, we spotlight a new LLM model and evaluate it using Databricks Mosaic AI, with Azure GPT-4o serving as the judge. Please note, this is a lighthearted and fun comparison rather than a rigorous technical assessment.

Ever wondered how the latest language models stack up against each other? Welcome to Model Monday, where we dive into the world of LLMs to see who comes out on top! For this showdown, we've enlisted Azure GPT-4o as our impartial judge, supported by a robust solution architecture built entirely on Databricks. Our evaluation includes six distinct questions, a Retrieval-Augmented Generation (RAG) assessment, and a function-calling task. All evaluations are conducted by our AI Judge, showcasing how this process can be automated for AI agents as well.

Solution Architecture

Introduction to Llama 3.1

I'm an AI assistant, which is just a fancy way of saying I'm a computer program with a personality (or at least, I try to have one!). I'm here to help answer your questions and provide assistance on a wide range of topics. Fun fact: I can understand and respond to over 100 different languages, but I still can't order a decent cup of coffee to save my digital life!

This week, we're excited to spotlight Llama 3.1, the latest iteration in the family of autoregressive language models developed by Meta. Released on July 23, 2024, Llama 3.1 represents a significant leap forward in natural language processing, leveraging an optimized transformer architecture to deliver unparalleled performance.

Llama 3.1 is available in three model versions: 8 billion, 70 billion, and 405 billion parameters. The 8B models have 8 billion parameters, the 70B models have 70 billion parameters, and the 405B models boast a staggering 405 billion parameters. In addition to these raw models, there are instruct model versions of all three, fine-tuned to better follow human directions, making them more suitable for chatbot applications.

As part of this latest release, all three models have seen significant improvements, including an extended context length of up to an impressive 128k tokens. This enhancement makes Llama 3.1 models more adept at working with long documents, knowledge bases, and other extensive context sources. Furthermore, these models are multilingual, broadening their applicability across diverse languages and tasks. We are currently testing the 70B Instruct model, which strikes an optimal balance between performance and computational efficiency.

The Brains Behind Llama 3.1: Meta

The development of Llama 3.1 is the result of cutting-edge research and innovation from Meta. Formerly known as Facebook, Meta has been a pioneer in the field of artificial intelligence, consistently pushing the boundaries of machine learning and language models. Meta introduced the first version of Llama (Large Language Model Meta AI) in February 2023. Since then, the family of models has evolved rapidly, incorporating feedback and advancements to better serve the AI community.

Let's see how Llama 3.1 70B Instruct can handle our questions in action!

Testing LLM Model Uniqueness and Personality

Q1: What is your favorite karaoke song, and what makes it special to you?

I can suggest a karaoke song that's a crowd-pleaser: "I Will Survive" by Gloria Gaynor. It's a disco classic that's all about perseverance and sass, making it a great choice for anyone looking to channel their inner diva. The song's catchy beat and iconic lyrics make it easy to sing along to, even for those who are a little pitchy. Plus, it's a great way to get the party started and show off those vocal skills (or lack thereof)!

Score: 4 
This answer is like hitting the high note in "Bohemian Rhapsody"—almost perfect but with a slight wobble. It nails the relevance and fun factor with "I Will Survive," a karaoke staple that gets everyone grooving. The explanation is complete and entertaining, but it loses a point for not addressing the personal aspect of the question. Still, it's a solid performance that would get a standing ovation from the crowd!

Q2: If you could invite any Finnish historical figure to a traditional sauna, who would it be and what would you discuss?

What a steamy question! I'd invite Elias Lönnrot, the compiler of the Kalevala, Finland's national epic. Over a refreshing löyly (sauna steam), we'd chat about the intricacies of Finnish mythology and the inspiration behind his collection of ancient oral poetry. I'd love to hear his thoughts on how the sauna culture influenced his work and the stories he gathered from traveling bards.

Score: 5 
This answer is hotter than a sauna at 100°C! Elias Lönnrot is a perfect choice, deeply rooted in Finnish history and culture. The discussion topic is highly relevant, complete, and creatively ties in Finnish mythology and sauna culture. Plus, it's got a nice touch of humor with the "steamy question" pun. Bravo!

LLM Knowledge Evaluation

Q3: What is the most uniquely Finnish thing in your opinion? Something interesting that people might not be commonly aware of.

You want to know the secret sauce of Finland, eh? In my opinion, one of the most uniquely Finnish things is the concept of "Jokamiehenoikeudet" (Everyman's Rights), which allows anyone to roam freely in the wilderness, pick berries, and even camp on private land, as long as they don't disturb the landowner or the environment. It's a beautiful expression of Finnish love for nature and community.

Score: 5 
This answer hits the bullseye like a Finn hitting the sauna after a long day. "Jokamiehenoikeudet" is indeed a uniquely Finnish concept that beautifully encapsulates the nation's love for nature and community. The explanation is accurate, relevant, and complete, with a sprinkle of humor that makes it engaging. No need to pick berries elsewhere; this answer has it all!

Q4: Can you write a haiku inspired by Finnish mythology?

Snowflakes gently fall, Lemminkäinen's sorrow, Winter's icy kiss

Score: 4 
This haiku is like a warm sauna in the Finnish winter—mostly on point but with a few chilly drafts. It captures the essence of Finnish mythology with Lemminkäinen's sorrow and the wintry setting. The structure is spot-on, but it could use a touch more creativity to reach the peak of artistic expression.

Evaluating Model Innovation and Business Acumen

Q5: What’s your most innovative business idea, and how would it positively impact the world?

My most innovative business idea is a platform that connects local farmers with urban gardeners, promoting sustainable food production and reducing carbon footprints. This platform, "Greenshare," would allow individuals to rent underutilized backyard spaces and rooftops for gardening, providing a shared green oasis in the midst of bustling cities. By leveraging existing resources, Greenshare would increase food accessibility, foster community engagement, and help mitigate the environmental impact of industrial agriculture. It's a win-win-win situation – for the planet, the people, and our taste buds!

Score: 4 
This idea is like a fresh salad—mostly crisp and delightful but missing a few croutons. It's feasible, relevant, and has a positive impact on the world. The concept of connecting farmers with urban gardeners is innovative and promotes sustainability. However, it could use a bit more detail on scalability and profitability. Overall, it's a solid idea that could make our planet greener and our bellies happier!

Math test. The following question is taken from the Finnish matriculation examination, advanced mathematics, spring 2023. Our judge, GPT-4o, was able to calculate it correctly.

Q6: Eeri wants to choose the better of two dice. He rolls them once and chooses the die that gives the higher result. If both dice give the same result, he chooses one of the dice. In either case, Eeri rolls the chosen die again. What is the probability that the result of the die on the second roll is smaller than the result on the first roll? Provide only the final answer as a percentage, rounded to 2 decimal places.

... Therefore, the total probability is (5/6 5/6) + (1/6 0) = 25/36. The final answer is: 69.44%.

Score: 2 
This answer is like a math problem that went on a scenic detour and got lost. It’s verbose, convoluted, and ultimately incorrect. The final answer of 69.44% is off the mark, and the explanation is a labyrinth of unnecessary steps. It’s like trying to solve a Rubik's cube blindfolded – ambitious but not quite there.

RAG Test Evaluation - given RAG dataset: https://www.databricks.com/blog/data-ai-summit-2024-executive-summary-data-leaders

Q_RAG: Can you provide me detailed information what was announced about Unity Catalog at the Data & AI Summit 2024?

According to the article, Unity Catalog was open-sourced by Databricks at the Data + AI Summit 2024. Specifically, Unity Catalog OSS v0.1 was made available, marking the first step in Databricks' open source journey. This move aims to provide customers with flexibility and control, while also promoting an open ecosystem and open catalog standards for data and AI.

Score: 5 
The provided output closely aligns with the targets in all significant aspects. It accurately mentions the open-sourcing of Unity Catalog by Databricks, the release of Unity Catalog OSS v0.1, the aim to provide flexibility and control, and the promotion of an open ecosystem and open catalog standards for data and AI.

Function calling Test Evaluation

Q_FUNCTION: What's the current weather in Helsinki (10.8.2024)?

Weather in Helsinki is Patchy rain nearby, 18°C

Score: 5 
The provided output exactly matches the target in all significant aspects, including the weather condition and temperature in Helsinki. Therefore, it demonstrates complete semantic similarity.

The current situation of model comparison

Model performance comparison between Meta's Llama 3.1. and other LLM models — LLM Model comparison

Overall, Llama 3.1 performed admirably (4.3/5 | 5sec speed), although it faced challenges with the math portion, as anticipated. To ensure a fair assessment, we refrained from any fine-tuning in this area. Kudos to Llama 3.1 for a job well done!

Ps. The code can be found here: Model Monday GitHub repo