5 min read

LLM Performance On M3 Max

LLM Performance On M3 Max

Introduction

When Apple announced the M3 chip in the new MacBook Pro at their "Scary Fast" event in October, the the first questions a lot of us were asking were, "How fast can LLMs run locally on the M3 Max?". There has been a lot of performance using the M2 Ultra on the Mac Studio which was essentially two M2 chips together. The large RAM created a system that was powerful to run a lot of workloads but for the first time, we could see the large RAM (96GB to 128GB) in a MacBook and allow us to take the Apple Studio workloads on the road (or show off in a coffee shop with the new Space Black color scheme). 

In this blog post, we will focus on the performance of running LLMs locally and compare the tokens per second on each of the different models. We will be leveraging the default models pulled from Ollama and not be going into the specific custom trained models or pulling anything custom from PyTorch that are supported by Ollama as well.

M3 Max LLM Testing Hardware

For this test, we are using the 14" M3 MacBook Pro with the upgraded M3 Max chip and maximum RAM. 

CPU16 Cores (12 performance and 4 efficiency)
GPU40 Cores
RAM128 GB
STORAGE1 TB

M3 Max Battery Settings

With the different power settings available and the ability of the Mac to shift between different performance states, we are going to use the High Performance settings found under battery settings to confirm there is no throttling happening on the hardware.

Local LLMs with Ollama

Before we can start exploring the performance of Ollama on the M3 Mac chip, it is important to understand how to set it up. The process is relatively simple and straightforward. First, you need to download the Ollama application from the official website. Once downloaded, follow the installation instructions provided. Once the installation is complete, you are ready to explore the performance of Ollama on the M3 Mac chip.

LLM Model Selection

Ollama out of the box allows you to run a blend of censored and uncensored models. For the test to determine the tokens per second on the M3 Max chip, we will focus on the 8 models on the Ollama Github page each individually. 

ModelParametersSizeDownload
Mistral7B4.1GBollama run mistral
Llama 27B3.8GBollama run llama2
Code Llama7B3.8GBollama run codellama
Llama 2 Uncensored7B3.8GBollama run llama2-uncensored
Llama 2 13B13B7.3GBollama run llama2:13b
Llama 2 70B70B39GBollama run llama2:70b
Orca Mini3B1.9GBollama run orca-mini
Vicuna7B3.8GBollama run vicuna

List of models from Ollama GitHub page.

Running Mistral on M3 Max

Mistral is a 7B parameter model that is about 4.1 GB on disk. Running it locally via Ollama running the command: 

% ollama run mistral 
Mistral M3 Max Performance

Prompt eval rate comes in at 103 tokens/s. The eval rate of the response comes in at 65 tokens/s.

Running Llama 2 on M3 Max

% ollama run llama2 
Llama 2 M3 Max Performance

Prompt eval rate comes in at 124 tokens/s. The eval rate of the response comes in at 64 tokens/s.

Running Code Llama on M3 Max

Code Llama is a 7B parameter model tuned to output software code and is about 3.8 GB on disk. Running it locally via Ollama running the command: 

% ollama run code llama
Code Llama Uncensored M3 Max Performance

Prompt eval rate comes in at 140 tokens/s. The eval rate of the response comes in at 61 tokens/s.

Running Llama 2 Uncensored on M3 Max

Llama 2 Uncensored is a 7B parameter model that is about 3.8 GB on disk. Running it locally via Ollama running the command: 

% ollama run llama2-uncensored 
Llama 2 Uncensored M3 Max Performance

Prompt eval rate comes in at 192 tokens/s. The eval rate of the response comes in at 64 tokens/s.

Running Llama 2 13B on M3 Max

Llama 2 13B is the larger model of Llama 2 and is about 7.3 GB on disk. Running it locally via Ollama running the command: 

% ollama run llama2:13b 
Llama 2 13B M3 Max Performance

Prompt eval rate comes in at 17 tokens/s. The eval rate of the response comes in at 39 tokens/s.

Running Llama 2 70B on M3 Max

Llama 2 70B is the largest model and is about 39 GB on disk. Running it locally via Ollama running the command: 

% ollama run llama2:70b
Llama 2 70B M3 Max Performance

Prompt eval rate comes in at 19 tokens/s. The eval rate of the response comes in at 8.5 tokens/s.

Running Orca Mini on M3 Max

Orca Mini is a 3B parameter model that is about 1.9 GB on disk. Running it locally via Ollama running the command: 

% ollama run orca-mini
Orca Mini M3 Max Performance

Prompt eval rate comes in at 298 tokens/s. The eval rate of the response comes in at 109 tokens/s.

Running Vicuna on M3 Max

Vicuna is a 7B parameter model that is about 3.8 GB on disk. Running it locally via Ollama running the command: 

% ollama run vicuna
Vicuna M3 Max Performance

Prompt eval rate comes in at 204 tokens/s. The eval rate of the response comes in at 67 tokens/s.

Summary of running LLMs locally on M3 Max

The power of the M3 Max chip brings a lot of desktop compute to the laptop in a portable manner. For the smaller models like Orca Mini which use only a small amount of RAM, we see blazing fast tokens per second. With the fine-tuning done and performance of the Mistral models being done in the space as well as the performance of Llama2 and Code Llama, you can easily run your own models locally at high speeds without risking privacy of your data.

ModelTokens/sec
Mistral65 tokens/second
Llama 264 tokens/second
Code Llama61 tokens/second
Llama 2 Uncensored64 tokens/second
Llama 2 13B39 tokens/second
Llama 2 70B8.5 tokens/second
Orca Mini109 tokens/second
Vicuna67 tokens/second

While steps such as fine tuning and training larger models is still not as efficient on consumer laptops, the power of the M3 chip in a 14" platform brings a lot of capability to those developers who are doing their own LLM projects and want to save some costs using hosted models and APIs. 

Stay tuned for more content around building LLM powered applications and SaaS products! Stay in the loop and subscribe to our newsletter!