Nov 12, 2023 5 min read AI

LLM Performance On M3 Max

Introduction

When Apple announced the M3 chip in the new MacBook Pro at their "Scary Fast" event in October, the the first questions a lot of us were asking were, "How fast can LLMs run locally on the M3 Max?". There has been a lot of performance using the M2 Ultra on the Mac Studio which was essentially two M2 chips together. The large RAM created a system that was powerful to run a lot of workloads but for the first time, we could see the large RAM (96GB to 128GB) in a MacBook and allow us to take the Apple Studio workloads on the road (or show off in a coffee shop with the new Space Black color scheme).

In this blog post, we will focus on the performance of running LLMs locally and compare the tokens per second on each of the different models. We will be leveraging the default models pulled from Ollama and not be going into the specific custom trained models or pulling anything custom from PyTorch that are supported by Ollama as well.

M3 Max LLM Testing Hardware

For this test, we are using the 14" M3 MacBook Pro with the upgraded M3 Max chip and maximum RAM.

CPU	16 Cores (12 performance and 4 efficiency)
GPU	40 Cores
RAM	128 GB
STORAGE	1 TB

M3 Max Battery Settings

With the different power settings available and the ability of the Mac to shift between different performance states, we are going to use the High Performance settings found under battery settings to confirm there is no throttling happening on the hardware.

Local LLMs with Ollama

Before we can start exploring the performance of Ollama on the M3 Mac chip, it is important to understand how to set it up. The process is relatively simple and straightforward. First, you need to download the Ollama application from the official website. Once downloaded, follow the installation instructions provided. Once the installation is complete, you are ready to explore the performance of Ollama on the M3 Mac chip.

LLM Model Selection

Ollama out of the box allows you to run a blend of censored and uncensored models. For the test to determine the tokens per second on the M3 Max chip, we will focus on the 8 models on the Ollama Github page each individually.

Model	Parameters	Size	Download
Mistral	7B	4.1GB	`ollama run mistral`
Llama 2	7B	3.8GB	`ollama run llama2`
Code Llama	7B	3.8GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8GB	`ollama run llama2-uncensored`
Llama 2 13B	13B	7.3GB	`ollama run llama2:13b`
Llama 2 70B	70B	39GB	`ollama run llama2:70b`
Orca Mini	3B	1.9GB	`ollama run orca-mini`
Vicuna	7B	3.8GB	`ollama run vicuna`

List of models from Ollama GitHub page.

Running Mistral on M3 Max

Mistral is a 7B parameter model that is about 4.1 GB on disk. Running it locally via Ollama running the command:

% ollama run mistral

Mistral M3 Max Performance

Prompt eval rate comes in at 103 tokens/s. The eval rate of the response comes in at 65 tokens/s.

Running Llama 2 on M3 Max

% ollama run llama2

Llama 2 M3 Max Performance

Prompt eval rate comes in at 124 tokens/s. The eval rate of the response comes in at 64 tokens/s.

Running Code Llama on M3 Max

Code Llama is a 7B parameter model tuned to output software code and is about 3.8 GB on disk. Running it locally via Ollama running the command:

% ollama run code llama

Code Llama Uncensored M3 Max Performance

Prompt eval rate comes in at 140 tokens/s. The eval rate of the response comes in at 61 tokens/s.

Running Llama 2 Uncensored on M3 Max

Llama 2 Uncensored is a 7B parameter model that is about 3.8 GB on disk. Running it locally via Ollama running the command:

% ollama run llama2-uncensored

Llama 2 Uncensored M3 Max Performance

Prompt eval rate comes in at 192 tokens/s. The eval rate of the response comes in at 64 tokens/s.

Running Llama 2 13B on M3 Max

Llama 2 13B is the larger model of Llama 2 and is about 7.3 GB on disk. Running it locally via Ollama running the command:

% ollama run llama2:13b

Llama 2 13B M3 Max Performance

Prompt eval rate comes in at 17 tokens/s. The eval rate of the response comes in at 39 tokens/s.

Running Llama 2 70B on M3 Max

Llama 2 70B is the largest model and is about 39 GB on disk. Running it locally via Ollama running the command:

% ollama run llama2:70b

Llama 2 70B M3 Max Performance

Prompt eval rate comes in at 19 tokens/s. The eval rate of the response comes in at 8.5 tokens/s.

Running Orca Mini on M3 Max

Orca Mini is a 3B parameter model that is about 1.9 GB on disk. Running it locally via Ollama running the command:

% ollama run orca-mini

Orca Mini M3 Max Performance

Prompt eval rate comes in at 298 tokens/s. The eval rate of the response comes in at 109 tokens/s.

Running Vicuna on M3 Max

Vicuna is a 7B parameter model that is about 3.8 GB on disk. Running it locally via Ollama running the command:

% ollama run vicuna

Vicuna M3 Max Performance

Prompt eval rate comes in at 204 tokens/s. The eval rate of the response comes in at 67 tokens/s.

Summary of running LLMs locally on M3 Max

The power of the M3 Max chip brings a lot of desktop compute to the laptop in a portable manner. For the smaller models like Orca Mini which use only a small amount of RAM, we see blazing fast tokens per second. With the fine-tuning done and performance of the Mistral models being done in the space as well as the performance of Llama2 and Code Llama, you can easily run your own models locally at high speeds without risking privacy of your data.

Model	Tokens/sec
Mistral	65 tokens/second
Llama 2	64 tokens/second
Code Llama	61 tokens/second
Llama 2 Uncensored	64 tokens/second
Llama 2 13B	39 tokens/second
Llama 2 70B	8.5 tokens/second
Orca Mini	109 tokens/second
Vicuna	67 tokens/second

While steps such as fine tuning and training larger models is still not as efficient on consumer laptops, the power of the M3 chip in a 14" platform brings a lot of capability to those developers who are doing their own LLM projects and want to save some costs using hosted models and APIs.

Stay tuned for more content around building LLM powered applications and SaaS products! Stay in the loop and subscribe to our newsletter!