How to Run Llama 3.1 405B on Home Devices? Build AI Cluster!

Bartłomiej Tadych
5 min readJul 28, 2024

--

Distributed Llama

In the race between open LLM models and closed LLM models, the biggest advantage of the open models is that you can run them locally. You don’t need to rely on external providers or pay anything extra beyond electricity and hardware costs. However, this advantage starts to wane as the size of the model increases. It’s not easy to run huge models that require large amounts of memory. Fortunately, the tensor paralism and the distributed inference are something that may help.

Tensor Parallism

Most computations in LLMs involve matrix multiplication, which accounts for around 97–98% of all computations. The matrix multiplication is quite easy to parallelize accross multiple CPU/GPU cores. The same we can do accross multiple devices. Devices may be splited in this way that each device calculates only a slice of the matrix multiplication. If a single device can compute the matrix multiplication in n seconds, then two devices should calculate it in n / 2 seconds! This is the tensor parallism.

Tensor parallelism, computation time

This sounds very promising, but the main bottleneck here is synchronization. We can speed up the multiplication, but at some point, we need to synchronize the state of the neural network. This takes time. Professional AI clusters use advanced links to communicate between GPUs (like NVLink) that achieve very high transfer speeds. Home devices, however, have slow Ethernet. But what is surprising is that the amount of data required to synchronize LLMs can be very low if the architecture of the model executor is designed to reduce transfer size. For example, a quantized Llama 3 8B to Q40 format (6.3 GB) requires only 1 MB of data to synchronize per token if the cluster consists of 2 devices. This is very, very low.

Here we are. Tensor parallelism speeds up inference, but synchronization slows it down. The combination of these two factors will determine the final performance. If you have 8 devices and can connect them with a fast link, you will observe a significant speedup (synchronization over USB4 seems very promising here, you can achieve from 10 to 20 Gbps).

So, how can we run a large model at home? You need a project that implements these ideas. Let me introduce the Distributed Llama project.

Distributed Llama

Distributed Llama is a project that allows you to run an LLM model across multiple devices. It uses tensor parallelism and is optimized for the low amount of data required for synchronization. Distributed Llama distinguishes between two types of nodes that you can run on your devices:

  • Root Node — the application that acts as the root node of your cluster, coordinating the cluster.
  • Worker Node — the application that functions as a worker, executing instructions from the root node.

Currently, Distributed Llama supports only CPU inference, but this will change in the future.

AI cluster topology, 4 devices, total 256 GB RAM

So, if your home cluster consists of 4 devices, you should run the root node on the first device and 3 worker nodes on the remaining devices. Distributed Llama splits RAM usage across all devices. For example, if an LLM model requires 238 GB of RAM, each node should have 238 GB /n of RAM. The exception is the root node, which requires a few percent more RAM than 238 GB /n because it needs to keep a few additional layers in memory.

Run 405B Model

To run Llama 3.1 405B, we need to clone the Distributed Llama repository and build the dllama application on all the devices you want to use for inference. A compiler like G++ or a similar one is required.

git clone https://github.com/b4rtaz/distributed-llama.git
make dllama

Then, you need to connect all devices to the same local network. You can use any Ethernet switch for that. As I mentioned earlier, synchronization time is a significant factor, so you should use the fastest switch possible. Gigabit Ethernet is the minimum requirement. You can also consider connecting devices via USB4 and creating a USB4 mesh network. Next, you need to run worker nodes on the worker devices:

./dllama worker --port 9998 --nthreads 4

The --nthreads argument defines how many CPU cores should be used for processing. You should set this to the number of CPU cores in your device. As you can see, the worker does not need the model files. These files are only required for the root node. At the beginning, the root node distributes all slices of the model to the worker nodes.

Before we run the root node, we need to download the Llama 3 405B model to the root device and convert it to Distributed Llama format. You can do it manually or just download the pre-converted weights from Huggingface. With the launch.py script from Distributed Llama repository you can download the model and the tokenizer by execution a single command. All files will be placed into the models folder.

launch.py llama3_1_405b_instruct_q40

Ensure that you have accepted the Llama 3.1 license on Huggingface. Make sure you have approximately 240GB of free space on your disk.

Now you can run the inference on the root node.

./dllama inference \
--model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m \
--tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t \
--buffer-float-type q80 \
--prompt "Hello world" \
--steps 64 \
--nthreads 4 \
--workers 10.0.0.1:9998 10.0.0.2:9998 10.0.0.3:9998

Please note that the--workers argument accepts IP addresses with the port of worker nodes. Addresses are separated by spaces. Additionally, you can define how many word predictions you expect by setting the --steps N argument.

If you want to run the API service that supports the /v1/chat/completions endpoint, you should build the dllama-api application and run it on the root device instead dllama inference .

./dllama-api \
--model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m \
--tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t \
--buffer-float-type q80 \
--max-seq-len 2048 \
--nthreads 4

Distributed Llama also supports the--kv-cache-storage disk argument, which reduces RAM usage by moving the KV cache to the disk. Llama 3.1 models require ~34 GB of RAM to store the full context (F32) in memory (131k tokens). By setting this argument, you can reduce RAM usage, but you will need additional disk space. Please note that the KV cache is split across all nodes, so you need to set this option for each node.

The second option to reduce RAM usage is to use the --max-seq-len 2048 argument. If you don’t need the full context size, you can reduce it, which will simultaneously reduce memory consumption.

That’s it! Don’t forget to share you results on GitHub.

--

--