Meta Llama 3.1 405B: GPU vs. CPU Performance Evaluation and RAM Considerations

It’s time to start testing various Private AI models, and fortunately, the timing is just right. Meta has just released six new AI language models. These models run on-premises and do not interact with the cloud or OpenAI’s ChatGPT.

Llama 3.1 405B competes with leading models like GPT-4, GPT-4o, and Claude 3.5 Sonnet, while smaller models match other open and closed models of similar size. Llama 3.1 405B is the first openly available model to rival top AI models in general knowledge, steerability, math, tool use, and multilingual translation. Meta has also upgraded the 8B and 70B multilingual models, now featuring an extended context length of 128K and enhanced reasoning capabilities. These advancements support advanced applications such as long-form text summarization, multilingual conversational agents, and coding assistants.

Meta Llama 3.1 405B, 70B, and 8B are completely open-source, and anyone can use them for commercial or educational purposes.
Most people say that running Meta Llama 3.1 405B at home is nearly impossible, but I wanted to give it a try. I’m curious about the setup process, performance, and overall experience.
Here’s a table showing the approximate memory needed for different configurations:

Model Size	FP16	FP8	INT4
8B	16 GB	8 GB	4 GB
70B	140 GB	70 GB	35 GB
405B	810 GB	405 GB	203 GB

For more information, you can visit the Hugging Face and Meta websites.
Hugging Face https://huggingface.co/blog/llama31#how-much-memory-does-llama-31-need
Meta https://ai.meta.com/blog/meta-llama-3-1/

I conducted a few different tests.
Test 1: I ran a bare-metal Windows 11 setup with Ollama, Docker, and OpenWebUI installed, along with a GPU (Geforce 3090 24GB).

Test 2: I set up a Windows 11 VM, allocating all the server resources available on the host. The ESXi host was running nothing else except this VM. Of course, it’s not advisable to allocate all resources to the VM, as the host also needs some RAM and CPU. I installed Ollama, Docker, and OpenWebUI. OpenWebUI is an interface similar to ChatGPT. The VM did not have any GPU assigned.
VM : vCPU- 120 Core, RAM 236GB

Test 3: I used the same setup but reduced the amount of RAM and the number of CPU cores.
VM: vCPU 60, RAM 180 and later 236

Server Hardware:
RAM 256 GB
CPU 60 Core (TH 120)

Questions what I asked:

Who is the US President?
Create a song similar to Taylor Swift’s “Style.”
Find the value of y in the equation: 4y + 7 = 31
If you have two buckets, one holding 5 liters and the other holding 3 liters, how can you measure out exactly 4 liters?

Test 1:
I ran a bare-metal Windows 11 setup with Ollama, Docker, and OpenWebUI installed, along with a GPU (Geforce 3090 24GB). In the video, you can see the usage of RAM, CPU, storage, and power, as well as the response times. I included a timer to show how quickly it responds,

Test 2:
I set up a Windows 11 VM, allocating all the server resources available on the host. The ESXi host was running nothing else except this VM. Of course, it’s not advisable to allocate all resources to the VM, as the host also needs some RAM and CPU. I installed Ollama, Docker, and OpenWebUI. The VM did not have any GPU assigned. VM specs: vCPU- 120 cores, RAM 236GB. In the video, you can see the usage of RAM, CPU, storage, and power, as well as the response times. I included a timer to show how quickly it responds

Test 3:
I used the same setup as Test 2 but reduced the amount of RAM and the number of CPU cores. VM specs: vCPU 60 cores, RAM 180GB, later increased to 236GB. In the video, you can see the usage of RAM, CPU, storage, and power, as well as the response times. I included a timer to show how quickly it responds

Conclusion:
Installing Ollama, Docker, and the Language Models was very straightforward.

Based on my tests of Meta Llama 3.1 405B, 70B, and 8B AI Language Models on a CPU VM and bare metal with a GPU, I can summarize the following:

• For smaller language models, I do not need a GPU; they run well on the CPU.
• The Llama 3.1 8B model ran at a reasonably acceptable speed. The 70B model ran quite slowly, with loading the language model into memory taking too long.
• Loading the 405B model into memory took several minutes, and generating responses was also time-consuming. With the current setup, the 405B language model is not usable as the response times are too long.
• When running with a GPU, the 8B model was extremely fast, and the 70B model also saw speed improvements. However, the 405B model’s performance did not improve and response times were actually longer in my test.

A surprising observation was that the initial memory loading process takes a while before responding, and the response generation was slow. I had expected the preprocessing to take longer and the responses to be quicker, but it seems the model processes in real-time, and I didn’t notice much storage usage. Currently, it appears that the VRAM size is crucial for running large language models, followed by the server RAM size. If the language model exceeds the GPU’s VRAM capacity, it switches to using the computer’s RAM and CPU. In one of my VMs with 180GB of RAM, the language model required 231GB, causing the VM to hang and encounter various issues. It tried to respond but was extremely slow, often freezing. It’s important to remember that Docker, Windows 11, etc., also require RAM, not just the language model. I estimate the RAM requirement based on the model’s disk size: the 405B model takes up 231GB on disk and needs the same amount in RAM. CPU usage was not significantly high during testing, even when I reduced the cores from 60 to 30, with no noticeable change in performance.

I need to determine which GPU would be suitable to comfortably run the Meta 405B language model in my home lab. The current speed is unsatisfactory, but this is the language model I want to explore.

In the video, you can see the usage of RAM, CPU, storage, and power, as well as the response times. I included a timer to show how quickly it responds, but I did not focus much on crafting the responses. Sometimes it had already responded by the time I asked the next question. I definitely recommend experimenting with it.

Share this:

Like this: