My Local LLM Setup

Table of Contents

Intro
#

It seems at this point it has become pretty clear that LLM assisted coding is the future, especially in its agentic form. People are being fairly productive and clearing out years of backlogged projects. The number of people I’ve met using multi-agent processes on Claude’s max plan with monthly token usage in the thousands of USD is astounding. And after dabbling with a few free models—courtesy of OpenCode—the utility is undeniable.

Drawback of Cloud Models
#

While it was tempting to buy a cheap subscription from open router, or even pay for an unlimited plan with one of the big providers, they seem to carry the fundamental issue that all cloud services have. With a subscription model, you simply do not own what you are paying for. That means token and rate limiting. Add onto that security issues like data leaks, and accidentally passing in data you didn’t mean to that’s now just turned into training data for the cloud model.

If I have to use AI, I’d like to use it on my own terms, and that meant running them locally.

Hardware
#

Before you can run any model you need to have the hardware to run it, and unfortunately, there’s just no cheap option. If you want anything semi-comparable to Claude Code, or other agentic models, you’ll need something with enough RAM to have a big enough context window, and keep chugging along at a task. While GPUs have really fast inference compute speeds, getting enough of them to load in a model seemed prohibitively expensive. Instead, I opted for a machine with Unified Memory. Since I was going to be running this as a server, it was important to me that I would be able to install Linux on it. While Apple has some impressive specs, I am not a fan of their OS or hardware ecosystem, so I gave them a pass.

Instead I set my eyes on the AMD Ryzen™ AI Max+ 395. There’s a whole host of mini PCs that ship this architecture, but I chose for the Framework Desktop. You can buy it as just a board if you just want it as a server, but I bought with the case and handle so that it could be more portable, it’s quite small so its not a bad option. I bought the 128GB version since loading entire models into RAM was very important for me. You can also cluster them by connecting another mainboard to it later.

And if it ended being a bust, well then I was just left with a really amazing server.

Cost
#

With current geopolitical prices and the RAM shortage being as it is, prices at this moment are basically the best that they’ll be for about a year. That being said, they’re quite high. Altogether the Framework desktop ended up being ~$3k (and that was without storage included), but it’s $2.4K if you just get the mainboard. You will have to get a rack, PSU, and Fan yourself, but altogether those should only be an extra ~$100.

I figured for my AI purposes I would be aiming for a Max plan with unlimited tokens. Those are about $200/month at the moment. Over the course of the year that is $2.4k, exactly the cost of a motherboard. For the configuration that I have it would be equivalent to about 15 months of Claude Code’s max plan.

It’s also important to keep in mind that these $200/month plans are hemorrhaging money and if they stop being subsidized (which they will have to be eventually) the prices will rise significantly to match actual token usage. I’d like not to be reliant on any one cloud provider when that happens.

Setup
#

This is just my first barebones LLM setup, right now I’m just focusing on text generation and having something to hook into OpenCode. So I haven’t delved into image, or video generation at all.

Stampby and kyuz0 have some more comprehensive setups for the AMD Strix Halo hardware. Halo-AI and and StrixHalo_Toolboxes are also worth looking into.

BIOS
#

Although the machine’s memory is unified, you can change the BIOS settings to reserve up to 96 GB for dedicated graphics memory. You can set kernel flags in the OS to further increase this size. When I first started out I had tried increasing the dedicated graphics memory pool, but quickly ran into memory errors, so I’ve just left it at default. I’m sure I’m leaving plenty of performance on the table, but I will revisit this later.

ROCm
#

ROCm is AMD’s compute stack specifically suited for AI usage, but seems to be consistently outperformed by Vulkan. Despite this I decided to try it anyway, but in the future I plan on bench marking with Vulkan to properly gauge the performance rift.

Installation
#

I chose Debian for my distribution, but I tried a different distro with some newer packages later. Regardless, AMD provides a useful Quick Setup:

wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
sudo apt install rocm

Driver Installation.

wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo apt install "linux-headers-$(uname -r)"
sudo apt install amdgpu-dkms

Then reboot to apply settings.

sudo systemctl reboot

Post Installation
#

Configure ROCm shared objects.

sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig

Configure ROCm PATH.

# Display a list of all ROCm versions available:
sudo update-alternatives --display rocm
# If multiple ROCm versions are installed, switch between them using this command and selecting the ROCm version:
sudo update-alternatives --config rocm

Verify Installation
#

# Check for rocm packages
apt list --installed | grep rocm
# Check for hip packages
apt list --installed | grep hip
# Verify Installation was successful
rocminfo | grep -i "Marketing Name:"

Llama User
#

The server is only up for as long as you have the command running, so to keep it persistent I created a systemd service for it, but I’d like to run with an under privaledged user.

# Create user llama with a home directory
sudo useradd -m -s /bin/bash llama

# Add it to the appropriate ROCm groups
sudo usermod -a -G video,render llama

# Open the user in an interactive shell
sudo -iu llama

The rest of the setup was created while logged in as llama.

Inference
#

For my inference I decided to go with llama.cpp. There wasn’t any specific reason, it just looked like a popular inference engine that supported my hardware. However, the base binary that it ships does not support rocm out of the box, so by default it will just use BLAS, very slow. To get rocm support you will have to build it from source. They do have some rocm docker images, as does AMD, but in my testing these resulted in a ~10 t/s hit, so I stuck to baremetal.

Clone and enter directory.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

For the build command I added flags to enhance flash attention performance with rocWMMA.

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -B build -S . -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DGGML_HIP_ROCWMMA_FATTN=ON && time cmake --build build --config Release -j$(nproc)

Spinning up the llama.cpp server. Since I’ll be accessing this from different devices I set it to listen on 0.0.0.0. I chose Qwen’s latest MOE model since I’ve been really impressed by its results in previous testing.

./build/bin/llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q8_0 --host 0.0.0.0 --port 8080

Systemd Service
#

I created a systemd service to keep the server persistent.

[Unit]
Description=Llama server
After=network.target

[Service]
User=llama
Group=llama
WorkingDirectory=/home/llama
ExecStart=/home/llama/build/bin/llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q8_0 --host 0.0.0.0 --port 8080
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

# Reload the Daemon
sudo systemctl daemon-reload

# Enable and Start the Service
sudo systemctl enable --now llama-server.service

After this I locked the user password.

sudo passwd -l llama

OpenCode
#

Now that the server is set up, we can hook it up to my agentic harness on any other client machine by adding it to opencode.json. Since I have Tailscale machines on all my machines, I’ll be using the Tailscale IP. If I ever plan on publicly exposing the server, I’ll add an API key.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "my-model-name": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Llama.cpp",
      "options": {
        "baseURL": "http://<TAILSCALE_IP>:8080/v1"
      },
      "models": {
        "my-model-name": {
          "name": "Qwen 3.5 35B-A3B"
        }
      }
    }
  }
}

Repo
#

The automated version of this is available at its repo.

Shampan/local-llama-debian

A script for setting up a llama.cpp server on Debian on the Framework Desktop (AMD Ryzen™ AI Max+ 395).

Shell

Intro #

Drawback of Cloud Models #

Hardware #

Cost #

Setup #

BIOS #

ROCm #

Installation #

Post Installation #

Verify Installation #

Llama User #

Inference #

Systemd Service #

OpenCode #

Repo #