Self-Hosting an LLM: A Scatter Pack

I got nerd-sniped (I think that's the term?) into running a text-generation LLM at home on a GPU I had lying around. In lieu of a well-formed blogpost, here's a collection of things that helped me along the way.

I'm calling this a scatter pack because it's kind of like a starter pack, but it's mostly a scattered mix of advice, tutorial, curation, and post-mortem blogging. I'm not getting very far into implementation detail, and I probably won't cover your use case.

Step 1: Do a little dreaming

What do you want to do with a model? Is this just to say you can do it, or do you want to accomplish something?

Here are some things that inspired me to host my own model:

Penelope (@penelope.hailey.at)

@hailey.at takes care of me. if there are problems, let her know.

https://bsky.app/profile/penelope.hailey.at

GitHub - letta-ai/letta-code: A self-improving, stateful coding agent that can learn from experience and improve with use.

A self-improving, stateful coding agent that can learn from experience and improve with use. - letta-ai/letta-code

https://github.com/letta-ai/letta-code

zzstoatzzdevlog.bsky.social

(maybe) interesting stuff i (@zzstoatzz.io) do - narrated by claude! see https://github.com/jlowin/fastmcp/pull/916

https://bsky.app/profile/zzstoatzzdevlog.bsky.social

Talking with Home Assistant - get your system up & running

Open source home automation that puts local control and privacy first. Powered by a worldwide community of tinkerers and DIY enthusiasts. Perfect to run on a Raspberry Pi or a local server.

https://www.home-assistant.io/voice_control/

Step 2: Choose your server

I'm assuming that you want to run an off-the-shelf model, and boy-oh-boy are there a lot of them. Deciding what you'll use to run your model will help you refine your search. For me, it came down to parallelism:

Parallelism =

No parallelism =

You can have many completions running at the same time

Multiple requests need to wait for the request before them to complete

I started with Ollama, but I quickly found out I needed to switch to vLLM. Here are some options, in addition to those two:

GitHub - turboderp-org/exllamav3: An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs

An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs - GitHub - turboderp-org/exllamav3: An optimized quantization and inference library for runni...

https://github.com/turboderp-org/exllamav3

Ollama

Get up and running with large language models.

https://ollama.com/

vLLM

You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.

https://docs.vllm.ai/en/latest/

MLX — MLX 0.30.0 documentation

MLX is a NumPy-like array framework designed for efficient and flexible machine learning on Apple silicon, brought to you by Apple machine learning research.

https://ml-explore.github.io/mlx/build/html/index.html

GitHub - ggml-org/llama.cpp: LLM inference in C/C++

LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.

https://github.com/ggml-org/llama.cpp

Why are we selecting the server before the model?

Again, I'm focusing on picking a model off-the-shelf, so your choice of server can introduce extra constraints on your selection. For example:

If you choose vLLM or parallelized solutions, you have to select a smaller model than you would otherwise, because the key-value cache needs to reside in VRAM.

If you have Apple Silicon, you probably want to use MLX, which has its own architecture that models need explicit conversion to support.

Ollama has its own catalog of models that you can deploy from, which can be easier to use than Hugging Face

These are all conditions that I wish I knew before I started to shop for models. I got excited about running Deepseek R1 7B, but I had to run a smaller model.

Here are some questions you should ask yourself during this process:

Where will it run (e.g. desktop environment or container)?

Does it need to be network available?

How many people/applications would like to access your server at once?

Step 3: Choose a model

Okay, hopefully you know what you're gonna use to run your model. If you're lost or confused, I'd recommend Ollama as the place to start.

The GPU I'm working with is an RTX 3060 Ti with 8GB of VRAM. To run our model, it needs to fit in VRAM--and I already told you about my parallelism requirement, so we're going to have to do a bit of code golfing.

There are three model characteristics that I needed to consider in selecting my model. I'll briefly describe them and link out to IBM, who can explain each in more detail:

Model Parameters

Quantization

Context Window

I'm not sure how much this affects non-parallel deployments, but this is a factor in computing the KV cache size. Fortunately, you can adjust this in vLLM, but it's good to know what size the model supports. Bigger context = bigger cache

This is the most important factor in how much "intelligence" the model has. More parameters = bigger model. Many models are available with different numbers of parameters.

Quantization is part compression, part optimization. It shrinks your model by reducing the precision of certain weights, but it that also increases the speed of execution. It can also make your model more unstable, so try to strike a balance.

What is a context window? | IBM

The context window (or “context length”) of a large language model (LLM) is the amount of text, in tokens, that the model can consider or “remember” at once.

https://www.ibm.com/think/topics/context-window

What are Model Parameters? | IBM

Model parameters are the internal configuration variables of a machine learning model which control how it processes data and makes predictions.

https://www.ibm.com/think/topics/model-parameters

What is Quantization? | IBM

Quantization is the process of reducing the precision of a digital signal, typically from a higher-precision format to a lower-precision format.

https://www.ibm.com/think/topics/quantization

Here are a couple of places where you can look for models:

Ollama Search

Search for models on Ollama.

https://ollama.com/search

Models – Hugging Face

Explore machine learning models.

https://huggingface.co/models

I recommend starting here, even if you're not using Ollama. It's a good place to see what's new and popular, and it concisely displays the available parameter versions.

Once you've got a clue, you can take your search over here to find a specific model that meets size/quantization constraints.

The model that I decided to run was Qwen3-4B-Thinking-2507, because it's small, recent, and scores well on benchmarks that don't mean all that much to me because I'm not a professional in this field.

Qwen/Qwen3-4B-Thinking-2507 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507

The base model on Hugging Face:

The quantized variant I'm actually running:

JunHowie/Qwen3-4B-Thinking-2507-GPTQ-Int4 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/JunHowie/Qwen3-4B-Thinking-2507-GPTQ-Int4

The Hugging Face interface had a bit of a learning curve, so here are some scribbleshots that might save you some headache:

Step 4: Pick your extras

You probably want to use your LLM somehow, so here are some services that you might consider standing up next to your LLM. Open WebUI is a bit of a no-brainer--I found it a super easy way to test my model out once deployed.

LiteLLM

LLM Gateway (OpenAI Proxy) to manage authentication, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.

https://www.litellm.ai/

Open WebUI

Open WebUI is an extensible, self-hosted interface for AI that adapts to your workflow, all while operating entirely offline; Supported LLM runners include Ollama and OpenAI-compatible APIs.

https://openwebui.com/

Letta

The platform for stateful agents. Build AI agents with long-term memory, advanced reasoning, and custom tools using the Letta API and Agent Development Environment (ADE).

https://www.letta.com/

GitHub - charmbracelet/crush: The glamourous AI coding agent for your favourite terminal 💘

The glamourous AI coding agent for your favourite terminal 💘 - charmbracelet/crush

https://github.com/charmbracelet/crush

GitHub - jekalmin/extended_openai_conversation: Home Assistant custom component of conversation agent. It uses OpenAI to control your devices.

Home Assistant custom component of conversation agent. It uses OpenAI to control your devices. - jekalmin/extended_openai_conversation

https://github.com/jekalmin/extended_openai_conversation

Step 5: Make it work!

I can only talk about my experience, as this isn't a real guide--but here are links that helped me.

Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit

Install the NVIDIA GPU driver for your Linux distribution. NVIDIA recommends installing the driver by using the package manager for your distribution. For information about installing the driver with a package manager, refer to the NVIDIA Driver Installation Quickstart Guide. Alternatively, you can install the driver by downloading a .run installer.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

My homeserver is fully managed via Podman Quadlets and Kubernetes manifests. If you're in a similar boat, you'll need to do some finagling to get your GPU into your container.

Using Docker - vLLM

vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai.

https://docs.vllm.ai/en/v0.11.0/deployment/docker.html

My only advice on this front is that I couldn't get GPU mounting to work with podman kube play. I had to write a .container Quadlet in the end.

vLLM Engine Arguments | Unsloth Documentation

vLLM engine arguments, flags, options for serving models on vLLM.

https://docs.unsloth.ai/basics/inference-and-deployment/vllm-guide/vllm-engine-arguments

I also needed to include SecurityLabelDisable=true in my Quadlet, because of SELinux. Otherwise, my container was unable to access the GPU.

podman-systemd.unit — Podman documentation

name.container, name.volume, name.network, name.kube name.image, name.build name.pod, name.artifact

https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html

If it's useful, here's the entire Quadlet I used to get vLLM running:

Step 6: Enjoy!

I hope you found something useful or interesting in this mess. Best of luck!