Self-Hosting an LLM: A Scatter Pack

I got nerd-sniped (I think that's the term?) into running a text-generation LLM at home on a GPU I had lying around. In lieu of a well-formed blogpost, here's a collection of things that helped me along the way.

I'm calling this a scatter pack because it's kind of like a starter pack, but it's mostly a scattered mix of advice, tutorial, curation, and post-mortem blogging. I'm not getting very far into implementation detail, and I probably won't cover your use case.

Step 1: Do a little dreaming

What do you want to do with a model? Is this just to say you can do it, or do you want to accomplish something?

Here are some things that inspired me to host my own model:

Step 2: Choose your server

I'm assuming that you want to run an off-the-shelf model, and boy-oh-boy are there a lot of them. Deciding what you'll use to run your model will help you refine your search. For me, it came down to parallelism:

Parallelism =

No parallelism =

You can have many completions running at the same time

Multiple requests need to wait for the request before them to complete

I started with Ollama, but I quickly found out I needed to switch to vLLM. Here are some options, in addition to those two:

Why are we selecting the server before the model?

Again, I'm focusing on picking a model off-the-shelf, so your choice of server can introduce extra constraints on your selection. For example:

If you choose vLLM or parallelized solutions, you have to select a smaller model than you would otherwise, because the key-value cache needs to reside in VRAM.

If you have Apple Silicon, you probably want to use MLX, which has its own architecture that models need explicit conversion to support.

Ollama has its own catalog of models that you can deploy from, which can be easier to use than Hugging Face

These are all conditions that I wish I knew before I started to shop for models. I got excited about running Deepseek R1 7B, but I had to run a smaller model.

Here are some questions you should ask yourself during this process:

Where will it run (e.g. desktop environment or container)?

Does it need to be network available?

How many people/applications would like to access your server at once?

Step 3: Choose a model

Okay, hopefully you know what you're gonna use to run your model. If you're lost or confused, I'd recommend Ollama as the place to start.

The GPU I'm working with is an RTX 3060 Ti with 8GB of VRAM. To run our model, it needs to fit in VRAM--and I already told you about my parallelism requirement, so we're going to have to do a bit of code golfing.

There are three model characteristics that I needed to consider in selecting my model. I'll briefly describe them and link out to IBM, who can explain each in more detail:

Model Parameters

Quantization

Context Window

I'm not sure how much this affects non-parallel deployments, but this is a factor in computing the KV cache size. Fortunately, you can adjust this in vLLM, but it's good to know what size the model supports. Bigger context = bigger cache

This is the most important factor in how much "intelligence" the model has. More parameters = bigger model. Many models are available with different numbers of parameters.

Quantization is part compression, part optimization. It shrinks your model by reducing the precision of certain weights, but it that also increases the speed of execution. It can also make your model more unstable, so try to strike a balance.

Here are a couple of places where you can look for models:

I recommend starting here, even if you're not using Ollama. It's a good place to see what's new and popular, and it concisely displays the available parameter versions.

Once you've got a clue, you can take your search over here to find a specific model that meets size/quantization constraints.

The model that I decided to run was Qwen3-4B-Thinking-2507, because it's small, recent, and scores well on benchmarks that don't mean all that much to me because I'm not a professional in this field.

The base model on Hugging Face:

The quantized variant I'm actually running:

The Hugging Face interface had a bit of a learning curve, so here are some scribbleshots that might save you some headache:

Step 4: Pick your extras

You probably want to use your LLM somehow, so here are some services that you might consider standing up next to your LLM. Open WebUI is a bit of a no-brainer--I found it a super easy way to test my model out once deployed.

Step 5: Make it work!

I can only talk about my experience, as this isn't a real guide--but here are links that helped me.

My homeserver is fully managed via Podman Quadlets and Kubernetes manifests. If you're in a similar boat, you'll need to do some finagling to get your GPU into your container.

My only advice on this front is that I couldn't get GPU mounting to work with podman kube play. I had to write a .container Quadlet in the end.

I also needed to include SecurityLabelDisable=true in my Quadlet, because of SELinux. Otherwise, my container was unable to access the GPU.

If it's useful, here's the entire Quadlet I used to get vLLM running:

Step 6: Enjoy!

I hope you found something useful or interesting in this mess. Best of luck!