<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vllm | Derek Armstrong — Staff Engineer &amp; Solutions Architect</title><link>https://derekarmstrong.dev/tags/vllm/</link><atom:link href="https://derekarmstrong.dev/tags/vllm/index.xml" rel="self" type="application/rss+xml"/><description>Vllm</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://derekarmstrong.dev/media/sharing.png</url><title>Vllm</title><link>https://derekarmstrong.dev/tags/vllm/</link></image><item><title>Self Hosted AI: Actually Running Local LLMs for a Multi-User Household</title><link>https://derekarmstrong.dev/blog/self-hosted-ai-multi-user-household/</link><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><guid>https://derekarmstrong.dev/blog/self-hosted-ai-multi-user-household/</guid><description>&lt;h2 id="key-takeaways"&gt;Key Takeaways&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Self-hosting AI is the natural extension of the homelab philosophy: own your data, own your stack, own your intelligence.&lt;/li&gt;
&lt;li&gt;Ollama is great for solo tinkering but hits concurrency bottlenecks with certain models (especially Qwen) in multi-user setups.&lt;/li&gt;
&lt;li&gt;vLLM with PagedAttention delivers true parallel processing for enterprise-grade homelab workloads.&lt;/li&gt;
&lt;li&gt;Qwen 27B Dense excels at reasoning and coding, while Qwen 35B MoE wins on speed and context window size.&lt;/li&gt;
&lt;li&gt;AI tooling is 90% knowledge management and 10% inference you need good documentation more than you need a shiny new model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To be real and honest, I did not start this homelab journey to become a local LLM overlord. I started it with a simple, pragmatic goal of hosting my own internal services. I wanted media, Git, storage, sync, backups, and game servers. That is the fun part of a homelab. It is the joy of knowing that if the internet goes down for a day, nothing really changes for me. I still have my movies, my code, my photos, and my games.&lt;/p&gt;
&lt;p&gt;It is also about having a whole lot less cloud accounts to manage, pay for, and keep secure. Why do I need an account on six different platforms just to exist in the modern world. I would rather have my data on a disk I physically own.&lt;/p&gt;
&lt;p&gt;Old school. Wild, I know.&lt;/p&gt;
&lt;p&gt;So, when I finally cracked down on hosting local AI, it was not a departure from that philosophy. It was the natural conclusion of it. I wanted to cut the cloud inference costs, sure. But I also wanted 24/7 uptime for workflows like posting to socials, scanning the news, or maybe even crunching my grocery list, without feeding my brain to a corporation.&lt;/p&gt;
&lt;p&gt;Here is what I have learned so far running this setup, from hardware bottlenecks to the actual superpowers of self hosted inference.&lt;/p&gt;
&lt;h2 id="the-ollama-bottleneck-and-learning-the-quirks"&gt;The Ollama Bottleneck and Learning the Quirks&lt;/h2&gt;
&lt;p&gt;When I first got into local LLMs, I started with Ollama. If you are just dipping your toes in, it is fantastic. It is plug-and-play, handles quantized models effortlessly, and integrates with pretty much every UI or tool out there.&lt;/p&gt;
&lt;p&gt;But I hit a snag with concurrency.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s be fair to Ollama. The platform handles parallel requests just fine. The limitation I ran into was actually a specific bug with certain Qwen models right now. For some reason, those specific models queue requests instead of processing them in parallel.&lt;/p&gt;
&lt;p&gt;It does not matter if you have a 3x RTX 3090 setup with 144GB of total VRAM. Those models just will not run concurrent out of the box through Ollama.&lt;/p&gt;
&lt;p&gt;Here is the scenario that broke me. My wife is running a long code generation, my son is asking a basic question, and I am trying to run a research agent. The system just waits. One at a time. With a multi-user household and parallel automation workflows, that bottleneck made the whole system feel slow. I did not want my son waiting 60 seconds for a joke just because the model was busy helping me with a Dockerfile.&lt;/p&gt;
&lt;p&gt;There are always pros and cons to every decision. Ollama is the easiest path and it is great for solo tinkering. But it is about learning how to adapt to the current limitations and advancements at the same time. I could not wait for the bug to be fixed, so I moved.&lt;/p&gt;
&lt;h2 id="enter-vllm-and-enterprise-grade-concurrency"&gt;Enter vLLM and Enterprise Grade Concurrency&lt;/h2&gt;
&lt;p&gt;I needed true parallel processing, and that led me to vLLM.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s be clear. vLLM is heavy. It is built for enterprise workloads, and setting it up initially is an absolute nightmare. You are dealing with Docker container orchestration, manual parameter tuning for your specific GPU topology, and wrestling with headroom allocation.&lt;/p&gt;
&lt;p&gt;But once it is running, it is so worth it.&lt;/p&gt;
&lt;p&gt;vLLM gave me the concurrency I actually needed. It uses PagedAttention, similar to how modern memory management works, to handle multiple active sessions without choking. It is OpenAI API-compatible, so it slots into everything like Cursor, Open WebUI, or whatever tool you are messing with, with zero friction.&lt;/p&gt;
&lt;p&gt;I run it in Docker, spinning up different containers pointing to the same local model directory. This lets me test new models with different parameters without re-downloading 200GB of weights every time I want to swap. For personal use, having 4 or 5 concurrent requests running at once covers 99% of my bases.&lt;/p&gt;
&lt;p&gt;The trade off is that vLLM is rigid. Unlike Ollama, swapping models requires spinning up new containers or changing launch parameters. You will likely run one stable model for your production homelab use.&lt;/p&gt;
&lt;h2 id="the-model-wars-qwen-27b-dense-vs-35b-moe"&gt;The Model Wars: Qwen 27B Dense vs 35B MoE&lt;/h2&gt;
&lt;p&gt;After burning through a few weeks of VRAM and electricity, I locked into the Qwen series. I found they punch way above their weight class, but picking between the 27B Dense and the 35B Mixture of Experts (MoE) depends entirely on your use case.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Qwen 27B Dense&lt;/strong&gt;
This is the workhorse. It does not use the fancy Mixture of Experts architecture. It is a straight-up dense model. I use it because it is the king of reasoning and coding. It is consistent and has great accuracy. It does not hallucinate as much when asked to debug complex logic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Qwen 35B MoE&lt;/strong&gt;
This thing is lightning fast. Because of the architecture, you can fit massive context windows, like 262K, and get high concurrency.&lt;/p&gt;
&lt;p&gt;If you are doing heavy web scraping, summarization, or research where speed is king, go with the 35B MoE. But if I am asking the AI to think through a problem, debug complex logic, or write code, the 27B Dense wins every time. It is just smarter.&lt;/p&gt;
&lt;h2 id="the-superpower-of-knowledge-management"&gt;The Superpower of Knowledge Management&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s stop pretending this is a competition over who has the biggest GPU.&lt;/p&gt;
&lt;p&gt;The biggest realization I have had is this. AI tooling is 90% knowledge management and 10% inference.&lt;/p&gt;
&lt;p&gt;Enterprises are throwing buckets of cash at custom models, hoping for ROI. But an AI is only as good as the documentation you feed it. If you ask it to review code, it does not care that it is smart. It cares if it knows your infrastructure stack, your deployment rules, and your codebase conventions.&lt;/p&gt;
&lt;p&gt;You do not need the newest, most expensive model on the leaderboard. You need a reliable model that you know how to interact with.&lt;/p&gt;
&lt;p&gt;You do not need to be a wizard who memorizes every granular API detail for 50 different platforms. You just need to be good at documentation, context gathering, and communication. Dump the manuals into a knowledge base, set clear system prompts, and let the AI act as your local, personalized search engine.&lt;/p&gt;
&lt;p&gt;We are moving from the era of knowing the right Google search operators to an era of personalized context synthesis. You are no longer requesting a list of links. You are feeding the AI your baseline knowledge and telling it to go deep, skip the basics, and highlight edge cases.&lt;/p&gt;
&lt;h2 id="the-bottom-line"&gt;The Bottom Line&lt;/h2&gt;
&lt;p&gt;Self-hosting AI is less about having the shiniest new toy and more about betting on boring, reliable tech that actually solves a problem.&lt;/p&gt;
&lt;p&gt;If I can host a medium server or Git server, why not an AI?&lt;/p&gt;
&lt;p&gt;Running it locally means I own the stack, the model, and the prompts. I am not limited to whatever agent wrapper a vendor ships this week. I do not have to guess which company is using my context window for training. It is less paranoia and more practical maintenance.&lt;/p&gt;
&lt;p&gt;You only have to worry about the power bill and GPU depreciation. No more per token tax. Your data never leaves your house. And with a local knowledge base and the right inference engine like vLLM, you have a system that actually understands your specific context.&lt;/p&gt;
&lt;p&gt;I could literally move to the middle of nowhere, bring a terabyte of markdown documentation, and still function perfectly. That is the same feeling I get when my ISP cuts the cord and I realize my movies still play and my music is still offline. That is a level of independence the cloud will never give you.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
- Simple, local LLM deployment&lt;/li&gt;
&lt;li&gt;
- High-throughput inference with PagedAttention&lt;/li&gt;
&lt;li&gt;
- Open-source language models from Alibaba&lt;/li&gt;
&lt;li&gt;
- Web UI for local LLMs&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>