Self-Hosted AI Inference Cluster

Jun 1, 2025 · 2 min read

Overview

Running local LLMs isn’t a hobby — it’s a production system. This cluster powers everything from agentic coding workflows to personal AI assistance, running entirely on hardware I own with zero cloud dependency.

The core setup: dual NVIDIA RTX 3090s (48GB VRAM combined) running vLLM with tensor parallelism to serve Qwen 3.6 35B. 100K context window, 8 concurrent sequences, and the kind of throughput that actually makes local models usable for real work.

Hardware

GPU: 2x NVIDIA RTX 3090 (24GB VRAM each, 48GB combined via NVLink)
CPU: AMD Ryzen 9 5950X (16 cores)
RAM: 128GB DDR4
Storage: NVMe for model weights and cache

Software Stack

Inference: vLLM with tensor parallelism
Model: Qwen 3.6 35B (quantized)
Orchestration: Docker containers
API: OpenAI-compatible endpoint
Context: 100K token context window

What It Powers

Agentic coding workflows (this conversation, for example)
Code review and analysis
Research and documentation assistance
Personal AI tasks across the household
Model evaluation and benchmarking

Why Self-Hosted

Privacy, cost, and control. Every API call to a cloud LLM is a data leak waiting to happen. Every token costs money. Every rate limit is a deadline on your productivity. Self-hosted means your AI works when you need it, handles whatever you throw at it, and never charges per token.

The initial hardware investment pays for itself in a few months compared to equivalent cloud API usage. After that, it’s just electricity.

Last updated on Jun 1, 2025

Ai Vllm Cuda Docker Infrastructure Self-Hosted

Authors

Derek Armstrong

Software Engineer · AI · Infrastructure

I’m Derek — software engineer, infrastructure nerd, and chronic tinkerer. 10+ years building payment platforms, production systems, and the kind of infrastructure that has to work at 3am whether I’m awake or not. When I’m not at my day job, I’m running local LLMs on dual 3090s, 3D printing things my wife didn’t ask for, and writing about all of it here. Topics range from code to infrastructure, AI, and whatever I broke this week.

← Fraud Detection System Rewrite Jun 1, 2025

CLI AI Agents to Apple Ecosystem Bridge Mar 1, 2025 →