Self-Hosted AI Inference Cluster

Jun 1, 2025 · 2 min read

Overview

Running local LLMs isn’t a hobby — it’s a production system. This cluster powers everything from agentic coding workflows to personal AI assistance, running entirely on hardware I own with zero cloud dependency.

The core setup: dual NVIDIA RTX 3090s (48GB VRAM combined) running vLLM with tensor parallelism to serve Qwen 3.6 35B. 100K context window, 8 concurrent sequences, and the kind of throughput that actually makes local models usable for real work.

Hardware

  • GPU: 2x NVIDIA RTX 3090 (24GB VRAM each, 48GB combined via NVLink)
  • CPU: AMD Ryzen 9 5950X (16 cores)
  • RAM: 128GB DDR4
  • Storage: NVMe for model weights and cache

Software Stack

  • Inference: vLLM with tensor parallelism
  • Model: Qwen 3.6 35B (quantized)
  • Orchestration: Docker containers
  • API: OpenAI-compatible endpoint
  • Context: 100K token context window

What It Powers

  • Agentic coding workflows (this conversation, for example)
  • Code review and analysis
  • Research and documentation assistance
  • Personal AI tasks across the household
  • Model evaluation and benchmarking

Why Self-Hosted

Privacy, cost, and control. Every API call to a cloud LLM is a data leak waiting to happen. Every token costs money. Every rate limit is a deadline on your productivity. Self-hosted means your AI works when you need it, handles whatever you throw at it, and never charges per token.

The initial hardware investment pays for itself in a few months compared to equivalent cloud API usage. After that, it’s just electricity.

Derek Armstrong - Software Engineer · AI · Infrastructure
Authors
Software Engineer · AI · Infrastructure
I’m Derek — software engineer, infrastructure nerd, and chronic tinkerer. 10+ years building payment platforms, production systems, and the kind of infrastructure that has to work at 3am whether I’m awake or not. When I’m not at my day job, I’m running local LLMs on dual 3090s, 3D printing things my wife didn’t ask for, and writing about all of it here. Topics range from payments architecture and DevOps to self-hosted AI and whatever I broke this week.