Self-Hosted AI Inference Cluster

Overview
Running local LLMs isn’t a hobby — it’s a production system. This cluster powers everything from agentic coding workflows to personal AI assistance, running entirely on hardware I own with zero cloud dependency.
The core setup: dual NVIDIA RTX 3090s (48GB VRAM combined) running vLLM with tensor parallelism to serve Qwen 3.6 35B. 100K context window, 8 concurrent sequences, and the kind of throughput that actually makes local models usable for real work.
Hardware
- GPU: 2x NVIDIA RTX 3090 (24GB VRAM each, 48GB combined via NVLink)
- CPU: AMD Ryzen 9 5950X (16 cores)
- RAM: 128GB DDR4
- Storage: NVMe for model weights and cache
Software Stack
- Inference: vLLM with tensor parallelism
- Model: Qwen 3.6 35B (quantized)
- Orchestration: Docker containers
- API: OpenAI-compatible endpoint
- Context: 100K token context window
What It Powers
- Agentic coding workflows (this conversation, for example)
- Code review and analysis
- Research and documentation assistance
- Personal AI tasks across the household
- Model evaluation and benchmarking
Why Self-Hosted
Privacy, cost, and control. Every API call to a cloud LLM is a data leak waiting to happen. Every token costs money. Every rate limit is a deadline on your productivity. Self-hosted means your AI works when you need it, handles whatever you throw at it, and never charges per token.
The initial hardware investment pays for itself in a few months compared to equivalent cloud API usage. After that, it’s just electricity.