Self-Hosted AI Inference Cluster

Sun, 01 Jun 2025 00:00:00 +0000

Overview

Running local LLMs isn’t a hobby — it’s a production system. This cluster powers everything from agentic coding workflows to personal AI assistance, running entirely on hardware I own with zero cloud dependency.

The core setup: dual NVIDIA RTX 3090s (48GB VRAM combined) running vLLM with tensor parallelism to serve Qwen 3.6 35B. 100K context window, 8 concurrent sequences, and the kind of throughput that actually makes local models usable for real work.

Hardware

GPU: 2x NVIDIA RTX 3090 (24GB VRAM each, 48GB combined via NVLink)
CPU: AMD Ryzen 9 5950X (16 cores)
RAM: 128GB DDR4
Storage: NVMe for model weights and cache

Software Stack

Inference: vLLM with tensor parallelism
Model: Qwen 3.6 35B (quantized)
Orchestration: Docker containers
API: OpenAI-compatible endpoint
Context: 100K token context window

What It Powers

Agentic coding workflows (this conversation, for example)
Code review and analysis
Research and documentation assistance
Personal AI tasks across the household
Model evaluation and benchmarking

Why Self-Hosted

Privacy, cost, and control. Every API call to a cloud LLM is a data leak waiting to happen. Every token costs money. Every rate limit is a deadline on your productivity. Self-hosted means your AI works when you need it, handles whatever you throw at it, and never charges per token.