<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Cuda | Derek Armstrong — Software Engineer · AI · Infrastructure</title><link>https://derekarmstrong.dev/tags/cuda/</link><atom:link href="https://derekarmstrong.dev/tags/cuda/index.xml" rel="self" type="application/rss+xml"/><description>Cuda</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 01 Jun 2025 00:00:00 +0000</lastBuildDate><image><url>https://derekarmstrong.dev/media/sharing.png</url><title>Cuda</title><link>https://derekarmstrong.dev/tags/cuda/</link></image><item><title>Self-Hosted AI Inference Cluster</title><link>https://derekarmstrong.dev/projects/self-hosted-ai-inference-cluster/</link><pubDate>Sun, 01 Jun 2025 00:00:00 +0000</pubDate><guid>https://derekarmstrong.dev/projects/self-hosted-ai-inference-cluster/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Running local LLMs isn&amp;rsquo;t a hobby — it&amp;rsquo;s a production system. This cluster powers everything from agentic coding workflows to personal AI assistance, running entirely on hardware I own with zero cloud dependency.&lt;/p&gt;
&lt;p&gt;The core setup: dual NVIDIA RTX 3090s (48GB VRAM combined) running vLLM with tensor parallelism to serve Qwen 3.6 35B. 100K context window, 8 concurrent sequences, and the kind of throughput that actually makes local models usable for real work.&lt;/p&gt;
&lt;h2 id="hardware"&gt;Hardware&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: 2x NVIDIA RTX 3090 (24GB VRAM each, 48GB combined via NVLink)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: AMD Ryzen 9 5950X (16 cores)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM&lt;/strong&gt;: 128GB DDR4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage&lt;/strong&gt;: NVMe for model weights and cache&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="software-stack"&gt;Software Stack&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Inference&lt;/strong&gt;: vLLM with tensor parallelism&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model&lt;/strong&gt;: Qwen 3.6 35B (quantized)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;: Docker containers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API&lt;/strong&gt;: OpenAI-compatible endpoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context&lt;/strong&gt;: 100K token context window&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-it-powers"&gt;What It Powers&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Agentic coding workflows (this conversation, for example)&lt;/li&gt;
&lt;li&gt;Code review and analysis&lt;/li&gt;
&lt;li&gt;Research and documentation assistance&lt;/li&gt;
&lt;li&gt;Personal AI tasks across the household&lt;/li&gt;
&lt;li&gt;Model evaluation and benchmarking&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="why-self-hosted"&gt;Why Self-Hosted&lt;/h2&gt;
&lt;p&gt;Privacy, cost, and control. Every API call to a cloud LLM is a data leak waiting to happen. Every token costs money. Every rate limit is a deadline on your productivity. Self-hosted means your AI works when you need it, handles whatever you throw at it, and never charges per token.&lt;/p&gt;
&lt;p&gt;The initial hardware investment pays for itself in a few months compared to equivalent cloud API usage. After that, it&amp;rsquo;s just electricity.&lt;/p&gt;</description></item></channel></rss>