Running Qwen3.6 27B Locally on Dual RTX 3090s with vLLM v0.19

There is a certain satisfaction in running a frontier-class model locally that no cloud subscription can replicate. When Qwen3.6 dropped, I wanted it running on my homelab at full capability: 160k context, tool calling for Cline and Roo Code, speculative decoding, the works.

What I did not want was a shallow setup that left performance on the table. This walkthrough covers the full process: every config decision, every error, and what the logs actually mean. If you are running vLLM on consumer Ampere GPUs (RTX 3090, 3080, and friends), most of this is directly applicable.

Key Takeaways

Dual RTX 3090s can run Qwen3.6 27B AWQ-INT4 at usable agentic-coding performance with a 160k context window.
The biggest wins came from FP8 KV cache, FlashInfer attention plus sampler, and MTP speculative decoding tuned to one speculative token.
vLLM v0.19 V1 startup diagnostics are strict but useful: most “mystery” init crashes are actually memory hygiene issues.
On this setup, throughput landed around 116 to 124 tok/s batched, which is a major jump over previous Ollama-based serving.
Small config details matter. Changing block size from 32 to 16 alone reclaimed hundreds of MB of KV cache capacity.

🎯 The Goal

I wanted a production-ish local endpoint for real coding workflows, not a benchmark screenshot. That meant long context, stable tool-call parsing, strong multi-turn coherence, and enough throughput to keep Cline sessions feeling responsive instead of conversational molasses.

🖥️ Hardware

2x NVIDIA RTX 3090 24GB (48GB VRAM total)
AMD Ryzen 9 5950X
Unraid with Docker containers

Dual 3090s are the key constraint. They are sm86 (Ampere): still very capable, but missing some newer architecture niceties. Tensor parallelism across both cards runs over PCIe with NCCL, not NVLink-style symmetric memory. It works fine, but there is overhead.

🧠 The Model: Qwen3.6 27B AWQ-INT4

Qwen3.6 27B is a dense transformer, so all 27 billion parameters activate for every forward pass. That is different from MoE variants like 35B-A3B, and for this use case that distinction matters.

For agentic coding in Cline and Roo Code, where the model must track many tool-call results across long contexts and still emit reliable JSON, dense behavior is often an advantage. Every token gets the full network. MoE buys speed by activating a subset per token, but that can trade away some long-range consistency in complex sessions.

The quant used here is cyankiwi/Qwen3.6-27B-AWQ-INT4, a BF16-INT4 AWQ model in compressed-tensors format. vLLM can run this directly through MarlinLinearKernel on Ampere.

⚙️ Why vLLM v0.19

vLLM v0.19.1 (April 2026) runs V1 by default. V1 is a major engine redesign: it isolates EngineCore and overlaps tokenization, scheduling, and streaming with model execution instead of serializing the whole pipeline. The practical result is materially better throughput on the same hardware.

For this workload, two V1 features were especially relevant:

Zero-bubble async scheduling that can coexist with speculative decoding
Piecewise CUDA graphs, which helps with more complex model architectures like Qwen3.6 hybrid Mamba/attention layers

🧩 Final Launch Configuration

cyankiwi/Qwen3.6-27B-AWQ-INT4 \
  --dtype bfloat16 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --disable-custom-all-reduce \
  --gpu-memory-utilization 0.8349 \
  --max-model-len 160000 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 16384 \
  --block-size 16 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --attention-backend FLASHINFER \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --generation-config vllm \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Environment variable:

VLLM_USE_FLASHINFER_SAMPLER=1

🔍 Config Decisions Explained

`--gpu-memory-utilization 0.8349`

This sets the VRAM fraction reserved for KV cache after weights load. It is not a dynamic runtime cap. vLLM profiles memory at startup, subtracts weight footprint, then allocates KV blocks from the remaining headroom under this ceiling.

With 48GB total VRAM and roughly 9.72GB used by weights (from startup logs), 0.8349 yielded around 7.7GB per GPU for KV cache, roughly 118,400 KV tokens total.

The oddly specific 0.8349 came straight from vLLM startup recommendations. v0.19 has more accurate CUDA graph profiling, and the log suggested bumping from 0.83 to 0.8349 to preserve equivalent effective KV capacity.

Important caveat: this assumes clean GPUs at container start. If Ollama, ComfyUI, or a stale container still holds VRAM, V1 now validates free memory up front and hard-fails with a clear message.

`--kv-cache-dtype fp8`

FP8 KV cache roughly halves KV memory versus BF16, which is what makes 160k context feasible on 48GB. Logs mention possible accuracy impact without scaling factors, but in this workload the practical tradeoff was negligible.

`--tensor-parallel-size 2` + `--disable-custom-all-reduce`

Model shards across both GPUs. Disabling custom all-reduce forces NCCL all-reduce instead of NVLink-dependent custom kernels. For PCIe-connected 3090s, this is the right call.

`--attention-backend FLASHINFER`

FlashInfer (bundled in vllm/vllm-openai:latest) improved decode-heavy performance versus default FlashAttention2 on this setup. Pairing it with VLLM_USE_FLASHINFER_SAMPLER=1 moves both attention and sampling to FlashInfer kernels.

Startup confirmation looked like this:

Using FlashInfer for top-p & top-k sampling.
Using AttentionBackendEnum.FLASHINFER backend.

`--speculative-config '{"method":"mtp","num_speculative_tokens":1}'`

MTP speculative decoding uses a light draft path that can provide near-free tokens when accepted. In practice this gave meaningful decode gains.

Two practical choices mattered:

"method":"mtp" is the modern path for v0.19
num_speculative_tokens=1 performed better than 2 in this quantized setup because acceptance did not justify extra draft compute

`--block-size 16`

Qwen3.6 hybrid Mamba/attention layers are sensitive to cache/page alignment. At block size 32, logs showed padding overhead:

Add 3 padding layers, may waste at most 6.25% KV cache memory

Dropping to 16 removed that waste. At roughly 7.7GB KV cache, 6.25% is about 480MB, which is not a rounding error when you care about long-context headroom.

`--enable-prefix-caching` + `--enable-chunked-prefill`

Prefix caching is huge for agentic sessions with repeated system prompts and code context. Once cached, repeated turns avoid redoing the same heavy prefill.

Chunked prefill prevents large prefill operations from monopolizing the engine, which keeps multi-request latency steadier.

🐛 Debugging Startup Failures

Two failures were worth documenting because both were misleading at first glance.

Failure 1: WorkerProc Init Error That Was Really VRAM Contention

The first crash looked like a FlashInfer compatibility problem: worker process failure during init_device. Root cause was much simpler. Ollama was still holding about 21.8GB on each GPU.

V1 in v0.19 now validates free memory before loading weights. With about 1.74GB free per GPU against a roughly 19.56GB requirement, fail-fast behavior is expected.

Fix: run nvidia-smi before launch and confirm both GPUs are basically clear. On Unraid, “idle” is not the same as “released VRAM.” Containers must be stopped.

Failure 2: Docker Image Selection

The Unraid Community Apps template pointed at a custom qwen3_5-cu130 image that predates Qwen3.6 and may not include FlashInfer.

Use:

vllm/vllm-openai:latest

Using stale community images can produce what looks like a backend compatibility problem when it is really a packaging problem.

⏱️ Startup Profile: Why Cold Start Feels Slow

Cold start was around 4 to 5 minutes. Breakdown:

Phase	Time
Model weights load (27B AWQ)	~15.5s
Drafter model load (MTP)	~6.4s
torch.compile backbone	~44.6s
torch.compile eagle head	~7.2s
Profiling/warmup run	~83.7s
CUDA graph capture	~1s
Total	~144s

Those shm_broadcast warnings during profiling are informational in this context, usually worker coordination while compile completes. After cache warmup, restarts are much faster thanks to /root/.cache/vllm/torch_compile_cache/ reuse.

📈 Performance Results

Throughput test used 2,000-token completions.

Test	Concurrent	Wall Time	Batched Throughput	Per-Request
Run 1	2	34.2s	116.9 tok/s	58.5 tok/s
Run 2	4	64.6s	123.9 tok/s	31.0 tok/s

Observations:

Batched throughput increased with concurrency (116.9 to 123.9 tok/s), indicating remaining GPU headroom at 2-concurrent.
Earlier behavior that looked like “slow 4-concurrent” was scheduler-expected when --max-num-seqs was too low.
KV cache capacity math aligned with measured behavior: real tool-call sessions usually run far below 160k/request, so effective concurrency is better than worst-case modeling suggests.

For perspective, prior Ollama serving of Qwen3.5 27B Q4_K_M on this hardware was around 15 to 25 tok/s single-threaded. This setup landed roughly 5 to 7x higher throughput in practical workloads.

🏗️ Hybrid Architecture Notes (Mamba + Attention)

Qwen3.6 mixes transformer attention and Mamba layers. That is why you see Mamba-specific startup behavior:

Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
Setting attention block size to 1600 tokens to ensure that attention page size is >= mamba page size

align mode keeps Mamba and attention pages coherent for prefix caching. vLLM handles this automatically, but it is useful context when tuning block sizes and understanding why some values create avoidable padding overhead.

Aside: This is one of those places where “it runs” and “it runs well” diverge. The defaults are good. The logs are better.

🧪 What I Plan to Test Next

Multi-agent orchestration: route complex subagents to this endpoint while a smaller model handles narrow fast tasks
Prefix cache observability: scrape /metrics into Grafana and measure actual hit rates per session type
Long-context stress tests: validate stability with sustained 100k+ token sessions
MTP acceptance telemetry: log acceptance in production and validate whether num_speculative_tokens=1 remains optimal

🗂️ Final Config Reference

# Docker environment
image: vllm/vllm-openai:latest
runtime: nvidia
ipc: host
environment:
  VLLM_USE_FLASHINFER_SAMPLER: "1"

# vLLM args
model: cyankiwi/Qwen3.6-27B-AWQ-INT4
dtype: bfloat16
quantization: compressed-tensors
kv-cache-dtype: fp8
tensor-parallel-size: 2
disable-custom-all-reduce: true
gpu-memory-utilization: 0.8349
max-model-len: 160000
max-num-seqs: 4
max-num-batched-tokens: 16384
block-size: 16
enable-prefix-caching: true
enable-chunked-prefill: true
attention-backend: FLASHINFER
enable-auto-tool-choice: true
tool-call-parser: qwen3_coder
reasoning-parser: qwen3
speculative-config: '{"method":"mtp","num_speculative_tokens":1}'
generation-config: vllm
trust-remote-code: true

📚 Resources

If this saved you a few hours of log archaeology, pass it on to someone else trying to make consumer GPUs do unreasonable things.