One Year Later: The Agentic CLI Revolution Revisited | Derek Armstrong

A year ago, I wrote “The Agentic CLI Revolution” with the wide-eyed enthusiasm of someone who’d just discovered a game-changing tool. I was excited—maybe a little too excited—about AI agents living in our terminals, writing code, and transforming how we build software.

Well, a year has passed. I’ve spent it in my homelab, in code reviews I wouldn’t have caught without AI help — and in a few situations where overconfidence in these tools caused real problems. Time to account for all of it.

Some predictions aged well. Some aged like week-old sushi. And a few things happened that nobody saw coming—including me.

Here’s the honest accounting.

What Actually Happened: The Community Shift

The shift was real, but the shape of it surprised me. CLI AI went from niche experiment to standard toolkit faster than I expected, costs dropped, and the tooling matured. The big players kept shipping.

But the how of adoption was where I kept getting it wrong. In my own work — day job and homelab — I went from “occasionally useful” to “can’t imagine shipping without it.” Just not in the ways I’d predicted.

What I Got Right (Surprisingly Few)

Let’s bank the wins before this gets considerably more humbling.

Win #1: CI/CD Integration Actually Happened

AI in CI/CD pipelines went mainstream, just like I predicted. But not the way I expected.

What I predicted:

# Complex AI orchestration in pipelines
- name: AI Code Review
  run: |
    copilot review --comprehensive
    copilot fix --auto-apply
    copilot test --generate

What actually happened:

# Targeted, focused AI operations
- name: Security Analysis
  run: gh copilot security-scan --critical-only

- name: Performance Review  
  run: gh copilot perf-check --regression-only

- name: Generate Release Notes
  run: gh copilot release-notes --since-last-tag

The lesson: Teams wanted AI for specific high-value tasks, not to replace entire workflows. Think surgical strike, not carpet bombing.

Win #2: The Documentation Revolution

This one exceeded expectations. AI-generated documentation went from “nice to have” to “absolutely essential.”

In my homelab projects:

# My documentation workflow now
$ gh copilot docs generate \
  --source=./src \
  --include=api,setup,deployment \
  --style=markdown

# Result: My personal projects actually have docs!
# And they stay current because updating them isn't painful

At work, we’ve integrated similar patterns into our workflow. The killer feature? AI can compare code changes to existing docs and flag inconsistencies. That documentation debt that always haunted us? Actually manageable now.

For someone like me who’d rather be building than writing docs (but knows docs are crucial), this has been transformative.

Win #3: Lower Expert Barriers

Junior developers using AI to do senior-level work? Absolutely happened.

But it created the problem we’ll get to in the surprises section.

What I Got Spectacularly Wrong

These predictions aged like milk in the sun.

Miss #1: “Self-Healing Infrastructure”

Remember when I said infrastructure would “literally heal itself”? Yeah… about that.

What I predicted:

# Magical self-healing
while true; do
  if [[ $HEALTH != "OK" ]]; then
    copilot diagnose --auto-fix
  fi
done

What actually happened: 2025 had the AWS Kiro incident in December where Amazon’s internal AI coding agent autonomously deleted a production environment, causing a 13-hour outage. Replit had an AI agent wipe a production database and fabricate 4,000 user accounts in July. A Cursor agent in April 2026 deleted a company’s entire production database. These weren’t edge cases — they were the predictable result of giving AI agents write access to production without human review.

AI making automated production changes without a human in the loop is terrifying.

The companies that tried this pattern quickly reverted to:

# What actually works
if [[ $HEALTH != "OK" ]]; then
  DIAGNOSIS=$(copilot diagnose --suggest-only)
  
  # Human reviews and approves
  echo "$DIAGNOSIS"
  read -p "Apply fix? (y/n) " confirm
  
  [[ $confirm == "y" ]] && apply_fix "$DIAGNOSIS"
fi

The lesson: AI can diagnose brilliantly. But production changes need human judgment. Always.

Miss #2: Cost Assumptions

I massively underestimated how expensive AI operations would be… initially.

My early 2025 reality check:

Started using AI in my homelab CI/CD
Small personal project with maybe 50 commits/month
Each commit triggered multiple AI operations
First month bill: $85
My reaction: “Wait, that’s more than my entire VPS budget!”

I saw similar sticker shock discussions across developer communities. People were excited about the tools but nervous about the costs.

What changed everything: Three major developments:

Model optimization: Claude 3.5 Haiku and GPT-4o-mini dropped costs by 70%
Caching strategies: Smart prompt caching reduced redundant operations by 80%
Competitive pressure: Prices dropped as providers competed

By late 2025, my monthly AI costs for all my projects: $15-25/month. Less than my coffee budget. Totally sustainable for homelab work.

Miss #3: The “AI-First Workflow” Pattern

I thought developers would start with AI describing intent, then refine. Nope.

What actually happened: Developers still code first, then use AI for:

Refactoring (works well)
Test generation (works very well)
Documentation (works surprisingly well)
Code review (surprisingly nuanced)

The coding-from-scratch use case? Way less common than I thought. Developers still want to write code. They just want AI to handle the tedious parts.

Think of it like this: You’re still the chef. AI just does the dishes.

What Nobody Predicted: The Surprises

Here’s where it gets genuinely interesting.

Surprise #1: Security Became a Real Concern

The more AI agents ended up in CI/CD pipelines, the more the security surface area grew — and the slower people were to notice. The risks aren’t hypothetical: prompt injection through code comments, agents making unauthorized writes, generated code with subtle vulnerabilities baked in. I won’t pretend I had all of this in my 2025 threat model.

When I reviewed my homelab CI/CD setups after reading through some incident discussions mid-year, I found gaps I wasn’t proud of. Cleaned them up.

What settled into practice:

# My AI-safe pipeline pattern
ai-operations:
  permissions:
    read: [code, logs]
    write: [none]  # AI never writes directly
    
  verification:
    human-review: required
    automated-checks: [security-scan, test-suite]
    
  audit:
    log-all-operations: true
    prompt-sanitization: required
    output-validation: required

The golden rule: AI can suggest, never commit directly. All AI output goes through review.

Surprise #2: The Three Killer Apps

CLI AI usage didn’t spread evenly. In my observations and conversations with other developers, three use cases clearly dominated actual usage:

Killer App #1: PR Review Enhancement

# The pattern everyone actually uses
$ gh pr view 123 --json diffstat | \
  gh copilot review \
    --focus=security,performance,accessibility \
    --style=conversational

AI as PR review copilot became indispensable. Not replacing human review—augmenting it.

Killer App #2: Test Generation

# This became the most ROI-positive AI operation
$ gh copilot tests generate \
  --file=src/auth.js \
  --coverage-target=85 \
  --include-edge-cases

Why it worked: Tests are tedious to write, high-value to have, and easy to verify. Perfect AI task.

Killer App #3: Legacy Code Understanding

# The unexpected champion
$ gh copilot explain \
  --file=legacy/payment_processor.c \
  --depth=detailed \
  --output=documentation/payment-flow.md

AI became the archaeology tool for ancient codebases. I’ve seen it used (and used it myself) to:

Understand code nobody remembered
Generate documentation for undocumented systems
Plan refactoring strategies
Onboard to unfamiliar codebases

In one memorable case at work, we used AI to help understand a legacy payment processing system that the original developers had long since left. It gave us the confidence to actually modernize it instead of being paralyzed by fear of breaking something critical.

Surprise #3: The Learning Curve Question

The bigger surprise was what happened to junior developers using these tools without guardrails. Is AI a teaching tool or a crutch? The answer turned out to be: entirely depends on how you deploy it.

The concern I’ve observed:

# Quick-fix approach
$ copilot solve "Why is my API returning 500?"
# AI: "Change line 42 to use try/catch"
$ # Apply fix without understanding why

You get answers fast. You ship code faster. But are you building the deep understanding that makes you a better engineer?

What works better: The “AI-Paired Learning” approach I use:

# Better junior dev workflow
$ copilot explain "Why is my API returning 500?" \
  --teach-me \
  --show-alternatives \
  --explain-trade-offs

# AI teaches, doesn't just fix
# Junior learns debugging skills
# AI suggests they try debugging first

When mentoring or learning myself, I use AI in “teaching mode”:

Ask AI to explain concepts before showing solutions
Use it to explore “why” not just “what”
Let it suggest learning resources and alternatives
Treat it as a patient teacher, not a magic answer box

The key insight: AI as a learning partner beats AI as a solution machine for skill development.

How I (and Others) Actually Use CLI AI in 2026

Note: The shell patterns below are illustrative — the CLI flags and subcommands are conceptual stand-ins for the actual tooling, which varies by provider and has changed significantly even in the last year. The workflows are real; the exact syntax should be adapted to whatever you’re actually running.

Here are the patterns that work, from my own experience and what I’ve seen in the community.

Pattern 1: The “AI-Enhanced Review Cycle”

My workflow before AI:

1. Write code (2 hours)
2. Self-review (15 min, often missed stuff)
3. Create PR (5 min)
4. Wait for review
5. Address feedback (30 min)
6. Rinse and repeat

My workflow now:

# Before creating PR
$ git diff | gh copilot pre-review \
  --checklist=security,performance,tests,docs

# Fix the obvious issues AI caught (15 min)

# Create PR with AI-generated description
$ gh pr create --fill-ai

# Reviewers focus on:
# - Architecture decisions
# - Business logic
# - Design patterns
# (Not formatting or obvious bugs)

The win: Reviews are faster AND higher quality. Plus, I catch embarrassing mistakes before anyone else sees them.

Pattern 2: The “Progressive Enhancement” Script

Instead of AI-first or AI-only, I layer AI into existing workflows:

#!/bin/bash
# deploy.sh - Progressively AI-enhanced

# Step 1: Traditional validation
npm test || exit 1
npm run lint || exit 1

# Step 2: AI-enhanced security scan
echo "Running AI security analysis..."
SECURITY=$(gh copilot security-scan --severity=high,critical)

if [[ -n "$SECURITY" ]]; then
  echo "  Security concerns found:"
  echo "$SECURITY"
  read -p "Continue anyway? (y/n) " confirm
  [[ $confirm != "y" ]] && exit 1
fi

# Step 3: AI-suggested deployment checks
echo "AI pre-flight checks..."
gh copilot deploy-checklist --environment=production

# Step 4: Traditional deployment
kubectl apply -f deployment.yaml

# Step 5: AI-monitored health check
gh copilot monitor-deployment \
  --timeout=5m \
  --alert-on-anomalies

Key insight: AI augments what works, doesn’t replace it. This pattern has served me well across homelab projects and production systems.

Pattern 3: The “Context-Aware Assistant”

One of my favorite discoveries: giving AI context about your project makes it 10x more useful:

# .ai-context file in project root
{
  "project": "payment-processor",
  "stack": ["python", "fastapi", "postgresql"],
  "conventions": "./CONVENTIONS.md",
  "architecture": "./docs/architecture.md",
  "common-tasks": {
    "test": "pytest --cov=src tests/",
    "deploy": "./scripts/deploy.sh",
    "review": "gh copilot review --team-style"
  }
}

# AI reads context automatically
$ gh copilot task "add rate limiting to API"

# AI response includes:
# - Code following team conventions
# - Tests using team's test patterns  
# - Documentation updates
# - Deployment considerations

Why it works: AI understands your project’s patterns, not just generic code examples. The context file pays for itself quickly.

The Economics: What Changed

Let’s talk money, because accessibility matters.

My Personal Cost Journey

Q1 2025:

My monthly AI costs: ~$85
My reaction: “This is steep for homelab work”

Q2 2025:

Started optimizing usage
Monthly cost: ~$45
Reaction: “Getting more reasonable”

Q3-Q4 2025:

Model prices dropped significantly
Better caching strategies
Monthly cost: ~$25
Reaction: “Totally sustainable!”

Q1 2026 (today):

My typical monthly spend: $15-30
Heavy usage months: $40-50
This is less than my streaming subscriptions

The Value Proposition

For me personally:

Time saved: Probably 5-8 hours/week on tedious tasks
Learning accelerated: Can explore new tech faster
Quality improved: Catch bugs before they ship
Documentation exists: My projects actually have usable docs
Less burnout: AI handles the boring stuff

The real ROI: More interesting problems, better quality shipped, and documentation that actually exists. That last one still surprises me.

Security Maturity: Lessons Learned

As AI agents became more common in production workflows, the industry developed important security practices.

Best Practices That Emerged

Principle 1: Zero Trust for AI

# AI gets minimal permissions
ai-agent-permissions:
  read: [source-code, logs, metrics]
  write: [none]  # AI writes to temp/PR only
  execute: [linting, testing]  # Safe operations only
  deploy: [never]  # Humans deploy

Principle 2: Prompt Injection Defense

# Sanitize all AI inputs
sanitize_prompt() {
  local prompt="$1"
  
  # Remove common injection patterns
  # Validate against allowlist
  # Escape special characters
  
  echo "$sanitized_prompt"
}

SAFE_PROMPT=$(sanitize_prompt "$USER_INPUT")
gh copilot query "$SAFE_PROMPT"

Principle 3: Output Validation

# Validate all AI-generated code
validate_ai_output() {
  local ai_code="$1"
  
  # Security scan
  semgrep --config=auto "$ai_code" || return 1
  
  # Syntax validation
  python -m py_compile "$ai_code" || return 1
  
  # Custom rules
  ./scripts/validate-conventions.sh "$ai_code" || return 1
  
  return 0
}

Principle 4: Audit Everything

# Every AI operation logged
{
  "timestamp": "2026-01-15T10:30:00Z",
  "user": "alice@company.com",
  "operation": "code-review",
  "prompt_hash": "a1b2c3...",
  "output_hash": "d4e5f6...",
  "cost": "$0.04",
  "approved": false,
  "applied": false
}

These patterns are becoming standard practice across teams I’ve talked to and reviewed configs from.

What We Learned About Learning

The teams that navigated this well used a phased rollout: teaching mode only for the first six months, then full productivity tools once fundamentals were established. AI explains before solving, suggests resources, resists just handing over the answer.

The teams that didn’t do this got fast juniors with shallow understanding. Which is fine until something breaks at 2am and nobody knows why.

Approach	Short-term velocity	Skill development
AI as answer machine	High	Low
AI as teaching partner	Medium	High
No AI	Low	High

The middle row is the play. It’s not even close once you run the numbers over 12 months.

What’s Next: 2026 and Beyond

Aside: This is speculation, not roadmap. The commands below are illustrative of where things are trending based on what’s already in beta or early access — not things you can actually run today.

Near Term (Next 6 Months)

1. Context-Aware Everything

AI agents are getting dramatically better at understanding full project context:

# Coming soon
$ gh copilot task "optimize payment flow" \
  --context=entire-codebase

# AI analyzes:
# - All payment-related code
# - Database schema  
# - API contracts
# - Performance metrics
# - Related tickets
# Suggests holistic optimization

2. Specialized Domain Agents

Instead of general-purpose AI, we’re seeing specialized agents:

# Domain-specific agents
$ gh copilot-security audit --compliance=PCI-DSS
$ gh copilot-performance profile --bottlenecks
$ gh copilot-accessibility check --WCAG-level=AA

Each agent is expert-level in its domain.

3. Team-Trained Models

Companies are starting to fine-tune models on their own codebases:

# Your team's AI, trained on your patterns
$ gh copilot-acme task "add feature" \
  --uses-team-patterns \
  --follows-team-style

This matters for consistency and velocity in ways that are hard to overstate.

Medium Term (6-12 Months)

1. Cross-Tool Intelligence

AI coordinating across your entire toolchain:

# AI orchestrates multiple tools
$ gh copilot workflow "deploy new feature" \
  --coordinate-tools=git,docker,kubernetes,datadog

# AI handles:
# - Git workflow
# - Container build  
# - K8s deployment
# - Monitoring setup
# - Rollback planning

2. Predictive Assistance

AI anticipating what you need:

# Working on auth code...
# AI proactively suggests:
" I noticed you're modifying authentication. 
Would you like me to:
- Update related tests
- Check for security implications  
- Update API documentation
- Verify OAuth flow consistency"

3. Self-Improving Workflows

Workflows that optimize themselves:

# Workflow learns from experience
gh copilot optimize-workflow .github/workflows/ci.yml \
  --based-on-history \
  --reduce-time \
  --maintain-reliability

# AI analyzes 1000 runs
# Identifies bottlenecks  
# Suggests optimizations
# You review and apply

Long Term (1-2 Years)

Intent-Driven Development:

You describe what you want to achieve. AI handles the how and adapts as requirements evolve.

$ gh copilot project init "real-time chat app" \
  --requirements=./requirements.md \
  --constraints=./constraints.md \
  --maintain-continuously

# AI:
# 1. Designs architecture
# 2. Implements core features
# 3. Sets up infrastructure  
# 4. Creates monitoring
# 5. Evolves as needs change

Your role: Architect, reviewer, decision-maker. AI’s role: Builder, maintainer, optimizer.

Practical Advice for 2026

If you haven’t started yet: Just start. Install the Copilot CLI or set up the Claude API and run it against commands you’ve been looking up in man pages for the last five years. Use it as a reviewer on your next PR before you push. The ramp-up is fast. Don’t follow a structured onboarding plan — poke at it until something clicks.

If you’re already using it, three upgrades worth doing:

Add a project context file. AI that knows your stack, conventions, and architecture doc is substantially more useful than generic prompts. A .ai-context file in your repo root that points to CONVENTIONS.md and your architecture notes takes 20 minutes to set up.
Build security principles into your pipelines now. Zero-trust permissions, human-in-the-loop for any write operations, output validation before applying. If your current setup doesn’t have these, it’s technical debt with a timer.
Share patterns with your team. The useful aliases, the workflow scripts, the prompts that actually work — document them somewhere. AI tooling has a surprisingly high variance in effectiveness depending on how it’s prompted, and institutional knowledge matters here.

If you’re leading a team: Write an AI usage policy before you need one. The hard questions — where AI plugs in, where humans stay in the loop, how you handle junior developer learning — are easier to answer in advance than in the middle of an incident. And measure impact. Not to justify the budget, but to know where to tune.

The Real Story: Smaller Gap Than Expected, Bigger Than It Looks

A year ago I was excited about what these tools could do. Today I’m watching what they actually do.

The gap between those two things is real — smaller than the skeptics predicted, bigger than the true believers promised. The developers getting the most out of CLI AI aren’t the ones who’ve automated the most. They’re the ones who’ve figured out which parts of their workflow genuinely benefit from an assist, and which parts still need a human brain in the loop.

Self-healing infrastructure doesn’t make that cut. Neither does AI-first workflow design. But PR review augmentation, test generation, and legacy code archaeology? Those three earned their keep.

The line between “good use” and “bad use” isn’t the same for everyone — your stack, your team, your risk tolerance all factor in. Figuring out where it sits for your work is the actual job. The tooling is mature enough now that “we don’t use AI” is an active choice, not a default.

That choice might be the right one for your context. But it should be deliberate.

Resources

Essential Tools (2026 Edition)

GitHub Copilot CLI — Available with GitHub
Claude API — With extended context and team features
Gemini Code Assist — Google’s AI development tool
Codeium — Free alternative with solid features

Security Resources

OWASP AI Security — AI-specific security guidelines
Anthropic’s Safety Research — Safety and alignment research
GitHub Security Best Practices — Official guidance for safe AI deployment

Learning Resources

GitHub Blog - AI & ML — Real-world AI development patterns
Anthropic Documentation — Developer guides and best practices
OpenAI Documentation — Comprehensive API and safety guides

Your Turn

If you’ve been running similar experiments — or found patterns that contradict mine — I’m genuinely curious. Drop me a note.

Key Takeaways

Self-healing infrastructure was a bad idea: I said it, tried something adjacent to it, and watched similar bets cause real outages. Automated AI changes in production without human review is a category of mistake.
The killer apps weren’t what anyone expected: PR review augmentation, test generation, and legacy code archaeology dominated actual usage — not AI-written codebases from scratch.
Context is the multiplier: AI that knows your project, stack, and conventions is dramatically more useful than generic prompts against a blank slate. This seems obvious in retrospect.
Costs fell faster than expected: What cost me ~$85/month in early 2025 was down to $15-30 by year end. The pricing competition got loud, which is good news for everyone.
Junior dev outcomes depend on how you deploy it: Teaching mode versus answer mode makes or breaks skill development. Most teams got this wrong initially — including some I observed up close.

The Agentic CLI Revolution: When AI Meets the Terminal — The original post that started this series, covering what was possible at the time.