Intermediate10 min readGuides

Complete Mode for Software & DevOps: Systematic Deploy Debugging

Apply Complete Problem Solving Mode to software deployment, service reliability, and infrastructure debugging with concrete acceptance gates.

Published October 29, 2025

Complete Mode for Software & DevOps

The Software/DevOps overlay adapts Complete Problem Solving Mode for deployment failures, service outages, build issues, and infrastructure problems. It emphasizes runtime health, log cleanliness, and test coverage as core acceptance gates.

When to Use This Overlay

Ideal for:

  • Deploy failures (staging or production)
  • Service 500s or timeouts
  • Build/CI pipeline breaks
  • Container or orchestration issues
  • Configuration drift
  • Performance degradation
  • Database migration problems

Not ideal for:

  • Architecture design discussions (use exploratory mode)
  • Feature planning (use Product/UX overlay)
  • Quick local dev fixes (overkill)

The Software/DevOps Overlay

[DONE overlay — Software/DevOps]

- **Cloud health**: primary endpoint (e.g., `/healthz`) returns **200** for **≥30 minutes**; process has **0 unexpected restarts**
- **Logs clean**: last **2 minutes** of runtime logs show **no errors above INFO**
- **Local gates**: build **+** tests **+** type/lint pass with **0 failures**
- **Zero criticals**: Problem Register has 0 P0/P1 remaining
- **Evidence pack**: health URL/screenshot, log excerpt, test summary, commit/PR refs

**Suggested toggles**: 
- `scope:start-command` — Focus on service startup
- `scope:db-migrations` — Focus on schema changes
- `scope:config` — Focus on environment/secrets
- `depth:deep` — Multiple discovery methods
- `risk_tolerance:low` — Production changes

Understanding the Gates

Gate 1: Cloud Health (30min + 0 restarts)

Why 30 minutes? Many issues only surface after:

  • Cold start optimizations expire
  • Connection pools stabilize
  • Caches populate
  • Background jobs run

Why zero restarts? A service that "works" but crashes every 15 minutes is not done.

How to verify:

# Check uptime
curl -i https://your-service.com/healthz

# Check process stability (depends on platform)
# Render: Check dashboard for restart count
# Railway: railway logs --tail 100
# Docker: docker ps -a (check STATUS)
# K8s: kubectl get pods -w (watch for restarts)

Independent verification:

  • Monitor dashboard (Datadog, New Relic, CloudWatch)
  • Synthetic probe from different region
  • Load test with realistic traffic

Gate 2: Logs Clean (2min, no errors)

Why last 2 minutes? Recent logs show current state, not startup noise.

Why INFO threshold? WARN is acceptable (expected edge cases). ERROR/FATAL are not.

How to verify:

# Platform-specific log commands
# Vercel: vercel logs --since 2m
# Render: render logs --tail 50
# Railway: railway logs --tail 50
# Docker: docker logs --since 2m container_name
# K8s: kubectl logs --since=2m pod_name

# Filter for errors
your-log-command | grep -i "error\|fatal\|exception"

Independent verification:

  • Log aggregation tool (Papertrail, Logtail, ELK)
  • Application monitoring (Sentry error count)
  • Custom healthcheck that validates internal state

Gate 3: Local Gates (build + tests + lint)

Why local? Production health doesn't prove code quality. Broken tests or lint failures indicate technical debt.

How to verify:

# Example for Node.js/TypeScript
npm run build        # or: tsc --noEmit
npm run test         # or: jest --coverage
npm run lint         # or: eslint . --ext .ts,.tsx

# Example for Python
pytest tests/        # or: python -m pytest
mypy src/            # type checking
ruff check .         # or: flake8/black --check

# Example for Go
go build ./...
go test ./...
golangci-lint run

What counts as "pass":

  • Build: 0 compilation errors (warnings OK if documented)
  • Tests: 100% pass rate (skip flaky tests temporarily, document as P2)
  • Lint: 0 errors (warnings OK if tracked)

Gate 4: Zero Criticals

P0 definition (Critical):

  • Production completely down
  • Data loss or corruption
  • Security vulnerability
  • Wrong business logic (financial, legal)

P1 definition (High):

  • Partial outage (>10% users affected)
  • Broken core feature
  • Performance degradation (>2x normal)
  • Silent failures (errors not surfaced)

P2 definition (Medium):

  • Edge case bugs (<5% users)
  • Cosmetic issues
  • Technical debt
  • Nice-to-have improvements

Acceptance: All P0/P1 must be Resolved. P2 allowed with documented rationale.

Gate 5: Evidence Pack

Required artifacts:

  • Health URL: Link or screenshot of successful healthcheck
  • Log excerpt: Last 2min of logs showing clean state
  • Test summary: Output showing all tests pass
  • Commit/PR refs: Links to changes made

Format example:

**Evidence Pack:**
- Health: https://api.example.com/healthz (200 OK, checked 2:47pm)
- Logs: [See excerpt below] — No errors in 2min tail
- Tests: All 47 tests pass (see CI run #1234)
- Changes: PR #567 (merged), commit abc123f

Discovery Methods for Software/DevOps

Use different methods each pass to catch different issue types:

Pass 1: Deployment Logs

# Platform-specific
vercel logs --since 30m
render logs --tail 200
railway logs
heroku logs --tail=500

Catches: Startup failures, missing env vars, port binding

Pass 2: Application Logs

# Runtime errors, exceptions
kubectl logs -f pod-name
docker logs -f container_name

Catches: Runtime exceptions, database errors, API failures

Pass 3: Health/Status Endpoints

curl -i https://api.example.com/healthz
curl -i https://api.example.com/status

Catches: Misconfigured health checks, dependency failures

Pass 4: Metrics/Monitoring

  • Check dashboards (Datadog, Grafana, CloudWatch)
  • Look for: error rate, latency, memory/CPU Catches: Performance degradation, resource leaks

Pass 5: Manual Testing

# Smoke test critical paths
curl -X POST https://api.example.com/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test"}'

Catches: Broken endpoints, wrong responses

Pass 6: CI/CD Pipeline

  • Check GitHub Actions / CircleCI / GitLab CI
  • Look for: test failures, build warnings, deployment steps Catches: Test regressions, build configuration issues

Common Software/DevOps Problems & Register Entries

Problem: Port Binding Failure

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Startup | "EADDRINUSE: port 3000 already in use" in logs | Start command doesn't use $PORT | Update start cmd to use process.env.PORT | Resolved | 0.95 |

Action Log:

### Pass 1 — Action Log
- Change: Updated package.json start script from `node server.js` to `node server.js --port $PORT`
- Before → After: Deploy logs showed port conflict → Now binds to dynamic port
- Verification (Primary): Health check returns 200
- Verification (Independent): Platform dashboard shows "Running" status
- New signals found? No additional issues

Problem: Missing Environment Variable

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-02 | P1 | Config | "DATABASE_URL is not defined" in startup logs | Missing from platform env vars | Add to Render environment | Resolved | 1.0 |

Problem: Database Connection Timeout

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-03 | P0 | Runtime | "connect ETIMEDOUT" every 30s in logs | Connection pool exhausted | Add connection retry with exponential backoff | Resolved | 0.85 |

Re-scan discovery: Checking application logs revealed secondary issue:

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-04 | P2 | Performance | Query taking 4.2s avg (logs) | Missing index on users.email | Add index in migration | Resolved | 0.9 |

Real-World Example: Fixing a Vercel Deployment

Initial Request

@complete-mode

Task = Fix production deploy failing on Vercel
Context = Next.js 14, TypeScript, Postgres (Supabase), last successful deploy was 3 days ago

[DONE overlay — Software/DevOps]
- Cloud health: https://myapp.vercel.app/api/health returns 200 for ≥30min; 0 restarts
- Logs clean: last 2min show no errors
- Local gates: npm run build + test + lint = 0 failures
- Zero criticals
- Evidence: health URL, log excerpt, test output, deployment URL

scope:build depth:deep strict:on

Pass 1: Build Logs

Discovery:

[12:34:56] Error: Type error: Property 'user' does not exist on type 'Session'
[12:34:56] src/app/dashboard/page.tsx:45:28

Problem Register:

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Build | Type error in dashboard/page.tsx:45 | Supabase types out of sync | Regenerate types from schema | In Progress | 0.9 |

Execute:

# Regenerate Supabase types
npx supabase gen types typescript --project-id abc123 > types/supabase.ts

# Verify locally
npm run build

Primary Verification: Local build succeeds
Independent Verification: Push to preview branch, check Vercel build

Pass 2: Re-scan with Different Method (Runtime Logs)

Discovery: Build now succeeds, but checking preview deployment logs shows:

[Error] Unhandled Promise Rejection: Missing environment variable: SUPABASE_SERVICE_KEY

Problem Register Update:

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Build | Type error | Types out of sync | Regenerated types | Resolved | 0.9 |
| P-02 | P0 | Config | "Missing SUPABASE_SERVICE_KEY" in preview logs | Not set in Vercel env | Add to Vercel dashboard | In Progress | 1.0 |

Execute:

  • Go to Vercel dashboard → Settings → Environment Variables
  • Add SUPABASE_SERVICE_KEY with value from Supabase dashboard
  • Trigger redeploy

Primary Verification: Preview deployment succeeds, health endpoint returns 200
Independent Verification: Test actual API call with curl

Pass 3: Re-scan with Third Method (Manual Testing)

Discovery: Testing the user auth flow manually:

curl https://preview-abc.vercel.app/api/auth/callback?code=xyz

Returns 500. Logs show:

[Error] redirect_uri mismatch in OAuth callback

Problem Register Update:

| P-03 | P1 | Auth | OAuth callback 500, redirect_uri mismatch | Supabase only has prod URL in allowed list | Add preview URL pattern | In Progress | 0.85 |

Execute:

  • Add https://*.vercel.app to Supabase allowed redirect URLs
  • Test callback again

Primary Verification: Callback returns 302 redirect as expected
Independent Verification: Complete full login flow in browser

Final Status

FINAL STATUS: DONE ✅

Problem Register (Final):

| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Build | Type error | Types out of sync | Regenerated types | Resolved | 0.9 |
| P-02 | P0 | Config | Missing SUPABASE_SERVICE_KEY | Not in Vercel env | Added to dashboard | Resolved | 1.0 |
| P-03 | P1 | Auth | OAuth callback 500 | Redirect URI not allowed | Added pattern to Supabase | Resolved | 0.85 |

Evidence Pack:

  • Health: https://myapp.vercel.app/api/health (200 OK, stable 45min)
  • Logs: Clean, no errors in last 2min (see screenshot)
  • Tests: All 63 tests pass (see CI run #891)
  • Changes: PR #145, deployment abc123xyz

WHAT TO WATCH:

  • Preview branch OAuth may need periodic re-auth if URLs change
  • Supabase types should be regenerated after schema changes (add to docs)
  • Consider adding health check that validates all env vars on startup

Advanced: Scope-Specific Patterns

scope:start-command

Focus on process startup, port binding, signal handling.

Discovery methods:

  1. Platform startup logs
  2. Process manager logs (PM2, systemd)
  3. Docker ENTRYPOINT behavior
  4. Health check timing (does it respond immediately?)

scope:db-migrations

Focus on schema changes, data integrity, rollback safety.

Discovery methods:

  1. Migration logs
  2. Before/after schema diff
  3. Data validation queries
  4. Rollback dry-run

Extra gates:

  • Reversible: Down migration tested
  • Data safe: No data loss on existing records
  • Performance: Migration completes in <5 min on production data volume

scope:config

Focus on environment variables, secrets, feature flags.

Discovery methods:

  1. Env var audit (what's set, what's missing)
  2. Secret rotation logs
  3. Config precedence (env vs file vs defaults)
  4. Feature flag state in admin panel

Extra gates:

  • No hardcoded secrets: grep for API keys in code
  • Documented: All env vars in README with examples
  • Validated on startup: App checks required vars early

Customization for Your Stack

For Kubernetes

[DONE overlay — Software/DevOps (K8s)]
- All base gates (health, logs, tests)
- **Pods stable**: 0 restarts across all replicas for ≥30min
- **Resources healthy**: CPU `<70%`, memory `<80%`
- **Readiness**: All pods passing readiness probe
- Evidence: kubectl get pods, Grafana dashboard, logs

For Serverless (AWS Lambda, Vercel Functions)

[DONE overlay — Software/DevOps (Serverless)]
- All base gates (health, logs, tests)
- **Cold start**: `<1s` on first invocation
- **Warm performance**: p95 `<200ms`
- **Error rate**: `<0.1%` over 100 invocations
- Evidence: CloudWatch metrics, test run, logs

For CI/CD Pipelines

[DONE overlay — Software/DevOps (CI/CD)]
- **Pipeline green**: All stages pass
- **Fast feedback**: Total time `<10min`
- **Deterministic**: 3 consecutive runs, all pass
- **Artifacts**: Build outputs available, versioned
- Evidence: CI dashboard, artifact links, timing metrics

Measuring Success

Before Complete Mode

  • Deploy "succeeds" but breaks in production 2 hours later
  • Fix surface issue, miss database timeout
  • No record of what was checked

After Complete Mode

  • All three issues found before production merge
  • Evidence pack allows teammate to verify
  • Monitoring plan prevents recurrence

Metric to track: Time-to-stable-deploy (initial deploy → 30min healthy with no follow-up fixes)


Next Steps

  1. Try on a known-good system first — Deploy something that works and verify all gates pass
  2. Document your team's gates — What thresholds matter for your SLOs?
  3. Automate verification — Script the health check + log analysis
  4. Create runbook template — Adapt Problem Register for incident response