Complete Mode for Software & DevOps
The Software/DevOps overlay adapts Complete Problem Solving Mode for deployment failures, service outages, build issues, and infrastructure problems. It emphasizes runtime health, log cleanliness, and test coverage as core acceptance gates.
When to Use This Overlay
Ideal for:
- Deploy failures (staging or production)
- Service 500s or timeouts
- Build/CI pipeline breaks
- Container or orchestration issues
- Configuration drift
- Performance degradation
- Database migration problems
Not ideal for:
- Architecture design discussions (use exploratory mode)
- Feature planning (use Product/UX overlay)
- Quick local dev fixes (overkill)
The Software/DevOps Overlay
[DONE overlay — Software/DevOps]
- **Cloud health**: primary endpoint (e.g., `/healthz`) returns **200** for **≥30 minutes**; process has **0 unexpected restarts**
- **Logs clean**: last **2 minutes** of runtime logs show **no errors above INFO**
- **Local gates**: build **+** tests **+** type/lint pass with **0 failures**
- **Zero criticals**: Problem Register has 0 P0/P1 remaining
- **Evidence pack**: health URL/screenshot, log excerpt, test summary, commit/PR refs
**Suggested toggles**:
- `scope:start-command` — Focus on service startup
- `scope:db-migrations` — Focus on schema changes
- `scope:config` — Focus on environment/secrets
- `depth:deep` — Multiple discovery methods
- `risk_tolerance:low` — Production changes
Understanding the Gates
Gate 1: Cloud Health (30min + 0 restarts)
Why 30 minutes? Many issues only surface after:
- Cold start optimizations expire
- Connection pools stabilize
- Caches populate
- Background jobs run
Why zero restarts? A service that "works" but crashes every 15 minutes is not done.
How to verify:
# Check uptime
curl -i https://your-service.com/healthz
# Check process stability (depends on platform)
# Render: Check dashboard for restart count
# Railway: railway logs --tail 100
# Docker: docker ps -a (check STATUS)
# K8s: kubectl get pods -w (watch for restarts)
Independent verification:
- Monitor dashboard (Datadog, New Relic, CloudWatch)
- Synthetic probe from different region
- Load test with realistic traffic
Gate 2: Logs Clean (2min, no errors)
Why last 2 minutes? Recent logs show current state, not startup noise.
Why INFO threshold? WARN is acceptable (expected edge cases). ERROR/FATAL are not.
How to verify:
# Platform-specific log commands
# Vercel: vercel logs --since 2m
# Render: render logs --tail 50
# Railway: railway logs --tail 50
# Docker: docker logs --since 2m container_name
# K8s: kubectl logs --since=2m pod_name
# Filter for errors
your-log-command | grep -i "error\|fatal\|exception"
Independent verification:
- Log aggregation tool (Papertrail, Logtail, ELK)
- Application monitoring (Sentry error count)
- Custom healthcheck that validates internal state
Gate 3: Local Gates (build + tests + lint)
Why local? Production health doesn't prove code quality. Broken tests or lint failures indicate technical debt.
How to verify:
# Example for Node.js/TypeScript
npm run build # or: tsc --noEmit
npm run test # or: jest --coverage
npm run lint # or: eslint . --ext .ts,.tsx
# Example for Python
pytest tests/ # or: python -m pytest
mypy src/ # type checking
ruff check . # or: flake8/black --check
# Example for Go
go build ./...
go test ./...
golangci-lint run
What counts as "pass":
- Build: 0 compilation errors (warnings OK if documented)
- Tests: 100% pass rate (skip flaky tests temporarily, document as P2)
- Lint: 0 errors (warnings OK if tracked)
Gate 4: Zero Criticals
P0 definition (Critical):
- Production completely down
- Data loss or corruption
- Security vulnerability
- Wrong business logic (financial, legal)
P1 definition (High):
- Partial outage (>10% users affected)
- Broken core feature
- Performance degradation (>2x normal)
- Silent failures (errors not surfaced)
P2 definition (Medium):
- Edge case bugs (
<5%users) - Cosmetic issues
- Technical debt
- Nice-to-have improvements
Acceptance: All P0/P1 must be Resolved. P2 allowed with documented rationale.
Gate 5: Evidence Pack
Required artifacts:
- Health URL: Link or screenshot of successful healthcheck
- Log excerpt: Last 2min of logs showing clean state
- Test summary: Output showing all tests pass
- Commit/PR refs: Links to changes made
Format example:
**Evidence Pack:**
- Health: https://api.example.com/healthz (200 OK, checked 2:47pm)
- Logs: [See excerpt below] — No errors in 2min tail
- Tests: All 47 tests pass (see CI run #1234)
- Changes: PR #567 (merged), commit abc123f
Discovery Methods for Software/DevOps
Use different methods each pass to catch different issue types:
Pass 1: Deployment Logs
# Platform-specific
vercel logs --since 30m
render logs --tail 200
railway logs
heroku logs --tail=500
Catches: Startup failures, missing env vars, port binding
Pass 2: Application Logs
# Runtime errors, exceptions
kubectl logs -f pod-name
docker logs -f container_name
Catches: Runtime exceptions, database errors, API failures
Pass 3: Health/Status Endpoints
curl -i https://api.example.com/healthz
curl -i https://api.example.com/status
Catches: Misconfigured health checks, dependency failures
Pass 4: Metrics/Monitoring
- Check dashboards (Datadog, Grafana, CloudWatch)
- Look for: error rate, latency, memory/CPU Catches: Performance degradation, resource leaks
Pass 5: Manual Testing
# Smoke test critical paths
curl -X POST https://api.example.com/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"test"}'
Catches: Broken endpoints, wrong responses
Pass 6: CI/CD Pipeline
- Check GitHub Actions / CircleCI / GitLab CI
- Look for: test failures, build warnings, deployment steps Catches: Test regressions, build configuration issues
Common Software/DevOps Problems & Register Entries
Problem: Port Binding Failure
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Startup | "EADDRINUSE: port 3000 already in use" in logs | Start command doesn't use $PORT | Update start cmd to use process.env.PORT | Resolved | 0.95 |
Action Log:
### Pass 1 — Action Log
- Change: Updated package.json start script from `node server.js` to `node server.js --port $PORT`
- Before → After: Deploy logs showed port conflict → Now binds to dynamic port
- Verification (Primary): Health check returns 200
- Verification (Independent): Platform dashboard shows "Running" status
- New signals found? No additional issues
Problem: Missing Environment Variable
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-02 | P1 | Config | "DATABASE_URL is not defined" in startup logs | Missing from platform env vars | Add to Render environment | Resolved | 1.0 |
Problem: Database Connection Timeout
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-03 | P0 | Runtime | "connect ETIMEDOUT" every 30s in logs | Connection pool exhausted | Add connection retry with exponential backoff | Resolved | 0.85 |
Re-scan discovery: Checking application logs revealed secondary issue:
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-04 | P2 | Performance | Query taking 4.2s avg (logs) | Missing index on users.email | Add index in migration | Resolved | 0.9 |
Real-World Example: Fixing a Vercel Deployment
Initial Request
@complete-mode
Task = Fix production deploy failing on Vercel
Context = Next.js 14, TypeScript, Postgres (Supabase), last successful deploy was 3 days ago
[DONE overlay — Software/DevOps]
- Cloud health: https://myapp.vercel.app/api/health returns 200 for ≥30min; 0 restarts
- Logs clean: last 2min show no errors
- Local gates: npm run build + test + lint = 0 failures
- Zero criticals
- Evidence: health URL, log excerpt, test output, deployment URL
scope:build depth:deep strict:on
Pass 1: Build Logs
Discovery:
[12:34:56] Error: Type error: Property 'user' does not exist on type 'Session'
[12:34:56] src/app/dashboard/page.tsx:45:28
Problem Register:
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Build | Type error in dashboard/page.tsx:45 | Supabase types out of sync | Regenerate types from schema | In Progress | 0.9 |
Execute:
# Regenerate Supabase types
npx supabase gen types typescript --project-id abc123 > types/supabase.ts
# Verify locally
npm run build
Primary Verification: Local build succeeds
Independent Verification: Push to preview branch, check Vercel build
Pass 2: Re-scan with Different Method (Runtime Logs)
Discovery: Build now succeeds, but checking preview deployment logs shows:
[Error] Unhandled Promise Rejection: Missing environment variable: SUPABASE_SERVICE_KEY
Problem Register Update:
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Build | Type error | Types out of sync | Regenerated types | Resolved | 0.9 |
| P-02 | P0 | Config | "Missing SUPABASE_SERVICE_KEY" in preview logs | Not set in Vercel env | Add to Vercel dashboard | In Progress | 1.0 |
Execute:
- Go to Vercel dashboard → Settings → Environment Variables
- Add SUPABASE_SERVICE_KEY with value from Supabase dashboard
- Trigger redeploy
Primary Verification: Preview deployment succeeds, health endpoint returns 200
Independent Verification: Test actual API call with curl
Pass 3: Re-scan with Third Method (Manual Testing)
Discovery: Testing the user auth flow manually:
curl https://preview-abc.vercel.app/api/auth/callback?code=xyz
Returns 500. Logs show:
[Error] redirect_uri mismatch in OAuth callback
Problem Register Update:
| P-03 | P1 | Auth | OAuth callback 500, redirect_uri mismatch | Supabase only has prod URL in allowed list | Add preview URL pattern | In Progress | 0.85 |
Execute:
- Add
https://*.vercel.appto Supabase allowed redirect URLs - Test callback again
Primary Verification: Callback returns 302 redirect as expected
Independent Verification: Complete full login flow in browser
Final Status
FINAL STATUS: DONE ✅
Problem Register (Final):
| ID | Sev | Category | Evidence | Root Cause | Action | Status | Conf |
|----|-----|----------|----------|------------|--------|--------|------|
| P-01 | P0 | Build | Type error | Types out of sync | Regenerated types | Resolved | 0.9 |
| P-02 | P0 | Config | Missing SUPABASE_SERVICE_KEY | Not in Vercel env | Added to dashboard | Resolved | 1.0 |
| P-03 | P1 | Auth | OAuth callback 500 | Redirect URI not allowed | Added pattern to Supabase | Resolved | 0.85 |
Evidence Pack:
- Health: https://myapp.vercel.app/api/health (200 OK, stable 45min)
- Logs: Clean, no errors in last 2min (see screenshot)
- Tests: All 63 tests pass (see CI run #891)
- Changes: PR #145, deployment abc123xyz
WHAT TO WATCH:
- Preview branch OAuth may need periodic re-auth if URLs change
- Supabase types should be regenerated after schema changes (add to docs)
- Consider adding health check that validates all env vars on startup
Advanced: Scope-Specific Patterns
scope:start-command
Focus on process startup, port binding, signal handling.
Discovery methods:
- Platform startup logs
- Process manager logs (PM2, systemd)
- Docker ENTRYPOINT behavior
- Health check timing (does it respond immediately?)
scope:db-migrations
Focus on schema changes, data integrity, rollback safety.
Discovery methods:
- Migration logs
- Before/after schema diff
- Data validation queries
- Rollback dry-run
Extra gates:
- Reversible: Down migration tested
- Data safe: No data loss on existing records
- Performance: Migration completes in
<5 minon production data volume
scope:config
Focus on environment variables, secrets, feature flags.
Discovery methods:
- Env var audit (what's set, what's missing)
- Secret rotation logs
- Config precedence (env vs file vs defaults)
- Feature flag state in admin panel
Extra gates:
- No hardcoded secrets: grep for API keys in code
- Documented: All env vars in README with examples
- Validated on startup: App checks required vars early
Customization for Your Stack
For Kubernetes
[DONE overlay — Software/DevOps (K8s)]
- All base gates (health, logs, tests)
- **Pods stable**: 0 restarts across all replicas for ≥30min
- **Resources healthy**: CPU `<70%`, memory `<80%`
- **Readiness**: All pods passing readiness probe
- Evidence: kubectl get pods, Grafana dashboard, logs
For Serverless (AWS Lambda, Vercel Functions)
[DONE overlay — Software/DevOps (Serverless)]
- All base gates (health, logs, tests)
- **Cold start**: `<1s` on first invocation
- **Warm performance**: p95 `<200ms`
- **Error rate**: `<0.1%` over 100 invocations
- Evidence: CloudWatch metrics, test run, logs
For CI/CD Pipelines
[DONE overlay — Software/DevOps (CI/CD)]
- **Pipeline green**: All stages pass
- **Fast feedback**: Total time `<10min`
- **Deterministic**: 3 consecutive runs, all pass
- **Artifacts**: Build outputs available, versioned
- Evidence: CI dashboard, artifact links, timing metrics
Measuring Success
Before Complete Mode
- Deploy "succeeds" but breaks in production 2 hours later
- Fix surface issue, miss database timeout
- No record of what was checked
After Complete Mode
- All three issues found before production merge
- Evidence pack allows teammate to verify
- Monitoring plan prevents recurrence
Metric to track: Time-to-stable-deploy (initial deploy → 30min healthy with no follow-up fixes)
Next Steps
- Try on a known-good system first — Deploy something that works and verify all gates pass
- Document your team's gates — What thresholds matter for your SLOs?
- Automate verification — Script the health check + log analysis
- Create runbook template — Adapt Problem Register for incident response
Related Resources
- Complete Mode Framework — Core concepts
- Quick Reference — Cheat sheet
- Data/Analytics Overlay — For metrics/experiments
- Windsurf Rule Setup — Installation guide