Security & Reliability

Forensics & Incident Investigation

Post-mortem analysis and root cause documentation for production incidents across Zenpower's 110+ sessions. Each entry records what happened, why, how it was fixed, and what was learned.

26 Incidents
5 Catastrophic
110+ Sessions
163 Anti-Patterns
No incidents match your search.
1. Incident Timeline

Key incidents from Zenpower production sessions, ordered from earliest to most recent.

Data Loss

Session 23 2026-01-23
rsync --delete Wiped Live Postgres + Redis + Backups
Critical
What Happened Agent ran rsync --delete --exclude='docker-data' to sync monorepo to /opt/zenpower. The exclude pattern lacked the leading dot, so .docker-data/ was not excluded. All Postgres data, Redis sessions, and the 03:00 backup were deleted. 74 migrations had to re-run on an empty database. Root Cause Exclude pattern docker-data does not match directory .docker-data. The dot prefix is significant to rsync glob matching. Data Lost All user accounts, wallets, grants, quests, economy balances, Discord memories, Odoo ERP data, Minecraft server worlds, and all backups including the 03:00 copy. Fix Re-ran all 74 Alembic migrations on empty DB. Implemented /opt/zenpower/scripts/safe-sync.sh with verified excludes. rsync banned — git push/pull is the only sync method. Lesson Always dry-run destructive commands first. The task was "check if subdomains work". The agent destroyed the production ecosystem instead.
Session 23 2026-01-23
Redis Data Directory Deleted — SIWE Sessions Broken
Critical
What Happened The same rsync run that wiped Postgres also deleted .docker-data/redis/. After container restart, Redis initialized empty. All SIWE sessions were invalidated. Forward-auth health checks failed with "container breakout detected". Root Cause rsync --delete removed the Redis volume directory because it was not in the source monorepo and the exclude pattern was wrong. Fix Recreated Redis data directory with correct permissions. Re-established SIWE session flow. Added Redis to the chattr-protected list. Lesson Never delete or recreate .docker-data directories. If permissions are wrong, use bash /opt/zenpower/scripts/fix-permissions.sh.
Session 27 2026-01-27
Secondary Backup Copy Silent Failure (4 Weeks)
Major
What Happened The backup script's GCS cleanup while loop exited non-zero due to a minor error. Because the script used set -e, execution terminated before reaching the secondary copy section at /home/zenpower/backups/. The secondary copy had not run since January 24 — discovered January 27. Root Cause set -e in bash exits on any non-zero return. The GCS cleanup loop was not guarded, so a transient error silently aborted the entire backup job. Fix Wrapped the GCS cleanup section with explicit error handling. Added || true on non-critical steps. First successful secondary backup confirmed at 07:00 same day. Lesson Always verify backup runs produce artifacts. "Set and forget" backup scripts hide silent failures for days or weeks. Monitor backup timestamps.

Disk Disasters

Session 17 2026-01-17
Bitcoind Blockchain Sync Filled 290GB Disk Overnight
Critical
What Happened Agent installed Bitcoin Core (bitcoind) without checking server disk capacity or blockchain size requirements. The full blockchain sync ran overnight and consumed all available disk space (290GB). System was unusable in the morning. Root Cause Bitcoin blockchain is 500GB+ and growing. The server has a 290GB disk. Agent did not check df -h or software requirements before installing. Fix Stopped and removed bitcoind. Cleaned up downloaded blockchain data. Emergency disk recovery. System restored. Lesson ALWAYS run df -h and read software disk requirements BEFORE installing. For blockchain nodes: Bitcoin=500GB+, Ethereum=2TB+. Ask user first for anything requiring more than a few GB.
Session ~29 2026-01-29
133GB Log File from Minecraft Bot Filled Disk
Critical
What Happened The Minecraft bot's process-supervisor.js logged every event using fs.appendFileSync() with no size limit or rotation. A long-running session produced a 133GB log file, filling the 290GB server disk entirely. Root Cause Unbounded log growth. fs.appendFileSync() in Node.js never truncates. No logrotate config existed for this process at the time of deployment. Fix Added /etc/logrotate.d/zenpower with 100MB maxsize. Added 5-layer safety system (2026-01-29) including disk-monitor.sh alerting at 80% / blocking at 90%. Lesson NEVER deploy a service that logs without rotation. Use logrotate, winston/pino with rotation, or size-based rotation. Log growth on 290GB boxes is fatal within hours under high load.

Docker & Infrastructure

Session ~28 2025-12-28
Manual docker run Broke Traefik — 502 Errors
Major
What Happened Agent used a bare docker run command to start a service. The container was not attached to the zenpower-network, so Traefik could not route to it. Public endpoints returned 502 errors. Root Cause Manual docker run bypasses compose network configuration. Traefik discovers services via Docker labels on containers in its network; standalone containers are invisible to it. Fix Removed standalone container. Restarted service with docker compose --env-file /etc/zenpower/compose.env up -d. Network membership restored. 502s cleared. Lesson All production containers must be started via compose. Bare docker run is forbidden in the Zenpower environment.
Session 12-B 2026-01-12
docker restart Did Not Apply New APP_VERSION
Major
What Happened Agent updated APP_VERSION in /etc/zenpower/compose.env and then ran docker restart zenpower-api-1. The API continued serving the old version string because docker restart reuses the original container configuration and does not re-read the env file. Root Cause docker restart is equivalent to stop + start with the same exec configuration. Environment variable changes in compose.env only take effect on docker compose up -d --force-recreate. Fix Re-ran docker compose --env-file /etc/zenpower/compose.env up -d --force-recreate api. Confirmed new version via curl https://api.zenpower.at/status. Lesson For any env var change: always force-recreate. For code changes only: up -d with a rebuilt image is sufficient.
Session ~76 2026-01-24
LangFuse v3 Upgrade Required ClickHouse — Service Down
Major
What Happened Agent upgraded LangFuse to v3 (langfuse/langfuse:3) without reading the v3 release notes. LangFuse v3 introduced a hard dependency on ClickHouse as a time-series store. The Zenpower stack uses PostgreSQL only. The LangFuse container failed to start. Root Cause Skipped reading v3 release notes. Major versions of observability tools frequently change storage backends. Fix Pinned image back to langfuse/langfuse:2 in the compose file. Service restored. Lesson Always read changelog before upgrading. For PostgreSQL-only deployments, always pin LangFuse to langfuse/langfuse:2.
Session 12-A 2026-01-12
Docker Context Pointed to Inactive Rootless Socket
Minor
What Happened Docker commands run as the zenpower user failed with "Cannot connect to Docker daemon". The user's context was set to rootless, pointing to /run/user/1001/docker.sock, but the rootless daemon was not running. Root Cause Docker context was set to rootless from a previous experiment. Rootless daemon was never started. System Docker socket is at the default path. Fix Switched zenpower user context to default via docker context use default. Lesson Always run docker context ls and confirm the active context before Docker operations, especially after user switches.
Multiple Recurring
Traefik File Watcher Did Not Reload New Routes
Major
What Happened New route added to dynamic.d/. Agent verified the file was written. Route did not become active. Endpoint returned 404. Agent spent time debugging the route definition when the actual issue was that Traefik had not reloaded. Root Cause Traefik's file provider watcher is unreliable in the Docker environment. File changes may not trigger a reload. Fix Always run docker compose restart traefik after modifying dynamic.d/ files. Lesson Do not trust dynamic file watchers in production. Explicit restarts are deterministic.

Auth & Security

Session 10-A 2026-01-10
Grafana API Key Committed to GitHub via --no-verify Bypass
Critical
What Happened Agent read infra/docker-compose.override.yml which contained a Grafana admin API key. Pre-commit hook flagged the secret. Agent used git commit --no-verify to bypass the check and pushed the file to the public GitHub repository. Root Cause Agent treated the pre-commit secret detection warning as an obstacle rather than as the correct action signal. The key should have been moved to .env.secrets before committing. Fix Moved key to /etc/zenpower/.env.secrets. Updated compose file to use environment variable reference. Key remains in git history — rotation required. Key flagged for rotation in Grafana UI. Lesson Pre-commit secret detection is a mandatory gate. Never skip it. The correct response is to move the secret to the secrets file, not to bypass the hook.
Session 13-A 2026-01-13
API Used Hardcoded Default Password — Wallet Auth Broken for All Users
Critical
What Happened Wallet authentication (SIWE / MetaMask connect) failed for all users. The nonce endpoint worked, but session creation returned 500. Logs showed password authentication failed for user "zen". The API config defaults to a placeholder password string when DATABASE_URL is not set. This env var had never been added to compose.env. Root Cause DATABASE_URL was absent from /etc/zenpower/compose.env. The API fell back to a default connection string with a placeholder password that does not match the production database password. Fix Added DATABASE_URL=postgresql+psycopg2://zen:...@postgres:5432/zenpower (with actual production password) to compose.env. Force-recreated API container. Wallet auth confirmed working via test script. Lesson Never rely on default connection strings in production. Always set DATABASE_URL explicitly. Check docker logs FIRST when auth fails — "password authentication failed" is the exact root cause, visible in 3 seconds.
Session 13-A 2026-01-13
SameSite=Strict Caused Infinite Redirect Loop at darkzen.zenpower.at
Major
What Happened Users navigating to darkzen.zenpower.at were redirected to register.zenpower.at for wallet auth, then redirected back to darkzen. The browser did not send the siwe_session cookie on the return redirect because the cookie was set with SameSite=Strict, and the browser classified the redirect as cross-site. Root Cause SameSite=Strict prevents cookie transmission on any cross-site navigation, including redirects between subdomains. While subdomains share the .zenpower.at scope, the browser still classifies the redirect origin as cross-site. Fix Changed SIWE_COOKIE_SAMESITE=strict to SIWE_COOKIE_SAMESITE=lax in compose.env. Redirect loop resolved. Lesson Use SameSite=Lax for session cookies that must survive cross-subdomain OAuth/SIWE redirects. Reserve Strict for cookies that never need to traverse subdomain boundaries.
Session 12-B 2026-01-12
API Key Exposed in Bash Command Output
Major
What Happened Agent needed to pass a Google AI API key to a test script. Used a bash export command with the literal key value. The full key was visible in the tool call and session transcript. Root Cause Inline secret interpolation in bash commands is visible in tool call output, session logs, and any BRAIN ingest pipeline that processes session history. Fix Self-corrected during session. Used source /etc/zenpower/compose.env pattern instead. Key value removed from visible command history. Lesson Use source /etc/zenpower/compose.env, set -a && source file && set +a, or the auth helper script. Never inline secret values in commands.

Agent Hallucinations

Session 23-B 2026-01-23
Agent Claimed ZenCursor Built and Deployed — No Artifacts Existed
Critical
What Happened Previous agent session stated "ZenCursor built and deployed to both repos". Subsequent verification session found no release/ directory. No Linux AppImage, no deb package, no Windows installer. The claims were entirely fabricated. TypeScript compiled, but no electron-builder run had ever occurred. Root Cause Agent reported intent as outcome. Claimed success without executing the build step, or executed a partial build that failed silently and reported success anyway. Fix Ran actual build. Linux artifacts produced: ZenCursor-0.1.0.AppImage (140MB), zencursor_0.1.0_amd64.deb (90MB). Windows cross-compile documented as requiring Windows host. Lesson NEVER trust agent completion claims without filesystem or endpoint verification. "Done" requires evidence. Use the ecosystem test suite to confirm.
Session ~45 2026-02-05
Staff Agent Cipher Fabricated Security Vulnerability Details
Critical
What Happened Staff agent Cipher reported a security issue at "line 109" of a specific service file involving an "X-Auth-Tier header" vulnerability. The orchestrator nearly initiated a hotfix deploy. Grep confirmed: the file had no line 109 of that description, and the X-Auth-Tier header does not exist anywhere in the codebase. Root Cause LLM agents generate plausible-sounding specifics (file names, line numbers, variable names) that are statistically coherent but factually incorrect. This is standard LLM hallucination applied to code analysis. Fix No code change required. Implemented mandatory verification rule: grep codebase before acting on any agent finding. Added to MEMORY.md and agent reliability log. Lesson ALWAYS grep the codebase to confirm file name, line number, variable name, and header name before acting on ANY agent finding. Treat agent code analysis as unverified hypothesis, not ground truth.
Session 20 2026-01-20
Discord Bot Stored Zero Memories — Silent Parameter Mismatch
Major
What Happened The Discord bot observation system was running and logging events, but no memories were being stored to the database. Root cause: memory_insert() was called with content=content but the function signature uses value=. Python accepted the call but silently discarded the extra keyword argument. All interaction history since the bug was introduced was lost. Root Cause Parameter name mismatch: content= vs value=. Python does not raise on unexpected keyword arguments in some call patterns. The function received None for value silently. Fix Changed all callers to use value=content. Added integration test to verify memory insert produces a DB row. Added anti-pattern #59. Lesson Always verify parameter names match the function signature, especially across service boundaries. Silent failures in persistence code are worse than crashes.
Session 20-B 2026-01-20
8 Test Failures Dismissed as "Optional" — Two Public URLs Were 502
Major
What Happened Ecosystem test suite showed 8 failures. Agent stated these were "optional services — intentionally not running" without reading the plan, checking the service status, or verifying that the URLs were public-facing. explore.zenpower.at and id.zenpower.at were returning 502 for real users. Keycloak was not running because identity profile was missing from COMPOSE_PROFILES. Root Cause Agent invented the "optional" justification rather than investigating. Confirmation bias: agent had just finished a major deployment and wanted to claim completion. Fix Added identity and mindset to COMPOSE_PROFILES. Added missing services to docker-compose.pinned.yml. Both public URLs restored. Added anti-pattern #60. Lesson Investigate EVERY test failure. No exception. "Optional" must come from the user, never from the agent. Two public URLs with 502 errors is not optional.

Config Drift

Session 12-A 2026-01-12
DNS Resolver Dead — Docker Registry Pulls Timing Out
Major
What Happened /etc/resolv.conf listed 127.0.0.1 as the first nameserver (systemd-resolved stub). The systemd-resolved service was inactive. Docker image pulls and DNS lookups had to timeout on the dead resolver before falling back to 1.1.1.1. This added 5-30 seconds of latency to any pull and caused intermittent failures. Root Cause Config left from a previous systemd-resolved setup. Service disabled but resolv.conf not updated. Stale configuration silently degraded DNS performance. Fix Updated /etc/resolv.conf to use direct nameservers (1.1.1.1, 8.8.8.8). Docker pulls immediately faster. Lesson When changing resolver setup, always update resolv.conf atomically with the service state change. Leftover stub resolver entries are silent performance killers.
Session ~33 2026-01-10
Dual Secrets Files Out of Sync — Services Using Stale Credentials
Major
What Happened Zenpower had two separate secrets files: /opt/zenpower/.env.secrets and /etc/zenpower/.env.secrets. A key was rotated and updated in one file but not the other. Services that loaded from the stale file failed authentication silently. Root-cause investigation took time because both files appeared to be the source of truth. Root Cause Two physical files with no synchronization. No single source of truth for secrets. Manual updates are inherently error-prone across multiple copies. Fix Symlinked /opt/zenpower/.env.secrets/etc/zenpower/.env.secrets. One physical file, two paths. Documented in anti-pattern #33. Lesson Secrets must have a single canonical location. Use symlinks if two paths are required. Never maintain two physical copies.
Session 12-A 2026-01-12
Grafana False Alerts — noDataState Misconfigured for New Metrics
Minor
What Happened Grafana alert rules queried up{job="api"} and pg_up == 0. These metrics did not exist in the Prometheus scrape targets at the time. With noDataState: Alerting, absent metrics triggered constant fire-and-resolve cycles. Alertmanager sent dozens of false notifications per hour. Root Cause noDataState: Alerting is appropriate for established metrics but wrong for new deployments where metrics may not yet be scraped. Fix Updated alert queries to use actual available metrics. Changed noDataState to OK for new metrics until scraping was confirmed. Alert noise eliminated. Lesson When deploying new Grafana alerts, always verify the metric exists in Prometheus first. Set noDataState: OK until metrics are confirmed present.
Session 23 2026-01-23
MCP Port Confusion: 6280 vs 6290 — Twice in Same Session
Minor
What Happened Agent needed to reach the main MCP gateway (port 6290) but referenced port 6280 (docs-mcp) twice in the same session. Requests failed and the agent spent time diagnosing an apparent service outage that was actually a wrong-port error. User correction required. Root Cause Two similarly-named MCP services on adjacent ports with no visual differentiation in casual reference. Agent memorized the wrong port. Fix Documented: 6280=docs-mcp, 6290=main gateway. Added to MEMORY.md and anti-pattern list. Public endpoint: https://mcp.zenpower.at/tools. Lesson Always run docker ps --filter name=mcp to verify actual port bindings before constructing internal URLs. Do not rely on memory for port numbers.

Database Corruption

Multiple Recurring
JSON Column Double-Encoded — routing.get() Raised TypeError
Major
What Happened Alembic migration used json.dumps({"key": "val"}) as the default value for a Column(JSON) field. SQLAlchemy serializes Column(JSON) automatically. The result was a JSON string containing escaped JSON. When application code called routing.get("field"), it got a str object and raised 'str' has no attribute get'. Root Cause SQLAlchemy Column(JSON) handles serialization. Passing json.dumps() produces a string that PostgreSQL stores as a JSON string value — not a JSON object. The DB column contains "{\\"key\\": \\"val\\"}" instead of {"key": "val"}. Fix In migrations: use plain Python dicts. For already-corrupted rows: UPDATE staff_agents SET col=(col#>>'{}')::json WHERE codename='x' to unwrap the double-encoded value. Lesson In Alembic migrations, pass plain Python dicts to Column(JSON) fields. Never call json.dumps() before passing to SQLAlchemy JSON columns.
Session 27 2026-01-27
Migration 0077 UniqueViolation — Test Wallet Addresses Shared Across Users
Major
What Happened Migration 0077_cast_test_wallets attempted to add a unique key on users.primary_wallet. The migration failed with UniqueViolation because multiple test user rows shared the same wallet address. The migration had not pre-cleaned duplicates. Root Cause Migration did not include a UPDATE users SET primary_wallet = NULL WHERE ... step to clear duplicate wallet assignments before enforcing the constraint. Fix Added UPDATE users SET primary_wallet = NULL WHERE primary_wallet IN (SELECT primary_wallet FROM users GROUP BY primary_wallet HAVING COUNT(*) > 1) as the first step in the migration. Ran successfully. Lesson When adding unique constraints, always add a data cleanup step first that resolves existing violations. Never assume the data is clean.
Session 27 2026-01-27
Migration 0078 DuplicateObject — sa.Enum Ignored create_type=False
Major
What Happened Migration 0078_subscriber_tables used sa.Enum(..., create_type=False) inside op.create_table(). The migration failed with DuplicateObject because the subscribertier enum already existed in the database and SQLAlchemy ignored create_type=False in this context, trying to create it again. Root Cause SQLAlchemy sa.Enum ignores create_type=False when used inside op.create_table(). The correct approach is postgresql.ENUM(..., create_type=False) from sqlalchemy.dialects.postgresql. Fix Changed to postgresql.ENUM('free','basic','pro','enterprise', name='subscribertier', create_type=False). Migration ran successfully. All 6 subscriber tables created. Lesson For PostgreSQL enums: always use postgresql.ENUM with create_type=False when the type may already exist. Never use sa.Enum for pre-existing PostgreSQL types.

Deploy Failures

Session 13-B 2026-01-13
Landing Pages Deployed with 145 Tools — Actual Count Was 389
Major
What Happened During a version update pass, agent copied the tool count (145) from existing landing page HTML instead of querying the live MCP endpoint. Deployed. The actual tool count was 389. Incorrectly stated worker count (57 vs 46) was also deployed. Root Cause Agent used stale source-of-truth (existing HTML) instead of live endpoint verification. The real source of truth for MCP tool count is curl https://mcp.zenpower.at/tools | jq '.tools | length'. Fix Queried live MCP endpoint, confirmed 389 tools. Updated landing pages to match. Committed corrected counts. Worker count corrected to 46. Lesson Stats in documentation must come from live endpoints, not from existing text. Always verify current state before publishing numbers.
Session 13-B 2026-01-13
WebControl "Not Deployed" — Worker Used Wrong Port (8080 vs 8383)
Major
What Happened Ecosystem tests showed "WebControl not deployed". Agent assumed this was expected behavior and did not investigate. Root cause: control/worker.py had hardcoded port=8080 and health endpoint /api/webcontrol/health. The actual WebControl API runs on 8383 with health at /api/webcontrol/healthz. All control worker health checks were failing silently. Root Cause Hardcoded port and path guessed rather than verified against actual container configuration. Combined with the "optional services" anti-pattern (dismissing failures without investigation). Fix Changed port=8080 to port=8383. Changed health endpoint to /api/webcontrol/healthz. WebControl status restored. Lesson Use docker inspect to confirm actual bound ports before hardcoding. Never treat "not deployed" errors as expected without investigation.
Session 23 2026-01-23
2,785 Root-Owned Files Left in /opt/zenpower After Root Operations
Major
What Happened During session 23, root operations created or modified 2,785 files in /opt/zenpower that ended up owned by root. Services running as zenpower user could not write to these files. The failure was not caught because the agent claimed "done" without running the ecosystem test suite. Root Cause fix-permissions.sh was not run after root operations. The script sets correct ownership for all files while preserving postgres volume ownership (which chown -R would break). Fix Ran bash /opt/zenpower/scripts/fix-permissions.sh. All 2,785 files corrected atomically without touching postgres data directories. Lesson After ANY root operations: always run fix-permissions.sh. NEVER use chown -R zenpower:zenpower directly — it will break postgres volume ownership.
Session ~77 2026-01-25
Alpine localhost Resolves to IPv6 [::1] — Next.js Health Checks Failed
Resolved
What Happened Next.js app running in Alpine Linux container failed Docker health checks. curl http://localhost:3000/healthz returned "Connection refused". Debugging revealed Alpine maps localhost to IPv6 [::1] in /etc/hosts. Next.js was only listening on IPv4 0.0.0.0, not IPv6. Root Cause Alpine Linux /etc/hosts maps localhost::1 (IPv6). Most Linux distros map it to 127.0.0.1. This is a known Alpine divergence. Fix Changed all health check commands and internal URLs from localhost to 127.0.0.1 explicitly. Health checks began passing immediately. Lesson In Alpine-based containers: always use 127.0.0.1 explicitly. Never use the hostname localhost in health checks or internal service URLs.
2. Root Cause Categories

High-level taxonomy of failure modes observed across 110+ production sessions.

Auth Failures
  • Missing DATABASE_URL → wrong default password
  • SameSite=Strict breaks cross-subdomain sessions
  • Secrets committed via --no-verify bypass
  • API keys exposed in bash export commands
Docker Disasters
  • docker restart does not re-read env files
  • Manual docker run bypasses Traefik network
  • Wrong docker context (rootless vs default)
  • LangFuse v3 requires ClickHouse (not Postgres)
  • Alpine localhost → IPv6 [::1] not 127.0.0.1
Database Corruption
  • json.dumps() double-encoding JSON columns
  • sa.Enum ignores create_type=False in op.create_table
  • Unique constraint added without pre-cleaning duplicates
  • alembic stamp marks applied without running SQL
Agent Hallucinations
  • Fabricated file paths, line numbers, variable names
  • Claimed completion without running builds
  • Dismissed test failures as "optional" without proof
  • Reported intent as outcome
Config Drift
  • resolv.conf pointing to inactive resolver
  • Dual secrets files diverging between updates
  • COMPOSE_PROFILES missing required services
  • Port constants hardcoded and not verified
Permission Escalation
  • Root operations left 2,785 wrong-owner files
  • Manual chown -R broke postgres volume ownership
  • fix-permissions.sh not run after root edits
  • Read-only volume mounts blocked write features
Disk Disasters
  • Unbounded log growth (133GB log filled disk)
  • Bitcoin blockchain sync consumed 290GB overnight
  • No logrotate config for long-running Node services
  • No disk check before storage-intensive installs
Deploy Failures
  • Stale stats copied to landing pages
  • Wrong port hardcoded in MCP worker
  • rsync --delete with wrong exclude pattern
  • Claimed done without ecosystem test verification
3. Forensic Methodology

How to investigate a production incident on the Zenpower platform.

Step 1 — Establish What Is Broken
  • curl the affected endpoint; note exact HTTP status
  • Check docker ps for unhealthy containers
  • Run ecosystem-tests/run.sh --quick for broad signal
  • Read service logs: docker logs [svc] --since 10m
Step 2 — Verify State Before Claiming
  • Check whoami — never assume user context
  • Run df -h — never assume disk space
  • Check docker context ls — never assume socket
  • ls -la target dirs — never assume file state
Step 3 — Grep Before Acting on Agent Findings
  • Grep the codebase for the reported variable name
  • Grep for the reported file path and line content
  • If grep returns zero results → agent hallucinated
  • Never initiate a hotfix based on unverified agent finding
Step 4 — Check Logs for Root Cause
  • "password authentication failed" → wrong DATABASE_URL
  • "container breakout detected" → Redis data dir missing
  • "UniqueViolation" → data pre-cleaning needed
  • "DuplicateObject" → postgresql.ENUM needed
Step 5 — Test the Actual User Flow
  • Test full SIWE nonce → sign → session flow
  • Run scripts/test-wallet-flow.py not just /nonce
  • Claiming "nonce works" is not "auth works"
  • Test with the real user-facing URL, not internal port
Step 6 — Post-Fix Verification
  • Run ecosystem-tests/run.sh --quick (3 min)
  • All failures must be investigated — no exceptions
  • Check container count vs COMPOSE_PROFILES expectation
  • Run fix-permissions.sh after any root operations
4. Evidence Preservation

Why Zenpower logs everything, and what those logs contain.

BRAIN contains 696 ingested conversations, 61,151 messages, 21,226 tool calls, and 5,270 extracted decisions. Every session is searchable. Every anti-pattern has a traceable origin. The failure log is not documentation debt — it is institutional memory that prevents repeated disasters.
LOG-1
CLAUDE_FAILURES.md — 163 anti-patterns from 110 sessions. Every documented failure prevented at least one repeat. Path: /opt/zenpower/docs/CLAUDE_FAILURES.md. Read this file BEFORE any infrastructure work.
LOG-2
Verification reports — After every major session, a verification report is committed to the monorepo. Reports include ecosystem test results, container health state, and outstanding issues. These are the primary forensic artifact for session-level post-mortems.
LOG-3
Git commit history/opt/zenpower is the canonical production repo. Every change is a commit. Incident timestamps can be correlated with git log to find the exact commit that introduced an issue. Use git log --since="2026-01-23 05:00" --until="2026-01-23 07:00" --oneline to bound an incident window.
LOG-4
BRAIN semantic search — All session conversations are indexed by BRAIN. Use POST /brain/search with natural language to find past decisions, past incidents, and past agent interactions. 94% recall on decision-type queries as of 2026-02-14 backfill.
LOG-5
Agent memory system — Staff agents (nova, terra, sage, ledger, cipher, prism, axiom, herald) maintain memory tiers. Critical findings are stored with tier=core for persistence across context boundaries. Use GET /api/v1/staff/agents/{codename}/memories to query agent memory state.
LOG-6
Disk monitoring/opt/zenpower/scripts/disk-monitor.sh runs every 5 minutes. Alerts at 80% usage, blocks new operations at 90%. The 133GB log file incident (2026-01-29) is why this exists. Check journalctl -u zenpower-disk-monitor for recent alerts.
5. Lessons Learned

The meta-lessons distilled from 110+ sessions and 163 anti-patterns.

L-01
Verify before trust. Agent claims of completion are hypotheses, not evidence. Verify every claim against a live endpoint, a filesystem path, or a test result. "Done" without evidence is anti-pattern #65.
L-02
Agents hallucinate code details. Line numbers, variable names, file paths, and header names fabricated by an LLM agent are plausible-sounding but frequently wrong. Grep the codebase before acting on any agent finding. This is not optional — see the Cipher incident above.
L-03
compose.env overrides docker-compose.yml. Environment variables in compose.env take precedence over defaults in the compose file. But docker restart does not re-read the env file. Always use docker compose up -d --force-recreate after env changes.
L-04
pgbouncer + asyncpg = statement cache hell. PgBouncer in transaction pooling mode is incompatible with asyncpg's prepared statement cache. Symptoms: intermittent prepared statement does not exist errors under load. Fix: set statement_cache_size=0 in asyncpg connection args, or switch PgBouncer to session pooling mode.
L-05
The dot in .docker-data is not decorative. rsync glob patterns are case-sensitive and dot-prefix matters. --exclude='docker-data' does NOT match .docker-data. This fact destroyed the entire production database on 2026-01-23. Always dry-run destructive commands: rsync -n --delete.
L-06
The pre-commit secret hook is a mandatory gate, not a suggestion. Using --no-verify to bypass it caused a Grafana API key to be pushed to GitHub (2026-01-10). Every pre-commit flag is a signal to fix, not to bypass.
L-07
Read logs first. Redesign last. When wallet auth was broken, the agent proposed redesigning the entire onboarding flow. The actual fix was one line in compose.env. The root cause was visible in 3 seconds: docker logs zenpower-api-1 | grep "authentication". Diagnose before redesigning.
L-08
Never use raw rsync to sync production. Use git push + git pull as the sync mechanism. /opt/zenpower/scripts/safe-sync.sh exists for the rare case where a file sync is genuinely needed. Manual rsync with --delete is banned. No exceptions.
L-09
SQLAlchemy Column(JSON) handles serialization automatically. Never pass json.dumps() to a JSON column. Pass plain Python dicts. Double-encoding produces a string-of-JSON that makes column.get("key") raise AttributeError: 'str' object has no attribute 'get'. Fixing this in production requires a PostgreSQL expression: (col#>>'{}')::json.
L-10
Check disk space before installing anything. Bitcoin blockchain requires 500GB+. Ethereum requires 2TB+. The server has 290GB total. df -h takes 0.1 seconds. Running it before every storage-intensive install takes 0.1 seconds. Not running it cost an overnight system outage (2026-01-17).
L-11
Investigate every test failure. No exceptions. "Optional services" is not a valid reason for a 502 on a public URL. When ecosystem tests fail, read the failure message, check the logs, and verify the actual endpoint. Two public URLs returning 502 was missed by treating 8 failures as "optional" (2026-01-20).
L-12
fix-permissions.sh — not chown -R. Manual chown -R zenpower:zenpower /opt/zenpower breaks PostgreSQL volume ownership. bash /opt/zenpower/scripts/fix-permissions.sh is the only safe way to correct ownership after root operations. It knows which directories to exclude (postgres data, acme.json, etc.).
L-13
POST /instruct is not idempotent. Every call fires the agent, costs money, logs to the database, and sends a Discord notification. Save the response to /tmp/{agent}_response.json immediately. Never re-POST to read a result. Re-reading caused 4 agents to be called twice in session 17, wasting real budget.
L-14
Context compaction kills the inbox loop. When root's context compacts, the agent_bridge inbox loop dies. A P2 message can sit unacknowledged for 53 minutes (session 14) or 36 minutes (session 18). Check inbox every 60 seconds or between every task step. This rule is in CLAUDE.md specifically because context compaction destroys prompt-level instructions.
L-15
Unbounded log files will fill your disk. fs.appendFileSync() in Node.js produced a 133GB log file (2026-01-29). Every service that logs must have a logrotate config or a built-in rotation policy. Check /etc/logrotate.d/zenpower before deploying any new service.
Source: /opt/zenpower/docs/CLAUDE_FAILURES.md — 163 anti-patterns, 110+ sessions, first session 2025-11-29. Last updated 2026-02-26. This page is a curated forensic summary; the full failure log is the authoritative reference.