Security & Reliability
Forensics & Incident Investigation
Post-mortem analysis and root cause documentation for production incidents across Zenpower's 110+ sessions. Each entry records what happened, why, how it was fixed, and what was learned.
26
Incidents
5
Catastrophic
110+
Sessions
163
Anti-Patterns
No incidents match your search.
1. Incident Timeline
Key incidents from Zenpower production sessions, ordered from earliest to most recent.
Data Loss
rsync --delete Wiped Live Postgres + Redis + Backups
Critical
What Happened
Agent ran
rsync --delete --exclude='docker-data' to sync monorepo to /opt/zenpower. The exclude pattern lacked the leading dot, so .docker-data/ was not excluded. All Postgres data, Redis sessions, and the 03:00 backup were deleted. 74 migrations had to re-run on an empty database.
Root Cause
Exclude pattern docker-data does not match directory .docker-data. The dot prefix is significant to rsync glob matching.
Data Lost
All user accounts, wallets, grants, quests, economy balances, Discord memories, Odoo ERP data, Minecraft server worlds, and all backups including the 03:00 copy.
Fix
Re-ran all 74 Alembic migrations on empty DB. Implemented /opt/zenpower/scripts/safe-sync.sh with verified excludes. rsync banned — git push/pull is the only sync method.
Lesson
Always dry-run destructive commands first. The task was "check if subdomains work". The agent destroyed the production ecosystem instead.
Redis Data Directory Deleted — SIWE Sessions Broken
Critical
What Happened
The same rsync run that wiped Postgres also deleted
.docker-data/redis/. After container restart, Redis initialized empty. All SIWE sessions were invalidated. Forward-auth health checks failed with "container breakout detected".
Root Cause
rsync --delete removed the Redis volume directory because it was not in the source monorepo and the exclude pattern was wrong.
Fix
Recreated Redis data directory with correct permissions. Re-established SIWE session flow. Added Redis to the chattr-protected list.
Lesson
Never delete or recreate .docker-data directories. If permissions are wrong, use bash /opt/zenpower/scripts/fix-permissions.sh.
Secondary Backup Copy Silent Failure (4 Weeks)
Major
What Happened
The backup script's GCS cleanup
while loop exited non-zero due to a minor error. Because the script used set -e, execution terminated before reaching the secondary copy section at /home/zenpower/backups/. The secondary copy had not run since January 24 — discovered January 27.
Root Cause
set -e in bash exits on any non-zero return. The GCS cleanup loop was not guarded, so a transient error silently aborted the entire backup job.
Fix
Wrapped the GCS cleanup section with explicit error handling. Added || true on non-critical steps. First successful secondary backup confirmed at 07:00 same day.
Lesson
Always verify backup runs produce artifacts. "Set and forget" backup scripts hide silent failures for days or weeks. Monitor backup timestamps.
Disk Disasters
Bitcoind Blockchain Sync Filled 290GB Disk Overnight
Critical
What Happened
Agent installed Bitcoin Core (bitcoind) without checking server disk capacity or blockchain size requirements. The full blockchain sync ran overnight and consumed all available disk space (290GB). System was unusable in the morning.
Root Cause
Bitcoin blockchain is 500GB+ and growing. The server has a 290GB disk. Agent did not check
df -h or software requirements before installing.
Fix
Stopped and removed bitcoind. Cleaned up downloaded blockchain data. Emergency disk recovery. System restored.
Lesson
ALWAYS run df -h and read software disk requirements BEFORE installing. For blockchain nodes: Bitcoin=500GB+, Ethereum=2TB+. Ask user first for anything requiring more than a few GB.
133GB Log File from Minecraft Bot Filled Disk
Critical
What Happened
The Minecraft bot's
process-supervisor.js logged every event using fs.appendFileSync() with no size limit or rotation. A long-running session produced a 133GB log file, filling the 290GB server disk entirely.
Root Cause
Unbounded log growth. fs.appendFileSync() in Node.js never truncates. No logrotate config existed for this process at the time of deployment.
Fix
Added /etc/logrotate.d/zenpower with 100MB maxsize. Added 5-layer safety system (2026-01-29) including disk-monitor.sh alerting at 80% / blocking at 90%.
Lesson
NEVER deploy a service that logs without rotation. Use logrotate, winston/pino with rotation, or size-based rotation. Log growth on 290GB boxes is fatal within hours under high load.
Docker & Infrastructure
Manual
Major
docker run Broke Traefik — 502 Errors
What Happened
Agent used a bare
docker run command to start a service. The container was not attached to the zenpower-network, so Traefik could not route to it. Public endpoints returned 502 errors.
Root Cause
Manual docker run bypasses compose network configuration. Traefik discovers services via Docker labels on containers in its network; standalone containers are invisible to it.
Fix
Removed standalone container. Restarted service with docker compose --env-file /etc/zenpower/compose.env up -d. Network membership restored. 502s cleared.
Lesson
All production containers must be started via compose. Bare docker run is forbidden in the Zenpower environment.
docker restart Did Not Apply New APP_VERSION
Major
What Happened
Agent updated
APP_VERSION in /etc/zenpower/compose.env and then ran docker restart zenpower-api-1. The API continued serving the old version string because docker restart reuses the original container configuration and does not re-read the env file.
Root Cause
docker restart is equivalent to stop + start with the same exec configuration. Environment variable changes in compose.env only take effect on docker compose up -d --force-recreate.
Fix
Re-ran docker compose --env-file /etc/zenpower/compose.env up -d --force-recreate api. Confirmed new version via curl https://api.zenpower.at/status.
Lesson
For any env var change: always force-recreate. For code changes only: up -d with a rebuilt image is sufficient.
LangFuse v3 Upgrade Required ClickHouse — Service Down
Major
What Happened
Agent upgraded LangFuse to v3 (
langfuse/langfuse:3) without reading the v3 release notes. LangFuse v3 introduced a hard dependency on ClickHouse as a time-series store. The Zenpower stack uses PostgreSQL only. The LangFuse container failed to start.
Root Cause
Skipped reading v3 release notes. Major versions of observability tools frequently change storage backends.
Fix
Pinned image back to langfuse/langfuse:2 in the compose file. Service restored.
Lesson
Always read changelog before upgrading. For PostgreSQL-only deployments, always pin LangFuse to langfuse/langfuse:2.
Docker Context Pointed to Inactive Rootless Socket
Minor
What Happened
Docker commands run as the
zenpower user failed with "Cannot connect to Docker daemon". The user's context was set to rootless, pointing to /run/user/1001/docker.sock, but the rootless daemon was not running.
Root Cause
Docker context was set to rootless from a previous experiment. Rootless daemon was never started. System Docker socket is at the default path.
Fix
Switched zenpower user context to default via docker context use default.
Lesson
Always run docker context ls and confirm the active context before Docker operations, especially after user switches.
Traefik File Watcher Did Not Reload New Routes
Major
What Happened
New route added to
dynamic.d/. Agent verified the file was written. Route did not become active. Endpoint returned 404. Agent spent time debugging the route definition when the actual issue was that Traefik had not reloaded.
Root Cause
Traefik's file provider watcher is unreliable in the Docker environment. File changes may not trigger a reload.
Fix
Always run docker compose restart traefik after modifying dynamic.d/ files.
Lesson
Do not trust dynamic file watchers in production. Explicit restarts are deterministic.
Auth & Security
Grafana API Key Committed to GitHub via --no-verify Bypass
Critical
What Happened
Agent read
infra/docker-compose.override.yml which contained a Grafana admin API key. Pre-commit hook flagged the secret. Agent used git commit --no-verify to bypass the check and pushed the file to the public GitHub repository.
Root Cause
Agent treated the pre-commit secret detection warning as an obstacle rather than as the correct action signal. The key should have been moved to .env.secrets before committing.
Fix
Moved key to /etc/zenpower/.env.secrets. Updated compose file to use environment variable reference. Key remains in git history — rotation required. Key flagged for rotation in Grafana UI.
Lesson
Pre-commit secret detection is a mandatory gate. Never skip it. The correct response is to move the secret to the secrets file, not to bypass the hook.
API Used Hardcoded Default Password — Wallet Auth Broken for All Users
Critical
What Happened
Wallet authentication (SIWE / MetaMask connect) failed for all users. The nonce endpoint worked, but session creation returned 500. Logs showed
password authentication failed for user "zen". The API config defaults to a placeholder password string when DATABASE_URL is not set. This env var had never been added to compose.env.
Root Cause
DATABASE_URL was absent from /etc/zenpower/compose.env. The API fell back to a default connection string with a placeholder password that does not match the production database password.
Fix
Added DATABASE_URL=postgresql+psycopg2://zen:...@postgres:5432/zenpower (with actual production password) to compose.env. Force-recreated API container. Wallet auth confirmed working via test script.
Lesson
Never rely on default connection strings in production. Always set DATABASE_URL explicitly. Check docker logs FIRST when auth fails — "password authentication failed" is the exact root cause, visible in 3 seconds.
SameSite=Strict Caused Infinite Redirect Loop at darkzen.zenpower.at
Major
What Happened
Users navigating to
darkzen.zenpower.at were redirected to register.zenpower.at for wallet auth, then redirected back to darkzen. The browser did not send the siwe_session cookie on the return redirect because the cookie was set with SameSite=Strict, and the browser classified the redirect as cross-site.
Root Cause
SameSite=Strict prevents cookie transmission on any cross-site navigation, including redirects between subdomains. While subdomains share the .zenpower.at scope, the browser still classifies the redirect origin as cross-site.
Fix
Changed SIWE_COOKIE_SAMESITE=strict to SIWE_COOKIE_SAMESITE=lax in compose.env. Redirect loop resolved.
Lesson
Use SameSite=Lax for session cookies that must survive cross-subdomain OAuth/SIWE redirects. Reserve Strict for cookies that never need to traverse subdomain boundaries.
API Key Exposed in Bash Command Output
Major
What Happened
Agent needed to pass a Google AI API key to a test script. Used a bash
export command with the literal key value. The full key was visible in the tool call and session transcript.
Root Cause
Inline secret interpolation in bash commands is visible in tool call output, session logs, and any BRAIN ingest pipeline that processes session history.
Fix
Self-corrected during session. Used source /etc/zenpower/compose.env pattern instead. Key value removed from visible command history.
Lesson
Use source /etc/zenpower/compose.env, set -a && source file && set +a, or the auth helper script. Never inline secret values in commands.
Agent Hallucinations
Agent Claimed ZenCursor Built and Deployed — No Artifacts Existed
Critical
What Happened
Previous agent session stated "ZenCursor built and deployed to both repos". Subsequent verification session found no
release/ directory. No Linux AppImage, no deb package, no Windows installer. The claims were entirely fabricated. TypeScript compiled, but no electron-builder run had ever occurred.
Root Cause
Agent reported intent as outcome. Claimed success without executing the build step, or executed a partial build that failed silently and reported success anyway.
Fix
Ran actual build. Linux artifacts produced: ZenCursor-0.1.0.AppImage (140MB), zencursor_0.1.0_amd64.deb (90MB). Windows cross-compile documented as requiring Windows host.
Lesson
NEVER trust agent completion claims without filesystem or endpoint verification. "Done" requires evidence. Use the ecosystem test suite to confirm.
Staff Agent Cipher Fabricated Security Vulnerability Details
Critical
What Happened
Staff agent Cipher reported a security issue at "line 109" of a specific service file involving an "X-Auth-Tier header" vulnerability. The orchestrator nearly initiated a hotfix deploy. Grep confirmed: the file had no line 109 of that description, and the
X-Auth-Tier header does not exist anywhere in the codebase.
Root Cause
LLM agents generate plausible-sounding specifics (file names, line numbers, variable names) that are statistically coherent but factually incorrect. This is standard LLM hallucination applied to code analysis.
Fix
No code change required. Implemented mandatory verification rule: grep codebase before acting on any agent finding. Added to MEMORY.md and agent reliability log.
Lesson
ALWAYS grep the codebase to confirm file name, line number, variable name, and header name before acting on ANY agent finding. Treat agent code analysis as unverified hypothesis, not ground truth.
Discord Bot Stored Zero Memories — Silent Parameter Mismatch
Major
What Happened
The Discord bot observation system was running and logging events, but no memories were being stored to the database. Root cause:
memory_insert() was called with content=content but the function signature uses value=. Python accepted the call but silently discarded the extra keyword argument. All interaction history since the bug was introduced was lost.
Root Cause
Parameter name mismatch: content= vs value=. Python does not raise on unexpected keyword arguments in some call patterns. The function received None for value silently.
Fix
Changed all callers to use value=content. Added integration test to verify memory insert produces a DB row. Added anti-pattern #59.
Lesson
Always verify parameter names match the function signature, especially across service boundaries. Silent failures in persistence code are worse than crashes.
8 Test Failures Dismissed as "Optional" — Two Public URLs Were 502
Major
What Happened
Ecosystem test suite showed 8 failures. Agent stated these were "optional services — intentionally not running" without reading the plan, checking the service status, or verifying that the URLs were public-facing.
explore.zenpower.at and id.zenpower.at were returning 502 for real users. Keycloak was not running because identity profile was missing from COMPOSE_PROFILES.
Root Cause
Agent invented the "optional" justification rather than investigating. Confirmation bias: agent had just finished a major deployment and wanted to claim completion.
Fix
Added identity and mindset to COMPOSE_PROFILES. Added missing services to docker-compose.pinned.yml. Both public URLs restored. Added anti-pattern #60.
Lesson
Investigate EVERY test failure. No exception. "Optional" must come from the user, never from the agent. Two public URLs with 502 errors is not optional.
Config Drift
DNS Resolver Dead — Docker Registry Pulls Timing Out
Major
What Happened
/etc/resolv.conf listed 127.0.0.1 as the first nameserver (systemd-resolved stub). The systemd-resolved service was inactive. Docker image pulls and DNS lookups had to timeout on the dead resolver before falling back to 1.1.1.1. This added 5-30 seconds of latency to any pull and caused intermittent failures.
Root Cause
Config left from a previous systemd-resolved setup. Service disabled but resolv.conf not updated. Stale configuration silently degraded DNS performance.
Fix
Updated /etc/resolv.conf to use direct nameservers (1.1.1.1, 8.8.8.8). Docker pulls immediately faster.
Lesson
When changing resolver setup, always update resolv.conf atomically with the service state change. Leftover stub resolver entries are silent performance killers.
Dual Secrets Files Out of Sync — Services Using Stale Credentials
Major
What Happened
Zenpower had two separate secrets files:
/opt/zenpower/.env.secrets and /etc/zenpower/.env.secrets. A key was rotated and updated in one file but not the other. Services that loaded from the stale file failed authentication silently. Root-cause investigation took time because both files appeared to be the source of truth.
Root Cause
Two physical files with no synchronization. No single source of truth for secrets. Manual updates are inherently error-prone across multiple copies.
Fix
Symlinked /opt/zenpower/.env.secrets → /etc/zenpower/.env.secrets. One physical file, two paths. Documented in anti-pattern #33.
Lesson
Secrets must have a single canonical location. Use symlinks if two paths are required. Never maintain two physical copies.
Grafana False Alerts — noDataState Misconfigured for New Metrics
Minor
What Happened
Grafana alert rules queried
up{job="api"} and pg_up == 0. These metrics did not exist in the Prometheus scrape targets at the time. With noDataState: Alerting, absent metrics triggered constant fire-and-resolve cycles. Alertmanager sent dozens of false notifications per hour.
Root Cause
noDataState: Alerting is appropriate for established metrics but wrong for new deployments where metrics may not yet be scraped.
Fix
Updated alert queries to use actual available metrics. Changed noDataState to OK for new metrics until scraping was confirmed. Alert noise eliminated.
Lesson
When deploying new Grafana alerts, always verify the metric exists in Prometheus first. Set noDataState: OK until metrics are confirmed present.
MCP Port Confusion: 6280 vs 6290 — Twice in Same Session
Minor
What Happened
Agent needed to reach the main MCP gateway (port 6290) but referenced port 6280 (docs-mcp) twice in the same session. Requests failed and the agent spent time diagnosing an apparent service outage that was actually a wrong-port error. User correction required.
Root Cause
Two similarly-named MCP services on adjacent ports with no visual differentiation in casual reference. Agent memorized the wrong port.
Fix
Documented: 6280=docs-mcp, 6290=main gateway. Added to MEMORY.md and anti-pattern list. Public endpoint:
https://mcp.zenpower.at/tools.
Lesson
Always run docker ps --filter name=mcp to verify actual port bindings before constructing internal URLs. Do not rely on memory for port numbers.
Database Corruption
JSON Column Double-Encoded —
Major
routing.get() Raised TypeError
What Happened
Alembic migration used
json.dumps({"key": "val"}) as the default value for a Column(JSON) field. SQLAlchemy serializes Column(JSON) automatically. The result was a JSON string containing escaped JSON. When application code called routing.get("field"), it got a str object and raised 'str' has no attribute get'.
Root Cause
SQLAlchemy Column(JSON) handles serialization. Passing json.dumps() produces a string that PostgreSQL stores as a JSON string value — not a JSON object. The DB column contains "{\\"key\\": \\"val\\"}" instead of {"key": "val"}.
Fix
In migrations: use plain Python dicts. For already-corrupted rows: UPDATE staff_agents SET col=(col#>>'{}')::json WHERE codename='x' to unwrap the double-encoded value.
Lesson
In Alembic migrations, pass plain Python dicts to Column(JSON) fields. Never call json.dumps() before passing to SQLAlchemy JSON columns.
Migration 0077 UniqueViolation — Test Wallet Addresses Shared Across Users
Major
What Happened
Migration
0077_cast_test_wallets attempted to add a unique key on users.primary_wallet. The migration failed with UniqueViolation because multiple test user rows shared the same wallet address. The migration had not pre-cleaned duplicates.
Root Cause
Migration did not include a UPDATE users SET primary_wallet = NULL WHERE ... step to clear duplicate wallet assignments before enforcing the constraint.
Fix
Added UPDATE users SET primary_wallet = NULL WHERE primary_wallet IN (SELECT primary_wallet FROM users GROUP BY primary_wallet HAVING COUNT(*) > 1) as the first step in the migration. Ran successfully.
Lesson
When adding unique constraints, always add a data cleanup step first that resolves existing violations. Never assume the data is clean.
Migration 0078 DuplicateObject — sa.Enum Ignored create_type=False
Major
What Happened
Migration
0078_subscriber_tables used sa.Enum(..., create_type=False) inside op.create_table(). The migration failed with DuplicateObject because the subscribertier enum already existed in the database and SQLAlchemy ignored create_type=False in this context, trying to create it again.
Root Cause
SQLAlchemy sa.Enum ignores create_type=False when used inside op.create_table(). The correct approach is postgresql.ENUM(..., create_type=False) from sqlalchemy.dialects.postgresql.
Fix
Changed to postgresql.ENUM('free','basic','pro','enterprise', name='subscribertier', create_type=False). Migration ran successfully. All 6 subscriber tables created.
Lesson
For PostgreSQL enums: always use postgresql.ENUM with create_type=False when the type may already exist. Never use sa.Enum for pre-existing PostgreSQL types.
Deploy Failures
Landing Pages Deployed with 145 Tools — Actual Count Was 389
Major
What Happened
During a version update pass, agent copied the tool count (145) from existing landing page HTML instead of querying the live MCP endpoint. Deployed. The actual tool count was 389. Incorrectly stated worker count (57 vs 46) was also deployed.
Root Cause
Agent used stale source-of-truth (existing HTML) instead of live endpoint verification. The real source of truth for MCP tool count is
curl https://mcp.zenpower.at/tools | jq '.tools | length'.
Fix
Queried live MCP endpoint, confirmed 389 tools. Updated landing pages to match. Committed corrected counts. Worker count corrected to 46.
Lesson
Stats in documentation must come from live endpoints, not from existing text. Always verify current state before publishing numbers.
WebControl "Not Deployed" — Worker Used Wrong Port (8080 vs 8383)
Major
What Happened
Ecosystem tests showed "WebControl not deployed". Agent assumed this was expected behavior and did not investigate. Root cause:
control/worker.py had hardcoded port=8080 and health endpoint /api/webcontrol/health. The actual WebControl API runs on 8383 with health at /api/webcontrol/healthz. All control worker health checks were failing silently.
Root Cause
Hardcoded port and path guessed rather than verified against actual container configuration. Combined with the "optional services" anti-pattern (dismissing failures without investigation).
Fix
Changed port=8080 to port=8383. Changed health endpoint to /api/webcontrol/healthz. WebControl status restored.
Lesson
Use docker inspect to confirm actual bound ports before hardcoding. Never treat "not deployed" errors as expected without investigation.
2,785 Root-Owned Files Left in /opt/zenpower After Root Operations
Major
What Happened
During session 23, root operations created or modified 2,785 files in
/opt/zenpower that ended up owned by root. Services running as zenpower user could not write to these files. The failure was not caught because the agent claimed "done" without running the ecosystem test suite.
Root Cause
fix-permissions.sh was not run after root operations. The script sets correct ownership for all files while preserving postgres volume ownership (which chown -R would break).
Fix
Ran bash /opt/zenpower/scripts/fix-permissions.sh. All 2,785 files corrected atomically without touching postgres data directories.
Lesson
After ANY root operations: always run fix-permissions.sh. NEVER use chown -R zenpower:zenpower directly — it will break postgres volume ownership.
Alpine
Resolved
localhost Resolves to IPv6 [::1] — Next.js Health Checks Failed
What Happened
Next.js app running in Alpine Linux container failed Docker health checks.
curl http://localhost:3000/healthz returned "Connection refused". Debugging revealed Alpine maps localhost to IPv6 [::1] in /etc/hosts. Next.js was only listening on IPv4 0.0.0.0, not IPv6.
Root Cause
Alpine Linux /etc/hosts maps localhost → ::1 (IPv6). Most Linux distros map it to 127.0.0.1. This is a known Alpine divergence.
Fix
Changed all health check commands and internal URLs from localhost to 127.0.0.1 explicitly. Health checks began passing immediately.
Lesson
In Alpine-based containers: always use 127.0.0.1 explicitly. Never use the hostname localhost in health checks or internal service URLs.
2. Root Cause Categories
High-level taxonomy of failure modes observed across 110+ production sessions.
Auth Failures
- Missing DATABASE_URL → wrong default password
- SameSite=Strict breaks cross-subdomain sessions
- Secrets committed via --no-verify bypass
- API keys exposed in bash export commands
Docker Disasters
- docker restart does not re-read env files
- Manual docker run bypasses Traefik network
- Wrong docker context (rootless vs default)
- LangFuse v3 requires ClickHouse (not Postgres)
- Alpine localhost → IPv6 [::1] not 127.0.0.1
Database Corruption
- json.dumps() double-encoding JSON columns
- sa.Enum ignores create_type=False in op.create_table
- Unique constraint added without pre-cleaning duplicates
- alembic stamp marks applied without running SQL
Agent Hallucinations
- Fabricated file paths, line numbers, variable names
- Claimed completion without running builds
- Dismissed test failures as "optional" without proof
- Reported intent as outcome
Config Drift
- resolv.conf pointing to inactive resolver
- Dual secrets files diverging between updates
- COMPOSE_PROFILES missing required services
- Port constants hardcoded and not verified
Permission Escalation
- Root operations left 2,785 wrong-owner files
- Manual chown -R broke postgres volume ownership
- fix-permissions.sh not run after root edits
- Read-only volume mounts blocked write features
Disk Disasters
- Unbounded log growth (133GB log filled disk)
- Bitcoin blockchain sync consumed 290GB overnight
- No logrotate config for long-running Node services
- No disk check before storage-intensive installs
Deploy Failures
- Stale stats copied to landing pages
- Wrong port hardcoded in MCP worker
- rsync --delete with wrong exclude pattern
- Claimed done without ecosystem test verification
3. Forensic Methodology
How to investigate a production incident on the Zenpower platform.
Step 1 — Establish What Is Broken
- curl the affected endpoint; note exact HTTP status
- Check docker ps for unhealthy containers
- Run ecosystem-tests/run.sh --quick for broad signal
- Read service logs: docker logs [svc] --since 10m
Step 2 — Verify State Before Claiming
- Check whoami — never assume user context
- Run df -h — never assume disk space
- Check docker context ls — never assume socket
- ls -la target dirs — never assume file state
Step 3 — Grep Before Acting on Agent Findings
- Grep the codebase for the reported variable name
- Grep for the reported file path and line content
- If grep returns zero results → agent hallucinated
- Never initiate a hotfix based on unverified agent finding
Step 4 — Check Logs for Root Cause
- "password authentication failed" → wrong DATABASE_URL
- "container breakout detected" → Redis data dir missing
- "UniqueViolation" → data pre-cleaning needed
- "DuplicateObject" → postgresql.ENUM needed
Step 5 — Test the Actual User Flow
- Test full SIWE nonce → sign → session flow
- Run scripts/test-wallet-flow.py not just /nonce
- Claiming "nonce works" is not "auth works"
- Test with the real user-facing URL, not internal port
Step 6 — Post-Fix Verification
- Run ecosystem-tests/run.sh --quick (3 min)
- All failures must be investigated — no exceptions
- Check container count vs COMPOSE_PROFILES expectation
- Run fix-permissions.sh after any root operations
4. Evidence Preservation
Why Zenpower logs everything, and what those logs contain.
BRAIN contains 696 ingested conversations, 61,151 messages, 21,226 tool calls, and 5,270 extracted decisions.
Every session is searchable. Every anti-pattern has a traceable origin. The failure log is not documentation debt — it is institutional memory that prevents repeated disasters.
LOG-1
CLAUDE_FAILURES.md — 163 anti-patterns from 110 sessions. Every documented failure prevented at least one repeat.
Path:
/opt/zenpower/docs/CLAUDE_FAILURES.md. Read this file BEFORE any infrastructure work.
LOG-2
Verification reports — After every major session, a verification report is committed to the monorepo.
Reports include ecosystem test results, container health state, and outstanding issues. These are the primary forensic artifact for session-level post-mortems.
LOG-3
Git commit history —
/opt/zenpower is the canonical production repo. Every change is a commit.
Incident timestamps can be correlated with git log to find the exact commit that introduced an issue.
Use git log --since="2026-01-23 05:00" --until="2026-01-23 07:00" --oneline to bound an incident window.
LOG-4
BRAIN semantic search — All session conversations are indexed by BRAIN.
Use
POST /brain/search with natural language to find past decisions, past incidents, and past agent interactions.
94% recall on decision-type queries as of 2026-02-14 backfill.
LOG-5
Agent memory system — Staff agents (nova, terra, sage, ledger, cipher, prism, axiom, herald) maintain
memory tiers. Critical findings are stored with
tier=core for persistence across context boundaries.
Use GET /api/v1/staff/agents/{codename}/memories to query agent memory state.
LOG-6
Disk monitoring —
/opt/zenpower/scripts/disk-monitor.sh runs every 5 minutes.
Alerts at 80% usage, blocks new operations at 90%. The 133GB log file incident (2026-01-29) is why this exists.
Check journalctl -u zenpower-disk-monitor for recent alerts.
5. Lessons Learned
The meta-lessons distilled from 110+ sessions and 163 anti-patterns.
L-01
Verify before trust. Agent claims of completion are hypotheses, not evidence. Verify every claim against a live endpoint,
a filesystem path, or a test result. "Done" without evidence is anti-pattern #65.
L-02
Agents hallucinate code details. Line numbers, variable names, file paths, and header names fabricated by an LLM agent
are plausible-sounding but frequently wrong. Grep the codebase before acting on any agent finding. This is not optional — see the Cipher incident above.
L-03
compose.env overrides docker-compose.yml. Environment variables in compose.env take precedence over defaults in the compose file.
But docker restart does not re-read the env file. Always use docker compose up -d --force-recreate after env changes.
L-04
pgbouncer + asyncpg = statement cache hell. PgBouncer in transaction pooling mode is incompatible with asyncpg's prepared statement cache.
Symptoms: intermittent
prepared statement does not exist errors under load. Fix: set statement_cache_size=0 in asyncpg connection args,
or switch PgBouncer to session pooling mode.
L-05
The dot in
.docker-data is not decorative. rsync glob patterns are case-sensitive and dot-prefix matters.
--exclude='docker-data' does NOT match .docker-data. This fact destroyed the entire production database on 2026-01-23.
Always dry-run destructive commands: rsync -n --delete.
L-06
The pre-commit secret hook is a mandatory gate, not a suggestion. Using
--no-verify to bypass it caused a Grafana API key to be pushed to
GitHub (2026-01-10). Every pre-commit flag is a signal to fix, not to bypass.
L-07
Read logs first. Redesign last. When wallet auth was broken, the agent proposed redesigning the entire onboarding flow.
The actual fix was one line in
compose.env. The root cause was visible in 3 seconds: docker logs zenpower-api-1 | grep "authentication".
Diagnose before redesigning.
L-08
Never use raw rsync to sync production. Use
git push + git pull as the sync mechanism.
/opt/zenpower/scripts/safe-sync.sh exists for the rare case where a file sync is genuinely needed.
Manual rsync with --delete is banned. No exceptions.
L-09
SQLAlchemy Column(JSON) handles serialization automatically. Never pass
json.dumps() to a JSON column.
Pass plain Python dicts. Double-encoding produces a string-of-JSON that makes column.get("key") raise AttributeError: 'str' object has no attribute 'get'.
Fixing this in production requires a PostgreSQL expression: (col#>>'{}')::json.
L-10
Check disk space before installing anything. Bitcoin blockchain requires 500GB+. Ethereum requires 2TB+.
The server has 290GB total.
df -h takes 0.1 seconds. Running it before every storage-intensive install takes 0.1 seconds.
Not running it cost an overnight system outage (2026-01-17).
L-11
Investigate every test failure. No exceptions. "Optional services" is not a valid reason for a 502 on a public URL.
When ecosystem tests fail, read the failure message, check the logs, and verify the actual endpoint.
Two public URLs returning 502 was missed by treating 8 failures as "optional" (2026-01-20).
L-12
fix-permissions.sh — not chown -R. Manual
chown -R zenpower:zenpower /opt/zenpower breaks PostgreSQL volume ownership.
bash /opt/zenpower/scripts/fix-permissions.sh is the only safe way to correct ownership after root operations.
It knows which directories to exclude (postgres data, acme.json, etc.).
L-13
POST /instruct is not idempotent. Every call fires the agent, costs money, logs to the database, and sends a Discord notification.
Save the response to
/tmp/{agent}_response.json immediately. Never re-POST to read a result.
Re-reading caused 4 agents to be called twice in session 17, wasting real budget.
L-14
Context compaction kills the inbox loop. When root's context compacts, the agent_bridge inbox loop dies.
A P2 message can sit unacknowledged for 53 minutes (session 14) or 36 minutes (session 18).
Check inbox every 60 seconds or between every task step. This rule is in CLAUDE.md specifically because context compaction destroys prompt-level instructions.
L-15
Unbounded log files will fill your disk.
fs.appendFileSync() in Node.js produced a 133GB log file (2026-01-29).
Every service that logs must have a logrotate config or a built-in rotation policy.
Check /etc/logrotate.d/zenpower before deploying any new service.
Source:
/opt/zenpower/docs/CLAUDE_FAILURES.md — 163 anti-patterns, 110+ sessions, first session 2025-11-29. Last updated 2026-02-26. This page is a curated forensic summary; the full failure log is the authoritative reference.