We’re releasing version 0.139.7, which documents one of the most significant operational failures in the Screeps GPT project’s history. This release doesn’t contain code fixes—instead, it provides comprehensive documentation of a 72+ hour bot failure that occurred between November 21-24, 2025, serving as a critical learning artifact for the project.
What Happened: A 72-Hour Silent Failure
Between November 21-24, 2025, the Screeps GPT bot experienced a complete operational failure that went undetected for over three days. During this period, the bot made zero progress despite five “successful” deployments and continuous monitoring data collection. The incident represents a catastrophic breakdown in multiple systems: deployment validation, runtime functionality, and anomaly detection.
The failure was particularly concerning because it violated the project’s core promise of autonomous operation with early failure detection (30-minute target). Instead, the bot remained completely non-functional for 72+ hours—a detection delay 144 times worse than our target.
Root Cause: Spawn Queue Activation Failure
The primary cause was a spawn queue activation failure that occurred after major architectural refactoring in PRs #1154 and #1130. These changes migrated the bot from a monolithic MainProcess to modular process architecture and reverted from task-based to role-based creep management.
The critical failure point:
1 | main.ts:loop() |
Telemetry data showed a spawn structure existed but was completely inactive: {"creeps":{"total":0},"spawns":{"total":1,"active":0}}. The bot had the infrastructure to spawn creeps but the logic to activate spawning never executed, creating a deadlock situation where zero creeps meant zero energy collection, which meant zero creeps could be spawned.
Why It Matters: Design Rationale for Post-Mortem Documentation
This release introduces docs/operations/incidents/2025-11-21-total-bot-failure.md, a comprehensive incident post-mortem that follows industry best practices for operational documentation. The decision to create formal incident documentation rather than just fixing the bug reflects several important principles:
1. Organizational Learning
The incident exposed gaps in three critical areas:
- Testing: No regression tests for spawn queue functionality after architectural changes
- Deployment: No post-deployment validation of bot functionality
- Monitoring: No alerting on zero-creep state despite collecting the relevant data
By documenting these gaps comprehensively, we create institutional knowledge that prevents repeating the same mistakes. The post-mortem captures not just what broke, but why it broke and how multiple safety systems failed simultaneously.
2. Systematic Prevention
The incident revealed that fixing the immediate spawn queue bug isn’t enough. We need systemic improvements:
Process Changes:
- Mandatory regression tests for spawn queue scenarios before merge
- Deployment validation gates that block broken deploys
- Zero-creep alerting within 30 minutes
- Architectural change review protocols for critical systems
Monitoring Improvements:
- Alert on consecutive zero-creep snapshots (threshold: 2)
- Validate bot health post-deployment (CPU usage, creep count)
- Track spawn queue activation metrics
- Monitor RCL regression as failure indicator
Code Changes:
- Emergency spawn protection with minimal body fallback
- Comprehensive spawn queue logging for debugging
- Health status validation (creeps, CPU, spawn activity)
- Bundle validation in deployment pipeline
3. Transparency and Accountability
As an autonomous AI development project, documenting failures is as important as documenting successes. The post-mortem provides complete transparency about:
- The timeline of events (last healthy state through detection)
- Contributing factors (deployment gaps, monitoring blind spots)
- Impact assessment (operational, development, business)
- Action items with priorities and issue references
This transparency builds trust and demonstrates the project’s commitment to continuous improvement through systematic learning from failures.
Technical Deep Dive: Contributing Factors
While spawn queue activation was the primary cause, the incident severity was amplified by two critical contributing factors:
Deployment Validation Gap
Issue #1273 identified a significant gap in the deployment script: no post-deployment health checks. During the 72-hour failure period, five deployments were executed:
- Nov 21, 02:17 UTC → Deployment #337
- Nov 23, 19:00 UTC → Deployment #338
- Nov 23, 20:10 UTC → Deployment #339
- Nov 23, 22:43 UTC → Deployment #340
- Nov 24, 00:03 UTC → Deployment #341
All five deployments succeeded according to the workflow, despite the bot being 100% non-functional. The deployment script validated that:
- Code compiled successfully
- Bundle was uploaded to Screeps servers
- Upload API returned success
But it never validated that the bot actually worked after deployment. This allowed broken code to reach production repeatedly, creating a false sense of security that everything was operational.
Why this matters: In a continuous deployment environment, deployment success should mean “the system is working” not “the upload didn’t error.” The gap between these two definitions created a dangerous blind spot where broken deployments looked successful.
Monitoring Blind Spot
The monitoring system collected bot snapshots every 30 minutes throughout the entire failure period. Four consecutive snapshots showed:
- Creep count: 0
- Spawn count: 1 (inactive)
- Energy capacity: 0/0
- RCL: Regressed from 4 to 2
Yet the monitoring system flagged the bot as “operational” because it was checking API availability, not bot functionality. The bot could respond to health checks while being completely unable to perform its actual purpose.
The critical insight: Operational health means the system is fulfilling its purpose, not just responding to pings. A web server that returns 200 OK but serves blank pages isn’t healthy. Similarly, a Screeps bot that responds to API calls but spawns zero creeps isn’t operational.
Impact Assessment
Operational Impact
- 72+ hours of zero functionality during a critical growth phase
- RCL regression from 4 to 2 due to controller downgrade penalties
- Complete loss of workforce requiring full bootstrap recovery
- Energy reserves lost through storage reset on respawn
Development Impact
- Reduced deployment confidence in the autonomous pipeline
- Exposed testing gaps in architectural change validation
- Identified monitoring blind spots that masked critical failures
Business Impact for Autonomous AI Projects
- Violated autonomous operation promise that was core to the project’s value proposition
- Detection delay 144x worse than target (72 hours vs 30-minute target)
- Manual intervention required contradicting the autonomous development model
The Path Forward: Prevention and Recovery
The post-mortem identifies three tiers of action items:
Immediate Actions (1 week)
- Issue #1294: Fix spawn queue activation (P0)
- Issue #1295: Add zero-creep detection and alerting (P0)
- Issue #1297: Implement post-deployment health checks (P0)
- Issue #1298: Add emergency spawn resilience (P0)
Short-term Actions (1 month)
- Issue #1273: Add bundle validation to deploy script (P1)
- Add regression tests for spawn queue scenarios (P1)
- Document emergency recovery procedures (P1)
- Review all recent architectural changes for similar risks (P1)
Long-term Actions (3 months)
- Implement automated rollback on deployment failure (P2)
- Add comprehensive E2E tests for critical paths (P2)
- Establish performance baselines for anomaly detection (P2)
- Create runbooks for common failure scenarios (P2)
What We Learned
What Went Well
Despite the severity of the failure, several systems functioned correctly:
- Bot snapshots successfully collected throughout outage
- Infrastructure remained operational (CI/CD, monitoring workflows)
- Strategic planning agent eventually detected the failure
- Complete telemetry data available for post-mortem analysis
What Went Wrong
- No regression tests validated spawn queue after architectural changes
- No deployment validation caught the broken bot before production
- No alerting detected zero-creep state despite collecting the data
- No automated recovery triggered despite clear failure signals
Surprises
The most surprising findings:
- Bot could deploy “successfully” while completely broken
- Monitoring could report “operational” with zero functionality
- Failure persisted through five deployments without detection
- Three days passed before the strategic planning agent flagged an anomaly
Implications for AI-Driven Development
This incident has profound implications for AI-driven development workflows. In traditional development, human oversight catches obvious failures like “the bot spawned zero creeps for three days.” But in autonomous systems, we can’t rely on human observation—the systems themselves must be self-aware.
The fix requires more than just better monitoring. We need systems that understand the difference between “operational” (API responding) and “functional” (achieving intended purpose). This requires:
- Purpose-aware health checks that validate goal achievement, not just system availability
- Behavioral baselines that detect anomalies in bot behavior patterns
- Autonomous recovery mechanisms that don’t require human intervention
- Comprehensive telemetry with actionable alerting thresholds
Conclusion
Version 0.139.7 doesn’t ship code fixes—it ships knowledge. The incident post-mortem at docs/operations/incidents/2025-11-21-total-bot-failure.md serves as a comprehensive artifact documenting what went wrong, why it went wrong, and how we’re preventing similar failures in the future.
This documentation is valuable not just for the Screeps GPT project, but for any team building autonomous systems. The lessons about deployment validation, monitoring effectiveness, and the gap between “operational” and “functional” apply broadly to AI-driven development workflows.
The incident was severe, but the learning opportunity it provides is invaluable. By documenting failures with the same rigor as successes, we build a knowledge base that makes the entire autonomous development ecosystem more resilient.
References
- Post-mortem Document:
docs/operations/incidents/2025-11-21-total-bot-failure.md - PR #1301: “Create incident post-mortem for November 21-24 bot failure”
- Issue #1294: Bot resurrection task
- Issue #1295: Zero-creep monitoring
- Issue #1297: Post-deployment validation
- Issue #1298: Emergency spawn resilience
- Issue #1273: Bundle validation
For complete details and the full timeline, see the incident post-mortem document and the CHANGELOG.