Strategy Testing and Validation
This document describes methodologies for testing AI strategy changes, validating behavior, and measuring effectiveness before deployment.
Overview
Testing Screeps AI behavior requires a combination of unit tests, simulation tests, and real-world validation. This document outlines best practices for ensuring strategy changes improve performance without introducing regressions.
Testing Hierarchy
Level 1: Unit Tests
Purpose: Verify individual components work correctly in isolation
Location: tests/unit/
Tools:
- Vitest test framework
- Mock Screeps globals
- Isolated component testing
Example Test Structure:
1 | // tests/unit/behaviorController.test.ts |
Coverage Requirements:
- Critical decision logic: 100% coverage
- Edge cases: 90%+ coverage
- Happy paths: 100% coverage
Run Command:
1 | bun run test:unit |
Level 2: End-to-End Tests
Purpose: Verify kernel orchestration and component integration
Location: tests/e2e/
Tools:
- Vitest test framework
- Mock entire game environment
- Multi-tick simulation
Example Test Structure:
1 | // tests/e2e/kernel.test.ts |
Coverage Requirements:
- Full tick execution: 100% coverage
- Multi-tick scenarios: Key scenarios tested
- State transitions: All transitions verified
Run Command:
1 | bun run test:e2e |
Level 3: Mockup Tests
Purpose: High-fidelity simulation using Screeps server mockup
Location: tests/mockup/
Tools:
screeps-server-mockup(actual Screeps engine)- Real game rules and mechanics
- Private server environment
Example Test Structure:
1 | // tests/mockup/economy.test.ts |
Coverage Requirements:
- Core gameplay loops: Major milestones tested
- RCL progression: Each level validated
- Failure recovery: Respawn and disaster scenarios
Run Command:
1 | bun run test:mockup |
Level 4: Regression Tests
Purpose: Prevent previously fixed bugs from reoccurring
Location: tests/regression/
Structure:
- One test file per bug
- References issue number
- Includes reproduction case
Example Test Structure:
1 | // tests/regression/issue-123-harvester-stuck.test.ts |
Run Command:
1 | bun run test:regression |
Strategy Validation Methodology
Pre-Deployment Validation
Step 1: Define Success Criteria
Before implementing any strategy change, document:
- Goal: What are you trying to improve?
- Metrics: How will you measure success?
- Baseline: What is the current performance?
- Target: What is the desired performance?
Example:
1 | Goal: Improve harvester efficiency by reducing travel time |
Step 2: Implement and Test
- Write unit tests for new logic
- Run full test suite:
npm test - Ensure all tests pass
- Review test coverage:
bun run test:coverage
Step 3: Simulate Performance
- Create mockup test for strategy
- Run 1000+ tick simulation
- Collect performance metrics
- Compare against baseline
Step 4: Code Review
- Check CPU impact (profile new code)
- Review memory usage changes
- Verify no breaking changes
- Validate edge case handling
Post-Deployment Validation
Step 1: Monitor Initial Performance (First 1000 ticks)
Watch for:
- CPU bucket trends
- Spawn throughput
- Creep population stability
- Controller upgrade rate
Console Monitoring:
1 | // Track key metrics |
Step 2: Compare Against Baseline (After 5000 ticks)
Collect metrics:
- Average CPU/tick
- Average energy/tick
- RCL progression rate
- Bucket stability
Step 3: Identify Regressions
Check for:
- CPU increase >10%
- Energy decrease >10%
- Spawning delays
- Stuck creeps
- Memory leaks
Step 4: Rollback if Necessary
If regressions detected:
- Document failure mode
- Revert to previous version
- Create regression test
- Fix and re-test
Behavioral Validation Checklist
Task Switching Validation
Harvester Role:
- Transitions HARVEST → DELIVER when full
- Transitions DELIVER → HARVEST when empty
- Transitions DELIVER → UPGRADE when no targets
- Transitions UPGRADE → HARVEST when empty
- Never gets stuck in invalid state
Upgrader Role:
- Transitions RECHARGE → UPGRADE when full
- Transitions UPGRADE → RECHARGE when empty
- Never idles with empty energy
- Never idles with full energy
Spawn Logic Validation
- Spawns harvesters when below minimum
- Spawns upgraders when below minimum
- Respects energy availability
- Handles busy spawns gracefully
- Generates unique creep names
- Initializes memory correctly
Pathfinding Validation
- Finds valid paths to targets
- Reuses paths for configured ticks
- Handles blocked paths gracefully
- Respects range parameters
- Doesn’t recalculate unnecessarily
Memory Validation
- Prunes dead creep memories
- Updates role counts accurately
- Persists critical state
- Doesn’t leak memory
- Recovers from corruption
Performance Benchmarking
Benchmark Collection
Manual Benchmarking (in console):
1 | // Run for 100 ticks, collect metrics |
Automated Benchmarking (in tests):
1 | function benchmark(strategy: () => void, ticks: number) { |
Key Performance Indicators (KPIs)
Efficiency Metrics:
- Energy per tick (higher = better)
- CPU per creep (lower = better)
- Spawn uptime % (higher = better)
- Idle time % (lower = better)
Stability Metrics:
- CPU bucket trend (stable = better)
- Memory size (stable = better)
- Population variance (lower = better)
- Error rate (lower = better)
Progress Metrics:
- RCL progression rate (higher = better)
- GCL progression rate (higher = better)
- Room expansion rate (context-dependent)
A/B Testing Strategies
Parallel Testing (Private Server)
Setup:
- Deploy baseline strategy to Bot A
- Deploy new strategy to Bot B
- Run in identical rooms
- Compare metrics after N ticks
Comparison Points:
- RCL progression (time to level up)
- Final creep count
- Average CPU usage
- Resource efficiency
Sequential Testing (Live Server)
Setup:
- Collect baseline metrics (1000+ ticks)
- Deploy new strategy
- Collect new metrics (1000+ ticks)
- Compare normalized results
Normalization Required:
- Account for RCL differences
- Normalize for room conditions
- Control for external factors (attacks, etc.)
Continuous Integration Testing
Pre-Merge Validation
GitHub Actions Guard Workflows (.github/workflows/guard-*.yml):
Individual guard workflows validate different aspects of PRs:
1 | # guard-test-unit.yml |
Quality Gates:
Results are aggregated by quality-gate-summary.yml:
- All tests must pass (unit, e2e, regression)
- Build succeeds
- No linting errors
- Code is properly formatted
- YAML syntax is valid
Post-Merge Validation
Deployment Pipeline (.github/workflows/deploy.yml):
- Build passes
- Tests pass
- Deploy to private server (optional)
- Monitor for regressions
- Deploy to live server
Testing Best Practices
DO:
- ✓ Write tests before fixing bugs
- ✓ Test edge cases and failure modes
- ✓ Mock external dependencies
- ✓ Use descriptive test names
- ✓ Keep tests focused and isolated
- ✓ Maintain >85% code coverage
DON’T:
- ✗ Skip tests for “simple” changes
- ✗ Test implementation details
- ✗ Write tests that depend on execution order
- ✗ Use real Screeps server for unit tests
- ✗ Commit failing tests
- ✗ Remove tests without understanding impact
MONITOR:
- ⚠ Test execution time (keep fast)
- ⚠ Flaky tests (fix or remove)
- ⚠ Coverage trends (prevent decay)
- ⚠ Test complexity (keep simple)
Related Documentation
- Safe Refactoring - How to modify code safely
- Improvement Metrics - Measuring strategy effectiveness
- Creep Roles - Expected behavior for validation
- Performance Monitoring - Runtime metrics collection