Rollback System¶
The rollback system ensures that failed deployments don't impact service availability by automatically reverting to the last known good deployment.
How It Works¶
Detection¶
The system monitors deployment health through:
- Build Exit Codes: Non-zero exit codes trigger rollback
- Health Checks: Failed health endpoints after deployment
- Error Rate Monitoring: Spike in 5xx errors
- Manual Triggers: Operator-initiated rollback
Execution¶
Rollback process:
graph LR
A[Deploy Fails] --> B[Trigger Rollback]
B --> C[Fetch Last Good Deployment]
C --> D[Restore Previous State]
D --> E[Verify Health]
E --> F[Update Status]
Verification¶
After rollback:
- Health checks confirm service availability
- Logs are captured for post-mortem
- Alerts notify operations team
- Incident report is generated
Rollback Types¶
Automatic Rollback¶
Triggered when:
- Deployment job fails in GitHub Actions
- Health checks fail after deployment
- Error rate exceeds threshold (5% 5xx errors)
Configuration:
rollback:
needs: deploy
if: ${{ failure() }}
runs-on: ubuntu-latest
steps:
- name: Rollback
run: npx wrangler pages rollback --project-name ai-agent-swarm-showcase
Manual Rollback¶
Initiated by operators when:
- Bugs discovered post-deployment
- Performance degradation observed
- Security vulnerability identified
Methods:
- GitHub Actions UI: Navigate to workflow, click "Re-run jobs" > "Run rollback job"
- Wrangler CLI:
wrangler pages rollback --project-name <project> - Cloudflare Dashboard: Deployments > History > "Rollback" button
Gradual Rollback¶
For canary deployments:
- Gradually shift traffic back to previous version
- Monitor metrics during transition
- Complete rollback if issues persist
Rollback History¶
Track all rollback events:
| Timestamp | Deployment ID | Reason | Duration | Status |
|---|---|---|---|---|
| 2025-11-02 10:15 | d7a3f2b | Build failure | 45s | Success |
| 2025-11-01 14:30 | 9c2e1a4 | Health check fail | 38s | Success |
| 2025-10-31 09:45 | 6f8b3d1 | Manual trigger | 52s | Success |
Safety Guarantees¶
Data Integrity¶
- Stateless deployments: No data loss during rollback
- Immutable builds: Previous versions always available
- Audit trail: Full history of all deployments and rollbacks
Zero-Downtime¶
- Instant cutover: DNS/CDN updates propagate in seconds
- Active connections: Existing sessions remain active
- Graceful degradation: Fallback to cached content if needed
Recovery Time¶
- Target RTO: < 2 minutes
- Average RTO: 47 seconds
- Max observed RTO: 1 minute 23 seconds
Post-Rollback Actions¶
After rollback completion:
- Incident Review: Analyze root cause
- Fix Forward: Prepare corrected deployment
- Testing: Validate fix in staging
- Re-deploy: Deploy corrected version
- Documentation: Update runbooks
Limitations¶
Be aware of:
- State reversion: Application state reverts to previous version
- Database migrations: May need manual rollback
- External dependencies: Third-party integrations may be affected
- User sessions: Active sessions might be disrupted
Best Practices¶
- Test rollback procedure regularly: Don't wait for production incident
- Monitor rollback metrics: Track success rate and duration
- Document rollback triggers: Clear criteria for when to rollback
- Automate where possible: Reduce human error and response time
- Communicate status: Keep stakeholders informed during rollback