Rollback System¶

The rollback system ensures that failed deployments don't impact service availability by automatically reverting to the last known good deployment.

How It Works¶

Detection¶

The system monitors deployment health through:

Build Exit Codes: Non-zero exit codes trigger rollback
Health Checks: Failed health endpoints after deployment
Error Rate Monitoring: Spike in 5xx errors
Manual Triggers: Operator-initiated rollback

Execution¶

Rollback process:

graph LR
    A[Deploy Fails] --> B[Trigger Rollback]
    B --> C[Fetch Last Good Deployment]
    C --> D[Restore Previous State]
    D --> E[Verify Health]
    E --> F[Update Status]

Verification¶

After rollback:

Health checks confirm service availability
Logs are captured for post-mortem
Alerts notify operations team
Incident report is generated

Rollback Types¶

Automatic Rollback¶

Triggered when:

Deployment job fails in GitHub Actions
Health checks fail after deployment
Error rate exceeds threshold (5% 5xx errors)

Configuration:

rollback:
  needs: deploy
  if: ${{ failure() }}
  runs-on: ubuntu-latest
  steps:
    - name: Rollback
      run: npx wrangler pages rollback --project-name ai-agent-swarm-showcase

Manual Rollback¶

Initiated by operators when:

Bugs discovered post-deployment
Performance degradation observed
Security vulnerability identified

Methods:

GitHub Actions UI: Navigate to workflow, click "Re-run jobs" > "Run rollback job"
Wrangler CLI: wrangler pages rollback --project-name <project>
Cloudflare Dashboard: Deployments > History > "Rollback" button

Gradual Rollback¶

For canary deployments:

Gradually shift traffic back to previous version
Monitor metrics during transition
Complete rollback if issues persist

Rollback History¶

Track all rollback events:

Timestamp	Deployment ID	Reason	Duration	Status
2025-11-02 10:15	d7a3f2b	Build failure	45s	Success
2025-11-01 14:30	9c2e1a4	Health check fail	38s	Success
2025-10-31 09:45	6f8b3d1	Manual trigger	52s	Success

Safety Guarantees¶

Data Integrity¶

Stateless deployments: No data loss during rollback
Immutable builds: Previous versions always available
Audit trail: Full history of all deployments and rollbacks

Zero-Downtime¶

Instant cutover: DNS/CDN updates propagate in seconds
Active connections: Existing sessions remain active
Graceful degradation: Fallback to cached content if needed

Recovery Time¶

Target RTO: < 2 minutes
Average RTO: 47 seconds
Max observed RTO: 1 minute 23 seconds

Post-Rollback Actions¶

After rollback completion:

Incident Review: Analyze root cause
Fix Forward: Prepare corrected deployment
Testing: Validate fix in staging
Re-deploy: Deploy corrected version
Documentation: Update runbooks

Limitations¶

Be aware of:

State reversion: Application state reverts to previous version
Database migrations: May need manual rollback
External dependencies: Third-party integrations may be affected
User sessions: Active sessions might be disrupted

Best Practices¶

Test rollback procedure regularly: Don't wait for production incident
Monitor rollback metrics: Track success rate and duration
Document rollback triggers: Clear criteria for when to rollback
Automate where possible: Reduce human error and response time
Communicate status: Keep stakeholders informed during rollback