Skip to content

Rollback System

The rollback system ensures that failed deployments don't impact service availability by automatically reverting to the last known good deployment.

How It Works

Detection

The system monitors deployment health through:

  1. Build Exit Codes: Non-zero exit codes trigger rollback
  2. Health Checks: Failed health endpoints after deployment
  3. Error Rate Monitoring: Spike in 5xx errors
  4. Manual Triggers: Operator-initiated rollback

Execution

Rollback process:

graph LR
    A[Deploy Fails] --> B[Trigger Rollback]
    B --> C[Fetch Last Good Deployment]
    C --> D[Restore Previous State]
    D --> E[Verify Health]
    E --> F[Update Status]

Verification

After rollback:

  • Health checks confirm service availability
  • Logs are captured for post-mortem
  • Alerts notify operations team
  • Incident report is generated

Rollback Types

Automatic Rollback

Triggered when:

  • Deployment job fails in GitHub Actions
  • Health checks fail after deployment
  • Error rate exceeds threshold (5% 5xx errors)

Configuration:

rollback:
  needs: deploy
  if: ${{ failure() }}
  runs-on: ubuntu-latest
  steps:
    - name: Rollback
      run: npx wrangler pages rollback --project-name ai-agent-swarm-showcase

Manual Rollback

Initiated by operators when:

  • Bugs discovered post-deployment
  • Performance degradation observed
  • Security vulnerability identified

Methods:

  1. GitHub Actions UI: Navigate to workflow, click "Re-run jobs" > "Run rollback job"
  2. Wrangler CLI: wrangler pages rollback --project-name <project>
  3. Cloudflare Dashboard: Deployments > History > "Rollback" button

Gradual Rollback

For canary deployments:

  • Gradually shift traffic back to previous version
  • Monitor metrics during transition
  • Complete rollback if issues persist

Rollback History

Track all rollback events:

Timestamp Deployment ID Reason Duration Status
2025-11-02 10:15 d7a3f2b Build failure 45s Success
2025-11-01 14:30 9c2e1a4 Health check fail 38s Success
2025-10-31 09:45 6f8b3d1 Manual trigger 52s Success

Safety Guarantees

Data Integrity

  • Stateless deployments: No data loss during rollback
  • Immutable builds: Previous versions always available
  • Audit trail: Full history of all deployments and rollbacks

Zero-Downtime

  • Instant cutover: DNS/CDN updates propagate in seconds
  • Active connections: Existing sessions remain active
  • Graceful degradation: Fallback to cached content if needed

Recovery Time

  • Target RTO: < 2 minutes
  • Average RTO: 47 seconds
  • Max observed RTO: 1 minute 23 seconds

Post-Rollback Actions

After rollback completion:

  1. Incident Review: Analyze root cause
  2. Fix Forward: Prepare corrected deployment
  3. Testing: Validate fix in staging
  4. Re-deploy: Deploy corrected version
  5. Documentation: Update runbooks

Limitations

Be aware of:

  • State reversion: Application state reverts to previous version
  • Database migrations: May need manual rollback
  • External dependencies: Third-party integrations may be affected
  • User sessions: Active sessions might be disrupted

Best Practices

  1. Test rollback procedure regularly: Don't wait for production incident
  2. Monitor rollback metrics: Track success rate and duration
  3. Document rollback triggers: Clear criteria for when to rollback
  4. Automate where possible: Reduce human error and response time
  5. Communicate status: Keep stakeholders informed during rollback