Backup and Disaster Recovery

Firefly's Backup and Disaster Recovery (DR) capabilities provide robust tools to safeguard your cloud infrastructure, enabling you to recover quickly from failures and prevent misconfigurations that could lead to outages. This guide details how to use Firefly to mitigate, diagnose, and recover from infrastructure failures, as well as how to proactively prevent them.

Overview

Disaster recovery (DR) is the process of restoring your cloud environment to a healthy state after an incident such as accidental deletion, misconfiguration, or infrastructure failure. Firefly offers:

  • Rapid recovery tools for deleted or misconfigured assets.

  • Comprehensive mutation logs and clickops events for root cause analysis.

  • Proactive notifications and insights to prevent disasters.

  • Automated backups and configuration history (coming soon).

Recovering from Infrastructure Failure

When an infrastructure failure occurs, Firefly provides tools to help you diagnose, resolve, and recover. The recovery process depends on whether you know which asset caused the failure.

Recovering Deleted Assets: When the Responsible Asset is Known

If you know which asset was deleted or misconfigured (e.g., a team member accidentally deleted a resource), you can restore it using Firefly's codification and GitOps integration.

Procedure:

  1. Click on Inventory > Deleted.

    • This view lists all assets that have been deleted from your environment.

  2. Filter by Time Range.

    • Use the filter to narrow down the list to the relevant time period.

  3. Select the Deleted Asset and Codify.

    • Click on the asset that was deleted. Use the Codify action to generate the Infrastructure-as-Code (IaC) template for the asset.

  4. Create a Pull Request.

    • Firefly will prompt you to select the appropriate repository and branch for your GitOps workflow. Submit a pull request to restore the asset via code.

  5. Review and Merge.

    • Once reviewed and merged, your CI/CD pipeline will recreate the asset in your cloud environment.

Tip: This process ensures that the restored asset is managed by code, reducing the risk of future drift or manual misconfiguration.

Viewing Mutations & ClickOps Events: When the Responsible Asset is Unknown

If you do not know which asset caused the failure, use Firefly's mutation tracking to investigate recent changes and identify the root cause.

Procedure:

  1. Click on Event Center

    • View all mutations and clickops events in your environment.

  2. Apply Filters

    • Filter assets by data source, environment, account, and location to narrow your search.

  3. Review Mutation Log or ClickOps Event

    • Click on an asset name to transfer to the asset page and open its mutation log to see a timeline of configuration changes.

  4. Codify Revision

    • For any suspicious or recent change, select the revision date and use Codify Revision to generate the IaC template for that point in time.

  5. Revert via Pull Request

    • Restore the asset to a previous configuration by submitting a pull request.

Tip: Mutation logs and clickops events provide a detailed audit trail, including who made each change and what was modified, making root cause analysis straightforward.

Preventing Misconfiguration and Reliability Risks

Proactive prevention is key to avoiding disasters. Firefly enables you to set up notifications and subscribe to insights that alert you to risky configurations or changes.

Receiving Notifications on Asset Changes

Stay informed about changes in your infrastructure by subscribing to notifications. These alerts help you:

  • Detect new drifts and clickops events.

  • Monitor IaC deployment failures.

  • Get alerts on policy and guardrail violations.

How to Subscribe:

  • Go to Settings > Notifications in Firefly.

  • Choose your preferred notification channels (Slack, Teams, email, etc.).

  • Select which events or asset changes should trigger notifications (e.g., deletions, drifts, policy violations).

Tip: Fine-tune your notification settings to avoid alert fatigue and focus on critical events.

For more information, check the Notifications guide.

Subscribing to Policy Checks for Reliability and Misconfiguration Prevention

Firefly Governance contains policy-driven checks that highlight risky configurations. Subscribing to these policies helps you proactively address issues before they lead to outages.

Example of Top 5 Policies to Reduce Disaster Risk:

  1. Reliability: K8s Deployments running containers without a configured CPU limit

    • K8s Deployments running containers without a configured CPU limit are vulnerable to resource exhaustion and can lead to outages.

  2. Reliability: AWS Database Instances in a Single Availability Zone

    • Databases in one zone are vulnerable to zone failures. Multi-AZ deployment is recommended for resilience.

  3. Reliability: AWS RDS Instance Without Deletion Protection

    • Without deletion protection, accidental or automated deletions can cause permanent data loss.

  4. Reliability: AWS DynamoDB Tables Without Point-in-Time Recovery

    • Enable point-in-time recovery to restore tables to any previous state and protect against data loss.

  5. Misconfiguration: AWS ELB/LB Without Access Logs Enabled

    • Access logs are essential for troubleshooting, monitoring, and security analysis. Enable logging to maintain visibility.

How to Subscribe:

  • Go to Settings > Governance in Firefly.

  • Subscribe to the above policies and configure notification preferences.

Tip: Regularly review policy recommendations and remediate flagged issues to maintain a resilient infrastructure.

Summary

Firefly's backup and disaster recovery features empower you to:

  • Rapidly recover from accidental deletions or misconfigurations.

  • Investigate and revert problematic changes.

  • Proactively prevent outages with real-time notifications and policy-driven insights.

By integrating these tools and practices into your operations, you can ensure your cloud environment remains resilient, auditable, and secure.

Last updated

Was this helpful?