Backup and Disaster Recovery
Firefly's Backup and Disaster Recovery (DR) capabilities provide robust tools to safeguard your cloud infrastructure, enabling you to recover quickly from failures and prevent misconfigurations that could lead to outages. This guide details how to use Firefly to mitigate, diagnose, and recover from infrastructure failures, as well as how to proactively prevent them.
Overview
Disaster recovery (DR) is the process of restoring your cloud environment to a healthy state after an incident such as accidental deletion, misconfiguration, or infrastructure failure. Firefly offers:
Rapid recovery tools for deleted or misconfigured assets.
Comprehensive mutation logs and clickops events for root cause analysis.
Proactive notifications and insights to prevent disasters.
Automated backups and configuration history (coming soon).
Recovering from Infrastructure Failure
When an infrastructure failure occurs, Firefly provides tools to help you diagnose, resolve, and recover. The recovery process depends on whether you know which asset caused the failure.
Recovering Deleted Assets: When the Responsible Asset is Known
If you know which asset was deleted or misconfigured (e.g., a team member accidentally deleted a resource), you can restore it using Firefly's codification and GitOps integration.
Procedure:
Click on Inventory > Deleted.
This view lists all assets that have been deleted from your environment.
Filter by Time Range.
Use the filter to narrow down the list to the relevant time period.
Select the Deleted Asset and Codify.
Click on the asset that was deleted. Use the Codify action to generate the Infrastructure-as-Code (IaC) template for the asset.
Create a Pull Request.
Firefly will prompt you to select the appropriate repository and branch for your GitOps workflow. Submit a pull request to restore the asset via code.
Review and Merge.
Once reviewed and merged, your CI/CD pipeline will recreate the asset in your cloud environment.
Tip: This process ensures that the restored asset is managed by code, reducing the risk of future drift or manual misconfiguration.
Viewing Mutations & ClickOps Events: When the Responsible Asset is Unknown
If you do not know which asset caused the failure, use Firefly's mutation tracking to investigate recent changes and identify the root cause.
Procedure:
Click on Event Center
View all mutations and clickops events in your environment.
Apply Filters
Filter assets by data source, environment, account, and location to narrow your search.
Review Mutation Log or ClickOps Event
Click on an asset name to transfer to the asset page and open its mutation log to see a timeline of configuration changes.
Codify Revision
For any suspicious or recent change, select the revision date and use Codify Revision to generate the IaC template for that point in time.
Revert via Pull Request
Restore the asset to a previous configuration by submitting a pull request.
Tip: Mutation logs and clickops events provide a detailed audit trail, including who made each change and what was modified, making root cause analysis straightforward.
Preventing Misconfiguration and Reliability Risks
Proactive prevention is key to avoiding disasters. Firefly enables you to set up notifications and subscribe to insights that alert you to risky configurations or changes.
Receiving Notifications on Asset Changes
Stay informed about changes in your infrastructure by subscribing to notifications. These alerts help you:
Detect new drifts and clickops events.
Monitor IaC deployment failures.
Get alerts on policy and guardrail violations.
How to Subscribe:
Go to Settings > Notifications in Firefly.
Choose your preferred notification channels (Slack, Teams, email, etc.).
Select which events or asset changes should trigger notifications (e.g., deletions, drifts, policy violations).
Tip: Fine-tune your notification settings to avoid alert fatigue and focus on critical events.
For more information, check the Notifications guide.
Subscribing to Policy Checks for Reliability and Misconfiguration Prevention
Firefly Governance contains policy-driven checks that highlight risky configurations. Subscribing to these policies helps you proactively address issues before they lead to outages.
Example of Top 5 Policies to Reduce Disaster Risk:
Reliability: K8s Deployments running containers without a configured CPU limit
K8s Deployments running containers without a configured CPU limit are vulnerable to resource exhaustion and can lead to outages.
Reliability: AWS Database Instances in a Single Availability Zone
Databases in one zone are vulnerable to zone failures. Multi-AZ deployment is recommended for resilience.
Reliability: AWS RDS Instance Without Deletion Protection
Without deletion protection, accidental or automated deletions can cause permanent data loss.
Reliability: AWS DynamoDB Tables Without Point-in-Time Recovery
Enable point-in-time recovery to restore tables to any previous state and protect against data loss.
Misconfiguration: AWS ELB/LB Without Access Logs Enabled
Access logs are essential for troubleshooting, monitoring, and security analysis. Enable logging to maintain visibility.
How to Subscribe:
Go to Settings > Governance in Firefly.
Subscribe to the above policies and configure notification preferences.
Tip: Regularly review policy recommendations and remediate flagged issues to maintain a resilient infrastructure.
Summary
Firefly's backup and disaster recovery features empower you to:
Rapidly recover from accidental deletions or misconfigurations.
Investigate and revert problematic changes.
Proactively prevent outages with real-time notifications and policy-driven insights.
By integrating these tools and practices into your operations, you can ensure your cloud environment remains resilient, auditable, and secure.
Last updated
Was this helpful?