Disaster recovery
Tools to mitigate and recover from infrastructure failure.
Last updated
Was this helpful?
Tools to mitigate and recover from infrastructure failure.
Last updated
Was this helpful?
To diagnose, resolve, and recover from an infrastructure failure, use the tools below:
If the infrastructure failure is due to a recent change in your infrastructure, and you can identify which asset is responsible for the failure. For example, in the event a team member accidentally deleted an asset, you can resurrect it using code.
Select Inventory > Deleted.
From the filter, select the appropriate time range.
Select the asset that was accidently deleted and Codify.
To revive your asset, create a pull request. Verify you selected the appropriate repository and branch of your GitOps.
If the asset causing the issue is unknown, follow the steps below to resolve the issue:
Select Inventory.
Filter your assets according to data source, environment, account, and location.
From the asset flags filter, select Mutations.
Select an asset > Mutation log.
To view the revision code and revert your asset to a previous configuration, select the revision date and Codify Revision.
To revert back to a previous configuration, select Pull request or use the Terraform Import Commands.
Prevent disasters by subscribing to notifications that alert you to any changes in the state or infrastructure of your assets. These notifications enhance your awareness of single points of failure, data protection, and system operation visibility. These attributes are crucial for early identification of potential issues that may lead to service disruptions or disaster scenarios.
To reduce the risk of service disruption or disaster, subscribe to the top five Insight notifications below:
Auto-scaling groups are used to automatically adjust the number of instances in a group based on changing demands. By default, these auto-scaling groups are usually set to operate across multiple availability zones to ensure high availability and fault tolerance. Running auto-scaling groups in multiple availability zones means that if one zone experiences issues or failures, the other zones can continue to operate and maintain the desired level of service. Therefore, this configuration can pose a risk because if that availability zone experiences problems, it may lead to service disruption or downtime without any automatic failover to other zones.
When a database instance is deployed in a single availability zone, it becomes more vulnerable to potential failures or disruptions in that zone. If the availability zone experiences issues, such as hardware failures, network problems, or scheduled maintenance, the entire database instance could become unavailable until the issue is resolved.
Reliability: AWS RDS instance without deletion protection
When deletion protection is turned off, the database instance can be deleted by users or automated processes without an additional safeguard. This poses a risk because if the instance is accidentally deleted, all the data and configurations associated with the database will be lost permanently. It can lead to data loss and disruption of applications or services that rely on that database.
Enabling Point-in-Time Recovery on DynamoDB tables is good practice, especially for critical or important data. It provides an additional layer of protection against data loss and offers peace of mind knowing that you have the option to restore the table to a previous state if needed.
Failure to enable access logging results in the load balancer's inability to generate log files, leading to a lack of comprehensive insight into the traffic and requests being processed. The absence of access logs makes it difficult to effectively troubleshoot issues, monitor traffic patterns, and conduct thorough security analysis.
To watch a visual presentation of disaster recovery, play the video below:
To mitigate an infrastructure failure, set up your notification subscriptions to .
are policies that improve the configuration of your assets. By subscribing to the following Insights, you can proactively avoid situations that may lead your account to a disastrous outcome.