Disaster recovery

Tools to mitigate and recover from infrastructure failure.

Recovering from infrastructure failure

To diagnose, resolve, and recover from an infrastructure failure, use the tools below:

Recovering deleted assets: When the asset responsible for infrastructure failure is known

If the infrastructure failure is due to a recent change in your infrastructure, and you can identify which asset is responsible for the failure. For example, in the event a team member accidentally deleted an asset, you can resurrect it using code.

Procedure

  1. Select Inventory > Deleted.

  2. From the filter, select the appropriate time range.

  3. Select the asset that was accidently deleted and Codify.

  4. To revive your asset, create a pull request. Verify you selected the appropriate repository and branch of your GitOps.

Viewing mutations: When the asset responsible for infrastructure failure is unknown

If the asset causing the issue is unknown, follow the steps below to resolve the issue:

Procedure

  1. Select Inventory.

  2. Filter your assets according to data source, environment, account, and location.

  3. From the asset flags filter, select Mutations.

  4. Select an asset > Mutation log.

  5. To view the revision code and revert your asset to a previous configuration, select the revision date and Codify Revision.

  6. To revert back to a previous configuration, select Pull request or use the Terraform Import Commands.

Preventing misconfiguration and reliability risks

To mitigate an infrastructure failure, set up your notification subscriptions to receive notifications on changes in the status or configuration of your assets.

Receiving notifications on changes in the status or configuration of your assets

Prevent disasters by subscribing to notifications that alert you to any changes in the state or infrastructure of your assets. These notifications enhance your awareness of single points of failure, data protection, and system operation visibility. These attributes are crucial for early identification of potential issues that may lead to service disruptions or disaster scenarios.

Insights are policies that improve the configuration of your assets. By subscribing to the following Insights, you can proactively avoid situations that may lead your account to a disastrous outcome.

To reduce the risk of service disruption or disaster, subscribe to the top five Insight notifications below:

Reliability: AWS auto-scaling groups are running with only a single availability zone

Auto-scaling groups are used to automatically adjust the number of instances in a group based on changing demands. By default, these auto-scaling groups are usually set to operate across multiple availability zones to ensure high availability and fault tolerance. Running auto-scaling groups in multiple availability zones means that if one zone experiences issues or failures, the other zones can continue to operate and maintain the desired level of service. Therefore, this configuration can pose a risk because if that availability zone experiences problems, it may lead to service disruption or downtime without any automatic failover to other zones.

Reliability: AWS database instances are currently deployed in only one availability zone

When a database instance is deployed in a single availability zone, it becomes more vulnerable to potential failures or disruptions in that zone. If the availability zone experiences issues, such as hardware failures, network problems, or scheduled maintenance, the entire database instance could become unavailable until the issue is resolved.

Reliability: AWS RDS instance without deletion protection

When deletion protection is turned off, the database instance can be deleted by users or automated processes without an additional safeguard. This poses a risk because if the instance is accidentally deleted, all the data and configurations associated with the database will be lost permanently. It can lead to data loss and disruption of applications or services that rely on that database.

Reliability: AWS DynamoDB tables without point-in-time recovery enabled

Enabling Point-in-Time Recovery on DynamoDB tables is good practice, especially for critical or important data. It provides an additional layer of protection against data loss and offers peace of mind knowing that you have the option to restore the table to a previous state if needed.

Misconfiguration: AWS ELB/LB without any access logs enabled

Failure to enable access logging results in the load balancer's inability to generate log files, leading to a lack of comprehensive insight into the traffic and requests being processed. The absence of access logs makes it difficult to effectively troubleshoot issues, monitor traffic patterns, and conduct thorough security analysis.

To watch a visual presentation of disaster recovery, play the video below:

Last updated