Disaster Recovery

Disaster Recovery for KAIO Infrastructure.

Overview

Disaster recovery refers to the process of reestablishing systems after an event that disrupts their proper functioning. It is essential to take preventive measures to ensure a fast and simple restoration. Two relevant terms measure the success of a disaster recovery operation:

  • Recovery Point Objective (RPO): RPO describes the maximum amount of time for which data loss is acceptable after recovery. For example, if the RPO is one hour, data loss up to an hour before the disaster occurred is considered acceptable.

  • Recovery Time Objective (RTO): RTO describes the maximum acceptable time for restoring systems. For example, if the RTO is four hours, it is considered acceptable if the systems are restored within four hours of the disaster occurring.

In its current MVP form, KAIO does not have a comprehensive disaster recovery plan in place. However, it is expected to be implemented in the near future, and this page contains information on how it can be achieved using the current infrastructure. KAIO aims to have a RPO of zero in regards to blockchain information (which is to mean, as long as a transaction gets included in a block that is confirmed to be added to the chain, that information will not be lost).

Backup

Backing up information and settings regularly is critical to provide a proper disaster recovery response. The more often backups are made, the less data can potentially be lost in the event of a disaster.

AWS

KAIO already has robust data replication due to each validator having its own data volume, which all act as backups with zero RPO. Additionally, periodic snapshots can be taken as extra protection in the extremely unlikely scenario all validators lose their data. AWS allows for snapshots and backups for its different products, which can be stored across different regions to guarantee access. The time between snapshots will determine the RPO.

MongoDB

MongoDB Atlas supports backups for clusters hosted on AWS and will use its native snapshot functionality. A Backup Compliance Policy can be enabled to protect sensitive data. In addition, Snapshot Encryption can be used to ensure the security of backups. If the backup procedure fails, a Fallback Snapshot will be attempted.

Restoration

Restoration plans are essential to decrease system downtime, and their effectiveness ultimately depends on previous backups and how quickly the systems can be restored. Measures should be in place to ensure a fast and effective restoration.

AWS

AWS Disaster Recovery options

As shown in the diagram above, AWS offers different options for recovery. Some options improve both RPO and RTO metrics significantly, at the expense of being more costly and complex. Backup and restore is the most basic option, allowing for restoring from an existing backup. Pilot Light is similar but essentially has the backup and settings ready, requiring very little time to restore. Warm standby is the same, with the difference of always being ready and activating autonomously when needed. Finally, Multi-site means running systems across regions to guarantee availability and avoid having to restore the system at all.

MongoDB

Restoration of a MongoDB Atlas Database is simple if the proper backup is available. Operations must be stopped during the restoration period. Additionally, a fallback snapshot can be used as a backup. This being said, it should be considered a last resort, since it may result in inconsistent data across the cluster.

Testing

Testing the disaster recovery plan is crucial to validate its effectiveness. Disaster drills can be used to put the plan in motion in a controlled environment, and then evaluate the result against expected RPO and RTO. Adjustments to the plan can be made accordingly. Doing routine testing ensures the recovery plan will facilitate restoring the systems successfully in the case of a real disaster.

Last updated