At approximately 1:55 AM CDT on 9/13/2024, AWS initiated automatic maintenance of the read replica of our main RDS database instance. While this generally causes most queries to simply fall back to the primary database, it can also cause an error to be propagated to subscriptions to our syncing service. In the past, we've always seen this accompanied by a disconnect and reconnect to the service, but that did not occur here.
We rely on a disconnect and reconnect in order for our on-premise servers to ask for a syncing status update, and since this never happened, the on-premise server never asked, instead requiring a syncing operation to occur, such as a resource update via the administration console, in order to clear the error.
We fixed the acute issue by simulating this operation for all affected on-premise servers.
The following remediation steps are planned to prevent this from happening in the future:
- We're exploring solutions that involve requiring the on-premise server to reconnect to the syncing service during an error state, so that it can self-heal from an error without requiring a sync operation from the service.
- We’re communicating with our GraphQL service vendor to ensure that we fully understand what the expected behavior is regarding disconnects during service outages.
- We will continue to improve “Chaos Testing”, where we test our systems’ abilities to respond to and recover from unexpected behaviors in both internal and external systems.
- We’re improving the stability of our read replica by both limiting version upgrades to times within our maintenance windows as well as researching multi-az failover, which should reduce the frequency of subscription errors.
- We’re creating alarms for large scale on premise server alarm events, which are generally indicative of a cloud-side outage, in order to improve response time when such outages occur.
- In a future on-premise server release, we’re going to downgrade the severity of the syncing subscription alarm such that it does not trigger a Fusion server failover event, since the syncing service’s availability does not influence the delivery of notifications.