InformaCast Fusion servers reporting AL-SYNCSUBUNH state on 2024-09-13

Incident Report for Singlewire Software

Postmortem

At approximately 1:55 AM CDT on 9/13/2024, AWS initiated automatic maintenance of the read replica of our main RDS database instance. While this generally causes most queries to simply fall back to the primary database, it can also cause an error to be propagated to subscriptions to our syncing service. In the past, we've always seen this accompanied by a disconnect and reconnect to the service, but that did not occur here.

‌

We rely on a disconnect and reconnect in order for our on-premise servers to ask for a syncing status update, and since this never happened, the on-premise server never asked, instead requiring a syncing operation to occur, such as a resource update via the administration console, in order to clear the error.

‌

We fixed the acute issue by simulating this operation for all affected on-premise servers.

‌

The following remediation steps are planned to prevent this from happening in the future:

‌

We're exploring solutions that involve requiring the on-premise server to reconnect to the syncing service during an error state, so that it can self-heal from an error without requiring a sync operation from the service.
We’re communicating with our GraphQL service vendor to ensure that we fully understand what the expected behavior is regarding disconnects during service outages.
We will continue to improve “Chaos Testing”, where we test our systems’ abilities to respond to and recover from unexpected behaviors in both internal and external systems.
We’re improving the stability of our read replica by both limiting version upgrades to times within our maintenance windows as well as researching multi-az failover, which should reduce the frequency of subscription errors.
We’re creating alarms for large scale on premise server alarm events, which are generally indicative of a cloud-side outage, in order to improve response time when such outages occur.
In a future on-premise server release, we’re going to downgrade the severity of the syncing subscription alarm such that it does not trigger a Fusion server failover event, since the syncing service’s availability does not influence the delivery of notifications.

Posted Sep 17, 2024 - 10:29 CDT

Resolved

Affected servers have returned to a healthy state.

Posted Sep 13, 2024 - 09:55 CDT

Monitoring

We have applied a fix and most affected servers have returned to a healthy state.

Posted Sep 13, 2024 - 09:33 CDT

Identified

We have identified the root issue and we are applying a fix to return affected servers to a healthy state.

Posted Sep 13, 2024 - 09:25 CDT

Investigating

InformaCast Fusion servers running 14.22.1 and higher are in a Red system health state on 2024-09-13 due to the AL-SYNCSUBUNH system health alarm.

This may cause configurations with a primary/backup Fusion server failover pair to emit redundant notifications.

We are currently investigating the root cause of this issue.

Please see https://support.singlewire.com/s/article/2024-09-13-AL-SYNCSUBUNH-red-system-health-alarm-state for remediation steps.

Posted Sep 13, 2024 - 08:00 CDT

This incident affected: InformaCast (On-premises Server Syncing Service).