Some Fusion servers reporting as disconnected

Incident Report for Singlewire Software

Postmortem

The service that determines Fusion server health uses a distributed data store to track the state of each Fusion server in a way that is meant to be resilient to loss of a single node in our system. This service was configured differently from other services in which we use the same technology, and in such a way that the system required data to match on every node in order to process it.

As part of a normal system maintenance operation, we replaced several nodes in our system on November 27, including the ones holding this Fusion server health data. Because of this misconfiguration, the system temporarily stopped processing Fusion server health data.

To solve the immediate problem, we restarted the service, which put it back into a good working state. We are also planning a longer-term fix correcting the configuration such that future losses of one node cannot result in the same problem occurring, either in this service or other future services using the same technology.

Posted Dec 20, 2024 - 09:49 CST

Resolved

After restarting the affected service, we observe that affected servers are now reporting as being connected. We will continue to work to identify the root cause and prevent similar future issues.

Posted Nov 27, 2024 - 15:30 CST

Monitoring

After restarting the affected service, we observe that affected servers are now reporting as being connected. We will continue to monitor for any abnormalities.

Posted Nov 27, 2024 - 15:10 CST

Identified

We have identified an issue with the service that reports on server state and are working to resolve it. Notification deliverability is not affected.

Posted Nov 27, 2024 - 14:18 CST

Investigating

We have received reports of Fusion servers reporting as disconnected. We are investigating to determine the cause of the disconnected status and what, if any, impact this might have on notification reliability.

Posted Nov 27, 2024 - 14:03 CST