Informacast Unplanned Outage

Incident Report for Singlewire Software

Postmortem

Postmortem Summary

On Monday morning, all Fusion customers experienced slow syncing between on-premises devices and the cloud, which may have impacted the speed by which notifications went out. The event started around 9:30am and was resolved by 10:30am.

Additionally, there was a short outage in the afternoon that affected notifications going out. We are very sorry for any impact this may have had on your daily operations and we’re striving to fix the root cause.

So what was the root cause?

‌

What went wrong?

On Saturday, Amazon Web Services (AWS), which hosts our Fusion product, experienced degraded performance related to networking and APIs. This caused one node to briefly detach from our RabbitMQ cluster (the part of our system responsible for organizing messages to send). Notifications continued to go out because we use highly available (mirrored) queues. However, the communication between all queue nodes never completely recovered. The troubled node partially rejoined the cluster and caused additional load between two healthy queue nodes in order to remain synchronized.

As the system experienced higher load and increased message rates on Monday through normal expected use, the degraded performance of the message queuing cluster was unable to remain healthy and process the load. For a short time Monday afternoon, all queue nodes were brought down in order to preserve data and bring the cluster back up in a healthy state.

‌

What are we doing about it?

We implemented additional proactive monitoring of our RabbitMQ cluster to detect the early warning signs for this issue. We also updated our processes for how to respond to these alerts so that we can recover prior to a service impact.

Posted Nov 20, 2019 - 13:46 CST

Resolved

The Informacast Unplanned Outage has been resolved. We’re sorry for any inconvenience this may have caused. If you experience any other issues, please reach out to our support team. Thank you!

Posted Nov 18, 2019 - 15:08 CST

Monitoring

We have identified and implemented a fix to our infrastructure. We are continuing to monitor stability of all services. Notification services have been restored.

Posted Nov 18, 2019 - 14:27 CST

Update

We have experienced an unplanned internal outage of our message queuing service. The Singlewire DevOps team is engaged and actively working to resolve this failure as quickly as possible. Notifications will be impacted during this time.

Posted Nov 18, 2019 - 13:59 CST

Investigating

We have been notified of reports of delayed or failing syncing operations. We are investigating at this time.

Posted Nov 18, 2019 - 13:53 CST

This incident affected: InformaCast Notification Channels (Android Push Notifications, iOS Push Notifications, Email Notifications, Phone Call Notifications, SMS Notifications, WebEx Teams Notifications).