On Monday morning, all Fusion customers experienced slow syncing between on-premises devices and the cloud, which may have impacted the speed by which notifications went out. The event started around 9:30am and was resolved by 10:30am.
Additionally, there was a short outage in the afternoon that affected notifications going out. We are very sorry for any impact this may have had on your daily operations and we’re striving to fix the root cause.
So what was the root cause?
On Saturday, Amazon Web Services (AWS), which hosts our Fusion product, experienced degraded performance related to networking and APIs. This caused one node to briefly detach from our RabbitMQ cluster (the part of our system responsible for organizing messages to send). Notifications continued to go out because we use highly available (mirrored) queues. However, the communication between all queue nodes never completely recovered. The troubled node partially rejoined the cluster and caused additional load between two healthy queue nodes in order to remain synchronized.
As the system experienced higher load and increased message rates on Monday through normal expected use, the degraded performance of the message queuing cluster was unable to remain healthy and process the load. For a short time Monday afternoon, all queue nodes were brought down in order to preserve data and bring the cluster back up in a healthy state.
We implemented additional proactive monitoring of our RabbitMQ cluster to detect the early warning signs for this issue. We also updated our processes for how to respond to these alerts so that we can recover prior to a service impact.