On Friday night, we deployed a small infrastructure change that required a reboot of our database. During this reboot, we encountered a bug in our event processor code that caused events not to be processed until service reboot.
To compound this, our monitoring platform had invalidated the alarm we had in place to monitor errors with our event processor. This caused us to remain unaware of the issue until Monday morning, where we promptly fixed it by rebooting the service containing the event processors. At that point, the platform needed to catch up by processing all the events that had been building up over the weekend, which completed on Monday night.
We will be taking a multi-pronged approach to ensuring this doesn’t happen again:
We are augmenting our event processing code in the following ways:
We apologize for service disruptions caused by this incident.