The service that determines Fusion server health uses a distributed data store to track the state of each Fusion server in a way that is meant to be resilient to loss of a single node in our system. This service was configured differently from other services in which we use the same technology, and in such a way that the system required data to match on every node in order to process it.
As part of a normal system maintenance operation, we replaced several nodes in our system on November 27, including the ones holding this Fusion server health data. Because of this misconfiguration, the system temporarily stopped processing Fusion server health data.
To solve the immediate problem, we restarted the service, which put it back into a good working state. We are also planning a longer-term fix correcting the configuration such that future losses of one node cannot result in the same problem occurring, either in this service or other future services using the same technology.