Valued AlienVault Customers,
We had an interruption in service of USM Anywhere this week and we wanted to take an opportunity to share more about it with you and explain how we plan to maintain the highest level of service availability going forward.
Many sensors started dropping events instead of forwarding them to your USM Anywhere control nodes beginning around 7 AM CT on Tuesday, November 28. This means during that time those events were not captured and alerts were not generated for those events. If your installation was affected you will be contacted directly.
We apologize for this outage; we know you depend on the USM Anywhere service to maintain a strong security posture and any gap in that on our part is not acceptable. We are committed to supporting the security and compliance needs of all of our customers.
The issue was a subtle code issue in the sensors in our most recent release (6.0.70), where a class loading race condition could cause the sensor to discard security events. We did not detect this during our testing cycle because we didn't encounter the race condition, but once released to our larger fleet of customers, it became apparent.
We have extensive monitoring on our production systems, but it took a while for us to catch this because sensors stop sending events for many normal, legitimate customer reasons, so a drop in events on a single instance isn't necessarily alarming, and usually we only look into it after a set period.
Once we became aware of the issue, around 10 AM CT Wednesday, November 29, we immediately started our incident management process, communicated via our status page at status.alienvault.cloud, and developed and released a hotfix. All sensors were fixed by around 5 PM CT Wednesday, November 29.
We have conducted a comprehensive internal retrospective and plan to do the following.
First, we've corrected the issue and are instituting several code hygiene improvements to make sure the underlying technical issue doesn't recur.
Second, we are adding deeper fleet level monitoring, and expose more metrics from the sensors that will help us establish their health more clearly.
Third, we'll improve the speed of our release process so that we can release hotfixes more quickly than we can now, while still maintaining a high level of quality and compliance.
Finally, we will improve our approach to buffering data to ensure that errors like this will not result in the failure to capture events in the future.
Thank you for being a USM Anywhere customer and again, we apologize to those whose service was interrupted during this period.