USM Anywhere Service Issue
Incident Report for USM Anywhere
Postmortem

Valued AlienVault Customers,

We had an interruption in service of USM Anywhere this week and we wanted to take an opportunity to share more about it with you and explain how we plan to maintain the highest level of service availability going forward.

What Happened

Many sensors started dropping events instead of forwarding them to your USM Anywhere control nodes beginning around 7 AM CT on Tuesday, November 28. This means during that time those events were not captured and alerts were not generated for those events. If your installation was affected you will be contacted directly.

We apologize for this outage; we know you depend on the USM Anywhere service to maintain a strong security posture and any gap in that on our part is not acceptable. We are committed to supporting the security and compliance needs of all of our customers.

How It Happened

The issue was a subtle code issue in the sensors in our most recent release (6.0.70), where a class loading race condition could cause the sensor to discard security events. We did not detect this during our testing cycle because we didn't encounter the race condition, but once released to our larger fleet of customers, it became apparent.

We have extensive monitoring on our production systems, but it took a while for us to catch this because sensors stop sending events for many normal, legitimate customer reasons, so a drop in events on a single instance isn't necessarily alarming, and usually we only look into it after a set period.

Once we became aware of the issue, around 10 AM CT Wednesday, November 29, we immediately started our incident management process, communicated via our status page at status.alienvault.cloud, and developed and released a hotfix. All sensors were fixed by around 5 PM CT Wednesday, November 29.

How We Will Prevent This In The Future

We have conducted a comprehensive internal retrospective and plan to do the following.

First, we've corrected the issue and are instituting several code hygiene improvements to make sure the underlying technical issue doesn't recur.

Second, we are adding deeper fleet level monitoring, and expose more metrics from the sensors that will help us establish their health more clearly.

Third, we'll improve the speed of our release process so that we can release hotfixes more quickly than we can now, while still maintaining a high level of quality and compliance.

Finally, we will improve our approach to buffering data to ensure that errors like this will not result in the failure to capture events in the future.

Thank you for being a USM Anywhere customer and again, we apologize to those whose service was interrupted during this period.

Posted 10 months ago. Dec 01, 2017 - 23:24 UTC

Resolved
We've released a fix and have confirmed that event flow is back to normal levels. If you have a sensor that isn't sending events, please try restarting it, and if that doesn't clear the problem contact AlienVault Support and they'll figure out what's up.

We apologize for the disruption and will publish a retrospective here with further details and next steps shortly.
Posted 10 months ago. Nov 29, 2017 - 23:50 UTC
Monitoring
A fix has been pushed out and sensors will be downloading and updating it over the course of the hour. We will watch carefully to ensure everyone's service is restored. We apologize for this disruption and plan to get all your sensors healthy again soon.
Posted 10 months ago. Nov 29, 2017 - 22:14 UTC
Update
We have a fix for this issue we are preparing now. If it passes our testing, we will release it immediately and all sensors should pull it down and automatically fix themselves. We'll update you again when the fix goes out.
Posted 10 months ago. Nov 29, 2017 - 20:18 UTC
Identified
We have identified an issue in the new version of USM Anywhere sensors that was released on November 28, 2017. The issue can cause many incoming events to not be processed. You check to see if you are affected by checking your dashboard and seeing if your number of events has dropped off (you can see events per sensor in the faceted navigation on the left of the dashboard - if a sensor is showing zero or an unusually small number of events but is up and working, you may be affected).

We have identified the cause and are working on pushing out a fix as soon as possible. Contact AlienVault support if you have any questions.

You can monitor the USM Anywhere status page (status.alienvault.cloud) for further developments. We will continue to send you additional information as we work to resolve the issue.
Posted 10 months ago. Nov 29, 2017 - 17:09 UTC