USM Anywhere Service Issue - Sensors Disconnected
Incident Report for USM Anywhere
Postmortem

We would like to take this opportunity to share more information with our customers regarding the USM Anywhere service disruption on April 18, 2019. This postmortem includes a description of the events and contributing factors that led to the service disruption as well as the actions we took to respond, recover, and to notify customers. It also describes the steps we are taking to improve our processes to prevent these types of issues from happening again.

Incident summary and timeline

This service disruption began at approximately 9:20AM CST on Thursday, April 18, 2019 as the USM Anywhere sensors began to systematically restart as part of a regular software release process. This process completed at approximately 10:30AM. AT&T Cybersecurity engineers monitoring the update process as part of our normal operations began to notice that some sensors were failing to reconnect to the USM Anywhere service after the update and immediately began to investigate. At approximately 11:00AM, our internal monitoring systems detected and alerted us to an issue with some of the sensors on our “canary deployments,” which had received the release before the rest of the customer base.

Shortly after 11:00AM, we determined this issue to be a severity level 0 (our most severe designation) service incident. Within the hour, we had notified all customers through status.alienvault.cloud.

By approximately 1:30PM, AT&T Cybersecurity engineers had identified the defect that was causing the sensors to fail. The defect was caused by a code change made to two AlienApps (the AlienApp for Carbon Black and the AlienApp for Forensics and Response) that had unexpectedly disrupted the decryption process of those apps’ configuration files, preventing the apps from starting, in turn disrupting the initialization and connection of the sensor. This issue affected sensors differently depending on when the AlienApp had initially been configured. This difference meant that the standard regression tests run did not detect this issue earlier in our release process.

By 5:45PM, the AT&T Cybersecurity engineering team had built, tested, and approved the best rollout option for the fix and released it into production. The team monitored the rollout closely, and by 7:00PM, our monitoring systems indicated that sensor operations had returned to a normal threshold. At this time, the incident was considered resolved.

Impact to Customers

During the approximate time period of 9:20AM on Thursday, April 18, 2019, when the sensors began to update, until 7:00PM CT, when most of the sensors had completed the fix update, any USM Anywhere sensors that were affected by this reconnection issue were unable to collect, process, and transmit data from the monitored environment to USM Anywhere. It’s important to note that not all USM Anywhere sensors were affected. All other functions of USM Anywhere and data collected by sensors not affected by this issue continued to operate normally during the time of this incident.

How we notified customers

When a service disruption or outage occurs, we take prompt action to notify customers with timely, accurate information. Our first and best source of information is status.alienvault.cloud for USM Anywhere and status-central.alienvault.cloud for USM Central. If you are not already subscribed to receive email and text alerts regarding service disruptions events, please take a moment to do so right now. In addition, we notified affected customers through an email communication on the evening of Thursday April 19th, 2019.

Please note that in the event of a service disruption, we do our best to accurately define the scope of affected customers and to coordinate proactive email communication in a rapid manner.

What are we doing to improve?

As described above, a confluence of events led to this incident. As part of our incident response process, we have conducted internal retrospectives to discover what we can learn from this incident and how we can improve our processes to help prevent this type of issue from happening again. We have identified improvements to our scope of testing as it relates to this issue as well as additional steps in our release process. In addition, we have identified improvements to our communications process to help streamline the identification and notification of affected customers to provide more expedient communication in the future.

Summary

Finally, we would like to apologize for the inconvenience this has caused. Our goal is to provide security monitoring at a level of service that you can rely on to monitor your critical assets in support of your security and compliance goals. With USM Anywhere, we are able to reduce a substantial portion of the overhead required to manage a security monitoring product by providing maintenance and updates in an automated fashion. In times like this, we commit to learning from this service disruption to drive improvement across our processes. We encourage you to share your thoughts with us on how we can do better. If you have any questions or concerns, please reach out directly to your AT&T Cybersecurity account manager at any time or contact our support channel at success.alienvault.com.

Sincerely,

AT&T Cybersecurity

Posted 19 days ago. Apr 30, 2019 - 15:01 UTC

Resolved
The incident is resolved. Thank you for your patience and we apologize for the disruption.

If your USM Anywhere is not functioning normally at this time, please open a support ticket at success.alienvault.com to discuss the issue with a support representative.
Posted about 1 month ago. Apr 19, 2019 - 00:55 UTC
Update
We are continuing to monitor for any further issues.
Posted about 1 month ago. Apr 19, 2019 - 00:14 UTC
Update
At approximately 5:45PM CT, we released a fix for the sensor connectivity issues. We are currently monitoring all customer environments as they update. At this time, most sensors have reconnected and we expect full recovery within the next hour.

We will continue to monitor the situation and will declare the incident resolved when we are satisfied that the service has returned to normal.

If you have any additional questions, please contact support at success.alienvault.com.
Posted about 1 month ago. Apr 19, 2019 - 00:12 UTC
Monitoring
We released a fix for the sensor connectivity issues at approximately 5:45 CT. We are currently monitoring all customer environments as their sensors update. At this time it appears all sensors have reconnected and we expect full recovery within the hour.
Posted about 1 month ago. Apr 19, 2019 - 00:11 UTC
Identified
We believe we have identified the cause of the issue with USM Anywhere sensors. This issue relates to a code update made during the most recent regularly scheduled sensor update, which began at approximately 9:20AM CT today.

At this time, this issue may be preventing your USM Anywhere sensors from connecting and sending data to the USM Anywhere service. We believe that this issue is limited to USM Anywhere customers who have configured or used the following AlienApps: AlienApp for Forensics and Response; AlienApp for Carbon Black. Affected sensors may show a connection status of “Connected” but not “Ready” under Data Sources > Sensors.

Our technical team is working hard to resolve the issue. You can monitor the USM Anywhere status page (status.alienvault.cloud) for further developments. We will continue to send you additional information as we work to resolve the issue.
Posted about 1 month ago. Apr 18, 2019 - 18:38 UTC
Investigating
There is an issue with the USM Anywhere service that is preventing sensors from connecting to USM Anywhere.

The issue began around 9:20 AM Central time.

AlienVault technical staff is investigating the cause. You can monitor the USM Anywhere status page (status.alienvault.cloud) for further developments. We will continue to send you additional information as we work to resolve the issue.
Posted about 1 month ago. Apr 18, 2019 - 17:01 UTC
This incident affected: USM Anywhere Service.