We would like to take this opportunity to share more information with our customers regarding the USM Anywhere service disruption on April 18, 2019. This postmortem includes a description of the events and contributing factors that led to the service disruption as well as the actions we took to respond, recover, and to notify customers. It also describes the steps we are taking to improve our processes to prevent these types of issues from happening again.
This service disruption began at approximately 9:20AM CST on Thursday, April 18, 2019 as the USM Anywhere sensors began to systematically restart as part of a regular software release process. This process completed at approximately 10:30AM. AT&T Cybersecurity engineers monitoring the update process as part of our normal operations began to notice that some sensors were failing to reconnect to the USM Anywhere service after the update and immediately began to investigate. At approximately 11:00AM, our internal monitoring systems detected and alerted us to an issue with some of the sensors on our “canary deployments,” which had received the release before the rest of the customer base.
Shortly after 11:00AM, we determined this issue to be a severity level 0 (our most severe designation) service incident. Within the hour, we had notified all customers through status.alienvault.cloud.
By approximately 1:30PM, AT&T Cybersecurity engineers had identified the defect that was causing the sensors to fail. The defect was caused by a code change made to two AlienApps (the AlienApp for Carbon Black and the AlienApp for Forensics and Response) that had unexpectedly disrupted the decryption process of those apps’ configuration files, preventing the apps from starting, in turn disrupting the initialization and connection of the sensor. This issue affected sensors differently depending on when the AlienApp had initially been configured. This difference meant that the standard regression tests run did not detect this issue earlier in our release process.
By 5:45PM, the AT&T Cybersecurity engineering team had built, tested, and approved the best rollout option for the fix and released it into production. The team monitored the rollout closely, and by 7:00PM, our monitoring systems indicated that sensor operations had returned to a normal threshold. At this time, the incident was considered resolved.
During the approximate time period of 9:20AM on Thursday, April 18, 2019, when the sensors began to update, until 7:00PM CT, when most of the sensors had completed the fix update, any USM Anywhere sensors that were affected by this reconnection issue were unable to collect, process, and transmit data from the monitored environment to USM Anywhere. It’s important to note that not all USM Anywhere sensors were affected. All other functions of USM Anywhere and data collected by sensors not affected by this issue continued to operate normally during the time of this incident.
When a service disruption or outage occurs, we take prompt action to notify customers with timely, accurate information. Our first and best source of information is status.alienvault.cloud for USM Anywhere and status-central.alienvault.cloud for USM Central. If you are not already subscribed to receive email and text alerts regarding service disruptions events, please take a moment to do so right now. In addition, we notified affected customers through an email communication on the evening of Thursday April 19th, 2019.
Please note that in the event of a service disruption, we do our best to accurately define the scope of affected customers and to coordinate proactive email communication in a rapid manner.
As described above, a confluence of events led to this incident. As part of our incident response process, we have conducted internal retrospectives to discover what we can learn from this incident and how we can improve our processes to help prevent this type of issue from happening again. We have identified improvements to our scope of testing as it relates to this issue as well as additional steps in our release process. In addition, we have identified improvements to our communications process to help streamline the identification and notification of affected customers to provide more expedient communication in the future.
Finally, we would like to apologize for the inconvenience this has caused. Our goal is to provide security monitoring at a level of service that you can rely on to monitor your critical assets in support of your security and compliance goals. With USM Anywhere, we are able to reduce a substantial portion of the overhead required to manage a security monitoring product by providing maintenance and updates in an automated fashion. In times like this, we commit to learning from this service disruption to drive improvement across our processes. We encourage you to share your thoughts with us on how we can do better. If you have any questions or concerns, please reach out directly to your AT&T Cybersecurity account manager at any time or contact our support channel at success.alienvault.com.