Thank you for your patience while our teams investigated the root cause of the outage that affected all Behavioral Assessment takers from early Saturday morning until Monday morning. Below we detailed what happened and what we are planing to do to avoid this in the future.
*There was an erroneous configuration issue introduced on Friday that didn't immediately present any problems. Over the weekend it ended up causing us to hit limits on a piece of our infrastructure that suddenly caused BA submissions to stop processing. *Unfortunately we never had hit one of these limits and so did not have alerting or monitoring to detect the issue. On Monday we were alerted to a large volume of service cases and quickly identified and resolved the configuration issue.
What Are We Doing
*Add robust alerting around the piece of infrastructure that hit its limits so we can respond immediately if it happens in the future
*Automate the management of those configuration values to prevent the underlying issue from happening again
*Do research into redundancy for the piece of infrastructure so we have a backup if it fails in the future
Please direct any further questions to firstname.lastname@example.org.