Following the incident that affected the Spektrix platform on Thursday 11th July we’re providing this report on what happened, how we responded, and what we have learnt from it.
We are sincerely sorry for the disruption and we understand the difficulties that this caused you.
What happened?
- At 10:45am BST / 5:45am EST, our platform monitoring systems reported that one of our web servers had crashed. These servers are the components of Spektrix that respond to each click you and your customers make. Following our internal procedure, we assembled an incident team to investigate.
- From this point on, web servers continued to crash, recover and crash again throughout the day. This in turn increased the number of requests which queued up in the system.
The combination of these factors caused a number of symptoms:
- Users were sometimes seeing a ‘page not found’ error when clicking around the system, because the particular web server at the other end of the request had crashed.
- Response times became very slow. Our servers are designed to temporarily store or ‘cache’ data to enable fast responses, but these caches are deleted when a server crashes. Because of this, response times dropped significantly.
The impact on system performance was significant and caused major disruption for you and your customers for over 14 hours.
We kept you updated throughout the incident via this status page and direct emails.
We resolved the incident by 1am BST / 8pm EST - the details of this are below.
Why did this happen?
- We have identified that an existing bug in our software code had been triggered.
- Whilst we undertake rigorous software testing, bugs affect all software and can be extremely difficult to spot. On this occasion, it was a specific and unusual set of circumstances that triggered the bug on Thursday, causing impact to system performance.
Why did it this problem take so long to resolve?
- Early on Thursday, we carried out a significant piece of well-tested maintenance work on our system infrastructure. Given that the performance of the system started to suffer shortly after this maintenance was completed, our focus went to investigating that work (as usually the most recent change in software is responsible for any new problems that arise).
- We spent several hours attempting to isolate the problem and identify where in our infrastructure it was coming from, working on the assumption that the root cause was introduced during our maintenance.
- Having exhausted this path, we instead focused on other less likely causes. We used new monitoring tools to identify the circumstances of a server at the precise moment it was crashing.
- Through this process we were able to identify a bug that was causing the issue and address it. At around 1:00am BST / 8:00pm EST, we stopped the problem and the system returned to normal performance.
A common question we have been asked is whether there was a peak in demand or traffic to the Spektrix system that triggered the incident. This was not the case. We actively monitor the system to ensure that we have enough computing resources for the peaks in demand ticketing requires.
What are the Spektrix team doing to prevent this from happening again?
As a result of the technical learnings gained from the incident last Thursday, we'll be better able to systematically and quickly identify this specific type of issue.
It is our priority to prevent such incidents occurring, but when they do we take these learnings very seriously, with a view to continually improving our approach and underlying technology.
If you have any questions about this report or would like to feedback if you have not already done so, please contact us on support@spektrix.com.