System Slowness

Incident Report for Spektrix

Postmortem

Following the incident that affected the Spektrix platform on Thursday 11th July we’re providing this report on what happened, how we responded, and what we have learnt from it.

We are sincerely sorry for the disruption and we understand the difficulties that this caused you.

What happened?

At 10:45am BST / 5:45am EST, our platform monitoring systems reported that one of our web servers had crashed. These servers are the components of Spektrix that respond to each click you and your customers make. Following our internal procedure, we assembled an incident team to investigate.
From this point on, web servers continued to crash, recover and crash again throughout the day. This in turn increased the number of requests which queued up in the system.
The combination of these factors caused a number of symptoms:
- Users were sometimes seeing a ‘page not found’ error when clicking around the system, because the particular web server at the other end of the request had crashed.
- Response times became very slow. Our servers are designed to temporarily store or ‘cache’ data to enable fast responses, but these caches are deleted when a server crashes. Because of this, response times dropped significantly.
The impact on system performance was significant and caused major disruption for you and your customers for over 14 hours.
We kept you updated throughout the incident via this status page and direct emails.
We resolved the incident by 1am BST / 8pm EST - the details of this are below.

Why did this happen?

We have identified that an existing bug in our software code had been triggered.
Whilst we undertake rigorous software testing, bugs affect all software and can be extremely difficult to spot. On this occasion, it was a specific and unusual set of circumstances that triggered the bug on Thursday, causing impact to system performance.

Why did it this problem take so long to resolve?

Early on Thursday, we carried out a significant piece of well-tested maintenance work on our system infrastructure. Given that the performance of the system started to suffer shortly after this maintenance was completed, our focus went to investigating that work (as usually the most recent change in software is responsible for any new problems that arise).
We spent several hours attempting to isolate the problem and identify where in our infrastructure it was coming from, working on the assumption that the root cause was introduced during our maintenance.
Having exhausted this path, we instead focused on other less likely causes. We used new monitoring tools to identify the circumstances of a server at the precise moment it was crashing.
Through this process we were able to identify a bug that was causing the issue and address it. At around 1:00am BST / 8:00pm EST, we stopped the problem and the system returned to normal performance.

A common question we have been asked is whether there was a peak in demand or traffic to the Spektrix system that triggered the incident. This was not the case. We actively monitor the system to ensure that we have enough computing resources for the peaks in demand ticketing requires.

What are the Spektrix team doing to prevent this from happening again?

As a result of the technical learnings gained from the incident last Thursday, we'll be better able to systematically and quickly identify this specific type of issue.

It is our priority to prevent such incidents occurring, but when they do we take these learnings very seriously, with a view to continually improving our approach and underlying technology.

If you have any questions about this report or would like to feedback if you have not already done so, please contact us on support@spektrix.com.

Posted Jul 17, 2019 - 09:36 BST

Resolved

We have been continuing to monitor system performance and we are confident that the issues from Thursday 11th July have been resolved. You should see that access to the system and system speeds have returned to normal.

This period of degraded system performance was extremely disruptive and we’re sorry for the impact this has had on you and your customers. We'll be providing a full explanation of the reasons for the issues in the coming days, but in the meantime our Support team are on hand to help on support@spektrix.com or 020 7183 3586.

Posted Jul 12, 2019 - 11:14 BST

Monitoring

We’ve identified the cause of the system slowness and it has been resolved. We will continue to monitor the system throughout the night, and we will be in touch in the morning with more details.

Thank you again for your patience throughout our investigation.

Posted Jul 12, 2019 - 01:43 BST

Update

We are currently seeing more system stability, with some intermittent slowness. Our engineers are still investigating the cause of this.

Posted Jul 11, 2019 - 23:39 BST

Update

#red we are still experiencing significant issues across the system and online. Apologies, we are still looking for ways we can rectify this and are exhausting all possible solutions.

Posted Jul 11, 2019 - 20:21 BST

Update

As we continue to address the slowness issues, you may have experienced some further issues with system access. We will keep you up to date as we continue to investigate.

Posted Jul 11, 2019 - 19:04 BST

Update

We're still experiencing some issues with slowness in the system. Please bear with us - we'll continue to update you here and at @SpektrixOps. We can generate an occupancy report for you if you require one for any performances today - please reach out to support@spektrix.com. Thank you for your patience.

Posted Jul 11, 2019 - 18:48 BST

Update

We are planning to implement another potential solution to the continuing system slowness many of you have been experiencing. Some of you may see increased slowness during this time - if so, apologies for this. We will keep you up to date on here when we have more news.

Posted Jul 11, 2019 - 17:57 BST

Update

We are sorry that the issues using the system are still persisting. We are doing everything we can to get back up and running - in the mean time, please let the Support team know if you need any occupancy report running.

Posted Jul 11, 2019 - 17:10 BST

Update

We're sorry these issues are still ongoing; our engineers are continuing their investigations. If you need any occupancy reports for today's shows and you're unable to run them, please contact Support and we'll get these sent to you as soon as we're able.

Posted Jul 11, 2019 - 15:54 BST

Update

Investigations are continuing here. We'll bring you another update in approx. 30 minutes, or sooner if we have any developments. We appreciate your patience.

Posted Jul 11, 2019 - 14:58 BST

Update

Our engineers are still investigating today's slowness with the system. We'll aim to update you here and on Twitter at @SpektrixOps every 30 minutes or so, or sooner if we have any breakthroughs.

Posted Jul 11, 2019 - 14:18 BST

Update

We are continuing to investigate this issue.

Posted Jul 11, 2019 - 13:45 BST

Update

We're seeing some clicks completing successfully, but we're aware there's still intermittent but significant slowness. Our engineers are still investigating the cause of this. Thanks for being patient with us!

Posted Jul 11, 2019 - 13:34 BST

Update

We're slowly seeing some improvements to speeds on the system, but are aware you may still be experiencing issues. You may now be able to run your own occupancy reports as needed – if not, let Support know and we'll get one run for you.

Posted Jul 11, 2019 - 13:16 BST

Update

We're still experiencing some issues with slowness in the system and our engineers are investigating all possibilities. We can generate an occupancy report for you if you require one for a matinee show - please just contact the Support Centre. Thanks!

Posted Jul 11, 2019 - 13:00 BST

Update

Our engineers are continuing their work to find the cause of the current system slowness. Please bear with us - we'll update you here and at @SpektrixOps.

Posted Jul 11, 2019 - 12:47 BST

Update

Our engineers are continuing to investigate what's causing today's slowdown; thanks so much for your patience with us.

Posted Jul 11, 2019 - 12:32 BST

Update

We've had reports of the Spektrix App disconnecting for some people - if you're experiencing issues with this, please restart the App and/or your PC. Our engineers are continuing to investigate the cause of the general slowness we've been experiencing.

Posted Jul 11, 2019 - 12:16 BST

Update

We're sorry that clicks are very slow on the system right now - our engineers are still investigating this and trying multiple options to fix it. We'll update you with any breakthroughs.

Posted Jul 11, 2019 - 12:08 BST

Update

Both the API and card payments should now have returned to functioning as normal, though we're still seeing some slowness with clicks around the system. Thanks for bearing with us here - updates will follow here and at @SpektrixOps on Twitter.

Posted Jul 11, 2019 - 11:52 BST

Update

We're aware that card payments are currently not working; our engineers are working on getting things back up and running ASAP. We'll keep you posted here and @SpektrixOps on Twitter.

Posted Jul 11, 2019 - 11:41 BST

Update

We're still investigating the issue with slowness around some clicks in the system; as we do so, some customers may experience some issues booking online/you may see some issues with the API. Thanks for bearing with us!

Posted Jul 11, 2019 - 11:26 BST

Update

We're continuing to investigate the issue some of you are having with system slowness with some but not all clicks. We'll keep you posted as we get to the bottom of it. Thanks for your patience! #amber

Posted Jul 11, 2019 - 11:12 BST

Investigating

We apologise if you're experiencing system slowness this morning; our engineers are investigating the cause of this and we'll keep you posted here.

Posted Jul 11, 2019 - 10:59 BST

This incident affected: APIs (APIs - UK and Ireland, APIs - US and Canada) and Spektrix System (UK and Ireland, US and Canada).