We have since introduced a number of quick changes that seem to have help mitigate the issue, and we will continue to apply improvements in the upcoming days. Even if we still have work to do, we feel we owe you some explanations now.
Some Background
Shotgun is a powerful tool for users, and we have been traditionally very open about how users can use Shotgun, crafting queries and UI to match their needs. This is very convenient, but is causing some complex issues for the engineering team when it comes to predicting impact of a change or a new feature.
Because of this, we are consistently monitoring clients pattern, optimizing queries and working with clients to improve their workflows.
Fire Fighting
We had our first incident on Monday. When this kind of thing happens, we always have a couple of people on the team jumping in, trying to identify what is adding unreasonable load on the system. We looked at the issue from a very broad angle, trying to identify an unusual number of requests, or new patterns that are not optimized in our system. We identified a couple of culprits, but nothing that obvious.
Then it happened again on Tuesday. We have put a lot of effort in the last months to reduce the contention points in our system with positive results. Having the issue happen twice in a row raised some flags. We started to suspect that the performance degradation could be related to the 6.3 release. We then split the investigation team in two. One group was looking into optimizing queries on a broad angle. One group looked into possible regressions or changes that could explain that sudden performance degradation. The core team started getting another database cluster ready. If we couldn't find a solution rapidly, we would at least be able to lower the load by re-balancing our clusters.
We attacked the slowest and more time consuming queries in our system by adding indexes, and released a patch on Wednesday night that was in part aimed at removing stress from the system.
We were also able to correlate some performance issues with the Shotgun 6.3 release. While the release was not the direct cause, it allowed a badly optimized workflow to be executed often enough to do some damage. In the first 3 days of the week, because of the new Media App, we served 10 times more versions and playlist than usual. We believe that this additional stress, along with the performance issue it underlined, was at the root of our performance issues.
What's Next
We still have a couple of improvements coming out in the upcoming days. We are also putting in production our new database cluster and recalibration will start today. While not directly solving the issue, it will give more breathing room for the clients on the affected cluster.
Further out, we have a lot of actions planned to reduce the likeliness of such events. We have put a lot of effort into segregating clients from each other, and we have more work to do at the database level. More specifically, we are looking into introducing different levels of quality of service for requests, in part to make sure the Web App is always responsive even under heavy load. Some of these features are being developed as we speak.
We will also look into making sure our monitoring can help pin-point issues before hitting production. Our QA team is already investing a lot of effort replicating clients patterns to identify regressions, but we want to invest more on the performance regression side. We are also currently integrating a new reporting tool that will help us optimize our queries more effectively.
Conclusion
Finally, our sincere apologies for this week's issues, and be assured that we are not taking them lightly. We realize that Shotgun is an important part of your pipelines and workflows. We will be working hard to continue improving Shotgun in every way.