We have conducted the internal post-mortem of the five database connectivity issues that happened in March and are confident that we have identified the root cause of the database connectivity issues that plagued Clubhouse throughout the month of March. Our technical response working group has a solution in place and additional protections that will prevent the set of circumstances that created the incidents. The technical response working group will now resume regular responsibilities.
During multiple high load periods, the database caching layer failed, causing read timeout and the app to become unusable. Getting the system back to a usable state took multiple hours over several incidents. The immediate issue was solved by creating a new database caching cluster with a larger capacity.
All users were affected. Clubhouse was unusable on-and-off during this time. Usage during this time would have been a painful experience and it would be clear that any attempt to save or update information would fail. There was no permanent loss of data for any customer.
We exceeded the network capacity of one database caching layer node. This caused peers to timeout on
GET from cache, then
FETCH from storage, and
WRITE back to the cache layer. This further increased network load on the cache node beyond capacity such that the node could never recover as it was in a read/write stampede. It also caused storage usage to go above provisioning because storage reads increased more than 10x.
As of this afternoon, our back-end engineering team has high confidence that they have uncovered the root cause of the database connectivity issues and have already put in place short-term solutions to reduce the likelihood of service disruptions. This builds upon the over-provisioning we started last week.
We have also put in-place new monitoring systems to help us identify any similar issues and should be able to respond more quickly than we have over the past few weeks. That said we are not ready to declare victory and want to wait and monitor performance over the next few days. We'll plan to provide another update at that point.
Finally, we want you to know that when the issue is fully resolved we're planning to do a full and public root cause analysis and will share even more of the nerdy details than we already have.
The Clubhouse app has experienced five database connectivity issues this month resulting in more than four hours of total downtime.
This kind of performance is unacceptable and we have re-routed significant portions of our engineering teams to determine the root cause and remedy the situation as quickly as possible.
As part of our commitment to transparency, we wanted to share what is happening and what we’re doing to make sure Clubhouse performs as the rock-solid application we’ve all come to expect. (Heck we use Clubhouse to build Clubhouse!)
In the short term, our focus is on over-provisioning our servers and sub-systems to avoid having this situation come up again. As we do this, we’re undertaking work to ensure we handle sub-system capacity spikes more gracefully. We’ve increased the number of servers by 70%, doubled individual server capacity, and significantly increased the memory available to our databases.
We believe that these actions will resolve the symptoms (downtime) while we eradicate the root cause. To do this, we’ve allocated the majority of our back-end engineering team to address the issues. They are wholly dedicated to this problem until it is resolved. That means a handful of features we planned to ship will be slightly delayed, but they pale in comparison to application availability and we know that.
As anyone who builds software fundamentally understands, it’s almost impossible to predict exactly when this issue will be resolved. What we can offer however is frequent and consistent communication.
Until this issue is fully resolved, we promise to update this page with every major status update. Transparency is one of our core values and that means not just internally but with you our customers, partners, and prospects as well. And just for good measure, even if we have nothing new to say - we’ll update this page every few days and just say that. So you can know with confidence you always have the most up to date information.
You can expect the next update on or before 5PM PT Monday March 29th.
Once per week in March so far, Clubhouse has encountered a cascading server failure event that resulted in a loss of service for users. During each of these periods, server requests slowed, with many timing out, degrading service to the point that the application became unresponsive and unusable.
All these events share a common thread: the backend services reached an unusually high read from database disk capacity, which was followed by cascading sub-service failures. The read capacity spiked and oscillated from 4x to 270x normal throughput for similar load in similar periods. These read stampedes were brought about by a contributing condition that caused memory cache misses. The end result is that our API server p95+ response times increased dramatically and a significant portion of requests timed out with 50x errors. Many users experienced this as an extremely slow or unavailable service.
If you have specific questions, concerns or comments please reach out. We want to make sure everyone is confident Clubhouse will be there when you need us.