Database Downtime: User Lookup
Incident Report for Loom
Resolved
We had complete downtime across all systems. The cause of this downtime was because of the following:

* A db lock implementation that spun indefinitely against the database and didn't release gracefully
* An unindexed, large user table scan that resulted in long lookup times when logging users in and signing them up
* The second issue exacerbating the load on the DB because of the first issue

Remediation:

* Replace our db lock implementation with one that does not spin
* Denormalize the lookup information off the user table and added indexes for faster lookup times
* Implemented this status page
* Upgrade database to latest version
* Upgraded server database runs on to double CPU and Memory
* Created dashboards internally that get us rich information about the health of our database queries so we can diagnose and address query-related issues before they impact production traffic
Posted 3 months ago. Apr 17, 2018 - 12:00 UTC