Highly elevated request timeouts

Incident Report for Healthie

Resolved

We are seeing response times and usage all looking normal, so we are marking this incident resolved. We will conduct a formal post-mortem internally.

At a higher level, we are incredibly cognizant of the fact that we build healthcare critical business critical infrastructure. What we build needs to work quickly, securely, and reliably 24/7. We've traditionally maintained multiple nines of uptime, and this is unfortunately the worst performance issues we've experienced in the past few years. Our team is continuing to identify investigate and implement additional mitigation steps, to ensure this does not happen again. Please reach out to hello@gethealthie.com (or to your CSM) with any questions here.

Best,
Cavan
Co-founder & CTO, Healthie

Posted Sep 03, 2024 - 15:55 EDT

Update

We are seeing response times fully back to normal. We are continuing to monitor.

Posted Sep 03, 2024 - 15:18 EDT

Update

Due to the level of impact reported from the slowness, we took a step to scale up capacity that had (what we believed to be) a very small chance of causing temporary downtime. That unfortunately happened, but everything should be back up and running now, and we'd expect response times to come fully back to normal in the coming hour.

Posted Sep 03, 2024 - 15:07 EDT

Update

We are seeing a reoccurence of the underlying DB host issue and are re-investigating.

Posted Sep 03, 2024 - 15:02 EDT

Update

Response times are about 2-4x slower than normal, which is leading to slower than normal UI in the platform. We believe this is caused by increased load from customer actions that were not able to be Given that our host is still investigating the issues of this morning, we are planning on waiting til after end of business day to scale infrastructure up and down.

We sincerely apologize for the frustration due these slower response times, and are continuing to monitor and investigate options here.

Posted Sep 03, 2024 - 14:25 EDT

Monitoring

Response times look to be dropping back to normal. We will continue to monitor.

Posted Sep 03, 2024 - 11:27 EDT

Update

Aptible has moved our database to a new hardware instance, and things are back online. We're seeing slightly elevated response times as everyone returns to Healthie, and we are continuing to monitor to ensure they return to (and stay at) normal.

Posted Sep 03, 2024 - 11:16 EDT

Update

Aptible expects it to take 10-15 minutes to bring the affected database back online.

Posted Sep 03, 2024 - 11:04 EDT

Identified

Our host has confirmed it is an issue with the database instance and their SRE team is actively working on getting the database instance back up and running.

Posted Sep 03, 2024 - 10:59 EDT

Update

Our team is investigating, and metrics look similar to the issue we say at 9:40. Our team is initiating a database restart (to move the database to a new instance with our host) and we have escalated this as urgent with our host's SRE team.

Posted Sep 03, 2024 - 10:51 EDT

Investigating

We are seeing elevated request timeouts leading to whitescreens and widespread errors, we are investigating and have reached back out to our host.

Posted Sep 03, 2024 - 10:48 EDT

This incident affected: Healthie (Production).