Resolved -
Webhooks are fully caught up, and response times are back to normal. This incident is resolved.
Oct 20, 16:14 EDT
Update -
We have been able to scale up and beyond our typical mid-day capacity. Response times should be back to normal, and background processes (specifically webhooks) have begun to catch up. We continue to monitor.
Oct 20, 15:18 EDT
Update -
We are seeing some (smaller) successes amongst failing scaling attempts and have been able to add more capacity which should help with response times. We're still far below normal response times and capacity and continue to work to get everything back to normal here.
Oct 20, 14:27 EDT
Update -
AWS has shared a positive update- "the internal subsystems of EC2 are now showing early signs of recovering in a few Availability Zones (AZs) in the US-EAST-1 Region. We are applying mitigations to the remaining AZs at which point we expect launch errors and network connectivity issues to subside.".
Once we are able to launch new EC2 instances, we would expect response times to return to normal and this incident to be resolved. We're continuing to monitor.
Oct 20, 13:44 EDT
Update -
AWS is "in the process of validating a fix and will deploy to the first AZ as soon as they have confidence they can do so safely."
Oct 20, 13:05 EDT
Update -
AWS "have identified and are applying next steps to mitigate throttling of new EC2 instance launches.". This is what needs to be restored for Healthie's response times to return to normal, so this is a positive step. We continue to monitor.
Oct 20, 12:17 EDT
Update -
AWS has "narrowed down the source of the network connectivity issues" and identified the root cause. They are actively working on mitigations but are still "throttling requests for new EC2 instances" which blocks us from provisioning needed capacity. We continue to monitor the situation closely.
Oct 20, 11:49 EDT
Update -
AWS continues to experience severe networking issues (as they are updating at https://health.aws.amazon.com/health/status). We continue to monitor and re-attempt to provision additional capacity.
Oct 20, 11:10 EDT
Update -
We've confirmed we're seeing slower response times. We continue to work to provision the needed capacity. We have paused sending webhooks to help lessen server traffic as we work to scale capacity. All will be sent once we're able to successfully scale.
Oct 20, 09:26 EDT
Investigating -
AWS continues to have issues with EC2 instances which is preventing us from automatically scaling to our normal server capacity (https://health.aws.amazon.com/). This is leading to slower than normal response times. Our team is monitoring and working with our host to scale up to the needed capacity.
Oct 20, 09:10 EDT