Increased error rate on uncached requests [Mumbai region]

Incident Report for ImageKit.io

Postmortem

Summary:
On July 7th, during the interval of 11:34 AM IST to 12:20 AM IST, our ImageKit CDN endpoint in the Mumbai region encountered an elevated error rate of around 4 to 5 percent.

Root cause:
The anomaly was traced back to a partial DNS outage within our server cluster. While our processing servers were working fine, our CoreDNS service was overwhelmed by an unusually high volume of DNS requests. This unprecedented flood of requests exceeded the capacity of the conntrack table, which led to dropped connections and requests not being sent for processing.

Resolution: The issue was quickly resolved by scaling the CoreDNS service, which successfully distributed the load and mitigated the problem. We have initiated meticulous performance monitoring of our DNS and will implement modifications, if required, after a thorough validation.

Lessons learned
What went well: Our efficient monitoring systems detected the anomaly promptly before customers started facing the issue. Our team responded swiftly and systematically, ensuring minimal impact on our services. Despite the relatively higher error rate, 95% of total requests continued to be served with ultra-low latency.

What went wrong:
A more proactive approach towards DNS performance monitoring could have averted this issue. We're now incorporating DNS into our regular monitoring routine to forestall such instances.

Posted Jul 09, 2023 - 14:06 IST

Resolved

Error rate increased for new transformations in Mumbai region. The issue started at 11:35 AM IST and lasted till 12:12 PM IST. During this time, the error rate for CDN miss requests increased. Issue was identified by our team and fixed. We are currently monitoring the systems and response times.

Posted Jul 07, 2023 - 11:35 IST