Summary:
On July 7th, during the interval of 11:34 AM IST to 12:20 AM IST, our ImageKit CDN endpoint in the Mumbai region encountered an elevated error rate of around 4 to 5 percent.
Root cause:
The anomaly was traced back to a partial DNS outage within our server cluster. While our processing servers were working fine, our CoreDNS service was overwhelmed by an unusually high volume of DNS requests. This unprecedented flood of requests exceeded the capacity of the conntrack
table, which led to dropped connections and requests not being sent for processing.
Resolution: The issue was quickly resolved by scaling the CoreDNS service, which successfully distributed the load and mitigated the problem. We have initiated meticulous performance monitoring of our DNS and will implement modifications, if required, after a thorough validation.
Lessons learned
What went well: Our efficient monitoring systems detected the anomaly promptly before customers started facing the issue. Our team responded swiftly and systematically, ensuring minimal impact on our services. Despite the relatively higher error rate, 95% of total requests continued to be served with ultra-low latency.
What went wrong:
A more proactive approach towards DNS performance monitoring could have averted this issue. We're now incorporating DNS into our regular monitoring routine to forestall such instances.