High Response Times for Instant learning models in US region

Incident Report for Nanonets

Postmortem

Overview
One of our secondary databases experienced connection issues following an unexpected spike in load. This sudden surge placed significant pressure on the database engine, causing request latency to increase substantially.

Root Cause
A rapid increase in incoming traffic caused a large number of concurrent connections to accumulate on a secondary database. This overwhelmed the database’s connection handling capacity, resulting in slow responses and delayed processing for dependent services.

Resolution
Our engineering team quickly identified the issue, took corrective actions to stabilize the database, and restored normal operation. Once the database recovered, the system began processing the accumulated backlog. This recovery phase took additional time due to the volume of pending requests.

Impact

Affected: Only Instant Learning models in the US region
Unaffected: All other regions and services apart from above continued operating normally throughout the incident

Posted Dec 04, 2025 - 05:59 UTC

Resolved

This incident has been resolved.

Posted Dec 03, 2025 - 21:10 UTC

Monitoring

Our sync prediction API for instant learning models is now operating normally. Async results for older files are available for most users and for few users we see some backlog which is getting cleared

Posted Dec 03, 2025 - 20:50 UTC

Update

We have identified the issue and increased our throughput to process requests faster. However, clearing the backlog may take some additional time. Our team is continuously monitoring the situation. We apologize for the inconvenience and appreciate your patience.

Posted Dec 03, 2025 - 17:08 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 03, 2025 - 16:35 UTC

Investigating

We are currently investigating this issue.

Posted Dec 03, 2025 - 15:08 UTC

This incident affected: API.