File processing issue with Instant Learning and Zero Training models

Incident Report for Nanonets

Postmortem

At around 10:46 UTC on 7th Nov, one of our queueing systems experienced heavy load which led to requests getting queued and frequently timing out for Instant Learning and Zero Training models. We got alerted to it and quickly scaled it up, and by 11:15 UTC, the backlog got cleared and the incident was resolved.

We are adding additional alerting to this queueing system to make sure that we can catch these type of issues well before the queue backs up.

Posted Nov 08, 2024 - 06:15 UTC

Resolved

This incident has been resolved.

Posted Nov 07, 2024 - 11:15 UTC

Update

We are continuing to monitor for any further issues.

Posted Nov 07, 2024 - 11:09 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 07, 2024 - 11:09 UTC

Investigating

We are currently investigating this issue.

Posted Nov 07, 2024 - 10:46 UTC

This incident affected: API and Web App.