Incident Summary:
On 07-OCT-2024 11:30PM IST, a new feature was deployed, which inadvertently caused a spike in our RDS usage. This led to some APIs timing out, causing 5xx errors, and resulting in random file upload failures for affected users.
Root Cause:
The root cause of the issue was a feature deployed by one of our developers that had an unintended effect on the database. The feature introduced inefficiencies in database queries, leading to a significant increase in RDS usage. The increased load caused API requests to exceed their time limits, resulting in timeouts and random file upload failures.
Resolution:
As soon as the issue was identified, we reverted the change and restored normal operations. The system returned to a stable state shortly after the rollback. We are now implementing additional safeguards and testing procedures to prevent similar incidents in the future, ensuring we do not experience such unanticipated impacts again.
Timeline:
OCT 7th 11:50PM IST - OCT 8th 12:30 AM IST