Incident Summary:
Users experienced elevated response times for instant learning models due to a disruption in our processing system.
Root Cause:
One of our GPU nodes was down, which significantly affected file processing times and led to slower response times for our users using instant learning models.
Resolution:
We promptly identified the machine and removed it from our pool, which restored normal processing times. As a long-term fix, we are implementing a robust mechanism to ensure that any node or machine going down will not impact file processing times. This will include automatic detection and removal of faulty nodes from our pool and redistribution of the workload to healthy nodes.
We sincerely apologize for the inconvenience this incident may have caused. We understand the importance of reliable and fast service, and we are taking the necessary steps to prevent such issues from occurring in the future. We appreciate your patience and understanding.