At around 10:46 UTC on 7th Nov, one of our queueing systems experienced heavy load which led to requests getting queued and frequently timing out for Instant Learning and Zero Training models. We got alerted to it and quickly scaled it up, and by 11:15 UTC, the backlog got cleared and the incident was resolved.
We are adding additional alerting to this queueing system to make sure that we can catch these type of issues well before the queue backs up.