Dr. ReadTimeout or: How I learned to stop worrying and love the request matrix

Ruby’s Net::ReadTimeout Error

We recently added a feature to one of the products we own. This feature runs a nightly job that updates thousands of items from an external API owned by a large company. Because of the way the API is structured we have to query each one of these items separately, which means the API makes thousands of calls within a short span of time.

Within a few days of deploying the feature, our logging service was showing thousands of new errors occurring around the time of the nightly job. This was, of course, deeply concerning. The vast majority of these errors were Ruby’s Net::ReadTimeout error. The gist of the Net::ReadTimeout error is that Ruby has successfully established a TCP connection with the host, and sent the request but did not receive the full response payload before the timeout value was hit. These particular requests were being made by the HTTParty Gem, a fantastic gem for making network calls.

The Reason for the Error Is Hard to Find

If you have ever run into this error before and looked it up online, you will find that it is unusually (and infuriatingly) hard to get information on its source or ways to solve it.

This is likely due to three reasons: 

  1. It is a relatively uncommon error 
  2. The error could be due to any number of disparate causes (anything from a misconfigured firewall to a loose cable somewhere, with many of these outside of the developer’s control) 
  3. In some cases, the problem resolves ‘on its own’ (not really ‘on its own’ more like ‘whenever whatever issue was outside of the developer’s control is fixed by someone else’)

Since the error affected a secondary part of the app and didn’t directly impact the user experience, we let it be for a few days, hoping that it would resolve on its own. When some time passed and the issue persisted we thought that maybe the calls were just taking a few seconds (i.e. still completing but outside the Read Time out period), perhaps lagging due to the effects of call volume on the API latency. But when we looked into it we found that HTTParty has a default ReadTimeout value of 60 seconds. No request should be taking 60 seconds, so we decided to look a bit deeper. Too, since there was no response (not even an error code) from the API we wanted to rule out that the source of the error was within our code, and held off on contacting the team that owns the API.

The Response Matrix

The step that really cracked things open was invoking one of the calls that was returning an error in a variety of settings and locations. What we did was set-up a matrix of call methods (in this case HTTParty, Net::HTTP, curl and Postman) and environments (Deployed Production, Deployed Stage, and Local). We made the same call (same variables, headers and, authentication) using each method and in each location and noted down the result. Making the calls with the various methods allows us to pinpoint whether the issue was dependent on Ruby or the request framework and making the call in a variety of locations pins down whether the fault lay there. We also made requests to an endpoint that should always work (here, google.com) as another control. Perhaps some of these calls seem redundant, but when the source of the problem is unclear,
sometimes doing things that seem redundant or subtle can be the clue to an outlier issue.

 

That response matrix looked something like this:

White this matrix looks pretty simple, this is its strength! It is much easier to pass around this block of ‘happy’ and ‘sad’ colors than try to convey the list of calls, their various conditions, and responses to the half dozen or so stakeholders and remote teams we needed to communicate with to solve this problem. Rather than listing out all the conditions of each request and its result building out a matrix like this allows you (and your teammates, those on other teams, as well as project managers and less/non-technical folks) to learn at a glance what requests are succeeding which are failing and can even allow you to dig to the bottom of the issue with a glance.

Success!

In our case, a very clear pattern emerged: any calls made from a Ruby agent would hang, but the same calls made locally succeeded without issue. This at least suggested that the issue was stemming from the API. 

 

Having established some confidence that this involved the API we contacted the API team. On analyzing our traffic it was determined that the source of the issue was a CDN that sat between our application and the API which–likely due to our high call volume and rapid rate–had marked Ruby requests as malicious. 

 

We were able to have our application requests whitelisted and the error was resolved.

 

Learn more!

AWS Certification

Returning to React Native

TensorFlow