421, debugging misdirected request

The month is January, goals and plans are warmer than ever, weather? just as cold as it can get. I cannot hyperfixate on problems endlessly during the winter, it’s cold, it’s distracting.

Anyway, the team is on their usual sprint - not to combat cold, just sprints on Jira. Amidst of it all, I get pinged for a peculiar response from nginx, 421 Misdirected Request. It was my first encounter with the status code 421.

Then, I searched for the definition & conditions for 421 to be produced. RFC-9110 suggest clients to retry the request. So I hint the team to check for changes that could have possibly affected their HTTP Client or their payload, and to my surprise there was none.

Nobody made changes

The clients (i.e android/ios/web) made no changes, neither did backend nor infra. The 421 response was rare but not non-existent, often occuring only on few devices, I was equally confused. We couldn’t set steps to reproduce.

Converging Paths

I went ahead to look for more details and on MDN, I came across the following statement in one of their examples.

Okay noted, many of our host do reside on the same server. I also remembered that we have two replicated edge servers, and I thought of pinning the requests to one of them hoping for a better luck at reproducing the result. So I did that, updated my /etc/hosts, and set all the hosts to resolve to the same server.

And that still didn’t work or did it? It seemed to me that I wasn’t making any progress, but turns out this was crucial which I at the time failed to acknowledge.

Narrowing down

We saw 421 response across all platforms i.e android, ios and web - there must be something common between them right ? and yes, there was. On android & iOS, the error was observed specifically on WebView. I placed an empirical bet - it has to be the browser engine.

Playing for the bet

Now, I had a clue, a direction. The only thing left was assertion, play for the bet I placed.

Then I decided to load all the hosts/sites present on the same server in a random order. And just like that, I got myself at the time infamous 421 Misdirected Request response.

I noted the things I did before getting there, quite simple surprisingly. First, I loaded payment-portal.nepalipatro.com.np, then upon attempting to open nepalipatro.com.np, I was presented with 421 Misdirected Request.

On macOS I use Orion Browser, which uses WebKit as the browser engine. I expected Safari to behave the same, and it did. I didn’t exactly know what the issue was, but it was clear to me that it had something to do with WebKit - but wait, android saw this issue too and it does not run on WebKit. That’s right, “it cannot be WebKit“ I said to myself.

To test my theory, I went ahead on Brave Browser which uses Chromium and perform same steps as I did earlier, except for this time I did not get a 421 Misdirected Request response. I did the same thing on both browsers MULTIPLE times to assure myself, but the result was clear - Brave Browser was working fine but Orion/Safari kept recieving 421.

Programming vs Networking

To be honest as absurd as it sounds, my programmer instinct kicked in and I thought of reading the source code for Chromium & WebKit to understand the behaviour, but for sane reasons I chose to inspect packets with Wireshark.

Peeking inside

Now it was time to dive deep and monitor closely. I ran tcpdump in the background and performed the steps of reproducing 421 Misdirected Request on both the browsers respectively.

 sudo tcpdump -i en0 dst 157.10.100.57 -w test_01.pcap

Inspecting Requests

On WebKit, only one Client Hello was observed. genius/blogs/421_wireshark_webkit.png

- packet captures of WebKit, observed only one Client Hello

On Chromium, two Client Hello were observed, one for each host. genius/blogs/421_chromium.png

- packet captures of Chromium, observed two Client Hello

Since both host reside on the same server i.e 157.10.100.57, it seemed that WebKit used the same TLS connection for both host which led to nginx responding with 421.

“It cannot just be WebKit right ? some of the android clients that faced the issue must be doing the same.“

It was clear to me that some browsers are reusing the same TLS connection for different host because they have the same destination. The behaviour was rare and since we have two edge server, it led me to believe that each host must have been being resolved to different addresses for the most time.

$ dig nepalipatro.com.np +short
157.10.100.23
157.10.100.57

$ dig payment-portal.nepalipatro.com.np +short
157.10.100.23
157.10.100.57

Now what?

The issue was visible, it’s the browser re-using the same TLS connection for different host. On the other hand, Chromium handled this perfectly by initiating a unique TLS connection for each host even if they reside on the same server.

I don’t have control over the browser, even if I were to submit a patch, the behaviour persists for pre-patch builds.

I sat back, tried to visualize the flow again, view the packets again, read the examples again. And on re-reading about the status code 421, I came across this statement.

Hmmm, so 421 can be triggered if the scheme is different. But we are terminating TLS at the same point for both the host so it didn’t make sense so I ignored it but it did lead me to check our nginx configuration.

As I take a closer look at nginx configuration for both the host, I noticed a difference.

You see it too, right? One has http2 enabled and the other one does not. I enabled http2 on both the host and no longer received status code 421.

So it turns out WebKit and some browser engines negotiated TLS and trasmitted HTTP2 data. But when the same TLS connection was used to transmit HTTP1, nginx responded with 421.

The Fix

Enabling HTTP2 on both the host did the job and we no longer recieved 421 across all our clients.

After enabling HTTP2, I further inspected the packets of WebKit and observed it to have two Client Hello, one for each host.