PUSH TO PROD OR DIE TRYING - Lesson 1.3: When Overwhelmed, Make a List of Questions

Lesson 1.3: When Overwhelmed, Make a List of Questions

This lesson tells the story of the first time I felt completely overwhelmed at Neurafilm — when I was asked to solve a critical problem in an area that I had zero experience with. I learned a technique for decomposing a large problem into questions, where each answer moves you closer to your goal.

It Got Late Early

It was a Monday morning in September, a little after 10am. I walked into a conference room and sat on the counter against the wall since all of the good seats were taken. Ethan, the director of our team, usually sent out an agenda for our Monday team meeting on Sunday night, often including documents to read in order to prepare for the discussion. I had zero interest in reading documents on a Sunday, so I wasn’t prepared to discuss anything. But here I was, along with 20 or so of my teammates, most of them typing on their laptops instead of paying attention to the meeting.

Ethan confirmed that Neurafilm would launch in England and Ireland in January. There had been rumors about this for a few weeks — we knew a European launch was happening, but the dates and countries weren’t finalized. Now they were, and we had to get moving.

We needed to get our services running in a 2nd AWS region: eu-west-1. Neurafilm had only run production software in AWS us-east-1 to that point, so there were a lot of unknowns in front of us. Lamar, my most seasoned colleague, was working on getting our service running in Europe. He messaged the Cloud Platform team, who provided base images for service teams to layer their stuff on top of:

Lamar
Hey, I’d like to start deploying our service to EU. Do you have the base images ready?

Random Platform Engineer
It’s a little early — we haven’t started yet.

It was September, and we needed to launch in January. Does that feel “a little early” to you? To me, it felt like we were already behind schedule. Would it be possible to deploy our entire service infrastructure in Europe within a few short months? At any other company, I’d say no, it’s not happening. But this was Neurafilm — we’re moving forward, whether it’s possible or not.

The Routing Situation

Hey everyone, we’re launching in Europe in January, figure it out.

— Neurafilm executive leadership

Neurafilm didn’t have any architects, let alone an architecture team. The executives decided that we were launching in Europe, but there was nobody in charge of deciding how we would do it from a technical perspective. Instead, an interweaving set of teams collaborated to move us into Europe:

Product and device partner teams chose the devices that we’d support
Client teams modified the UIs and client software as necessary
The aforementioned Cloud Platform team provided foundational libraries, tools, and patterns to enable backend services

There were also other teams handling multi-language support, payment processing, CDN integration, and other areas I wasn’t aware of. As a service owner, it was your responsibility to work within the constraints of other teams in order to get your stuff running in Europe. There was a paved path that was easy to follow, especially for backend teams which largely used the same services and tools under the hood. When in doubt, you could get pretty far by copying whatever the team most similar to yours was doing, conceptually.

My manager, Caleb, asked me to figure out how to route API traffic to our new European cluster. More specifically, I needed to figure out how to route western hemisphere traffic to our AWS us-east-1 cluster, and eastern hemisphere traffic to eu-west-1. I knew nothing about the problem space. Lamar suggested I look into using DNS, which was enough to get me started, so I grabbed a headlamp and crawled into the rabbit hole.

I think you can do some geo stuff with DNS, or something.

— Lamar

Here’s a diagram of our HTTP traffic flow:

Figure 1. API traffic overview

Summary:

Client devices used a DNS hostname for API requests which resolved to a load balancer
The load balancer forwarded to our API proxy, a 3rd party app which I’ll discuss later
The API proxy forwarded to the API backend, which was the primary service that my team worked on
The API backend forwarded requests to multiple backend services and used their responses to assemble a final response to the client

I knew that DNS resolved hostnames into IP addresses. I also knew that you could alter the DNS resolution process to create functionality like round-robin DNS for load balancing. But what I wanted was this, in pseudo-code-ish fashion:

for hostname H
resolve users in country C1 to CNAME N1
and users in country C2 to CNAME N2

Where N1 and N2 would point to different AWS regions, and C1 and C2 would be countries that we wanted to route to those regions, respectively.

This would create a new global hostname which would route clients to the appropriate region based on their location:

Figure 2. API global hostname

Some brief research revealed that we could configure our DNS provider to resolve different CNAMES per country (or even zip code) for a given hostname. However, complexities in network routing, IP addresses, and user behavior prevent this from working 100% of the time. Some percentage of customers would be routed to the wrong region, some percentage of the time.

That sounded suboptimal, but maybe it would still work. Did all requests always have to go to the correct region? Unfortunately, yes, they did. The API was the front door to Neurafilm for all content discovery related device traffic. It’s easier to define content discovery related device traffic by what it wasn’t as opposed to what it was, so let’s clarify what it wasn’t:

CDN traffic: video assets or box art images
Playback related traffic: DRM keys, manifests (or other things I didn’t understand)
Website traffic: HTML used to render Neurafilm in your web browser

Our traffic mostly consisted of devices fetching rows of content. We also provided:

Detailed metadata for specific subsets of content
Search results
Authentication

Plus various other things.

Cluster Constraints

The API backend depended on a system named MMS (Movie Metadata Service) for content metadata. We used an MMS client library that stored metadata for the entire content catalog in memory to enable fast "find metadata by id" lookups for all programs.

Why did we choose to store all content metadata in memory, as opposed to in a separate service? I discuss this in the next lesson: lesson-1-4 (NOTE: cross-lesson link removed).

For now, what’s important is that the MMS cache required significant memory, so much so that we couldn’t store our North American and Latin American datasets together. We had to split them into separate clusters and route by country. Here’s a diagram of how it worked:

Figure 3. API LatAm routing

Couldn’t we just proxy calls to our new EU cluster like we did for LatAm? Technically, we probably could have, but we didn’t want to, due to concerns about security, latency, and resiliency. Since both the NorAm and LatAm clusters were in AWS us-east-1, enabling connectivity from our proxy layer was as simple as configuring security groups. But going from US to EU was cross-region — so we’d have to connect securely over the public internet and ensure that our connection pools had enough capacity to handle the increase in concurrency due to additional cross-region latency.

Cross-Region Traffic

If you have to send cross-region traffic today, there are more secure options than sending it over the public internet, such as VPC peering connections ^[1]

Also, our proxy was a piece of 3rd party software that we all disliked. I’ll discuss more details about this system in Chapter 2 (NOTE: cross-lesson link removed), including the process of rewriting it from scratch. However, our immediate goal was a successful European launch, and we were hesitant to add further complexity to our proxy in order to achieve it.

The Devil of DNS Details

DNS and Locality

If you were solving a problem today involving DNS and geolocation, you may be able to use EDNS Client Subnet (ECS) ^[2].

A lot would depend on how widely your DNS resolvers vary across your customer’s devices and internet service providers. If you couldn’t guarantee that ECS is supported across 100% of your DNS lookups, you’d need an alternate solution.

Let’s take a step back — why doesn’t DNS geo-routing always work perfectly?

DNS lookups don’t have the concept of a client IP — only a resolver IP. So when you do a DNS lookup from your laptop, the origin DNS server can only see the IP of your ISP’s DNS server, or maybe Google’s DNS server if you’ve configured that instead. And since the location of your DNS server isn’t required to be the same location as your Neurafilm-friendly device, it’s a problem.

This was somewhat illustrated earlier: img-api-global-hostname (NOTE: cross-lesson link removed).

Whatever country you’re in, Neurafilm can only stream the programs that it has paid to license in that country. Let’s walk through a scenario:

You’re in the US but using a DNS server in Ireland.
Since you’re using a DNS server in Europe, DNS routes you to our EU cluster which only contains content for European countries.
Whatever server you hit in the EU cluster can see your real client IP once you send an HTTP request.
We’d properly resolve your country as US, then filter our content to include only what is available to you.
No US content is available to you on an EU server, since it only has EU content in its MMS cache.
You’re staring at either an error page or an empty homepage — a broken experience that is unacceptable to serve to any customer.

This is why it was so critical to handle routing correctly for 100% of customers, 100% of the time. It wasn’t a situation where an unlucky customer would get an occasional error and receive a satisfactory fallback or send a retry and succeed. The failures were total, not partial, so we needed to prevent them from happening.

Questions and Answers

When asked to make a big decision, it’s normal to feel clueless, afraid, or overwhelmed. But none of these feelings are useful.

If you find yourself in this situation, a good first step is to ask: what is one question that I wish I had the answer to? During the process of finding the answer, you’ll bump into lots of other questions. Now you’re walking the path towards enlightenment.

I had spent a few days pondering the European traffic routing situation — exploring details of HTTP, cross-region traffic, MMS caching, and DNS. But I hadn’t yet stitched everything together. Here’s a series of questions that led me towards a coherent belief system:

How are customer devices routed to AWS regions?

DNS.

How is DNS configured, specifically?

Using our DNS provider’s console, we configure the api.neurafilm.com hostname as a CNAME to the public hostname of an Amazon load balancer in us-east-1.

If we’re running in multiple AWS regions, can we use DNS to route devices to the closest region?

Yes, we can configure our DNS provider to resolve different countries to different CNAMEs, each of which could point to different regions. But it won’t work 100% of the time.

Why won’t it work 100% of the time?

DNS can only see the resolver IP, not the client IP.

Does it need to work 100% of the time?

Yes, we need devices to be routed correctly 100% of the time so they can access the correct country-specific content catalog.

What percentage of requests are misrouted?

We don’t know for sure, but the number tossed around was between 1% and 5%.

What options are there to route devices correctly when DNS doesn’t work?

We can send a redirect to the client and expect them to follow it, or we can proxy the request to the correct region for them.

What are the benefits of proxying?

Functionally transparent to clients (redirects, on the other hand, would require client changes)
Requires fewer hostnames, and therefore less config

What are the benefits of redirects:

Lower latency (details below)
Avoids cross-region requests and all of the complexity that comes with it (securing traffic, thinking through cross-region connection pool sizes, handling extra load during cross-region latency spikes, etc.)

Can we visualize what each solution would look like?

Sure. Note that these sequence diagrams omit the API Proxy since it likely wouldn’t be adding any functionality:

Figure 4. API misroutes: proxying

Figure 5. API misroutes: redirects

If we proxied, could we route from Proxy to Proxy?

Yes, but it’s a bad idea. This would require us to configure the API Proxy, which was always unpleasant.

This would also require us to do geo lookups in the API Proxy, which would likely involve integrating our GeoClient library. That sounds like a very bad time.

If we proxied, could we route directly between API Backend nodes?

If we decided to proxy misroutes instead of redirect, this felt like a better option. But this would require every API backend instance to be directly addressable and for the cluster in each region to have a list of all of the nodes in the other region. We already had a service named Discovery that provided service-to-node (and vice-versa) mappings, so we could have potentially modified it to also support cross-region lookups. Else, we’d stand up a new backend load balancer in each region.

There is also the issue of cross-region requests having higher latency. Plus, in the absence of persistent redirects, every request from a misrouted client would be proxied, so every request would be slower.

Are we sure that clients are able to implement redirects?

No, not sure, at this point. Each device had its own unique networking stack, sometimes with unexpected DNS and HTTP behaviors. Also, even if it was possible, it was not clear that it could be implemented within our tight timeline. Clients would have to parse a redirect (specific HTTP status codes and other details were still not decided) and follow it. A redirect would point directly to a specific region, so we’d need to create two new DNS hostnames, SSL certs, and load balancers (1 for each region). We had not yet learned about SAN certs ^[3] which enabled us to use multiple hostnames per cert.

Ideally, clients could persist redirects so that only the first misroute would receive a redirect, and all subsequent requests would go directly to the correct region. But even in the worst case of every request being redirected, I still believed that clients would see lower latency with redirects than proxying.

I had a lot of information in front of me. It was now time to make a decision.

A Tentative Path Forward

After pondering our options, my preference was to use redirects to send misrouted clients to the proper region.

The complexity of cross-region traffic felt too high for me to solve within our limited timeline. Plus, lower latency is always a nice bonus.

It was a long journey from being completely clueless to having a desired solution, and there was still a long way to go. But things were looking good — a large, complex problem had decomposed into a few simple options. I was only able to accomplish that by asking a lot of stupid questions.

I’ve used this approach many times since then and it has always moved me in a positive direction. If you find yourself stuck in a similar situation, here are a few rough steps to get started:

Start with a short document, even if it’s just for your own use.
Write down every question that comes to mind.
Group them into categories if you wish.
Often, some questions will appear more critical or time-sensitive than others, so you’ll know which answers to pursue first.
If not, start with a random question, and I guarantee that you’ll end each day in a better place than you started.

Back to our routing situation: Now that I knew what I wanted, I had to meet with our partner teams to ensure that we were highly aligned yet loosely coupled.

1. https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html

2. https://en.wikipedia.org/wiki/EDNS_Client_Subnet

3. https://en.wikipedia.org/wiki/Subject_Alternative_Name