Sunday, February 15, 2026

A Conversation That Changed How I Think About Backend Systems - Part I



Me:

I think our API servers will be the first thing to break if we get a million users.

Him:
Why do you think that?

Me:
Because… more users means more requests. The APIs will get overloaded, right?

He smiled. Not the “you’re wrong” smile — the wait and think one.


“Are the APIs really doing the work?”

Him:
When an API receives a request, what does it actually do most of the time?

I paused.

Me:
It… validates input, runs business logic, calls the database, and returns a response.

Him:
Exactly. Now tell me — which part takes the longest?

That’s when it clicked.

Me:
Waiting on the database.

Him:
Good. So if an API is slow, is it really the API that’s slow?


The first bottleneck appears

In large systems, API servers rarely struggle because of computation.
They struggle because they wait.

They wait for:

  • Database queries

  • Cache misses

  • External services

  • Network I/O

At scale, waiting is more dangerous than working.



“The database is the fragile one”

Me:
So the database is the real bottleneck?

Him:
Usually, yes.

Databases are:

  • Expensive

  • Shared by everything

  • Hard to scale compared to APIs

Even fast queries can bring a system down if too many of them run at once.


“Then make the database work less”

That was his next sentence.

Me:
How?

Him:
Why hit the database every time if the data hasn’t changed?

That’s when caching entered the picture.



“But what about stale data?”

I pushed back.

Me:
Caching means users might see old data.

Him:
Correct. Now ask the real question.

Me:
Which data is allowed to be wrong… briefly?

User profile data — names, avatars — felt safe.
Payments and permissions did not.

This was the first time I understood that consistency is a business decision, not just a technical one.


Time-based freshness instead of perfection

For low-risk data, we agreed on a simple rule:

  • Cache it

  • Set a TTL

  • Let it refresh naturally

For user profiles, something like 10–30 minutes is often acceptable.


Fast system.
Slight staleness.
Happy users.

Or so I thought.


“What happens when everyone comes at once?”

He wasn’t done.

Him:
What happens if a popular user’s cache expires and a million requests arrive at the same second?

I knew the answer before saying it.

Me:
They all hit the database.

Him:
Exactly.

That problem has a name: cache stampede.


Multiply that by a million — and the database collapses.


“Protect the database, always”

His rule was simple:

The database must survive, even if the system degrades.

One effective safety net is rate-limiting database reads:

  • Not to be fast

  • But to avoid total failure

Combined with serving slightly stale data, the system degrades gracefully instead of crashing.



The real lesson

That conversation changed how I think about backend systems.

Large-scale design is not about:

  • Eliminating all risk

  • Making everything perfectly fresh

It’s about:

  • Knowing where the bottleneck is

  • Protecting the most fragile layer

  • Accepting controlled imperfection

  • Designing for failure, not just success

Much later, I realized something else.

He wasn’t just teaching backend systems.

He was teaching me how architects think.


More conversations like this coming soon.

No comments:

Post a Comment