Skip to content

Improve Back-of-the-Envelope Calculation Practice #501

@ido777

Description

@ido777

Augment the Primer with clearer guidance and examples for back-of-the-envelope calculations – those quick, approximate calculations that estimate system capacity, throughput, or latency. Following Jeff Dean’s advice, we’ll integrate more numeric estimation exercises, real-world examples, and handy reference tables (like “numbers every engineer should know”) to strengthen this skill.

Justification: Being able to do rough calculations is a key skill in system design. It helps validate if a design can meet requirements (e.g. can one server handle 1 million users? how much storage will a year of data consume?). Jeff Dean, a respected Googler, emphasized that engineers should estimate performance of alternatives without always building them, and provided latency numbers as common knowledge. Many interviewers expect candidates to do such estimates during the interview (“let’s approximate how many QPS this service needs to handle”). Currently, the Primer may have some scattered numbers, but a dedicated focus will ensure learners get comfortable with this. Moreover, including real-world-ish calculations (like those Jeff Dean examples or inspired by them) makes the Primer more concrete and applied.
On the other hand in the last decade the focus moved from sizing to scaling. In other words instead of guessing the performance we measure and improve, and we can use AI to help verify validity since the main takeout from the Back-of-the-Envelope Calculation is validity check - so it should be integrated in the 4 phases of the interview but not necessarily as current Back-of-the-Envelope Calculation.

Implementation Steps DRAFT needs refactoring:

Create a “Numbers to Remember” Reference: Add an appendix or a section (possibly in the introduction or as a sidebar in relevant sections) listing common performance and capacity numbers. For example:
Latency benchmarks: L1 cache ~0.5ns, disk seek ~10ms, cross-data-center RTT ~150ms, etc..
Throughput examples: 1Gbps network ~ 125 MB/s, typical HDD ~ 100 IOPS, etc.
These can be presented as a table for quick scanning. Jeff Dean’s list is a great source; we can cite it and perhaps format it in a Markdown table for clarity.
Emphasize orders of magnitude: the table helps readers internalize what is fast vs slow (memory vs disk, etc.).
Incorporate Calculations in Case Studies: For each major design example in the Primer, include a “back-of-the-envelope” calculation segment. For instance:
When designing a URL Shortener, add: “Estimate: If we have 1 billion URLs and each stored entry is ~100 bytes, that’s ~100 GB of data. If each request is 0.5 KB out and there are 10k requests/sec, that’s ~5 MB/s of egress bandwidth.” These rough numbers give context if the design of a caching layer or database sharding is needed.
For a messaging system design: “Assume each user sends 50 messages a day and we have 10 million users. That’s 500 million messages/day, ~5800 messages/sec on average. Peak might be 5x that, ~29k messages/sec. Our design needs to handle this order of magnitude.”
Present these in a clear, stepwise way (perhaps bullet points or a small table) so the reader can follow the logic. Use rounding and assumptions explicitly to demonstrate how to simplify (e.g. “We assume 1 month ≈ 30 days for ease”).
Markdown-Formatted Breakdowns: Format calculations in Markdown for readability:
Use bullet points or numbered steps to show each part of a calc.
Use bold for final results or key numbers.
Possibly use a monospaced font for numbers or align them in tables for clarity.
Example in text: “Storage needed: 1 billion * 100 bytes ≈ 100 billion bytes ≈ 100 GB.”
For more complex comparisons, use a small table. E.g., compare two designs: one reads 30 thumbnails serially vs in parallel – table columns could show “Design A: total time = 30 * (latency+processing), Design B: total time = (latency+processing) + overhead” etc. This makes it easy to see which is better.
Inspired by Jeff Dean’s Examples: Jeff Dean’s talk included illustrative calculations like how long to generate an image results page with different approaches. We can create similar scenario-based exercises:
Perhaps add a section “Back-of-the-Envelope Practice” with a few problems and solutions. For example: “Calculate how many servers are needed to handle 1 million concurrent users if each server handles 10k connections” (simple division).
Or “You design a system with 3 layers of caching. Estimate the cache hit rate needed in L1 to reduce traffic by 50% to downstream.” Provide solution outlines.
These can be small and optional but will appeal to those who want to test themselves.
Comparison Tables: Where appropriate, use tables to compare outcomes:
E.g., in an example, if we consider two architectures (say one uses heavy caching, one uses none), present a table of estimated throughput or costs for each, side by side. This visualizes the impact of a design decision.
Another idea: a table of scaling: “If 1 server can handle X QPS, then N servers handle N*X QPS (assuming linear scaling). Table: for N=1,10,100, what’s the QPS and what new bottlenecks might appear?” This educates on non-linear effects too.
Integrate with Backlog Feedback: Perhaps some open issues requested clarifications on how to do these estimations. If such issues exist (“please add example calculations for X”), ensure we address those directly by including that example. Then we can close those issues referencing the new content.
Use Tools for Accuracy: While these are rough calculations, ensure they’re sensible. We might double-check with actual computation (maybe using a small Python script if needed for complex ones) to avoid any arithmetic mistakes in examples. However, part of the learning is showing that even order-of-magnitude correctness is fine.
Highlight Jeff Dean’s Advice: In the Primer’s narrative, mention the importance as advocated by industry veterans:
For instance, a callout quote: “If you don’t know what’s going on, you can’t do decent back-of-the-envelope calculations” – Jeff Dean. This drives the point that understanding system internals (like those in Primer) plus doing math is powerful.
Also note that Jeff Dean considers this one of the most important skills, to motivate readers to practice it.
Make it Engaging: We can present some calculations as challenges (like fill in the blank). For example, pose a question in text and maybe hide the answer behind a markdown

tag which the reader can click to reveal. This way, they can try on their own first.
Collaboration & Accuracy: Encourage contributors to review these numbers. Even though they are approximate, having a second pair of eyes helps catch off-by-one errors or unreasonable assumptions. Perhaps have a specific label or PR tag for “needs math check”. Because these are educational, we don’t need exact precision, but we do need them to be plausible. Also, be open to corrections: if someone from the community points out that an assumption is outdated (e.g., maybe SSDs have changed some “numbers to know”), we should update accordingly. Use version control to track changes to these reference numbers (for example, an update in a couple years if network speeds significantly increase across the board).

Trade-offs: One risk is that too many numbers could intimidate non-engaged readers. We should integrate them in a non-intrusive way – possibly clearly separating them so that someone skimming for conceptual understanding can skip the math if they want, but it’s there for those who delve. Another trade-off: numbers can become outdated (though basic physical limits tend to remain for a while – e.g. memory vs disk speed difference). We mitigate by focusing on order-of-magnitude which stays relevant longer, and we can periodically refresh the table (the community can help if something is glaringly old). The benefit of adding these is high: it trains a critical aspect of system design thinking and can set the Primer apart as not just giving answers but teaching how to validate those answers quantitatively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions