First Try

To get me calibrated, I was given a sample problem to work through in a mock interview: design Parkmobile. Assume all the towers etc already exist.

I lost the scratchpad I used for that interview. What follows is the best recollection I have of how it went.

Starting with identifying functionalities. Parkmobile lets people park their cars in designated zones for specific periods of time under a cap. We presume that there is also an API for law enforcement to get data on whether a given car is supposed to be in the current zone at the current time.

Then I defined entities and relationship. I made a simplifying assumption that a given user only has one car registered on parkmobile (which in hazy memory might even be true).

classDiagram class User { int id String first_name String last_name String license_plate } class Zone{ int id geojson Perimeter } class Reservation{ user User zone Zone datetime startTime int duration_seconds } User <|-- Reservation Zone <|-- Reservation
Under prodding, I **identified workflows**: 1. Admins create a bunch of zones 1. User creates an account 1. User creates a reservation 1. Law enforcement passing a car checks the plate to see if it is current. This tells us an **access pattern**: we will need to be able to get things by plate, which will be important later. Then I defined **api access**. I focused heavily on CRUD operations. ``` create_user(...) get_user(...) update_user(...) create_zone(...) get_zone(...) update_zone(...) create_reservation(...) get_reservation(...) update_reservation(...) check_plate(...) ``` Handwaving a bit, I enumerated the attributes of those API calls and called them all <= 500B. We now start with **sizing and usage assumptions**. Most of the time, people are not using Parkmobile. It's mostly useful in cities, and many people in cities are not using it as part of daily life; they drive from home to work and back. Given that, I think ~1% of the population of the US is using Parkmobile on any given day, usually once or twice. Making a **market sizing assumption** that we are building for the US and Europe, and that the EU is about as big as the US by population, that means about *7 million* new reservations per day, distributed throughout the day because of time zones. Next: **data-sizing napkinmath.** At ~86K seconds in a day, that is about 81 writes per second. Handwaving a write as 500B, that means about 45KB of data written per second. 7 million writes/day at 500B/write is 3.5GB of newly-written data per day, or 1.2 TB/year, or 6TB in a 5 year auditing period. I observed an at-least 5:1 read:write ratio because users and police would check a reservation multiple times even when it is only written once. This fits in one very large RDS box, but in the actual session I used different assumptions and got a larger number, skewing toward a NoSQL implementation. I deliberated at length about the appropriate storage of the data. I broke it up into a storage block (primary and replica) per entity, with the understanding that app servers might make multiple requests to stitch the data together.
graph TD UP[User Primary] --> UB[User Backup] RP[Reservation Primary] --> RB[Reservation Backup] ZP[Zone Primary] --> ZB[Zone Backup]

I observed that consistent hashing based on the plate of the user could be used to scale the storage blocks as needed and were unlikely to have hotspots.

I added a stateless app tier in front of the storage tier to stitch data together in response to API requests and respond to things like rate limiting. I added a load balancer in front of the app tier.

flowchart TD LB[Load Balancer] subgraph app AT1[App Tier Box 1] AT2[App Tier Box 2] end subgraph storage UP[User Primary] --> UB[User Backup] RP[Reservation Primary] --> RB[Reservation Backup] ZP[Zone Primary] --> ZB[Zone Backup] end app --> storage LB --> app

Under prompting, I added a cache in front of the LB to reduce redundant access.

flowchart TD MC[Cache: Plate => Reservation] LB[Load Balancer] subgraph app AT1[App Tier Box 1] AT2[App Tier Box 2] end subgraph storage UP[User Primary] --> UB[User Backup] RP[Reservation Primary] --> RB[Reservation Backup] ZP[Zone Primary] --> ZB[Zone Backup] end app --> storage MC --> LB LB --> app

And the app tier would need to reach out and invalidate the cache for a plate on any write of a reservation with that plate. Problem observed by the interviewer: sure we could use least-recently-used, least-frequently-used, time-based, or FIFO eviction strategies. But is this cache even going to get full? If 80% of use comes from 20% of users, they will just keep updating the cache with their recent reservation, and it might not even be that large.

The mock interview ended. I don’t think, compared to the broad candidate pool, I did a good job. But that is why this is the first calibration practice.

Things to improve on

Questions I have

In theory, there is a three-dimensional parameter space of number of connections, duration of connection, and result size of query that has a surface of maximum-stable-performance for a given RDS box. I don’t know what that frontier is shaped like. How many connections, how much throughput, running for how long before the machine is exhausted? What is this like for a few representative box types? I don’t know, and want to find out.