System Design Questions to ask
Fair warning, this is a long one. Brew a pot of coffee (or tea!) and read on.
This page is about questions to ask in a system design interview, not answers. In general, we should ask these questions. Be careful to ask questions, but not dive to deep into a topic if you don’t know the answers. Topics can become very deep when tackling distributed systems.
Gathering requirements
At this stage we want to gather requirements so that we have a better understanding of the system we’re designing.
Functional requirements
- Who is it for, and why do we need to build it?
- What are the features we need to build to solve the users’ problem?
- Identifying a few problem areas to solve, and honing in on 1 or 2
Non-functional requirements
- How many active users? Are there power users that upload most of the content or is it spread evenly?
- How do we partition this data and noisy neighbours?
- Are users distributed across the world?
- Is there ever burst traffic?
- Are there users who uploaded single files that are GBs of size?
- Is the application used during work hours? When is there higher general traffic?
Availability and consistency requirements
- Accuracy is different from consistency.
- e.g. eventual consistency is eventually accurate, where inaccuracy implies the processed data doesn’t have to be the same as what the users provided.
- Does it needs to be highly available?
- Touch on CAP theorem
- Durability - How much data is acceptable to lose?
Response time and latency constraints
- How long for a response to be available for?
- Freshness - Actions performed to keep items fresh (jobs, etc.)
Define API
- What is needed to provide each feature?
- What will the following look like:
- Endpoints
- Request body
- HTTP methods used
- Input + outputs
- Dive into the API design
High level diagrams
- What clients will use this API? Logical blocks - API gateway, database, app server.
- Repeat per API
- Keep it simple for the diagram at the start
- Walk through the diagram and how it functions
After getting answers to these final questions, here is where we should create a high level diagram. This high level diagram should be designed quickly to move onto to diving into the choices around this diagram. Having a concrete diagram drawn up allows for discussion around our choices and preventing flip-flopping on choices. This is a great chance to showcase a very presentable diagram. Don’t try to use vendor specific images, instead rely on generic names. (e.g. generic DB photo instead of AWS RDS/Aurora)
Data model and schema
Here we define how we are going to store our data in our data store of choice. We make tradeoffs based on schema/schemaless. Some good questions to ask are: Do we have all the entities we need here? Is anything missing? How are going to identify the key properties of each entity? We should identify the relationships between these entities. Talk about how normalized our database should be if we chose a relational data store. The more normalized it is, the less performance, but the less redundancy we have in our data.
Deep dive design
This is where we scale up the design. We should have numbers already (5 million users etc.) We want to be asking these questions but also be providing answers for these questions. All of these questions asked could be fair game depending on what you mention.
The following are questions you may be asked, and if not you should ask and answer them yourself.
-
As we scale, our database(s) will come under extreme load. How do we deal with this?
-
Are we making good use of indexes?
-
Are we using database partitions? Vertical vs horizontal partitions?
- We can break the tables down into smaller pieces, then attach them to a main table
- Vertical: large column like a blob that can be stored in an access drive in own tablespace
- Horizontal: range or list, like videos_200k, videos_400k, etc. - xxxk refers to the last sequential key in this table
-
What partioning types are there? If we jump to partitioning, remember to talk about the tradeoffs!
- Range: dates, id, etc. - we could do this for video metadata, as older videos would not be watched as much
- List: discrete values like states or zipcodes. e.g. this parition is for everyone in CA
- Hash: hash functions
- Partition Advantages:
- Improves query performance when accessing a single partition
- Sequential scan improvement vs scattered index scan being slow on a huge table
- Easy bulk loading (attaching partition) - mysql only
- Archiving old data that are barely accessed into cheap storage
- Partition Disadvantages:
- Updates can move rows from one partition to another which can be slow
- Inefficient queries can scan all partitions
- Schema changes be challenging if not planned for
-
Are we using database sharding?
-
Row based split of table across databases like 200k in db 1, 200k db 2, …
-
We need consistent hashing, to ensure we connect to the right instance that has our data
- Make input into number, -> binary -> int -> modulo # nodes. Remainder + a given number gives port # to connect to
- Take in input and get back the instance
- Hashing of input(1) will always go to db 1, input(2) will always go to db 2, …
- Take in input and get back the instance
- Make input into number, -> binary -> int -> modulo # nodes. Remainder + a given number gives port # to connect to
-
What is the difference between sharding and partitioning?
- Partitioning splits table into multiple tables in the same database. Table name or schema changes
- Sharding splits table into multiple tables in multiple database servers. everything stays the same except for database server
Addendum:
- Is there concurrency control on our database engine? Most engines have this.
- Is there data that requires two phase locking?
- Are we using read/write replicas?
- Are we using the proper data store now? Is the database engine the right one for our needs? Why?
- What happens when we have dead locks?
Creating deadlocks
-- client 1
begin transaction; insert into test values (20); insert into test values (21);
-- client 2
begin transaction; insert into test values (21); insert into test values (20);
-- This creates an exclusive lock on 20. Client 1 hasn’t committed yet.
-- There is an exclusive lock on 21 as well. Client 2 hasn’t committed yet.
-- Both are waiting for each other and it creates a deadlock.
-- The last one to enter the deadlock will be the one to fail