ð¡ Big Data Systems: Fan-out Design
Age old system design behind Twitter's(X) timelines; Fastest way to read a parquet and more.
3 ways to read a parquet
Here are 3 different ways to read a parquet file locally.
Pandas: common python library used for working with data sets.
Duck Db: in-process SQL OLAP database system.
Polars - lightning-fast dataframe library written in Rust.
Install libraries.
pip install pandas
pip install polars
pip install duckdb
Pandas
import pandas as pd
df = pd.read_parquet(path=file_name.parquet, engine='pyarrow')
print(df.head(100))
DuckDb
import duckdb as dd
df = dd.sql(f"Select * from {file} limit 100 ")
print(df)
Polars
import polars as pl
df = pl.read_parquet(source=file)
print(df.head(100))
Which one is faster?
Ever heard about the Fan-out Design Pattern?
Fan-out is an electronics engineering term that defines the number of input gates driven by the output of a single logic gate. In Big Data transaction processing systems, it refers to the number of requests that need to be made to other services in order to serve one incoming request.
The classic example of this problem is Twitter(X)âs home timeline as published in Timelines at Scale - Raffi Krikorian November 2012.
Twitter(X) consists of millions of users and each user follows thousands of people.
In a naive database design approach
Each new tweet is inserted into a global collection of tweets.
When a user requests for their home timeline the system needs to lookup all the people the user follows and the tweets for each user and present them ordered by time.
However, this presents a scalability problem and puts a massive load on the database for every home timeline query. The request to read a home timeline is a very very expensive query.
A better approach was to pre-compute and maintain the cache for each userâs home timeline. Whenever a user posts a tweet, look up all users who follow the user and insert the tweet into each of their home timeline caches. This is the classic case of fan-out design.
The caching approach requires a lot of work during the write as a single new tweet is amplified and written to multiple places. For instance, at the time this data was published there were about 4.6k tweets per second and an average of about 75 followers per user, which results in 345k writes per second. The average hides the fact that there are certain users like celebrities and influencers who have millions of followers and the same approach would lead to exponential growth in the number of writes needed to handle this in a timely manner. Therefore a hybrid approach was used by Twitter where most usersâ tweets were fanned out to home timeline caches but a small number of users who had a very large following were exempted. The hybrid approach was able to deliver consistently good performance for many years.
The distribution of followers per user was a key load parameter for designing a scalable system in the example of Twitter(X).
Present-day Twitter has about 6,000 tweets per second, an average of 700 followers, and some users with more than 100 million followers, which signifies that the load parameter for Twitter has significantly mutated.
As per the latest update from Twitter Engineering, the 12-year-old fan-out service has been phased out and replaced by a more powerful in-house-built Recommendation system to pick a handful of top tweets for your timeline.
Learn all about Twitter(X)âs new Recommendation system.
ð³Data Engineering Tip of the Day
Find the size of a databricks delta table and other properties of your table easily using the command describing the table metadata.