How to build
a SQL-based data warehouse
for a trillion rows in Python
Ville Tuulos Principal Engineer @ AdRollville.tuulos@adroll.com
What?
DeliRoll @
Low latency, in-memory
fully SQL-compliant data warehouse
powers all internal analytics at AdRoll
Core parts in Python
AdRoll data firehose: New data daily
50 TB raw data
80 billion events
Append only, data goes back to 2012
Granular data, hundreds of dimensions
Largest data cube: 4 trillion log lines
Backend
Raw data in S3
⇣
Elastic Map Reduce
⇣
Encoded data in S3
⇡
High-mem EC2 instance
DeliRoll Master
⇣
DeliRoll Workers
High-mem EC2 instance
DeliRoll Master
⇣
DeliRoll Workers
High-mem EC2 instance
DeliRoll Master
⇣
DeliRoll Workers
Frontend
Tableau
⇣
PostgreSQL
⇣
FDW/Multicorn
⇣
DeliRoll Master
⇣
DeliRoll Workers
⇣
Numba / LLVM
⇣
Encoded data
Why?
Querying TBs of immutable data is easy!
Client
Query ⇣ ⇡ Reduce
EC2 Instance
PostgreSQL
Day 1
EC2 Instance
PostgreSQL
Day 2
EC2 Instance
PostgreSQL
Day 3
EC2 Instance
PostgreSQL
Day 4
EC2 Instance
PostgreSQL
Day 5
EC2 Instance
PostgreSQL
Day N
Data volume is an operational issue
Not an engineering challenge
Users care about information
not data
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
3200 bits of data
400 bits of information
Dense vs Sparse Information
Same results, different encodings
Users care about latency
Bottleneck: CPU throughput (GB/s)
Solution: Reduce the amount of data per CPU
Increase the number of CPUs
Increase information density
⇣
⇣
Operator headache
Developer headache
CPU throughput depends on information density
Neat, but I'm using Python
C vs Numba
If you can compress your data by 30%, you lose nothing
Focus on information density!
Easiest path to low latency
Pay a higher upfront cost, reap benefits over time
Information density is hard to outsource
Renaissance of Custom Solutions
User: Data has become a key differentiator
Developer: Advanced high-level languages
Operator: Infrastructure as a Service
Buying a 3rd party Big Data solution is like getting a puppy
Cute, but it comes with an operational burden
How?
What's the fastest operation in modern computers?
SIMD vector instructions: Operate on 256 bits at a time
What's the most efficient way to access data?
Plain integers in memory (cache)
That is, a matrix
Raw Data: Homogeneous Events
{"cookie_set":
true
,"country":
"Sweden"
,"cost":
840
}
{"cookie_set":
false
,"country":
"UK"
,"cost":
23
}
{"cookie_set":
true
,"country":
"China"
,"cost":
11
}
Remove Redundant Field Names
{"cookie_set":
true
,"country":
"Sweden"
,"cost":
840
}
{"cookie_set":
false
,"country":
"UK"
,"cost":
23
}
{"cookie_set":
true
,"country":
"China"
,"cost":
11
}
Encode Values in Binary
{"cookie_set":
integer
,"country":
integer
,"cost":
integer
}
{"cookie_set":
integer
,"country":
integer
,"cost":
integer
}
{"cookie_set":
integer
,"country":
integer
,"cost":
integer
}
It's a Sparse Matrix!
Sparse Matrix as a Columnar Database
aggregate rows that match certain criteria
SELECT sum(*) FROM matrix WHERE column=value
⇣
matrix.T.dot(index-vector)
where index-vector has 1 for each row with column=value, 0 otherwise