How to build
a SQL-based data warehouse
for a trillion rows in Python


Ville Tuulos
Principal Engineer @ AdRoll

ville.tuulos@adroll.com

What?

DeliRoll @

Low latency, in-memory

fully SQL-compliant data warehouse

powers all internal analytics at AdRoll

Core parts in Python

AdRoll data firehose: New data daily

50 TB raw data

80 billion events

Append only, data goes back to 2012

Granular data, hundreds of dimensions

Largest data cube: 4 trillion log lines

Backend

Raw data in S3
Elastic Map Reduce
Encoded data in S3
High-mem EC2 instance DeliRoll Master
DeliRoll Workers
High-mem EC2 instance DeliRoll Master
DeliRoll Workers
High-mem EC2 instance DeliRoll Master
DeliRoll Workers

Frontend

Tableau
PostgreSQL
FDW/Multicorn
DeliRoll Master
DeliRoll Workers
Numba / LLVM
Encoded data

Why?

Querying TBs of immutable data is easy!

Client
Query ⇣ ⇡ Reduce
EC2 Instance
PostgreSQL
Day 1
EC2 Instance
PostgreSQL
Day 2
EC2 Instance
PostgreSQL
Day 3
EC2 Instance
PostgreSQL
Day 4
EC2 Instance
PostgreSQL
Day 5
EC2 Instance
PostgreSQL
Day N

Data volume is an operational issue

Not an engineering challenge

Users care about information

not data

{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
    

3200 bits of data

400 bits of information

Dense vs Sparse Information

Same results, different encodings

Perfect encoding128 bytes overhead32KB overhead10100100010000100000Execution time (log scale)

Users care about latency

Bottleneck: CPU throughput (GB/s)

Solution: Reduce the amount of data per CPU

Increase the number of CPUsIncrease information density
Operator headache Developer headache

CPU throughput depends on information density

100100010000100000100000001250250037505000L3 CacheL3 CacheTotal RAMTotal RAMVector Size in KB (log scale)Integer operations / usec

Neat, but I'm using Python

C vs Numba

C Numba10100100010000100000100000010100100010000100000100000010000000100000000Vector Size in KB (log scale)Execution Time (usec)

If you can compress your data by 30%, you lose nothing

Focus on information density!

Easiest path to low latency

Pay a higher upfront cost, reap benefits over time

Information density is hard to outsource

Renaissance of Custom Solutions

User: Data has become a key differentiator

Developer: Advanced high-level languages

Operator: Infrastructure as a Service

Buying a 3rd party Big Data solution is like getting a puppy

Cute, but it comes with an operational burden

How?

What's the fastest operation in modern computers?

SIMD vector instructions: Operate on 256 bits at a time

What's the most efficient way to access data?

Plain integers in memory (cache)

That is, a matrix

Raw Data: Homogeneous Events


{"cookie_set":true,"country":"Sweden","cost":840}
{"cookie_set":false,"country":"UK","cost":23}
{"cookie_set":true,"country":"China","cost":11}

Remove Redundant Field Names


{"cookie_set":true,"country":"Sweden","cost":840}
{"cookie_set":false,"country":"UK","cost":23}
{"cookie_set":true,"country":"China","cost":11}

Encode Values in Binary


{"cookie_set": integer ,"country": integer ,"cost": integer }
{"cookie_set": integer ,"country": integer ,"cost": integer }
{"cookie_set": integer ,"country": integer ,"cost": integer }

It's a Sparse Matrix!

Sparse Matrix as a Columnar Database

aggregate rows that match certain criteria

SELECT sum(*) FROM matrix WHERE column=value

matrix.T.dot(index-vector)

where index-vector has 1 for each row with column=value, 0 otherwise

Sparse Matrix as a Columnar Database

include only a subset of columns

SELECT sum(a), sum(b) FROM matrix WHERE ...

matrix.dot(diag(chosen-columns)).T.dot(index-vector)

where chosen-columns has 1 for each chosen column, 0 otherwise

Great! Let's ship it!

1 billion rows

100 columns

10% density

32-bit Column ID, 32-bit value

596 GB matrix

Rethink encoding

Variable-length Encoding for Values

Optimal EncodingVariable length EncodingMax Encoding05101520RowsNumber of Bits per Item

48% smaller!

Regularities in Data

Try also

Run-length Encoding

Dictionary Encoding

Delta Encoding

Probabilistic Data Structures

Finally, on run-time

JIT-compile query-specific decoder using

Thank You!


is Hiring


Python, Erlang, Big Data, AWS

1.5 TB new data / engineer / day!