How to build
a SQL-based data warehouse
for a trillion rows in Python

Ville Tuulos
Principal Engineer @ AdRoll

ville.tuulos@adroll.com

What?

DeliRoll @

Low latency, in-memory

fully SQL-compliant data warehouse

powers all internal analytics at AdRoll

Core parts in Python

AdRoll data firehose: New data daily

50 TB raw data

80 billion events

Append only, data goes back to 2012

Granular data, hundreds of dimensions

Largest data cube: 4 trillion log lines

Backend

Raw data in S3

⇣

Elastic Map Reduce

⇣

Encoded data in S3

⇡

High-mem EC2 instance DeliRoll Master

⇣

DeliRoll Workers

High-mem EC2 instance DeliRoll Master

⇣

DeliRoll Workers

High-mem EC2 instance DeliRoll Master

⇣

DeliRoll Workers

Frontend

Tableau

⇣

PostgreSQL

⇣

FDW/Multicorn

⇣

DeliRoll Master

⇣

DeliRoll Workers

⇣

Numba / LLVM

⇣

Encoded data

Why?

Querying TBs of immutable data is easy!

Client

Query ⇣ ⇡ Reduce

EC2 Instance

PostgreSQL

Day 1

EC2 Instance

PostgreSQL

Day 2

EC2 Instance

PostgreSQL

Day 3

EC2 Instance

PostgreSQL

Day 4

EC2 Instance

PostgreSQL

Day 5

EC2 Instance

PostgreSQL

Day N

Data volume is an operational issue

Not an engineering challenge

Users care about information

not data

{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}
{device-id: 'Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X)', campaign-id: 'AB32E0'}

3200 bits of data

400 bits of information

Dense vs Sparse Information

Same results, different encodings

Users care about latency

Bottleneck: CPU throughput (GB/s)

Solution: Reduce the amount of data per CPU

Increase the number of CPUs	Increase information density
⇣	⇣
Operator headache	Developer headache

CPU throughput depends on information density

Neat, but I'm using Python

C vs Numba

If you can compress your data by 30%, you lose nothing

Focus on information density!

Easiest path to low latency

Pay a higher upfront cost, reap benefits over time

Information density is hard to outsource

Renaissance of Custom Solutions

User: Data has become a key differentiator

Developer: Advanced high-level languages

Operator: Infrastructure as a Service

Buying a 3rd party Big Data solution is like getting a puppy

Cute, but it comes with an operational burden

How?

What's the fastest operation in modern computers?

SIMD vector instructions: Operate on 256 bits at a time

What's the most efficient way to access data?

Plain integers in memory (cache)

That is, a matrix

Raw Data: Homogeneous Events

{"cookie_set":	true	,"country":	"Sweden"	,"cost":	840	}
{"cookie_set":	false	,"country":	"UK"	,"cost":	23	}
{"cookie_set":	true	,"country":	"China"	,"cost":	11	}

Remove Redundant Field Names

{"cookie_set":	true	,"country":	"Sweden"	,"cost":	840	}
{"cookie_set":	false	,"country":	"UK"	,"cost":	23	}
{"cookie_set":	true	,"country":	"China"	,"cost":	11	}

Encode Values in Binary

{"cookie_set":	integer	,"country":	integer	,"cost":	integer	}
{"cookie_set":	integer	,"country":	integer	,"cost":	integer	}
{"cookie_set":	integer	,"country":	integer	,"cost":	integer	}

It's a Sparse Matrix!

Sparse Matrix as a Columnar Database

aggregate rows that match certain criteria

SELECT sum(*) FROM matrix WHERE column=value

⇣

matrix.T.dot(index-vector)

where index-vector has 1 for each row with column=value, 0 otherwise

Sparse Matrix as a Columnar Database

include only a subset of columns

SELECT sum(a), sum(b) FROM matrix WHERE ...

⇣

matrix.dot(diag(chosen-columns)).T.dot(index-vector)

where chosen-columns has 1 for each chosen column, 0 otherwise

Great! Let's ship it!

1 billion rows

100 columns

10% density

32-bit Column ID, 32-bit value

596 GB matrix

Rethink encoding

Variable-length Encoding for Values

48% smaller!

Regularities in Data

Try also

Run-length Encoding

Dictionary Encoding

Delta Encoding

Probabilistic Data Structures

Finally, on run-time

JIT-compile query-specific decoder using

Thank You!

is Hiring

Python, Erlang, Big Data, AWS

How to build a SQL-based data warehouse for a trillion rows in Python

Ville Tuulos Principal Engineer @ AdRoll

What?

Low latency, in-memory

fully SQL-compliant data warehouse

powers all internal analytics at AdRoll

Core parts in Python

AdRoll data firehose: New data daily

50 TB raw data

80 billion events

Append only, data goes back to 2012

Granular data, hundreds of dimensions

Largest data cube: 4 trillion log lines

Backend

Frontend

Why?

Querying TBs of immutable data is easy!

Data volume is an operational issue

Not an engineering challenge

Users care about information

not data

3200 bits of data

400 bits of information

Dense vs Sparse Information

Same results, different encodings

Users care about latency

Bottleneck: CPU throughput (GB/s)

Solution: Reduce the amount of data per CPU

CPU throughput depends on information density

Neat, but I'm using Python

C vs Numba

If you can compress your data by 30%, you lose nothing

Focus on information density!

Easiest path to low latency

Pay a higher upfront cost, reap benefits over time

Information density is hard to outsource

Renaissance of Custom Solutions

User: Data has become a key differentiator

Developer: Advanced high-level languages

Operator: Infrastructure as a Service

Buying a 3rd party Big Data solution is like getting a puppy

Cute, but it comes with an operational burden

How?

What's the fastest operation in modern computers?

SIMD vector instructions: Operate on 256 bits at a time

What's the most efficient way to access data?

Plain integers in memory (cache)

That is, a matrix

Raw Data: Homogeneous Events

Remove Redundant Field Names

Encode Values in Binary

It's a Sparse Matrix!

Sparse Matrix as a Columnar Database

aggregate rows that match certain criteria

SELECT sum(*) FROM matrix WHERE column=value

⇣

matrix.T.dot(index-vector)

where index-vector has 1 for each row with column=value, 0 otherwise

Sparse Matrix as a Columnar Database

include only a subset of columns

SELECT sum(a), sum(b) FROM matrix WHERE ...

⇣

matrix.dot(diag(chosen-columns)).T.dot(index-vector)

where chosen-columns has 1 for each chosen column, 0 otherwise

Great! Let's ship it!

1 billion rows

100 columns

10% density

32-bit Column ID, 32-bit value

596 GB matrix

Rethink encoding

Variable-length Encoding for Values

48% smaller!

Regularities in Data

Try also

Run-length Encoding

Dictionary Encoding

Delta Encoding

Probabilistic Data Structures

Finally, on run-time

How to build
a SQL-based data warehouse
for a trillion rows in Python

Ville Tuulos
Principal Engineer @ AdRoll