Understanding Record Shredding storing nested data in columns
Starting with a ton of protobuf messages, I was interested in querying over them using Hive and Parquet. Parquet describes a columnar storage format for data that is not necessarily tabular, for example with nested message schemas. Columnar storage can be a good choice when queries don’t read all columns of the data. How can you store these nested messages in columns, preserving the structure?
The original work for this came from Google. The Dremel paper, published in 2010, describes the storage format and query engine that became the basis for Parquet.
This post is a view over the detail, not what you’d actually need to run it. This is based heavily on the Twitter blog post describing the same, and is only an attempt to describe the process for my own understanding. This isn’t intended to be enough to reproduce the full detail, but should be a start at getting an intuition about the approach.
Intuition and running example
The term record shredding, or striping, comes from the idea that we’re taking apart the records from their per-message encoding. We lose locality per message in favour of locality per field.
The principle intuition is to think of the message format as a tree, rooted at the message type with child nodes for each component.
- leaf nodes are primitive types – actual data to store
- non-leaf nodes are types containing other data
Note that each of these may be
NULL if the node (or any parent) is optional or repeated. Each node in the tree has a path through the tree with a segment per node on the path, for example
With some repeated elements, a path might refer to multiple fields within a single message – in these examples, multiple phone numbers stored in the address book, or multiple forward links from a document.
The unit of columnar storage is these paths. The data stored for each such path is the projection of a message onto this path – it may be
NULL depending on optionality of nodes above it.
Given that fields may be optional or repeated we need to store some additional information to preserve the structure, specifically two extra integer values.
We store two pieces of extra information for each value at each path. These allow the original messages to be reconstructed completely from these partial projections.
- definition level that describes how many optional elements are undefined
- repetition level that describes the structure of repeated elements
How many fields in [path] p that could be undefined (because they are optional or repeated) are actually present in the record?
— definition from the paper, sec 4.1
The definition level of a value at a path handles elements that might be
NULL. It means “how much of the path to the node is defined?”.
For leaves under a node that can be missing (because it’s optional or repeated), the leaf can be undefined from a certain level. The leaf may be undefined, or its parent, or some other node further up the tree. The definition level records which segment along the path is the first to be undefined.
As an optimisation, required nodes don’t need to count towards the definition level, so for counting segments of the path, required nodes are ignored.
It tells us at what repeated field in the field’s path the value has repeated
— definition from the paper, sec 4.1
The repetition level handles repeated elements. This is best viewed from the point of view of re-assembling the records from the columnar store into the original message.
A repeated element means we’re dealing with lists of subtrees rooted at that node. When reconstructing the message, a value of n for the repetition level means “start a new list of repeated elements at level n in the tree”
Storage and re-assembly
In this format, a path in a message is stored simply as the series of values at that path with their extra fields.
Given that a single message may have many values at one path, many records may be stored per column for that path. Reading from this stored format takes the form of reading through the separate columns in a coordinated way, something like having a cursor in the file for each path.
A finite state machine can be constructed for a query, based on the columns that it reads, in which states describe which path from which to consume a value. Transitions are given by the repetition levels, and once you transition back to the root state, you’ve read a complete message.
An FSM state corresponds to a field reader for each selected field. […] we look at the next repetition level to decide what next reader to use. The FSM is traversed […] once for each record.
— definition from the paper, sec 4.3