Concepts in DataFlow basic definitions

Google's 2015 paper on the Dataflow model describes general solutions to general data pipeline processing problems. The terms they use have been helpful to me in understanding patterns in these problems.

Google Cloud Dataflow

Google published a paper in 2015 describing a general model of data pipeline computation, discussing how to balance correctness, latency, cost, scale, boundedness (or otherwise) and ordering (or lack thereof!).

This was the foundation of an internal implementation that eventually surfaced as the Apache Beam APIs, and a commercially-available implementation in Google Cloud Dataflow.

I’ve found the concepts introduced in the paper as a great foundation for thinking about these kinds of problems. Separating the ideas of event-time and processing-time, and thinking explicitly about triggers for processing of late events, has been very instructive. Here I want to summarise some of these concepts.

Time

Event-time is the time at which an event actually occurred. This is a record of the time that the event occurred according to whatever system generated the event.

Processing-time is the time at which an event is observed during processing. There might be many such processing times, as data is processed at stages through a pipeline.

A watermark is the lower bound on event-times that have been processed. It can be interpreted as “we’ve received all events prior to this time”. When events can be received with arbitrary delay this obviously cannot be true, and in practice it is often heuristically established.

The intuition is that the processing-time is later than the event-time, and that the watermark is far enough in the past that we expect no (or at least few) events further back in the past.

Windowing

A window determines where in event-time data are grouped for processing. It’s a finite chunk of a dataset over which to run some processing (see Li 2005 for more here).

There are a few different kinds of windows that come up commonly.

Windows are called unaligned if a single window doesn’t span the entirety of a data source. Session windows per user, for example, don’t include the whole data space for its event-time range.

Triggering

A trigger determines where in processing-time the results of processing over a window are emitted. A trigger stimulates the output of processing of a specific window of event-time.

When events arrive late, and further processing happens after an earlier batch was processed, how do the new results relate to the last ones? There are different approaches.