Where and how in dataflow is late data being handled? How can I configure in whi...

namibj · on Jan 22, 2021

Late data is very deliberately not handled. The reasoning for that is best available at [0]. Now, there are ways [1] to handle bitemporal data, but they have fairly significant issues in ergonomics and performance, due to the additional work needed to allow the bitemporal aggregations.

As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).

Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...

james_woods · on Jan 22, 2021

Taken from [0] If you wanted to use the information above to make decisions, it could often be wrong. Let's say you want to wait for it to be correct; how long do you wait?

I know how long I want to wait, 30 minutes in one of my cases as I know that I've seen 95% of the important data by then. In the streaming world there is _always_ late data so being able to tell what should happen when the rest (5%) arrives is crucial for me.

This differs from use-case to use-case for me and being able to configure this and handling out-of-order data at scale is key for me when selecting a framework for stream processing. Apache Beam and Apache Flink do this very well.

Taken from [1]: Apache Beam has some other approach where you use both and there is some magical timeout and it only works for windows or something and blah blah blah... If any of you all know the details, drop me a note. It obviously only works when you window your data as it needs to fit in memory. The event-time and system-time concept from Beam and Flink are very similar, also the watermark approach. Thank you for sharing the links, For me it is now clearer where the difference lies between differential-dataflow and stream-processing frameworks (which also offer SQL and even ACID conformity!). I'm using Beam/Flink in production and missing out on one of these mentioned points is a deal-breaker for me.

jamii · on Jan 22, 2021

What do you usually want to happen with late data? In DD you have the option to ignore it at the source but not to update already-emitted results. Is the latter important for you?

namibj · on Jan 23, 2021

In DDflow, you could also use the `Product` timestamp combinator, and track both the time that event came from, as well as the time you ingested it. You can then make use of the data as soon as the frontier says it's current for the relevant ingestion timestamp, and occasionally advance the frontier for the origin timestamp at the input, so that arrangements can compact historic data. An affected example would be a query that counts "distinct within some time window". It only has to keep that window's `distinct on` values around as long as you can still feed events with timestamps in that window. If you are no longer able to, the values of the `distinct on` become irrelevant for this operator, and only the count for that window needs to be retained.

james_woods · on Jan 23, 2021

If I have to report transaction (aka money) then yes. I need to update already emitted results. If it's just a log-based metric for internal use then no.

What I would like to have is a choice - and Apache Beam for example lets you choose this.