Module dedup

Source
Expand description

Utilities to remove duplicate rows from a sorted batch.

Structsยง

BatchLastRow ๐Ÿ”’
State of the last row in a batch for dedup.
DedupMetrics ๐Ÿ”’
Metrics for deduplication.
DedupReader ๐Ÿ”’
A reader that dedup sorted batches from a source based on the dedup strategy.
LastFieldsBuilder ๐Ÿ”’
Buffer to store fields in the last row to merge.
LastNonNull ๐Ÿ”’
Dedup strategy that keeps the last non-null field for the same key.
LastNonNullIter ๐Ÿ”’
An iterator that dedup rows by LastNonNull strategy. The input iterator must returns sorted batches.
LastRow ๐Ÿ”’
Dedup strategy that keeps the row with latest sequence of each key.

Traitsยง

DedupStrategy ๐Ÿ”’
Strategy to remove duplicate rows from sorted batches.

Functionsยง

filter_deleted_from_batch ๐Ÿ”’
Removes deleted rows from the batch and updates metrics.