pub fn merge_and_dedup(
schema: &SchemaRef,
append_mode: bool,
merge_mode: MergeMode,
field_column_start: usize,
input_iters: Vec<BoxedRecordBatchIterator>,
) -> Result<BoxedRecordBatchIterator>Expand description
Merges multiple record batch iterators and applies deduplication based on the specified mode.
This function is used during the flush process to combine data from multiple memtable ranges into a single stream while handling duplicate records according to the configured merge strategy.
§Arguments
schema- The Arrow schema reference that defines the structure of the record batchesappend_mode- When true, no deduplication is performed and all records are preserved. This is used for append-only workloads where duplicate handling is not required.merge_mode- The strategy used for deduplication when not in append mode:MergeMode::LastRow: Keeps the last record for each primary keyMergeMode::LastNonNull: Keeps the last non-null values for each field
field_column_start- The starting column index for fields in the record batch. Used whenMergeMode::LastNonNullto identify which columns contain field values versus primary key columns.input_iters- A vector of record batch iterators to be merged and deduplicated
§Returns
Returns a boxed record batch iterator that yields the merged and potentially deduplicated record batches.
§Behavior
- Creates a
FlatMergeIteratorto merge all input iterators in sorted order based on primary key and timestamp - If
append_modeis true, returns the merge iterator directly without deduplication - If
append_modeis false, wraps the merge iterator with aFlatDedupIteratorthat applies the specified merge mode:LastRow: Removes duplicate rows, keeping only the last oneLastNonNull: Removes duplicates but preserves the last non-null value for each field
§Examples
ⓘ
let merged_iter = merge_and_dedup(
&schema,
false, // not append mode, apply dedup
MergeMode::LastRow,
2, // fields start at column 2 after primary key columns
vec![iter1, iter2, iter3],
)?;