Struct BatchCoalescer

pub struct BatchCoalescer {
    schema: Arc<Schema>,
    target_batch_size: usize,
    in_progress_arrays: Vec<Box<dyn InProgressArray>>,
    buffered_rows: usize,
    completed: VecDeque<RecordBatch>,
}
Expand description

Concatenate multiple [RecordBatch]es

Implements the common pattern of incrementally creating output [RecordBatch]es of a specific size from an input stream of [RecordBatch]es.

This is useful after operations such as filter and take that produce smaller batches, and we want to coalesce them into larger batches for further processing.

§Motivation

If we use concat_batches to implement the same functionality, there are 2 potential issues:

  1. At least 2x peak memory (holding the input and output of concat)
  2. 2 copies of the data (to create the output of filter and then create the output of concat)

See: https://github.com/apache/arrow-rs/issues/6692 for more discussions about the motivation.

§Example

use arrow_array::record_batch;
use arrow_select::coalesce::{BatchCoalescer};
let batch1 = record_batch!(("a", Int32, [1, 2, 3])).unwrap();
let batch2 = record_batch!(("a", Int32, [4, 5])).unwrap();

// Create a `BatchCoalescer` that will produce batches with at least 4 rows
let target_batch_size = 4;
let mut coalescer = BatchCoalescer::new(batch1.schema(), 4);

// push the batches
coalescer.push_batch(batch1).unwrap();
// only pushed 3 rows (not yet 4, enough to produce a batch)
assert!(coalescer.next_completed_batch().is_none());
coalescer.push_batch(batch2).unwrap();
// now we have 5 rows, so we can produce a batch
let finished = coalescer.next_completed_batch().unwrap();
// 4 rows came out (target batch size is 4)
let expected = record_batch!(("a", Int32, [1, 2, 3, 4])).unwrap();
assert_eq!(finished, expected);

// Have no more input, but still have an in-progress batch
assert!(coalescer.next_completed_batch().is_none());
// We can finish the batch, which will produce the remaining rows
coalescer.finish_buffered_batch().unwrap();
let expected = record_batch!(("a", Int32, [5])).unwrap();
assert_eq!(coalescer.next_completed_batch().unwrap(), expected);

// The coalescer is now empty
assert!(coalescer.next_completed_batch().is_none());

§Background

Generally speaking, larger [RecordBatch]es are more efficient to process than smaller [RecordBatch]es (until the CPU cache is exceeded) because there is fixed processing overhead per batch. This coalescer builds up these larger batches incrementally.

┌────────────────────┐
│    RecordBatch     │
│   num_rows = 100   │
└────────────────────┘                 ┌────────────────────┐
                                       │                    │
┌────────────────────┐     Coalesce    │                    │
│                    │      Batches    │                    │
│    RecordBatch     │                 │                    │
│   num_rows = 200   │  ─ ─ ─ ─ ─ ─ ▶  │                    │
│                    │                 │    RecordBatch     │
│                    │                 │   num_rows = 400   │
└────────────────────┘                 │                    │
                                       │                    │
┌────────────────────┐                 │                    │
│                    │                 │                    │
│    RecordBatch     │                 │                    │
│   num_rows = 100   │                 └────────────────────┘
│                    │
└────────────────────┘

§Notes:

  1. Output rows are produced in the same order as the input rows

  2. The output is a sequence of batches, with all but the last being at exactly target_batch_size rows.

Fields§

§schema: Arc<Schema>§target_batch_size: usize§in_progress_arrays: Vec<Box<dyn InProgressArray>>§buffered_rows: usize§completed: VecDeque<RecordBatch>

Implementations§

§

impl BatchCoalescer

pub fn new(schema: Arc<Schema>, target_batch_size: usize) -> BatchCoalescer

Create a new BatchCoalescer

§Arguments
  • schema - the schema of the output batches
  • target_batch_size - the number of rows in each output batch. Typical values are 4096 or 8192 rows.

pub fn schema(&self) -> Arc<Schema>

Return the schema of the output batches

pub fn push_batch_with_filter( &mut self, batch: RecordBatch, filter: &BooleanArray, ) -> Result<(), ArrowError>

Push a batch into the Coalescer after applying a filter

This is semantically equivalent of calling Self::push_batch with the results from filter_record_batch

§Example
let batch1 = record_batch!(("a", Int32, [1, 2, 3])).unwrap();
let batch2 = record_batch!(("a", Int32, [4, 5, 6])).unwrap();
// Apply a filter to each batch to pick the first and last row
let filter = BooleanArray::from(vec![true, false, true]);
// create a new Coalescer that targets creating 1000 row batches
let mut coalescer = BatchCoalescer::new(batch1.schema(), 1000);
coalescer.push_batch_with_filter(batch1, &filter);
coalescer.push_batch_with_filter(batch2, &filter);
// finsh and retrieve the created batch
coalescer.finish_buffered_batch().unwrap();
let completed_batch = coalescer.next_completed_batch().unwrap();
// filtered out 2 and 5:
let expected_batch = record_batch!(("a", Int32, [1, 3, 4, 6])).unwrap();
assert_eq!(completed_batch, expected_batch);

pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError>

Push all the rows from batch into the Coalescer

When buffered data plus incoming rows reach target_batch_size , completed batches are generated eagerly and can be retrieved via Self::next_completed_batch(). Output batches contain exactly target_batch_size rows, so the tail of the input batch may remain buffered. Remaining partial data either waits for future input batches or can be materialized immediately by calling Self::finish_buffered_batch().

§Example
let batch1 = record_batch!(("a", Int32, [1, 2, 3])).unwrap();
let batch2 = record_batch!(("a", Int32, [4, 5, 6])).unwrap();
// create a new Coalescer that targets creating 1000 row batches
let mut coalescer = BatchCoalescer::new(batch1.schema(), 1000);
coalescer.push_batch(batch1);
coalescer.push_batch(batch2);
// finsh and retrieve the created batch
coalescer.finish_buffered_batch().unwrap();
let completed_batch = coalescer.next_completed_batch().unwrap();
let expected_batch = record_batch!(("a", Int32, [1, 2, 3, 4, 5, 6])).unwrap();
assert_eq!(completed_batch, expected_batch);

pub fn finish_buffered_batch(&mut self) -> Result<(), ArrowError>

Concatenates any buffered batches into a single RecordBatch and clears any output buffers

Normally this is called when the input stream is exhausted, and we want to finalize the last batch of rows.

See Self::next_completed_batch() for the completed batches.

pub fn is_empty(&self) -> bool

Returns true if there is any buffered data

pub fn has_completed_batch(&self) -> bool

Returns true if there are any completed batches

pub fn next_completed_batch(&mut self) -> Option<RecordBatch>

Removes and returns the next completed batch, if any.

Trait Implementations§

§

impl Debug for BatchCoalescer

§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
§

impl<T> Conv for T

§

fn conv<T>(self) -> T
where Self: Into<T>,

Converts self into T using Into<T>. Read more
§

impl<T, V> Convert<T> for V
where V: Into<T>,

§

fn convert(value: Self) -> T

§

fn convert_box(value: Box<Self>) -> Box<T>

§

fn convert_vec(value: Vec<Self>) -> Vec<T>

§

fn convert_vec_box(value: Vec<Box<Self>>) -> Vec<Box<T>>

§

fn convert_matrix(value: Vec<Vec<Self>>) -> Vec<Vec<T>>

§

fn convert_option(value: Option<Self>) -> Option<T>

§

fn convert_option_box(value: Option<Box<Self>>) -> Option<Box<T>>

§

fn convert_option_vec(value: Option<Vec<Self>>) -> Option<Vec<T>>

§

impl<T> FmtForward for T

§

fn fmt_binary(self) -> FmtBinary<Self>
where Self: Binary,

Causes self to use its Binary implementation when Debug-formatted.
§

fn fmt_display(self) -> FmtDisplay<Self>
where Self: Display,

Causes self to use its Display implementation when Debug-formatted.
§

fn fmt_lower_exp(self) -> FmtLowerExp<Self>
where Self: LowerExp,

Causes self to use its LowerExp implementation when Debug-formatted.
§

fn fmt_lower_hex(self) -> FmtLowerHex<Self>
where Self: LowerHex,

Causes self to use its LowerHex implementation when Debug-formatted.
§

fn fmt_octal(self) -> FmtOctal<Self>
where Self: Octal,

Causes self to use its Octal implementation when Debug-formatted.
§

fn fmt_pointer(self) -> FmtPointer<Self>
where Self: Pointer,

Causes self to use its Pointer implementation when Debug-formatted.
§

fn fmt_upper_exp(self) -> FmtUpperExp<Self>
where Self: UpperExp,

Causes self to use its UpperExp implementation when Debug-formatted.
§

fn fmt_upper_hex(self) -> FmtUpperHex<Self>
where Self: UpperHex,

Causes self to use its UpperHex implementation when Debug-formatted.
§

fn fmt_list(self) -> FmtList<Self>
where &'a Self: for<'a> IntoIterator,

Formats each item in a sequence. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

§

impl<T> FutureExt for T

§

fn with_context(self, otel_cx: Context) -> WithContext<Self>

Attaches the provided Context to this type, returning a WithContext wrapper. Read more
§

fn with_current_context(self) -> WithContext<Self>

Attaches the current Context to this type, returning a WithContext wrapper. Read more
§

impl<T> Instrument for T

§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided [Span], returning an Instrumented wrapper. Read more
§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
§

impl<T> IntoRequest<T> for T

§

fn into_request(self) -> Request<T>

Wrap the input message T in a tonic::Request
§

impl<L> LayerExt<L> for L

§

fn named_layer<S>(&self, service: S) -> Layered<<L as Layer<S>>::Service, S>
where L: Layer<S>,

Applies the layer to a service and wraps it in [Layered].
§

impl<T> Pipe for T
where T: ?Sized,

§

fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> R
where Self: Sized,

Pipes by value. This is generally the method you want to use. Read more
§

fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> R
where R: 'a,

Borrows self and passes that borrow into the pipe function. Read more
§

fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> R
where R: 'a,

Mutably borrows self and passes that borrow into the pipe function. Read more
§

fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
where Self: Borrow<B>, B: 'a + ?Sized, R: 'a,

Borrows self, then passes self.borrow() into the pipe function. Read more
§

fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
where Self: BorrowMut<B>, B: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.borrow_mut() into the pipe function. Read more
§

fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
where Self: AsRef<U>, U: 'a + ?Sized, R: 'a,

Borrows self, then passes self.as_ref() into the pipe function.
§

fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
where Self: AsMut<U>, U: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.as_mut() into the pipe function.
§

fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
where Self: Deref<Target = T>, T: 'a + ?Sized, R: 'a,

Borrows self, then passes self.deref() into the pipe function.
§

fn pipe_deref_mut<'a, T, R>( &'a mut self, func: impl FnOnce(&'a mut T) -> R, ) -> R
where Self: DerefMut<Target = T> + Deref, T: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.deref_mut() into the pipe function.
§

impl<T> PolicyExt for T
where T: ?Sized,

§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] only if self and other return Action::Follow. Read more
§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns [Action::Follow] if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
§

impl<T> ServiceExt for T

§

fn propagate_header(self, header: HeaderName) -> PropagateHeader<Self>
where Self: Sized,

Propagate a header from the request to the response. Read more
§

fn add_extension<T>(self, value: T) -> AddExtension<Self, T>
where Self: Sized,

Add some shareable value to request extensions. Read more
§

fn map_request_body<F>(self, f: F) -> MapRequestBody<Self, F>
where Self: Sized,

Apply a transformation to the request body. Read more
§

fn map_response_body<F>(self, f: F) -> MapResponseBody<Self, F>
where Self: Sized,

Apply a transformation to the response body. Read more
§

fn compression(self) -> Compression<Self>
where Self: Sized,

Compresses response bodies. Read more
§

fn decompression(self) -> Decompression<Self>
where Self: Sized,

Decompress response bodies. Read more
§

fn trace_for_http(self) -> Trace<Self, SharedClassifier<ServerErrorsAsFailures>>
where Self: Sized,

High level tracing that classifies responses using HTTP status codes. Read more
§

fn trace_for_grpc(self) -> Trace<Self, SharedClassifier<GrpcErrorsAsFailures>>
where Self: Sized,

High level tracing that classifies responses using gRPC headers. Read more
§

fn follow_redirects(self) -> FollowRedirect<Self>
where Self: Sized,

Follow redirect resposes using the Standard policy. Read more
§

fn sensitive_headers( self, headers: impl IntoIterator<Item = HeaderName>, ) -> SetSensitiveRequestHeaders<SetSensitiveResponseHeaders<Self>>
where Self: Sized,

Mark headers as sensitive on both requests and responses. Read more
§

fn sensitive_request_headers( self, headers: impl IntoIterator<Item = HeaderName>, ) -> SetSensitiveRequestHeaders<Self>
where Self: Sized,

Mark headers as sensitive on requests. Read more
§

fn sensitive_response_headers( self, headers: impl IntoIterator<Item = HeaderName>, ) -> SetSensitiveResponseHeaders<Self>
where Self: Sized,

Mark headers as sensitive on responses. Read more
§

fn override_request_header<M>( self, header_name: HeaderName, make: M, ) -> SetRequestHeader<Self, M>
where Self: Sized,

Insert a header into the request. Read more
§

fn append_request_header<M>( self, header_name: HeaderName, make: M, ) -> SetRequestHeader<Self, M>
where Self: Sized,

Append a header into the request. Read more
§

fn insert_request_header_if_not_present<M>( self, header_name: HeaderName, make: M, ) -> SetRequestHeader<Self, M>
where Self: Sized,

Insert a header into the request, if the header is not already present. Read more
§

fn override_response_header<M>( self, header_name: HeaderName, make: M, ) -> SetResponseHeader<Self, M>
where Self: Sized,

Insert a header into the response. Read more
§

fn append_response_header<M>( self, header_name: HeaderName, make: M, ) -> SetResponseHeader<Self, M>
where Self: Sized,

Append a header into the response. Read more
§

fn insert_response_header_if_not_present<M>( self, header_name: HeaderName, make: M, ) -> SetResponseHeader<Self, M>
where Self: Sized,

Insert a header into the response, if the header is not already present. Read more
§

fn set_request_id<M>( self, header_name: HeaderName, make_request_id: M, ) -> SetRequestId<Self, M>
where Self: Sized, M: MakeRequestId,

Add request id header and extension. Read more
§

fn set_x_request_id<M>(self, make_request_id: M) -> SetRequestId<Self, M>
where Self: Sized, M: MakeRequestId,

Add request id header and extension, using x-request-id as the header name. Read more
§

fn propagate_request_id( self, header_name: HeaderName, ) -> PropagateRequestId<Self>
where Self: Sized,

Propgate request ids from requests to responses. Read more
§

fn propagate_x_request_id(self) -> PropagateRequestId<Self>
where Self: Sized,

Propgate request ids from requests to responses, using x-request-id as the header name. Read more
§

fn catch_panic(self) -> CatchPanic<Self, DefaultResponseForPanic>
where Self: Sized,

Catch panics and convert them into 500 Internal Server responses. Read more
§

fn request_body_limit(self, limit: usize) -> RequestBodyLimit<Self>
where Self: Sized,

Intercept requests with over-sized payloads and convert them into 413 Payload Too Large responses. Read more
§

fn trim_trailing_slash(self) -> NormalizePath<Self>
where Self: Sized,

Remove trailing slashes from paths. Read more
§

fn append_trailing_slash(self) -> NormalizePath<Self>
where Self: Sized,

Append trailing slash to paths. Read more
§

impl<T> Tap for T

§

fn tap(self, func: impl FnOnce(&Self)) -> Self

Immutable access to a value. Read more
§

fn tap_mut(self, func: impl FnOnce(&mut Self)) -> Self

Mutable access to a value. Read more
§

fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Immutable access to the Borrow<B> of a value. Read more
§

fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Mutable access to the BorrowMut<B> of a value. Read more
§

fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Immutable access to the AsRef<R> view of a value. Read more
§

fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Mutable access to the AsMut<R> view of a value. Read more
§

fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Immutable access to the Deref::Target of a value. Read more
§

fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Mutable access to the Deref::Target of a value. Read more
§

fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self

Calls .tap() only in debug builds, and is erased in release builds.
§

fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self

Calls .tap_mut() only in debug builds, and is erased in release builds.
§

fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Calls .tap_borrow() only in debug builds, and is erased in release builds.
§

fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Calls .tap_borrow_mut() only in debug builds, and is erased in release builds.
§

fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Calls .tap_ref() only in debug builds, and is erased in release builds.
§

fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Calls .tap_ref_mut() only in debug builds, and is erased in release builds.
§

fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Calls .tap_deref() only in debug builds, and is erased in release builds.
§

fn tap_deref_mut_dbg<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Calls .tap_deref_mut() only in debug builds, and is erased in release builds.
§

impl<T> TryConv for T

§

fn try_conv<T>(self) -> Result<T, Self::Error>
where Self: TryInto<T>,

Attempts to convert self into T using TryInto<T>. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> WithSubscriber for T

§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a [WithDispatch] wrapper. Read more
§

impl<T> Any for T
where T: Any,

§

impl<T> ErasedDestructor for T
where T: 'static,