Dataset¶
Vortex files implement the Arrow Dataset interface permitting efficient use of a Vortex file within query engines like DuckDB and Polars. In particular, Vortex will read data proportional to the number of rows passing a filter condition and the number of columns in a selection. For most Vortex encodings, this property holds true even when the filter condition specifies a single row.
Read Vortex files with row filter and column selection pushdown. |
|
A PyArrow Dataset Scanner that reads from a Vortex Array. |
- class vortex.dataset.VortexDataset(dataset)¶
Read Vortex files with row filter and column selection pushdown.
This class implements the
pyarrow.dataset.Datasetinterface which enables its use with Polars, DuckDB, Pandas and others.- count_rows(filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) int¶
Not implemented.
- filter(expression: Expression) VortexDataset¶
Not implemented.
- get_fragments(filter: Expression | None = None) Iterator[Fragment]¶
Not implemented.
- head(num_rows: int, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) Table¶
Load the first num_rows of the dataset.
- Parameters:
num_rows (int) – The number of rows to load.
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- join(right_dataset, keys, right_keys=None, join_type=None, left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads: bool | None = None) InMemoryDataset¶
Not implemented.
- join_asof(right_dataset, on, by, tolerance, right_on=None, right_by=None) InMemoryDataset¶
Not implemented.
- scanner(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) Scanner¶
Construct a
pyarrow.dataset.Scanner.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- sort_by(sorting, **kwargs) InMemoryDataset¶
Not implemented.
- take(indices: Array | Any, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) Table¶
Load a subset of rows identified by their absolute indices.
- Parameters:
indices (
pyarrow.Array) – A numeric array of absolute indices into self indicating which rows to keep.columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- to_batches(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) Iterator[RecordBatch]¶
Construct an iterator of
pyarrow.RecordBatch.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- to_record_batch_reader(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) RecordBatchReader¶
Construct a
pyarrow.RecordBatchReader.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- to_table(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) Table¶
Construct an Arrow
pyarrow.Table.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- class vortex.dataset.VortexScanner(dataset: VortexDataset, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None)¶
A PyArrow Dataset Scanner that reads from a Vortex Array.
- Parameters:
dataset (VortexDataset) – The dataset to scan.
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes toTrue. Any rows for which this expression evaluates toNullis removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool) – Not implemented.
- Returns:
table
- Return type:
- head(num_rows: int) Table¶
Load the first num_rows of the dataset.
- Parameters:
num_rows (int) – The number of rows to read.
- Returns:
table
- Return type:
- scan_batches() Iterator[TaggedRecordBatch]¶
Not implemented.
- to_batches() Iterator[RecordBatch]¶
Construct an iterator of
pyarrow.RecordBatch.- Returns:
table
- Return type:
- to_reader() RecordBatchReader¶
Construct a
pyarrow.RecordBatchReader.- Returns:
table
- Return type:
- to_table() Table¶
Construct an Arrow
pyarrow.Table.- Returns:
table
- Return type: