Skip to content

Dataset

fricon uses Arrow IPC format to store datasets. A basic knowledge of Arrow data structures can be helpful to understand how fricon works.

Apache Arrow

You may be familiar with pandas, which is a widely-used data manipulation library in Python. Arrow is a similar library but with much stricter data types requirements. Each Arrow table comes with a schema that specifies the data types of each column. Following are some key classes in the python binding of Arrow:

  • pyarrow.RecordBatch: A record batch is a collection of arrays with the same length. Each record batch is associated with a schema.
  • pyarrow.Array: An array is a sequence of values with the same data type.
  • pyarrow.Scalar: A scalar is a single value with a data type.
  • pyarrow.Schema: A schema is a collection of fields. Each field corresponds to a column in a table.
  • pyarrow.Field: A field is a data type with a name.
  • pyarrow.DataType
  • pyarrow.Table: A helper type to unify representations of single and collection of record batches with the same schema.

How are datasets stored?

A dataset is exactly one Arrow table stored in Arrow IPC format. When a dataset is created, the schema of the table must be determined first. In fricon, users can specify a partial schema in DatasetManager.create, and unspecified columns will be inferred from the first row of the dataset.

Type inference

fricon only tries to infer a subset of Arrow data types. The following table lists the mapping between Python types and Arrow data types:

Python type Arrow data type
bool pyarrow.bool_
int pyarrow.int64
float pyarrow.float64
complex fricon.complex128
str pyarrow.string
Sequence pyarrow.list_
fricon.Trace fricon.trace_

Notice that fricon defines custom data types for complex numbers and traces. Users can use utility functions to convert these custom data types back to Python types, or process them directly with pyarrow or polars.

If users want to store other data types, they need to construct pyarrow.Scalar values by themselves. fricon will store these values as is.