Skip to content

Latest commit

 

History

History
121 lines (84 loc) · 7.58 KB

File metadata and controls

121 lines (84 loc) · 7.58 KB

Data frame (1.0)

Overview

The data_frame format provides an on-disk representation of a data frame, based on the DataFrame class from the S4Vectors Bioconductor package.

Object metadata

The OBJECT file should contain a data_frame property, itself a JSON object with the following properties:

  • version, a string specifying the version of the data_frame format. This should be set to "1.0".

Directory structure

The directory should contain a basic_contents.h5 HDF5 file, which in turn should contain an data_frame group. This group should have a row-count attribute, which should be a scalar dataset of any datatype that is representable by a 64-bit unsigned integer. The value of this attribute specifies the number of rows in the data frame.

The data_frame group should contain column_names, a 1-dimensional string dataset containing the column names of the data frame. Column names should not be empty or duplicated. The length of this dataset defines the number of columns of the data frame. The datatype should be representable by a UTF-8 encoded string.

The data_frame group may optionally contain row_names, a 1-dimensional string dataset containing the row names of the data frame if any are present. The length of this dataset should be equal to the number of rows in the data frame. The datatype should be representable by a UTF-8 encoded string.

The data_frame group should contain the data subgroup, which stores all "basic" columns. Each basic column is represented by a HDF5 group or dataset named after the positional index of the column, e.g., 0 for the first column, 1 for the second column, and so on. See below on the expected representation of each basic column.

The directory may also contain a other_contents subdirectory, which stores all "non-basic" columns. Each non-basic column is represented by a subdirectory within other_contents that is named after the positional index of the column, e.g., 2 for the third column, 3 for the fourth column, and so on. See below on the expected representation of each non-basic column. If no other_contents directory is present, it can be assumed that there are no non-basic columns.

Each column in the data frame should be represented exactly once across basic_contents.h5 and other_contents. Thus, if a dataset named 0 is present inside data_frame/data, there should not be a 0 subdirectory in other_contents.

The directory may contain an element_annotations subdirectory, which contains a child object that satisfies the DATA_FRAME interface. This child object should have number of rows equal to the number of columns, i.e., the length of the data_frame/column_names dataset in basic_contents.h5. Each row of this child object corresponds to a column of the data frame and contains additional annotations for that column.

The directory may contain an other_annotations subdirectory, which contains a child object that satisfies the SIMPLE_LIST interface. This holds extra annotations for the entire data frame.

Column representations

Basic, not factor

Consider the basic column X, i.e., the X-th column in the data frame. If this column is not a factor, it is represented by a HDF5 dataset at data_frame/data/X. This dataset should be a 1-dimensional and of length equal to the number of rows. It should have a type scalar attribute of any string datatype that is represented by a UTF-8 encoded string. The value of the type attribute specifies the type of the column and the expected HDF5 datatype of the dataset itself:

  • For type = "integer" or "boolean", the datatype should be representable by a 32-bit signed integer. Check out the HDF5 policy draft (v0.1.0) for more details.
  • For type = "number", the datatype should be representable by a 64-bit float.
  • For type = "string", the datatype should be representable by a UTF-8 encoded string.

For type = "string", the data_frame/data/X dataset may optionally contain a format attribute. If present, this should be a scalar attribute of a datatype that is represented by a UTF-8 encoded string. The attribute itself should contain one of the following values:

  • "none": no constraints on the contents of each string. This is the default behaviour if no format is present.
  • "date": strings should be dates following a YYYY-MM-DD format.
  • "date-time": strings should be Internet Date/Time values following the format described in RFC3339.

Missing values are represented by placeholder values, defined in the missing-value-placeholder attribute of the data_frame/data/X dataset. The attribute should be scalar and have the same datatype as data_frame/data/X (except in the case of strings, where any datatype may be used for the attribute as long as it is compatible with a UTF-8 encoded string). All values in the dataset equal to the placeholder should be treated as missing. See the HDF5 policy draft (v0.1.0) for details.

Basic, factor

If column X is a factor, it is instead represented by a HDF5 group at data_frame/data/X. It should have a type scalar attribute of any string datatype that is represented by a UTF-8 encoded string. This attribute should contain the "factor" string.

The data_frame/data/X group should contain levels, a 1-dimensional string dataset containing the factor levels. The datatype should be representable by a UTF-8 encoded string. All levels should be unique.

The data_frame/data/X group should contain codes, a 1-dimensional dataset containing the 0-indexed factor codes. The length of this dataset should be equal to the number of rows, and the datatype should be representable by a 64-bit unsigned integer. Values of this dataset should either be less than the number of levels or equal to the placeholder value.

Missing factor entries are represented by placeholder values, defined in the missing-value-placeholder attribute of the data_frame/data/X/codes dataset. The attribute should be scalar and have the same datatype as data_frame/data/X/codes. All values in the dataset equal to the placeholder should be treated as missing. See the HDF5 policy draft (v0.1.0) for details.

The data_frame/data/X group may also have an ordered attribute, which should be a scalar of any datatype that fits into a 32-bit signed integer. If present, a non-zero value indicates that the factor levels should be treated as ordered. Otherwise, the levels are treated as unordered.

Non-basic

We consider the non-basic column Y, i.e., the X-th column in the data frame. This may be any object type that is supported by takane and has a concept of "height". The subdirectory at other_contents/Y holds the on-disk representation of this non-basic column, making it a child object of the enclosing data frame. The height of the child object should be equal to the number of rows in the enclosing data frame. A common use case is that of nested data frames where one data frame is a column of another data frame and has the same number of rows.

Height

The height of the data frame is defined as the number of rows, as specified in the row-count attribute of basic_contents.h5.

Dimensions

The dimensions of the data frame is defined as the number of rows (row-count) and columns (the length of column_names).

Interfaces

The data_frame object satisfies the DATA_FRAME interface.