The data_frame
format provides an on-disk representation of a data frame, based on the DataFrame
class from the S4Vectors Bioconductor package.
The OBJECT
file should contain a data_frame
property, itself a JSON object with the following properties:
version
, a string specifying the version of thedata_frame
format. This should be set to "1.0".
The directory should contain a basic_contents.h5
HDF5 file, which in turn should contain an data_frame
group.
This group should have a row-count
attribute, which should be a scalar dataset of any datatype that is representable by a 64-bit unsigned integer.
The value of this attribute specifies the number of rows in the data frame.
The data_frame
group should contain column_names
, a 1-dimensional string dataset containing the column names of the data frame.
Column names should not be empty or duplicated.
The length of this dataset defines the number of columns of the data frame.
The datatype should be representable by a UTF-8 encoded string.
The data_frame
group may optionally contain row_names
, a 1-dimensional string dataset containing the row names of the data frame if any are present.
The length of this dataset should be equal to the number of rows in the data frame.
The datatype should be representable by a UTF-8 encoded string.
The data_frame
group should contain the data
subgroup, which stores all "basic" columns.
Each basic column is represented by a HDF5 group or dataset named after the positional index of the column, e.g., 0
for the first column, 1
for the second column, and so on.
See below on the expected representation of each basic column.
The directory may also contain a other_contents
subdirectory, which stores all "non-basic" columns.
Each non-basic column is represented by a subdirectory within other_contents
that is named after the positional index of the column, e.g., 2
for the third column, 3
for the fourth column, and so on.
See below on the expected representation of each non-basic column.
If no other_contents
directory is present, it can be assumed that there are no non-basic columns.
Each column in the data frame should be represented exactly once across basic_contents.h5
and other_contents
.
Thus, if a dataset named 0
is present inside data_frame/data
, there should not be a 0
subdirectory in other_contents
.
The directory may contain an element_annotations
subdirectory, which contains a child object that satisfies the DATA_FRAME
interface.
This child object should have number of rows equal to the number of columns, i.e., the length of the data_frame/column_names
dataset in basic_contents.h5
.
Each row of this child object corresponds to a column of the data frame and contains additional annotations for that column.
The directory may contain an other_annotations
subdirectory, which contains a child object that satisfies the SIMPLE_LIST
interface.
This holds extra annotations for the entire data frame.
Consider the basic column X, i.e., the X-th column in the data frame.
If this column is not a factor, it is represented by a HDF5 dataset at data_frame/data/X
.
This dataset should be a 1-dimensional and of length equal to the number of rows.
It should have a type
scalar attribute of any string datatype that is represented by a UTF-8 encoded string.
The value of the type
attribute specifies the type of the column and the expected HDF5 datatype of the dataset itself:
- For
type = "integer"
or"boolean"
, the datatype should be representable by a 32-bit signed integer. Check out the HDF5 policy draft (v0.1.0) for more details. - For
type = "number"
, the datatype should be representable by a 64-bit float. - For
type = "string"
, the datatype should be representable by a UTF-8 encoded string.
For type = "string"
, the data_frame/data/X
dataset may optionally contain a format
attribute.
If present, this should be a scalar attribute of a datatype that is represented by a UTF-8 encoded string.
The attribute itself should contain one of the following values:
"none"
: no constraints on the contents of each string. This is the default behaviour if noformat
is present."date"
: strings should be dates following aYYYY-MM-DD
format."date-time"
: strings should be Internet Date/Time values following the format described in RFC3339.
Missing values are represented by placeholder values, defined in the missing-value-placeholder
attribute of the data_frame/data/X
dataset.
The attribute should be scalar and have the same datatype as data_frame/data/X
(except in the case of strings, where any datatype may be used for the attribute as long as it is compatible with a UTF-8 encoded string).
All values in the dataset equal to the placeholder should be treated as missing.
See the HDF5 policy draft (v0.1.0) for details.
If column X is a factor, it is instead represented by a HDF5 group at data_frame/data/X
.
It should have a type
scalar attribute of any string datatype that is represented by a UTF-8 encoded string.
This attribute should contain the "factor"
string.
The data_frame/data/X
group should contain levels
, a 1-dimensional string dataset containing the factor levels.
The datatype should be representable by a UTF-8 encoded string.
All levels should be unique.
The data_frame/data/X
group should contain codes
, a 1-dimensional dataset containing the 0-indexed factor codes.
The length of this dataset should be equal to the number of rows, and the datatype should be representable by a 64-bit unsigned integer.
Values of this dataset should either be less than the number of levels or equal to the placeholder value.
Missing factor entries are represented by placeholder values, defined in the missing-value-placeholder
attribute of the data_frame/data/X/codes
dataset.
The attribute should be scalar and have the same datatype as data_frame/data/X/codes
.
All values in the dataset equal to the placeholder should be treated as missing.
See the HDF5 policy draft (v0.1.0) for details.
The data_frame/data/X
group may also have an ordered
attribute, which should be a scalar of any datatype that fits into a 32-bit signed integer.
If present, a non-zero value indicates that the factor levels should be treated as ordered.
Otherwise, the levels are treated as unordered.
We consider the non-basic column Y, i.e., the X-th column in the data frame.
This may be any object type that is supported by takane and has a concept of "height".
The subdirectory at other_contents/Y
holds the on-disk representation of this non-basic column, making it a child object of the enclosing data frame.
The height of the child object should be equal to the number of rows in the enclosing data frame.
A common use case is that of nested data frames where one data frame is a column of another data frame and has the same number of rows.
The height of the data frame is defined as the number of rows, as specified in the row-count
attribute of basic_contents.h5
.
The dimensions of the data frame is defined as the number of rows (row-count
) and columns (the length of column_names
).
The data_frame
object satisfies the DATA_FRAME
interface.