Skip to content

Commit

Permalink
Programmatically generate version-specific specification documents.
Browse files Browse the repository at this point in the history
This is arguably easier to read if we want to understand any given version of
the spec, as we don't have to mentally ignore the parts of the spec related to
other versions. We  use knitr to generate one document per version, only
keeping the clauses relevant to that version via conditional chunks.
  • Loading branch information
LTLA committed Jan 3, 2024
1 parent b1d39e6 commit 44e3a31
Show file tree
Hide file tree
Showing 7 changed files with 221 additions and 85 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/doxygenate.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,28 @@ on:
name: Build documentation

jobs:
build-spec:
runs-on: ubuntu-latest
container: bioconductor/bioconductor_docker:devel

steps:
- name: Checkout repo
uses: actions/checkout@v3

- name: Compile markdown
run: |
cd docs/specifications
R -f build.R
- name: Upload markdown
uses: actions/upload-artifact@v3
with:
name: built-spec
path: docs/specifications/compiled

docs:
runs-on: ubuntu-latest
needs: build-spec
steps:
- uses: actions/checkout@v3

Expand All @@ -16,6 +36,12 @@ jobs:
with:
args: -O docs/doxygen-awesome.css https://raw.githubusercontent.com/jothepro/doxygen-awesome-css/main/doxygen-awesome.css

- name: Download markdown
uses: actions/download-artifact@v3
with:
name: built-spec
path: docs/specifications/compiled

- name: Doxygen Action
uses: mattnotmitt/doxygen-action@v1
with:
Expand Down
3 changes: 2 additions & 1 deletion docs/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -794,7 +794,8 @@ INPUT = ../include/uzuki2/parse_json.hpp \
../include/uzuki2/parse_hdf5.hpp \
../include/uzuki2/interfaces.hpp \
../include/uzuki2/uzuki2.hpp \
../README.md
../README.md \
specifications/compiled

# This tag can be used to specify the character encoding of the source files
# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
Expand Down
1 change: 1 addition & 0 deletions docs/specifications/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
compiled/
14 changes: 14 additions & 0 deletions docs/specifications/build.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
library(knitr)
dir.create("compiled", showWarnings=FALSE)

for (v in c("1.0", "1.1", "1.2", "1.3")) {
.version <- package_version(v)
knitr::knit("hdf5.Rmd", output=file.path("compiled", paste0("hdf5-", v, ".md")))
}

for (v in c("1.0", "1.1", "1.2")) {
.version <- package_version(v)
knitr::knit("json.Rmd", output=file.path("compiled", paste0("json-", v, ".md")))
}

file.copy("misc.md", file.path("compiled", "misc.md"))
166 changes: 114 additions & 52 deletions docs/specifications/hdf5.md → docs/specifications/hdf5.Rmd
Original file line number Diff line number Diff line change
@@ -1,18 +1,51 @@
# HDF5 Specification
```{r, results="hide", echo=FALSE}
knitr::opts_chunk$set(error=FALSE)
if (!exists(".version")) {
.version <- package_version("1.3")
}
```

## General comments
```{r, results="asis", echo=FALSE}
cat("# HDF5 Specification (", as.character(.version), ")", sep="")
```

We use `**/` to represent a variable name of the group representing any of the supported R objects.
It is assumed that `**/` will be replaced by the actual name of the group in implementations,
as defined by users (for the top-level group) or by the specification (e.g., as a nested child of a list).
## Comments

All objects should be nested inside an R list.
### General

Every R object is represented by a HDF5 group.
In the descriptions below, we use `**/` as a placeholder for the name of the group.

All R objects should be nested inside an R list.
In other words, the top-level HDF5 group should represent an R list.

The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0**.
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0** for back-compatibility purposes.

```{r, echo=FALSE, results="asis"}
if (.version >= package_version("1.3")) {
cat("### Datatypes
The HDF5 datatype specification used by each R object is based on the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0).
This aims to provide readers with a guaranteed type for faithfully representing the data in memory.
The draft also describes the use of placeholders to represent missing values within HDF5 datasets.")
}
```

### Names

Some R objects may have a `**/names` dataset in their HDF5 group.
If `**/names` is supplied, the contents should always be non-missing, so any `missing-value-placeholder` will not be respected.
Each name is allowed to be any string, including an empty string.

## Lists
It is technically permitted to provide duplicate names in `**/names`, consistent with how R itself supports duplicate names in its lists and vectors.
However, this is not recommended as other frameworks may wish to use representations that assume unique names, e.g., using Python dictionaries to represent named lists.
By providing unique names, users can improve interoperability with native data structures in other frameworks.

## Object types

### Lists

An R list is represented as a HDF5 group (`**/`) with the following attributes:

Expand All @@ -24,15 +57,19 @@ One subgroup should be present for each integer in `[0, N)`, given a list of len
Each list element may be any of the objects described in this specification, including further nested lists.

If the list is named, there will additionally be a 1-dimensional `**/names` string dataset of length equal to the number of elements in `**/data`.
See also the [comments on names](misc.md#comments-on-names).

## Atomic vectors
### Atomic vectors

An atomic vector is represented as a HDF5 group (`**/`) with the following attributes:

- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.
- **(for version 1.0)** this may also be `"date"` or `"date-time"`.
```{r, echo=FALSE, results="asis"}
if (.version == package_version("1.0")) {
cat('- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"`, `"string"`, `"date"` or `"date-time"`.')
} else {
cat('- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.')
}
```

The group should contain an 1-dimensional dataset at `**/data`.
Vectors of length 1 may also be represented as a scalar dataset.
Expand All @@ -41,72 +78,93 @@ The allowed HDF5 datatype depends on `uzuki_type`:

- `"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
- **(for version < 1.3)** `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
- **(for version >= 1.3)** `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.
This implies a limit of 32 bits for any integer datatype.
See also the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
```{r, echo=FALSE, results="asis"}
if (.version == package_version("1.0")) {
cat('- `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.')
} else {
cat('- `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.')
}
```
- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.

For `boolean` type, values in `**/data` should be one of 0 (false) or 1 (true).

**(for versions >= 1.1)**
For the `string` type, the group may optionally contain the `**/format` dataset.
```{r, echo=FALSE, results="asis"}
if (.version == package_version("1.0")) {
cat('- `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
- `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.')
}
```

For `boolean` type, values in `**/data` should be one of 0 (false) or non-zero (true).

```{r, echo=FALSE, results="asis"}
if (.version >= package_version("1.1")) {
cat('For the `string` type, the group may optionally contain the `**/format` dataset.
This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:
- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.
- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.')
}
```

The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
If `**/data` is a scalar, `**/names` should have length 1.
See also the [comments on names](misc.md#comments-on-names).

### Representing missing values

**(for version >= 1.1)**
Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
```{r, echo=FALSE, results="asis"}
if (.version >= package_version("1.1")) {
cat('Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
If present, this should be a scalar dataset that specifies the placeholder for missing values.
Any value of `**/data` that is equal to this placeholder should be treated as missing.
If no such attribute is present, it can be assumed that there are no missing values.
If no such attribute is present, it can be assumed that there are no missing values.')
}
**(for version >= 1.2)**
The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
if (.version >= package_version("1.2")) {
cat('The data type of the placeholder attribute should be exactly the same as that of `**/data`, so as to avoid unexpected results upon casting.
The only exception is when `**/data` is a string, in which case the placeholder type may be of any string type;
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.
it is expected that any comparison between the placeholder and strings in `**/data` will be performed bytewise in the same manner as `strcmp`.')
}
**(for version == 1.1)**
The data type of the placeholder attribute should have the same data type class as `**/data`.
if (.version >= package_version("1.1")) {
cat('The data type of the placeholder attribute should have the same data type class as `**/data`.')
}
**(for version >= 1.3)**
Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
if (.version >= package_version("1.3")) {
cat('Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
No casting should be performed to a lower-precision type, as this may cause a non-missing value to become equal to the placeholder.
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.
See the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.')
}
**(for version >= 1.1, < 1.3)**
Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
if (.version >= package_version("1.1") && .version < package_version("1.3")) {
cat('Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.')
}
**(for version 1.0)**
Integer or boolean values of -2147483648 are treated as missing.
if (.version == package_version("1.0")) {
cat("Integer or boolean values of -2147483648 are treated as missing.
Missing floats are represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
If present, this should be a scalar string dataset that specifies the placeholder for missing values.
Any value of `**/data` that is equal to this placeholder should be treated as missing.
If no such attribute is present, it can be assumed that there are no missing values.
If no such attribute is present, it can be assumed that there are no missing values.")
}
```

## Factors
### Factors

A factor is represented as a HDF5 group (`**/`) with the following attributes:

- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
- `uzuki_type`, a scalar string dataset containing `"factor"`.
- **(for version 1.0)** `uzuki_type` could also be set to `"ordered"`.
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
```{r, echo=FALSE, results="asis"}
if (.version == package_version("1.0")) {
cat('- `uzuki_type`, a scalar string dataset containing `"factor"` or `"ordered"`.')
} else {
cat('- `uzuki_type`, a scalar string dataset containing `"factor"`.')
}
```

The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
(Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.)
Missing values are represented as described above for atomic vectors.

The group should also contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`.
Expand All @@ -118,16 +176,20 @@ beyond that count, the levels cannot be indexed by elements of `**/data`.
The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
See also the [comments on names](misc.md#comments-on-names).

**(for version >= 1.1)** The group may optionally contain `**/ordered`, a scalar integer dataset.
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.
```{r, echo=FALSE, results="asis"}
if (.version == package_version("1.1")) {
cat('The group may optionally contain `**/ordered`, a scalar integer dataset.
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.')
}
```

## Nothing
### Nothing

A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:

- `uzuki_object`, a scalar string dataset containing the value `"nothing"`.

## External object
### External object

Each external object is represented as a HDF5 group (`**/`) with the following attributes:

Expand All @@ -136,5 +198,5 @@ Each external object is represented as a HDF5 group (`**/`) with the following a
This should contain an `**/index` scalar dataset, containing an index that identifies this external object uniquely within the entire list.
`**/index` should start at zero and be incremented whenever an external object is encountered.

By indexing this external metadata, we can restore the object in its appropriate location in the list.
By indexing some external metadata with the value of `**/index`, we can restore the external object in its appropriate location in the R list.
The exact mechanism by which this restoration occurs is implementation-defined.
Loading

0 comments on commit 44e3a31

Please sign in to comment.