Skip to content

Commit

Permalink
Added versioning to accommodate factor, date(time) changes. (#5)
Browse files Browse the repository at this point in the history
The specification now has a place to store the version inside the file, so
validators can do version-specific tasks. This is motivated by the deprecation
of ordered, date and date- types in favor of simplifying the type system.  Now,
ordered type is just a factor with ordered=true, while the date(time)s are just
strings with a special format. This should simplify interpretation a lot.
  • Loading branch information
LTLA authored Oct 2, 2023
1 parent f18a55b commit 288954d
Show file tree
Hide file tree
Showing 12 changed files with 746 additions and 117 deletions.
67 changes: 55 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ as defined by users (for the top-level group) or by the specification (e.g., as

All objects should be nested inside an R list.

The top-level group may have a `uzuki2_version` attribute, describing the version of the **uzuki2** specification that it uses.
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
If not provided, it is assumed to be "1.0".

### Lists

An R list is represented as a HDF5 group (`**/`) with the following attributes:
Expand All @@ -41,7 +45,7 @@ If the list is named, there will additionally be a 1-dimensional `**/names` stri
An atomic vector is represented as a HDF5 group (`**/`) with the following attributes:

- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"`, `"string"`, `"date"` or `"date-time"`.
- `uzuki_type`, a scalar string dataset containing one of `"integer"`, `"boolean"`, `"number"` or `"string"`.

The group should contain an 1-dimensional dataset at `**/data`.
Vectors of length 1 may also be represented as a scalar dataset.
Expand All @@ -51,7 +55,7 @@ The allowed HDF5 datatype depends on `uzuki_type`:
- `"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
- `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
- `"string"`, `"date"` or `"date-time"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.

For some `uzuki_type`, further considerations may be applicable:

Expand All @@ -60,22 +64,30 @@ For some `uzuki_type`, further considerations may be applicable:
- `string`: the `**/data` dataset may contain a `"missing-value-placeholder"` attribute.
If present, this should be a string scalar dataset that specifies the placeholder for missing values.
Any value of `**/data` that is equal to this placeholder should be treated as missing.
- `"date"`: like `"string"`, the `**/data` dataset may contain a `missing-value-placeholder` attribute.
The `**/data` dataset should only contain `YYYY-MM-DD` dates or the placeholder value.
- `"date-time"`: like `"string"`, the `**/data` dataset may contain a `missing-value-placeholder` attribute.
The `**/data` dataset should only contain date-times in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.

For `string` types, the group may optionally contain the `**/format` dataset.
This should be a scalar string dataset that specifies constraints to the format of the values in `**/data`:

- `"date"`: strings should be `YYYY-MM-DD` dates or the placeholder value.
- `"date-time"`: strings should be in the Internet Date/Time format ([RFC 3339, Section 5.6](https://www.rfc-editor.org/rfc/rfc3339#section-5.6)) or the placeholder value.

The atomic vector's group may also contain `**/names`, a 1-dimensional string dataset of length equal to that of `**/data`.
If `**/data` is a scalar, `**/names` should have length 1.

<details>
<summary>Changes from previous versions</summary>
In version 1.0, it was possible to have `uzuki_type` set to `"date"` or `"date-time"`.
This is the same as `uzuki_type` of `"string"` with `**/format` set to `"date"` or `"date-time"`.
</details>

### Factors

A factor is represented as a HDF5 group (`**/`) with the following attributes:

- `uzuki_object`, a scalar string dataset containing the value `"vector"`.
- `uzuki_type`, a scalar string dataset containing one of `"factor"` or `"ordered"`.
- `uzuki_type`, a scalar string dataset containing `"factor"`.

The group should contain an 1-dimensional dataset at `**/data`.
The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
This should be type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
Missing values are represented by -2147483648.

Expand All @@ -85,6 +97,15 @@ Values in `**/data` should be non-negative (missing values excepted) and less th

The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.

The group may optionally contain `**/ordered`, a scalar integer dataset.
This should be interpreted as a boolean where a non-zero value specifies that we should assume that the levels are ordered.

<details>
<summary>Changes from previous versions</summary>
In version 1.0, it was possible to have `uzuki_type` set to `"ordered"`.
This is the same as `uzuki_type` of `"factor"` with the `**/ordered` dataset set to a truthy value.
</details>

### Nothing

A "nothing" (a.k.a., "null", "none") value is represented as a HDF5 group with the following attributes:
Expand All @@ -108,6 +129,9 @@ The exact mechanism by which this restoration occurs is implementation-defined.
All R objects are represented by JSON objects with a `type` property.
Every R object should be nested inside an R list.

The top-level object may have a `version` property that contains the **uzuki2** specification version as a `"X.Y"` string for non-negative integers `X` and `Y`.
If missing, the version can be assumed to be "1.0".

### Lists

An R list is represented as a JSON object with the following properties:
Expand All @@ -121,19 +145,24 @@ An R list is represented as a JSON object with the following properties:

An atomic vector is represented as a JSON object with the following properties:

- `type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"` or `"date"`.
- `type`, set to one of `"integer"`, `"boolean"`, `"number"`, `"string"`.
- `values`, an array of values for the vector (see below).
This may also be a scalar of the same type as the array contents.
- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.
If `values` is a scalar, `names` should have length 1.

The contents of `values` is subject to some constraints:

- `"integer"`: values should be JSON numbers that can fit into a 32-bit signed integer.
- `"number"`: values should be JSON numbers.
Missing values are represented by `null`.
- `"integer"`: values should be JSON numbers that can be represented by a 32-bit signed integer.
Missing values may be represented by `null` or the special value -2147483648.
- `"boolean"`: values should be JSON booleans or `null` (for missing values).
- `string`: values should be JSON strings.
`null` is also allowed and represents a missing value.

For `type` of `"string"`, the object may optionally have a `format` property that constrains the `values`:

- `"date"`: values should be JSON strings following a `YYYY-MM-DD` format.
`null` is also allowed and represents a missing value.
- `"date-time"`: values should be JSON strings following the Internet Date/Time format.
Expand All @@ -142,18 +171,32 @@ The contents of `values` is subject to some constraints:
Vectors of length 1 may also be represented as scalars of the appropriate type.
While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.

<details>
<summary>Changes from previous versions</summary>
In version 1.0, it was possible to have `type` set to `"date"` or `"date-time"`.
This is the same as `"type": "string"` with `format` set to `"date"` or `"date-time"`.
</details>

### Factors

A factor is represented as a JSON object with the following properties:

- `type`, set to one of `"factor"` or `"ordered"`.
- `values`, an array of integer indices for the factor.
- `type`, set to `"factor"`.
- `values`, an array of 0-based integer indices for the factor.
These should be non-negative JSON numbers that can fit into a 32-bit signed integer.
They should also be less than the length of `levels`.
Missing values may be represented by `null` or the special value -2147483648.
- `levels`, an array of unique strings containing the levels for the indices in `values`.
- (optional) `ordered`, a boolean indicating whether to assume that the levels are ordered.
If absent, levels are assumed to be non-ordered.
- (optional) `"names"`, an array of length equal to `values`, containing the names of the list elements.

<details>
<summary>Changes from previous versions</summary>
In version 1.0, it was possible to have `"type": "ordered"`.
This is the same as `"type": "factor"` with `"ordered": true`.
</details>

### Nothing

A "nothing" (a.k.a., "null", "none") value is represented as a JSON object with the following properties:
Expand Down
82 changes: 82 additions & 0 deletions include/uzuki2/ParsedList.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#ifndef UZUKI2_PARSED_LIST_HPP
#define UZUKI2_PARSED_LIST_HPP

#include <memory>

#include "interfaces.hpp"
#include "Version.hpp"

/**
* @file ParsedList.hpp
* @brief Class to hold the parsed list.
*/

namespace uzuki2 {

/**
* @brief Results of parsing a list from file.
*
* This wraps a pointer to `Base` and is equivalent to `shared_ptr<Base>` for most applications.
* It contains some extra metadata to hold the version information.
*/
struct ParsedList {
/**
* @cond
*/
ParsedList(std::shared_ptr<Base> p, Version v) : version(std::move(v)), ptr(std::move(p)) {}
/**
* @endcond
*/

/**
* @return Pointer to `Base`.
*/
Base* get() const {
return ptr.get();
}

/**
* @return Reference to `Base`.
*/
Base& operator*() const {
return *ptr;
}

/**
* @return Pointer to `Base`.
*/
Base* operator->() const {
return ptr.operator->();
}

/**
* @return Whether this stores a non-null pointer.
*/
operator bool() const {
return ptr.operator bool();
}

/**
* Calls the corresponding method for `ptr`.
* @tparam Args_ Assorted arguments.
* @param args Arguments forwarded to the corresponding method for `ptr`.
*/
template<typename ...Args_>
void reset(Args_&& ... args) const {
ptr.reset(std::forward<Args_>(args)...);
}

/**
* Version of the **uzuki2** specification.
*/
Version version;

/**
* Pointer to the `Base` object.
*/
std::shared_ptr<Base> ptr;
};

}

#endif
97 changes: 97 additions & 0 deletions include/uzuki2/Version.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
#ifndef UZUKI2_VERSIONED_BASE_HPP
#define UZUKI2_VERSIONED_BASE_HPP

#include <string>
#include <cstring>

/**
* @file Version.hpp
* @brief Version-related definitions.
*/

namespace uzuki2 {

/**
* @brief Version number.
*/
struct Version {
/**
* @cond
*/
Version() = default;
Version(int maj, int min) : major(maj), minor(min) {}
/**
* @endcond
*/

/**
* Major version number, must be positive.
*/
int major = 1;

/**
* Minor version number, must be non-negative.
*/
int minor = 0;

/**
* @param maj Major version number.
* @param min Minor version number.
* @return Whether the version is equal to `<maj>.<min>`.
*/
bool equals(int maj, int min) const {
return (major == maj && minor == min);
}
};

/**
* @cond
*/
inline Version parse_version_string(const std::string& version_string) {
int major = 0, minor = 0;
size_t i = 0, end = version_string.size();

if (version_string.empty()) {
throw std::runtime_error("version string is empty");
}
if (version_string[i] == '0') {
throw std::runtime_error("invalid version string '" + version_string + "' has leading zeros in its major version");
}
while (i < end && version_string[i] != '.') {
if (!std::isdigit(version_string[i])) {
throw std::runtime_error("invalid version string '" + version_string + "' contains non-digit characters");
}
major *= 10;
major += version_string[i] - '0';
++i;
}

if (i == end) {
throw std::runtime_error("version string '" + version_string + "' is missing a minor version");
}
++i; // get past the period and check again.
if (i == end) {
throw std::runtime_error("version string '" + version_string + "' is missing a minor version");
}

if (version_string[i] == '0' && i + 1 < end) {
throw std::runtime_error("invalid version string '" + version_string + "' has leading zeros in its minor version");
}
while (i < end) {
if (!std::isdigit(version_string[i])) {
throw std::runtime_error("invalid version string '" + version_string + "' contains non-digit characters");
}
minor *= 10;
minor += version_string[i] - '0';
++i;
}

return Version(major, minor);
}
/**
* @cond
*/

}

#endif
Loading

0 comments on commit 288954d

Please sign in to comment.