Skip to content

Commit

Permalink
Various improvements for v2 draft (frictionlessdata#47)
Browse files Browse the repository at this point in the history
* Fixed security spec navigation

* Updated the guide

* Reference url-or-path from glossary

* Hide empty contributing for now

* Fixed links
  • Loading branch information
roll authored Apr 1, 2024
1 parent a3ac601 commit e39cd47
Show file tree
Hide file tree
Showing 11 changed files with 117 additions and 102 deletions.
61 changes: 59 additions & 2 deletions content/docs/guides/using-data-package.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,63 @@ sidebar:
order: 1
---

:::caution
This section is under development
There are many alternatives when it comes to Data Package Standard implementations. We will cover a few the most popular options which will be a good starting point.

:::tip
Please take a look at the full list of Data Package [Software](../../standard/software/) to find other implementations.
:::

## Open Data Editor

The simplest way to start using the Data Package Standard is by installing [Open Data Editor](https://opendataeditor.okfn.org/) (currently, in beta):

[![Open Data Editor](../../../assets/software/ode.png)](https://opendataeditor.okfn.org)

You can use the visual interface as you usually do in any modern IDE, adding and moving files, validating data, etc. Under the hood, Open Data Editor will be creating Data Package descriptors for your datasets (can be explicitly done by creating a dataset), inferring metadata, and data types. When the data curation work is done a data package can be validated and published, for example, to CKAN.

Please refer to the [Open Data Editor's documentation](https://opendataeditor.okfn.org) to read about all the features.

## frictionless-py

If you prefer a command-line interface, or Python, there is [frictionless-py](https://framework.frictionlessdata.io/), a complete framework for managing data packages. Here are main commands available in CLI:

```bash
frictionless describe # to describe your data
frictionless explore # to explore your data
frictionless extract # to extract your data
frictionless index # to index your data
frictionless list # to list your data
frictionless publish # to publish your data
frictionless query # to query your data
frictionless script # to script your data
frictionless validate # to validate your data
frictionless --help # to get list of the command
frictionless --version # to get the version
```

Please refer to the [frictionless-py's documentation](https://framework.frictionlessdata.io/) to read about all the features.

## frictionless-r

For the R community, there is [frictionless-r](https://docs.ropensci.org/frictionless/) package that allows managing data packages in R language. For example:

```r
library(frictionless)

# Read the datapackage.json file
# This gives you access to all Data Resources of the Data Package without
# reading them, which is convenient and fast.
package <- read_package("https://zenodo.org/records/10053702/files/datapackage.json")

package

# List resources
resources(package)

# Read data from the resource "gps"
# This will return a single data frame, even though the data are split over
# multiple zipped CSV files.
read_resource(package, "gps")
```

Please refer to the [frictionless-r's documentation](https://docs.ropensci.org/frictionless/) to read about all the features.
6 changes: 3 additions & 3 deletions content/docs/specifications/data-package.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Here is an example:
```

- `name`: The `name` `MUST` be an [Open Definition license ID](http://licenses.opendefinition.org/)
- `path`: A [url-or-path](../data-resource/#url-or-path) string, that is a fully qualified HTTP address, or a relative POSIX path.
- `path`: A [URL or Path](../glossary/#url-or-path) string, that is a fully qualified HTTP address, or a relative POSIX path.
- `title`: A human-readable title.

### `profile`
Expand Down Expand Up @@ -210,7 +210,7 @@ A URL for the home on the web that is related to this data package.

An image to use for this data package. For example, when showing the package in a listing.

The value of the image property `MUST` be a string pointing to the location of the image. The string `MUST` be a [url-or-path](../data-resource/#url-or-path), that is a fully qualified HTTP address, or a relative POSIX path.
The value of the image property `MUST` be a string pointing to the location of the image. The string `MUST` be a [URL or Path](../glossary/#url-or-path), that is a fully qualified HTTP address, or a relative POSIX path.

### `version`

Expand Down Expand Up @@ -277,6 +277,6 @@ The raw sources for this data package. It `MUST` be an array of Source objects.
```

- `title`: title of the source (e.g. document or organization name)
- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]).
- `path`: A [URL or Path](../glossary/#url-or-path) string, that is a fully qualified HTTP address, or a relative POSIX path.
- `email`: An email address
- `version`: A version of the source
32 changes: 2 additions & 30 deletions content/docs/specifications/data-resource.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ A resource `MUST` contain a property describing the location of the data associa

#### Single File

If a resource have only a single file then `path` `MUST` be a string that a "url-or-path" as defined in [URL of Path](#url-or-path) section.
If a resource have only a single file then `path` `MUST` be a string that a "url-or-path" as defined in the [URL of Path](../glossary/#url-or-path) definition.

#### Multiple Files

Expand Down Expand Up @@ -227,34 +227,6 @@ A Data Resource `MAY` have a `schema` property to describe the schema of the res

The value for the `schema` property on a `resource` MUST be an `object` representing the schema OR a `string` that identifies the location of the schema.

If a `string` it must be a [url-or-path as defined above](#url-or-path), that is a fully qualified http URL or a relative POSIX path. The file at the location specified by this url-or-path string `MUST` be a JSON document containing the schema.
If a `string` it must be a [URL or Path](../glossary/#url-or-path), that is a fully qualified http URL or a relative POSIX path. The file at the location specified by this [URL or Path](../glossary/#url-or-path) string `MUST` be a JSON document containing the schema.

NOTE: the Data Package specification places no restrictions on the form of the schema Object. This flexibility enables specific communities to define schemas appropriate for the data they manage. As an example, the [Tabular Data Package](https://specs.frictionlessdata.io/tabular-data-package/) specification requires the schema to conform to [Table Schema](../table-schema/).

## URL or Path

A `url-or-path` is a `string` with the following additional constraints:

- `MUST` either be a URL or a POSIX path
- [URLs](https://en.wikipedia.org/wiki/Uniform_Resource_Locator) `MUST` be fully qualified. `MUST` be using either http or https scheme. (Absence of a scheme indicates `MUST` be a POSIX path)
- [POSIX paths](https://en.wikipedia.org/wiki/Path_%28computing%29#POSIX_pathname_definition) (unix-style with `/` as separator) are supported for referencing local files, with the security restraint that they `MUST` be relative siblings or children of the descriptor. Absolute paths `/`, relative parent paths `../`, hidden folders starting from a dot `.hidden` `MUST` NOT be used.

Example of a fully qualified url:

```json
{
"path": "http://ex.datapackages.org/big-csv/my-big.csv"
}
```

Example of a relative path that this will work both as a relative path on disk and online:

```json
{
"path": "my-data-directory/my-csv.csv"
}
```

:::caution[Security]
`/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller.
:::
33 changes: 20 additions & 13 deletions content/docs/specifications/glossary.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
title: Glossary
sidebar:
hidden: true
order: 6
---

Expand All @@ -20,22 +19,30 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S

## Definitions

:::caution
This section is under development
:::

### Metadata Descriptor
### URL or Path

### Metadata Profile
A `URL or Path` is a `string` with the following additional constraints:

### Tabular Data
- `MUST` either be a URL or a POSIX path
- [URLs](https://en.wikipedia.org/wiki/Uniform_Resource_Locator) `MUST` be fully qualified. `MUST` be using either http or https scheme. (Absence of a scheme indicates `MUST` be a POSIX path)
- [POSIX paths](https://en.wikipedia.org/wiki/Path_%28computing%29#POSIX_pathname_definition) (unix-style with `/` as separator) are supported for referencing local files, with the security restraint that they `MUST` be relative siblings or children of the descriptor. Absolute paths `/`, relative parent paths `../`, hidden folders starting from a dot `.hidden` `MUST` NOT be used.

### Physical Level
Example of a fully qualified url:

### Logical Level
```json
{
"path": "http://ex.datapackages.org/big-csv/my-big.csv"
}
```

### Data Consumer
Example of a relative path that this will work both as a relative path on disk and online:

### Data Producer
```json
{
"path": "my-data-directory/my-csv.csv"
}
```

### Implementation
:::caution[Security]
`/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller.
:::
16 changes: 8 additions & 8 deletions content/docs/specifications/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Security considerations around Data Packages and Data Resources.

The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt)

## Library users
## Usage Perspective

Data packages is a container format that allows the creator to specify payload data (Resources) either as JSON
objects/arrays or via pointers. There are two pointer formats:
Expand All @@ -36,7 +36,7 @@ ONLY in a trusted environment (eg. your own computer during development of Data
all kinds of Resource pointers. In every other environment, you MUST keep the various attack scenarios in mind and
filter out potentially dangerous Resource pointer types

## Dangerous Descriptor/Resource pointer combinations
### Dangerous Descriptor/Resource pointer combinations

How to read the table: if your "datapackage.json"-file comes from one of the sources on the left, you should treat
Resources in the format on the top as:
Expand All @@ -47,7 +47,7 @@ Resources in the format on the top as:

![Security Matrix](./assets/security-matrix.png)

### Descriptor source is a URL
#### Descriptor source is a URL

If your descriptor is loaded via URL, and the server to which the URL points is not fully trusted, you
SHOULD NOT allow Data Packages with Resource pointers in
Expand All @@ -64,7 +64,7 @@ each could point to very large CSV files hosted somewhere. The Data Package proc
those CSV files which might overwhelm the user's computer. If an attacker were able to spread such a malicious
Data Package, this could exhaust the resources of a hosting service.

### Descriptor source is a local relative path
#### Descriptor source is a local relative path

If your descriptor is loaded via a local relative path, and the source of the Data Package is not fully trusted, you
SHOULD NOT allow Data Packages with Resource pointers in
Expand All @@ -82,7 +82,7 @@ as well as crafting malicious Data Packages. In the above table, this case is th
If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the
internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers.

### Descriptor source is a local relative path
#### Descriptor source is a local relative path

While it is never safe to accept absolute file paths for Resources, it is perfectly safe to accept them for Descriptor
files. If your descriptor is loaded via a local absolute path, and the source of the Data Package is not fully
Expand All @@ -101,7 +101,7 @@ as well as crafting malicious Data Packages. In the above table, this case is th
If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the
internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers.

### Descriptor source is a JSON object
#### Descriptor source is a JSON object

If the Descriptor is not loaded from file but created in-memory and the source of the Data Package is not fully
trusted, you SHOULD NOT allow Data Packages with Resource pointers in
Expand All @@ -120,13 +120,13 @@ as well as crafting malicious Data Packages. In the above table, this case is th
If Data Package parsing is part of a service offered to computers across subnets on the same LAN or even open to the
internet, it NEVER safe to accept Data Packages containing URL-based Resource pointers.

### Descriptor source is a self-created JSON object
#### Descriptor source is a self-created JSON object

If the Descriptor is not loaded from file or created via a third-party application but by your software, it is
generally assumed you know what you do and therefore, loading Resources from URLs or file is considered safe. You
still SHOULD NOT use absolute paths as a matter of precaution - and implementing libraries should filter them out.

## Library creators
## Implemention Perspective

Two kinds of Resource pointers can never be guaranteed to be totally safe:

Expand Down
2 changes: 1 addition & 1 deletion content/docs/specifications/table-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ See [Field Constraints](#field-constraints)

#### `missingValues`

A list of missing values for this field as per [Missing Values](#missing-values) definition. If this property is defined, it takes precedence over the schema-level property and completely replaces it for the field without combining the values.
A list of missing values for this field as per [Missing Values](#missingvalues) definition. If this property is defined, it takes precedence over the schema-level property and completely replaces it for the field without combining the values.

For example, for the Table Schema below:

Expand Down
8 changes: 4 additions & 4 deletions content/docs/standard/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,13 @@ This change allows omitting `title` property for the `contributor` and `source`

##### Added `contributor.given/familyName`

This change adds two new properties to the `contributor` object: `givenName` and `familyName`. Please read more about [`package.contributors`](../../specifications/data-resource/#contributors) property.
This change adds two new properties to the `contributor` object: `givenName` and `familyName`. Please read more about [`package.contributors`](../../specifications/data-package/#contributors) property.

> [Pull Request -- #20](https://github.com/frictionlessdata/datapackage/pull/20)
##### Added `contributor.roles` property

This change adds a new `contributors.roles` property that replaces `contributor.role`. Please read more about [`package.contributors`](../../specifications/data-resource/#contributors) property.
This change adds a new `contributors.roles` property that replaces `contributor.role`. Please read more about [`package.contributors`](../../specifications/data-package/#contributors) property.

> [Pull Request -- #18](https://github.com/frictionlessdata/datapackage/pull/18)
Expand All @@ -54,7 +54,7 @@ This change adds omitted `version` property to the Data Package profiles.

##### Relaxed `resource.name` rules but keep it required and unique

This change relaxes requirements to `resource.name` allowing it to be any string. This property still needs to present and be unique among resources. Please read more about [`resource.name`](../../specifications/data-resource/#name) property.
This change relaxes requirements to `resource.name` allowing it to be any string. This property still needs to present and be unique among resources. Please read more about [`resource.name`](../../specifications/data-resource/#name-required) property.

> [Pull Request -- #27](https://github.com/frictionlessdata/datapackage/pull/27)
Expand Down Expand Up @@ -110,7 +110,7 @@ This change adds a new constraint for the `object` and `array` fields. Please re
##### Support `groupChar` for integer field type

This change adds support for providing integers with group chars. Please read more about [`field.groupChar`](../../specifications/table-schema/#groupchar) property.
This change adds support for providing integers with group chars. Please read more about [`field.groupChar`](../../specifications/table-schema/#integer) property.

> [Pull Request -- #6](https://github.com/frictionlessdata/datapackage/pull/6)
Expand Down
1 change: 1 addition & 0 deletions content/docs/standard/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: Contributing
sidebar:
order: 9
hidden: true
---

:::caution
Expand Down
20 changes: 4 additions & 16 deletions content/docs/standard/extensions.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ One of the key strengths of the Data Package Standard lies in its extensibility.
<LinkCard
title="Tabular Data Package"
description="A data package for tabular data, including spreadsheets and CSV files."
href="/specifications/tabular-data-package"
href="../../extensions/tabular-data-package"
/>

<LinkCard
title="Tabular Data Resource"
description="A resource type within tabular data packages, typically containing structured data tables."
href="/specifications/tabular-data-resource"
href="../../extensions/tabular-data-resource"
/>
</CardGrid>

Expand All @@ -34,24 +34,12 @@ One of the key strengths of the Data Package Standard lies in its extensibility.
<LinkCard
title="Camtrap Data Package"
description="A data package for managing and sharing camera trap data."
href="/specifications/camtrap-data-package"
href="../../extensions/camtrap-data-package"
/>

<LinkCard
title="Fiscal Data Package"
description="A data package for fiscal data, including financial reports and government spending information."
href="/specifications/fiscal-data-package"
/>

<LinkCard
title="Fiscal Data Package - Budget Standard Taxonomy"
description="A standard taxonomy for categorizing budgetary data within fiscal data packages."
href="/specifications/fiscal-data-package-budget-standard"
/>

<LinkCard
title="Fiscal Data Package - Spending Standard Taxonomy"
description="A standard taxonomy for categorizing spending data within fiscal data packages."
href="/specifications/fiscal-data-package-spending-standard"
href="../../extensions/fiscal-data-package"
/>
</CardGrid>
10 changes: 0 additions & 10 deletions content/docs/standard/guides.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,3 @@ The least formal part of the standard containing various guides on how to get st
href="/guides/using-data-package"
/>
</CardGrid>

## Extensions

<CardGrid>
<LinkCard
title="How to extend Data Package"
description="This guide will walk you through an example of creating a Data Package extension."
href="/guides/extending-data-package"
/>
</CardGrid>
Loading

0 comments on commit e39cd47

Please sign in to comment.