Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 add start of the creating resources guide #810

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
322 changes: 322 additions & 0 deletions docs/guide/resources.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
---
title: "Creating and managing data resources"
order: 2
jupyter: python3
execute:
eval: false
---

In each [data package](/docs/design/interface/outputs.qmd) are [data
resources](/docs/design/interface/outputs.qmd), which contain a
conceptually standalone set of data. This page shows you how to create
and manage data resources inside a data package using Sprout. We assume
that a data package has already been [created](packages.qmd).

{{< include _preamble.qmd >}}

::: callout-important
Data resources can only be created from [tidy
data](https://design.seedcase-project.org/data/). Before you can store
it, you need to process it into a tidy format, ideally using Python so
that you have a record of the steps taken to clean and transform the
data.
:::

```{python setup}
#| include: false
# This `setup` code chunk loads packages and prepares the data.
import seedcase_sprout.core as sp
import tempfile
from urllib.request import urlretrieve

temp_path = tempfile.TemporaryDirectory()
package_path = sp.create_package_properties(
properties=sp.example_package_properties(),
path=temp_path / "diabetes-study"
)
readme = sp.build_readme_text(sp.example_package_properties())
sp.write_text(readme, package_path.parent)

# Since the path leads to the datapackage.json file, for later functions we need the folder instead.
package_path = package_path.parent

# TODO: Maybe eventually move this over into Sprout as an example dataset, rather than via a URL.
# Download the example data and save to a data-raw folder in the temp path.
url = "https://raw.githubusercontent.com/seedcase-project/data/refs/heads/main/patients/patients.csv"
raw_data_path = temp_path / "patients.csv"
urlretrieve(
url,
raw_data_path
)
```

Making a data resource requires that you actually have data that can be
a resource in the first place. Generated or collected data always starts
out in a bit of a "raw" shape that needs some working. For this guide,
we have a raw (but fake) data file that we've already made tidy and that
looks like:

```{python}
#| echo: false
with open(raw_data_path, "r") as f:
print(f.read())
```

We've saved this data file in a path object called `raw_data_path`:

```{python}
print(raw_data_path)
```

Putting your raw data into a data
package makes it easier for yourself and others to use later one. So
the steps we'll take to get this raw data into the structure offered by
Sprout are:

1. Create the properties for the resource, using the original raw data
as a starting point and edit as needed.
2. Create a folder to store the (processed) data resource in our
package, as well as having a folder for the (tidy) raw data.
3. Save the properties of and path to the new data resource
into the `datapackage.json` file.
4. Re-build the data package's `README.md` file from the updated
`datapackage.json` file.
5. If you need to edit the properties at a later point, you can use
`edit_resource_properties()` and then re-build the
`datapackage.json` file.

Before we start, we need to import Sprout as well as other
helper packages:

```{python}
import seedcase_sprout.core as sp

# For pretty printing of output
from pprint import pprint

# TODO: This could be a wrapper helper function instead
# To be able to write multiline strings without indentation
from textwrap import dedent
```

## Extract resource properties from raw data

Because the resource's properties are useful for many later functions,
let's first get that created and ready to go. While you can create a
resource properties object manually using `ResourceProperties`, it can be
quite intensive and time-consuming if you for example have many columns in your data. The better and easier approach is to
extract as much information as possible from the raw data to create an
initial resource properties object with
`extract_resource_properties()`. Then, you can edit the properties as
needed.

Let's start with extracting the resource properties from the raw data.
The function is fairly good at getting and guessing the right
information, but it is very far from perfect and it cannot guess things that are not in the data itself, like a description of what the data contains or the unit of the data.

```{python}
resource_properties = extract_resource_properties(
data_path=raw_data_path
)
pprint(resource_properties)
```

You may be able to see that some things are missing, for instance, the
individual columns (called fields) don't have any descriptions. We have to
manually add this ourselves.
We can run a check on the properties to confirm what is missing:

```{python}
#| error: true
print(sp.check_resource_properties(resource_properties))
```

Let's fill in the description for all
the fields in the resource:

```{python}
# TODO: Need to consider/design how editing can be done in an easier, user-friendly way.
# TODO: Add more detail when we know what can and can't be extracted.
```

## Creating a data resource

Now that we have the properties for the resource, we can create the
resource itself within the data package. What this means is that we want
a folder for the specific resource (since we may have more data
resources to add).

Our package has already been created (using the steps from the [package
guide](packages.qmd)), with the path set as the variable `package_path`:

```{python}
print(package_path)
```

We can look inside that path to see the current files and folders:

```{python}
print(list(package_path.glob("*")))
```

This shows that the data package already includes a `datapackage.json` file and a `README.md` file. Now, we will create the resource structure in
this package:

```{python}
resource_paths=sp.create_resource_structure(
path=package_path
)
print(resource_paths)
```

With the the resource folder structure created, we are now ready to fill it
with our raw data! Next, we'll set up the resource properties so that it
is ready to be saved into the `datapackage.json` file. We can use the
`path_resource()` helper function to always give us the correct location
to the specific resource's folder path. In this case, our resource is
the first one in the package, so we can use `path_resource(1)`.

```{python}
resource_properties = sp.create_resource_properties(
properties=resource_properties,
path=package_path / sp.path_resource(1)
)
pprint(resource_properties)
```

::: callout-tip
If you want to see the list of resources available in your data package
via Python code (rather than looking at it directly in the file system),
you can use the `list_resources()` function.

```{python}
#| eval: false
print(sp.list_resources())
```
:::

This has set up the properties to be ready to add to the
`datapackage.json` file. Next, we save that properties file into the
`datapackage.json` file by writing it to the `datapackage.json` file:

```{python}
sp.write_resource_properties(
properties=resource_properties,
path=sp.path_properties()
)
```

We can check the contents of the `datapackage.json` file to see that the
resource properties have been added:

```{python}
pprint(sp.read_properties(package_path / sp.path_properties()))
```

## Storing a backup of the raw data

Before we start processing the raw data into a Parquet file, it is a
good idea to store a backup of the raw data. This is useful if you need
to re-process the data at a later point, troubleshoot any issues, update
incorrect values, or if you need to compare the stored raw data to your
original raw data.

As we showed above, the data is stored in the path that we've set as
`raw_data_path`. We can store this data in the resource's folder by
using:

```{python}
sp.write_resource_data_to_raw(
data_path=raw_data_path,
resource_properties=resource_properties
)
```

This function uses the properties object to determine where to store the
raw data, which is in the `raw/` folder of the resource's folder. We can
check the newly added file by using:

```{python}
print(sp.path_resource_raw_files(1))
```

## Building the Parquet data resource file

Now that we've stored the raw data file, we can build the Parquet file
that will be used as the data resource. This Parquet file is built from
the raw data file that we've stored in the resource's folder.

```{python}
parquet_path = sp.build_resource_parquet(
raw_files=sp.path_resource_raw_files(1),
path=sp.path_resource_data(1)
)
print(parquet_path)
```

::: callout-tip
If you add more raw data to the resource later on, you can update this Parquet file to include all data in the raw folder using the `build_resource_parquet()` function like shown above.
:::

## Re-building the README file

One of the last steps to finish adding a new data resource is to
re-build the `README.md` file for the data package. To allow some
flexibility with what gets added to the README text, this next function
will only *build the text*, but not write it to the file. This allows
you to add additional information to the README text before writing it
to the file.

```{python}
readme_text = sp.build_readme_text(
properties=sp.read_properties(package_path / sp.path_properties())
)
```

In this case, we don't want to add anything else, so we'll write the
text to the `README.md` file:

```{python}
sp.write_text(
text=readme_text,
# TODO: Make a helper function for this path?
path=package_path / "README.md"
)
```

## Edit resource properties

After having created a resource, you may need to make edits to the
properties. While technically you can do this manually by opening up the
`datapackage.json` file and editing it, we've made these functions to
help do it in an easier way that ensures that the `datapackage.json` is still in a correct json format. Using the
`edit_resource_properties()` function, you give it the path to the
current properties and then create a new `ResourceProperties` object
with any changes you want to make. Anything in the new properties object
will overwrite fields in the old properties object. This function does
not write back, it only returns the new properties object.

```{python}
resource_properties = sp.edit_resource_properties(
# Helper function
path=sp.path_properties(),
properties=sp.ResourceProperties(
title="Basic characteristics of patients"
)
)
pprint(resource_properties)
```

To write back, you use the `write_resource_properties()` function:

```{python}
sp.write_resource_properties(
properties=resource_properties,
path=sp.path_properties()
)
```

```{python}
#| include: false
temp_path.cleanup()
```