Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor local source references #140

Merged
merged 2 commits into from
Nov 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 78 additions & 4 deletions docs/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,13 +53,13 @@ Example dictionary of files:
}
```

Sources may have both.
Sources may have both. In this case, dict entries will overwrite list entries of the same name.

## Local storage locations

The source config fully determines the storage location of the data in its *storage format*. All data is stored inside of `{config.inputpath}`. When *pulling data*, each source gets its own local subfolder: For each source, a subfolder `{source.id}` is created. The list entries (str) and dict keys determine the filenames of the locally stored files.

Remark: For Google sheets, the sheet_ids are non-descript. Thus the [configuration] has an (optional) global field `sheet_names` in which a mapping from names to sheet_ids can be provided. When a source references an input file, it first looks up whether it's in the `sheet_names` map and in that case uses the respective values.
Remark: The [configuration] has an (optional) global field `sheet_names` in which a mapping from names to sheet_ids can be provided. When a source references an input file, it first looks up whether it's in the `sheet_names` map and in that case uses the respective key as storage file path (while pulling the file in accordance to the `sheet_names` dict value). This is useful for Google sheets because their sheet_ids are non-descript, but potentially also for local file references to abbreviate them and avoid `/`.


### `json` and `sheets`
Expand All @@ -70,10 +70,84 @@ Within the source's subfolder, for each `(name, filepath)` entry in `{source.fil

For the input format `sheets`, we can additionally use `files_list`.

- A special case here is if `files_archive` is provided and `source.subformat` is `csv`, then for each `sheet_id` entry in `source.files_list`, we process the folder `sheet_id` as a csv workbook and store the converted result as `{sheet_id}.json`.
- Otherwise, for each `sheet_id` entry in `source.files_list`, the processed version of `sheet_id` is stored as `{sheet_id}.json`. Note that this currently only works if `source.subformat` is `google_sheets`, because we have not made a decision on how to turn full file paths into filenames.
- For each `sheet_name` entry in `source.files_list`, the processed version of `sheet_name` is stored as `{sheet_name}.json`. Note that the `sheet_name` may not contain certain special characters such as `/`.
- If the subformat is not `google_sheets`, i.e. we're referencing local files, the local file path is relative to the current working directory of the pipeline.
- It is possible to provide a `basepath` (relative or absolute) to the source config; then all file paths are relative to the `basepath`.
- It is also possible to provide a `files_archive` URL to a zip file. In that case, all file paths are relative to the archive root.

- Remark: Do we still need `files_archive` (`.zip` archive) support? I'd be keen to deprecate it.

Example: Assume that, relative to the current working directory, we have a folder `csv/safeguarding` containing `.csv` files, and we have a file `excel_files/safeguarding crisis.xlsx`. Then the following stores three copies of the `csv` data and three copies of the `xlsx` data, each in json format.

```
{
"meta": {
"version": "1.0.0",
"pipeline_version": "1.0.0"
},
"parents": {},
"flows_outputbasename": "parenttext_all",
"output_split_number": 1,
"sheet_names" : {
"csv_safeguarding" : "csv/safeguarding",
"xlsx_safeguarding" : "excel_files/safeguarding crisis.xlsx"
},
"sources": {
"safeguarding_csv_dict": {
"parent_sources": [],
"format": "sheets",
"subformat": "csv",
"files_dict": {
"safeguarding": "csv/safeguarding"
}
},
"safeguarding_csv_list": {
"parent_sources": [],
"format": "sheets",
"subformat": "csv",
"files_list": [
"csv_safeguarding"
]
},
"safeguarding_csv_list_remap": {
"parent_sources": [],
"format": "sheets",
"subformat": "csv",
"basepath": "csv",
"files_list": [
"safeguarding"
]
},
"safeguarding_xlsx_dict": {
"parent_sources": [],
"format": "sheets",
"subformat": "xlsx",
"files_dict": {
"safeguarding": "excel_files/safeguarding crisis.xlsx"
}
},
"safeguarding_xlsx_list_remap": {
"parent_sources": [],
"format": "sheets",
"subformat": "xlsx",
"files_list": [
"xlsx_safeguarding"
]
},
"safeguarding_xlsx_list": {
"parent_sources": [],
"basepath": "excel_files",
"format": "sheets",
"subformat": "xlsx",
"files_list": [
"safeguarding crisis.xlsx"
]
}
},
"steps": []
}
```

[configs]: ../src/parenttext_pipeline/configs.py
[configuration]: configuration.md
[steps]: steps.md
Expand Down
3 changes: 3 additions & 0 deletions src/parenttext_pipeline/configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,9 @@ class SheetsSourceConfig(SourceConfig):
# Path or URL to a zip archive containing folders
# each with sheets in CSV format (no nesting)
files_archive: str = None
# Path relative to which other paths in the files_list/dict are,
# assuming no files_archive is provided
basepath: str = None


@dataclass(kw_only=True)
Expand Down
41 changes: 24 additions & 17 deletions src/parenttext_pipeline/pull_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ def pull_translations(config, source, source_name):
)


def get_json_from_sheet_id(source, temp_dir, sheet_id):
if source.subformat == "google_sheets":
return convert_to_json(sheet_id, source.subformat)
else:
sheet_path = os.path.join(temp_dir, sheet_id)
return convert_to_json(sheet_path, source.subformat)


def pull_sheets(config, source, source_name):
# Download all sheets used for flow creation and edits and store as json
source_input_path = get_input_subfolder(
Expand All @@ -91,27 +99,26 @@ def pull_sheets(config, source, source_name):

jsons = {}
if source.files_archive is not None:
if source.subformat != "csv":
raise NotImplementedError(
"files_archive only supported for sheets of subformat csv."
if source.subformat == "google_sheets":
raise ValueError(
"files_archive not supported for sheets of subformat google_sheets."
)
location = source.archive
archive_filepath = download_archive(config.temppath, location)
with tempfile.TemporaryDirectory() as temp_dir:
shutil.unpack_archive(archive_filepath, temp_dir)
for sheet_id in source.files_list:
csv_folder = os.path.join(temp_dir, sheet_id)
jsons[sheet_id] = convert_to_json([csv_folder], source.subformat)
temp_dir = tempfile.TemporaryDirectory()
shutil.unpack_archive(archive_filepath, temp_dir)
else:
for sheet_name in source.files_list:
if source.subformat != "google_sheets":
raise NotImplementedError(
"files_list only supported for sheets of subformat google_sheets."
)
sheet_id = get_sheet_id(config, sheet_name)
jsons[sheet_name] = convert_to_json(sheet_id, source.subformat)
for new_name, sheet_id in source.files_dict.items():
jsons[new_name] = convert_to_json(sheet_id, source.subformat)
temp_dir = Path(source.basepath or ".")

for sheet_name in source.files_list:
sheet_id = get_sheet_id(config, sheet_name)
jsons[sheet_name] = get_json_from_sheet_id(source, temp_dir, sheet_id)
for new_name, sheet_name in source.files_dict.items():
sheet_id = get_sheet_id(config, sheet_name)
jsons[new_name] = get_json_from_sheet_id(source, temp_dir, sheet_id)

if source.files_archive is not None:
temp_dir.cleanup()

for sheet_name, content in jsons.items():
with open(source_input_path / f"{sheet_name}.json", "w", encoding='utf-8') as export:
Expand Down
Loading