Skip to content

Commit

Permalink
Merge pull request #140 from IDEMSInternational/refactor-local-source…
Browse files Browse the repository at this point in the history
…-refs

Refactor local source references
  • Loading branch information
geoo89 authored Nov 5, 2024
2 parents a32f094 + 0f9c6d8 commit 8905a39
Show file tree
Hide file tree
Showing 3 changed files with 105 additions and 21 deletions.
82 changes: 78 additions & 4 deletions docs/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,13 +53,13 @@ Example dictionary of files:
}
```

Sources may have both.
Sources may have both. In this case, dict entries will overwrite list entries of the same name.

## Local storage locations

The source config fully determines the storage location of the data in its *storage format*. All data is stored inside of `{config.inputpath}`. When *pulling data*, each source gets its own local subfolder: For each source, a subfolder `{source.id}` is created. The list entries (str) and dict keys determine the filenames of the locally stored files.

Remark: For Google sheets, the sheet_ids are non-descript. Thus the [configuration] has an (optional) global field `sheet_names` in which a mapping from names to sheet_ids can be provided. When a source references an input file, it first looks up whether it's in the `sheet_names` map and in that case uses the respective values.
Remark: The [configuration] has an (optional) global field `sheet_names` in which a mapping from names to sheet_ids can be provided. When a source references an input file, it first looks up whether it's in the `sheet_names` map and in that case uses the respective key as storage file path (while pulling the file in accordance to the `sheet_names` dict value). This is useful for Google sheets because their sheet_ids are non-descript, but potentially also for local file references to abbreviate them and avoid `/`.


### `json` and `sheets`
Expand All @@ -70,10 +70,84 @@ Within the source's subfolder, for each `(name, filepath)` entry in `{source.fil

For the input format `sheets`, we can additionally use `files_list`.

- A special case here is if `files_archive` is provided and `source.subformat` is `csv`, then for each `sheet_id` entry in `source.files_list`, we process the folder `sheet_id` as a csv workbook and store the converted result as `{sheet_id}.json`.
- Otherwise, for each `sheet_id` entry in `source.files_list`, the processed version of `sheet_id` is stored as `{sheet_id}.json`. Note that this currently only works if `source.subformat` is `google_sheets`, because we have not made a decision on how to turn full file paths into filenames.
- For each `sheet_name` entry in `source.files_list`, the processed version of `sheet_name` is stored as `{sheet_name}.json`. Note that the `sheet_name` may not contain certain special characters such as `/`.
- If the subformat is not `google_sheets`, i.e. we're referencing local files, the local file path is relative to the current working directory of the pipeline.
- It is possible to provide a `basepath` (relative or absolute) to the source config; then all file paths are relative to the `basepath`.
- It is also possible to provide a `files_archive` URL to a zip file. In that case, all file paths are relative to the archive root.

- Remark: Do we still need `files_archive` (`.zip` archive) support? I'd be keen to deprecate it.

Example: Assume that, relative to the current working directory, we have a folder `csv/safeguarding` containing `.csv` files, and we have a file `excel_files/safeguarding crisis.xlsx`. Then the following stores three copies of the `csv` data and three copies of the `xlsx` data, each in json format.

```
{
"meta": {
"version": "1.0.0",
"pipeline_version": "1.0.0"
},
"parents": {},
"flows_outputbasename": "parenttext_all",
"output_split_number": 1,
"sheet_names" : {
"csv_safeguarding" : "csv/safeguarding",
"xlsx_safeguarding" : "excel_files/safeguarding crisis.xlsx"
},
"sources": {
"safeguarding_csv_dict": {
"parent_sources": [],
"format": "sheets",
"subformat": "csv",
"files_dict": {
"safeguarding": "csv/safeguarding"
}
},
"safeguarding_csv_list": {
"parent_sources": [],
"format": "sheets",
"subformat": "csv",
"files_list": [
"csv_safeguarding"
]
},
"safeguarding_csv_list_remap": {
"parent_sources": [],
"format": "sheets",
"subformat": "csv",
"basepath": "csv",
"files_list": [
"safeguarding"
]
},
"safeguarding_xlsx_dict": {
"parent_sources": [],
"format": "sheets",
"subformat": "xlsx",
"files_dict": {
"safeguarding": "excel_files/safeguarding crisis.xlsx"
}
},
"safeguarding_xlsx_list_remap": {
"parent_sources": [],
"format": "sheets",
"subformat": "xlsx",
"files_list": [
"xlsx_safeguarding"
]
},
"safeguarding_xlsx_list": {
"parent_sources": [],
"basepath": "excel_files",
"format": "sheets",
"subformat": "xlsx",
"files_list": [
"safeguarding crisis.xlsx"
]
}
},
"steps": []
}
```

[configs]: ../src/parenttext_pipeline/configs.py
[configuration]: configuration.md
[steps]: steps.md
Expand Down
3 changes: 3 additions & 0 deletions src/parenttext_pipeline/configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,9 @@ class SheetsSourceConfig(SourceConfig):
# Path or URL to a zip archive containing folders
# each with sheets in CSV format (no nesting)
files_archive: str = None
# Path relative to which other paths in the files_list/dict are,
# assuming no files_archive is provided
basepath: str = None


@dataclass(kw_only=True)
Expand Down
41 changes: 24 additions & 17 deletions src/parenttext_pipeline/pull_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ def pull_translations(config, source, source_name):
)


def get_json_from_sheet_id(source, temp_dir, sheet_id):
if source.subformat == "google_sheets":
return convert_to_json(sheet_id, source.subformat)
else:
sheet_path = os.path.join(temp_dir, sheet_id)
return convert_to_json(sheet_path, source.subformat)


def pull_sheets(config, source, source_name):
# Download all sheets used for flow creation and edits and store as json
source_input_path = get_input_subfolder(
Expand All @@ -91,27 +99,26 @@ def pull_sheets(config, source, source_name):

jsons = {}
if source.files_archive is not None:
if source.subformat != "csv":
raise NotImplementedError(
"files_archive only supported for sheets of subformat csv."
if source.subformat == "google_sheets":
raise ValueError(
"files_archive not supported for sheets of subformat google_sheets."
)
location = source.archive
archive_filepath = download_archive(config.temppath, location)
with tempfile.TemporaryDirectory() as temp_dir:
shutil.unpack_archive(archive_filepath, temp_dir)
for sheet_id in source.files_list:
csv_folder = os.path.join(temp_dir, sheet_id)
jsons[sheet_id] = convert_to_json([csv_folder], source.subformat)
temp_dir = tempfile.TemporaryDirectory()
shutil.unpack_archive(archive_filepath, temp_dir)
else:
for sheet_name in source.files_list:
if source.subformat != "google_sheets":
raise NotImplementedError(
"files_list only supported for sheets of subformat google_sheets."
)
sheet_id = get_sheet_id(config, sheet_name)
jsons[sheet_name] = convert_to_json(sheet_id, source.subformat)
for new_name, sheet_id in source.files_dict.items():
jsons[new_name] = convert_to_json(sheet_id, source.subformat)
temp_dir = Path(source.basepath or ".")

for sheet_name in source.files_list:
sheet_id = get_sheet_id(config, sheet_name)
jsons[sheet_name] = get_json_from_sheet_id(source, temp_dir, sheet_id)
for new_name, sheet_name in source.files_dict.items():
sheet_id = get_sheet_id(config, sheet_name)
jsons[new_name] = get_json_from_sheet_id(source, temp_dir, sheet_id)

if source.files_archive is not None:
temp_dir.cleanup()

for sheet_name, content in jsons.items():
with open(source_input_path / f"{sheet_name}.json", "w", encoding='utf-8') as export:
Expand Down

0 comments on commit 8905a39

Please sign in to comment.