Add a "delete_missing" option to CKAN harvester #542

danielcoelhocgu · 2023-11-20T21:14:30Z

In brazilian government we have a very decentralized structure in which several entities have their own CKAN instances. We collect all data from these entities trough the harvest extension.

We have quite a lot of trouble when a dataset is deleted in one of those harvested CKAN portals because the CKAN harvester does not delete it in our CKAN, so it keeps showing many datasets with broken links or out of date information.

We propose to add an option to the CKAN harvester called delete_missing (boolean type), which will check for datasets that no longer exist in the harvested CKAN portal and delete them.

A near identical demand was reported on issue #396 about 2 years ago. The author of the issue even said he wrote some custom code to solve it, but he never shared the code, so I am opening this new issue aiming to submit a future pull request.

My idea is to copy the same logic from the DCAT JSON harvester from ckanext-dcat:

Inside gather_stage function:
1.2. List all dataset UIDs that were imported through the current harvest source (by querying the harvest_object table).
1.3. List all remote CKAN datasets, then check for local UIDs that are missing in the remote CKAN list.
1.4. Create harvest objects with delete state for all of those missing datasets.
Inside import_stage function:
2.1. Effectively delete (but not purge) all those missing datasets.

About step 1.2, I don't know if it would be better to look into the harvest_object table or to look for datasets with the extra field harvest_source_id that matches the harvest source of the job. It seems that the extension normally uses the havest_object table, but it won't work if we use the clear_history command on the source.

I kindly appreciate any feedback about this implementation idea, since this is my first contribution to the project.

The text was updated successfully, but these errors were encountered:

…d state ckan#542

danielcoelhocgu · 2024-05-09T20:12:17Z

I wrote the proposed code in PR #548.

Regarding my question about step 1.2, I chose to fetch from the harvest_object table, to keep the same logic from ckanext-dcat harvester.

I also changed one line in the base harvester to force package update whenever the package already exists but is in the deleted state.

This is necessary to address a situation when the remote CKAN instance has technical problems (with Solr) that cause pacakge_search API call to not list one or more datasets. In this case, if using the delete_missing option to harvest, CKAN would delete this dataset, which is correct. But whenever the remote CKAN fixes the issue, the dataset will appear again in package_search response, but it wouldn't be updated by the harvest process because the metadata_modified field does not change in this scenario.

It seems a very unlikely situation, but it has already happened in Brazilian government data portal.

This has the inconvenient that it would also take out of trash a dataset which was harvested and then manually deleted. Anyway, I think this should be the right behaviour, since if we purge a harvested dataset, it will be reimported in the next harvest run.

danielcoelhocgu added a commit to danielcoelhocgu/ckanext-harvest that referenced this issue May 8, 2024

CKAN Harvester: add delete_missing option ckan#542

5b8da48

danielcoelhocgu added a commit to danielcoelhocgu/ckanext-harvest that referenced this issue May 9, 2024

Harvester Base: force package update if existing package is in delete…

6cdc217

…d state ckan#542

danielcoelhocgu mentioned this issue May 9, 2024

Add delete_missing option to CKAN Harvester #542 #548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "delete_missing" option to CKAN harvester #542

Add a "delete_missing" option to CKAN harvester #542

danielcoelhocgu commented Nov 20, 2023

danielcoelhocgu commented May 9, 2024

Add a "delete_missing" option to CKAN harvester #542

Add a "delete_missing" option to CKAN harvester #542

Comments

danielcoelhocgu commented Nov 20, 2023

danielcoelhocgu commented May 9, 2024