Revisit/reimplement the concept of a "Harvested file". #8629

landreev · 2022-04-20T17:46:22Z

Short version: "Harvested files" are currently stored as DvObject/DataFile/FileMetadata/etc. entities, just like "real" files. I don't think they should be handled so.

(I feel like I have a memory of opening an issue for this, but looks like I never did - ?)

History: "Harvested Files" are created locally when a Harvesting client imports DDI or native JSON dataset metadata records with file entries from other Dataverses (DC format does not have a mechanism for encoding files or any kinds of child objects). The reason they become DataFiles/DvObjects is a throwback to or legacy of the old implementation in DVN v2-3. Back then they were treated as actual files - users could download them locally; they stored the remote location (url) in place of the physical file name, and DVN would make an HTTP call to get and proxy the content, transparently to the user. We abandoned that scheme as overly complicated (the problem with authentication was never fully resolved, among other things). So in the current scheme these "files" are used only for indexing. We still attempt to store a link to the remote object (as the storageidentifier of the DvObject), but it is never used practically. When search hits for harvested files are displayed, no attempt is made to redirect the user specifically to that file - clicking on the card always sends them to the remote location of the dataset to which the file belongs. This really doesn't justify maintaining the same DvObject hierarchy of entities as for "real" files, IMO.

The concept of a "remote file", something that transparently appears as a DataFile to the local user, with the byte content stored elsewhere/remotely, is now being revisited (#7324). Once we have that, we may consider, as an optional/configurable harvesting feature, being able to turn harvested files into these "remotely stored" files locally. But when harvesting file records solely for indexing, I believe we should instead introduce some "HarvestedFileMetadata" entity for storing them.

Definition of done:

discuss during a tech hour.
decide whether to move forward on this.
- if we decide to implement this, create the corresponding issues that are associated with it.

The text was updated successfully, but these errors were encountered:

landreev · 2023-01-09T16:15:13Z

This is probably doable in one sprint-worth of time... But let's decide if we actually want to do this ("revisit" being the key word). And/or if maybe we want to address other, more urgently needed harvesting issues?

mreekie · 2023-01-09T16:41:49Z

This will be a spike:

Discussion during the tech hours maybe.
we created a definition of done and added it to the end of the description.
assigned it a size of 10, since we have a good idea of what will be done in this step

mreekie · 2023-04-19T15:27:25Z

Sizing:

Was discussed in tech hour.
Decided to do one small but important thing.
Leonid will update this spike and create a follow-up ticket from that discussion.
Once that is done, this issue can be closed.

cmbz · 2023-06-01T21:44:28Z

@landreev I'm following up on @mreekie's 19 April note with some questions:

Has discussion occurred somewhere else I can link to?
Has a follow-up ticket already been created so this issue can be closed?

landreev · 2023-06-09T20:37:29Z

@cmbz
Sorry, meant to reply last week...

This was discussed during a tech hour. And we concluded that it wasn't worth it, to try and heavily re-design the current setup, such as, introduce a new database object dedicated to representing a harvested file, etc. But we decided to do one small/simple thing: move the column harvestingclient_id from the dataset table in the database to the common dvobject. This by itself will simplify many operations, will make it much easier to tell a harvested from a "real" file in 1 step, etc.

So we can do one of the 2 things: close this issue as a completed spike, and open a quick dev. issue for implementing the change above. Or change the title of this issue and use it for scheduling and implementing it. The former is probably cleaner (?).

cmbz · 2023-06-12T13:20:41Z

@landreev I like your first suggestion: "close this issue as a completed spike, and open a quick dev. issue for implementing the change above". Thank you! :)

cmbz · 2023-06-30T00:59:18Z

Acting on @landreev recommendation, I created the related issue: #9686 and am closing this issue.

landreev added the Feature: Harvesting label Apr 20, 2022

This was referenced Apr 20, 2022

Fix handling of storageidentifiers in dataverse_json harvests #7736

Closed

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

jggautier mentioned this issue Apr 21, 2022

Re-harvesting ICPSR datasets IQSS/dataverse.harvard.edu#63

Open

mreekie mentioned this issue Mar 10, 2023

Collection: Keep track of list of issues that we want to address as part of 1.4.1 IQSS/dataverse-pm#25

Closed

20 tasks

mreekie mentioned this issue May 16, 2022

PM.Epic: (Potentially) Modify the design of the harvesting Framework and/or metadata exports #8703

Closed

4 tasks

mreekie added the pm.epic.nih_harvesting_framework label May 16, 2022

pdurbin mentioned this issue Jun 7, 2022

Harvester exporter uses short version of XML #8778

Open

mreekie moved this to NIH (Stefano) in IQSS Dataverse Project Nov 2, 2022

mreekie added this to IQSS Dataverse Project Nov 2, 2022

mreekie added Size: 80 A percentage of a sprint. 56 hours. Size: 10 A percentage of a sprint. 7 hours. and removed Size: 80 A percentage of a sprint. 56 hours. labels Jan 9, 2023

sync-by-unito bot mentioned this issue Mar 3, 2023

4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 IQSS/dataverse-pm#10

Closed

3 tasks

mreekie added pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards labels Mar 20, 2023

cmbz mentioned this issue Jun 2, 2023

NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues IQSS/dataverse-pm#85

Closed

cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Jun 2, 2023

cmbz mentioned this issue Jun 30, 2023

Move harvestingclient_id from the dataset to dvobject and use it directly for files #9686

Open

cmbz closed this as completed Jun 30, 2023

github-project-automation bot moved this from NIH bklog items (Stefano) to Clear of the Backlog in IQSS Dataverse Project Jun 30, 2023

cmbz mentioned this issue Jan 29, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

59 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit/reimplement the concept of a "Harvested file". #8629

Revisit/reimplement the concept of a "Harvested file". #8629

landreev commented Apr 20, 2022 •

edited by mreekie

Loading

landreev commented Jan 9, 2023

mreekie commented Jan 9, 2023 •

edited

Loading

mreekie commented Apr 19, 2023

cmbz commented Jun 1, 2023

landreev commented Jun 9, 2023

cmbz commented Jun 12, 2023

cmbz commented Jun 30, 2023

Revisit/reimplement the concept of a "Harvested file". #8629

Revisit/reimplement the concept of a "Harvested file". #8629

Comments

landreev commented Apr 20, 2022 • edited by mreekie Loading

landreev commented Jan 9, 2023

mreekie commented Jan 9, 2023 • edited Loading

mreekie commented Apr 19, 2023

cmbz commented Jun 1, 2023

landreev commented Jun 9, 2023

cmbz commented Jun 12, 2023

cmbz commented Jun 30, 2023

landreev commented Apr 20, 2022 •

edited by mreekie

Loading

mreekie commented Jan 9, 2023 •

edited

Loading