Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding description info to the fileDsc seciton in DDI CodeBook. #5051 #10938

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Oct 18, 2024

What this PR does / why we need it:

Apparently, users have been asking for this since 2018 - for tabular files that have the Description field populated, this label was never exported in the DDI (non-ingested files always had their descriptions exported, in the corresponding <otherMat> sections).
There is no obvious field under <fileDscr> in the DDI Codebook schema for it - probably the reason we chose not to export it back in the day (?) - but putting it into another dedicated free text <note> field seems like a reasonable solution.

The RestAssured export tests are passing, so un-drafting the PR.

I kept the changes minimal to stay under the "3" estimate.

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this:

Straightforward. Upload some file that's known to be ingestable (Stata, CSV ... doesn't matter). Populate the description field in the file metadata. Publish the dataset. Look at the DDI export, the description should not be showing in the corresponding <fileDscr ...>, like this:

<notes level="file" type="DATAVERSE:FILEDESC" subject="DataFile Description">
   This is a tabular file produced from a Stata .dta file with rich descriptive metadata
</notes>

For extra credit, look at the file under Data Explorer, verify that new <notes> element isn't causing any trouble there (the Explorer relies on the DDI for viewing and - in the latest version - editing).
Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

Coverage Status

coverage: 20.867% (-0.001%) from 20.868%
when pulling 84e0fad on 5051-ddi-tabular-file-description
into d039a10 on develop.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:5051-ddi-tabular-file-description
ghcr.io/gdcc/configbaker:5051-ddi-tabular-file-description

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks fine. I'm trusting that "notes" is the right place to put the file descriptions. I did leave a couple other comments.

@cmbz cmbz added FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) Size: 3 A percentage of a sprint. 2.1 hours. FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) labels Oct 23, 2024
@stevenwinship stevenwinship self-assigned this Nov 1, 2024
@stevenwinship stevenwinship removed their assignment Nov 1, 2024
@ofahimIQSS ofahimIQSS self-assigned this Nov 4, 2024
@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Nov 4, 2024

Hey @landreev, on trying to test this I had an observation.

When I upload a CSV file, I am able to see the correct DDI output (see below):
image

When I tested uploading a Stata file, I noticed that the format is different and notes is appearing twice.
image

Here is the Stata Test File (I compressed it so I could add it here, please unzip it first):
ATHLET2.DTA.zip

@cmbz cmbz added the FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) label Nov 7, 2024
@landreev
Copy link
Contributor Author

@ofahimIQSS
Sorry for the delay, I pretty much forgot about this PR.
The fact that the 2 files are differently formatted in the DDI output is simply because only the Stata file did get successfully ingested as a tabular file. I.e., Dataverse was able to successfully parse the file and extract the extra metadata from it that describes the individual variable columns and calculate summary statistics (see the guides for more information about ingest). The <fileDscr> section in the DDI is only used for files that we were able to successfully ingest.

We cannot guarantee to be able to successfully ingest any file in a potentially "ingestable" file format. This is especially true with CSV. We are going to try to parse and ingest any CSV file uploaded by the user (as long as it is below the ingestable size limit, if defined), but it may or may not succeed. Some of the more common reasons why we may fail to ingest a CSV file: ingest will stop if the first line is not a comma-separated list of what looks like the names of the individual variables; ingest will fail unless every row contains the same number of comma-separated fields.

The <fileDscr> entry for the ingested Stata file in the screenshot looks correct. (the whole point of the small change in this PR was to add an extra <notes> field with the file description).
There are 2 <notes> entries because they are used for 2 different things - one for the UNF (the data signature we calculate for tab. files), the other one is the new note for the file description. <notes> is a field in the DDI that can be used for any arbitrary text. We use it to encode any information for which we do not have a specifically reserved, dedicated field in the DDI.

@ofahimIQSS
Copy link
Contributor

Thanks for the clarification @landreev - Merging PR
Testing of 10938.docx

@ofahimIQSS ofahimIQSS merged commit dc1de87 into develop Nov 13, 2024
21 checks passed
@ofahimIQSS ofahimIQSS deleted the 5051-ddi-tabular-file-description branch November 13, 2024 19:00
@ofahimIQSS ofahimIQSS removed their assignment Nov 13, 2024
@pdurbin pdurbin added this to the 6.5 milestone Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

File description metadata of ingested files are not in the DDI exported metadata
6 participants