Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citation: Remove MD5s, if you have UNF #2192

Closed
sbarbosadataverse opened this issue May 22, 2015 · 19 comments
Closed

Citation: Remove MD5s, if you have UNF #2192

sbarbosadataverse opened this issue May 22, 2015 · 19 comments
Assignees
Labels
Feature: File Upload & Handling Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner
Milestone

Comments

@sbarbosadataverse
Copy link

Gary's sent the following:

why are there MD5's? these I think should all be removed. we have UNFs
instead.

@sbarbosadataverse sbarbosadataverse added Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner Priority: Medium Type: Suggestion an idea Feature: File Upload & Handling Type: Bug a defect labels May 22, 2015
@pdurbin
Copy link
Member

pdurbin commented May 22, 2015

MD5s are commonly used to verify that files were not corrupted during download. Every Mac and and Linux box has the native ability to calculate an MD5 of a file. For Windows it's a supported addon: https://support.microsoft.com/en-us/kb/841290

@mercecrosas
Copy link
Member

This is issue is not well defined as it is. @thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.

@thegaryking
Copy link
Contributor

we're first and foremost trying to communicate with users, almost none of
which know about either md5 or unfs. we have taken them another step and
told them about unfs; we have a page describing them, and when we put in
enough effort they get what they are. there's no reason to introduce
another new concept. let's just use another degree of indirection. if the
librarians want something for files for which we don't have unfs, then we
create a broader notion of what a unf is and we always have a unf for every
file. the plan would be quite like the idea of dv to begin with, which is
that the more it knows about the files which are uploaded, the more
services we provide. for unfs, we can do the same thing: if the file is in
a format we know what to do with (R, sas, spss, table, etc.) we can compute
a format-independent UNF. If it is anything else we can create a
format-dependent UNF, which if you want would be exactly a MD5, but would
be displayed as a UNF. we can also add another service if librarians want,
somewhere far out of the way of most users, that lets people type in a unf
and have dv tell them exactly what it is and how it was calculated,
including the full algorithm, an MD5 if it is in there, and anything else.

then from a UI/user understanding point of view, there will be only one
thing to understand; they can ignore the details and trust us if they want;
they can get the details if they like; and we can continue to innovate what
a UNF is since there's a version number embedded in it (and i agree that we
should do the latter; people tell me that videos and photos wouldn't be
hard and we could clearly expand to more forms of data. this, however, is
a separate project that we could perhaps seek funding for and do then).

Gary

Gary King - Albert J. Weatherhead III University Professor - Director,
IQSS http://iq.harvard.edu/\- Harvard University
GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary -
617-500-7570 - fax 812-8581 - Assistant [email protected]:
495-9271

On Fri, May 22, 2015 at 2:06 PM, Merce Crosas [email protected]
wrote:

This is issue is not well defined as it is. @thegaryking
https://github.com/thegaryking does this question apply to subsettable
Tabular files? why is it bad to have both UNF and MD5? as @pdurbin
https://github.com/pdurbin comments, MD5 are the most commonly used by
preservation groups and libraries, and can be useful in addition to UNF to
verify the original deposited file.


Reply to this email directly or view it on GitHub
#2192 (comment).

@mercecrosas
Copy link
Member

Yes, I agree with the general approach, but we need to do some research to
implement this well. Here are the three main issues:

  • If we only provide a UNF for tabular files (spss, stata, r, etc), and not
    an MD5 for the original format, then we don't have a way to verify the
    original deposited file, which is important in some cases when we need to
    recalculate the UNF or there is some issues or uncertainties with
    reformatting, or simply for standard archival verifications that
    repositories should do. This is a request from some groups that want to
    make sure we support preservation good practices. There are some
    preservation certificates that Dataverse could not get without this.
  • UNF is a fantastic concept, but it has some practical limitations and
    issues in the way that is currently defined. Given that each format treats
    some data types differently (time, binary and categorical variables,
    rounding), it could turn out that you convert from one format to another,
    to another and then to back to the first format, and end up with different
    UNFs (this is similar to the phenomenon of google translating from one
    language, to another, to another, back to the first and end up with
    different sentence). This has been improved considerably from the initial
    UNF version, but it's practically very difficult to include all the
    exceptions.
  • But on of the main issue with UNF is that it doesn't include the
    metadata of the file, that is the variable name, for example, or data type.
    This mean that you might have a spss file with column A that corresponds to
    var1, but this was not correct and for some reason needs to be changed to
    var2, and this is not reflected in the UNF, while this is a critical
    critical change in the data file.

I agree it would be great, as you say, to generalize UNF and make it work
well across all cases, so we use and teach only one thing. But we need to
take these three issues (and I might be missing others that Leonid, Kevin
and others in the team might know) in consideration.

Mercè Crosas, Ph.D.
Director of Data Science, IQSS
Harvard University
http://scholar.harvard.edu/mercecrosas

On Sat, May 23, 2015 at 10:27 AM, Gary King [email protected]
wrote:

we're first and foremost trying to communicate with users, almost none of
which know about either md5 or unfs. we have taken them another step and
told them about unfs; we have a page describing them, and when we put in
enough effort they get what they are. there's no reason to introduce
another new concept. let's just use another degree of indirection. if the
librarians want something for files for which we don't have unfs, then we
create a broader notion of what a unf is and we always have a unf for every
file. the plan would be quite like the idea of dv to begin with, which is
that the more it knows about the files which are uploaded, the more
services we provide. for unfs, we can do the same thing: if the file is in
a format we know what to do with (R, sas, spss, table, etc.) we can compute
a format-independent UNF. If it is anything else we can create a
format-dependent UNF, which if you want would be exactly a MD5, but would
be displayed as a UNF. we can also add another service if librarians want,
somewhere far out of the way of most users, that lets people type in a unf
and have dv tell them exactly what it is and how it was calculated,
including the full algorithm, an MD5 if it is in there, and anything else.

then from a UI/user understanding point of view, there will be only one
thing to understand; they can ignore the details and trust us if they want;
they can get the details if they like; and we can continue to innovate what
a UNF is since there's a version number embedded in it (and i agree that we
should do the latter; people tell me that videos and photos wouldn't be
hard and we could clearly expand to more forms of data. this, however, is
a separate project that we could perhaps seek funding for and do then).

Gary

Gary King - Albert J. Weatherhead III University Professor - Director,
IQSS http://iq.harvard.edu/\- Harvard University

GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary

617-500-7570 - fax 812-8581 - Assistant [email protected]:
495-9271

On Fri, May 22, 2015 at 2:06 PM, Merce Crosas [email protected]
wrote:

This is issue is not well defined as it is. @thegaryking
https://github.com/thegaryking does this question apply to subsettable
Tabular files? why is it bad to have both UNF and MD5? as @pdurbin
https://github.com/pdurbin comments, MD5 are the most commonly used by
preservation groups and libraries, and can be useful in addition to UNF
to
verify the original deposited file.


Reply to this email directly or view it on GitHub
#2192 (comment).


Reply to this email directly or view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D104902198&d=BQMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=n42TBHZeNCFjVWht9OJze2EDoPR2o7n87LpnMd0UIlQ&s=3H3Wxs075lnSQvTNGBQFfpirL4UfeRAgp1akLUnwL94&e=
.

@mercecrosas mercecrosas self-assigned this May 23, 2015
@mercecrosas mercecrosas modified the milestones: In Review, In Design May 23, 2015
@posixeleni
Copy link
Contributor

I agree with @mcrosas that the MD5 checksum should exist for every file to ensure bit-level preservation. When I presented a preview of Dataverse 4.0 to the Library of Congress' National Digital Stewardship Alliance in the Fall they were particularly impressed that we included MD5s for all our files. Here's a blog post from them discussing the importance of file fixity/data integrity: http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/

MD5 is a standard that the digital archival community trusts whereas UNF was unknown to them. I don't think we should replace MD5s for all files with UNFs if the community isn't using them outside of Dataverse.

@thegaryking
Copy link
Contributor

ok, but let's get MD5s out of the file list now. we can stick it in the
metadata when we expand the long list, as an unchangable item if someone
wants it.

and separately, let's create a google doc or something with specifications
of a UNF that would satisfy everyone. we can even create the specs and
either get the grant ourselves or have a call for PIs to take on this task,
perhaps with us. we know pretty much everything we want, and all the
problems with the current UNF. we just need some bandwidth (or someone
else) to implement it all.

Gary

Gary King - Albert J. Weatherhead III University Professor - Director,
IQSS http://iq.harvard.edu/\- Harvard University
GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary -
617-500-7570 - fax 812-8581 - Assistant [email protected]:
495-9271

On Sat, May 23, 2015 at 11:01 AM, Eleni Castro [email protected]
wrote:

I agree with @mcrosas https://github.com/mcrosas that the MD5 checksum
should exist for every file to ensure bit-level preservation. When I
presented a preview of Dataverse 4.0 to the Library of Congress' National
Digital Stewardship Alliance in the Fall they were particularly impressed
that we included MD5s for all our files. Here's a blog post from them
discussing the importance of file fixity/data integrity:
http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/

MD5 is a standard that the digital archival community trusts whereas UNF
was unknown to them. I don't think we should replace MD5s for all files
with UNFs if the community isn't using them outside of Dataverse.


Reply to this email directly or view it on GitHub
#2192 (comment).

@pdurbin
Copy link
Member

pdurbin commented May 24, 2015

Maybe we should remove both UNFs and MD5s from the default listing for files. They add a lot of noise.

I just clicked on a random dataset and saw this for a PowerPoint file:

documentation_and_metadata_-training_materials_dataverse-_2015-05-24_10 00 24

Isn't this a little... noisy... busy... unfriendly?

Who cares that it's MD5 is 26a3bb59a1d9a837ea51cc9c160c5b1a? (In addition, who cares that it's MIME Type is application/vnd.openxmlformats-officedocument.presentationml.presentation?) It's a PowerPoint file! That's all most people need to know. If normal people download it and can't open it they'll throw it in the trash and try again. If the still can't open it, they'll email the dataset contact and say "Hey, I think you uploaded a corrupted PowerPoint file" (which is more likely than the file being corrupted during download). They're not going to calculate the MD5 locally (let alone the UNF) and compare it to the MD5 on the screen. Only geeks like me would even think of doing that. And I probably wouldn't bother. I'd email the dataset contact.

Sure, show that it's 3 megabytes. Show the date it was uploaded. Stuff like MD5 and UNF could be hidden behind a "details" link, perhaps with some definitions of what MD5 and UNF even are.

@mercecrosas
Copy link
Member

@pdurbin we should not just remove these from the file cards without the appropriate research and consideration of preservation good practices - it has been an expectation from users and partners to easily find the fixity even if it's not used all the time. But you bring good points, it's worth reviewing if they can be displayed in another place.

I'm assigning this issue to @mheppler and adding it to the "In Design" milestone, following our process. Once a designed is proposed and reviewed (by @thegaryking and partners who requested the MD5), we'll move it a Release Candidate milestone.

To summarize, based on @thegaryking comments above:

  • This issue is about removing MD5 from the file card of tabular/subsettable files (but keeping UNF) and finding the appropriate place to display the MD5. Files that don't have UNF will display MD5, as it the case now. (As a note, the metadata tab might not be the right place for the MD5 of tabular files since it has dataset metadata but not file level metadata, although we should still consider it. @mheppler I have some ideas about this, for when you are ready to work on it)
  • For the larger task of generalizing UNF, I'll create a new issue in GitHub and start a Functional Requirements Document, as we do for new features or components, and invite others to review it.

@landreev
Copy link
Contributor

@pdurbin - yes, the long Microsoft mime types are terrible. But we have a mechanism for dealing with this - it's just a matter of adding the "friendly" version of it (such as "PowerPoint") to the list we maintain. (it's a .property file).
The friendly types for Excel and Word are already there. PowerPoint was left out, probably because it's not as common.

@mheppler
Copy link
Contributor

Thank you for commenting on that @landreev. I was going to ask you about these "friendly" file types, since I recall going over these with you for the file icons. We should separate out that task of identifying as many of these file types as we can in our current production data, and giving them friendly labels.

@landreev
Copy link
Contributor

@mheppler
Yes, we could use a dedicated ticket for creating these "friendly" labels for as many types as possible.
The file in question is ./src/main/java/MimeTypeDisplay.properties

@mheppler
Copy link
Contributor

@landreev @pdurbin -- #2202 -- new issue for MIME Type improvements created. Enjoy.

@eaquigley
Copy link
Contributor

Need to discuss this during a UI/UX team meeting to brainstorm ideas on how to show more file metadata without being overwhelming in the file card on the dataset page. Perhaps having a files metadata section in the metadata tab. @mcrosas @mheppler

@mercecrosas
Copy link
Member

After reviewing it with @eaquigley and @mheppler we plan to move this to 4.0.3.

@eaquigley
Copy link
Contributor

Have a section in the metadata tab that is "Files" and displays this extra metadata (MD5 shows here and not on the files card if a UNF is available).

@mheppler
Copy link
Contributor

mheppler commented Jul 6, 2015

@sbarbosadataverse
Copy link
Author

I had one question about this--- Is there a safeguard in place to ensure MD5 gets assed when tabular ingest fails for any reason? We have so enough failures at the moment to cause me to ask.
Thanks

Sonia Barbosa
Manager of Data Curation, IQSS Dataverse Network
Manager of the Murray Research Archive, IQSS
Data Science
Harvard University

Dataverse 4.0 is now available for use!
http://dataverse.harvard.edu

All test dataverses should be created in 4.0 Demo!
http://dataverse-demo.iq.harvard.edu/

Join our Dataverse Community!
https://groups.google.com/forum/#!forum/dataverse-community


From: Michael Heppler [[email protected]]
Sent: Monday, July 06, 2015 10:58 AM
To: IQSS/dataverse
Cc: Barbosa, Sonia
Subject: Re: [dataverse] Citation: Remove MD5s, if you have UNF (#2192)

FRD: https://docs.google.com/document/d/1v-6WuFyClnAAHqyMf1VsWtCdXDTTR-ikuG6Ou8RtDMM/edithttps://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1v-2D6WuFyClnAAHqyMf1VsWtCdXDTTR-2DikuG6Ou8RtDMM_edit&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=Y4vvQiQubGh1CPsWkJEGEzDGpHO75B44oyjIlzC5t3Q&e=

Mockups:


Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D118881037&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=FKZCdYxAvsauobBJFVRrmyeWZ1YEp6tgjrihdk3_bQY&e=.

@mheppler
Copy link
Contributor

  • Removed MD5 from dataverse browse/search card for tabular files when UNF is displayed.
  • Removed MD5 from dataset list table for tabular files when UNF is displayed.
  • Removed MD5 from top section of file landing when for tabular files UNF is displayed.

Note: With the file landing page being pushed to 4.3, this removes the "Original File MD5" for tabular files completely from the UI.

@mheppler mheppler assigned kcondon and unassigned mheppler Sep 14, 2015
@kcondon
Copy link
Contributor

kcondon commented Sep 22, 2015

OK looks good, closing.

@kcondon kcondon closed this as completed Sep 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner
Projects
None yet
Development

No branches or pull requests

9 participants