Citation: Remove MD5s, if you have UNF #2192

sbarbosadataverse · 2015-05-22T15:37:02Z

Gary's sent the following:

why are there MD5's? these I think should all be removed. we have UNFs
instead.

pdurbin · 2015-05-22T15:43:40Z

MD5s are commonly used to verify that files were not corrupted during download. Every Mac and and Linux box has the native ability to calculate an MD5 of a file. For Windows it's a supported addon: https://support.microsoft.com/en-us/kb/841290

mercecrosas · 2015-05-22T18:06:34Z

This is issue is not well defined as it is. @thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.

thegaryking · 2015-05-23T14:27:05Z

we're first and foremost trying to communicate with users, almost none of
which know about either md5 or unfs. we have taken them another step and
told them about unfs; we have a page describing them, and when we put in
enough effort they get what they are. there's no reason to introduce
another new concept. let's just use another degree of indirection. if the
librarians want something for files for which we don't have unfs, then we
create a broader notion of what a unf is and we always have a unf for every
file. the plan would be quite like the idea of dv to begin with, which is
that the more it knows about the files which are uploaded, the more
services we provide. for unfs, we can do the same thing: if the file is in
a format we know what to do with (R, sas, spss, table, etc.) we can compute
a format-independent UNF. If it is anything else we can create a
format-dependent UNF, which if you want would be exactly a MD5, but would
be displayed as a UNF. we can also add another service if librarians want,
somewhere far out of the way of most users, that lets people type in a unf
and have dv tell them exactly what it is and how it was calculated,
including the full algorithm, an MD5 if it is in there, and anything else.

then from a UI/user understanding point of view, there will be only one
thing to understand; they can ignore the details and trust us if they want;
they can get the details if they like; and we can continue to innovate what
a UNF is since there's a version number embedded in it (and i agree that we
should do the latter; people tell me that videos and photos wouldn't be
hard and we could clearly expand to more forms of data. this, however, is
a separate project that we could perhaps seek funding for and do then).

Gary

Gary King - Albert J. Weatherhead III University Professor - Director,
IQSS http://iq.harvard.edu/\- Harvard University
GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary -
617-500-7570 - fax 812-8581 - Assistant [email protected]:
495-9271

On Fri, May 22, 2015 at 2:06 PM, Merce Crosas [email protected]
wrote:

This is issue is not well defined as it is. @thegaryking
https://github.com/thegaryking does this question apply to subsettable
Tabular files? why is it bad to have both UNF and MD5? as @pdurbin
https://github.com/pdurbin comments, MD5 are the most commonly used by
preservation groups and libraries, and can be useful in addition to UNF to
verify the original deposited file.

—
Reply to this email directly or view it on GitHub
#2192 (comment).

mercecrosas · 2015-05-23T14:47:00Z

Yes, I agree with the general approach, but we need to do some research to
implement this well. Here are the three main issues:

If we only provide a UNF for tabular files (spss, stata, r, etc), and not
an MD5 for the original format, then we don't have a way to verify the
original deposited file, which is important in some cases when we need to
recalculate the UNF or there is some issues or uncertainties with
reformatting, or simply for standard archival verifications that
repositories should do. This is a request from some groups that want to
make sure we support preservation good practices. There are some
preservation certificates that Dataverse could not get without this.
UNF is a fantastic concept, but it has some practical limitations and
issues in the way that is currently defined. Given that each format treats
some data types differently (time, binary and categorical variables,
rounding), it could turn out that you convert from one format to another,
to another and then to back to the first format, and end up with different
UNFs (this is similar to the phenomenon of google translating from one
language, to another, to another, back to the first and end up with
different sentence). This has been improved considerably from the initial
UNF version, but it's practically very difficult to include all the
exceptions.
But on of the main issue with UNF is that it doesn't include the
metadata of the file, that is the variable name, for example, or data type.
This mean that you might have a spss file with column A that corresponds to
var1, but this was not correct and for some reason needs to be changed to
var2, and this is not reflected in the UNF, while this is a critical
critical change in the data file.

I agree it would be great, as you say, to generalize UNF and make it work
well across all cases, so we use and teach only one thing. But we need to
take these three issues (and I might be missing others that Leonid, Kevin
and others in the team might know) in consideration.

Mercè Crosas, Ph.D.
Director of Data Science, IQSS
Harvard University
http://scholar.harvard.edu/mercecrosas

On Sat, May 23, 2015 at 10:27 AM, Gary King [email protected]
wrote:

we're first and foremost trying to communicate with users, almost none of
which know about either md5 or unfs. we have taken them another step and
told them about unfs; we have a page describing them, and when we put in
enough effort they get what they are. there's no reason to introduce
another new concept. let's just use another degree of indirection. if the
librarians want something for files for which we don't have unfs, then we
create a broader notion of what a unf is and we always have a unf for every
file. the plan would be quite like the idea of dv to begin with, which is
that the more it knows about the files which are uploaded, the more
services we provide. for unfs, we can do the same thing: if the file is in
a format we know what to do with (R, sas, spss, table, etc.) we can compute
a format-independent UNF. If it is anything else we can create a
format-dependent UNF, which if you want would be exactly a MD5, but would
be displayed as a UNF. we can also add another service if librarians want,
somewhere far out of the way of most users, that lets people type in a unf
and have dv tell them exactly what it is and how it was calculated,
including the full algorithm, an MD5 if it is in there, and anything else.

then from a UI/user understanding point of view, there will be only one
thing to understand; they can ignore the details and trust us if they want;
they can get the details if they like; and we can continue to innovate what
a UNF is since there's a version number embedded in it (and i agree that we
should do the latter; people tell me that videos and photos wouldn't be
hard and we could clearly expand to more forms of data. this, however, is
a separate project that we could perhaps seek funding for and do then).

Gary

Gary King - Albert J. Weatherhead III University Professor - Director,
IQSS http://iq.harvard.edu/\- Harvard University

GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary

617-500-7570 - fax 812-8581 - Assistant [email protected]:
495-9271

On Fri, May 22, 2015 at 2:06 PM, Merce Crosas [email protected]
wrote:

This is issue is not well defined as it is. @thegaryking
https://github.com/thegaryking does this question apply to subsettable
Tabular files? why is it bad to have both UNF and MD5? as @pdurbin
https://github.com/pdurbin comments, MD5 are the most commonly used by
preservation groups and libraries, and can be useful in addition to UNF
to
verify the original deposited file.

—
Reply to this email directly or view it on GitHub
#2192 (comment).

—
Reply to this email directly or view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D104902198&d=BQMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=n42TBHZeNCFjVWht9OJze2EDoPR2o7n87LpnMd0UIlQ&s=3H3Wxs075lnSQvTNGBQFfpirL4UfeRAgp1akLUnwL94&e=
.

posixeleni · 2015-05-23T15:01:54Z

I agree with @mcrosas that the MD5 checksum should exist for every file to ensure bit-level preservation. When I presented a preview of Dataverse 4.0 to the Library of Congress' National Digital Stewardship Alliance in the Fall they were particularly impressed that we included MD5s for all our files. Here's a blog post from them discussing the importance of file fixity/data integrity: http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/

MD5 is a standard that the digital archival community trusts whereas UNF was unknown to them. I don't think we should replace MD5s for all files with UNFs if the community isn't using them outside of Dataverse.

thegaryking · 2015-05-23T15:51:52Z

ok, but let's get MD5s out of the file list now. we can stick it in the
metadata when we expand the long list, as an unchangable item if someone
wants it.

and separately, let's create a google doc or something with specifications
of a UNF that would satisfy everyone. we can even create the specs and
either get the grant ourselves or have a call for PIs to take on this task,
perhaps with us. we know pretty much everything we want, and all the
problems with the current UNF. we just need some bandwidth (or someone
else) to implement it all.

Gary

Gary King - Albert J. Weatherhead III University Professor - Director,
IQSS http://iq.harvard.edu/\- Harvard University
GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary -
617-500-7570 - fax 812-8581 - Assistant [email protected]:
495-9271

On Sat, May 23, 2015 at 11:01 AM, Eleni Castro [email protected]
wrote:

I agree with @mcrosas https://github.com/mcrosas that the MD5 checksum
should exist for every file to ensure bit-level preservation. When I
presented a preview of Dataverse 4.0 to the Library of Congress' National
Digital Stewardship Alliance in the Fall they were particularly impressed
that we included MD5s for all our files. Here's a blog post from them
discussing the importance of file fixity/data integrity:
http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/

MD5 is a standard that the digital archival community trusts whereas UNF
was unknown to them. I don't think we should replace MD5s for all files
with UNFs if the community isn't using them outside of Dataverse.

—
Reply to this email directly or view it on GitHub
#2192 (comment).

pdurbin · 2015-05-24T14:18:25Z

Maybe we should remove both UNFs and MD5s from the default listing for files. They add a lot of noise.

I just clicked on a random dataset and saw this for a PowerPoint file:

Isn't this a little... noisy... busy... unfriendly?

Who cares that it's MD5 is 26a3bb59a1d9a837ea51cc9c160c5b1a? (In addition, who cares that it's MIME Type is application/vnd.openxmlformats-officedocument.presentationml.presentation?) It's a PowerPoint file! That's all most people need to know. If normal people download it and can't open it they'll throw it in the trash and try again. If the still can't open it, they'll email the dataset contact and say "Hey, I think you uploaded a corrupted PowerPoint file" (which is more likely than the file being corrupted during download). They're not going to calculate the MD5 locally (let alone the UNF) and compare it to the MD5 on the screen. Only geeks like me would even think of doing that. And I probably wouldn't bother. I'd email the dataset contact.

Sure, show that it's 3 megabytes. Show the date it was uploaded. Stuff like MD5 and UNF could be hidden behind a "details" link, perhaps with some definitions of what MD5 and UNF even are.

mercecrosas · 2015-05-24T15:06:17Z

@pdurbin we should not just remove these from the file cards without the appropriate research and consideration of preservation good practices - it has been an expectation from users and partners to easily find the fixity even if it's not used all the time. But you bring good points, it's worth reviewing if they can be displayed in another place.

I'm assigning this issue to @mheppler and adding it to the "In Design" milestone, following our process. Once a designed is proposed and reviewed (by @thegaryking and partners who requested the MD5), we'll move it a Release Candidate milestone.

To summarize, based on @thegaryking comments above:

This issue is about removing MD5 from the file card of tabular/subsettable files (but keeping UNF) and finding the appropriate place to display the MD5. Files that don't have UNF will display MD5, as it the case now. (As a note, the metadata tab might not be the right place for the MD5 of tabular files since it has dataset metadata but not file level metadata, although we should still consider it. @mheppler I have some ideas about this, for when you are ready to work on it)
For the larger task of generalizing UNF, I'll create a new issue in GitHub and start a Functional Requirements Document, as we do for new features or components, and invite others to review it.

landreev · 2015-05-26T21:09:28Z

@pdurbin - yes, the long Microsoft mime types are terrible. But we have a mechanism for dealing with this - it's just a matter of adding the "friendly" version of it (such as "PowerPoint") to the list we maintain. (it's a .property file).
The friendly types for Excel and Word are already there. PowerPoint was left out, probably because it's not as common.

mheppler · 2015-05-26T21:14:55Z

Thank you for commenting on that @landreev. I was going to ask you about these "friendly" file types, since I recall going over these with you for the file icons. We should separate out that task of identifying as many of these file types as we can in our current production data, and giving them friendly labels.

landreev · 2015-05-26T21:20:05Z

@mheppler
Yes, we could use a dedicated ticket for creating these "friendly" labels for as many types as possible.
The file in question is ./src/main/java/MimeTypeDisplay.properties

mheppler · 2015-05-26T21:45:05Z

@landreev @pdurbin -- #2202 -- new issue for MIME Type improvements created. Enjoy.

eaquigley · 2015-06-25T15:20:20Z

Need to discuss this during a UI/UX team meeting to brainstorm ideas on how to show more file metadata without being overwhelming in the file card on the dataset page. Perhaps having a files metadata section in the metadata tab. @mcrosas @mheppler

mercecrosas · 2015-06-25T15:25:34Z

After reviewing it with @eaquigley and @mheppler we plan to move this to 4.0.3.

eaquigley · 2015-06-29T20:27:59Z

Have a section in the metadata tab that is "Files" and displays this extra metadata (MD5 shows here and not on the files card if a UNF is available).

mheppler · 2015-07-06T14:58:10Z

FRD: https://docs.google.com/document/d/1v-6WuFyClnAAHqyMf1VsWtCdXDTTR-ikuG6Ou8RtDMM/edit

Mockups:

sbarbosadataverse · 2015-07-06T15:35:56Z

I had one question about this--- Is there a safeguard in place to ensure MD5 gets assed when tabular ingest fails for any reason? We have so enough failures at the moment to cause me to ask.
Thanks

Sonia Barbosa
Manager of Data Curation, IQSS Dataverse Network
Manager of the Murray Research Archive, IQSS
Data Science
Harvard University

Dataverse 4.0 is now available for use!
http://dataverse.harvard.edu

All test dataverses should be created in 4.0 Demo!
http://dataverse-demo.iq.harvard.edu/

Join our Dataverse Community!
https://groups.google.com/forum/#!forum/dataverse-community

From: Michael Heppler [[email protected]]
Sent: Monday, July 06, 2015 10:58 AM
To: IQSS/dataverse
Cc: Barbosa, Sonia
Subject: Re: [dataverse] Citation: Remove MD5s, if you have UNF (#2192)

FRD: https://docs.google.com/document/d/1v-6WuFyClnAAHqyMf1VsWtCdXDTTR-ikuG6Ou8RtDMM/edithttps://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1v-2D6WuFyClnAAHqyMf1VsWtCdXDTTR-2DikuG6Ou8RtDMM_edit&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=Y4vvQiQubGh1CPsWkJEGEzDGpHO75B44oyjIlzC5t3Q&e=

Mockups:

�
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D118881037&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=FKZCdYxAvsauobBJFVRrmyeWZ1YEp6tgjrihdk3_bQY&e=.

…2192 #1665 #2503 #2504 #2465]

mheppler · 2015-09-14T19:35:19Z

Removed MD5 from dataverse browse/search card for tabular files when UNF is displayed.
Removed MD5 from dataset list table for tabular files when UNF is displayed.
Removed MD5 from top section of file landing when for tabular files UNF is displayed.

Note: With the file landing page being pushed to 4.3, this removes the "Original File MD5" for tabular files completely from the UI.

kcondon · 2015-09-22T22:11:02Z

OK looks good, closing.

sbarbosadataverse added Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner Priority: Medium Type: Suggestion an idea Feature: File Upload & Handling Type: Bug a defect labels May 22, 2015

mercecrosas removed Priority: High Type: Bug a defect Type: Feature a feature request labels May 23, 2015

mercecrosas self-assigned this May 23, 2015

mercecrosas modified the milestones: In Review, In Design May 23, 2015

mercecrosas assigned mheppler and unassigned mercecrosas May 24, 2015

mercecrosas mentioned this issue May 24, 2015

Generalize UNF definition to apply across all Files #2198

Closed

mercecrosas added Type: Feature a feature request and removed Type: Suggestion an idea labels May 24, 2015

mercecrosas mentioned this issue May 26, 2015

Generalize UNF definition to apply across all Files IQSS/UNF#2

Open

mheppler mentioned this issue May 26, 2015

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Closed

eaquigley modified the milestones: In Design, 4.2 Jul 23, 2015

mheppler added a commit that referenced this issue Sep 14, 2015

Removed MD5 for tabular file displays when the UNF is displayed. [ref #…

e2781a8

…2192 #1665 #2503 #2504 #2465]

mheppler added the Status: QA label Sep 14, 2015

mheppler assigned kcondon and unassigned mheppler Sep 14, 2015

kcondon closed this as completed Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Citation: Remove MD5s, if you have UNF #2192

Citation: Remove MD5s, if you have UNF #2192

sbarbosadataverse commented May 22, 2015

pdurbin commented May 22, 2015

mercecrosas commented May 22, 2015

thegaryking commented May 23, 2015

mercecrosas commented May 23, 2015

Gary

GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary

posixeleni commented May 23, 2015

thegaryking commented May 23, 2015

pdurbin commented May 24, 2015

mercecrosas commented May 24, 2015

landreev commented May 26, 2015

mheppler commented May 26, 2015

landreev commented May 26, 2015

mheppler commented May 26, 2015

eaquigley commented Jun 25, 2015

mercecrosas commented Jun 25, 2015

eaquigley commented Jun 29, 2015

mheppler commented Jul 6, 2015

sbarbosadataverse commented Jul 6, 2015

mheppler commented Sep 14, 2015

kcondon commented Sep 22, 2015

Citation: Remove MD5s, if you have UNF #2192

Citation: Remove MD5s, if you have UNF #2192

Comments

sbarbosadataverse commented May 22, 2015

pdurbin commented May 22, 2015

mercecrosas commented May 22, 2015

thegaryking commented May 23, 2015

Gary

mercecrosas commented May 23, 2015

Gary

GaryKing.org - [email protected] - @KingGary https://twitter.com/kinggary

posixeleni commented May 23, 2015

thegaryking commented May 23, 2015

Gary

pdurbin commented May 24, 2015

mercecrosas commented May 24, 2015

landreev commented May 26, 2015

mheppler commented May 26, 2015

landreev commented May 26, 2015

mheppler commented May 26, 2015

eaquigley commented Jun 25, 2015

mercecrosas commented Jun 25, 2015

eaquigley commented Jun 29, 2015

mheppler commented Jul 6, 2015

sbarbosadataverse commented Jul 6, 2015

mheppler commented Sep 14, 2015

kcondon commented Sep 22, 2015