Crazy idea: make metadata blocks more flexible #6700

poikilotherm · 2020-02-27T11:58:02Z

Motivation

A lot of people at Tromso, my colleagues and so many other great folks I've been talking to about Dataverse highlighted one of the outstanding features: custom metadata blocks.

But when I read things like #6561 or discussing about new standard like CodeMeta, I miss two things:

a dedicated place to run to, informing me about
- what does already exist (apart from the guide)
- who is interested in what
- what might already be available as a TSV file, which lives in some dark place of an installation
more flexibility to change upstream core TSVs like citation.tsv without forking the whole app

I'm very aware of #4451 and #6030, but those are grand challenges, large tasks not being solveable in the nearer future. So let's talk about what we can do today and on a low-effort basis. (Like I did with Solr and custom metadata, see #6142)

Crazy idea

Let's create IQSS/dataverse-metadata.
This repo should contain our main/core TSV files plus any useful scripts and docs, maybe pointing to the guides or being a new, consolidated guide related to metadata (however).
Remove TSVs from the main repo, maybe add as git submodule
Change Makefile to include them from their new place

Now what's the benefit?

Easily create forks of these files, which is much more lightweight than forks of the complete application
Make changes to these files for your installation in a branch, keep master (or whatever) updateable from upstream
Create workflows to update your installation branch with upstream changes with git doing the hard work of keeping track of changes and merging things again. This can also be automated to sync branches and create PRs in the fork repos (needs to be setup e. g. with free Github services by the forking party).
As long as people do it on GitHub: create a network of these forks dedicated to metadata. That way we can visualize what we want, look at other places what people add in their installations, share experiences and re-use metadata blocks.
Have a place to discuss issues about metadata. I have a feeling this will become more important as we go.
Create more flexibility by organizing instead of huge refactoring today.

Let's talk about this 😸

Mentions

Mentioning a bunch of people I know being interested in metadata blocks (and some other important people): @4tikhonov @skasberger @djbrooke @scolapasta @pdurbin @TaniaSchlatter @BPeuch @youssefOuahalou @vbernabe @RightInTwo @doigl @qqmyers @mercecrosas @bronger

Please share to anyone you might think is interested. Community power! 💪

The text was updated successfully, but these errors were encountered:

mercecrosas · 2020-02-27T14:56:50Z

Thanks for sharing this idea. I think it's very interesting and needs to be explored. We definitely need to do something about the flexibility of metadata blocks, and maybe this could work. I would ask the advantages and disadvantages from those in the community who use metadata blocks. Also, should we explore moving into a JSON-LD metadata blocks support, while maintaining support for TSV files? This is already what Jim Myers is looking into with the current extension of the RDA work if I'm not mistaken.

…

-- Mercè Crosas, Ph.D. University Research Data Management Officer Chief Data Science and Technology Officer, Institute for Quantitative Social Science Harvard University [email protected] | @mercecrosas <https://twitter.com/mercecrosas> | scholar.harvard.edu/mercecrosas

qqmyers · 2020-02-27T15:04:12Z

@poikilotherm - having a place to discuss metadata issues sounds useful - the problems and ideas are spread out across many issues and docs. Another google group like that for i18n might be useful. Also, FWIW, Gustavo and I are trying to get a doc together summarizing issues and potential design directions around metadata that we hope to get out for community input in the near future.

W.r.t. another repository, while I think separating the set of schemas you can use from Dataverse itself makes sense in the longer term, I'm not sure that a separate tsv repo is as useful. (To be clear, this is a personal opinion rather than a GDCC position. I expect that if there's consensus this is useful, it could be a GDCC repo and I'd help to make it work.) Right now, the tsv files mix the description of what is included with how that should be displayed in Dataverse. And there are different changes you can make to metadata blocks that have different impacts. I think changes to tsv files that affect how things are displayed work with the current API and they don't have too much impact on preservation. Other changes - switching required fields to optional, changing the mapping to community terms are more problematic w.r.t. harvesting, import/export, preservation, etc. And some changes, such as removing a term, just don't work (unless you edit in the database and account for any prior use of that term). While none of these get worse if tsv files are in another repository, I think it may get harder to coordinate if there are changes like those discussed in #6561, or larger changes to separate out the schema from display issues. If there is a separate repository, I think it would be very important, given the way things work today, to continue to review changes for impacts like those above.

poikilotherm · 2020-02-27T16:24:40Z

Thank you @mercecrosas and @qqmyers for your input. Wonderful to hear that you are already are working with Gustavo on a doc.

Is there some kind of ETA for such a huge UI change foreseeable? From my (not so long) experience with folks around here this sounds like end of 2020 minimum. Which is perfectly fine - you are all doing a great job, and this shouldn't be rushed. So my idea was about doing sth. that can be done easily in a very short time, does not involve many code changes, but helps people getting things done now until the big, great and fancy solution is ready. The very same idea was behind #6142.

I know that TSV files aren't great and I very much dislike them, too. But it's where we are now. Maybe splitting them out can be a good starting point for moving on to a new format, too, as it's a central place to maintain things. (:wave: @mercecrosas)

Jim, you are absolutely right about the multiverse of things you can do to TSVs. But actually that was the whole point why this should be moved into a separate repo. That way we have a non-cluttered, easy to follow change log for this more or less static data. Everyone can pretty much rely on it, but needs to make sure their changes are backported to any upstream change (the normal fork-problem workflow you need to work on).

So for everyone that is happy with the current schemas, there is no need to fork, just use things. But for everyone else, it gets much easier to maintain a fork when you don't have to mess with the large main app repo.

If you guys feel uncomfortable with removing from the main repo - updates to the metadata repo from the main repo can be automated. That way you would even need less changes, but still create a place to run to. In my personal experience, I'm less of a mailinglist guy, but read GitHub issues and IRC frequently and love to discuss here.

Harvesting might become a problem, indeed. But what should we do instead when things like #6561 appear? Just this morning we discussed that we don't want all the author id schemas for Jülich DATA but stick with ORCID only. That's a change only doable by changing the TSV when you don't want to fork the code to implement filters or similar. On the other hand we want to keep the maintenance effort to maintain the metadata schemas as low as possible.

Don't get me wrong - if there is no interest for this within the community, this is just a skip-able crazy idea. We can simply do this just for us and share scripts with everyone interested. If there is no greater value in creating a place to run to for the community I'll simply cope with what's present. 😉

skasberger · 2020-02-27T19:56:59Z

To have a master repo with the default metadatablock files and some helpful scripts would be a good start. I am creating my own repo for our AUSSDA metadatablocks right now, to make them usable for our jenkins tests and for the deployment scripts. So it could be a fork or so in the future.

BPeuch · 2020-02-28T15:16:57Z

Sounds like a great idea to me, @poikilotherm. Decompartmentalization (you don't get to use that word everyday but it sounds fitting here) is the way to go, I think. IQSS's wish to maintain a core of metadata is more than understandable: it's good practice documentologically speaking. Without it, say goodbye to interoperability. But with time the number of repositories with specific metadata needs, who come to the fore asking for customization options, only seems to grow… So looking for a workaround like this feels like a great initiative to me.

@qqmyers wrote:

I think changes to tsv files that affect how things are displayed work with the current API and they don't have too much impact on preservation. Other changes - switching required fields to optional, changing the mapping to community terms are more problematic w.r.t. harvesting, import/export, preservation, etc.

That's also a very good point. For instance, while JSON, flexible as ever, immediately incorporates custom metadata blocks in its files, DDI doesn't. I have yet to see if an issue or a topic on the Google Group was already created to mention this.

qqmyers · 2020-02-28T15:51:06Z

Independent of the question of a repository, I just want to make sure that it's clear that there's a difference in the current design between having new metadata blocks and editing existing ones. The former is straight forward and there's no issue with having new ones (e.g. the Darwin Core block discussed in Tromso). Editing an existing block, or different groups using different versions of the same block is where care is needed to avoid problems/ where guidance about avoiding changes that will require db edits, affect interoperability, etc.

djbrooke · 2020-03-02T20:04:42Z

@poikilotherm, seems there is quite a bit of interest here and we will discuss as @mercecrosas and others mentioned. Quick question, you mentioned a UI change in your comment. Can you provide some more details on what you see as the potential change (or changes)?

4tikhonov · 2020-03-02T22:24:01Z

I don't think we need to separate metadata schema from the master branch and put in another GitHub repository, or build another GUI to handle this. The most obvious solution is to create a synchronization tool that can read and update any schema in Dataverse directly from Google Spreadsheet by Google API.

Spreadsheets are suitable for the collaborative work and can be archived time after time in the master branch. With a bit of efforts this tool can be also integrated in the Dataverse dashboard so admin can fill a form with Google API and link to Spreadsheet and get a metadata schema updated.

poikilotherm · 2020-03-04T14:50:59Z

@djbrooke actually I think for this proposal, there is no UI change needed. Everything can stay as is, this is just about reorganizing the metadata blocks files.

The UI changes I mentioned in #6700 (comment) were about the changes that @qqmyers and @scolapasta are investigating and related to an ETA for those changes (my anticipation is that this will take much longer than reorganizing as a temporary workaround).

pdurbin · 2020-07-16T14:33:23Z

Seeing a post referencing this issue at https://groups.google.com/g/dataverse-community/c/RJl4IQcPw30/m/pk1RtA58CgAJ reminds me to mention here that a new "flexible metadata" working group is being formed following the 2020 Dataverse Community Meeting.

The "GDCC working groups" announcement can be found here: https://groups.google.com/g/dataverse-community/c/EY0dduRj3Ac/m/EDcEQHLoAwAJ

Here's a direct link where people can sign up to the flexible metadata group and other groups: https://docs.google.com/document/d/1LTLjLM5sR07SAEqO7u-QgRp-StO327WS2TdbR2KdPxY/edit?usp=sharing

poikilotherm · 2022-08-04T12:26:48Z

This discussion was a dead end. Closing.

poikilotherm added Type: Suggestion an idea Feature: Metadata Feature: Installer Feature: Admin Guide labels Feb 27, 2020

pdurbin mentioned this issue Feb 27, 2020

harmonize formats for metadata schema and dataset creation #4451

Closed

poikilotherm mentioned this issue Feb 27, 2020

Allow installation to decide if displayoncreate is true or false for metadata fields #6561

Closed

poikilotherm mentioned this issue May 4, 2020

Classify metadata fields as mandatory, recommended, and optional #6885

Closed

poikilotherm closed this as completed Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crazy idea: make metadata blocks more flexible #6700

Crazy idea: make metadata blocks more flexible #6700

poikilotherm commented Feb 27, 2020 •

edited

Loading

mercecrosas commented Feb 27, 2020 via email

qqmyers commented Feb 27, 2020

poikilotherm commented Feb 27, 2020

skasberger commented Feb 27, 2020

BPeuch commented Feb 28, 2020

qqmyers commented Feb 28, 2020

djbrooke commented Mar 2, 2020

4tikhonov commented Mar 2, 2020

poikilotherm commented Mar 4, 2020

pdurbin commented Jul 16, 2020

poikilotherm commented Aug 4, 2022

Crazy idea: make metadata blocks more flexible #6700

Crazy idea: make metadata blocks more flexible #6700

Comments

poikilotherm commented Feb 27, 2020 • edited Loading

Motivation

Crazy idea

Mentions

mercecrosas commented Feb 27, 2020 via email

qqmyers commented Feb 27, 2020

poikilotherm commented Feb 27, 2020

skasberger commented Feb 27, 2020

BPeuch commented Feb 28, 2020

qqmyers commented Feb 28, 2020

djbrooke commented Mar 2, 2020

4tikhonov commented Mar 2, 2020

poikilotherm commented Mar 4, 2020

pdurbin commented Jul 16, 2020

poikilotherm commented Aug 4, 2022

poikilotherm commented Feb 27, 2020 •

edited

Loading