Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data restricted to Shibboleth groups cannot be explored in TwoRavens or downloaded via curl #3447

Closed
pdurbin opened this issue Nov 2, 2016 · 5 comments

Comments

@pdurbin
Copy link
Member

pdurbin commented Nov 2, 2016

First mentioned at https://help.hmdc.harvard.edu/Ticket/Display.html?id=243061 we have a case at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T2DQQS where files are restricted and access has been granted to the institution-wide Shibboleth group for Harvard. The files can be downloaded but they cannot be explored in TwoRavens. Here's a screenshot of the poor user experience (no "data pebbles):

image001

This morning @landreev @kcondon @scolapasta @djbrooke and I discussed this ticket and it was agreed that I would create this issue and provide background since I'm the one who implemented institution-wide Shibboleth groups in #1401 and documented them at http://guides.dataverse.org/en/4.5.1/installation/shibboleth.html#institution-wide-shibboleth-groups

We came up with a lot of ideas of how we could fix the bug (comments welcome) but I'd like to explain why these shib groups work the way they do.

We are trying to be careful about knowing if a Harvard affiliate, for example, has left Harvard and should no longer have access to restricted data. If you can't log in with HarvardKey anymore, you shouldn't be able to download data that is restricted to Harvard users. Toward that end, institution-wide Shibboleth groups are implemented as runtime groups. This is similar to how IP Groups are also implemented. (Perhaps you expect to be able to download data from a Harvard IP address when you're on campus, but you understand that if you're at home, you have a non-Harvard IP address, you will not be able to download the restricted file.) Dataverse knows you are a Harvard affiliate because you have logged in successfully to HarvardKey and HarvardKey has asserted back to Dataverse information about you including your name, email address, and most importantly for this case, which Identity Provider (IdP) you came from, which for Harvard is Entity ID "https://fed.huit.harvard.edu/idp/shibboleth" as seen at https://incommon.org/federation/info/entity.html?entityID=https%3A%2F%2Ffed.huit.harvard.edu%2Fidp%2Fshibboleth&technical=true and below:

screen shot 2016-11-02 at 12 42 17 pm

So, once we have "https://fed.huit.harvard.edu/idp/shibboleth" where do we put it? We store it in the persistentuserid column of the authenticateduserlookup table, followed by the pipe character (|), followed by an the unique identifier for the user. It looks something like this: https://fed.huit.harvard.edu/idp/shibboleth|[email protected]. This is true not only for HarvardKey users, of course, but all of the 200+ institutions that can log in.

For most of those 200 institutions, that's the only place where this "Entity ID" is stored. It's basically a no-op, permissions-wise. But for institutions for which we've bothered to set up an institution-wide Shibboleth group (currently Harvard and MIT only, I think, we'd like to automate this in #1403), at runtime we will match the user's IdP with the group (stored in a table called "shibgroup") and consider them part of that group from a permissions point of view. It's that shib group that is given "download permission" or whatever on a particular file or dataset. Again, the important thing here is that this only happens at runtime. In the code, as of 4.5.1 it only happens in the GUI with the line au.setShibIdentityProvider(shibIdp) at https://github.com/IQSS/dataverse/blob/v4.5.1/src/main/java/edu/harvard/iq/dataverse/Shib.java#L338 . Note that the variable is annotated as @Transient. It's not persisted anywhere. Runtime only.

I hope this background helps. I could go on and on but I think this is a good start. In short, we rely on the GUI (the browser) to determine at runtime if you're a Harvard user or not. In the title, I'm mentioning curl because it's another way to test this. You can't download files restricted to a Shib group via curl. The API token alone is currently enough. There's a related issue here in that if someone leaves Harvard and can't log into HarvardKey anymore, their API token will still work. Not forever, but until it expires, which I think is a year after creation? We'll know more after we dig into #3398.

@pdurbin
Copy link
Member Author

pdurbin commented Nov 4, 2016

I haven't really talked about workarounds.

The first workaround to try is IP Groups if we're mostly trying to restrict data to Harvard campus IP addresses. I don't believe there are a lot of docs but there's a bit at http://guides.dataverse.org/en/4.5.1/api/native-api.html#ipgroups

Another workaround is to create an "explicit" group and assign all the users to that group. Then this explicit group could be given permission at the dataverse, dataset, or file level.

There's a security wrinkle in here that I don't quite understand that @landreev could explain better. Something about how we could use an IP Group to allow just the IP Group of Rserve server (?) to be able to download the restricted files?

@landreev
Copy link
Contributor

landreev commented Feb 7, 2017

@pdurbin -
Sorry for replying 3 months later; @djbrooke just pointed out to me that there was an unanswered question in this ticket.
Actually, you do seem to understand the "security wrinkle" in question - the way you describe the potential solution/workaround is all there is to it: "use an IP Group to allow just the IP Group of Rserve server to be able to download the restricted files".
The reason TwoRavens isn't working for shib users is that the part of it that runs on the server (rserve.dataverse.harvard.edu) needs to be able to download the data files.
It authenticates with the user's api token that Dataverse passes to it. And that part doesn't work for shib users. However, if we create an IP group with just the IP address of the R server, with the privileges to download any file (can be done, right?), this should solve it.
The other part of the TwoRavens app is javascript that runs on the user's browser. It needs access privileges too - but that should not be a problem, since the requests from their browsers will be coming in with their session id; and their session will have all the right privileges, for as long as they are logged in.

@djbrooke
Copy link
Contributor

djbrooke commented Feb 8, 2017

Thanks @landreev and @pdurbin for the discussion. If there is some workaround that we can set up with IPGroups, that would be great, as this is causing issues for at least one installation.

Can you please work together to confirm whether or not this workaround is possible?

Thanks!

@pdurbin
Copy link
Member Author

pdurbin commented Feb 8, 2017

#3151 is about how IP Groups shouldn't allow access to level 2 data or above. I guess we'd make an exception in this case?

Could Rserve authenticate to Dataverse as a user rather than an IP Group? I'm asking because one workaround would be to create a dedicated user for Rserve and flip the superuser boolean to true, which would allow downloads of all files. This is a bit risky, of course! Also, I don't believe access this broad is possible with IP Groups. I think you'd need to:

  • Create an IP group for the IP used by rserve.dataverse.harvard.edu
  • Give the IP Group the Member role (or some suitable read-only role that allows downloads) for to each existing dataverse
  • Set up a cron job to look for new dataverses as they are created and assign the Member role to those as well.

I think this is the only way to allow an IP Group to download any file because permissions are not inherited from a parent dataverse to a child dataverse per #2447. The way I think of this is that each dataverse is an island of permission. @scolapasta @kcondon @michbarsinai and @landreev can fact check me on this. 😄 #alternativefacts

@pdurbin
Copy link
Member Author

pdurbin commented Jun 28, 2017

This doesn't seem to be a high priority so I'm closing this issue. If I'm wrong, let's re-open it some day.

@pdurbin pdurbin closed this as completed Jun 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants