Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-based ObjectStore #4314

Closed
wants to merge 36 commits into from
Closed

Conversation

VJalili
Copy link
Member

@VJalili VJalili commented Jul 14, 2017

This PR is in accordance with the Federated user-based ObjectStore goal. The updates introduced in this PR are mock-up of the Federated user-based ObjectStore, and can generally be grouped as:

  1. User-based ObjectStore. Current ObjectStore reads its configuration (e.g., provider, bucket name, access and secret keys) from config/object_store_conf.xml and applies them Galaxy instance-wide. In other words, the data uploaded/created/processes by any of the users are written-to or read-from a common bucket to which the Galaxy instance has full access (accordingly, implicitly accessible by all the users of the galaxy instance). This PR introduces user-based objectstore, where each user defines his/her own PluggedMedia (e.g., AWS S3 bucket(s), Azure Blob, and etc.) where a dataset is read/written to that PluggedMedia.

  2. Provider agnostic ObjectStore. This PR replaces the S3ObjecStore module with cloud module. The former uses boto and provides access to AWS S3 only; while, the latter leverages on CloudBridge which provides an interface for unified access to various providers (including S3 and Swift).

What is expected in future related PRs:

  • at this PR, only cloud module is user-based; other modules such as DistributedObjectStore are not (user should be able to define how to distribute his/her own data on the PluggedMedias he/she has defined using a user-based DistributedObjectStore).

  • some UI elements will be defined to allow user (a) define a PluggedMedia, (b) choose a PluggedMedia when uploading/creating a new dataset. At this PR, to define a PluggedMedia, one should manually add it to the database (PluggedMedia table). Additionally, the code has two temporary code-snippets (marked with TEMP BLOCK) which choose the first available PluggedMedia of the user.

2. Added user to cloud create signature.
2. Added a function in dataset to retrieve a list of all the users its shared with.
… of the functions.

- ObjectStore.Cloud calls for the configuration of a connection to the cloud-based storage providers at its first call (this has to be extended to other possible call paths).
…sed objectstore.

- Replaced dataset.filename with dataset.get_file_name for what is needed for a user-based objectstore.
- previously was `Job`
- now it's HDA
- previously was `Job`
- now it's HDA
- added a table for it, and extended HDA table accordingly.
- created migration scripts.
- propagated PluggedMedia to the ObjectStore
- added temporary code for defining a pluggedMedia (once the a user is created), and using it (once a dataset is being uploaded). These temporary codes has to be replaced with appropriate fucntions.
…ocess.

2. Fixed some bugs in Objectstore.cloud.
…stances of objectstore could run.

2. cleaned-up cloud config file parser.
3. fixed a bug with contruct_path function.
…tore

# Conflicts:
#	doc/source/lib/galaxy.objectstore.rst
#	doc/source/slideshow/architecture/images/objectstore.plantuml.svg
#	lib/galaxy/jobs/__init__.py
@@ -902,10 +902,16 @@ def get_special( ):
def _create_working_directory( self ):
job = self.get_job()
try:
# TEMP BLOCK --- START
pluggedMedia = None
for pM in job.user.pluggedMedia:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop just to set a variable is a bit awkward. I think it'd be better to add a method to the User class in lib/galaxy/model/__init__.py that defines this.

def current_plugged_media(self):
    return None if not len(self.plugged_media) else self.plugged_media[-1]

and then access like:

job.user.get_current_plugged_media()

It is really unfortunate that plugged_media is also used as both a list and a single instance. I'd consider a name that has a more obvious, distinct singular vs. plural modality (e.g. plugged_volume vs. plugged_volumes) or use plugged_medium for the single instance you are fetching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As marked, this part of the code is temporary. There will be some UI's asking user which plugged_media to use, or allow user to set a default plugged_media from preferences. However, till then, we're assuming that the user has only one plugged_media and we're retrieving it this way.

regarding naming; indeed job.user.plugged_media is a relation to the plugged_media table.

"""
Object store that stores objects as items in a Swift bucket. A local
cache exists that is used as an intermediate location for files between
Galaxy and Swift.
"""

def _configure_connection(self):
# TODO: Replace with Cloudbridge connection.
log.debug("Configuring Swift Connection")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mark this PR as a WIP until this TODO is addressed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is already addressed in the super class - can you just delete the code and the comment?

@jmchilton
Copy link
Member

This is exciting functionality and a great direction to move in - thanks for the contribution!

That said, I'd really like to see some large structural changes to this PR - in particular I'd like to see this as 3 different PRs so that I can comment on the pieces individually.

The three PRs I'd like to see are:

  • First a very small easy-to-merge PR that addresses all the smaller white space changes and such you make. If those are important to you - we can just merge those quickly and it would really reduce the visual clutter for the other PRs. This is a pattern I frequently follow before making big changes - there is an initial PR to smooth out and clarify the pieces of code I'm going to touch. This would make the other two PRs more atomic as well.

  • Secondly, I'd like to see the swap to cloudbridge as its own PR - that should be doable without the pluggable media stuff right? I mean there is an open PR from you about that right? I thought it required more work on your end - but if you are waiting on committers maybe we can have a small team meeting to discuss getting that merged - I don't think anyone wants to block your progress. That said - based on TODOs and such - this part of the PR seems like it needs some work - restoring swift functionality for instance.

    One thing to consider here is to keep the existing object stores as they are - and just add a new cloud one. One potential criticism of what you have done here is that you break existing object store confs by changing the names and stuff. If you just added your own object store first and even insisted that it is used for the new functionality I think that would be fine. Everyone could be more confident about merging it then if we knew it wasn't breaking old stuff. Longer term that duplication is problematic - but it could be updated in an atomic PR after this hypothetical one to just provide the backward compatibility endpoints to make swift and s3 point at some specialization of the cloud provider.

  • The third PR would be the pluggable media PR. And for that PR I'd like to see it marked as WIP until a full demo is ready to go - even if it is just a set of API calls that the reviewer can imaging a GUI sitting on top of. I don't feel comfortable making these big changes to the object store interaction code and to the database until I definitely see all the pieces fitting together. Maybe everything that is needed to make this usable today is there and I just need it pointed out to me - but if not I'd like to delay if possible and make the PR feel more atomic. I care passionately about the interaction between the job runners and object stores and I think probably others do as well and this adds some complexity there - given that a lot of people care about it - it would be better to make a big change atomically I think so we can be sure this is the right complexity. And separating the two PRs discussed above would give everyone a better shot at commenting intelligently about the work as you go I think.

    This is a lot to ask I think - to keep so many big changes sort of in WIP branches - but I do think it is the best way to go. I started my collections branch with database changes just like this in like December of 2014 and didn't merge the final collections PR until maybe May of 2015 once the API and tool interactions, etc... were just right (or right enough - demonstrable at least). I started a branch in the fall of 2015 for CWL support that included database interactions and I'm still working on producing something atomic that feels correct and complete.

All that said - I'm not a -1 or anything - this isn't something I'm intending to force on you. I'm just letting you know how I would approach it and how I would feel more comfortable reviewing it. Good luck!

@VJalili
Copy link
Member Author

VJalili commented Jul 14, 2017

@bgruening thanks for the reference. Indeed having a section in user preferences where the user can define PluggedMedia (e.g., AWS S3, Azure Blob, and etc.) is required to improve usability of user-based objectstore. Additionally that section should allow user to define hierarchical relation between his/her various plugged medias (similar to object_store_config).

@natefoo natefoo self-requested a review July 17, 2017 03:50
@jgoecks
Copy link
Contributor

jgoecks commented Jul 18, 2017

+1 on @jmchilton's suggestions. These will help in scoping/modularizing/understanding this excellent work and generating digestable PRs.

@VJalili
Copy link
Member Author

VJalili commented Jul 18, 2017

are you suggesting to close this PR, separate cloudbridge, and create a new PR when use-based ObjectStore is fully functional ?

@dannon
Copy link
Member

dannon commented Jul 18, 2017

I didn't want to jump in on this PR right away since I feel like we've talked about this in person quite a bit, but I agree pretty much completely with what @jmchilton suggests, which closely echoes what I feel like we've talked about already.

Small logical units of incremental change are vastly preferable and are much easier to understand and review/approve, or suggest improvements for. I like the three-PR breakdown John suggests -- one stylistic, with no substantial changes, for if you need/want to fix style in places prior to work. Then the cloudbridge part on existing components. Finally, new PluggableMedia changes.

I'll also echo, as I've mentioned several times in person, that we should not just throw away S3ObjectStore without a dead-simple automated upgrade path where existing configurations will continue to work (this is general practice for anything changing in Galaxy configuration). The easiest solution here is probably to, instead of renaming S3ObjectStore, just leave it as-is for now and add the CloudObjectStore separately.

@jxtx
Copy link
Contributor

jxtx commented Jul 18, 2017

I'll also echo, as I've mentioned several times in person, that we should not just throw away S3ObjectStore without a dead-simple automated upgrade path where existing configurations will continue to work (this is general practice for anything changing in Galaxy configuration). The easiest solution here is probably to, instead of renaming S3ObjectStore, just leave it as-is for now and add the CloudObjectStore separately.

I also don't understand the reason for removing it. If we just leave it we have backward compatibility without needing to write any conversion scripts or whatever, which is what we usually strive for.

@VJalili
Copy link
Member Author

VJalili commented Jul 18, 2017

What do you think about the user-based objectstore concept this PR is introducing?

@VJalili VJalili mentioned this pull request Jul 27, 2017
@VJalili
Copy link
Member Author

VJalili commented Jul 27, 2017

Thanks for your comments. Accordingly, I created PRs #135 and #136 on starforge, and PRs #4333, #4342 and 4352 on galaxy.

@nsoranzo
Copy link
Member

Thanks @VJalili, I'm closing this one then.

@nsoranzo nsoranzo closed this Jul 27, 2017
@VJalili
Copy link
Member Author

VJalili commented Jul 28, 2017

@nsoranzo , The other PRs are focused on Cloud ObjectStore, and this PR extends it as a User-based ObjectStore; i.e., adding a user-defined "PluggedMedia" (e.g., AWS S3 bucket, Azure blob). The main difference between the current ObjectStore (e.g., S3ObjectStore) and the User-based ObjectStore (which this PR introduces) is as follows:

  • in the current objectstore, a configured media (e.g., AWS S3) is used for the all the galaxy users. In other words, only one S3 bucket per galaxy instance.

  • in the User-based objectstore, each user defines his/her own media (e.g., his very own AWS S3 bucket, or Google drive). In other words, many S3 buckets per galaxy instance.

The Cloud is submitted as a separated PR following @jmchilton @jgoecks and @dannon suggestion; but those PRs are not adding User-based ObjectStore. Having merged the aforementioned PRs, we need to work on merging this PR. Therefore, please re-open this PR.

@nsoranzo
Copy link
Member

I supposed you wanted to open a clean PR after the others are merged, but I can surely reopen this, but I'll mark it WIP in the mean time.

@VJalili
Copy link
Member Author

VJalili commented Jul 28, 2017

@nsoranzo Thanks.

@VJalili VJalili closed this Jul 31, 2017
@VJalili VJalili mentioned this pull request Aug 23, 2017
@VJalili VJalili deleted the UserBasedObjectStore branch March 5, 2019 04:23
@VJalili VJalili mentioned this pull request Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants