Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul Object Store Backends #18117

Closed
wants to merge 40 commits into from

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented May 9, 2024

  • Add a boto3 object store with tests (if you have keys defined) and update documentation. It works well for me both against AWS and against other S3 via interop (I used GCP another clear target of interest for the project). This is the first time Galaxy has had test cases for the multithreaded transfer to S3 object stores.
  • I removed a ton of code (mostly duplicated code that was merged to a parent class but also some code that didn't make sense or wasn't used) from all the non-disk object stores. Over 400 lines of code removed each from S3, cloud, and Azure object stores and hundreds more removed from pithos, rucio, and irods.
  • I think the Azure and the Boto3 object stores are now pretty good and clean examples of how to implement a caching object store - pull all the caching logic from the base class and just expose small methods that are very precise and efficient in terms of how they access cloud APIs, respond to error, and such. S3's legacy is on display in the implementation and I didn't break any of that (in fact there are more tests than ever) but it isn't a great example as a result. On the other hand the cloudbridge implementation is very broad with exception handling and that isn't great. Also I think in trying to be so generic you miss a lot of options available through boto3.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

jmchilton added 30 commits May 9, 2024 09:52
Works with service account and user accounts keys. Settings -> Interoperability to create HMAC keys.
Eliminate CloudConfigMixin...

This got copid and pasted over and over and each place just used once. Really odd reading of my original code but I guess I understand why people did it.
Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

lib/galaxy/objectstore/_caching_base.py Outdated Show resolved Hide resolved
lib/galaxy/objectstore/_caching_base.py Outdated Show resolved Hide resolved
lib/galaxy/objectstore/_caching_base.py Show resolved Hide resolved
@jmchilton jmchilton force-pushed the overhaul_object_stores branch from 9923d29 to 8d060b3 Compare May 9, 2024 17:01
@jmchilton
Copy link
Member Author

This is breaking a single test still:

integration.test_tool_data_bundles.TestDataBundlesIntegration
    def test_admin_build_data_bundle_by_uri(self):
        original_count = self._testbeta_field_count()
    
        history_id = self.dataset_populator.new_history()
        payload = self.dataset_populator.run_tool_payload(
            tool_id="data_manager",
            inputs={"ignored_value": "moo"},
            data_manager_mode="bundle",
            history_id=history_id,
        )
        create_response = self._post("tools", data=payload)
        create_response.raise_for_status()
        self.dataset_populator.wait_for_history(history_id, assert_ok=True)
        data_manager_dataset = self.dataset_populator.get_history_dataset_details(history_id)
        assert data_manager_dataset["extension"] == "data_manager_json"
    
        post_job_count = self._testbeta_field_count()
        assert original_count == post_job_count
    
        shutil.rmtree(self.object_store_cache_path)
        os.makedirs(self.object_store_cache_path)
    
        content = self.dataset_populator.get_history_dataset_content(
            history_id, to_ext="data_manager_json", type="bytes"
        )
        temp_directory = decompress_bytes_to_directory(content)
>       assert os.path.exists(os.path.join(temp_directory, "newvalue.txt"))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7fdf1957f0d0>('/tmp/tmpnrvoj4c7/tmp9e__od17/newvalue.txt')
E        +    where <function exists at 0x7fdf1957f0d0> = <module 'posixpath' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/posixpath.py'>.exists
E        +      where <module 'posixpath' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/posixpath.py'> = os.path
E        +    and   '/tmp/tmpnrvoj4c7/tmp9e__od17/newvalue.txt' = <function join at 0x7fdf1957fa60>('/tmp/tmpnrvoj4c7/tmp9e__od17', 'newvalue.txt')
E        +      where <function join at 0x7fdf1957fa60> = <module 'posixpath' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/posixpath.py'>.join
E        +        where <module 'posixpath' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/posixpath.py'> = os.path
INFO:     127.0.0.1:53594 - "GET /api/tool_data/testbeta HTTP/1.1" 200 OK
multipart.multipart DEBUG 2024-05-10 05:34:19,535 [pN:main,p:7930,tN:Thread-501] Calling on_field_start with no data
multipart.multipart DEBUG 2024-05-10 05:34:19,535 [pN:main,p:7930,tN:Thread-501] Calling on_field_name with data[0:4]
multipart.multipart DEBUG 2024-05-10 05:34:19,535 [pN:main,p:7930,tN:Thread-501] Calling on_field_data with data[5:21]
multipart.multipart DEBUG 2024-05-10 05:34:19,535 [pN:main,p:7930,tN:Thread-501] Calling on_field_end with no data
multipart.multipart DEBUG 2024-05-10 05:34:19,536 [pN:main,p:7930,tN:Thread-501] Calling on_end with no data
INFO:     127.0.0.1:53600 - "POST /api/histories HTTP/1.1" 200 OK
galaxy.tools INFO 2024-05-10 05:34:19,616 [pN:main,p:7930,tN:WSGI_0] Validated and populated state for tool request (0.161 ms)
galaxy.tools.actions INFO 2024-05-10 05:34:19,625 [pN:main,p:7930,tN:WSGI_0] Handled output named out_file for tool data_manager (1.029 ms)
galaxy.tools.actions INFO 2024-05-10 05:34:19,628 [pN:main,p:7930,tN:WSGI_0] Added output datasets to history (3.609 ms)
galaxy.tools.actions INFO 2024-05-10 05:34:19,630 [pN:main,p:7930,tN:WSGI_0] Setup for job Job[unflushed,tool_id=data_manager] complete, ready to be enqueued (1.515 ms)
galaxy.tools.execute DEBUG 2024-05-10 05:34:19,659 [pN:main,p:7930,tN:WSGI_0] Tool data_manager created job 1 (38.103 ms)
galaxy.web_stack.handlers INFO 2024-05-10 05:34:19,669 [pN:main,p:7930,tN:WSGI_0] (Job[id=1,tool_id=data_manager]) Handler '_default_' assigned using 'HANDLER_ASSIGNMENT_METHODS.DB_SKIP_LOCKED' assignment method
galaxy.tools.execute DEBUG 2024-05-10 05:34:19,672 [pN:main,p:7930,tN:WSGI_0] Created 1 job(s) for tool data_manager request (55.855 ms)
INFO:     127.0.0.1:53610 - "POST /api/tools HTTP/1.1" 200 OK
INFO:     127.0.0.1:53620 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
galaxy.jobs.handler DEBUG 2024-05-10 05:34:19,777 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 1
galaxy.jobs.mapper DEBUG 2024-05-10 05:34:19,794 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] (1) Mapped job to destination id: local
galaxy.jobs.handler DEBUG 2024-05-10 05:34:19,809 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] (1) Dispatching to local runner
galaxy.jobs DEBUG 2024-05-10 05:34:19,821 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] (1) Persisting job destination (destination id: local)
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:19,852 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] Pushing cache file '/tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' of size 0 bytes to key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat'
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:19,856 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] Pushed cache file '/tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' to key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' (0 bytes transfered in 0:00:00.003193 sec)
galaxy.jobs DEBUG 2024-05-10 05:34:19,856 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] (1) Working directory for job is: /tmp/tmp0pze_216/job_working_directory_swift/000/1
galaxy.jobs.runners DEBUG 2024-05-10 05:34:19,862 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] Job [1] queued (52.983 ms)
galaxy.jobs.handler INFO 2024-05-10 05:34:19,866 [pN:main,p:7930,tN:JobHandlerQueue.monitor_thread] (1) Job dispatched
galaxy.jobs DEBUG 2024-05-10 05:34:19,918 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Job wrapper for Job [1] prepared (46.260 ms)
INFO:     127.0.0.1:53630 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
galaxy.jobs.command_factory INFO 2024-05-10 05:34:20,061 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Built script [/tmp/tmp0pze_216/job_working_directory_swift/000/1/tool_script.sh] for tool command [mkdir /tmp/tmp0pze_216/job_working_directory_swift/000/1/outputs/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files ; echo "A new value" > '/tmp/tmp0pze_216/job_working_directory_swift/000/1/outputs/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/newvalue.txt'; cp '/tmp/tmp0pze_216/job_working_directory_swift/000/1/configs/tmpvwp933kv' '/tmp/tmp0pze_216/job_working_directory_swift/000/1/outputs/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat'; exit 0]
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:20,107 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Pulling key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' into cache to /tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:20,108 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Pulled key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' into cache to /tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat
galaxy.jobs.runners DEBUG 2024-05-10 05:34:20,138 [pN:main,p:7930,tN:LocalRunner.work_thread-1] (1) command is: mkdir -p working outputs configs
if [ -d _working ]; then
    rm -rf working/ outputs/ configs/; cp -R _working working; cp -R _outputs outputs; cp -R _configs configs
else
    cp -R working _working; cp -R outputs _outputs; cp -R configs _configs
fi
cd working; /bin/bash /tmp/tmp0pze_216/job_working_directory_swift/000/1/tool_script.sh > '../outputs/tool_stdout' 2> '../outputs/tool_stderr'; return_code=$?; echo $return_code > /tmp/tmp0pze_216/job_working_directory_swift/000/1/galaxy_1.ec; cd '/tmp/tmp0pze_216/job_working_directory_swift/000/1'; 
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True; python metadata/set.py; sh -c "exit $return_code"
galaxy.jobs.runners.local DEBUG 2024-05-10 05:34:20,282 [pN:main,p:7930,tN:LocalRunner.work_thread-1] (1) executing job script: /tmp/tmp0pze_216/job_working_directory_swift/000/1/galaxy_1.sh
INFO:     127.0.0.1:53636 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53644 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53660 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53662 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53668 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53674 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53690 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53706 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53708 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53716 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53732 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53734 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53748 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53758 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53762 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53774 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53778 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53794 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53798 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53806 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
galaxy.jobs.runners.util.process_groups DEBUG 2024-05-10 05:34:25,835 [pN:main,p:7930,tN:LocalRunner.work_thread-1] check_pg(): No process found in process group 34618
galaxy.jobs.runners.local DEBUG 2024-05-10 05:34:25,835 [pN:main,p:7930,tN:LocalRunner.work_thread-1] execution finished: /tmp/tmp0pze_216/job_working_directory_swift/000/1/galaxy_1.sh
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:25,903 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Pushing cache file '/tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json' of size 716 bytes to key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json'
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:25,905 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Pushed cache file '/tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json' to key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json' (716 bytes transfered in 0:00:00.002758 sec)
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:25,908 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Pushing cache file '/tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json' of size 716 bytes to key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json'
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:25,911 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Pushed cache file '/tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json' to key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208_files/_gx_data_bundle_index.json' (716 bytes transfered in 0:00:00.003257 sec)
galaxy.jobs INFO 2024-05-10 05:34:25,926 [pN:main,p:7930,tN:LocalRunner.work_thread-1] Collecting metrics for Job 1 in /tmp/tmp0pze_216/job_working_directory_swift/000/1
galaxy.jobs DEBUG 2024-05-10 05:34:25,939 [pN:main,p:7930,tN:LocalRunner.work_thread-1] job_wrapper.finish for job 1 executed (96.751 ms)
INFO:     127.0.0.1:53818 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53830 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:53832 - "GET /api/histories/adb5f5c93f827949/contents HTTP/1.1" 200 OK
INFO:     127.0.0.1:53834 - "GET /api/histories/adb5f5c93f827949/contents/datasets/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:41336 - "GET /api/tool_data/testbeta HTTP/1.1" 200 OK
INFO:     127.0.0.1:41344 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
INFO:     127.0.0.1:41356 - "GET /api/histories/adb5f5c93f827949/contents HTTP/1.1" 200 OK
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:27,298 [pN:main,p:7930,tN:AnyIO worker thread] Pulling key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' into cache to /tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat
galaxy.objectstore.s3 DEBUG 2024-05-10 05:34:27,299 [pN:main,p:7930,tN:AnyIO worker thread] Pulled key '9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat' into cache to /tmp/tmp0pze_216/object_store_cache/9/e/b/dataset_9eb7b7a7-d0bf-4040-9e41-e75151416208.dat
INFO:     127.0.0.1:41370 - "GET /api/histories/adb5f5c93f827949/contents/adb5f5c93f827949/display?to_ext=data_manager_json HTTP/1.1" 200 OK

@jmchilton
Copy link
Member Author

Unit test that might replicate what is happening with the test case.

  • Create a dataset with extra files.
  • Wipe and restore the cache.
  • Call get_file_name()
  • Ensure the extra files path is restore.

@jmchilton jmchilton force-pushed the overhaul_object_stores branch 4 times, most recently from 18453e4 to 5aa3519 Compare May 10, 2024 16:15
jmchilton added 2 commits May 10, 2024 14:07
- Fix s3 which broke during the migration because we stopped adding a / I thought was silly.
- Use the test case I wrote that mimics the integration test failure to also fix the boto3 and azure_blob object stores for these failures.

Also make note of a lingering issue that is pretty important
@jmchilton jmchilton force-pushed the overhaul_object_stores branch from 62d23b6 to 9d1ba54 Compare May 10, 2024 18:07
@jmchilton jmchilton marked this pull request as ready for review May 10, 2024 18:58
@github-actions github-actions bot added this to the 24.1 milestone May 10, 2024
@jmchilton
Copy link
Member Author

jmchilton commented May 10, 2024

Tracking that last bug down... meant adding a test case and finding many more existing bugs that predate this PR. I think the overall result from all of the fixes that resulted from those tests is that all of the objectstores are now much less buggy when it comes to extra files handling before this refactoring began. The comments in the last few commits make it clear what I found and such - probably worth reviewing those commits.

jmchilton added a commit to jmchilton/galaxy that referenced this pull request May 11, 2024
jmchilton added a commit to jmchilton/galaxy that referenced this pull request May 14, 2024
jmchilton added a commit to jmchilton/galaxy that referenced this pull request May 14, 2024
@mvdbeek mvdbeek self-requested a review May 14, 2024 14:38
jmchilton added a commit to jmchilton/galaxy that referenced this pull request May 14, 2024
@jmchilton
Copy link
Member Author

Closing in favor of #18136. That PR has all of this plus some nice test improvements and fixes discovered by those.

@jmchilton jmchilton closed this May 15, 2024
jmchilton added a commit to jmchilton/galaxy that referenced this pull request May 15, 2024
@jmchilton jmchilton mentioned this pull request May 20, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/objectstore kind/bug kind/enhancement kind/feature kind/refactoring cleanup or refactoring of existing code, no functional changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants