Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abandon storage.xml in favor of storage.json for stage-in and stage-out (final version) #11816

Merged

Conversation

nhduongvn
Copy link
Collaborator

@nhduongvn nhduongvn commented Dec 15, 2023

fixes #11703

Description

With the migration to Rucio, new storage descriptions are adopted using the storage description file, storage.json. This file has similar role as the storage.xml used in PhEDEx during Run 1 and 2. A new block in the site configuration, site-local-config.xml contains all stage out choices.

This PR proposes code adaption to the new stage out mechanism for Rucio described in the issue #11703. A more detailed descriptions of the implementation can be found in this doc. There is an attempt of implementation in PR #11790 based on stageOutUsingStorageJson branch which ended up to workflow test failures. The branch stageOutUsingStorageJson_test_b927 used for this PR was branched out from commit b927c95 of the branch stageOutUsingStorageJson and cherry pick of commit 274609a of that branch.

These methods are transferred from master branch without major modifications (mostly add or change docstrings and comments)

Related PRs

Debug PR #11790

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 36 warnings and errors that must be fixed
    • 7 warnings
    • 186 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 283 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14723/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov added the PR: NEVER MERGE This PR is not meant for merging label Dec 19, 2023
@todor-ivanov
Copy link
Contributor

hi @nhduongvn ,

In this debug PR you are missing a commit:
274609a

I believe things were working up to this point, including the commit I just pointed to.

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Dec 19, 2023

Hi @todor-ivanov, ok I can go further to the commit 274609a

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 191 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 310 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14738/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 191 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 312 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14741/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Dec 21, 2023

hi @nhduongvn,

I could not clean the whole database as I promised in a private chat, The agents still does not support it. I had to completely wipe out the whole agent and repatch it with the current PR. Here are the two workflows that I have submitted for the current test:

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=tivanov_ReReco_Run2022C_LumiMask_StageOutTest_v7_231221_074926_2776

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=tivanov_ReReco_RunBlockWhite_StageOutTest_v7_231221_074919_3473

@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov , it is not a good sign! I think the tests will fail again with the same location error since it took long time from running-open to running-close. In one of the unittest the WMAGENT_SITE_CONFIG_OVERRIDE changes:
ef0822b
Can it cause problems?

@todor-ivanov
Copy link
Contributor

Hi @nhduongvn I am more optimistic this time. one of the workflows already completed.
There is one Unit test that is failing which could be related to the current changes:

Runtime_t.RuntimeTest:testB_EmulatorTest changed from success to failure

You may want to take a look at this one. And do not worry about working on any of the files under /WMCore/tests while the rest of the PR is under validation. We use those only for the unit tests and if a unit test itself needs to be modified in order to match the new mechanisms, now is the moment.

@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov ,
Yes, it is encouraging that one of the workflow completed without errors!
I will look into this unittest failure. It has happened since the first commit and I did not know whether it happened before switching to new mechanism or not. It seems to have problems when trying to get site name:

File "/build/cmsbld/jenkins/workspace/DMWM-WMCorePy3-PR-unittests/SLICE/5/label/cms-dmwm-cc7/code/test/python/WMCore_t/Misc_t/Runtime_t.py", line 422, in testB_EmulatorTest
    self.assertEqual(report.getSiteName(), {})
  File "/build/cmsbld/jenkins/workspace/DMWM-WMCorePy3-PR-unittests/SLICE/5/label/cms-dmwm-cc7/deploy/2.2.5/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/unittest/case.py", line 912, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/build/cmsbld/jenkins/workspace/DMWM-WMCorePy3-PR-unittests/SLICE/5/label/cms-dmwm-cc7/deploy/2.2.5/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/unittest/case.py", line 905, in _baseAssertEqual
    raise self.failureException(msg)
'T1_US_FNAL' != {}

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Dec 22, 2023

Hi @todor-ivanov , first of all, good news is that the workflow tests complete successfully!
Secondly about the Runtime_t.py.testB_EmulatorTest failure in Jenkins, I observed that this test is OK when running nosetests but the site config can not be loaded inside createInitialReport in Boostrap because SITECONFIG_PATH environment variable is not set:

  1. createInitialReport is called here:
    Bootstrap.createInitialReport(job=job,
  2. Here is where site config is loaded:
    This results in report.getSiteName() == {} so the self.assertEqual(report.getSiteName(), {}) test is OK.
    However, in Jenkins, SITECONFIG_PATH environment variable is available somehow at this stage and report.getSiteName() returns T1_US_FNAL
    Now, I explicitly set SITECONFIG_PATH in runJobs: https://github.com/nhduongvn/WMCore/blob/e9968d05359d6b14dea3a48731a127819aacb2d1/test/python/WMCore_t/Misc_t/Runtime_t.py#L320
    and change requirement to self.assertEqual(report.getSiteName(), 'T1_US_FNAL')
    (Please note that this is the latest Jenkins test report: https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14749/artifact/artifacts/PullRequestReport.html)

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 191 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 312 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14748/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 41 warnings and errors that must be fixed
    • 7 warnings
    • 216 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 319 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14749/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 191 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 312 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14747/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor

Thanks @nhduongvn it is indeed good we have those workflows succeeding, there are few final steps which need to be done in order get this issue closed.

  • We should decide whether we should close this issue though the current PR or through the (V1) Abandon storage.xml in favor of storage.json for stage-in and stage-out #11790 I am fine both ways, it would have been good to keep track of all the discussions hapenning during the resolution, but if you think that would be too uch work for you to rebase on top of the Comit we already know was the last good commit in 11790 I am ok of merging this PR instead and marking the other one with a label Never MERGE instead.
  • You should provide a full description to the PR expressing in short what are the changes provided
  • We should complete the wiki page @amaltaro was requesting for providing full documentation on the new stageout/stagein mechanisms. (this may be completed asynchronously though, no need to be ablocker for merging the PR)
  • Can you take a final look at the pylint markers in the latest unit tests reports: https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14747/artifact/artifacts/PullRequestReport.html#unittestspy3
    (I am talking about those lines in bold)
  • Finally we should squash all the Commits in only two commits:
    • one for the code changes
    • one for the unit tests and pylint fixes.

Most of those steps are basically for us to keep a consistent development process and to follow the already established good practices in the WMCore repository.

@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov , yes, I will addressing these items. However, I do not know how to do this since in each commit I made, it can have both code changes and tests. Could you provide some ideas how?

Finally we should squash all the Commits in only two commits:
one for the code changes
one for the unit tests and pylint fixes.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Dec 31, 2023

hi @nhduongvn

I know it is a bad idea to reorder commits when there is overlapping work done on both - code and unit tests in the same commit, so I'd not suggest doing it. but for most of those (modulo one commit) we can live by squashing them together with the code commits and only the last 4 should go into the unit tests commit. (The very last one for merging the master into your working branch should be reverted - this will happen automatically when we merge your PR and won't mess up the branch's history). So here is the full plan, which I suggest:

If you are not around during the holidays, I can do it myself. I'd be happy to merge this PR and close this issue in 2023,

@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov , yes you could go ahead to merge this PR. I am not familiar with these steps and can risk to mess things up. Thank you

@todor-ivanov
Copy link
Contributor

Hi @nhduongvn I cannot merge this the way how it is right now. And it turned out I cannot squash those commits myself because Github still does not support transfer of PR ownership. So I cannot overtake neither this PR nor the original one. The only way I could implement the plan listed in my previous comment is to fork from this PR and create yet another brand new one (already 3rd in the row on fixing #11703), but I'd like to avoid fixing issues with too many copies of the same PR. Could you give it a try and reorganize those commits yourself, either here or in the original PR #11790 ?

@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov, yes I can give it a try probably tomorrow.

@nhduongvn nhduongvn force-pushed the stageOutUsingStorageJson_test_b927 branch from 7048c17 to 38c4591 Compare January 3, 2024 15:46
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 41 warnings and errors that must be fixed
    • 7 warnings
    • 216 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 319 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14750/artifact/artifacts/PullRequestReport.html

@nhduongvn nhduongvn force-pushed the stageOutUsingStorageJson_test_b927 branch from 38c4591 to 7048c17 Compare January 3, 2024 16:18
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 41 warnings and errors that must be fixed
    • 7 warnings
    • 216 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 319 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14751/artifact/artifacts/PullRequestReport.html

@nhduongvn nhduongvn force-pushed the stageOutUsingStorageJson_test_b927 branch from 7048c17 to 4268fd3 Compare January 3, 2024 20:56
@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov , it works like a charm after I followed your suggestion here:
https://github.com/dmwm/WMCore/wiki/Developing-against-WMCore
First, I update the master both locally and on GitHub
git checkout master
git fetch upstream
git merge upstream/master
git push origin master
After this I rebase my code to the master:
https://github.com/dmwm/WMCore/wiki/Developing-against-WMCore#rebasing
and boom, all my commits are on top of the latest master and I just squash them as usual.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 41 warnings and errors that must be fixed
    • 7 warnings
    • 216 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 319 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14752/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov changed the title DEBUG (V1) for #11790 (Abandon storage.xml in favor of storage.json for stage-in and stage-out) Abandon storage.xml in favor of storage.json for stage-in and stage-out (final version) Jan 4, 2024
@todor-ivanov todor-ivanov removed the PR: NEVER MERGE This PR is not meant for merging label Jan 4, 2024
@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jan 4, 2024

Thanks @nhduongvn
Things look much better now. I have renamed the PR and also changed the original one's description in order to properly link to the continued development here. Could you Please finally change the description of this PR to give a short summary on what the changes here relate to and few bullets on basic points of change. We heavily use those descriptions in our code maintenance process, because it usually facilitates the browsing through many PRs.

@amaltaro This PR is ready to go. The failing Unit test seems to be not related to the code changes here. Please take a final look (at least on the Workqueue unit test) and if you agree with my opinion, either you or me can push the merge button.

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Jan 4, 2024

Hi @todor-ivanov,
I made a description for this PR above and a documentation requested by @amaltaro (also linked in the PR description above)
https://docs.google.com/document/d/1YAqaX6cdMi0cnuTY6372FkIJGL4JAkgG-O5ufCxuyVs/edit
It is possible that relics of old mechanism still exist in the other WMCore packages outside of Storage (by searching localStageOut), for example
https://github.com/nhduongvn/WMCore/blob/87ea437f508308b5e1d544a234a80e2a9911d5c1/src/python/WMCore/WMRuntime/Bootstrap.py#L334
https://github.com/nhduongvn/WMCore/blob/87ea437f508308b5e1d544a234a80e2a9911d5c1/src/python/WMCore/WMSpec/Steps/Executors/LogCollect.py#L71

@todor-ivanov
Copy link
Contributor

Thanks @nhduongvn

@amaltaro I am merging this PR and closing the issue.

@todor-ivanov todor-ivanov merged commit 1bd46ad into dmwm:master Jan 6, 2024
2 of 4 checks passed
@amaltaro
Copy link
Contributor

amaltaro commented Jan 8, 2024

Hi @nhduongvn @todor-ivanov, thank you very much for following this development through and converging on the final product.

Given how deep and impactful these changes are, I think it deserved a second person review before merging it into master. Said that, I would like to ask you @todor-ivanov to promptly follow this up with a new pre-release of WMAgent and running a large-scale test in most (or all!) of the T1 and T2 sites that we use for central production. Can you please plan to work on that this week?

I still have to look into the document that Duong shared a few days ago.
I also plan to review this PR by tomorrow, even though it has already been merged, I suspect there are a few things that we will have to have another iteration.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left an incomplete review in this PR, as there are tons of changes and I need to switch context. This review is mostly for my own record and for a potential follow up PR.

Depending on how the remaining review goes, we might have to fallback these changes such that we can resume upgrading central services and WMAgent while still refining these changes.

src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
src/python/WMCore/Storage/SiteLocalConfig.py Show resolved Hide resolved
src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I had another iteration over these changes, leaving questions/comments/suggestions along the code.

For my education, I also have 3 general questions:

  1. What is the "volume" attribute in storage.json? What is it expected to be used for?
  2. What is a "chained" rule/prefix?
  3. What is the rationale for calling it Rucio File Catalog? I assume this catalog is actually decoupled from Rucio, isn't it?

src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/RucioFileCatalog.py Show resolved Hide resolved
src/python/WMCore/Storage/StageInMgr.py Show resolved Hide resolved
def fallbackStageOut(self, lfn, localPfn, fbParams, checksums):


def stageOut(self, lfn, localPfn, checksums, stageOut_rfc=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again for my education, my understanding is that there is no more definition of local and fallback stage out. Is that correct? I guess it's all part of this stageOut method now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

try:
delManager.deletePFN(pfn, lfn, command)
except StageOutFailure as ex:
msg = "Failed to cleanup staged out file after error:"
msg += " %s\n%s" % (lfn, str(ex))
logging.error(msg)

def searchTFC(self, lfn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this one been superseeded by StageOutMgr.searchRFC as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved

msg = "\nThere are %s stage out definitions." % len(self.stageOuts)
for stageOut in self.stageOuts:
for k in ['phedex-node','command','storageSite','volume','protocol']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably define these attributes as constants somewhere, given that they are used in multiple places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used three times in DeleteMgr.py, StageOutMrg.py and StageInMgr.py

@amaltaro
Copy link
Contributor

amaltaro commented Jan 9, 2024

I am reverting these changes such that we can start the usual pre-production validation phase with what is in master, while working in parallel with the changes provided in this PR.

Here is a revert pull request: #11857 and we will be following up on further validation and codebase changes as needed.

@nhduongvn
Copy link
Collaborator Author

@amaltaro, I will address your comments tomorrow

@stlammel
Copy link

Yes, volume and protocol are mandatory in the site-local-config.xml stage-out entries. (Site is optional, defaulting to the same/used SITECONF site.)
Not all volumes may have a Rucio Storage Element defined.
Thanks,

  • Stephan

@nhduongvn
Copy link
Collaborator Author

And I had another iteration over these changes, leaving questions/comments/suggestions along the code.

For my education, I also have 3 general questions:

  1. What is the "volume" attribute in storage.json? What is it expected to be used for?
    Stephan might provide further explanations. In the mean time you could take a look here:
    https://twiki.cern.ch/twiki/bin/viewauth/CMS/StorageDescription
    "Storage inside an entity will be treated by CMS as a unit. It can span across one or more grid Storage Elements (SE). The entity is defined by site name (potentially sub-site name) and volume name"
  2. What is a "chained" rule/prefix?
    In a simple term, "chained" rules (there is no chained prefix) are used iteratively to have a complete translation of LFN to PFN. It craws back to the base (root) rule. You can take a look on this:
    Fix chaining in rules and unify prefix and rules of protocols cms-sw/cmssw#42530 (comment)
    esp. Fix chaining in rules and unify prefix and rules of protocols cms-sw/cmssw#42530 (comment)
  3. What is the rationale for calling it Rucio File Catalog? I assume this catalog is actually decoupled from Rucio, isn't it?
    Because this file catalog is used to provide relevant information from new storage description for Rucio. It has some unique methods applicable for this new Rucio storage description such as rseName.

@nhduongvn
Copy link
Collaborator Author

@amaltaro I have finished your first round of review. Do you want to make further comments?

@nhduongvn
Copy link
Collaborator Author

Hi @nhduongvn @todor-ivanov, thank you very much for following this development through and converging on the final product.

Given how deep and impactful these changes are, I think it deserved a second person review before merging it into master. Said that, I would like to ask you @todor-ivanov to promptly follow this up with a new pre-release of WMAgent and running a large-scale test in most (or all!) of the T1 and T2 sites that we use for central production. Can you please plan to work on that this week?

I still have to look into the document that Duong shared a few days ago. I also plan to review this PR by tomorrow, even though it has already been merged, I suspect there are a few things that we will have to have another iteration.

If setting tests for all sites are big loads, we could focus on T2_DE_DESY which has chained rules in stage out protocols
https://gitlab.cern.ch/SITECONF/T2_DE_DESY/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L27
and T1_DE_KIT for sub site case

@amaltaro
Copy link
Contributor

@nhduongvn thank you for suggesting these 2 sites for specific tests. We should definitely test both:

  • chained rules (T2_DE_DESY)
  • sub-site (T1_DE_KIT)

in addition to running large-scale tests everywhere used for central production.

Concerning your reply in this comment: #11816 (comment)
and reading the StorageDescription twiki that you linked. I have a question on this sentence:
"""
Data access and stage-out specifications in site-local-config.xml will then point to an ordered list of one or more storage entities (and protocol) in the new storage.json file.
"""
isn't it implying that we will keep using both site-local-config.xml AND storage.json? Apologies if I am adding confusion here, but I would love it if you or @stlammel could clarify.

Perhaps a simple diagram in that twiki on the connection/flow of files for stage out resolution would be extremely helpful.

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Jan 11, 2024

@nhduongvn thank you for suggesting these 2 sites for specific tests. We should definitely test both:

  • chained rules (T2_DE_DESY)
  • sub-site (T1_DE_KIT)

in addition to running large-scale tests everywhere used for central production.

Concerning your reply in this comment: #11816 (comment) and reading the StorageDescription twiki that you linked. I have a question on this sentence: """ Data access and stage-out specifications in site-local-config.xml will then point to an ordered list of one or more storage entities (and protocol) in the new storage.json file. """ isn't it implying that we will keep using both site-local-config.xml AND storage.json? Apologies if I am adding confusion here, but I would love it if you or @stlammel could clarify.

Yes, both site-local-config.xml AND storage.json will be used to set up the stage out. This sentence means that the entries in <stage-out> block inside site-local-config.xml will point to corresponding parameters (or information) in the storage.json for stage out (the path translation is the core information). Those are what storage entities mean in the above sentence. The setup is exactly in old mechanism where <local-stage-out> and <fall-back-stage-out> in the site-local-config.xml are used to find detailed information inside storage.xml for stage out. In short, <stage-out> replace <local-stage-out> and <fall-back-stage-out>, while storage.json replace storage.xml.

Perhaps a simple diagram in that twiki on the connection/flow of files for stage out resolution would be extremely helpful.

I will add a diagram to my google doc

@stlammel
Copy link

Hallo Alan,
storage.json is organized by volumes of a site. The RSE is just an attribute. To keep the specification in site-local-config.xml the same for reading and writing, we need to use the volume (as not all read volumes are RSEs).
Thanks,
cheers, Stephan

@stlammel
Copy link

Yes, both site-local-config.xml and storage.json will be used. The first to provide the list of stage-out options (site/volume/protocol) and the later having the definition/specification of the referenced protocol of the volume and site.
Thanks,

  • Stephan

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Jan 12, 2024

I added the data flow drawing to the doc
https://docs.google.com/document/d/1YAqaX6cdMi0cnuTY6372FkIJGL4JAkgG-O5ufCxuyVs/edit#
As you can see the stage out needs inputs from both site-local-config.xml (command, option) and storage.json, that is why you see I use a variable stageOuts_rfcs to combine those inputs. The _rfcs part mainly deals with the path translation, while the stageOuts part is used to provide the rest of inputs.

@nhduongvn
Copy link
Collaborator Author

@todor-ivanov , @amaltaro , Hi Todor and Alan, do you have further comments?

@nhduongvn
Copy link
Collaborator Author

Hi @todor-ivanov and @amaltaro , I made new commits:
https://github.com/nhduongvn/WMCore/tree/stageOutUsingStorageJson_test_b927
but Jenkin does not launch automatically. Is it because of "Pull request closed"?

@amaltaro
Copy link
Contributor

@nhduongvn Hi Duong, yes, apologies for not making this comment yesterday. Would you mind opening a new pull request? I don't think any of your changes will show in this PR, given that it has already been "merged".

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Jan 19, 2024 via email

amaltaro pushed a commit that referenced this pull request Feb 29, 2024
…ut (after reverting merging #11816 to master) (#11869)

* stage out implementation for the new Rucio storage description (storage.json)

* update stage out implementation for the new Rucio storage description (storage.json): chained rules, add more tests

* unit tests for the new stage out implementation (storage.json)

* reply to Alan reviews after unmerged from master, fix bug when there are missing attributes in stage out (happened in this loop: for stageOut in self.stageOuts)

* Kenjy first review

* unit tests: remove bypassImpl

* Alan review after Kenyi test: polish logging, messages etc. , pylint, unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abandon storage.xml in favor of storage.json for stage-in and stage-out
6 participants