-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abandon storage.xml in favor of storage.json for stage-in and stage-out (final version) #11816
Abandon storage.xml in favor of storage.json for stage-in and stage-out (final version) #11816
Conversation
Jenkins results:
|
hi @nhduongvn , In this debug PR you are missing a commit: I believe things were working up to this point, including the commit I just pointed to. |
Hi @todor-ivanov, ok I can go further to the commit 274609a |
Jenkins results:
|
Jenkins results:
|
hi @nhduongvn, I could not clean the whole database as I promised in a private chat, The agents still does not support it. I had to completely wipe out the whole agent and repatch it with the current PR. Here are the two workflows that I have submitted for the current test: |
Hi @todor-ivanov , it is not a good sign! I think the tests will fail again with the same |
Hi @nhduongvn I am more optimistic this time. one of the workflows already completed.
You may want to take a look at this one. And do not worry about working on any of the files under |
Hi @todor-ivanov ,
|
Hi @todor-ivanov , first of all, good news is that the workflow tests complete successfully!
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Thanks @nhduongvn it is indeed good we have those workflows succeeding, there are few final steps which need to be done in order get this issue closed.
Most of those steps are basically for us to keep a consistent development process and to follow the already established |
Hi @todor-ivanov , yes, I will addressing these items. However, I do not know how to do this since in each commit I made, it can have both code changes and tests. Could you provide some ideas how? Finally we should squash all the Commits in only two commits: |
hi @nhduongvn I know it is a bad idea to reorder commits when there is overlapping work done on both - code and unit tests in the same commit, so I'd not suggest doing it. but for most of those (modulo one commit) we can live by squashing them together with the code commits and only the last 4 should go into the unit tests commit. (The very last one for merging the master into your working branch should be reverted - this will happen automatically when we merge your PR and won't mess up the branch's history). So here is the full plan, which I suggest:
If you are not around during the holidays, I can do it myself. I'd be happy to merge this PR and close this issue in 2023, |
Hi @todor-ivanov , yes you could go ahead to merge this PR. I am not familiar with these steps and can risk to mess things up. Thank you |
Hi @nhduongvn I cannot merge this the way how it is right now. And it turned out I cannot squash those commits myself because Github still does not support transfer of PR ownership. So I cannot overtake neither this PR nor the original one. The only way I could implement the plan listed in my previous comment is to fork from this PR and create yet another brand new one (already 3rd in the row on fixing #11703), but I'd like to avoid fixing issues with too many copies of the same PR. Could you give it a try and reorganize those commits yourself, either here or in the original PR #11790 ? |
Hi @todor-ivanov, yes I can give it a try probably tomorrow. |
7048c17
to
38c4591
Compare
Jenkins results:
|
38c4591
to
7048c17
Compare
Jenkins results:
|
… (storage.json): chained rules, add more tests
7048c17
to
4268fd3
Compare
Hi @todor-ivanov , it works like a charm after I followed your suggestion here: |
Jenkins results:
|
Thanks @nhduongvn @amaltaro This PR is ready to go. The failing Unit test seems to be not related to the code changes here. Please take a final look (at least on the Workqueue unit test) and if you agree with my opinion, either you or me can push the merge button. |
Hi @todor-ivanov, |
Thanks @nhduongvn @amaltaro I am merging this PR and closing the issue. |
Hi @nhduongvn @todor-ivanov, thank you very much for following this development through and converging on the final product. Given how deep and impactful these changes are, I think it deserved a second person review before merging it into master. Said that, I would like to ask you @todor-ivanov to promptly follow this up with a new pre-release of WMAgent and running a large-scale test in most (or all!) of the T1 and T2 sites that we use for central production. Can you please plan to work on that this week? I still have to look into the document that Duong shared a few days ago. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left an incomplete review in this PR, as there are tons of changes and I need to switch context. This review is mostly for my own record and for a potential follow up PR.
Depending on how the remaining review goes, we might have to fallback these changes such that we can resume upgrading central services and WMAgent while still refining these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I had another iteration over these changes, leaving questions/comments/suggestions along the code.
For my education, I also have 3 general questions:
- What is the "volume" attribute in storage.json? What is it expected to be used for?
- What is a "chained" rule/prefix?
- What is the rationale for calling it Rucio File Catalog? I assume this catalog is actually decoupled from Rucio, isn't it?
def fallbackStageOut(self, lfn, localPfn, fbParams, checksums): | ||
|
||
|
||
def stageOut(self, lfn, localPfn, checksums, stageOut_rfc=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again for my education, my understanding is that there is no more definition of local and fallback stage out. Is that correct? I guess it's all part of this stageOut
method now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly
try: | ||
delManager.deletePFN(pfn, lfn, command) | ||
except StageOutFailure as ex: | ||
msg = "Failed to cleanup staged out file after error:" | ||
msg += " %s\n%s" % (lfn, str(ex)) | ||
logging.error(msg) | ||
|
||
def searchTFC(self, lfn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this one been superseeded by StageOutMgr.searchRFC
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
msg = "\nThere are %s stage out definitions." % len(self.stageOuts) | ||
for stageOut in self.stageOuts: | ||
for k in ['phedex-node','command','storageSite','volume','protocol']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably define these attributes as constants somewhere, given that they are used in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are used three times in DeleteMgr.py, StageOutMrg.py and StageInMgr.py
I am reverting these changes such that we can start the usual pre-production validation phase with what is in master, while working in parallel with the changes provided in this PR. Here is a revert pull request: #11857 and we will be following up on further validation and codebase changes as needed. |
@amaltaro, I will address your comments tomorrow |
Yes, volume and protocol are mandatory in the site-local-config.xml stage-out entries. (Site is optional, defaulting to the same/used SITECONF site.)
|
|
@amaltaro I have finished your first round of review. Do you want to make further comments? |
If setting tests for all sites are big loads, we could focus on T2_DE_DESY which has chained rules in stage out protocols |
@nhduongvn thank you for suggesting these 2 sites for specific tests. We should definitely test both:
in addition to running large-scale tests everywhere used for central production. Concerning your reply in this comment: #11816 (comment) Perhaps a simple diagram in that twiki on the connection/flow of files for stage out resolution would be extremely helpful. |
Yes, both site-local-config.xml AND storage.json will be used to set up the stage out. This sentence means that the entries in
I will add a diagram to my google doc |
Hallo Alan, |
Yes, both site-local-config.xml and storage.json will be used. The first to provide the list of stage-out options (site/volume/protocol) and the later having the definition/specification of the referenced protocol of the volume and site.
|
I added the data flow drawing to the doc |
@todor-ivanov , @amaltaro , Hi Todor and Alan, do you have further comments? |
Hi @todor-ivanov and @amaltaro , I made new commits: |
@nhduongvn Hi Duong, yes, apologies for not making this comment yesterday. Would you mind opening a new pull request? I don't think any of your changes will show in this PR, given that it has already been "merged". |
Yes, I added this to my local copy here. It is too late to update github
now since I already pushed and made a pull request. Will be included in the
next push.
…On Fri, Jan 19, 2024 at 3:36 PM Alan Malta Rodrigues < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/python/WMCore/Storage/DeleteMgr.py
<#11816 (comment)>:
> """
+ if not self.overrideConf:
+ from WMCore.Storage.StageOutMgr import searchRFC
And the previous implementation (master) apparently has a duplication of
that function in 2 different modules, which isn't great as well.
Could you please add a comment like # FIXME saying that there is a
circular import and that's why the import is placed in the middle of the
module? At some point we have to refactor and properly separate it.
—
Reply to this email directly, view it on GitHub
<#11816 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADNQFJDSPM32AMJBWVCLK33YPLKLHAVCNFSM6AAAAABAXD4YTGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZTGY2DGMRXGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…ut (after reverting merging #11816 to master) (#11869) * stage out implementation for the new Rucio storage description (storage.json) * update stage out implementation for the new Rucio storage description (storage.json): chained rules, add more tests * unit tests for the new stage out implementation (storage.json) * reply to Alan reviews after unmerged from master, fix bug when there are missing attributes in stage out (happened in this loop: for stageOut in self.stageOuts) * Kenjy first review * unit tests: remove bypassImpl * Alan review after Kenyi test: polish logging, messages etc. , pylint, unit tests
fixes #11703
Description
With the migration to Rucio, new storage descriptions are adopted using the storage description file, storage.json. This file has similar role as the storage.xml used in PhEDEx during Run 1 and 2. A new block in the site configuration, site-local-config.xml contains all stage out choices.
This PR proposes code adaption to the new stage out mechanism for Rucio described in the issue #11703. A more detailed descriptions of the implementation can be found in this doc. There is an attempt of implementation in PR #11790 based on
stageOutUsingStorageJson
branch which ended up to workflow test failures. The branchstageOutUsingStorageJson_test_b927
used for this PR was branched out from commit b927c95 of the branchstageOutUsingStorageJson
and cherry pick of commit 274609a of that branch.These methods are transferred from master branch without major modifications (mostly add or change docstrings and comments)
Related PRs
Debug PR #11790