-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate racing conditions in DBS data injection procedure #11106
Comments
While talking with Alan and explaining issue with #11107 I made a proposal how to temporary solve this problem. The proposal is simple, to avoid racing conditions between independent HTTP requests when we inject multiple blocks, we can make two step procedure:
The first step will insure that all common parts associated with new block are either injected or present in DB. Therefore, after this step we can inject as many blocks in parallel as we wish since all of their common meta-data will be present in DB. This will be easier to implement in WMCore/WMAgent and can be used as temporary solution to the problem. |
Thanks for this summary, Valentin. |
Impact of the bug
The current
bulkblocks
API for DBS data injection contains internal racing conditions independent of DBS server implementation (Python or Go-based code). We should revisit and change how we insert the data into DBS and replacebulkblocks
API with new way for data injection.Describe the bug
The data input (JSON) for
bulkblocks
DBS API contains common and independent parts. For instance, if we want to inject two different blocks via parallel HTTP calls, the provided JSON forbulkblocks
API contains the static data which will be the same for both JSON files, e.g. physics groups, data tiers, dataset access types, etc. And, the data which will be different and independent from each other the blocks/files parts. Since each HTTP call will have internal database transactions this transactions can compete to insert or obtain ID for the static data which will lead to the racing condition.How to reproduce it
Send multiple requests to
bulkblocks
API of Python or Go server with blocks/files from the same datasets. At certain rate, we will experience ORA-00001 error in DBS server code which represents transaction error of identical data injection, e.g. we need either to inject or obtain ID of common fields, such as data tiers, physics group, processed datasets, etc. The probability of the error increases with size of the internal ORACLE table where we need to insert or check for existence of common attribute. For instance, the probability of this error is low for low populated tables such as physics groups, data tiers, but much higher for dense tables such as processed datasets.Currently, both Python and Go-server has a way to re-iterate over ORA-00001 ORACLE violate constrain errors.
For instance, Python codebase has the following code, see lines 467-562
And, the same pattern (with handling ORA-00001 error) repeats on the following lines:
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L139,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L316,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L382,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L409,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L510,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L522,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L616,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L683,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L736,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L812,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L816,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L918,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L964,
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/business/DBSBlockInsert.py#L988.
In total, there are 14 occurrences which handle in different way racing conditions.
For some calls, this error is ignored since the code assumes that data is already in DB, for others (as shown above), there is a repetitive attempts to reinsert the data, and for others even there are comments about racing conditions (wording differently), e.g.
In Go code, I added the following construct https://github.com/dmwm/dbs2go/blob/master/dbs/dbs.go#L610-617 to handle the same issue.
Both code-bases are vulnerable to ORA-00001 issue which happen due to design of
bulkblocks
API. In order to resolve it we either need to repeat the data insertion in a server code (as now it is done in Python and Go based server code) or on a client side by re-injecting the data again. But fundamentally it is a flaw in the design of this API.In short, the design of
bulkblocks
API internally contains racing conditions which can't be 100% resolved on a server side and even though we have some protection the code it is vulnerable to this issue which may occur againif we move to more concurrent nature of data injection and larger data volume in the future.
Finally, please note that issue occurs when there is no entry in DBS database for common data used by concurrent blocks/files injection. For instance, when we create a new campaign, or create new processed dataset and try to inject all its blocks/files via
bulkblocks
API these independent HTTP calls (used by upstream workflows, e.g. WMCore, T0, CRAB) can lead to the racing conditions among them.Expected behavior
The
bulkblocks
API used by WMCore/T0/CRAB teams should be decomposed into several pieces which fall into two main categories:blocks/files
data injection since latter it requires this common datablocks/files
separately from common data, this can be done via parallel injection. Since these data will have independent pieces their injections can be safely injected through parallel HTTP calls but it requires that common data should be already injected in DBS.This will require major changes to WMCore data injection procedure and we should carefully design it. For instnace, the change of data structures used by
bulkblocks
API and decomposition of data injection calls, e.g.It may requires to revisit data-structures of injected data as well as WMCore/T0/CRAB workflows. As such, the further discussion will be required among all stakeholders.
Additional context and error message
@amaltaro , @todor-ivanov , @klannon , @dciangot , @belforte , @mapellidario , @d-ylee
This issue requires careful design and implementation for WMCore/T0/CRAB workflows and further discussion are required. I suggest to revisit this issue once we'll discuss new data injection policies, changes to DBS databases, etc.
The text was updated successfully, but these errors were encountered: