-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new CMSCouch exception for Request Entity Too Large #11502
Conversation
Jenkins results:
|
Jenkins results:
|
Self-reminder: I am temporarily applying this patch to vocms0256. |
Still no job hitting this error in vocms0256, since March 20th. I see submit5 has these large documents, but I am sure that if I patch the component and restart JobAccountant, it will go away (and we won't be able to debug #11516) |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Thanks to German, we managed to run a few T0 replays while checking and improving these changes. In addition to dealing and reporting this error in a better way, it will now parse the FJR and reset it whenever a large FJR is found. We also leave a trace in the logs for which jobid it happened. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amaltaro thanks for this PR!
I did leave few comments inline. Please take a look.
@@ -235,8 +246,8 @@ def queue(self, doc, timestamp=False, viewlist=None, callback=None): | |||
if timestamp: | |||
self.timestamp(doc, timestamp) | |||
# TODO: Thread this off so that it's non blocking... | |||
if len(self._queue) >= self._queue_size: | |||
print('queue larger than %s records, committing' % self._queue_size) | |||
if self.getQueueSize() >= self._queue_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the benefits of calling a dedicated method for just asking for the len
of an object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, provided that you get the size of the correct object, and not self.queue
as one might think of. I am reverting it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I look at its usage in ChangeState module, I actually have to say that having a specific method to retrieve the size of a "private" attribute is a much better implementation, e.g.:
self.jobsdatabase.getQueueSize()
instead of an error prone line
len(self.jobsdatabase._queue)
Unless you have a strong reason to defend, I am keeping this code around.
"""Stringify the error""" | ||
errorMsg = "" | ||
if self.type == "CouchError": | ||
errorMsg += f"NEW ERROR TYPE/STATUS! UPDATE CMSCOUCH.PY!: Status: {self.status}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I have missed the context from the code somewhere, but this message is quite non descriptive. This by itself is a bad sign, because whoever is about to read the logs would get even more confused. We shouldn't rely on knowing the the code context in order to interpret an error message while debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a message for developers, so the content should be good enough to say what is supposed to happen. I can make it slight better though without making too much noise/verbose.
errMsg = f"The 'fwjr' attribute for jobid: {doc.get('jobid')} will " | ||
errMsg += f"be empty because it is too large." | ||
logging.warning(errMsg) | ||
doc['fwjr'] = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well one solution would be to simply drop the document of course, but isn't there any option of trying to compress it somehow, before we decide to complete get rid of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original suggestion too, why not to apply gzip to FWJR part entirely for all docs and only keep meta-data in unzipped form. Doing this way we get another benefit that size of the database will be much smaller, all docs will be in the same format, e.g. meta-data used by views are in unzipped form, and actual meta-data of FWJR will be zipped. On a client side whoever needs to use FWJR part will unzip it, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure who actually consumes this fwjr
content and whether it is even consumed by any of the couchdb views. IF it is, then we cannot have compressed data stored in CouchDB.
I'd suggest to investigate and make this change through a new GH issue/PR, instead of refactoring things in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, the proposed changes do not guarantee that after your assignment doc['fwjr'] = dict()
the doc size will still be below sizeLimit
. Since I do not know structure of the doc and does it has other portions rather than fwjr
I suggest that you add a new check of the doc after you made this assignment, otherwise the code can still leak docs whose size will be larger than sizeLimit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point and I agree that nothing guarantees that that document will be smaller than 8MB.
Looking into the recordInCouch
method of the ChangeState module, my main concern goes towards the mask
attribute.
Nonetheless, we do not have enough information and know how to properly deal with that document in case of other fields are still larger than the configured document size limit in CouchDB. Having said that, I'd rather see an explicit error/exception with that component, such that we are able to properly debug it and come up with the best way to address it.
msg = "Failed to commit bulk of framework job report to CouchDB." | ||
msg += f" Error: {str(exc)}" | ||
logging.warning(msg) | ||
shrinkLargeFJR(self.fwjrdatabase, 8e6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The presence of just 8e6
as an argument heret is quite unreadable - a better way would be to set it as: sizeLimit=8e6
. Which is actually the default as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it is hard-coded anyway? Can we make size parameter configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be configurable in the next commit.
logging.warning(msg) | ||
shrinkLargeFJR(self.fwjrdatabase, 8e6) | ||
# now all the documents should fit in | ||
self.fwjrdatabase.commit(callback=discardConflictingDocument) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the guaranty that on the second try to commit the document (inside the exception handling) it will actually succeed. Moreover, you do have the option to rise the same exception in the cases where the document size overgrowth is due to a reason other than the FWJR.... Simply calling the shrinkLargeFJR
function, does not lead to immediate shrink of the document size in 100% of the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally agree here too, we need another check in a code after shrinking the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just answered a similar question above. AFAIK, only the fwjr
key can be as large as MB, hence this is the only key that we shrink in shrinkLargeFJR
. If it is not, then we will get to know it in the future because docs will keep failing to be inserted in the database.
Having said that, at the moment I see no reason for a second check and if some exception happens, let it go all the way upstream.
@@ -235,8 +246,8 @@ def queue(self, doc, timestamp=False, viewlist=None, callback=None): | |||
if timestamp: | |||
self.timestamp(doc, timestamp) | |||
# TODO: Thread this off so that it's non blocking... | |||
if len(self._queue) >= self._queue_size: | |||
print('queue larger than %s records, committing' % self._queue_size) | |||
if self.getQueueSize() >= self._queue_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
def __init__(self, reason, data, result, status): | ||
# calculate the size of this JSON serialized object | ||
docSize = sys.getsizeof(json.dumps(data)) | ||
errorMsg = f"Document has {docSize} bytes and it's too large to be accepted by CouchDB." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is insufficient to print document has certain size since you need to specify CouchDB threshold otherwise you can't clearly define what is too large means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CouchDB does not tell me what the configured limit is, so I have no way to say that in the error message either. The best I can do is to say something "CouchDB limit is defined by couchdb.max_document_size setting (default 8M bytes)".
if sys.getsizeof(json.dumps(doc)) > sizeLimit: | ||
if 'fwjr' in doc: | ||
errMsg = f"The 'fwjr' attribute for jobid: {doc.get('jobid')} will " | ||
errMsg += f"be empty because it is too large." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to clearly define what is too large, i.e. you must print CouchDB limits as well in addition to document size.
doc['fwjr'] = dict() | ||
else: | ||
errMsg = f"Found a large Couch document for job id: {doc.get('jobid')}, " | ||
errMsg += "but it does not have a 'fwjr' field. Contact a developer!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a point of "Contact developer", i.e. what developer can do about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a red flag for us, saying that something unexpected happen. A job report is meant to have a fwjr
field!
Anyhow, I don't mind rephrasing it.
@@ -89,6 +111,8 @@ def __init__(self, config, couchDbName=None): | |||
self.jsumdatabase = None | |||
self.statsumdatabase = None | |||
|
|||
# max total number of documents to be committed in the same Couch operation | |||
self.maxBulkCommit = 250 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this parameter is hard-coded? Can we put it into configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating.
msg += f" Details: {str(exc)}" | ||
logging.warning(msg) | ||
shrinkLargeFJR(self.fwjrdatabase, 8e6) | ||
# now all the documents should fit in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must add second check here to confirm that all docs are below CouchDB max doc limit, otherwise you cannot guarantee that after shrinkage document will still fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. shrinkLargeFJR
function will actually parse all the documents in the database queue and shrink all those that are actually larger than the limit. I don't see why a second check would be required here. Please clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above in this function which requires second check of doc size after doc['fwjr'] = dict()
assignment.
logging.warning(msg) | ||
shrinkLargeFJR(self.fwjrdatabase, 8e6) | ||
# now all the documents should fit in | ||
self.fwjrdatabase.commit(callback=discardConflictingDocument) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you probably need another try/except to make clear error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could be done. But we don't expect to hit that error again. So I would simply not complicate the code and let any exception to just propagate upstream.
try: | ||
self.fwjrdatabase.commit(callback=discardConflictingDocument) | ||
except CouchRequestTooLargeError as exc: | ||
msg = "Failed to commit bulk of framework job report to CouchDB." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt CouchDB has rollback functionality, in other words here the behavior of commit is unpredictable. For instance, if large document will be first one to insert the entire chunk will fail, but if it is a last one I think the first documents will be successfully injected, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just chatted with CouchDB experts and they say that the behavior of _bulk_docs
is that each document can succeed or fail independently.
In other words, the correct way to know what succeeded or not is by parsing the response body and identifying what succeeded or not, more information in: https://docs.couchdb.org/en/stable/api/database/bulk-api.html#updating-documents-in-bulk
This goes beyond the scope of this PR though, and this PR does not change the current behavior. We need to address that in a different GH issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already got this info I suggest to make appropriate comment in code base that we'll know what we'll need to address in a different PR, something like:
# TODO: CouchDB bulk insert may fail at any given document, and further changes are required
# to test such conditions and act accordingly on document insert failure, for further details see
# https://docs.couchdb.org/en/stable/api/database/bulk-api.html#updating-documents-in-bulk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! It seems to be a widespreaded lack of check in this module. Updating in my next commit.
Todor, Valentin, thanks for this review. I am going to go through those soon and update this PR accordingly. |
Jenkins results:
|
Jenkins results:
|
Todor, Valentin, my 5th commit addresses most of your concerns. Others were followed up in the comments themselves. |
errMsg = f"The 'fwjr' attribute for jobid: {doc.get('jobid')} will " | ||
errMsg += f"be empty because it is too large." | ||
logging.warning(errMsg) | ||
doc['fwjr'] = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, the proposed changes do not guarantee that after your assignment doc['fwjr'] = dict()
the doc size will still be below sizeLimit
. Since I do not know structure of the doc and does it has other portions rather than fwjr
I suggest that you add a new check of the doc after you made this assignment, otherwise the code can still leak docs whose size will be larger than sizeLimit
.
@@ -101,6 +125,7 @@ def __init__(self, config, couchDbName=None): | |||
self.getWorkflowSpecDAO = self.daofactory("Workflow.GetSpecAndNameFromTask") | |||
|
|||
self.maxUploadedInputFiles = getattr(self.config.JobStateMachine, 'maxFWJRInputFiles', 1000) | |||
self.fwjrLimitSize = getattr(self.config.JobStateMachine, 'fwjrLimitSize', 8 * 1000**2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not know how CouchDB code check document size, therefore to avoid edge case I suggest to change default value to be 8*1000**2 -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I underrstand it to be the same mechanism as we adopted here. From their documentation:
"serialized Erlang representation of the JSON document body".
Anyhow, we will have this answer soon, from a T0 replay.
try: | ||
self.fwjrdatabase.commit(callback=discardConflictingDocument) | ||
except CouchRequestTooLargeError as exc: | ||
msg = "Failed to commit bulk of framework job report to CouchDB." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already got this info I suggest to make appropriate comment in code base that we'll know what we'll need to address in a different PR, something like:
# TODO: CouchDB bulk insert may fail at any given document, and further changes are required
# to test such conditions and act accordingly on document insert failure, for further details see
# https://docs.couchdb.org/en/stable/api/database/bulk-api.html#updating-documents-in-bulk
msg += f" Details: {str(exc)}" | ||
logging.warning(msg) | ||
shrinkLargeFJR(self.fwjrdatabase, 8e6) | ||
# now all the documents should fit in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above in this function which requires second check of doc size after doc['fwjr'] = dict()
assignment.
Jenkins results:
|
I just created another issue for parsing the output of a bulk operation in CouchDB: |
rename variable Always provide status code in the error message; remake stringification Skip jobs with too large (>=8MB) FJR Shrink fwjr whenever we fail to commit documents Set object attr before using it replace Error by Details apply Todor and Valentin suggestions move maxBulkCommitDocs attribute under JobStateMachine section Added ToDo to check response object from Couch commit
Jenkins results:
|
test this please |
Jenkins results:
|
Thanks for the review. |
Fixes #11342
Status
ready
Description
The current changes will ensure that the whole document is no longer dumped in the component log when the error is reported.
In addition to that, we JSON serialize it and calculate the total object size, which is then added to the error message, in addition to the result attribute, e.g.:
result=b'{"error":"document_too_large","reason":"4799572-0"}\n'
Lastly, whenever it happens, we look into the database queue and reset the
fwjr
field of each of those jobs that exceed the size limit. Right now the limit is ~8MB, which matches the limit of CouchDB itself. It might be a good idea to actually create a component configuration parameter for that.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
Reference: https://docs.couchdb.org/en/3.2.2-docs/api/basics.html?highlight=Request%20Entity%20Too%20Large#http-status-codes