Skip to content
fstagni edited this page Apr 7, 2015 · 4 revisions

Error propagation

Problem with current system

We have several problems with the current S_ERROR(msg) system :

  1. Error propagation: a low level error is most of the time not propagated to the top, where we can debug it. We lose information.

  2. Error comparison: so far, we do strings comparison. Fine. But these are not standardized, and hence we are sometimes badly handling the error.

  3. Error traceability: because we are so bad at propagation and fail to have a clear log policy, it is not always possible to know how did we get to this error.

As an example:

My-script-do-something:
   Lvl1.doSomething()
     Lvl2.doSomething()
       Lvl3.doSomething()
         lowLvlLib.doSomething()

With the current system, if the lowLvlLib encounter some errors, and gives very technical details, the Lvl3 or Lvl2 will replace this error with something like “Could not perform doSomething”. This is also due to the fact that we don’t have a strict separation and policy between error handling and log printing. But as a matter of a fact, the user/service/agent that calls Lvl1.doSomething just knows that we could not perform the operation, but has no idea why. Very difficult to debug, especially if you are trying to debug the library with the developers of that tool. More over, it is also very possible that Lvl1 has a test that would look like

if res[‘Message’] == ‘could not perform doSomething’:
  callFailoverSolution()

So returning the cryptic technical error message is not even an option.

Proposition

I still rest my case: the proper use of exceptions would solve the problem. Since, the use of exception (and hence object oriented error handling approach) is not accepted, one has to fall back on traditional procedural error handling, and the most spread and successful method for that is error numbers.

When a low level error happens, we return a standard error number, to which is associated a human readable message (e.g. EPERM -> Operation not permitted). It turns out that we can even afford the luxury to return the low level error message itself on top of the number.

Having an error number allows easier and safer comparison (numbers are not case sensitive, less typo prone, many cases are standardized already AND returned by the externals we use, such as gfal2). Because these numbers are standardized at high level, we don’t need to replace technical messages by generic messages providing that we keep propagating the error number. Hence, better and easier debugging.

The difficulty comes from the transition. Obviously, we cannot change all the DIRAC code at once to adopt the new error system, so we need some sort of compatibility layer. The operations we are currently doing on S_ERROR(msg) are

  • ‘blabla’ in res[‘Message’]
  • ‘blabla’ == res[‘Message’]

So if res[‘Message’] is supposed to be replaced by an error number and a technical message, the only way is to have an object that offers that type of interface. I called it DError. An extra feature of the DError object is that it keeps the callstack on creation. If the object is passed to gLogger.debug, the callstack will be printed, which is extremely useful to debug.

Essentially, I re-implemented the exceptions, just that we need to propagate it ourselves with the usual

If not res[‘OK’]:
  return res

What it looks like

Generating the error:

Old : return S_ERROR(“A message, generally high level”)

New : return S_ERROR( DError(errNumber, ‘technical message’) ). The errNumber, since standardized, has an associated message that often looks like what we would return before (e.g. No such file or directory), hence the compatibility.

So, not too difficult change!

Handling the error:

The old method still works against the DError object, and this is a key point. But if we want to move forward, we need to compare it differently, with a utility method that would look like

cmpError(res[‘Message’], errNb)

res[‘Message’] is obviously the error we get, and errNb is the candidate error number that we are testing against.

Under the hood, this function behaves differently depending on the nature of res[‘Message’]:

  • If it is a DError object, then we do a integer comparison, which is the “ideal” case
  • If it is an integer, same, we do direct integer comparison (should not happen, but just in case…)
  • If it is a string
    • The message is compared to the exact standardized one (e.g. ‘No such file or directory’ for ENOENT)
    • The message is compared to ‘compatible’ error messages. Since we failed at standardizing error messages, I give the possibility to have several matching messages for one error (e.g. “File does not exist” for ENOENT)

One objection can be that

 cmpError(res[‘Message’], ENOENT)

is not as self explanatory as

‘No such file or directory’ in res[‘Message’]

I agree. However, readability should not take over functionality, and adding a comment never killed anyone.

Conclusion

This proposition would allow us:

  • To propagate low level error messages at the top
  • To have a more flexible but safest error comparison system, that should naturally drives us to more standardization
  • To have the traceability of errors and understand how they were created
  • To have a transparent transition between the old and the new system

First implementation is visible here : https://github.com/chaen/DIRAC/blob/rel-v6r13_FEAT_errorHandling/Core/Utilities/DIRACError.py

Clone this wiki locally