Improved error handling in SIGMA

Posted by Vijay Mahadevan in MOAB, SIGMA

1. Current Error Handling Model of MOAB

When some error occurs, an MOAB routine can return an enum type ErrorCode (defined in src/moab/Types.hpp) to its callers. To keep track of detail information about errors, a class Error (defined in src/moab/Error.hpp) is also used to store corresponding error messages. Usually, some locally defined macros (e.g. ERRORR in src/io/ReadNC.cpp) call Error::set_last_error() to report a new error message, before a non-success error code is returned. At any time, user may call Core::get_last_error() to retrieve the latest error message from the Error class instance held by MOAB.
This model has some known limitations:
1) The Error class can only store the latest error message. When an error originates from a lower level in a call hierarchy, upper level callers might report different error messages. As a result, previously reported error messages get overwritten and they will no longer be available to the user.
2) There is no useful stack trace for the user to find out where an error first occurs, making MOAB difficult to debug. In contrast, when a PETSc routine reports an error, user can get a full stack trace without the debugger.
Ticket 138 is an existing ticket that had mentioned similar issues before.

2. An Enhanced Error Handling Model

To improve existing error handling model, some thoughts were initially proposed by Tim Tautges:

Design and write a new error handler class for MOAB. Have this class replace the ErrorCode return codes in most cases
Be comparable to current ErrorCode enumeration, probably using operators like != and ==
Store multiple “levels” of errors and strings to go with them, e.g. to return a stack of return codes/strings we’d get by erroring out deep in a call hierarchy
Define IO stream operators such that we can use C++ streaming IO calls to build up error messages rather than C sprintf
Optionally include a stack trace starting from the first non-success error
Be as lightweight as possible so that copying them is relatively efficient, so they can be returned by value from functions
Have preprocessor-like functionality such that we can do something like:
error = MOAB->get_adjacencies(…)
CHECK_ERR(error, “string to pass back if error != success”)
that is, this function checks the error value and returns with the specified string from the current function
Have embedded parallel functionality, such that if a non-root processor encounters an error, there’s an option to pass it back to the root (or for the root to gather up all the error objects) then for the root to print the message; this must be optional, though, to avoid excessive synchronizations (maybe use asynchronous point to point message to root with a known message code, and have another function that on the root checks for any messages of that type?)

Note, these requirements are still prone to change, especially after the design review.

3. Analysis and Design of the New Model

The error handling model of PETSc seems to be close to our needs (especially the nice support of stack trace), so our design will borrow some ideas from PETSc.

3.1 Return Type of MOAB Routines

According to the proposed requirements listed above, we can design a new class (e.g. ErrorInfo) to replace the ErrorCode return codes in most cases. However, it is still under discussion whether or not we should do so.

3.1.1 Return Class Object

A class object can provide more information than an enum type ErrorCode. One of our concerns is the overhead caused by returning a class object, which is also mentioned in the requirements as “be as lightweight as possible so that copying them is relatively efficient”. The usual problem is to assume it is too cheap to instantiate objects. Of course, we could also have placed too much emphasis on cost of instantiating a class (need some benchmarking tests).
Another concern is that it would require mass code changes to signatures of most MOAB routines, and might affect legacy client code.

3.1.2 Return Error Code

This does not break any client code nor does it need to change existing function signatures. Regarding function return overhead, some benchmarking tests show that returning a plain ErrorCode enum type is much faster than returning a class object. PETSc also returns error code, because it is simple and portable between languages. Since the error messages are printed out in the stack trace, there is no need to store them anywhere.
A known issue with returning ErrorCode is sometimes a non-error condition (e.g. TAG_NOT_FOUND) is expected, and other times it is not. Therefore, printing the error message in the lowest-level function might not be the right thing to do. And in those same cases, when it is in fact an error, there may be extra information that can only come from the lower-level function. There is a different opinion on this, though. If the caller knows how to handle the non-error condition based on the context (fails with an error message being set, or returns the non-error condition unchanged, to be handled by upper level routines), an ErrorCode would still suffice. Using an object to store any context explicitly is not necessary here.

3.2 ErrorInfo Class

If confirmed, this class will replace current ErrorCode return codes in most cases. It can be returned by value from functions, and it does not contain too much data member so that object copying is relatively efficient. It also defines some useful operators, to compare with current ErrorCode enumeration, or build error messages with C++ streaming IO calls.

3.2.1 Data Members

ErrorInfo class itself will NOT store multiple “levels” of errors and strings. Instead, a trace back error handler function will be called to print out the stack trace when an error occurs.
Proposed data members are listed below.
ErrorCode mErrorCode; // Stores current error code, can be used to compare with an ErrorCode enumeration std::ostringstream mErrorMsg; // Stores current error message, which supports stream operators to build the string
Upper level callers might modify the returned ErrorInfo object, such that mErrorCode or mErrorMsg could be different from the original one.

3.2.2 Member Functions

Proposed member functions are listed below.
ErrorInfo(ErrorCode err_code = MB_FAILURE, const char* err_msg = “”); // Default constructor, can take error code and error message as optional parameters
ErrorInfo(const ErrorInfo& err_info); // Copy constructor
Some setter and getter functions (not listed here)

3.2.3 Supported Operators

Some operators will be overloaded for special purposes.
1) = operator: assignment operator. Should check self assignment.
2) == and != operators: use mErrorCode to compare an ErrorInfo object with an ErrorCode enumeration. Normally, we have to overload two operators for each of them, to support both “MB_SUCCESS == err_info” and “err_info == MB_SUCCESS”. An alternative simpler solution is to define a conversion operator for converting ErrorInfo class type to ErrorCode enum type using mErrorCode, which can also keep some legacy user code unaffected, like
ErrorCode rval = MOABRoutine(...); // MOABRoutine now returns ErrorInfo instead of ErrorCode
3) << operator: use mErrorMsg (a C++ string stream object) to build up an error message string. This support chained operations, such that we can build a string like this
err_info << "Invalid data size " << actual << " specified for tag " << name.c_str() << " of size " << expected;
This can replace current C style sprintf like
sprintf("Invalid data size %d specified for tag %s of size %d", actual, name.c_str(), expected);
Note, some benchmark tests show that the string stream object can cause some overhead on function. To reduce this overhead, an alternative lightweight way is to use std::string instead, and use a temporary string stream object outside the class to build desired error messages.

3.3 Macro-based Interface

To make the error handling code easy to read and use, some macros are defined as wrappers of error handlers. For now, a few MOAB source files have already defined their own local macros to check and return error code. A better way is to define a small set of global macros like PETSc, so that the developer or user can use them throughout all of the codes.

3.3.1 Global Macros to Handle Errors

As the desired return type has not been confirmed yet, we only list macros based on existing ErrorCode enum type below (macros based on the new ErrorInfo class type will be quite similar). The reason for the numbered format is because C++ macros cannot handle a variable number of argument.
SET_ERR(err_code, err_msg): When an error is initially detected, creates a new error with the specified error code and error message, and then returns the error code from the current function.
SET_ERR_STR(err_code, err_msg_str): Similar to SET_ERR, except that the error message is built from err_msg_str with stringstream << operators.
SET_GLB_ERR(err_code, err_msg): Similar to SET_ERR, except that the error is a global fatal error on all processors
SET_GLB_ERR_STR(err_code, err_msg): Similar to SET_ERR_STR, except that the error is a global fatal error on all processors
CHK_ERR(err_code): Checks the returned error code, if it is not MB_SUCCESS, returns it from the current function.
In addition to CHK_ERR are the macros CHK_ERR1() and CHK_ERR2() that allow one to create a new error with specified error code or error message.
CHK_ERR1(err_code, err_msg_to_set): Checks the returned error code, if it is not MB_SUCCESS, creates a new error with the specified error message.
CHK_ERR2(err_code, err_code_to_set, err_msg_to_set): Checks the returned error code, if it is not MB_SUCCESS, creates a new error with specified error code and specified error message.
Note, MB_SUCCESS is NOT always the expected return code (e.g. we might expect to get MB_TAG_NOT_FOUND). For this issue, we can define macros that compare the returned error code with a specified error code.
CHK_EQL(err_code, exp_err_code): Checks the returned error code, if it is not the expected one, creates a new error with default error code MB_FAILURE (or a new enum type, like MB_UNEXPECTED_RETURN_CODE?) and default error message “Returned error code is not expected”.
CHK_EQL1(err_code, exp_err_code, err_msg_to_set): Similar to CHK_EQL, but creates a new error with specified error message.
CHK_EQL2(err_code, exp_err_code, err_code_to_set, err_msg_to_set): Similar to CHK_EQL, but creates a new error with specified error code and specified error message.
The above CHK_XXX macros cannot be used in functions which return (void) or any other data type. For these types of functions, we can define CHK_XXX_NO_RET macros which return without an error code.

3.3.2 Macro Definitions

Errors are handled through the routine MBError(), which is declared as below. The enum ErrorType is MB_ERROR_TYPE_NEW_GLOBAL or MB_ERROR_TYPE_NEW_LOCAL on new error messages and MB_ERROR_TYPE_EXISTING on existing error messages. Each new error message is only printed once instead of for all levels of returned errors.
ErrorCode MBError(int line, const char* func, const char* file, const char* dir, ErrorCode err_code, const char* err_msg, ErrorType err_type);
MBError() is called when an error has been detected. Those macros listed above are just wrappers for MBError(). The definitions of SET_ERR, SET_GLB_ERR and CHK_ERR will be something like:
#define SET_ERR(err_code) return MBError(__LINE__, __func__, __FILENAME__, __SDIR__, err_code, err_msg, MB_ERROR_TYPE_NEW_LOCAL)
#define SET_GLB_ERR(err_code) return MBError(__LINE__, __func__, __FILENAME__, __SDIR__, err_code, err_msg, MB_ERROR_TYPE_NEW_GLOBAL)
#define CHK_ERR(err_code) do { if (MB_SUCCESS != err_code) return MBError(__LINE__, __func__, __FILENAME__, __SDIR__, err_code, “”, MB_ERROR_TYPE_EXISTING); } while (false)
Note, the do … while(false) is a convention used to allow a macro to be embedded anywhere a statement might be. Other macros are defined in a similar way.

3.3.3 Usages

When an error is initially detected, one should set it by calling SET_ERR (or SET_GLB_ERR if it is a global error across all processors). The user should then check the return codes for MOAB routines (and possibly user-defined routines as well) with
err_code = MBRoutine(…); CHK_ERR(err_code)
or check an expected error code
err_code = MBRoutine(…); CHK_EQL(err_code, exp_err_code)
or set a new error on a particular returned error code
err_code = MBRoutine(…); if (MB_XXXX == err_code) SET_ERR(err_code_to_set, err_msg_to_set)
If this procedure is followed, any error will by default generate a clean trace back of the location of the error.
CAUTION: We should notice that sometimes ErrorCode is used to return a non-error condition (some internal error code that can be handled, or even expected, e.g. MB_TAG_NOT_FOUND). Therefore, SET_ERR should be appropriately placed to report an error to the the caller. Before it is used, we need to carefully decide whether that error is intentional. For example, a lower level MOAB routine that could return MB_TAG_NOT_FOUND should not set it as an error, since the caller might expect to get this code. In this case, the lower level routine just returns MB_TAG_NOT_FOUND as a condition, and no error is set. It is then up to the caller to decide whether it should be a true error:
1) If MB_TAG_NOT_FOUND is expected, use: CHK_EQL(err_code, MB_TAG_NOT_FOUND)
2) If MB_TAG_NOT_FOUND is treated as an error, use something like: if (MB_TAG_NOT_FOUND == err_code) SET_ERR(MB_FAILURE, “Error message on tag not found”)
3) If this caller still can not make a decision, return it to upper level callers

3.3.4 Default Error Handler

As for now, the error handling routine MBError() only calls one default error handler, MBTraceBackErrorHandler(), which tries to print out a stack trace. The arguments to MBTraceBackErrorHandler() are the line number where the error occurred, the file in which the error was detected, the source directory, the error message, and the error type indicating whether the error message should be printed. Other kind of error handlers might be supported in the future.
ErrorInfo MBTraceBackErrorHandler(int line, const char* func, const char* file, const char* dir, const char* err_msg, ErrorType err_type);
This handler will print out a line of stack trace, indicating line number, function name, directory and file name. If MB_ERROR_TYPE_NEW_XXXX is passed as the error type, the error message will be printed, otherwise it will be ignored.

3.4 Embedded Parallel Functionality

The proposed requirement is: if a non-root processor encounters an error, pass it back to the root and then for the root to print the message. However, letting the root processor gather error messages is considered a bad idea in recent discussion.
PETSc does not gather errors onto the root processor. Instead, it has the capability to take in an MPI communicator argument and depending on the value of it, the error handler decides whether to print from current processor or just from the root processor. The user code has the option to decide whether this is a globally fatal error (only printed from the root processor) or a per-processor relevant error (printed from current processor) and then pass a Comm object accordingly.
We can use an approach similar to that of PETSc. This would be much easier to implement and the code would be quite lightweight. Like PETSc, we can define a global MPI rank with which to prefix the output, as most systems have mechanisms for separating output by rank anyway. Instead of passing an MPI communicator to the error handler as PETSc does, we can pass error type MB_ERROR_TYPE_NEW_GLOBAL for globally fatal errors and MB_ERROR_TYPE_NEW_LOCAL for per-processor relevant errors.
Note, if the error handler uses std::cout to print error messages and stack traces in each processor, it can result in a messy output. This is a known MPI issue with std::cout, and it seems that existing DebugOutput class has solved this issue with buffered lines. A new class ErrorOutput (implemented similar to DebugOutput) will be used by the error handler to print each line prefixed with the MPI rank.

3.5 Concerns

A few concerns and current decisions are listed below.

3.5.1 How to Get Function Names

Function names are needed in the stack trace. The __func__ macro (a “magic” string constant) is automatically set by the compiler to the name of the function in which the macro is invoked. It makes it very simple to write trace output and debugging statements. C99 mandates __func__, but unfortunately C++ and earlier C standards mandate no such thing. Even so, most mainstream compilers support __func__. In PETSc, __FUNCT__ macro appears to be a workaround for a compiler that doesn’t support __func__, which can be redefined before each PETSc or user-defined routine to indicate the name. This workaround will NOT be used in MOAB for now (unless desired in the future).

3.5.2 Build Call Stack or NOT

PETSc defines a global stack of maximum size 64, to allow the code to build call stack frames as it runs. The first executable line of each PETSc function is PetscFunctionBegin macro, which pushes the information of current function (e.g. the function name, the line number of the start of the function) to the global stack. The last executable line of each PETSc function is PetscFunctionReturn macro, which pops the frame of current function from the global stack. This enables PETSc to access a call stack of functions at any time.
However, PETSc does NOT rely on this stack for its default trace back error handler. Instead, the trace back handler is called when a function detects an error and returns normally, with current line number and function name passed to it, just as what we proposed above in 3.3.4. Note, this can provide the EXACT line number for each function. Only when a signal (e.g. SIGSEGV) or a floating-point trap is caught, as the lower level function fails to return normally back to its upper level callers, PETSc’s default signal handler has to call PetscStackView() to access the global call stack and print out the stack frames. Note, in this case, the EXACT line numbers in the stack are NOT available, INSTEAD the line number of the start of the function is given (originally set by PetscFunctionBegin).
Maintaining such a global call stack will be expensive. Unless MOAB decides to handle signals or floating-point traps as PETSc does, this stack does NOT have to be used at this time.

4. Implementation and Testing

We will implement the proposed new error handling model in one or two places first. After that’s demonstrated to work well, we will have a group exercise to implement this into the rest of MOAB, to make sure we all get used to using it.
A new branch of MOAB (error_handling_enhancement) has been created for initial development and testing. So far, three new examples are added to this branch. TestErrorHandling.cpp and TestErrorHandlingPar.cpp test MOAB’s trace back error handler on real errors (e.g. MB_FILE_DOES_NOT_EXIST), while ErrorHandlingModel.cpp shows how MOAB’s enhanced error handling model works with some contrived errors. Sample test results are available in TestResults.txt.

SIGMA

Scalable Interfaces for Geometry and Mesh based Applications