Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-78

Add API error-checking to coding standards

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None
    • Location:
      On this JIRA ticket

      Description

      I suggest that we add:

      When an API is used that provides a means for returning an error indication, the invoking code shall perform a check for errors.

      In particular, when invoking POSIX library functions or system calls, the return value shall always be checked for an error condition, and if an error condition is found, the value of errno should generally be preserved or reported in some form, unless a specific value of errno is expected as part of the normal operation of an algorithm.

      Rationale: In general it is clear that if you invoke a function, you want to know whether it claims to have done what you asked. It is true that in casual coding there are some commonly used library functions that will very rarely return an error, and there are traditions in various communities of not checking in these cases. However, in a large production project it is preferable to make error-checking a blanket rule because even extremely unlikely error conditions have a way of happening at some point, and the error check may provide the only clue for the cause of a rare failure.

      to the programming standards. I am thinking primarily of C++ but the principle applies mutatis mutandi to Python.

        Attachments

          Issue Links

            Activity

            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            What to actually do when you detect an error is another matter. This caused endless controversy in BaBar, particularly about the question of what code has the right to call abort() or something equivalent that causes immediate process termination.

            The general rule we ultimately developed was to try to pass even the most severe errors outward to "application level", the level that knows what the overall point of running the process was, so that the decision about whether to terminate rests with code that understands the environment.

            As an example of the application-specific nature of that decision: it may very well be reasonable for a sky-tile-processing job in DRP to terminate itself at the slightest sign of something unexpected in its environment that would cause suspicion of the integrity of the application - some kind of "can't happen". (I'm not suggesting it do this in response to bad pixel data, though.) It can count on a supervisory process to be watching for that and to mark its unit of work as needing to be rescheduled. A control-system process, however, can be viewed as having an obligation to do the best it can to remain part of the control system constellation and often should deal with even a very serious error by broadcasting that status to its partners as part of the control system's logic, rather than just bailing out.

            As a corollary: we restricted the use of assert() just to those cases where it tested a predicate that was completely under the control of the unit of code containing the assertion. ("Unit of code" was usually interpreted as a class and all its functions, but it was acceptable for it to include closely cooperating classes authored by the same person as a conceptual unit - e.g., a container class and its associated iterator class.) Assertions, we stated, should never be used for things like argument checking in public interfaces - both because they can be compiled away and because the result of a failed assert() - an abort() - was rarely an appropriate response to a bad argument.

            The LSST standards seem to be silent on this, based on a cursory search.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - What to actually do when you detect an error is another matter. This caused endless controversy in BaBar, particularly about the question of what code has the right to call abort() or something equivalent that causes immediate process termination. The general rule we ultimately developed was to try to pass even the most severe errors outward to "application level", the level that knows what the overall point of running the process was, so that the decision about whether to terminate rests with code that understands the environment. As an example of the application-specific nature of that decision: it may very well be reasonable for a sky-tile-processing job in DRP to terminate itself at the slightest sign of something unexpected in its environment that would cause suspicion of the integrity of the application - some kind of "can't happen". (I'm not suggesting it do this in response to bad pixel data, though.) It can count on a supervisory process to be watching for that and to mark its unit of work as needing to be rescheduled. A control-system process, however, can be viewed as having an obligation to do the best it can to remain part of the control system constellation and often should deal with even a very serious error by broadcasting that status to its partners as part of the control system's logic, rather than just bailing out. As a corollary: we restricted the use of assert() just to those cases where it tested a predicate that was completely under the control of the unit of code containing the assertion. ("Unit of code" was usually interpreted as a class and all its functions, but it was acceptable for it to include closely cooperating classes authored by the same person as a conceptual unit - e.g., a container class and its associated iterator class.) Assertions, we stated, should never be used for things like argument checking in public interfaces - both because they can be compiled away and because the result of a failed assert() - an abort() - was rarely an appropriate response to a bad argument. The LSST standards seem to be silent on this, based on a cursory search.
            Hide
            tjenness Tim Jenness added a comment -

            I'd summarize this as: not knowing there was an error is worse than finding there was an error and then having to think about how to handle it.

            Show
            tjenness Tim Jenness added a comment - I'd summarize this as: not knowing there was an error is worse than finding there was an error and then having to think about how to handle it.
            Hide
            ktl Kian-Tat Lim added a comment - - edited

            I think assert rules were supposed to be added about 5 years ago, based on https://dev.lsstcorp.org/trac/wiki/DM/SAT/M20100528M, but that action does not seem to have been implemented. There's also https://dev.lsstcorp.org/trac/wiki/Winter2012/ExceptionsLoggingRequirements.

            Show
            ktl Kian-Tat Lim added a comment - - edited I think assert rules were supposed to be added about 5 years ago, based on https://dev.lsstcorp.org/trac/wiki/DM/SAT/M20100528M , but that action does not seem to have been implemented. There's also https://dev.lsstcorp.org/trac/wiki/Winter2012/ExceptionsLoggingRequirements .
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            I subscribe to the erlang-ian view that unhanded errors (exceptions, etc.) should be fatal to program execution. In an environment where hundreds or thousands of concurrent jobs are running, its much more expedient for that task to come to an unequivocal halt and be rescheduled then to continue on in a possibly inconsistent state.

            Show
            jhoblitt Joshua Hoblitt added a comment - I subscribe to the erlang-ian view that unhanded errors (exceptions, etc.) should be fatal to program execution. In an environment where hundreds or thousands of concurrent jobs are running, its much more expedient for that task to come to an unequivocal halt and be rescheduled then to continue on in a possibly inconsistent state.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Re: Joshua Hoblitt's comment: as I tried to indicate above, I think that's exactly right when dealing with a large-scale production data-processing system. No one piece of work is irreplaceably critical to complete on the first try, and data integrity is key.

            Sometimes when you are writing systems that, for instance, control external non-compute hardware, I've found it preferable for a key control system process to tenaciously try to stay alive and maintain some basic state and connections even when weird things happen. This can be a layered strategy; I often paired a complex process with a simpler supervisor (kind of like the modern trend of letting browser subprocesses do the actual rendering) so that it was OK for the complex process to self-terminate at the first sign of trouble.

            Another of thinking about this is that to some extent when you said "unhandled" you were begging the question. The question is what does it really mean for an error to have been handled?

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Re: Joshua Hoblitt 's comment: as I tried to indicate above, I think that's exactly right when dealing with a large-scale production data-processing system. No one piece of work is irreplaceably critical to complete on the first try, and data integrity is key. Sometimes when you are writing systems that, for instance, control external non-compute hardware, I've found it preferable for a key control system process to tenaciously try to stay alive and maintain some basic state and connections even when weird things happen. This can be a layered strategy; I often paired a complex process with a simpler supervisor (kind of like the modern trend of letting browser subprocesses do the actual rendering) so that it was OK for the complex process to self-terminate at the first sign of trouble. Another of thinking about this is that to some extent when you said "unhandled" you were begging the question. The question is what does it really mean for an error to have been handled?
            Hide
            ktl Kian-Tat Lim added a comment -

            Will add to coding standards.

            Show
            ktl Kian-Tat Lim added a comment - Will add to coding standards.
            Hide
            tjenness Tim Jenness added a comment -

            Kian-Tat Lim I think this needs an implementation ticket. I don't see anything obvious in JIRA at the moment and I don't believe the developer guide has been updated. I can make a ticket if there isn't one.

            Show
            tjenness Tim Jenness added a comment - Kian-Tat Lim I think this needs an implementation ticket. I don't see anything obvious in JIRA at the moment and I don't believe the developer guide has been updated. I can make a ticket if there isn't one.
            Hide
            tjenness Tim Jenness added a comment -

            I added DM-9038.

            Show
            tjenness Tim Jenness added a comment - I added DM-9038 .

              People

              Assignee:
              gpdf Gregory Dubois-Felsmann
              Reporter:
              gpdf Gregory Dubois-Felsmann
              Watchers:
              Gregory Dubois-Felsmann, John Swinbank, Jonathan Sick, Joshua Hoblitt, Kian-Tat Lim, Krzysztof Findeisen, Tim Jenness
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.