Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-593

Interval, sphgeom API consistency, and other enhancements to the geom package

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      Proposal Stages

      The proposals in this RFC are split into three stages:

      • Stage 1 proposals are fully backwards compatible (and essentially already implemented on DM-19392) but make design choices at least partly predicated on additional backwards-incompatible (Stage 2) changes.  I would like at least some scaled down version of the Stage 1 proposal for moderately high-priority DM work (DM-18610).  The rest can be deferred indefinitely, or until I have weekend/20%-time to work on it.
      • Stage 2 proposals are backwards incompatible at a manageable level, either because a deprecation-based migration seems doable or because current (especially external) dependence on these existing APIs or behavior is essentially nonexistent.
      • Stage 3 proposals are not actually being proposed in this RFC; they're either notes for future potential RFCs, or they're they're things I wish we could have done differently but feel we are now essentially locked into. I'd be happy for others to convince me I'm wrong about the latter (especially on future RFCs, or their operations-era equivalent).

      Interval Classes

      (all Stage 1)

      IntervalI and IntervalD are 1-d counterparts to Box2I and Box2D, which can now be constructed from a pair of x and y intervals, and have accessors to return them. Many box methods are now implemented by delegating to interval implementations, and I hope more can in the future (more on this below).

      The APIs and behavior of intervals are not as similar to those of boxes as they could be, however, because they're instead more similar to those of the sphgeom library, and I'm proposing here that we use those instead as a model for how we should (carefully!) evolve the box APIs.

      These differences include:

      • IntervalD (like IntervalI, Box2I, and all box and interval classes in sphgeom) is defined as a closed interval, with both endpoints considered to be contained by the interval, while Box2D is defined to be half-open, with the minimum included and the maximum excluded. A closed interval permits non-empty zero-size intervals (i.e. points) to be represented exactly, and the symmetry simplifies the implementation.
      • In C++, every interval mutator has a factory variant that returns a new object instead, with names differing in the tense of the verb (e.g. "expandTo" and "expandedTo"). The names of the methods themselves come from or are inspired by sphgeom's more precise method names. The current box API has only the mutator forms of these methods, with names that do not lend themselves as well to adding factory variants.
      • In Python, only the factory variants are available, making the interval classes unmodifiable from Python, and hence safe to use as the return value of Python properties. The new getX and getY interval accessors on the box classes thus are accompanied by .x and .y read-only properties. Most interval accessors (which return int or float) also have associated properties.
      • In C++, interval construction primarily goes through static methods (i.e. the named constructor idiom) like fromMinMax, fromMinSize, and fromCenterSize. The box classes have a static method (makeCenteredBox) for one set of arguments and use regular constructors for the rest (though they can get away with this because Point and Extent have different types, while the bounds and size of an interval do not).
      • In Python, interval construction goes through a single constructor that requires all arguments to be passed as keywords, allowing it to take any sane combination of min, max, center, and size.

      Changes to Box and Point/Extent Classes

      Making the interval classes intentionally similar to sphgeom and intentionally different from the current box classes only makes sense if we can update the Box, Point, and Extent classes:

      • We should add the sphgeom/interval mutators and factory methods (erode[d]By, dilate[d]By, shift[ed]By, reflect[ed]About, expand[ed]To, clip[ped]To) to the box classes (Stage 1). This would mean we would have multiple, differently-named ways to perform some operations, as I consider removing the old methods (grow, shift, flipLR, FlipTB, include, clip) - or at least some of them - to be infeasible (i.e., Stage 3), though I would support documentation-level statements that the new methods are preferred in new code.
      • We should add read-only properties for all scalar and interval accessors to Box, Point, and Extent (Stage 1).
      • We should deprecate and ultimately remove all mutators (namely setX and setY) from Point and Extent in Python, eventually allowing them to be unmodifiable from Python and usable in properties (Stage 2).
      • We should redefine Box2D to be a closed geometry (Stage 2), allowing it to delegate more to IntervalD in its implementation (and have better, less surprising behavior, IMO). I do not believe a deprecation period for this change is doable, but I don't think any downstream non-unittest code will notice the change in behavior.
      • We should make box "emptiness" separable (Stage 2), allowing a box to be empty in one dimension without forgetting its extent in the other. Such a box would still be considered empty overall, but it would allow even more delegation from box to interval (in fact, the box implementation could become essentially trivial), and more importantly it permits a box to represent an interval exactly at one limit, just as an interval can represent a point at one limit. I do not believe a deprecation period for this change is doable, but I don't think any downstream non-unittest code will notice the change in behavior.
      • We should rename makeCenteredBox to fromCenterSize (in C++), after a deprecation period, to match Interval's C++ API (Stage 2), and in Python support this operation via keyword arguments to the constructor (Stage 2). Making all box construction in Python require keyword arguments is probably infeasible and unnecessary given the strong typing of Point and Extent, though requiring them would make code easier to read (Stage 3).

      NumPy Integration and Angle

      DM-19392 adds several Python helper methods to several geom classes to make them more usable with NumPy. Most of these are straightforward additions (Stage 1):

      • IntervalI is iterable (over the points within it), and also has an arange method to return those points in a 1-d numpy.ndarray.
      • Box2I similarly has a grid method that returns a pair of 2-d numpy.ndarrays with the x and y coordinates within it (yes, just numpy.meshgrid on Interval.arange).
      • All intervals, boxes, and sphgeom regions now have a vectorized contains method that takes separate arrays of x, y, and (for sphgeom) z and returns a bool array. These should work on arrays of any shape, and return an array of the same shape.
      • LinearTransform and AffineTransform now have vectorized function call operators that take separate arrays of x and y and return separate arrays for transformed x and y.
      • SpherePoint now has a toUnitXYZ static method that transforms arrays of longitude and latitude (with a required AngleUnit argument) to the x, y, z
        unit-vector arrays expected by the sphgeom region classes.

      These additions are worth mentioning in this RFC because they imply that several other vectorized operations should be available, but those are a bit more problematic:

      • The number of input and output dimensions in an afw.geom.Transform may (with a generic endpoint) not be known until runtime, making it at least difficult to provide APIs that take a fixed number of single-coordinate arrays without (possibly breaking) changes to the class (Stage 2, at least for TransformPoint2ToPoint2I).
      • We have two Angle classes - one in sphgeom and one in geom - but neither provides a safe way to represent an array of angles (they are also both mutable in Python, which is also unfortunate). At least some of these problems should probably be solved - perhaps by using astropy objects - before adding methods that accept or return vectorized arrays of angles to SkyWcs or sphgeom regions. (Note that these would differ from the existing vectorized methods in SkyWcs that operate on lists/vectors of SpherePoints; those are still difficult and inefficient if what you have are column arrays from afw.table, astropy.table, or Pandas, as is often the case).

      On Documentation

      Much of this RFC concerns additions to the Python API at the pybind11 layer or other intentional differences between C++ and Python APIs, and at present there's no way for users to find out about them.  Even in a world in which we have tooling to create Sphinx docs from Doxygen content, documenting these Python interfaces precisely will be difficult.

      Because the geom package is so central and ubiquitous, I think it is reasonable to consider writing completely distinct Python docstrings (in the pybind11 source files) in this package, and living with whatever duplication that entails. If others agree, I believe there are no technical obstacles to us starting to do that now (let's call that Stage 2). But when mechanisms to use Doxygen to provide Sphinx docstrings do arrive,  those mechanisms would need to support having distinct Python and C++ documentation for the same class in some cases.

        Attachments

          Issue Links

            Activity

            Hide
            ktl Kian-Tat Lim added a comment -

            Finally getting to this.

            Questions:

            • Are the boundaries of a box (especially a BoxI when used with an image) infinitesimally-narrow lines, or are they single pixels? I can see closed interval behavior for the former, but it could be dangerous for the latter.
            • Would you be averse to an extra-long but still finite deprecation period for the duplicated methods? That helps to mitigate the "multiple methods doing the same thing" issue, since one will start to give an explicit warning.
            • I think it would be highly desirable to make these immutable objects. I don't think the C++ assignment operator itself poses a problem here; if I'm reading this right, the problem would come from exposing member variables to assignment. Is what you're suggesting?
            • I'm a little worried about all this explicit vectorization bypassing the type system, but maybe that's inevitable in a NumPy/pandas world. But can't you get the n-dimensional vectorized transform methods to take 1-d arrays by using *args? Something like TransformPoint2ToPoint2.applyForward(List[float], List[float]) -> Tuple[List[float], List[float]]

            In general, I think most of Krzysztof Findeisen's objections and my similar concerns can be handled by making the proposal more ambitious: fully align interfaces, remove duplication at the cost of backward-compatibility, go for C++ immutability.

            I think copy-and-edit documentation in pybind11 is going to be required for some time. I can't foresee write-once Python and C++ documentation without significant tooling build-out.

            Show
            ktl Kian-Tat Lim added a comment - Finally getting to this. Questions: Are the boundaries of a box (especially a BoxI when used with an image) infinitesimally-narrow lines, or are they single pixels? I can see closed interval behavior for the former, but it could be dangerous for the latter. Would you be averse to an extra-long but still finite deprecation period for the duplicated methods? That helps to mitigate the "multiple methods doing the same thing" issue, since one will start to give an explicit warning. I think it would be highly desirable to make these immutable objects. I don't think the C++ assignment operator itself poses a problem here; if I'm reading this right, the problem would come from exposing member variables to assignment. Is what you're suggesting? I'm a little worried about all this explicit vectorization bypassing the type system, but maybe that's inevitable in a NumPy/pandas world. But can't you get the n-dimensional vectorized transform methods to take 1-d arrays by using *args ? Something like TransformPoint2ToPoint2.applyForward(List [float] , List [float] ) -> Tuple[List [float] , List [float] ] In general, I think most of Krzysztof Findeisen 's objections and my similar concerns can be handled by making the proposal more ambitious: fully align interfaces, remove duplication at the cost of backward-compatibility, go for C++ immutability. I think copy-and-edit documentation in pybind11 is going to be required for some time. I can't foresee write-once Python and C++ documentation without significant tooling build-out.
            Hide
            jbosch Jim Bosch added a comment - - edited

            Are the boundaries of a box (especially a BoxI when used with an image) infinitesimally-narrow lines, or are they single pixels? I can see closed interval behavior for the former, but it could be dangerous for the latter.

            Box2D boundaries are infinitesimally narrow lines, while Box2I's are full pixels.  But Box2I has always been a closed geometry, and the conversions between the classes and their differing semantics for size reflect this.  The only change proposed here is that Box2D's infinitesimally narrow boundary is considered part of the box on all sides, rather than just the lower boundary.

            Would you be averse to an extra-long but still finite deprecation period for the duplicated methods? That helps to mitigate the "multiple methods doing the same thing" issue, since one will start to give an explicit warning.

            I forget whether our deprecation procedure provides a way to suppress warnings in old code.  If so, I'm on board with this - I'd love to have warnings in new code to encourage using the preferred APIs, but I don't want a ton of warnings in old code just because we haven't yet decided it's worthwhile to convert it yet.

            I think it would be highly desirable to make these immutable objects. I don't think the C++ assignment operator itself poses a problem here; if I'm reading this right, the problem would come from exposing member variables to assignment. Is what you're suggesting?

            Having a C++ assignment operator definitely does break immutability; I'm not sure if that's what you meant by "poses a problem".  I am indeed arguing that you can still have a C++ assignment operator (and other mutating methods) in C++ and still avoid the problems associated with Python properties that return mutable objects, as long as you don't provide Python versions of those methods.

            To expand a bit, I think "value-held" objects like those here are quite different from "shared_ptr-held" (often polymorphic) objects, which do gain considerably from C++ immutability.  Value-held objects are rarely aliased in C++ code, so there simply isn't much value to immutability there.  The problem of course is that Python effectively holds everything by shared_ptr, so there's a ton of aliasing there, and if there are ways to mutate these objects from Python that's a problem.  But we don't have to remove C++ mutability to achieve that; instead we can:

            • always copy or move when returning objects through wrapped C++ APIs;
            • never take non-const references to these objects in wrapped C++ APIs;
            • never wrap mutating methods on value-held classes.

            Aside from the last (which this proposal aims to address), these are good practices anyway, and if there are any exceptions to them (for geom classes, at least) in the codebase, they're extremely rare.   The result is that the set of (aliased, effectively immutable) Python instances and the set of (effectively unaliased, but mutable) C++ instances are disjoint, and hence we have no shared state problems from mutable, aliased instances.

            Apologies if I just explained something that you already understood, but I wanted to make sure we were on the same page.  If you are arguing that we should go further and also remove the mutating methods from the C++ APIs while leaving the assignment operator, I'd be okay with that; it would both make C++ and Python code more similar and simplify their implementations, both of which are good things, even if it doesn't affect the kind of immutability that matters for reasoning about shared state.

            I'm a little worried about all this explicit vectorization bypassing the type system, but maybe that's inevitable in a NumPy/pandas world.

            That's my conclusion, too - for better or (probably) worse, using NumPy efficiently really means using arrays of POD in many places where it'd be vastly preferable to use classes/structs in any compiled language.

            But can't you get the n-dimensional vectorized transform methods to take 1-d arrays by using *args? Something like TransformPoint2ToPoint2.applyForward(List[float], List[float]) -> Tuple[List[float], List[float]]

            Good point!  I'm not sure I'd want to add it to the Stage 1 implementation ticket I've basically completed, but this certainly seems quite doable.

            Show
            jbosch Jim Bosch added a comment - - edited Are the boundaries of a box (especially a BoxI when used with an image) infinitesimally-narrow lines, or are they single pixels? I can see closed interval behavior for the former, but it could be dangerous for the latter. Box2D boundaries are infinitesimally narrow lines, while Box2I's are full pixels.  But Box2I has always been a closed geometry, and the conversions between the classes and their differing semantics for size reflect this.  The only change proposed here is that Box2D's infinitesimally narrow boundary is considered part of the box on all sides, rather than just the lower boundary. Would you be averse to an extra-long but still finite deprecation period for the duplicated methods? That helps to mitigate the "multiple methods doing the same thing" issue, since one will start to give an explicit warning. I forget whether our deprecation procedure provides a way to suppress warnings in old code.  If so, I'm on board with this - I'd love to have warnings in new code to encourage using the preferred APIs, but I don't want a ton of warnings in old code just because we haven't yet decided it's worthwhile to convert it yet. I think it would be highly desirable to make these immutable objects. I don't think the C++ assignment operator itself poses a problem here; if I'm reading this right, the problem would come from exposing member variables to assignment. Is what you're suggesting? Having a C++ assignment operator definitely does break immutability; I'm not sure if that's what you meant by "poses a problem".  I am indeed arguing that you can still have a C++ assignment operator (and other mutating methods) in C++ and still avoid the problems associated with Python properties that return mutable objects, as long as you don't provide Python versions of those methods. To expand a bit, I think "value-held" objects like those here are quite different from "shared_ptr-held" (often polymorphic) objects, which do gain considerably from C++ immutability.  Value-held objects are rarely aliased in C++ code, so there simply isn't much value to immutability there.  The problem of course is that Python effectively holds everything by shared_ptr, so there's a ton of aliasing there, and if there are ways to mutate these objects from Python that's a problem.  But we don't have to remove C++ mutability to achieve that; instead we can: always copy or move when returning objects through wrapped C++ APIs; never take non-const references to these objects in wrapped C++ APIs; never wrap mutating methods on value-held classes. Aside from the last (which this proposal aims to address), these are good practices anyway, and if there are any exceptions to them (for geom classes, at least) in the codebase, they're extremely rare.   The result is that the set of (aliased, effectively immutable) Python instances and the set of (effectively unaliased, but mutable) C++ instances are disjoint, and hence we have no shared state problems from mutable, aliased instances. Apologies if I just explained something that you already understood, but I wanted to make sure we were on the same page.  If you are arguing that we should go further and also remove the mutating methods from the C++ APIs while leaving the assignment operator, I'd be okay with that; it would both make C++ and Python code more similar and simplify their implementations, both of which are good things, even if it doesn't affect the kind of immutability that matters for reasoning about shared state. I'm a little worried about all this explicit vectorization bypassing the type system, but maybe that's inevitable in a NumPy/pandas world. That's my conclusion, too - for better or (probably) worse, using NumPy efficiently really means using arrays of POD in many places where it'd be vastly preferable to use classes/structs in any compiled language. But can't you get the n-dimensional vectorized transform methods to take 1-d arrays by using *args ? Something like TransformPoint2ToPoint2.applyForward(List [float] , List [float] ) -> Tuple[List [float] , List [float] ] Good point!  I'm not sure I'd want to add it to the Stage 1 implementation ticket I've basically completed, but this certainly seems quite doable.
            Hide
            ktl Kian-Tat Lim added a comment -

            BTW, I was arguing for removing the mutating C++ methods.

            Show
            ktl Kian-Tat Lim added a comment - BTW, I was arguing for removing the mutating C++ methods.
            Hide
            jbosch Jim Bosch added a comment -

            BTW, I was arguing for removing the mutating C++ methods.

            I'll drop them.

             

            Show
            jbosch Jim Bosch added a comment - BTW, I was arguing for removing the mutating C++ methods. I'll drop them.  
            Hide
            jbosch Jim Bosch added a comment -

            Looks like I marked this as Implemented prematurely, because I only completed one of the two triggered tickets.  But now Jira doesn't seem to give me a way to fix that.

            Show
            jbosch Jim Bosch added a comment - Looks like I marked this as Implemented prematurely, because I only completed one of the two triggered tickets.  But now Jira doesn't seem to give me a way to fix that.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Jim Bosch, John Parejko, Jonathan Sick, Kian-Tat Lim, Krzysztof Findeisen, Russell Owen, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.