Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-3142

Port HSC optimisations for reading astrometry.net catalog

    Details

    • Type: Improvement
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: meas_astrom
    • Labels:
      None

      Description

      Some astrometry.net catalogs used in production can be quite large, and currently all of the catalog must be read in order to determine bounds for each component. This can make the loading of the catalog quite slow (e.g., 144 sec out of 177 sec to process an HSC image, using an SDSS DR9 catalog). We have HSC code that caches the required information, making the catalog load much faster. The code is from the following HSC issues:

      While there have been some changes to the LSST astrometry code that will mean we can't directly cherry-pick the HSC code, yet I think the main structure remains, so the approach can be copied without much effort.

        Attachments

          Issue Links

            Activity

            Hide
            rowen Russell Owen added a comment - - edited

            astrometry_net.i

            You added function healpixDistance:

                lsst::afw::geom::Angle healpixDistance(int hp, int Nside, lsst::afw::coord::Coord const& coord) {
                    return lsst::afw::geom::Angle(healpix_distance_to_radec(hp, Nside, coord.getLongitude().asDegrees(),
                                                                            coord.getLatitude().asDegrees(), NULL),
                                                  lsst::afw::geom::degrees);
                }
            

            This does not enforce that "coord" is the same coordinate system and epoch as was used to make the index. That might be a bit tricky (though I hope it's solvable) but it doesn't even require that "coord" be an RA/Dec-based coordinate system and that would be trivial to solve. As it stands it can silently give nonsense answers.

            Perhaps we can assume that all index files are ICRS and thus that you can convert the RA/Dec to ICRS?

            For the ra/dec version please use Angle instead of float unless you expect this to be called so frequently that extracting the value in degrees will be onorous. If you choose ICRS to solve the above problem then please add that information to the doc string.

            These functions are only called by isWithinRange, as far as I can see, and I suggest a similar transformation there (below). To the extent that you can make this funciton safe, isWithinRange already becomes safer. And maybe you can just ditch the ra/dec version altogether, which would be nice.

            bin/andCache.py

            The name seems unclear for a function that generates a cache of a.net load data. A clear name is helpful because it will be on the $PATH. I suggest generateANetLoadCache.py.

            multiindex.py

            getIndexPath appears to have many logic errors, and a unit test would be helpful! I have included a possible alternative implementation below. Here are the issues I see:

            • If fn is an absolute path then it returns it without testing if it exists (since os.path.isabs does not test for existence).
            • lsst.utils.getPackageDir raises an exception if the package is not setup.
            • The final return statement appears to be incorrect; I think getIndexPath will fail with "fn2 undefined" if anddir is None.
            • If os.path.exists(fn2) then the function returns non-absolute path

            getConfigFromEnvironment raises an exception if astrometry_net_data is not setup.

            MultiIndexCache:

            • you say that fromFilenameList is not very efficient. If possible I suggest ditching it until we have a demonstrated need for it. I can't find any use of it (not even in tests).
            • If you retain it, please call healpix = set(...) something else, e.g. healpixSet, and similarly for nside.
            • In the interest of catching errors early... in __init__ I suggest you enforce int-ness for healpix and nside, e.g.: self._healpix = int(healpix) and similarly for nside. Also, I suggest you raise an exception if len(filenameList) < 2 (if at least one index file is required, or whatever value makes the most sense); in particular this confirms that filenameList is iterable.
            • isWithinRange: I suggest that ra, dec be a Coord (which the method should convert to IcrsCoord or however you handle healpixDistance, above) and distance be an Angle. That would clean up code I see using it, and that code acts one index at a time, so assuming you don't have many thousands of them, the extra time spent extracting the data as float values in degrees should not be an issue. If you're really worried about it then you could make the list of indices be an additional argument and have this method iterate over them. I realize that makes it less flexible, but it does satisfy the current use cases.
            • writeCache: I echo K-T's remark that using a FITS table seems sub-optimal as opposed to an SQLite database, but I take your point about an existing implementation.
            • _initFromCache: very picky, but would you mind making using present tense for the description (e.g. "Ingets the cache file...and use that...")?
            • generateCache: should the reload loop be inside the try block? That would help if a reload failed and you wanted to be sure to unload any indices that had been reloaded. But that only helps if index.unload() is always safe to call, even if an index has not been [re]loaded.
            • AstrometryNetCatalog:
            • It would help if the description said this behaved like an ordered collection of ...(whatever it is).
            • The argument name "index" for _getitem_ is a bit confusing given what I think is contained are a.net index files. Perhaps "i" would be clearer?

            itertools is not used (as shows by pyflakes linter)

            Any doc string that contains Doxygen markup should have an exclamation mark after the opening triple quotes. I suggest always using the exclamation mark because somebody may add Doxygen markup commands later.

            Here is a possible alternative implementation of getIndexPath:

            def getIndexPath(fn):
                """!Get the path to the specified astrometry.net index file
             
                @param[in] fn  path to index file; if relative, then relative to astrometry_net_data
                    if that product is setup, else relative to the current working directory
                @return the absolute path to the index file, or None if the file was not found
                """
                try:
                    tryPath = os.path.join(lsst.utils.getPackageDir('astrometry_net_data'), fn)
                except Exception:
                    tryPath = fn
                if os.path.exists(tryPath):
                    return os.path.abspath(tryPath)
                return None
            

            Show
            rowen Russell Owen added a comment - - edited astrometry_net.i You added function healpixDistance: lsst::afw::geom::Angle healpixDistance(int hp, int Nside, lsst::afw::coord::Coord const& coord) { return lsst::afw::geom::Angle(healpix_distance_to_radec(hp, Nside, coord.getLongitude().asDegrees(), coord.getLatitude().asDegrees(), NULL), lsst::afw::geom::degrees); } This does not enforce that "coord" is the same coordinate system and epoch as was used to make the index. That might be a bit tricky (though I hope it's solvable) but it doesn't even require that "coord" be an RA/Dec-based coordinate system and that would be trivial to solve. As it stands it can silently give nonsense answers. Perhaps we can assume that all index files are ICRS and thus that you can convert the RA/Dec to ICRS? For the ra/dec version please use Angle instead of float unless you expect this to be called so frequently that extracting the value in degrees will be onorous. If you choose ICRS to solve the above problem then please add that information to the doc string. These functions are only called by isWithinRange, as far as I can see, and I suggest a similar transformation there (below). To the extent that you can make this funciton safe, isWithinRange already becomes safer. And maybe you can just ditch the ra/dec version altogether, which would be nice. bin/andCache.py The name seems unclear for a function that generates a cache of a.net load data. A clear name is helpful because it will be on the $PATH. I suggest generateANetLoadCache.py. multiindex.py getIndexPath appears to have many logic errors, and a unit test would be helpful! I have included a possible alternative implementation below. Here are the issues I see: If fn is an absolute path then it returns it without testing if it exists (since os.path.isabs does not test for existence). lsst.utils.getPackageDir raises an exception if the package is not setup. The final return statement appears to be incorrect; I think getIndexPath will fail with "fn2 undefined" if anddir is None. If os.path.exists(fn2) then the function returns non-absolute path getConfigFromEnvironment raises an exception if astrometry_net_data is not setup. MultiIndexCache: you say that fromFilenameList is not very efficient. If possible I suggest ditching it until we have a demonstrated need for it. I can't find any use of it (not even in tests). If you retain it, please call healpix = set(...) something else, e.g. healpixSet, and similarly for nside. In the interest of catching errors early... in __init__ I suggest you enforce int-ness for healpix and nside, e.g.: self._healpix = int(healpix) and similarly for nside. Also, I suggest you raise an exception if len(filenameList) < 2 (if at least one index file is required, or whatever value makes the most sense); in particular this confirms that filenameList is iterable. isWithinRange: I suggest that ra, dec be a Coord (which the method should convert to IcrsCoord or however you handle healpixDistance, above) and distance be an Angle. That would clean up code I see using it, and that code acts one index at a time, so assuming you don't have many thousands of them, the extra time spent extracting the data as float values in degrees should not be an issue. If you're really worried about it then you could make the list of indices be an additional argument and have this method iterate over them. I realize that makes it less flexible, but it does satisfy the current use cases. writeCache: I echo K-T's remark that using a FITS table seems sub-optimal as opposed to an SQLite database, but I take your point about an existing implementation. _initFromCache: very picky, but would you mind making using present tense for the description (e.g. "Ingets the cache file...and use that...")? generateCache: should the reload loop be inside the try block? That would help if a reload failed and you wanted to be sure to unload any indices that had been reloaded. But that only helps if index.unload() is always safe to call, even if an index has not been [re]loaded. AstrometryNetCatalog: It would help if the description said this behaved like an ordered collection of ...(whatever it is). The argument name "index" for _ getitem _ is a bit confusing given what I think is contained are a.net index files. Perhaps "i" would be clearer? itertools is not used (as shows by pyflakes linter) Any doc string that contains Doxygen markup should have an exclamation mark after the opening triple quotes. I suggest always using the exclamation mark because somebody may add Doxygen markup commands later. Here is a possible alternative implementation of getIndexPath: def getIndexPath(fn): """!Get the path to the specified astrometry.net index file   @param[in] fn path to index file; if relative, then relative to astrometry_net_data if that product is setup, else relative to the current working directory @return the absolute path to the index file, or None if the file was not found """ try: tryPath = os.path.join(lsst.utils.getPackageDir('astrometry_net_data'), fn) except Exception: tryPath = fn if os.path.exists(tryPath): return os.path.abspath(tryPath) return None
            Hide
            price Paul Price added a comment -

            Tim Jenness, I replied to one comment on github. Your other two comments corresponded to comments made by Russell, and I've taken them into account in my revision.

            Thank you, Russell Owen, for your helpful review comments. I've made almost all the changes you requested, and pushed several fixup commits for you to review if you desire. I've not squashed them in yet so that it's easy for you to see what I've changed.

            I'm especially grateful to you for catching my misimplementations of getIndexPath and getConfigFromEnvironment. I think they should now be better, but please let me know what you think.

            I left MultiIndexCache.fromFilenameList because it's used by AstrometryNetCatalog._initFromIndexFiles — that's the only way to create a MultiIndexCache if there's no cache, and you need them to create a cache.

            I didn't move the reload into the try block in generateCache because I think we should fail if we're unable to read everything when generating a cache — otherwise the cache and the original are out of sync, and you'd hit an assertion.

            Show
            price Paul Price added a comment - Tim Jenness , I replied to one comment on github. Your other two comments corresponded to comments made by Russell, and I've taken them into account in my revision. Thank you, Russell Owen , for your helpful review comments. I've made almost all the changes you requested, and pushed several fixup commits for you to review if you desire. I've not squashed them in yet so that it's easy for you to see what I've changed. I'm especially grateful to you for catching my misimplementations of getIndexPath and getConfigFromEnvironment . I think they should now be better, but please let me know what you think. I left MultiIndexCache.fromFilenameList because it's used by AstrometryNetCatalog._initFromIndexFiles — that's the only way to create a MultiIndexCache if there's no cache, and you need them to create a cache. I didn't move the reload into the try block in generateCache because I think we should fail if we're unable to read everything when generating a cache — otherwise the cache and the original are out of sync, and you'd hit an assertion.
            Hide
            rowen Russell Owen added a comment - - edited

            I'll ask again about generateCache (below) but all your other changes look great. Thank you for being so tolerant of all my suggested changes. Leaving your changes as fixup commits so others could look at them was clever. I'll try to remember to do that, rebasing after the final review is done.

            Regarding generateCache: I agree that you should raise an exception if a reload fails, but surely that will happen if you move the reload loop into the try/finally block? If an error on loading occurs then the finally block runs, which unloads the indices, and the error is raised. Naively that seems ideal to me (as long as it is safe to call unload even if an index has never been loaded) because it leaves the cache in a known and self-constent state after a failure.

            Show
            rowen Russell Owen added a comment - - edited I'll ask again about generateCache (below) but all your other changes look great. Thank you for being so tolerant of all my suggested changes. Leaving your changes as fixup commits so others could look at them was clever. I'll try to remember to do that, rebasing after the final review is done. Regarding generateCache: I agree that you should raise an exception if a reload fails, but surely that will happen if you move the reload loop into the try/finally block? If an error on loading occurs then the finally block runs, which unloads the indices, and the error is raised. Naively that seems ideal to me (as long as it is safe to call unload even if an index has never been loaded) because it leaves the cache in a known and self-constent state after a failure.
            Hide
            price Paul Price added a comment -

            Ah, I understand. OK, I'll do that.

            Going to squash now and then merge after a final Jenkins build.

            Show
            price Paul Price added a comment - Ah, I understand. OK, I'll do that. Going to squash now and then merge after a final Jenkins build.
            Hide
            price Paul Price added a comment -

            Squashed, build succeeded and now merged to master.

            Thanks!

            Show
            price Paul Price added a comment - Squashed, build succeeded and now merged to master. Thanks!

              People

              • Assignee:
                price Paul Price
                Reporter:
                price Paul Price
                Reviewers:
                Russell Owen
                Watchers:
                Dominique Boutigny, Kian-Tat Lim, Lauren MacArthur, Paul Price, Russell Owen, Simon Krughoff
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel