Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-3490

Quick-and-dirty n-way spatial matching

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: afw
    • Labels:
      None
    • Story Points:
      2
    • Sprint:
      Science Pipelines DM-S15-6, Science Pipelines DM-W16-1
    • Team:
      Data Release Production

      Description

      This issue will add limited N-way spatial matching of multiple catalogs with identical schemas, sufficient for measuring FY15 KPMs. It will be a simple wrapper on our existing 2-way matching code in afw, and will not be intended for long term use (as it won't be an efficient algorithm or an ideal interrface).

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Lauren MacArthur, could you take a look at this? I'm adding a few Python classes to afw.table for spatially matching multiple catalogs. I don't think this is code we'll want to use in the long run, because it's using a very naive algorithm and a lot of performance-critical loops are in Python which probably ought to be lifted to C++. But it works well enough for our little HSC test dataset, and I think that makes it better than anything else we have right now. It's also possible (but not guaranteed) that the interface might survive long-term even if the implementation does not.

            I have not included any unit tests, because I'm in a hurry - but I have added an IPython notebook to examples/ that exercises the code pretty well (I'm using that to compute KPMs). Since I think that will hold as a stopgap measure, I'm inclined to wait for a better implementation before writing tests.

            All changes on tickets/DM-3490 of afw.

            Show
            jbosch Jim Bosch added a comment - Lauren MacArthur , could you take a look at this? I'm adding a few Python classes to afw.table for spatially matching multiple catalogs. I don't think this is code we'll want to use in the long run, because it's using a very naive algorithm and a lot of performance-critical loops are in Python which probably ought to be lifted to C++. But it works well enough for our little HSC test dataset, and I think that makes it better than anything else we have right now. It's also possible (but not guaranteed) that the interface might survive long-term even if the implementation does not. I have not included any unit tests, because I'm in a hurry - but I have added an IPython notebook to examples/ that exercises the code pretty well (I'm using that to compute KPMs). Since I think that will hold as a stopgap measure, I'm inclined to wait for a better implementation before writing tests. All changes on tickets/ DM-3490 of afw.
            Hide
            jbosch Jim Bosch added a comment -

            I just noticed that you'll need obs_subaru tickets/DM-3518 to run the example. I'll hold off merging this one until that ticket is ready to merge too, but it shouldn't be a blocker for the review.

            Show
            jbosch Jim Bosch added a comment - I just noticed that you'll need obs_subaru tickets/ DM-3518 to run the example. I'll hold off merging this one until that ticket is ready to merge too, but it shouldn't be a blocker for the review.
            Hide
            lauren Lauren MacArthur added a comment -

            I'm just finishing up a ticket of my own and then a review for Nate, then I will have a look at this.
            I'll be sure to pull down that branch to run the example.

            Show
            lauren Lauren MacArthur added a comment - I'm just finishing up a ticket of my own and then a review for Nate, then I will have a look at this. I'll be sure to pull down that branch to run the example.
            Hide
            jbosch Jim Bosch added a comment -

            Actually, I think I'm just going to put the obs_subaru changes required here on a tickets/DM-3490 branch; they fit as well there as they do on DM-3518, and DM-3518 may take me a while longer to complete.

            Show
            jbosch Jim Bosch added a comment - Actually, I think I'm just going to put the obs_subaru changes required here on a tickets/ DM-3490 branch; they fit as well there as they do on DM-3518 , and DM-3518 may take me a while longer to complete.
            Hide
            lauren Lauren MacArthur added a comment -

            Copy that...I decided just to pull in those commits on a testing branch. Could you do me a big favor and rebase your afw branch on master? I was going to run some of the same data on tiger and run the notebook there, but my current setup needs updated repos (no worries if not as I can probably switch btw branches for running vs notebook execution).

            Show
            lauren Lauren MacArthur added a comment - Copy that...I decided just to pull in those commits on a testing branch. Could you do me a big favor and rebase your afw branch on master? I was going to run some of the same data on tiger and run the notebook there, but my current setup needs updated repos (no worries if not as I can probably switch btw branches for running vs notebook execution).
            Hide
            jbosch Jim Bosch added a comment -

            The afw branch has been rebased.

            Show
            jbosch Jim Bosch added a comment - The afw branch has been rebased.
            Hide
            lauren Lauren MacArthur added a comment -

            I put a few comments on GitHub. I ran this on a full focal plane run of the same visits. As advertised, it's certainly not a fast code (processing time ~6min), but it does the job we need for now! Review complete.

            Show
            lauren Lauren MacArthur added a comment - I put a few comments on GitHub. I ran this on a full focal plane run of the same visits. As advertised, it's certainly not a fast code (processing time ~6min), but it does the job we need for now! Review complete.
            Hide
            jbosch Jim Bosch added a comment -

            Lauren MacArthur I finally got back to this (sorry I took so long), but I don't actually see any comments on GitHub (or even a Pull Request) for either package.

            Show
            jbosch Jim Bosch added a comment - Lauren MacArthur I finally got back to this (sorry I took so long), but I don't actually see any comments on GitHub (or even a Pull Request) for either package.
            Hide
            lauren Lauren MacArthur added a comment -

            You don't see the comment on GitHub, e.g. https://github.com/lsst/afw/commit/1ed1330865698239294cc465c24db80ce4e9d72a
            ?

            I assume that means you didn't get email notifications either??

            Show
            lauren Lauren MacArthur added a comment - You don't see the comment on GitHub, e.g. https://github.com/lsst/afw/commit/1ed1330865698239294cc465c24db80ce4e9d72a ? I assume that means you didn't get email notifications either??
            Hide
            jbosch Jim Bosch added a comment -

            Looks like I did get the email notifications, but I failed in my attempts to search for them until you and John commented further on the changes earlier today.

            Anyhow, I've figured out what happened: I think you just commented on the Git commits, instead of actually creating a PR, and that's why it didn't get linked or appear in the list of PRs on GitHub. I'll reply there and then paste everything here so the conversation isn't lost when I rebase.

            Show
            jbosch Jim Bosch added a comment - Looks like I did get the email notifications, but I failed in my attempts to search for them until you and John commented further on the changes earlier today. Anyhow, I've figured out what happened: I think you just commented on the Git commits, instead of actually creating a PR, and that's why it didn't get linked or appear in the list of PRs on GitHub. I'll reply there and then paste everything here so the conversation isn't lost when I rebase.
            Hide
            jbosch Jim Bosch added a comment - - edited

            Copying the review here:

            Lauren MacArthur, on the MultiMatch constructor:

            Should it be noted/documented somewhere that the default radius is 0.5 arcsec if None is provided?

            Is 0.5 arcsec a bit generous for matching on astrometrically calibrated (although roughly) positions? Should this be set based on the estimated astrometric scatter? This may be moot as I think your rejection criteria account for any mismatches leaking in due to a large match radius, but restricting the match radius can shrink the initial number of "goodMatches", e.g.
            ========================================================================
            radius = 0.5
            Number of goodMatches = 89155
            Number of safeMatches = 3468
            RMS = 0.019516 magnitudes, with 0.49% (143 of 29362) measurements clipped as outliers.
            ========================================================================
            radius = 0.21 # rougly 3*astrom scatter
            Number of goodMatches = 64889
            Number of safeMatches = 3518
            RMS = 0.019513 magnitudes, with 0.49% (145 of 29811) measurements clipped as outliers.
            ========================================================================

            Do you need to add radius = radius*lsst.afw.geom.arcseconds in the case where the call provides a float/int?

            Define/describe idField and RecordClass in Arguments list. Indicate expected units for radius.

            Is there a reason you didn't use the @param[in] format here?

            Jim Bosch:

            I'll document the default radius and I think I'll make it throw if you pass in a float, not an angle, just to avoid the case where someone passes in a float they think will be interpreted in some other unit.

            Will add the other missing arguments to the docs too.

            In pure Python code, I don't think we're required to use Doxygen markup, and Doxygen will handle it fine without it as long as we don't start the string with a !. I'm also aware that we'll be switching away from Doxygen to Sphinx in the near future (with yet another markup language), so this is probably actually a bit more future proof than using Doxygen markup right now.

            Lauren MacArthur:

            Copy that. But just to be clear, I was really wondering why you didn't use the @param[in] format here, but did use it for many functions below (e.g. build, aggregate, apply).

            Jim Bosch:

            Oh, I guess I didn't realize I was being inconsistent within a particular file. I don't really have a reason; it may have been code that began life as part of an IPython notebook vs. code that began life elsewhere.

            If I'd seen this a moment earlier, I think I'd have fixed it because I do think consistency within a file is moderately important. But given that I've just merged it and it'll all be reformatted into restructed text soon enough, I'm just going to leave it.

            Lauren MacArthur:

            Would it be better to use, e.g., dataType instead of type here (since type is a python internal)?

            Jim Bosch:

            Will do.

            Lauren MacArthur, on the class docs for GroupView:

            Funny sentence construction here: "that provides an convenient operations on the"

            Jim Bosch

            Will remove the "an".

            Show
            jbosch Jim Bosch added a comment - - edited Copying the review here: Lauren MacArthur , on the MultiMatch constructor: Should it be noted/documented somewhere that the default radius is 0.5 arcsec if None is provided? Is 0.5 arcsec a bit generous for matching on astrometrically calibrated (although roughly) positions? Should this be set based on the estimated astrometric scatter? This may be moot as I think your rejection criteria account for any mismatches leaking in due to a large match radius, but restricting the match radius can shrink the initial number of "goodMatches", e.g. ======================================================================== radius = 0.5 Number of goodMatches = 89155 Number of safeMatches = 3468 RMS = 0.019516 magnitudes, with 0.49% (143 of 29362) measurements clipped as outliers. ======================================================================== radius = 0.21 # rougly 3*astrom scatter Number of goodMatches = 64889 Number of safeMatches = 3518 RMS = 0.019513 magnitudes, with 0.49% (145 of 29811) measurements clipped as outliers. ======================================================================== Do you need to add radius = radius*lsst.afw.geom.arcseconds in the case where the call provides a float/int? Define/describe idField and RecordClass in Arguments list. Indicate expected units for radius. Is there a reason you didn't use the @param[in] format here? Jim Bosch : I'll document the default radius and I think I'll make it throw if you pass in a float, not an angle, just to avoid the case where someone passes in a float they think will be interpreted in some other unit. Will add the other missing arguments to the docs too. In pure Python code, I don't think we're required to use Doxygen markup, and Doxygen will handle it fine without it as long as we don't start the string with a !. I'm also aware that we'll be switching away from Doxygen to Sphinx in the near future (with yet another markup language), so this is probably actually a bit more future proof than using Doxygen markup right now. Lauren MacArthur : Copy that. But just to be clear, I was really wondering why you didn't use the @param[in] format here, but did use it for many functions below (e.g. build, aggregate, apply). Jim Bosch : Oh, I guess I didn't realize I was being inconsistent within a particular file. I don't really have a reason; it may have been code that began life as part of an IPython notebook vs. code that began life elsewhere. If I'd seen this a moment earlier, I think I'd have fixed it because I do think consistency within a file is moderately important. But given that I've just merged it and it'll all be reformatted into restructed text soon enough, I'm just going to leave it. Lauren MacArthur : Would it be better to use, e.g., dataType instead of type here (since type is a python internal)? Jim Bosch : Will do. Lauren MacArthur , on the class docs for GroupView : Funny sentence construction here: "that provides an convenient operations on the" Jim Bosch Will remove the "an".
            Hide
            jbosch Jim Bosch added a comment -

            Rebased and merged to master.

            Show
            jbosch Jim Bosch added a comment - Rebased and merged to master.
            Hide
            lauren Lauren MacArthur added a comment -

            Jim Bosch Did you also see the comments for this commit:
            https://github.com/lsst/afw/commit/f9bfe4543b8c75c342cca0ece11c8803a6dc3a81

            Not a huge deal, but I do believe there is a typo in your notebook in box [13] (which likely filtered into Vishal Kasliwal [X]'s script).

            This is the comment:

            In [13] I think you meant get(modelMagKey) in:
            modelMag = numpy.mean(cat.get(psfMagKey))

            I ran this with a full focal plane run of the same visits. With the above fix the sigma went down slightly from 0.223 to 0.195. The difference in the number of safe matches was:
            Old Number of safeMatches = 4024 # (before modelMagKey fix in notebook)
            Number of safeMatches = 3468

            Redundant call in [15]

            Truncate sigma printing in [17]:
            ax2.plot(gaussian, y, 'r-', label="Gaussian with sigma={0:0.4f}".format(sigma))

            Maybe indicate the total number of measurements (before and after clipping) somewhere?

            Show
            lauren Lauren MacArthur added a comment - Jim Bosch Did you also see the comments for this commit: https://github.com/lsst/afw/commit/f9bfe4543b8c75c342cca0ece11c8803a6dc3a81 Not a huge deal, but I do believe there is a typo in your notebook in box [13] (which likely filtered into Vishal Kasliwal [X] 's script). This is the comment: In [13] I think you meant get(modelMagKey) in: modelMag = numpy.mean(cat.get(psfMagKey)) I ran this with a full focal plane run of the same visits. With the above fix the sigma went down slightly from 0.223 to 0.195. The difference in the number of safe matches was: Old Number of safeMatches = 4024 # (before modelMagKey fix in notebook) Number of safeMatches = 3468 Redundant call in [15] Truncate sigma printing in [17]: ax2.plot(gaussian, y, 'r-', label="Gaussian with sigma={0:0.4f}".format(sigma)) Maybe indicate the total number of measurements (before and after clipping) somewhere?
            Hide
            vpk24 Vishal Kasliwal [X] (Inactive) added a comment -

            I saw that typo too and changed it in my version of Jim's script.

            Show
            vpk24 Vishal Kasliwal [X] (Inactive) added a comment - I saw that typo too and changed it in my version of Jim's script.
            Hide
            jbosch Jim Bosch added a comment -

            Lauren MacArthur thanks for the ping - I did not see that. I'll fix that on master on another issue.

            Show
            jbosch Jim Bosch added a comment - Lauren MacArthur thanks for the ping - I did not see that. I'll fix that on master on another issue.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Lauren MacArthur
              Watchers:
              Jim Bosch, Lauren MacArthur, Vishal Kasliwal [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.