Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29936

Enable getting Children without repeatedly checking if the SourceCatalog is sorted

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Won't Fix
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      While Fred Moolekamp was chasing the errors in DM-29907, it was realized that `afwTable.SourceCatalog` forbids getting children if the catalog is unsorted. In checking if the catalog is sorted, it performs an operation that is O( n), which is as expensive as fetching the children even from an unsorted catalog. Moreover, this check is performed for each parent, thereby making getting the children for all parents an O(n^2) operation. However, it is desirable to check (somewhere, once) if the catalog is sorted, as "unsortedness" could indicate a potentially serious problem elsewhere. This ticket captures potential solutions, and discussions made towards that on Slack.

        Attachments

          Issue Links

            Activity

            No builds found.
            kannawad Arun Kannawadi created issue -
            kannawad Arun Kannawadi made changes -
            Field Original Value New Value
            Link This issue relates to DM-29907 [ DM-29907 ]
            kannawad Arun Kannawadi made changes -
            Link This issue relates to DM-29737 [ DM-29737 ]
            kannawad Arun Kannawadi made changes -
            Description While [~fred3m] was chasing the errors in DM-29907, it was realized that `afwTable.SourceCatalog` forbids getting children if the catalog is unsorted. In checking if the catalog is sorted, it performs an operation that is O(n), which is as expensive as fetching the children even from an unsorted catalog. Moreover, this check is performed for each parent, thereby making getting the children an O(n^2) operation. However, it is desirable to check (somewhere, once) if the catalog is sorted, as "unsortedness" could indicate a potentially serious problem elsewhere. This ticket captures potential solutions, and discussions made towards that on Slack. While [~fred3m] was chasing the errors in DM-29907, it was realized that `afwTable.SourceCatalog` forbids getting children if the catalog is unsorted. In checking if the catalog is sorted, it performs an operation that is O( n), which is as expensive as fetching the children even from an unsorted catalog. Moreover, this check is performed for each parent, thereby making getting the children an O(n^2) operation. However, it is desirable to check (somewhere, once) if the catalog is sorted, as "unsortedness" could indicate a potentially serious problem elsewhere. This ticket captures potential solutions, and discussions made towards that on Slack.
            Hide
            kannawad Arun Kannawadi added a comment -

            Jim Bosch suggested that a context manager interface might be a good way to make progress on this. This is a time-consuming, but a more-principled solution.
            https://lsstc.slack.com/archives/C2JPMCF5X/p1619627754121800?thread_ts=1619626296.119600&cid=C2JPMCF5X

            Show
            kannawad Arun Kannawadi added a comment - Jim Bosch suggested that a context manager interface might be a good way to make progress on this. This is a time-consuming, but a more-principled solution. https://lsstc.slack.com/archives/C2JPMCF5X/p1619627754121800?thread_ts=1619626296.119600&cid=C2JPMCF5X
            Hide
            fred3m Fred Moolekamp added a comment -

            If we want a faster algorithm to check that the children are ordered by parent, can't we just check that the parents are ordered too? In other words in python

            np.all(catalog["id"][1:]-catalog["id"][:-1] > 0)
            np.all(catalog["parent"][1:]-catalog["id"][:-1] >= 0)
            

            That way you could run it one for all of the parents.

            Show
            fred3m Fred Moolekamp added a comment - If we want a faster algorithm to check that the children are ordered by parent, can't we just check that the parents are ordered too? In other words in python np. all (catalog[ "id" ][ 1 :] - catalog[ "id" ][: - 1 ] > 0 ) np. all (catalog[ "parent" ][ 1 :] - catalog[ "id" ][: - 1 ] > = 0 ) That way you could run it one for all of the parents.
            Hide
            kannawad Arun Kannawadi added a comment -

            Checking if the catalog is ordered or not before calling `getChildren` method was my initial thought as well, but then we will have to make that check everywhere before this method is invoked (which was only a handful number of times). But Jim's concern was that there may be other functions that still assume that the catalog is sorted and might fail. Either way, we need to remove this check within the `getChildren` method here, which will force a check given a parent: https://github.com/lsst/afw/blob/f21da11f85345907a2a1784b5f2f45b65e25ac04/python/lsst/afw/table/_source.py#L61

            Show
            kannawad Arun Kannawadi added a comment - Checking if the catalog is ordered or not before calling `getChildren` method was my initial thought as well, but then we will have to make that check everywhere before this method is invoked (which was only a handful number of times). But Jim's concern was that there may be other functions that still assume that the catalog is sorted and might fail. Either way, we need to remove this check within the `getChildren` method here, which will force a check given a parent: https://github.com/lsst/afw/blob/f21da11f85345907a2a1784b5f2f45b65e25ac04/python/lsst/afw/table/_source.py#L61
            Hide
            kannawad Arun Kannawadi added a comment - - edited

            To motivate this case further, I did a timing test on a catalog (sorted) with ~2000 parents. Getting their children by running the `getChildren` method on each of the parent consistently took > 1s, whereas skipping that check runs in ~0.05s. So I think this is an optimization worth making.

            Edit: Adding a one-time check did not make any significant difference by itself.

            Show
            kannawad Arun Kannawadi added a comment - - edited To motivate this case further, I did a timing test on a catalog (sorted) with ~2000 parents. Getting their children by running the `getChildren` method on each of the parent consistently took > 1s, whereas skipping that check runs in ~0.05s. So I think this is an optimization worth making. Edit: Adding a one-time check did not make any significant difference by itself.
            kannawad Arun Kannawadi made changes -
            Description While [~fred3m] was chasing the errors in DM-29907, it was realized that `afwTable.SourceCatalog` forbids getting children if the catalog is unsorted. In checking if the catalog is sorted, it performs an operation that is O( n), which is as expensive as fetching the children even from an unsorted catalog. Moreover, this check is performed for each parent, thereby making getting the children an O(n^2) operation. However, it is desirable to check (somewhere, once) if the catalog is sorted, as "unsortedness" could indicate a potentially serious problem elsewhere. This ticket captures potential solutions, and discussions made towards that on Slack. While [~fred3m] was chasing the errors in DM-29907, it was realized that `afwTable.SourceCatalog` forbids getting children if the catalog is unsorted. In checking if the catalog is sorted, it performs an operation that is O( n), which is as expensive as fetching the children even from an unsorted catalog. Moreover, this check is performed for each parent, thereby making getting the children for all parents an O(n^2) operation. However, it is desirable to check (somewhere, once) if the catalog is sorted, as "unsortedness" could indicate a potentially serious problem elsewhere. This ticket captures potential solutions, and discussions made towards that on Slack.
            kannawad Arun Kannawadi made changes -
            Summary Enable getting Children even if the SourceCatalog is unsorted Enable getting Children without repeatedly checking if the SourceCatalog is sorted
            kannawad Arun Kannawadi made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            kannawad Arun Kannawadi made changes -
            Assignee Arun Kannawadi [ kannawad ]
            Show
            kannawad Arun Kannawadi added a comment - - edited Jenkins run: - https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/34145/pipeline- https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/34154/pipeline
            kannawad Arun Kannawadi made changes -
            Reviewers Fred Moolekamp [ fred3m ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            Hide
            kannawad Arun Kannawadi added a comment -

            There are two PRs for this ticket. Jira is picking up only one.

            afw: https://github.com/lsst/afw/pull/585
            meas_base: https://github.com/lsst/meas_base/pull/174

            Show
            kannawad Arun Kannawadi added a comment - There are two PRs for this ticket. Jira is picking up only one. afw: https://github.com/lsst/afw/pull/585 meas_base: https://github.com/lsst/meas_base/pull/174
            Hide
            fred3m Fred Moolekamp added a comment -

            Thanks for pointing out the missing PR associated with the ticket. Everything looks good, it's a great approach.

            Show
            fred3m Fred Moolekamp added a comment - Thanks for pointing out the missing PR associated with the ticket. Everything looks good, it's a great approach.
            fred3m Fred Moolekamp made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Hide
            kannawad Arun Kannawadi added a comment -

            Thanks for the review, Fred! I'm glad you like the generator-based approach. For the record, the other approaches I considered were:

            1. having an attribute `_sorted` associated with each catalog, which has to be set every time a record is inserted, or catalogs are merged. Instead of running `isSorted` method, one can check the attribute alone.
            2. Context Manager interface that Jim suggested.

            However, both those methods seemed like over-engineering. The current implementation seemed close to ideal, but not perfect since it still checks if the catalog is sorted once more than necessary - once when fetching the parents and once when fetching the children. That's a redundancy that we can afford to have easily.

            Show
            kannawad Arun Kannawadi added a comment - Thanks for the review, Fred! I'm glad you like the generator-based approach. For the record, the other approaches I considered were: 1. having an attribute `_sorted` associated with each catalog, which has to be set every time a record is inserted, or catalogs are merged. Instead of running `isSorted` method, one can check the attribute alone. 2. Context Manager interface that Jim suggested. However, both those methods seemed like over-engineering. The current implementation seemed close to ideal, but not perfect since it still checks if the catalog is sorted once more than necessary - once when fetching the parents and once when fetching the children. That's a redundancy that we can afford to have easily.
            kannawad Arun Kannawadi made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            kannawad Arun Kannawadi made changes -
            Story Points 2
            jbosch Jim Bosch made changes -
            Resolution Done [ 10000 ]
            Status Done [ 10002 ] To Do [ 10001 ]
            jbosch Jim Bosch made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            jbosch Jim Bosch added a comment -

            Putting this back In Progress for now; the merges caused ci_hsc failures, and I reverted them Saturday morning (didn't have time to investigate further, and didn't want to ask anyone else to over the weekend).  I imagine we'll want to create a new ticket for the second attempt, move the commits to that branch, and then just mark this one as Invalid, but that can wait for next week.

            Show
            jbosch Jim Bosch added a comment - Putting this back In Progress for now; the merges caused ci_hsc failures, and I reverted them Saturday morning (didn't have time to investigate further, and didn't want to ask anyone else to over the weekend).  I imagine we'll want to create a new ticket for the second attempt, move the commits to that branch, and then just mark this one as Invalid, but that can wait for next week.
            Hide
            kannawad Arun Kannawadi added a comment -

            Apologies! That's the first time that happened to me. I'll have a look at the logs by Monday. I have a rough idea what might have gone wrong.

            Show
            kannawad Arun Kannawadi added a comment - Apologies! That's the first time that happened to me. I'll have a look at the logs by Monday. I have a rough idea what might have gone wrong.
            Hide
            kannawad Arun Kannawadi added a comment -

            I think I've fixed it now. There was a logic bug in the `subset` method `references.py` file that caused children sources that are outside of the bounding box also to be returned. The fix is fairly simple and pushed it in the same branch. Since the fix is simple, I don't know if it can be in the same ticket or has to come from another ticket necessarily because the PR was reverted.

            Show
            kannawad Arun Kannawadi added a comment - I think I've fixed it now. There was a logic bug in the `subset` method `references.py` file that caused children sources that are outside of the bounding box also to be returned. The fix is fairly simple and pushed it in the same branch. Since the fix is simple, I don't know if it can be in the same ticket or has to come from another ticket necessarily because the PR was reverted.
            Hide
            kannawad Arun Kannawadi added a comment -

            Weirdly, ci_hsc_gen3 doesn't have build issues, only ci_hsc_gen2 does. Is this a concern?

            Show
            kannawad Arun Kannawadi added a comment - Weirdly, ci_hsc_gen3 doesn't have build issues, only ci_hsc_gen2 does. Is this a concern?
            Hide
            jbosch Jim Bosch added a comment -

            This happens rarely enough that we don't have a policy (AFAIK) for whether to make it another ticket, but I would recommend another ticket just to avoid various kinds of clashes complicating things.

            There are some things ci_hsc_gen2 does that ci_hsc_gen3 does not, and they take different code paths in much of the I/O.  I wouldn't have expected them to differ on this ticket, but that's not a strong expectation and I wouldn't be able to say more without looking much more closely.

            Show
            jbosch Jim Bosch added a comment - This happens rarely enough that we don't have a policy (AFAIK) for whether to make it another ticket, but I would recommend another ticket just to avoid various kinds of clashes complicating things. There are some things ci_hsc_gen2 does that ci_hsc_gen3 does not, and they take different code paths in much of the I/O.  I wouldn't have expected them to differ on this ticket, but that's not a strong expectation and I wouldn't be able to say more without looking much more closely.
            kannawad Arun Kannawadi made changes -
            Link This issue is duplicated by DM-30105 [ DM-30105 ]
            Hide
            kannawad Arun Kannawadi added a comment -

            Alright. Marking this as Won't Fix now and will create a new PR with the patch included in DM-30105.

            Show
            kannawad Arun Kannawadi added a comment - Alright. Marking this as Won't Fix now and will create a new PR with the patch included in DM-30105 .
            kannawad Arun Kannawadi made changes -
            Status In Progress [ 3 ] To Do [ 10001 ]
            kannawad Arun Kannawadi made changes -
            Resolution Done [ 10000 ]
            Status To Do [ 10001 ] Won't Fix [ 10405 ]
            Hide
            kannawad Arun Kannawadi added a comment -

            To conclude the gen2/gen3 differences in this ticket: in forcedPhotCcd.py, ci_hsc_gen2 gets the `refCat` by calling the `fetchReferences` method from `runDataRef` whereas, ci_hsc_gen3 gets the same by calling `mergeAndFilterReferences` method in `runQuantum`. Only the first method takes the code path that contained the bug, so it is perfectly reasonable that ci_hsc_gen3 build passed but not ci_hsc_gen2.

            Show
            kannawad Arun Kannawadi added a comment - To conclude the gen2/gen3 differences in this ticket: in forcedPhotCcd.py, ci_hsc_gen2 gets the `refCat` by calling the `fetchReferences` method from `runDataRef` whereas, ci_hsc_gen3 gets the same by calling `mergeAndFilterReferences` method in `runQuantum`. Only the first method takes the code path that contained the bug, so it is perfectly reasonable that ci_hsc_gen3 build passed but not ci_hsc_gen2.
            yusra Yusra AlSayyad made changes -
            Story Points 2 0

              People

              Assignee:
              kannawad Arun Kannawadi
              Reporter:
              kannawad Arun Kannawadi
              Reviewers:
              Fred Moolekamp
              Watchers:
              Arun Kannawadi, Fred Moolekamp, Jim Bosch
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.