Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-19616

Make IngestIndexReferenceObjectsTask multiprocessing capable

    Details

    • Story Points:
      8
    • Epic Link:
    • Sprint:
      AP S19-6, AP F19-1
    • Team:
      Alert Production

      Description

      In order to be able to generate a Gaia (or PS1-DR2) reference catalog in a reasonable amount of time, we need a way to process files in parallel. The existing IngestIndexReferenceObjectsTask is not capable of running in parallel. One approach that should be fairly simple is to use multiprocessing plus file locking: lock each output catalog before appending.

        Attachments

          Issue Links

            Activity

            Hide
            Parejkoj John Parejko added a comment -

            Thanks for the quick review, Christopher Waters.

            Merged and done.

            Show
            Parejkoj John Parejko added a comment - Thanks for the quick review, Christopher Waters . Merged and done.
            Hide
            Parejkoj John Parejko added a comment - - edited

            Watching the CPU load on my desktop when generating my Gaia test set, I could see file contention appear as occasional dips, but it was pretty consistent otherwise. When I made some estimates of speed, it was roughly a factor of 7 faster with 8 process than with 1 process, so it seems to be doing a decent job.

            Show
            Parejkoj John Parejko added a comment - - edited Watching the CPU load on my desktop when generating my Gaia test set, I could see file contention appear as occasional dips, but it was pretty consistent otherwise. When I made some estimates of speed, it was roughly a factor of 7 faster with 8 process than with 1 process, so it seems to be doing a decent job.
            Hide
            czw Christopher Waters added a comment -

            I had to stop and think about the locking, but this seemed like a reasonable solution.  I have a suspicion that a randomized list of inputs would run faster than one that's sequential (as processes in the pool shouldn't have lock collisions), but that probably isn't a big deal in the long run.

            Show
            czw Christopher Waters added a comment - I had to stop and think about the locking, but this seemed like a reasonable solution.  I have a suspicion that a randomized list of inputs would run faster than one that's sequential (as processes in the pool shouldn't have lock collisions), but that probably isn't a big deal in the long run.
            Hide
            Parejkoj John Parejko added a comment -

            Did my approach to implementing multiprocessing seem reasonable to you?

            Show
            Parejkoj John Parejko added a comment - Did my approach to implementing multiprocessing seem reasonable to you?
            Hide
            czw Christopher Waters added a comment -

            Everything largely looked fine to me, except for some style issues.  For the moved code, I checked the PR versions, and then spot checked that they matched the removed versions (method signatures, etc).

            Show
            czw Christopher Waters added a comment - Everything largely looked fine to me, except for some style issues.  For the moved code, I checked the PR versions, and then spot checked that they matched the removed versions (method signatures, etc).

              People

              • Assignee:
                Parejkoj John Parejko
                Reporter:
                Parejkoj John Parejko
                Reviewers:
                Christopher Waters
                Watchers:
                Christopher Waters, Colin Slater, Jim Bosch, John Parejko, John Swinbank, Paul Price, Simon Krughoff
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel