Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-19491

Enumerate options for creating new HTM Indexed refcats

    Details

    • Story Points:
      2
    • Epic Link:
    • Sprint:
      AP S19-5, AP S19-6
    • Team:
      Alert Production

      Description

      After some slack discussion, Simon Krughoff, Colin Slater, and Eli Rykoff have all provided me with some suggestions about how to make refcats.

      There are the DR2 .csv files on lsst-dev here: /project/shared/data/gaia_dr2/gaia_source/csv

      and a Parquet version here: /project/shared/data/gaia_dr2/gaia_source.parquet

      meas_algorithms has IngestIndexReferenceTask, which I've been told is very slow. Eli Rykoff suggested using fgcmOutputProduct._outputStandardStars from fgcmcal, but it does not handle proper motion or parallax or many of the other fields. I should run some simple tests to see just how slow the "standard" code is and make some guesses as to what might be necessary to speed it up. And similarly, compare it with Eli Rykoff's code and decide whether we should just use something like that. As part of this, I should look at reading Parquet vs. CSV.

        Attachments

          Activity

          Hide
          Parejkoj John Parejko added a comment -

          After some investigation, it looks like cleaning up IngestIndexReferenceTask is probably the best approach. I've already gotten a ~2x speedup by replacing np.genfromtxt with astropy.table.read. Based on that, I'm not sure that it's worth switching to Parquet. There's another big speedup available once I rework it to operate in parallel by replacing _fillRecord with something that can operate on columns as vectors (_setFlux is the next tall pole). The infrastructure to map input column names to output is complicated enough that it doesn't appear worthwhile to adopt the approach in _outputStandardStars.

          I've attached the config and short script file I'm using to profile ingesting a handful of Gaia DR2 .csv.gz files. The script itself can eventually become either a Driver, or a bin/ executable in meas_algorithms.

          Closing this, since the investigatory portion is done. Now to work on the refactoring.

          Show
          Parejkoj John Parejko added a comment - After some investigation, it looks like cleaning up IngestIndexReferenceTask is probably the best approach. I've already gotten a ~2x speedup by replacing np.genfromtxt with astropy.table.read . Based on that, I'm not sure that it's worth switching to Parquet. There's another big speedup available once I rework it to operate in parallel by replacing _fillRecord with something that can operate on columns as vectors ( _setFlux is the next tall pole). The infrastructure to map input column names to output is complicated enough that it doesn't appear worthwhile to adopt the approach in _outputStandardStars . I've attached the config and short script file I'm using to profile ingesting a handful of Gaia DR2 .csv.gz files. The script itself can eventually become either a Driver, or a bin/ executable in meas_algorithms . Closing this, since the investigatory portion is done. Now to work on the refactoring.
          Hide
          price Paul Price added a comment -

          John Parejko: They're in the directory /datasets/refcats/htm/v0/ps1_pv3_3pi_20170110/python. Especially see the file ingestIndexReferenceDriverTask.py.

          Show
          price Paul Price added a comment - John Parejko : They're in the directory /datasets/refcats/htm/v0/ps1_pv3_3pi_20170110/python . Especially see the file ingestIndexReferenceDriverTask.py .
          Hide
          Parejkoj John Parejko added a comment -

          Useful datapoint: ReadTextCatalogTask uses np.genfromtxt, which is about a factor of 4 slower than astropy.table.Table.read(filename, format='csv', delimiter=',').

          Show
          Parejkoj John Parejko added a comment - Useful datapoint: ReadTextCatalogTask uses np.genfromtxt , which is about a factor of 4 slower than astropy.table.Table.read(filename, format='csv', delimiter=',') .
          Hide
          Parejkoj John Parejko added a comment -

          Paul Price: what scripts are you referring to, and did you mean ps1_pv3 (e.g. /datasets/refcats/htm/v0/ps1_pv3_3pi_20170110/)? I don't see any scripts in that directory other than the config.

          Show
          Parejkoj John Parejko added a comment - Paul Price : what scripts are you referring to, and did you mean ps1_pv3 (e.g. /datasets/refcats/htm/v0/ps1_pv3_3pi_20170110/ )? I don't see any scripts in that directory other than the config.
          Hide
          price Paul Price added a comment -

          It would be great if we could have a standard script (or base class) for generating the refcats in a fast way.

          Another thing to look at are the scripts in the ps1_pv2 catalog, which might be a start in that direction.

          Show
          price Paul Price added a comment - It would be great if we could have a standard script (or base class) for generating the refcats in a fast way. Another thing to look at are the scripts in the ps1_pv2 catalog, which might be a start in that direction.

            People

            • Assignee:
              Parejkoj John Parejko
              Reporter:
              Parejkoj John Parejko
              Watchers:
              Colin Slater, Eli Rykoff, Jim Bosch, John Parejko, John Swinbank, Paul Price, Simon Krughoff
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel