Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-576

Propagate input headers to output files

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      Currently, when output files are written from tasks, for example calibration products, the output headers are constructed without anything explicitly coming from the input files. For example the header for a processed bias file looks something like:

      OBSTYPE = 'bias    '                                                            
      HIERARCH CALIB_CREATION_DATE = '2016-03-28'                                     
      HIERARCH CALIB_CREATION_TIME = '16:16:33 EDT'                                   
      HIERARCH CALIB_CREATION_ROOT = '/tigress/HSC/HSC/rerun/dm-5124/calib'           
      HIERARCH CALIB_INPUT_0 = '(904542,)'                                            
      HIERARCH CALIB_INPUT_1 = '(904544,)'                                            
      HIERARCH CALIB_INPUT_2 = '(904546,)'                                            
      HIERARCH CALIB_INPUT_3 = '(904548,)'                                            
      HIERARCH CALIB_INPUT_4 = '(904550,)'                                            
      HIERARCH CALIB_INPUT_5 = '(904552,)'                                            
      HIERARCH CALIB_INPUT_6 = '(904554,)'                                            
      HIERARCH CALIB_INPUT_7 = '(904556,)'                                            
      HIERARCH CALIB_INPUT_8 = '(904558,)'                                            
      HIERARCH CALIB_INPUT_9 = '(904560,)'                                            
      HIERARCH CALIB_INPUT_10 = '(904562,)'                                           
      HIERARCH CALIB_INPUT_11 = '(904564,)'                                           
      HIERARCH CALIB_INPUT_12 = '(904566,)'                                           
      CALIB_ID= 'filter=NONE calibDate=2013-11-03 ccd=50'                             
      HIERARCH MD5_IMAGE = '6185bb72f20de7e81c45e6e6591eb6ad' 
      

      (eliding the WCS headers and the mandatory headers).

      This header contains the provenance information via the input visit numbers and encodes the critical information for how this was formed in the CALIB_ID header. For a butler user this is sufficient to work out the what type of data was used (for gen 2 butler the instrument is fixed for the entire repository) and presumably what configuration was used.

      From a metadata translation perspective I do not have sufficient information in this header to work out what the data are. I can't even tell which telescope it comes from. If this file is copied out of the butler repository it's now very difficult to know anything about it.

      In this RFC I propose that if we are writing headers into FITS files we should write a full set of usable headers and not require that people use provenance to work out basic information.

      Specifically:

      • The output file should be seeded with a header created from all the input files, dropping any headers that have values that are different.
      • We allow specific headers (via configuration) from the earliest file and latest file being processed to be added to the output headers.

      The second point is to allow time-dependent headers to appear in the output files to give the reader of the header an idea of how conditions changed or the range of time covered by the output product. For example, we plan to write DATE-OBS and DATE-END headers to LSST files. Propagating the oldest DATE-OBS and the newest DATE-END will give an idea of how much time elapsed between the first and last observation. This is not to be confused with the validity date.

      I have a prototype for this working in DM-16292 for calibrations. With these changes an AuxTel reduced bias now looks like:

      HEADVER =                    2 / Version number of header                       
      INSTRUME= 'LSST_ATISS'         / Instrument                                     
      TELESCOP= 'LSST_AuxTel'        / Telescope                                      
      SEQFILE = 'ats_20180511.seq'   / Sequencer file name                            
      CCD_MANU= 'ITL     '           / CCD Manufacturer                               
      CCD_TYPE= '3800C   '           / CCD Model Number                               
      CCD_SERN= '20304   '           / Manufacturers' CCD Serial Number               
      LSST_NUM= 'ITL-3800C-098'      / LSST Assigned CCD Number                       
      DETSIZE = '[1:4072,1:4000]'    / IRAF detector size                             
      EXPTIME =                   0. / Exposure Time in Seconds                       
      TELCODE = 'AT      '           / The "code" for AuxTel                          
      CONTRLLR= 'C       '           / The controller (e.g. O for OCS, C for CCS)     
      DAYOBS  = '20180920'           / The observation day as defined in the image nam
      MJD-OBS =     58382.2355372338 / Modified Julian Date of image acquisition      
      DATE-OBS= '2018-09-21T05:39:10.417' / Date of the observation (image acquisition
      DARKTIME=                   0.                                                  
      ROTTYPE = 'UNKNOWN '           / Type of rotation angle                         
      OBSTYPE = 'bias    '                                                            
      HIERARCH CALIB_CREATION_DATE = '2019-02-15'                                     
      HIERARCH CALIB_CREATION_TIME = '15:13:07 MST'                                   
      DATE-AVG= '2018-09-21T00:00:00.00'                                              
      HIERARCH CALIB_INPUT_0 = '(2018092000028,)'                                     
      HIERARCH CALIB_INPUT_1 = '(2018092000029,)'                                     
      HIERARCH CALIB_INPUT_2 = '(2018092000030,)'                                     
      HIERARCH CALIB_INPUT_3 = '(2018092000031,)'                                     
      HIERARCH CALIB_INPUT_4 = '(2018092000032,)'                                     
      HIERARCH CALIB_INPUT_5 = '(2018092000033,)'                                     
      HIERARCH CALIB_INPUT_6 = '(2018092000034,)'                                     
      HIERARCH CALIB_INPUT_7 = '(2018092000035,)'                                     
      HIERARCH CALIB_INPUT_8 = '(2018092000036,)'                                     
      CALIB_ID= 'detector=0 detectorName=S00 filter=NONE calibDate=2018-09-21'        
      HIERARCH MD5_IMAGE = 'eda3f77dfa7ea894a0f796e24615a42a'                         
      MD5_MASK= 'fb394efc9596d6865b4dc311839da4f9'                                    
      HIERARCH MD5_VARIANCE = '8df0c2327f53ee09cb391743817c3f94'                      
      HIERARCH EXPINFO_V =         0                                                  
      AR_HDU  =                    5 / HDU (1-indexed) containing the archive used to 
      FILTER  = '_unknown_'                                                           
      FLUXMAG0=                   0.                                                  
      HIERARCH FLUXMAG0ERR =      0.                 
      

      This header is immediately understandable as being from an AuxTel observation.

      As an aside in pipe_drivers calibrations ostensibly report the "average" date in the headers but it is stored in DATE-OBS rather than DATE-AVG and it seems to be the average of full days and not the real average.

        Attachments

          Issue Links

            Activity

            Hide
            rhl Robert Lupton added a comment -

            While I have no problem with Tim adding any headers that he feels like to the calibration products, I do object to the idea that these files can now be copied out of the LSST system (in particular the calibration registry) with impunity. That would be an export, and at that point we would add any required metadata (e.g. the validity range, which is not and cannot be tracked by Tim's proposed new keywords).

            To put it another way, if we add these keys we should add an explicit rule that no pipeline code may rely on any field in these headers.

            Show
            rhl Robert Lupton added a comment - While I have no problem with Tim adding any headers that he feels like to the calibration products, I do object to the idea that these files can now be copied out of the LSST system (in particular the calibration registry) with impunity. That would be an export, and at that point we would add any required metadata (e.g. the validity range, which is not and cannot be tracked by Tim's proposed new keywords). To put it another way, if we add these keys we should add an explicit rule that no pipeline code may rely on any field in these headers.
            Hide
            swinbank John Swinbank added a comment -

            First, I should apologise because I've not been following this discussion closely, and what follows may be a dumb question.

            However: I'm confused about the distinction between implementation and specification that's implied by this RFC.

            It seems totally reasonable to me to insist that calibration products be accompanied by appropriate metadata, and that, when those products are exported from the DM system, that metadata should be packaged with them.

            However, it's not obvious why we'd want to legislate that in terms of copying specific headers from one file to another. Why don't we start by writing down what headers are required in the output? Once we've defined that data model, it can then be independent of any particular implementation: whether that metadata is copied from an input file or being pulled from a database or whatever doesn't seem like a very interesting question. And in years to come, when it's all HDF5 or whatever, you won't be scratching your head over obscure rules that are specific to FITS...

            Show
            swinbank John Swinbank added a comment - First, I should apologise because I've not been following this discussion closely, and what follows may be a dumb question. However: I'm confused about the distinction between implementation and specification that's implied by this RFC. It seems totally reasonable to me to insist that calibration products be accompanied by appropriate metadata, and that, when those products are exported from the DM system, that metadata should be packaged with them. However, it's not obvious why we'd want to legislate that in terms of copying specific headers from one file to another. Why don't we start by writing down what headers are required in the output? Once we've defined that data model, it can then be independent of any particular implementation: whether that metadata is copied from an input file or being pulled from a database or whatever doesn't seem like a very interesting question. And in years to come, when it's all HDF5 or whatever, you won't be scratching your head over obscure rules that are specific to FITS...
            Hide
            tjenness Tim Jenness added a comment -

            I was using examples of FITS headers but since we are writing FITS headers .

            The proposal is that output files should always include all the headers from the input files that are the same in all inputs. This is data model independent: If all the files have identical INSTRUME headers then we get an INSTRUME in the output file. Output files will have headers in them even if they are written in HDF5 format.

            Additionally I would like some information in the output files to indicate the range of conditions and times covering the processed product. My proposal was that the simplest way to do this is to obtain them directly from the input files (oldest and newest) and use a config item to indicate which keys are relevant for the particular instrument. This is a general approach that does not require that each camera share the same headers. In the past I've done this for many time dependent headers to indicate azimuth/elevation start/end, airmass start/end, weather conditions at start and end. I wasn't asking for the list of headers on this RFC, I was more seeing if people objected to the idea in general.

            pipe_drivers currently attempts to use VisitInfo to generate LSST-specific output headers but it only does that for dark time and exposure time. If pipe_drivers calibration construction was rewritten to calculate these range values from ObservationInfo and afw.image.setVisitInfoMetadata was replaced with something that understood a collection of ObservationInfo, then the DATE-OBS/DATE-END information could be derived directly without requiring that we get it from the raw data (so no need to specify instrument-specific time-dependent headers). We currently have more time-dependent headers defined in LSE-400 than are present in ObservationInfo (and many more than are in VisitInfo) but that approach might work.

            John Swinbank would you be happier with a modified proposal whereby all the constant value headers are propagated and time-dependent values are handled via ObservationInfo collections in a general way?

            Show
            tjenness Tim Jenness added a comment - I was using examples of FITS headers but since we are writing FITS headers . The proposal is that output files should always include all the headers from the input files that are the same in all inputs. This is data model independent: If all the files have identical INSTRUME headers then we get an INSTRUME in the output file. Output files will have headers in them even if they are written in HDF5 format. Additionally I would like some information in the output files to indicate the range of conditions and times covering the processed product. My proposal was that the simplest way to do this is to obtain them directly from the input files (oldest and newest) and use a config item to indicate which keys are relevant for the particular instrument. This is a general approach that does not require that each camera share the same headers. In the past I've done this for many time dependent headers to indicate azimuth/elevation start/end, airmass start/end, weather conditions at start and end. I wasn't asking for the list of headers on this RFC, I was more seeing if people objected to the idea in general. pipe_drivers currently attempts to use VisitInfo to generate LSST-specific output headers but it only does that for dark time and exposure time. If pipe_drivers calibration construction was rewritten to calculate these range values from ObservationInfo and afw.image.setVisitInfoMetadata was replaced with something that understood a collection of ObservationInfo, then the DATE-OBS/DATE-END information could be derived directly without requiring that we get it from the raw data (so no need to specify instrument-specific time-dependent headers). We currently have more time-dependent headers defined in LSE-400 than are present in ObservationInfo (and many more than are in VisitInfo) but that approach might work. John Swinbank would you be happier with a modified proposal whereby all the constant value headers are propagated and time-dependent values are handled via ObservationInfo collections in a general way?
            Hide
            swinbank John Swinbank added a comment -

            Hi Tim Jenness — just to be clear, I'm not actually unhappy with this proposal, just a little unsure of my ground.

            It seems surprising to me that we'd define the metadata for a product by naïvely propagating the metadata of its inputs, rather than by writing down what the required metadata for that type of product is. But... that's “surprising”, rather than “wrong” or “upsetting” — I'm totally prepared to believe this is the right way to do things, either as a matter of safety (ie, being sure you don't miss anything) or of expediency (it's quicker & easier to do this than to have some master document where we document the metadata needed for all of our products).

            Perhaps some of the thinking about what metadata is necessary where has already been done in LSE-400? Unfortunately, I can't find that (it's not at ls.st/lse-400*, https://lse-400.lsst.io/, https://github.com/lsst/lse-400, or https://www.lsst.io/#lse), so it doesn't help me much.

            I also don't object to your proposal to selecting DATE-OBS and DATE-END headers. It does seem like this is something that should be defined in a document somewhere, though, rather than just as a matter of implementation.

            Speaking of implementation, I entirely defer to you (and Robert, and whoever else feels strongly about it) about the mechanism by which this metadata is propagated; I'm sure you understand that much better than I do.

            Show
            swinbank John Swinbank added a comment - Hi Tim Jenness — just to be clear, I'm not actually unhappy with this proposal, just a little unsure of my ground. It seems surprising to me that we'd define the metadata for a product by naïvely propagating the metadata of its inputs, rather than by writing down what the required metadata for that type of product is. But... that's “surprising”, rather than “wrong” or “upsetting” — I'm totally prepared to believe this is the right way to do things, either as a matter of safety (ie, being sure you don't miss anything) or of expediency (it's quicker & easier to do this than to have some master document where we document the metadata needed for all of our products). Perhaps some of the thinking about what metadata is necessary where has already been done in LSE-400? Unfortunately, I can't find that (it's not at ls.st/lse-400*, https://lse-400.lsst.io/ , https://github.com/lsst/lse-400 , or https://www.lsst.io/#lse ), so it doesn't help me much. I also don't object to your proposal to selecting DATE-OBS and DATE-END headers. It does seem like this is something that should be defined in a document somewhere, though, rather than just as a matter of implementation. Speaking of implementation, I entirely defer to you (and Robert, and whoever else feels strongly about it) about the mechanism by which this metadata is propagated; I'm sure you understand that much better than I do.
            Hide
            tjenness Tim Jenness added a comment -

            Adopting this as:

            1. Any headers that are present in all inputs with identical values will appear in the output file.
            2. I will expand the VisitInfo to Header behavior such that standardized information can come from the earliest and latest input files.

            Show
            tjenness Tim Jenness added a comment - Adopting this as: 1. Any headers that are present in all inputs with identical values will appear in the output file. 2. I will expand the VisitInfo to Header behavior such that standardized information can come from the earliest and latest input files.
            Hide
            tjenness Tim Jenness added a comment -

            pipe_drivers has been updated so that header propagation now occurs for calibration products. I will mark this RFC as implemented but note that this RFC also covers generation of tracts/patches and the changes to pipe_drivers must end up in the new calibration products pipeline.

            Show
            tjenness Tim Jenness added a comment - pipe_drivers has been updated so that header propagation now occurs for calibration products. I will mark this RFC as implemented but note that this RFC also covers generation of tracts/patches and the changes to pipe_drivers must end up in the new calibration products pipeline.

              People

              • Assignee:
                tjenness Tim Jenness
                Reporter:
                tjenness Tim Jenness
                Watchers:
                John Parejko, John Swinbank, Kian-Tat Lim, Robert Lupton, Tim Jenness, Wil O'Mullane
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Planned End:

                  Summary Panel