Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-41913

TAP service misidentifies types of some computed columns

    XMLWordPrintable

Details

    • SQuaRE
    • No

    Description

      In developing a response to the user problem report here:
      https://community.lsst.org/t/query-failing-for-dp0-3-when-trying-to-calculate-summary-statistics-over-5-years/8116

      (which is principally about the ability of TAP to handle large query result sets), we happened to notice that the TAP service was misidentifying the type of some values returned from complex aggregate expressions.

      Original query

      The original query was:

              select sss.ssObjectId,
              band,
              count(phaseAngle), 
              min(phaseAngle) as minPhase, 
              max(phaseAngle) as maxPhase, 
              (max(phaseAngle) - min(phaseAngle))/count(sss.ssObjectId) as phaseDensity, 
              SUM(CASE WHEN phaseAngle < 2 THEN 1 ELSE 0 END) AS oppCOV, 
              SUM(CASE WHEN phaseAngle > 2 AND phaseAngle < 15 THEN 1 ELSE 0 END) AS modCOV, 
              SUM(CASE WHEN phaseAngle > 15 THEN 1 ELSE 0 END) AS highCOV,
              AVG(mag) as avgMag,
              STDDEV(mag) as magSTD,
              AVG(magErr) as avgMagErr,
              STDDEV(magErr) as MagErrSTD
              from dp03_catalogs_10yr.SSSource as sss
              left join dp03_catalogs_10yr.DiaSource as dia on sss.diaSourceId = dia.diaSourceId
              where dia.midPointMjdTai BETWEEN 60218.00491 and 62044.25491
              group by sss.ssObjectId, band
      

      This query fails when run in TAP (though it works on the Postgres back end) because of the size of the result set produced: 3.2GB over 15M rows.

      The same query with a restriction to a single band does work and can be used to illustrate this situation. The time range on dia.midPointMjdTai can be further reduced to obtain a smaller result.

               select sss.ssObjectId,
               band,
               count(phaseAngle),
               min(phaseAngle) as minPhase,
               max(phaseAngle) as maxPhase,
               (max(phaseAngle) - min(phaseAngle))/count(sss.ssObjectId) as phaseDensity,
               SUM(CASE WHEN phaseAngle < 2 THEN 1 ELSE 0 END) AS oppCOV,
               SUM(CASE WHEN phaseAngle > 2 AND phaseAngle < 15 THEN 1 ELSE 0 END) AS modCOV,
               SUM(CASE WHEN phaseAngle > 15 THEN 1 ELSE 0 END) AS highCOV,
               AVG(mag) as avgMag,
               STDDEV(mag) as magSTD,
               AVG(magErr) as avgMagErr,
               STDDEV(magErr) as MagErrSTD
               from dp03_catalogs_10yr.SSSource as sss
               left join dp03_catalogs_10yr.DiaSource as dia on sss.diaSourceId = dia.diaSourceId
               where dia.midPointMjdTai BETWEEN 60218.00491 and 62044.25491 AND band = 'g'
               group by sss.ssObjectId, band
      

      Main problem

      The result columns phaseDensity, oppCOV, modCOV, and highCOV are all returned as character string types:

            <FIELD name="phaseDensity" datatype="char" arraysize="*" />
            <FIELD name="oppCOV" datatype="char" arraysize="*" />
            <FIELD name="modCOV" datatype="char" arraysize="*" />
            <FIELD name="highCOV" datatype="char" arraysize="*" />
      

      even though they are clearly numeric:

               (max(phaseAngle) - min(phaseAngle))/count(sss.ssObjectId) as phaseDensity,
               SUM(CASE WHEN phaseAngle < 2 THEN 1 ELSE 0 END) AS oppCOV,
               SUM(CASE WHEN phaseAngle > 2 AND phaseAngle < 15 THEN 1 ELSE 0 END) AS modCOV,
               SUM(CASE WHEN phaseAngle > 15 THEN 1 ELSE 0 END) AS highCOV,
      

      Possible additional issue

      We also noticed that the outputs of the other aggregate functions, while numeric, were a bit inconsistent, and would like to be sure that we understand what is going on:

      The following ADQL (excerpted):

               min(phaseAngle) as minPhase,
               max(phaseAngle) as maxPhase,
               AVG(mag) as avgMag,
               STDDEV(mag) as magSTD,
               AVG(magErr) as avgMagErr,
               STDDEV(magErr) as MagErrSTD
      

      given the inputs phaseAngle, mag, and magErr, which are all reported as float in TAP_SCHEMA, and have been confirmed as real in the underlying Postgres database, produces the following output:

            <FIELD name="minPhase" datatype="float" />
            <FIELD name="maxPhase" datatype="float" />
            <FIELD name="avgMag" datatype="double" />
            <FIELD name="magSTD" datatype="double" />
            <FIELD name="avgMagErr" datatype="double" />
            <FIELD name="MagErrSTD" datatype="double" />
      

      I'd like to understand why the MIN() and MAX() are returning different types from the AVG() and STDDEV().

      Attachments

        Issue Links

          Activity

            ktl contributed to the Slack discussion a note of a way that a JDBC connection can determine the types of the data being returned from the database. If this were possible to invoke from the TAP service, it would avoid the TAP service having to infer return types itself by analyzing the ADQL syntax tree.

            https://docs.oracle.com/javase/8/docs/api/java/sql/ResultSet.html#getMetaData--

            (It's not clear that that would help with a Qserv back end, but in this case we're dealing with Postgres.)

            I'm passing this information on without explicitly requesting it to be used; that'll be for Christine (perhaps with consultation from Pat Dowler, too) to determine.

            gpdf Gregory Dubois-Felsmann added a comment - ktl contributed to the Slack discussion a note of a way that a JDBC connection can determine the types of the data being returned from the database. If this were possible to invoke from the TAP service, it would avoid the TAP service having to infer return types itself by analyzing the ADQL syntax tree. https://docs.oracle.com/javase/8/docs/api/java/sql/ResultSet.html#getMetaData-- (It's not clear that that would help with a Qserv back end, but in this case we're dealing with Postgres.) I'm passing this information on without explicitly requesting it to be used; that'll be for Christine (perhaps with consultation from Pat Dowler, too) to determine.

            It has emerged that the fix in DM-17872 may have inadvertently reverted at some point and/or not ever been deployed on SSOTAP (the DP0.3 TAP Postgres-backed TAP service).

            I have reopened that ticket. If fixing that addresses the current problem as well, that'll be good news.

            gpdf Gregory Dubois-Felsmann added a comment - It has emerged that the fix in DM-17872 may have inadvertently reverted at some point and/or not ever been deployed on SSOTAP (the DP0.3 TAP Postgres-backed TAP service). I have reopened that ticket. If fixing that addresses the current problem as well, that'll be good news.

            As to the MIN/MAX/AVG/STDDEV question:

            Here is where that comes from in the code:

            https://github.com/opencadc/tap/blob/246b05f473add8dd60aa54dc94acb8a785c52c03/cadc-tap-schema/src/main/java/ca/nrc/cadc/tap/schema/TapSchemaDAO.java#L1469

            You can see that MIN/MAX returns whatever the input type is, which makes sense since it's not computing anything, it's just picking one of the elements.

            For AVG and STDDEV these both return a new computed value, which has to be stored as a datatype with decimal precision, hence those functions just say the datatype is a double.

            cbanek Christine Banek added a comment - As to the MIN/MAX/AVG/STDDEV question: Here is where that comes from in the code: https://github.com/opencadc/tap/blob/246b05f473add8dd60aa54dc94acb8a785c52c03/cadc-tap-schema/src/main/java/ca/nrc/cadc/tap/schema/TapSchemaDAO.java#L1469 You can see that MIN/MAX returns whatever the input type is, which makes sense since it's not computing anything, it's just picking one of the elements. For AVG and STDDEV these both return a new computed value, which has to be stored as a datatype with decimal precision, hence those functions just say the datatype is a double.

            People

              cbanek Christine Banek
              gpdf Gregory Dubois-Felsmann
              Christine Banek, Fritz Mueller, Gregory Dubois-Felsmann, Jeremy McCormick, Melissa Graham
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Jenkins

                  No builds found.