Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-17872

TAP returns incorrect column type when SELECT contains math

    XMLWordPrintable

Details

    • Story
    • Status: To Do
    • Resolution: Unresolved
    • None
    • dax
    • None

    Description

      When I submit the query 

      SELECT ra, decl, w1mag, w1mag - w2mag FROM wise_00.allwise_p3as_psd
      WHERE qserv_areaspec_circle(283.831250,-30.545278,0.20)=1

      through pyvo, the values in the result column "w1mag - w2mag" are all strings instead of floats. Manipulating the results with pandas seems to kind of work anyways, but it becomes excruciatingly slow.

      I can check what the XML looks like with:

      query_string = ("SELECT ra, decl, w1mag, w1mag - w2mag FROM wise_00.allwise_p3as_psd "
      {{ "WHERE qserv_areaspec_circle(283.831250,-30.545278,0.20)=1 ")}}
      query = pyvo.dal.TAPQuery(query=query_string, baseurl='http://lsst-lsp-stable.ncsa.illinois.edu/api/tap')
      res = query.execute_stream()
      res_str = res.read()
      res_str[:1500]

      and I see: 

      <FIELD name="w1mag - w2mag" datatype="char" arraysize="*" />

       

       

      Attachments

        Issue Links

          Activity

            No builds found.
            ctslater Colin Slater created issue -
            cbanek Christine Banek made changes -
            Field Original Value New Value
            Assignee Christine Banek [ cbanek ]

            This is an excellent find.  Keep 'em coming!

            Just doing a quick look at the logs in this, the TAP service seems to find each column in there (the w1mag and w2mag), but it doesn't seem to imply the type of w1mag - w2mag to be of the same type.  It looks like it defaults to string which is what you're seeing.  This obviously isn't great, although I'm not sure how much effort it will be to keep the type information around when parsing the statement (although this seems like a good idea).

            I'll be sure to ask about this more when I'm visiting with the CADC folks, maybe this is one of the issues we can investigate together to learn a bit more about it.

            cbanek Christine Banek added a comment - This is an excellent find.  Keep 'em coming! Just doing a quick look at the logs in this, the TAP service seems to find each column in there (the w1mag and w2mag), but it doesn't seem to imply the type of w1mag - w2mag to be of the same type.  It looks like it defaults to string which is what you're seeing.  This obviously isn't great, although I'm not sure how much effort it will be to keep the type information around when parsing the statement (although this seems like a good idea). I'll be sure to ask about this more when I'm visiting with the CADC folks, maybe this is one of the issues we can investigate together to learn a bit more about it.
            fritzm Fritz Mueller made changes -
            Team Data Access and Database [ 10204 ] SQuaRE [ 10302 ]
            cbanek Christine Banek made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            cbanek Christine Banek made changes -
            Issue Type Bug [ 1 ] Story [ 10001 ]
            cbanek Christine Banek made changes -
            Epic Link DM-22752 [ 428069 ]

            Moving this to the bug fixing epic.  Just doing initial investigation now.

            cbanek Christine Banek added a comment - Moving this to the bug fixing epic.  Just doing initial investigation now.
            cbanek Christine Banek made changes -
            Story Points 2

            From Pat:

            I never tried to tackle any figuring out the result datatype from an expression in cadc-tap-server... There are two things that always scared me off:
             
            1. You may have noticed the lack of support for a subquery in the here clause... to do that right requires mapping columns output from the subquery into essentially a temp table so that columns in the outer select clause will have know name, type, unit, etc. Which leads to:

             
            2. Once you start getting the type of expressions correct it also becomes more or less required to figure out the correct unit to put in the result as well... then you have to decide what to do if it doesn't make sense (like {something in m} + {something in cm}: do you also convert one to the other units? Do you throw an exception/fail? Convert gets super complex and there would be many inconvertible cases that would fail, fail is arguably wrong because users are allowed to do many such "nonsense" things now and in other TAP services... and would you put unit="m" or unit="cm" in the result?
             
            So, it is working as intended and I would not want to "fix" the types without knowing how to handle unit specification. If there is an "unknown" value that is separate from dimensionless in VOUnit I could get behind that (char was really "you take care of this - I know nothing" kind of response).

             
            The subquery part is something that could be fixed, but iirc the adql parsing (SelectListExtractor) doesn't track enough state to handle the recursion correctly. This was something I wanted to fix  (because I find that paradigm useful)... it did/does feel like one would run into technical debt in the parser and I'm trying to procrastinate on addressing that until I can justify creating an ADQL 2.x parser from the PEG grammar (a bigger project with long term value).

             
            If we can figure out a good step that won't leave users feeling worse off then I'd be happy to see a PR

            cbanek Christine Banek added a comment - From Pat: I never tried to tackle any figuring out the result datatype from an expression in cadc-tap-server... There are two things that always scared me off:   1. You may have noticed the lack of support for a subquery in the here clause... to do that right requires mapping columns output from the subquery into essentially a temp table so that columns in the outer select clause will have know name, type, unit, etc. Which leads to:   2. Once you start getting the type of expressions correct it also becomes more or less required to figure out the correct unit to put in the result as well... then you have to decide what to do if it doesn't make sense (like {something in m} + {something in cm}: do you also convert one to the other units? Do you throw an exception/fail? Convert gets super complex and there would be many inconvertible cases that would fail, fail is arguably wrong because users are allowed to do many such "nonsense" things now and in other TAP services... and would you put unit="m" or unit="cm" in the result?   So, it is working as intended and I would not want to "fix" the types without knowing how to handle unit specification. If there is an "unknown" value that is separate from dimensionless in VOUnit I could get behind that (char was really "you take care of this - I know nothing" kind of response).   The subquery part is something that could be fixed, but iirc the adql parsing (SelectListExtractor) doesn't track enough state to handle the recursion correctly. This was something I wanted to fix  (because I find that paradigm useful)... it did/does feel like one would run into technical debt in the parser and I'm trying to procrastinate on addressing that until I can justify creating an ADQL 2.x parser from the PEG grammar (a bigger project with long term value).   If we can figure out a good step that won't leave users feeling worse off then I'd be happy to see a PR
            cbanek Christine Banek made changes -
            Status In Progress [ 3 ] To Do [ 10001 ]
            ctslater Colin Slater added a comment -

            Very helpful explanation from Pat, thanks for including that. I don't think I agree that it's not worth getting data types right without also doing the units. Right now the behavior is less user-friendly than straight mysqlclient in this pretty-common use case (subtracting two bands to get a color is a really basic step), and I think we should endeavor to set "at least as good as mysqlclient" as a floor on what's acceptable for TAP. The Gaia archive supports returning proper types too.

            I don't mean to create a proxy-argument-with-Pat-via-Christine though; I think it's still worth investigating how hard a minimal fix would be (minimal meaning units get set to empty or "unknown", subqueries not supported).

             

            ctslater Colin Slater added a comment - Very helpful explanation from Pat, thanks for including that. I don't think I agree that it's not worth getting data types right without also doing the units. Right now the behavior is less user-friendly than straight mysqlclient in this pretty-common use case (subtracting two bands to get a color is a really basic step), and I think we should endeavor to set "at least as good as mysqlclient" as a floor on what's acceptable for TAP. The Gaia archive supports returning proper types too. I don't mean to create a proxy-argument-with-Pat-via-Christine though; I think it's still worth investigating how hard a minimal fix would be (minimal meaning units get set to empty or "unknown", subqueries not supported).  
            cbanek Christine Banek made changes -
            Epic Link DM-22752 [ 428069 ] DM-22761 [ 428079 ]

            Yeah, I'm not sure what a minimal fix would be, since I don't think that computed columns are really being tracked at all through the parsing.  I get the feeling that the parsing and the data types are completely different levels.  Apparently the TAP service also doesn't support subqueries for this reason.  Of course this hasn't come up for us because qserv also can't do subqueries.

            I guess my workaround feeling would be to just do the query, have the where be the difference, and return both the numbers and do the different in the dataframe.  Maybe this will even take into account units with astropy stuff?

            Another thought is maybe a function DOUBLE or something, that we could do some conversion and say anything out of that is a double, but that would still require changing the query.

            I don't really have a great answer here.

            cbanek Christine Banek added a comment - Yeah, I'm not sure what a minimal fix would be, since I don't think that computed columns are really being tracked at all through the parsing.  I get the feeling that the parsing and the data types are completely different levels.  Apparently the TAP service also doesn't support subqueries for this reason.  Of course this hasn't come up for us because qserv also can't do subqueries. I guess my workaround feeling would be to just do the query, have the where be the difference, and return both the numbers and do the different in the dataframe.  Maybe this will even take into account units with astropy stuff? Another thought is maybe a function DOUBLE or something, that we could do some conversion and say anything out of that is a double, but that would still require changing the query. I don't really have a great answer here.
            frossie Frossie Economou made changes -
            Epic Link DM-22761 [ 428079 ] DM-23708 [ 431296 ]
            frossie Frossie Economou made changes -
            Epic Link DM-23708 [ 431296 ] DM-25255 [ 435598 ]
            cbanek Christine Banek made changes -
            Rank Ranked lower
            frossie Frossie Economou made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            frossie Frossie Economou made changes -
            Reviewers Frossie Economou [ frossie ]
            Status In Progress [ 3 ] In Review [ 10004 ]

            Closing this one to reconcile the story points for accounting purposes. Discussion seems to have stalled for some time, if there is a specific request for work on this issue please open a new ticket and link it to this one.

            frossie Frossie Economou added a comment - Closing this one to reconcile the story points for accounting purposes. Discussion seems to have stalled for some time, if there is a specific request for work on this issue please open a new ticket and link it to this one.
            frossie Frossie Economou made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            frossie Frossie Economou made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            frossie Frossie Economou made changes -
            Resolution Done [ 10000 ]
            Status Done [ 10002 ] To Do [ 10001 ]
            frossie Frossie Economou added a comment - Re-opening due to  https://github.com/rubin-dp0/Support/issues/12

            For context:

            The ESAC Gaia TAP service and the IRSA TAP service both currently correctly report a numeric type for the results of mathematical expressions in the SELECT.

            gpdf Gregory Dubois-Felsmann added a comment - For context: The ESAC Gaia TAP service and the IRSA TAP service both currently correctly report a numeric type for the results of mathematical expressions in the SELECT.

            This has resurfaced in the context of DP0.1 at the IDF in relation to https://github.com/rubin-dp0/Support/issues/12.

            Verified that Qserv is, in fact, returning expected (numeric) column types in response to the query provided there. With direct connection from mysql client to czar, we have:

            MariaDB [(none)]> SELECT mag_g,mag_r, mag_g-mag_r as color
                -> FROM dp01_dc2_catalogs.object
                -> WHERE scisql_s2PtInCircle(ra, dec, 55, -35, 1)=1
                -> AND (mag_g <21 AND mag_r <21)
                -> ;
            Field   1:  `mag_g`
            Catalog:    `def`
            Database:   `qservResult`
            Table:      `result_3076`
            Org_table:  `result_3076`
            Type:       DOUBLE
            Collation:  binary (63)
            Length:     22
            Max_length: 10
            Decimals:   31
            Flags:      NUM 
             
            Field   2:  `mag_r`
            Catalog:    `def`
            Database:   `qservResult`
            Table:      `result_3076`
            Org_table:  `result_3076`
            Type:       DOUBLE
            Collation:  binary (63)
            Length:     22
            Max_length: 10
            Decimals:   31
            Flags:      NUM 
             
            Field   3:  `color`
            Catalog:    `def`
            Database:   `qservResult`
            Table:      `result_3076`
            Org_table:  `result_3076`
            Type:       DOUBLE
            Collation:  binary (63)
            Length:     22
            Max_length: 23
            Decimals:   31
            Flags:      NUM 
            

            fritzm Fritz Mueller added a comment - This has resurfaced in the context of DP0.1 at the IDF in relation to https://github.com/rubin-dp0/Support/issues/12 . Verified that Qserv is, in fact, returning expected (numeric) column types in response to the query provided there. With direct connection from mysql client to czar, we have: MariaDB [(none)]> SELECT mag_g,mag_r, mag_g-mag_r as color -> FROM dp01_dc2_catalogs.object -> WHERE scisql_s2PtInCircle(ra, dec, 55, -35, 1)=1 -> AND (mag_g <21 AND mag_r <21) -> ; Field 1: `mag_g` Catalog: `def` Database: `qservResult` Table: `result_3076` Org_table: `result_3076` Type: DOUBLE Collation: binary (63) Length: 22 Max_length: 10 Decimals: 31 Flags: NUM   Field 2: `mag_r` Catalog: `def` Database: `qservResult` Table: `result_3076` Org_table: `result_3076` Type: DOUBLE Collation: binary (63) Length: 22 Max_length: 10 Decimals: 31 Flags: NUM   Field 3: `color` Catalog: `def` Database: `qservResult` Table: `result_3076` Org_table: `result_3076` Type: DOUBLE Collation: binary (63) Length: 22 Max_length: 23 Decimals: 31 Flags: NUM

            Note that in the Github issue above the user used an ADQL "AS" to give the computation (also a difference) a new name ("color"); otherwise the result is the same: the column came back from the RSP TAP service annotated in the VOTable as of string type.

            gpdf Gregory Dubois-Felsmann added a comment - Note that in the Github issue above the user used an ADQL "AS" to give the computation (also a difference) a new name ("color"); otherwise the result is the same: the column came back from the RSP TAP service annotated in the VOTable as of string type.

            Confirmed that the current production CADC TAP service returns string types for math expressions, e.g., from

            SELECT obs_publisher_did,obs_id,s_ra,s_dec,s_ra-s_dec AS funny 
            FROM ivoa.ObsCore 
            WHERE dataproduct_type = 'image' AND CONTAINS(CIRCLE('ICRS', 148.8882208, 69.06529472, 0.2777777777777778), s_region)=1
             

            gpdf Gregory Dubois-Felsmann added a comment - Confirmed that the current production CADC TAP service returns string types for math expressions, e.g., from SELECT obs_publisher_did,obs_id,s_ra,s_dec,s_ra-s_dec AS funny FROM ivoa.ObsCore WHERE dataproduct_type = 'image' AND CONTAINS (CIRCLE( 'ICRS' , 148.8882208, 69.06529472, 0.2777777777777778), s_region)=1
            cbanek Christine Banek made changes -
            Story Points 2 12

            Made a PR to CADC upstream, and it has the full type detection of the system.  This is deployed in a few places on a private build, but we're waiting for the upstream PR to get fixed.

            cbanek Christine Banek added a comment - Made a PR to CADC upstream, and it has the full type detection of the system.  This is deployed in a few places on a private build, but we're waiting for the upstream PR to get fixed.
            cbanek Christine Banek added a comment - Upstream PR:  https://github.com/opencadc/tap/pull/116
            cbanek Christine Banek made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            cbanek Christine Banek made changes -
            Resolution Done [ 10000 ]
            Status In Progress [ 3 ] Done [ 10002 ]
            frossie Frossie Economou made changes -
            Epic Link DM-25255 [ 435598 ] DM-30590 [ 511627 ]
            gpdf Gregory Dubois-Felsmann made changes -
            Attachment screenshot-1.png [ 67121 ]

            Repeating, today, the user's query from https://github.com/rubin-dp0/Support/issues/12:

            SELECT mag_g,mag_r, mag_g-mag_r as color
            FROM dp01_dc2_catalogs.object
            WHERE CONTAINS(POINT('ICRS', ra, dec),CIRCLE('ICRS', 55, -35, 1))=1
            AND (mag_g <21 AND mag_r <21)
            

            which was reported as solved, and confirmed by the user, on 2021-07-31, now fails again:

            ("color" is reported as "char".)

            Also of interest: the PR to CADC was never acted on.

            gpdf Gregory Dubois-Felsmann added a comment - Repeating, today, the user's query from https://github.com/rubin-dp0/Support/issues/12: SELECT mag_g,mag_r, mag_g-mag_r as color FROM dp01_dc2_catalogs.object WHERE CONTAINS (POINT( 'ICRS' , ra, dec ),CIRCLE( 'ICRS' , 55, -35, 1))=1 AND (mag_g <21 AND mag_r <21) which was reported as solved, and confirmed by the user, on 2021-07-31, now fails again: ("color" is reported as "char".) Also of interest: the PR to CADC was never acted on.
            gpdf Gregory Dubois-Felsmann made changes -
            Link This issue relates to DM-41913 [ DM-41913 ]

            Just tested again. color is still being reported as "char".

            gpdf Gregory Dubois-Felsmann added a comment - Just tested again. color is still being reported as "char".
            gpdf Gregory Dubois-Felsmann made changes -
            Resolution Done [ 10000 ]
            Status Done [ 10002 ] To Do [ 10001 ]

            Reopened ticket

            gpdf Gregory Dubois-Felsmann added a comment - Reopened ticket

            People

              cbanek Christine Banek
              ctslater Colin Slater
              Frossie Economou
              Christine Banek, Colin Slater, Fritz Mueller, Frossie Economou, Gregory Dubois-Felsmann
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Jenkins

                  No builds found.