Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-1674

Data ingest scripts cleanup

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: daf_ingest, datarel
    • Labels:
      None
    • Story Points:
      10
    • Sprint:
      DB_S15_07, DB_S15_08
    • Team:
      Data Access and Database

      Attachments

        Issue Links

          Activity

          No builds found.
          jbecla Jacek Becla created issue -
          jbecla Jacek Becla made changes -
          Field Original Value New Value
          Epic Link DM-210 [ 11471 ]
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          jbecla Jacek Becla made changes -
          Sprint DB_W15_01 [ 112 ] DB_W15_02 [ 115 ]
          Hide
          jbecla Jacek Becla added a comment -

          This RFC: https://jira.lsstcorp.org/browse/RFC-22 is very relevant.

          Show
          jbecla Jacek Becla added a comment - This RFC: https://jira.lsstcorp.org/browse/RFC-22 is very relevant.
          jbecla Jacek Becla made changes -
          Sprint DB_W15_02 [ 115 ] DB_W15_03 [ 136 ]
          fritzm Fritz Mueller made changes -
          Sprint DB_S15_03 [ 136 ] DB_S15_04 [ 137 ]
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          jbecla Jacek Becla made changes -
          Sprint DB_S15_04 [ 137 ] DB_S15_05 [ 138 ]
          tjenness Tim Jenness made changes -
          Link This issue relates to RFC-22 [ RFC-22 ]
          tjenness Tim Jenness made changes -
          Link This issue relates to RFC-57 [ RFC-57 ]
          jbecla Jacek Becla made changes -
          Sprint DB_S15_05 [ 138 ] DB_S15_06 [ 139 ]
          jbecla Jacek Becla made changes -
          Rank Ranked lower
          swinbank John Swinbank made changes -
          Link This issue relates to DM-2928 [ DM-2928 ]
          smonkewitz Serge Monkewitz made changes -
          Sprint DB_S15_06 [ 139 ] DB_S15_07 [ 156 ]
          smonkewitz Serge Monkewitz made changes -
          Status To Do [ 10001 ] In Progress [ 3 ]
          Hide
          smonkewitz Serge Monkewitz added a comment -

          This is the massaged afw table ingest code (previously known as ingestSourcesTask in datarel), updated for the new afw table features. Please don't hesitate to assign the review to someone else if you are too busy - the only thing I'd say is that it'd be useful for whomever you choose to know their way around afw tables and pipeline tasks.

          Thanks!

          Show
          smonkewitz Serge Monkewitz added a comment - This is the massaged afw table ingest code (previously known as ingestSourcesTask in datarel), updated for the new afw table features. Please don't hesitate to assign the review to someone else if you are too busy - the only thing I'd say is that it'd be useful for whomever you choose to know their way around afw tables and pipeline tasks. Thanks!
          smonkewitz Serge Monkewitz made changes -
          Reviewers John Swinbank [ swinbank ]
          Status In Progress [ 3 ] In Review [ 10004 ]
          Hide
          swinbank John Swinbank added a comment -

          Hi Serge – I'm happy (indeed, interested) to look at this, but I probably won't get to it for a few days (maybe not until next week). Does that schedule work for you?

          Show
          swinbank John Swinbank added a comment - Hi Serge – I'm happy (indeed, interested) to look at this, but I probably won't get to it for a few days (maybe not until next week). Does that schedule work for you?
          Hide
          smonkewitz Serge Monkewitz added a comment -

          Yup, works for me. Thanks for agreeing to review this!

          Show
          smonkewitz Serge Monkewitz added a comment - Yup, works for me. Thanks for agreeing to review this!
          Hide
          smonkewitz Serge Monkewitz added a comment -

          A reminder for myself: remember to update https://confluence.lsstcorp.org/display/DM/DM+Stack+Package+History before closing this ticket.

          Show
          smonkewitz Serge Monkewitz added a comment - A reminder for myself: remember to update https://confluence.lsstcorp.org/display/DM/DM+Stack+Package+History before closing this ticket.
          jbecla Jacek Becla made changes -
          Sprint DB_S15_07 [ 156 ] DB_S15_07, DB_S15_08 [ 156, 157 ]
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          jbecla Jacek Becla made changes -
          Rank Ranked higher
          Hide
          swinbank John Swinbank added a comment -

          I've started looking over the GithHub PR. First impressions are pretty positive; I've left a few comments, but not had chance to get through it all yet. I'll return to this shortly.

          Show
          swinbank John Swinbank added a comment - I've started looking over the GithHub PR. First impressions are pretty positive; I've left a few comments, but not had chance to get through it all yet. I'll return to this shortly.
          Hide
          smonkewitz Serge Monkewitz added a comment -

          I've force-pushed my reaction to those initial comments.

          Show
          smonkewitz Serge Monkewitz added a comment - I've force-pushed my reaction to those initial comments.
          Hide
          swinbank John Swinbank added a comment -

          Thanks Serge. I'll continue from where I left off.

          One thing to note: I realise that a bunch of things I commented on were not introduced on this ticket but were carried over from earlier work. I'm trying to follow the overall logic, rather than simply your changes, so I can't really avoid them, but if you want to declare them outside of the scope of this review that's fine (although perhaps you'll agree with me that there are some improvements to make, which would be great).

          Show
          swinbank John Swinbank added a comment - Thanks Serge. I'll continue from where I left off. One thing to note: I realise that a bunch of things I commented on were not introduced on this ticket but were carried over from earlier work. I'm trying to follow the overall logic, rather than simply your changes, so I can't really avoid them, but if you want to declare them outside of the scope of this review that's fine (although perhaps you'll agree with me that there are some improvements to make, which would be great).
          Hide
          smonkewitz Serge Monkewitz added a comment - - edited

          OK. I think those comments are valuable, so please keep them coming! I will only punt on a comment if addressing it involves a large amount of work or I disagree, and I'll attempt to justify punting in the PR comments when I'm inclined to do so. For the current round - I've changed the config class to use field names instead of column names, but I think I'm going to stick to the current field -> column name transformation strategy.

          Show
          smonkewitz Serge Monkewitz added a comment - - edited OK. I think those comments are valuable, so please keep them coming! I will only punt on a comment if addressing it involves a large amount of work or I disagree, and I'll attempt to justify punting in the PR comments when I'm inclined to do so. For the current round - I've changed the config class to use field names instead of column names, but I think I'm going to stick to the current field -> column name transformation strategy.
          Hide
          swinbank John Swinbank added a comment - - edited

          By the way, I understand that Jim provided you with an example table to ingest. Could you let me see it please? I gathered from our conversation earlier that you were worried about recursively defined aliases and the like, and I think I might better understand your concerns (and hence the code) if I could see what you'd been looking at. Thanks!

          Show
          swinbank John Swinbank added a comment - - edited By the way, I understand that Jim provided you with an example table to ingest. Could you let me see it please? I gathered from our conversation earlier that you were worried about recursively defined aliases and the like, and I think I might better understand your concerns (and hence the code) if I could see what you'd been looking at. Thanks!
          Hide
          smonkewitz Serge Monkewitz added a comment -

          Atached is the sample catalog. Here's some code to look at the aliases for its fields:

          import lsst.afw.table as afwTable
          from lsst.daf.ingest.ingestCatalogTask import aliasesFor
           
          cat = afwTable.BaseCatalog.readFits('catalog_example.fits')
          mappings = sorted(cat.schema.getAliasMap().iteritems())
          aliasSets = dict((i.field.getName(), aliasesFor(i.field.getName(), mappings)) for i in cat.schema) 
          print "alias map:\n\t" + ",\n\t".join(s + " -> " + t for s,t in mappings)
           
          for name, aliases in aliasSets.iteritems():
              if aliases:
                  print "aliases for " + name + ":\n\t" + ",\n\t".join(sorted(aliases)) + "\n"
          

          The aliases for e.g. the base_SdssCentroid_flag_almostNoSecondDerivative field demonstrate the problem.

          Show
          smonkewitz Serge Monkewitz added a comment - Atached is the sample catalog. Here's some code to look at the aliases for its fields: import lsst.afw.table as afwTable from lsst.daf.ingest.ingestCatalogTask import aliasesFor   cat = afwTable.BaseCatalog.readFits( 'catalog_example.fits' ) mappings = sorted (cat.schema.getAliasMap().iteritems()) aliasSets = dict ((i.field.getName(), aliasesFor(i.field.getName(), mappings)) for i in cat.schema) print "alias map:\n\t" + ",\n\t" .join(s + " -> " + t for s,t in mappings)   for name, aliases in aliasSets.iteritems(): if aliases: print "aliases for " + name + ":\n\t" + ",\n\t" .join( sorted (aliases)) + "\n" The aliases for e.g. the base_SdssCentroid_flag_almostNoSecondDerivative field demonstrate the problem.
          smonkewitz Serge Monkewitz made changes -
          Attachment catalog_example.fits [ 27045 ]
          Hide
          ktl Kian-Tat Lim added a comment -

          Since the example table is so small, would it make sense to use it as another source of test data or perhaps put it in an examples directory?

          Show
          ktl Kian-Tat Lim added a comment - Since the example table is so small, would it make sense to use it as another source of test data or perhaps put it in an examples directory?
          Hide
          swinbank John Swinbank added a comment -

          I've added some more comments to GitHub. The short summary is this all looks good to me.

          The main concern I have is with the handling of aliases. However, I don't think that's something you can fix here; rather, I think there probably needs to be a bit more thought about this in afw::table. I have no problem with the way you've tackled this being merged, but it's probably worth opening a ticket against afw to revisit this issue.

          I particularly like the test case, which is nice and comprehensive. However... it's unfortunate that it requires talking to a live MySQL instance to run. It seems like you could carry out some quite useful testing in the absence of that by doing some mocking – you know a priori what the stream of bytes you want to send to the database is, so you should be able to use a mock database connection to demonstrate that that really is what gets generated by your task. I'd imagine that setting up the infrastructure for that might take some thought, so I'm not suggesting it should be a requirement for getting this ticket merged, but I wonder if this is something we ought to consider more generally for testing code which interacts with databases.

          In this case, you could also write more specific unit tests rather than just demonstrating that the whole task works. For example, I think I'd have found a test case which just exercises aliasesFor useful when understanding what the code does (and I'd imagine it would have been useful when writing the code). Again, though, I don't think this is essential for completing the ticket – I'm happy that the existing test case does a good job of covering all the functionality.

          A couple of final notes which don't directly related to the code:

          • What's the future of the datarel package? It seems (from reading the "related issues" links) that it should be removed after you've extracted all the still-relevant code from it (which you've done here), but that dropping it entirely is a problem for buildbot (DM-2948). After this ticket is complete, is the only blocker to removing that package the buildbot issue?
          • There are some pages on Confluence (Tour of the software stack, Package & task index) which specifically refer to datarel. Please could you update them to refer to daf_ingest instead/as well (as appropriate)? (They also mention ap – I guess that can be dropped?)
          Show
          swinbank John Swinbank added a comment - I've added some more comments to GitHub. The short summary is this all looks good to me. The main concern I have is with the handling of aliases. However, I don't think that's something you can fix here; rather, I think there probably needs to be a bit more thought about this in afw::table . I have no problem with the way you've tackled this being merged, but it's probably worth opening a ticket against afw to revisit this issue. I particularly like the test case, which is nice and comprehensive. However... it's unfortunate that it requires talking to a live MySQL instance to run. It seems like you could carry out some quite useful testing in the absence of that by doing some mocking – you know a priori what the stream of bytes you want to send to the database is, so you should be able to use a mock database connection to demonstrate that that really is what gets generated by your task. I'd imagine that setting up the infrastructure for that might take some thought, so I'm not suggesting it should be a requirement for getting this ticket merged, but I wonder if this is something we ought to consider more generally for testing code which interacts with databases. In this case, you could also write more specific unit tests rather than just demonstrating that the whole task works. For example, I think I'd have found a test case which just exercises aliasesFor useful when understanding what the code does (and I'd imagine it would have been useful when writing the code). Again, though, I don't think this is essential for completing the ticket – I'm happy that the existing test case does a good job of covering all the functionality. A couple of final notes which don't directly related to the code: What's the future of the datarel package? It seems (from reading the "related issues" links) that it should be removed after you've extracted all the still-relevant code from it (which you've done here), but that dropping it entirely is a problem for buildbot ( DM-2948 ). After this ticket is complete, is the only blocker to removing that package the buildbot issue? There are some pages on Confluence ( Tour of the software stack , Package & task index ) which specifically refer to datarel . Please could you update them to refer to daf_ingest instead/as well (as appropriate)? (They also mention ap – I guess that can be dropped?)
          swinbank John Swinbank made changes -
          Status In Review [ 10004 ] Reviewed [ 10101 ]
          Hide
          smonkewitz Serge Monkewitz added a comment -

          I'm going to wait for the dust to settle on DM-3401 before I proceed any further, and I'll think about a way to run the unit test without a mysql instance, though it seems potentially hard. I'm thinking along the lines of interjecting a proxy between the actual connection (and cursor) objects and clients that can memorize and save calls and their arguments/return values, which you would use record interactions with a live database. You'd then have mocks which can "play back" results in the absence of a live database.

          Regarding your other points:

          • I think datarel needs to stay around, at least for a bit. For one thing, it still contains scripts to ingest metadata for exposures (bin/ingest/ingestProcessed.py). Even after those scripts (or equivalents) make it into daf_ingest, there may still be a place for datarel. The other thing is that ingestCatalog has very minimal facilities for creating indexes, and to make (large) data release tables useable, there are additional steps that will need to happen (creation of spatial indexes for example) - datarel seems like a good place for the scripts that do this sort of thing.
          • Thanks for the pointer to those confluence pages, I'll be sure to update them just before merging. And yes, from what I understand we can drop references to ap.
          Show
          smonkewitz Serge Monkewitz added a comment - I'm going to wait for the dust to settle on DM-3401 before I proceed any further, and I'll think about a way to run the unit test without a mysql instance, though it seems potentially hard. I'm thinking along the lines of interjecting a proxy between the actual connection (and cursor) objects and clients that can memorize and save calls and their arguments/return values, which you would use record interactions with a live database. You'd then have mocks which can "play back" results in the absence of a live database. Regarding your other points: I think datarel needs to stay around, at least for a bit. For one thing, it still contains scripts to ingest metadata for exposures (bin/ingest/ingestProcessed.py). Even after those scripts (or equivalents) make it into daf_ingest, there may still be a place for datarel. The other thing is that ingestCatalog has very minimal facilities for creating indexes, and to make (large) data release tables useable, there are additional steps that will need to happen (creation of spatial indexes for example) - datarel seems like a good place for the scripts that do this sort of thing. Thanks for the pointer to those confluence pages, I'll be sure to update them just before merging. And yes, from what I understand we can drop references to ap.
          smonkewitz Serge Monkewitz made changes -
          Remote Link This issue links to "Page (Confluence)" [ 13011 ]
          Hide
          jbecla Jacek Becla added a comment -

          I can see it is merged Shall we close the story and the epic?

          Show
          jbecla Jacek Becla added a comment - I can see it is merged Shall we close the story and the epic?
          Hide
          smonkewitz Serge Monkewitz added a comment -

          DM-3401 looks to be a ways off, so for now I am leaving the alias logic as is (to match the actual algorithm in afw). Making the unit test do something more useful when there is no live database instance to interact with looks like more than a days worth of work, and I've filed DM-3431 to cover that particular code review comment.

          I've also edited the relevant confluence pages to mention the daf_ingest and the code transfer involved.

          Show
          smonkewitz Serge Monkewitz added a comment - DM-3401 looks to be a ways off, so for now I am leaving the alias logic as is (to match the actual algorithm in afw). Making the unit test do something more useful when there is no live database instance to interact with looks like more than a days worth of work, and I've filed DM-3431 to cover that particular code review comment. I've also edited the relevant confluence pages to mention the daf_ingest and the code transfer involved.
          smonkewitz Serge Monkewitz made changes -
          Resolution Done [ 10000 ]
          Status Reviewed [ 10101 ] Done [ 10002 ]
          Parejkoj John Parejko made changes -
          Remote Link This issue links to "Page (Confluence)" [ 13011 ] This issue links to "Page (Confluence)" [ 13011 ]
          tjenness Tim Jenness made changes -
          Component/s daf_ingest [ 12895 ]
          tjenness Tim Jenness made changes -
          Link This issue is triggered by RFC-57 [ RFC-57 ]
          tjenness Tim Jenness made changes -
          Link This issue relates to RFC-57 [ RFC-57 ]
          fritzm Fritz Mueller made changes -
          Reporter Jacek Becla [X] [ jbecla ] Fritz Mueller [ fritzm ]

            People

            Assignee:
            smonkewitz Serge Monkewitz
            Reporter:
            fritzm Fritz Mueller
            Reviewers:
            John Swinbank
            Watchers:
            Jacek Becla, John Swinbank, Kian-Tat Lim, Serge Monkewitz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.