Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-1674

Data ingest scripts cleanup

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: daf_ingest, datarel
    • Labels:
      None
    • Story Points:
      10
    • Sprint:
      DB_S15_07, DB_S15_08
    • Team:
      Data Access and Database

      Attachments

        Issue Links

          Activity

          Hide
          ktl Kian-Tat Lim added a comment -

          Since the example table is so small, would it make sense to use it as another source of test data or perhaps put it in an examples directory?

          Show
          ktl Kian-Tat Lim added a comment - Since the example table is so small, would it make sense to use it as another source of test data or perhaps put it in an examples directory?
          Hide
          swinbank John Swinbank added a comment -

          I've added some more comments to GitHub. The short summary is this all looks good to me.

          The main concern I have is with the handling of aliases. However, I don't think that's something you can fix here; rather, I think there probably needs to be a bit more thought about this in afw::table. I have no problem with the way you've tackled this being merged, but it's probably worth opening a ticket against afw to revisit this issue.

          I particularly like the test case, which is nice and comprehensive. However... it's unfortunate that it requires talking to a live MySQL instance to run. It seems like you could carry out some quite useful testing in the absence of that by doing some mocking – you know a priori what the stream of bytes you want to send to the database is, so you should be able to use a mock database connection to demonstrate that that really is what gets generated by your task. I'd imagine that setting up the infrastructure for that might take some thought, so I'm not suggesting it should be a requirement for getting this ticket merged, but I wonder if this is something we ought to consider more generally for testing code which interacts with databases.

          In this case, you could also write more specific unit tests rather than just demonstrating that the whole task works. For example, I think I'd have found a test case which just exercises aliasesFor useful when understanding what the code does (and I'd imagine it would have been useful when writing the code). Again, though, I don't think this is essential for completing the ticket – I'm happy that the existing test case does a good job of covering all the functionality.

          A couple of final notes which don't directly related to the code:

          • What's the future of the datarel package? It seems (from reading the "related issues" links) that it should be removed after you've extracted all the still-relevant code from it (which you've done here), but that dropping it entirely is a problem for buildbot (DM-2948). After this ticket is complete, is the only blocker to removing that package the buildbot issue?
          • There are some pages on Confluence (Tour of the software stack, Package & task index) which specifically refer to datarel. Please could you update them to refer to daf_ingest instead/as well (as appropriate)? (They also mention ap – I guess that can be dropped?)
          Show
          swinbank John Swinbank added a comment - I've added some more comments to GitHub. The short summary is this all looks good to me. The main concern I have is with the handling of aliases. However, I don't think that's something you can fix here; rather, I think there probably needs to be a bit more thought about this in afw::table . I have no problem with the way you've tackled this being merged, but it's probably worth opening a ticket against afw to revisit this issue. I particularly like the test case, which is nice and comprehensive. However... it's unfortunate that it requires talking to a live MySQL instance to run. It seems like you could carry out some quite useful testing in the absence of that by doing some mocking – you know a priori what the stream of bytes you want to send to the database is, so you should be able to use a mock database connection to demonstrate that that really is what gets generated by your task. I'd imagine that setting up the infrastructure for that might take some thought, so I'm not suggesting it should be a requirement for getting this ticket merged, but I wonder if this is something we ought to consider more generally for testing code which interacts with databases. In this case, you could also write more specific unit tests rather than just demonstrating that the whole task works. For example, I think I'd have found a test case which just exercises aliasesFor useful when understanding what the code does (and I'd imagine it would have been useful when writing the code). Again, though, I don't think this is essential for completing the ticket – I'm happy that the existing test case does a good job of covering all the functionality. A couple of final notes which don't directly related to the code: What's the future of the datarel package? It seems (from reading the "related issues" links) that it should be removed after you've extracted all the still-relevant code from it (which you've done here), but that dropping it entirely is a problem for buildbot ( DM-2948 ). After this ticket is complete, is the only blocker to removing that package the buildbot issue? There are some pages on Confluence ( Tour of the software stack , Package & task index ) which specifically refer to datarel . Please could you update them to refer to daf_ingest instead/as well (as appropriate)? (They also mention ap – I guess that can be dropped?)
          Hide
          smonkewitz Serge Monkewitz added a comment -

          I'm going to wait for the dust to settle on DM-3401 before I proceed any further, and I'll think about a way to run the unit test without a mysql instance, though it seems potentially hard. I'm thinking along the lines of interjecting a proxy between the actual connection (and cursor) objects and clients that can memorize and save calls and their arguments/return values, which you would use record interactions with a live database. You'd then have mocks which can "play back" results in the absence of a live database.

          Regarding your other points:

          • I think datarel needs to stay around, at least for a bit. For one thing, it still contains scripts to ingest metadata for exposures (bin/ingest/ingestProcessed.py). Even after those scripts (or equivalents) make it into daf_ingest, there may still be a place for datarel. The other thing is that ingestCatalog has very minimal facilities for creating indexes, and to make (large) data release tables useable, there are additional steps that will need to happen (creation of spatial indexes for example) - datarel seems like a good place for the scripts that do this sort of thing.
          • Thanks for the pointer to those confluence pages, I'll be sure to update them just before merging. And yes, from what I understand we can drop references to ap.
          Show
          smonkewitz Serge Monkewitz added a comment - I'm going to wait for the dust to settle on DM-3401 before I proceed any further, and I'll think about a way to run the unit test without a mysql instance, though it seems potentially hard. I'm thinking along the lines of interjecting a proxy between the actual connection (and cursor) objects and clients that can memorize and save calls and their arguments/return values, which you would use record interactions with a live database. You'd then have mocks which can "play back" results in the absence of a live database. Regarding your other points: I think datarel needs to stay around, at least for a bit. For one thing, it still contains scripts to ingest metadata for exposures (bin/ingest/ingestProcessed.py). Even after those scripts (or equivalents) make it into daf_ingest, there may still be a place for datarel. The other thing is that ingestCatalog has very minimal facilities for creating indexes, and to make (large) data release tables useable, there are additional steps that will need to happen (creation of spatial indexes for example) - datarel seems like a good place for the scripts that do this sort of thing. Thanks for the pointer to those confluence pages, I'll be sure to update them just before merging. And yes, from what I understand we can drop references to ap.
          Hide
          jbecla Jacek Becla added a comment -

          I can see it is merged Shall we close the story and the epic?

          Show
          jbecla Jacek Becla added a comment - I can see it is merged Shall we close the story and the epic?
          Hide
          smonkewitz Serge Monkewitz added a comment -

          DM-3401 looks to be a ways off, so for now I am leaving the alias logic as is (to match the actual algorithm in afw). Making the unit test do something more useful when there is no live database instance to interact with looks like more than a days worth of work, and I've filed DM-3431 to cover that particular code review comment.

          I've also edited the relevant confluence pages to mention the daf_ingest and the code transfer involved.

          Show
          smonkewitz Serge Monkewitz added a comment - DM-3401 looks to be a ways off, so for now I am leaving the alias logic as is (to match the actual algorithm in afw). Making the unit test do something more useful when there is no live database instance to interact with looks like more than a days worth of work, and I've filed DM-3431 to cover that particular code review comment. I've also edited the relevant confluence pages to mention the daf_ingest and the code transfer involved.

            People

            Assignee:
            smonkewitz Serge Monkewitz
            Reporter:
            fritzm Fritz Mueller
            Reviewers:
            John Swinbank
            Watchers:
            Jacek Becla, John Swinbank, Kian-Tat Lim, Serge Monkewitz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.