Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26689

Add command-line tool for Butler.pruneDatasets

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      DB_S21_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Add a command-line tool or tools for Butler.pruneDatasets. This method has a lot of optional keyword arguments, and it's worth thinking about whether the right command-line interface actually involves multiple subcommands. I also think a command-line interface for this method makes a tool for Registry.disassociate unnecessary, but if we split it up into multiple subcommands, making one just call Registry.disassociate might make sense.

      This script will need to start with a call to Registry.queryDatasets, so I'll add DM-26685 as a blocker.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            to both of your last posts - good catch (and sorry the review is taking so long - I built myself a test repo to play with on Tuesday and just haven't had a chance to return to it).

            Show
            jbosch Jim Bosch added a comment - to both of your last posts - good catch (and sorry the review is taking so long - I built myself a test repo to play with on Tuesday and just haven't had a chance to return to it).
            Hide
            npease Nate Pease [X] (Inactive) added a comment -

            No worries on taking time to review. I think I've made all the changes that have been requested so far, and pushed to the ticket branch in GitHub.

            Show
            npease Nate Pease [X] (Inactive) added a comment - No worries on taking time to review. I think I've made all the changes that have been requested so far, and pushed to the ticket branch in GitHub.
            Hide
            jbosch Jim Bosch added a comment -

            I've left a few minor comments on the PR. And I've got a few behavior requests, too:

            If the user doesn't pass any of -purge, disassociate, or unstore, this doesn't actually do anything. -unstore seems like an okay default; and maybe we keep the option around so you can say "disassociate and also unstore"?

            It'd be nice to have the dry-run functionality check for a few obvious ways things could go wrong:

            • if the user passed --purge, is that collection actually a RUN-type collection, and are the DatasetRef.run attributes equal to that for all queried datasets?
            • if the user passed --disassociate, is that collectoni actually a TAGGED-type collection?

            It'd probably be best to perform those checks regardless of the dry-run state, to provide better error messages, but they're all things that would cause lower-level failures when not doing dry-run, so that's where it's most important.

            I'm also a bit worried about possibly surprising behavior with unstore and disassociate, because there the actions are "do something to this dataset if there is something to do", and the dry-run outputs can't really tell us whether there is something to do. Maybe the best solution is to update the dry-run/confirm text according to the action, i.e: instead of "The following datasets will be removed.", we'd say:

            • for disassociate, "The following datasets will be disassociated from <COLLECTION> if they are currently present in it (which is not checked):"
            • for unstore, "The following datasets will be removed from any datastores in which they are present:"

            Maybe someday we could make the query for datasets clever enough to only return datasets that actually would be affected, but we can't right now.

            Anyhow, I may have had a lot to say, but this is actually really close, and I don't think I need to look again (unless there's something you want me to look at of course), so I'm hitting Reviewed and Approve now.

            Show
            jbosch Jim Bosch added a comment - I've left a few minor comments on the PR. And I've got a few behavior requests, too: If the user doesn't pass any of - purge , disassociate , or unstore , this doesn't actually do anything. -unstore seems like an okay default; and maybe we keep the option around so you can say "disassociate and also unstore"? It'd be nice to have the dry-run functionality check for a few obvious ways things could go wrong: if the user passed --purge , is that collection actually a RUN-type collection, and are the DatasetRef.run attributes equal to that for all queried datasets? if the user passed --disassociate , is that collectoni actually a TAGGED-type collection? It'd probably be best to perform those checks regardless of the dry-run state, to provide better error messages, but they're all things that would cause lower-level failures when not doing dry-run, so that's where it's most important. I'm also a bit worried about possibly surprising behavior with unstore and disassociate, because there the actions are "do something to this dataset if there is something to do", and the dry-run outputs can't really tell us whether there is something to do. Maybe the best solution is to update the dry-run/confirm text according to the action, i.e: instead of "The following datasets will be removed.", we'd say: for disassociate, "The following datasets will be disassociated from <COLLECTION> if they are currently present in it (which is not checked):" for unstore, "The following datasets will be removed from any datastores in which they are present:" Maybe someday we could make the query for datasets clever enough to only return datasets that actually would be affected, but we can't right now. Anyhow, I may have had a lot to say, but this is actually really close, and I don't think I need to look again (unless there's something you want me to look at of course), so I'm hitting Reviewed and Approve now.
            Hide
            npease Nate Pease [X] (Inactive) added a comment -

            Jim Bosch for the alternate text for disassociate and unstore: it's currently possible to pass both those options. Does that even make sense? Maybe unstore does everything disassociate would do and more? ...and we should require one or the other but not both?
            If we should allow both then what text should be used if both options are used?

            Show
            npease Nate Pease [X] (Inactive) added a comment - Jim Bosch for the alternate text for disassociate and unstore: it's currently possible to pass both those options. Does that even make sense? Maybe unstore does everything disassociate would do and more? ...and we should require one or the other but not both? If we should allow both then what text should be used if both options are used?
            Hide
            jbosch Jim Bosch added a comment -

            Passing both of those options does make sense, and I'd just concatenate the actions:

            "The following datasets will be disassociated from <COLLECTION> if they are currently present in it (which is not checked), and removed from any datastores in which they are present."

            Above I said we should make unstore the default, but thinking about it further, I think requiring at least one of purge, unstore, or disassociate is probably better.  Here's the full matrix:

            • none of (purge, unstore, disassociate): error, nothing to do
            • purge only: ok
            • unstore only: ok
            • disassociate only: ok
            • purge+unstore: ok, just ignore unstore (purge effectively implies unstore)
            • purge+disassociate: probably best to make this an error instead of ignoring disassociate, because that comes with a collection argument that we can't respect, and that might be confusing(purge will disassociate from all TAGGED collections, not just the one given)
            • unstore+disassociate: ok; these operations are unrelated to each other
            • purge+unstore+disassociate: an error, for the same reason as just purge+disassociate

             

            Show
            jbosch Jim Bosch added a comment - Passing both of those options does make sense, and I'd just concatenate the actions: "The following datasets will be disassociated from <COLLECTION> if they are currently present in it (which is not checked), and removed from any datastores in which they are present." Above I said we should make unstore the default, but thinking about it further, I think requiring at least one of purge, unstore, or disassociate is probably better.  Here's the full matrix: none of (purge, unstore, disassociate): error, nothing to do purge only: ok unstore only: ok disassociate only: ok purge+unstore: ok, just ignore unstore (purge effectively implies unstore) purge+disassociate: probably best to make this an error instead of ignoring disassociate, because that comes with a collection argument that we can't respect, and that might be confusing(purge will disassociate from all TAGGED collections, not just the one given) unstore+disassociate: ok; these operations are unrelated to each other purge+unstore+disassociate: an error, for the same reason as just purge+disassociate  

              People

              Assignee:
              npease Nate Pease [X] (Inactive)
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Jim Bosch
              Watchers:
              Jim Bosch, Nate Pease [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.