Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16606

ap_pipe should not create DB automatically

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_pipe
    • Labels:
      None
    • Story Points:
      4
    • Sprint:
      AP S19-3, AP S19-4
    • Team:
      Alert Production

      Description

      As discussed in the review for DM-15588 ap_pipe should not assume that it can safely call Ppdb.makeSchema(): the implementation of makeSchema() may break when running on an existing database, or the program might run afoul of (a lack of) table creation permissions.

      The most user-friendly way to get around this limitation is to add two command-line flags to ap_pipe.py:

      • --make-db: ap_pipe will try to create the database it is configured for, assuming it does not already exist. Behavior if the database already exists is undefined. Not guaranteed to be compatible with all values for PpdbConfig; the user is responsible for verifying compatibility.
      • --clobber-db: ap_pipe will delete and replace the database it is configured for, if it already exists. Equivalent to --make-db if the database does not exist. Cannot be combined with --make-db, and should not be combined with --reuse-outputs-from associator or --reuse-outputs-from all.

      (If neither argument is provided, then the database is assumed to already exist.)

      This will be a breaking change to any scripts that call ap_pipe.py directly, since the current behavior is equivalent to always setting --make-db.

      This issue should be done after DM-13887, which will make it easier for ap_verify to pass arbitrary arguments to ap_pipe and make it safe for ap_verify to assume it is starting from scratch.

        Attachments

          Issue Links

            Activity

            Hide
            mrawls Meredith Rawls added a comment -

            One recent issue I've encountered comes up when I run ap_pipe with slurm. As you work to implement this ticket, I would appreciate a suggested workflow for the case where I need to create a database once and have a bunch of parallel and/or sequential operations write to it without crashing into one another. I think the --make_db flag as described would solve this, since it sounds like you propose to check if the DB exists and only make it if it does not.

            Show
            mrawls Meredith Rawls added a comment - One recent issue I've encountered comes up when I run ap_pipe with slurm. As you work to implement this ticket, I would appreciate a suggested workflow for the case where I need to create a database  once  and have a bunch of parallel and/or sequential operations write to it without crashing into one another. I think the --make_db flag as described would solve this, since it sounds like you propose to check if the DB exists and only make it if it does not.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            No, I was not proposing that:

            Behavior if the database already exists is undefined.

            IIRC the problem is that, depending on the backend, a Ppdb object can't necessarily tell if the database already exists or not. Though now that makes me wonder how I expected to implement --clobber-db, given that the URI may not represent a file. Maybe I assumed there was never a valid use case where that's not true?

            The need to create/replace the database atomically (which I assume is the issue you've encountered with the current system?) is something that didn't even occur to me.

            This makes me wonder if maybe a separate creator script (which you would run before slurm) was a better option after all. I can already think of one way to get around my objection that the user would either have to specify the DB location twice, or become dependent on having a config file.

            Show
            krzys Krzysztof Findeisen added a comment - - edited No, I was not proposing that: Behavior if the database already exists is undefined. IIRC the problem is that, depending on the backend, a Ppdb object can't necessarily tell if the database already exists or not. Though now that makes me wonder how I expected to implement --clobber-db , given that the URI may not represent a file. Maybe I assumed there was never a valid use case where that's not true? The need to create/replace the database atomically (which I assume is the issue you've encountered with the current system?) is something that didn't even occur to me. This makes me wonder if maybe a separate creator script (which you would run before slurm ) was a better option after all. I can already think of one way to get around my objection that the user would either have to specify the DB location twice, or become dependent on having a config file.
            Hide
            mrawls Meredith Rawls added a comment -

            I see; I was focusing on the part where you said "try to create the database it is configured for, assuming it does not already exist," which is what I want, and glossing over the (critical!) "Behavior if the database already exists is undefined" bit  

            I'm OK if the solution is a separate DB creator script. I guess in that case, if the DB ap_pipe is directed to doesn't already exist, it would fail and log a helpful message about how you could use said script to create an empty DB and try again.

            Show
            mrawls Meredith Rawls added a comment - I see; I was focusing on the part where you said "try to create the database it is configured for, assuming it does not already exist," which is what I want, and glossing over the (critical!) "Behavior if the database already exists is undefined" bit   I'm OK if the solution is a separate DB creator script. I guess in that case, if the DB ap_pipe is directed to doesn't already exist, it would fail and log a helpful message about how you could use said script to create an empty DB and try again.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Ok, a slightly more specific design:

            • ap_pipe.py will not call Ppdb.makeSchema at all. Unfortunately, Meredith Rawls's request for a helpful message will be tricky because a Ppdb object can't tell if the database already exists or not (the same problem as before). We could catch exceptions that get raised because of an improperly set-up DB, but those would only happen at the end of the run, and may cause false positives.
            • a new script, make_ppdb.py, will create a database from a config (this probably can't be a CmdLineTask because it doesn't use repos). The config for it will, like ApPipeConfig, contain a member called ppdb, so that any config overrides for the two scripts can be written exactly identically. This should cover slurm runs as well as ap_verify.

            I don't see a way to avoid having two copies of the database location (or a shared config file) that doesn't create other problems, and I don't remember what I had in mind on Jan 31. So I intend to just make the two changes above, and we can revisit the config problem if it turns out to be a major source of error when running ap_pipe from the command line.

            Show
            krzys Krzysztof Findeisen added a comment - Ok, a slightly more specific design: ap_pipe.py will not call Ppdb.makeSchema at all. Unfortunately, Meredith Rawls 's request for a helpful message will be tricky because a Ppdb object can't tell if the database already exists or not (the same problem as before). We could catch exceptions that get raised because of an improperly set-up DB, but those would only happen at the end of the run, and may cause false positives. a new script, make_ppdb.py , will create a database from a config (this probably can't be a CmdLineTask because it doesn't use repos). The config for it will, like ApPipeConfig , contain a member called ppdb , so that any config overrides for the two scripts can be written exactly identically. This should cover slurm runs as well as ap_verify . I don't see a way to avoid having two copies of the database location (or a shared config file) that doesn't create other problems, and I don't remember what I had in mind on Jan 31. So I intend to just make the two changes above, and we can revisit the config problem if it turns out to be a major source of error when running ap_pipe from the command line.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Hi Andy Salnikov, can you please review these changes? Thanks!

            Show
            krzys Krzysztof Findeisen added a comment - Hi Andy Salnikov , can you please review these changes? Thanks!
            Hide
            salnikov Andy Salnikov added a comment -

            I'm in a different time zone (Europe) now but should be able to look a it when it is tomorrow morning here.

            Show
            salnikov Andy Salnikov added a comment - I'm in a different time zone (Europe) now but should be able to look a it when it is tomorrow morning here.
            Hide
            salnikov Andy Salnikov added a comment -

            Looks OK, few minor comments on PRs.

            Show
            salnikov Andy Salnikov added a comment - Looks OK, few minor comments on PRs.

              People

              • Assignee:
                krzys Krzysztof Findeisen
                Reporter:
                krzys Krzysztof Findeisen
                Reviewers:
                Andy Salnikov
                Watchers:
                Andy Salnikov, Chris Morrison, Eric Bellm, Eric Morganson, Krzysztof Findeisen, Meredith Rawls
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: