Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26375

Allow databases other than SQLite when running ap_verify

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_verify
    • Labels:
      None
    • Story Points:
      6
    • Sprint:
      AP F20-5 (October), AP F20-6 (November)
    • Team:
      Alert Production

      Description

      Currently, ap_verify is hard-coded to use an SQLite APDB when running the AP pipeline. While this is adequate for CI processing, SQLite does not scale well to large datasets. Provide a way for the user to override the use of SQLite, preferably by finding a more scalable way to handle configuration options that depend on the workspace location.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            I think I painted myself into a corner on this one. The problem:

            • ap_verify currently needs to call make_apdb and the AP pipeline (both Gen 2 and Gen 3) through command-line arguments; there's no way to directly pass a config.
            • ap_verify tries to put the SQLite DB into the "workspace" directory it uses for managing a run (anywhere else is potentially confusing), which means the config sent to the programs needs to be set based on where the workspace for a given run is.
            • any user-specified config for databases, if it requires a file location (Postgres doesn't, but SQLite may not be the only one), can potentially be overwritten by code that solves the above point.

            Proposed solution:

            • Add an argument to ap_verify.py that takes an APDB config file. This config file may have a {workspace} placeholder that is replaced with the workspace location; the modified (proper) config is written to workspace/config where it can be used by make_apdb and the AP pipeline.
            • Allow a {workspace} placeholder in the Gen 3 pipeline file, which is processed in the same way (this lets us support other location-dependent configs, like diaPipe:alertPackager.alertWriteLocation). If provided, the separate APDB config will override any settings in the pipeline file, but we can transition to using only the pipeline file once we no longer have to support Gen 2.
            • Users who need a system-specific DB config can create a suitable config file and/or a pipeline that inherits the standard ApVerify pipeline, then call ap_verify.py with it.

            This is a clunky and complicated solution, but I think anything simpler (e.g., a hardcoded doUsePostgres flag) would have just led to even more technical debt down the road.

            Show
            krzys Krzysztof Findeisen added a comment - - edited I think I painted myself into a corner on this one. The problem: ap_verify currently needs to call make_apdb and the AP pipeline (both Gen 2 and Gen 3) through command-line arguments; there's no way to directly pass a config. ap_verify tries to put the SQLite DB into the "workspace" directory it uses for managing a run (anywhere else is potentially confusing), which means the config sent to the programs needs to be set based on where the workspace for a given run is. any user-specified config for databases, if it requires a file location (Postgres doesn't, but SQLite may not be the only one), can potentially be overwritten by code that solves the above point. Proposed solution: Add an argument to ap_verify.py that takes an APDB config file. This config file may have a {workspace} placeholder that is replaced with the workspace location; the modified (proper) config is written to workspace/config where it can be used by make_apdb and the AP pipeline. Allow a {workspace} placeholder in the Gen 3 pipeline file, which is processed in the same way (this lets us support other location-dependent configs, like diaPipe:alertPackager.alertWriteLocation ). If provided, the separate APDB config will override any settings in the pipeline file, but we can transition to using only the pipeline file once we no longer have to support Gen 2. Users who need a system-specific DB config can create a suitable config file and/or a pipeline that inherits the standard ApVerify pipeline, then call ap_verify.py with it. This is a clunky and complicated solution, but I think anything simpler (e.g., a hardcoded doUsePostgres flag) would have just led to even more technical debt down the road.
            Hide
            mrawls Meredith Rawls added a comment -

            I'm just catching up on this. Could we have ap_verify.py require a db_url argument/config? Since ap_verify calls make_apdb early on, if that fails because the db_url passed along is invalid for whatever reason, it could presumably fail quickly with a useful error message.

            Show
            mrawls Meredith Rawls added a comment - I'm just catching up on this. Could we have ap_verify.py require a db_url argument/config? Since ap_verify calls make_apdb early on, if that fails because the db_url passed along is invalid for whatever reason, it could presumably fail quickly with a useful error message.
            Hide
            krzys Krzysztof Findeisen added a comment -

            I would hope that failure mode would be handled adequately by make_apdb itself...

            I'm worried that just db_url might be too inflexible; e.g., SQLite also requires that you set the isolation level. Or are you saying provide a db_url argument in addition to a more general config?

            Show
            krzys Krzysztof Findeisen added a comment - I would hope that failure mode would be handled adequately by make_apdb itself... I'm worried that just db_url might be too inflexible; e.g., SQLite also requires that you set the isolation level. Or are you saying provide a db_url argument in addition to a more general config?
            Hide
            mrawls Meredith Rawls added a comment -

            I don't want more configs, no, haha... I guess whatever arguments make_apdb requires to find/connect to the APDB, ap_verify should also require and then pass along, is my broad probably-not-revolutionary idea here.

            Show
            mrawls Meredith Rawls added a comment - I don't want more configs, no, haha... I guess whatever arguments make_apdb requires to find/connect to the APDB, ap_verify should also require and then pass along, is my broad probably-not-revolutionary idea here.
            Hide
            krzys Krzysztof Findeisen added a comment -

            I discussed with Meredith Rawls offline, and the proposal above is probably more flexibility than we'll actually need. For now, I'll just have users pass a db_url argument, and special-case the isolation_level for SQLite (it is also special-cased in ApdbConfig itself).

            Show
            krzys Krzysztof Findeisen added a comment - I discussed with Meredith Rawls offline, and the proposal above is probably more flexibility than we'll actually need. For now, I'll just have users pass a db_url argument, and special-case the isolation_level for SQLite (it is also special-cased in ApdbConfig itself).
            Hide
            sullivan Ian Sullivan added a comment -

            Looks good, though I wasn't able to test the changes on real data.

            Show
            sullivan Ian Sullivan added a comment - Looks good, though I wasn't able to test the changes on real data.

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Reviewers:
              Ian Sullivan
              Watchers:
              Eric Bellm, Ian Sullivan, Krzysztof Findeisen, Meredith Rawls
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.