Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Labels:
-
Story Points:6
-
Epic Link:
-
Sprint:DB_F19_07, DB_F19_10
-
Team:Data Access and Database
Description
Using SQLAlchemy with SQLite, Oracle, and PostgreSQL backends, write a small Python snippet that bulk-inserts into a table with a primary key or other unique constraints and, depending on an option provided by the caller, either:
- aborts when the constraint is violated (as usual)
- ignores new records that conflict with old records
- replaces old records with new records when conflicts appear.
This operation needs to be safe in the presence of concurrent inserts into the same table (though in that case we don't care who wins if one or more writers are attempting to replace).
I am fairly confident all three RDBMSs provide a syntax for this, but it's not the same, and SQLAlchemy support is uneven, so this will in general involve asking SQLAlchemy's connection object for the lower-level RDBMS-specific connection object (or equivalent).
Attachments
Issue Links
- blocks
-
DM-21231 Refactor Registry handling of dataset and associated tables
- Done
- contains
-
DM-17419 Resolve upsert order in SQLite Registry
- Invalid
- is triggering
-
DM-21748 oracle ci_hsc_gen3 sqlalchemy.exc.ObjectNotExectuableError
- Done
- relates to
-
DM-21301 pipetask --register-dataset-types doesn't work with Postgres registry on AWS
- Done
- mentioned in
-
Page Loading...
Concurrency issues.
In a single-transaction mode there are no concurrency issues, every statement is executed against latest snapshot and COMMIT is not supposed to generate any errors. With concurrent transactions situation becomes more complicated, in part also because different backends may implement transaction isolation logic slightly differently. If two or more clients want to read/update the same data simultaneously the exact behavior depends on a number of factors such as transaction isolation level, implementation of locking mechanism, optimization strategy, etc. Even with the conflict resolution options outlined above it may result in a situation when INSERT statement succeeds but final COMMIT fails (or one of the updaters will commit but others will fail).
I think proper handling of the failures in concurrent updates should also include exception handling for COMMIT phase. We could catch an exception, rollback, and repeat an operation once again if commit fails due to IntegrityError (only if we do ignore/replace). There may be other backend-specific strategies (e.g. explicit locking) but it would be a lot messier to try to use something like that. Proving that something works correctly is generally hard in concurrent systems, and it will be very hard to build a reasonable test for all existing and future backend types, so I think that "abort and retry" is probably easiest way to convince ourselves that we do things right.