Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21201

Research cross-database approach to inserts with custom conflict resolution

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      DB_F19_07, DB_F19_10
    • Team:
      Data Access and Database

      Description

      Using SQLAlchemy with SQLite, Oracle, and PostgreSQL backends, write a small Python snippet that bulk-inserts into a table with a primary key or other unique constraints and, depending on an option provided by the caller, either:

      • aborts when the constraint is violated (as usual)
      • ignores new records that conflict with old records
      • replaces old records with new records when conflicts appear.

      This operation needs to be safe in the presence of concurrent inserts into the same table (though in that case we don't care who wins if one or more writers are attempting to replace).

      I am fairly confident all three RDBMSs provide a syntax for this, but it's not the same, and SQLAlchemy support is uneven, so this will in general involve asking SQLAlchemy's connection object for the lower-level RDBMS-specific connection object (or equivalent).

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment - - edited

            Concurrency issues.

            In a single-transaction mode there are no concurrency issues, every statement is executed against latest snapshot and COMMIT is not supposed to generate any errors. With concurrent transactions situation becomes more complicated, in part also because different backends may implement transaction isolation logic slightly differently. If two or more clients want to read/update the same data simultaneously the exact behavior depends on a number of factors such as transaction isolation level, implementation of locking mechanism, optimization strategy, etc. Even with the conflict resolution options outlined above it may result in a situation when INSERT statement succeeds but final COMMIT fails (or one of the updaters will commit but others will fail).

            I think proper handling of the failures in concurrent updates should also include exception handling for COMMIT phase. We could catch an exception, rollback, and repeat an operation once again if commit fails due to IntegrityError (only if we do ignore/replace). There may be other backend-specific strategies (e.g. explicit locking) but it would be a lot messier to try to use something like that. Proving that something works correctly is generally hard in concurrent systems, and it will be very hard to build a reasonable test for all existing and future backend types, so I think that "abort and retry" is probably easiest way to convince ourselves that we do things right.

            Show
            salnikov Andy Salnikov added a comment - - edited Concurrency issues. In a single-transaction mode there are no concurrency issues, every statement is executed against latest snapshot and COMMIT is not supposed to generate any errors. With concurrent transactions situation becomes more complicated, in part also because different backends may implement transaction isolation logic slightly differently. If two or more clients want to read/update the same data simultaneously the exact behavior depends on a number of factors such as transaction isolation level, implementation of locking mechanism, optimization strategy, etc. Even with the conflict resolution options outlined above it may result in a situation when INSERT statement succeeds but final COMMIT fails (or one of the updaters will commit but others will fail). I think proper handling of the failures in concurrent updates should also include exception handling for COMMIT phase. We could catch an exception, rollback, and repeat an operation once again if commit fails due to IntegrityError (only if we do ignore/replace). There may be other backend-specific strategies (e.g. explicit locking) but it would be a lot messier to try to use something like that. Proving that something works correctly is generally hard in concurrent systems, and it will be very hard to build a reasonable test for all existing and future backend types, so I think that "abort and retry" is probably easiest way to convince ourselves that we do things right.
            Hide
            salnikov Andy Salnikov added a comment -

            Some details about PostgreSQL "ON CONFLICT" implementation. I am testing it with a test instance (version 10) to make sure that I understand all issues.
            Simple table is created:

            postgres=# create table test (pk int primary key, val1 int, val2 int);
            CREATE TABLE
            postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100);
            INSERT 0 1
            postgres=# insert into test (pk, val1, val2) VALUES (2, 20, 200);
            INSERT 0 1
            postgres=# select * from test;
             pk | val1 | val2 
            ----+------+------
              1 |   10 |  100
              2 |   20 |  200
            (2 rows)
            

            Here are the points that I learned:

            • ON CONFLICT DO NOTHING works as expected:

              postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLICT DO NOTHING;
              INSERT 0 0
              

            • ON CONFLICT DO UPDATE SET ... seems to require conflicting column names or index name:

              postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLICT DO UPDATE SET val1=11;
              ERROR:  ON CONFLICT DO UPDATE requires inference specification or constraint name
              LINE 1: ...rt into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLIC...
               
              HINT:  For example, ON CONFLICT (column_name).
              

            • In DO UPDATE SET ... one can also "update" conflicting column value

              postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLICT (pk) DO UPDATE SET pk=excluded.pk, val1=excluded.val1, val2=excluded.val2;
              INSERT 0 1
              

            • Using "excluded" table with column name that was not in the INSERT list is OK, but its value is set to NULL:

              postgres=# insert into test (pk, val1) VALUES (1, 10) ON CONFLICT (pk) DO UPDATE SET pk=excluded.pk, val1=excluded.val1, val2=excluded.val2;
              INSERT 0 1
              postgres=# select * from test;
               pk | val1 | val2 
              ----+------+------
                2 |   20 |  200
                1 |   10 |     
              (2 rows)
              

            Show
            salnikov Andy Salnikov added a comment - Some details about PostgreSQL "ON CONFLICT" implementation. I am testing it with a test instance (version 10) to make sure that I understand all issues. Simple table is created: postgres=# create table test (pk int primary key , val1 int , val2 int ); CREATE TABLE postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100); INSERT 0 1 postgres=# insert into test (pk, val1, val2) VALUES (2, 20, 200); INSERT 0 1 postgres=# select * from test; pk | val1 | val2 ----+------+------ 1 | 10 | 100 2 | 20 | 200 (2 rows ) Here are the points that I learned: ON CONFLICT DO NOTHING works as expected: postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLICT DO NOTHING; INSERT 0 0 ON CONFLICT DO UPDATE SET ... seems to require conflicting column names or index name: postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLICT DO UPDATE SET val1=11; ERROR: ON CONFLICT DO UPDATE requires inference specification or constraint name LINE 1: ...rt into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLIC...   HINT: For example, ON CONFLICT (column_name). In DO UPDATE SET ... one can also "update" conflicting column value postgres=# insert into test (pk, val1, val2) VALUES (1, 10, 100) ON CONFLICT (pk) DO UPDATE SET pk=excluded.pk, val1=excluded.val1, val2=excluded.val2; INSERT 0 1 Using "excluded" table with column name that was not in the INSERT list is OK, but its value is set to NULL: postgres=# insert into test (pk, val1) VALUES (1, 10) ON CONFLICT (pk) DO UPDATE SET pk=excluded.pk, val1=excluded.val1, val2=excluded.val2; INSERT 0 1 postgres=# select * from test; pk | val1 | val2 ----+------+------ 2 | 20 | 200 1 | 10 | (2 rows)
            Hide
            salnikov Andy Salnikov added a comment - - edited

            After reading a bit more on PostgreSQL syntax it is clear that ON CONFLICT DO UPDATE can only work with a single constraint and it is not possible to specify more than one constraint.

            I think for our use case this may be OK, in dataset_collections table insert we could just depend on PK uniqueness to achive the same behavior as we have now:

            • When inserting we could do "ON CONFLICT (dataset_id, collection) DO NOTHING"
            • When same (dataset_id, collection) is already there it means that we already OK, there is no need to update dataset_ref_hash (it is the same by construction)
            • If there a record with different (dataset_id, collection) but the same (dataset_ref_hash, collection) then it will cause an error (as it is not handled by "ON CONFLICT")
            Show
            salnikov Andy Salnikov added a comment - - edited After reading a bit more on PostgreSQL syntax it is clear that ON CONFLICT DO UPDATE can only work with a single constraint and it is not possible to specify more than one constraint. I think for our use case this may be OK, in dataset_collections table insert we could just depend on PK uniqueness to achive the same behavior as we have now: When inserting we could do "ON CONFLICT (dataset_id, collection) DO NOTHING" When same (dataset_id, collection) is already there it means that we already OK, there is no need to update dataset_ref_hash (it is the same by construction) If there a record with different (dataset_id, collection) but the same (dataset_ref_hash, collection) then it will cause an error (as it is not handled by "ON CONFLICT")
            Hide
            salnikov Andy Salnikov added a comment -

            Jim Bosch, I have something that I think is in a reasonable shape, but probably not final. The implementation is now in an internal method in SqlRegistry class, if you want it to be in a separate module somewhere it can be done too, I just need some input on that. Anyways, I'd be hapy to get comments on how it is implemented now, took me some time to figure out all details, but the amount of code looks reasonable. Bigger question is of course whether it does what you want.

            Show
            salnikov Andy Salnikov added a comment - Jim Bosch , I have something that I think is in a reasonable shape, but probably not final. The implementation is now in an internal method in SqlRegistry class, if you want it to be in a separate module somewhere it can be done too, I just need some input on that. Anyways, I'd be hapy to get comments on how it is implemented now, took me some time to figure out all details, but the amount of code looks reasonable. Bigger question is of course whether it does what you want.
            Hide
            jbosch Jim Bosch added a comment -

            Comments on the PR.

            Show
            jbosch Jim Bosch added a comment - Comments on the PR.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.