Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21203

Research cross-database approach to bulk inserts returning autoincrement values

    XMLWordPrintable

    Details

      Description

      Using SQLAlchemy with SQLite, Oracle, and PostgreSQL backends, write a Python snippet that bulk-inserts into a table with an autoincrement field and returns to Python the values of the autoincrement field.  This operation must be safe in the present of concurrent inserts into the same table.

      My understanding is that:

      • PostgreSQL can do this via INSERT ... RETURNING, and SQLAlchemy already supports that syntax.
      • Oracle can do this via INSERT ... RETURNING, but the way those results are returned is different from how the PostgreSQL native Python driver does it, and SQLAlchemy does not support that syntax, at least for bulk inserts.
      • SQLite cannot do this for bulk inserts at all, unless you either:
        • rely on exclusive locking and do arithmetic on the last inserted ID and the number of rows (scary, to say the least - and I'd really like to stop using exclusive locking)
        • select from the sequence in advance and then use those values explicitly in the insert (at least, I think this ought to work - but it definitely needs to be tested).

      I don't think this operation needs to also include the conflict-resolution options of DM-21201, but it's possible we'll identify a need for that in the future.

      I do think we'll want a guarantee that the order of returned IDs match the order of the rows provided, or have some other way to connect them if that's not true.

       

        Attachments

          Issue Links

            Activity

            No builds found.
            jbosch Jim Bosch created issue -
            swinbank John Swinbank made changes -
            Field Original Value New Value
            Team Data Access and Database [ 10204 ]
            jbosch Jim Bosch made changes -
            Link This issue blocks DM-21231 [ DM-21231 ]
            salnikov Andy Salnikov made changes -
            Assignee Andy Salnikov [ salnikov ]
            salnikov Andy Salnikov made changes -
            Epic Link DM-20082 [ 318933 ]
            jbosch Jim Bosch made changes -
            Link This issue blocks DM-21770 [ DM-21770 ]
            salnikov Andy Salnikov made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            salnikov Andy Salnikov added a comment -

            General comment for autoincrement feature - its implementation is backend-specific and is done very differently in sqlalchemy for every separate backend:

            • Oracle has no support for autoincrement, one has to always use SEQUENCE to generate unique numbers. At sqlalchemy level one needs to pass Sequence instance to Column definition: https://docs.sqlalchemy.org/en/13/dialects/oracle.html#auto-increment-behavior
            • Postgres has special column type SERIAL/BIGSERIAL which are just a shorthand notation for creating special per-table sequence and associating that sequence with the column: https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL
            • SQLite has two variations of auto-increment feature. One is just a plain "INTEGER PRIMARY KEY" column that is equivalent to ROWID, this generates unique but re-usable numbers (ID of the delete row can be reused later). Second option is an AUTOINCREMENT flag which generates true unique non-reusable ROWIDs: https://www.sqlite.org/autoinc.html. SQLite does not have support for sequences, though it can be simulated with a special table with AUTOINCREMENT column.

            Relying on autoincrement with bulk inserts is somewhat tricky - one would have to retrieve IDs of the inserted rows and associate them with client-side data. Oracle and Postgres support INSERT ... RETURNING syntax so it is possible to specify what has to be returned from INSERT. SQLAlchemy also provides inserted_primary_key and lastrowid attributes for ResultSet which work for a single-row insert (that should be OK for us as we do not really use "true" bulk insert syntax). Combining multi-row insert and conflict resolution means that number of rows actually inserted can be different from number of rows supplied, and associating two may not be trivial at all. I think the only way out of that is to insert one row at a time and check whether it was inserted OK and find its primary key.

            Just to mention another approach - we could use explicit sequence calls to obtain a set of IDs and then insert new rows with the known IDs (for SQLite we'd have simulate a sequence). Drawback here is that we'll need multiple calls a sequence to generate multiple IDs and each call is basically a separate query and separate round trip to a server so effectively it means doubling the number of queries sent to a server.

            I want to experiment a little bit with all three backends to check for any non-trivial issues.

            Show
            salnikov Andy Salnikov added a comment - General comment for autoincrement feature - its implementation is backend-specific and is done very differently in sqlalchemy for every separate backend: Oracle has no support for autoincrement, one has to always use SEQUENCE to generate unique numbers. At sqlalchemy level one needs to pass Sequence instance to Column definition: https://docs.sqlalchemy.org/en/13/dialects/oracle.html#auto-increment-behavior Postgres has special column type SERIAL/BIGSERIAL which are just a shorthand notation for creating special per-table sequence and associating that sequence with the column: https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL SQLite has two variations of auto-increment feature. One is just a plain "INTEGER PRIMARY KEY" column that is equivalent to ROWID, this generates unique but re-usable numbers (ID of the delete row can be reused later). Second option is an AUTOINCREMENT flag which generates true unique non-reusable ROWIDs: https://www.sqlite.org/autoinc.html . SQLite does not have support for sequences, though it can be simulated with a special table with AUTOINCREMENT column. Relying on autoincrement with bulk inserts is somewhat tricky - one would have to retrieve IDs of the inserted rows and associate them with client-side data. Oracle and Postgres support INSERT ... RETURNING syntax so it is possible to specify what has to be returned from INSERT. SQLAlchemy also provides inserted_primary_key and lastrowid attributes for ResultSet which work for a single-row insert (that should be OK for us as we do not really use "true" bulk insert syntax). Combining multi-row insert and conflict resolution means that number of rows actually inserted can be different from number of rows supplied, and associating two may not be trivial at all. I think the only way out of that is to insert one row at a time and check whether it was inserted OK and find its primary key. Just to mention another approach - we could use explicit sequence calls to obtain a set of IDs and then insert new rows with the known IDs (for SQLite we'd have simulate a sequence). Drawback here is that we'll need multiple calls a sequence to generate multiple IDs and each call is basically a separate query and separate round trip to a server so effectively it means doubling the number of queries sent to a server. I want to experiment a little bit with all three backends to check for any non-trivial issues.
            Hide
            jbosch Jim Bosch added a comment - - edited

            Some thoughts on boundary conditions:

            • I don't think we need to support both returning autoincrement IDs and conflict resolution in any single table.  But we will want to do both on different tables within the same transaction.
            • I would like to start using bulk insert syntax (though maybe we mean different things by that), so lastrowid / inserted_primary_key aren't on their own viable solutions.
            • If we can come up with good solutions for Oracle and PostgreSQL via INSERT...RETURNING, but SQLite is much harder, maybe we should benchmark single-row inserts for SQLite in a loop before trying too hard.  Right now we have both SAVEPOINT and INSERT for each row in a loop, and it may be enough to just move the SAVEPOINT calls out of the loop.

             

            Show
            jbosch Jim Bosch added a comment - - edited Some thoughts on boundary conditions: I don't think we need to support both returning autoincrement IDs and conflict resolution in any single table.  But we will want to do both on different tables within the same transaction. I would like to start using bulk insert syntax (though maybe we mean different things by that), so lastrowid / inserted_primary_key aren't on their own viable solutions. If we can come up with good solutions for Oracle and PostgreSQL via INSERT...RETURNING, but SQLite is much harder, maybe we should benchmark single-row inserts for SQLite in a loop before trying too hard.  Right now we have both SAVEPOINT and INSERT for each row in a loop, and it may be enough to just move the SAVEPOINT calls out of the loop.  
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Regarding bulk insert, indeed it needs some clarification as there are at least two different ways to do that, one is via multiple VALUES in one query (INSERT ... (col1, col2) VALUES (v1, v2), (v3, v4), ...) and another is executemany()-style when one passes a list of records to inset to a single connection.execute() call. The VALUES case (needs calling values() method with a list of records) is non-standard and cannot be used with Oracle and has limitations on other backends, I think we should not use that. So execute() with the list of record is what I think the only reasonable option for bulk insert. It can be implemented as actual bulk insert in one query by some backends (I think cx_oracle does that) or just iterating over records and doing execute() on each record separately (some optimizations are possible even there, e.g. pre-compiling a statement once on server side).

            Bulk insertion and RETURNING may not work together for Oracle, sqlalchemy docs says explicitly that RETURNING is only for single-row inserts. I'm playing with a test and it indeed looks like that, there is no way to get anything out of Oracle for bulk inserted rows. I'll post a summary of what I have found, but want to do more testing with sqlite and Postgres before that.

            If we want to do conflict resolution and bulk insert in different tables but in one transaction we'll need to think about transaction management, my current implementation of conflict resolution does its own transactions, so it needs to be extended.

            Show
            salnikov Andy Salnikov added a comment - - edited Regarding bulk insert, indeed it needs some clarification as there are at least two different ways to do that, one is via multiple VALUES in one query ( INSERT ... (col1, col2) VALUES (v1, v2), (v3, v4), ... ) and another is executemany() -style when one passes a list of records to inset to a single connection.execute() call. The VALUES case (needs calling values() method with a list of records) is non-standard and cannot be used with Oracle and has limitations on other backends, I think we should not use that. So execute() with the list of record is what I think the only reasonable option for bulk insert. It can be implemented as actual bulk insert in one query by some backends (I think cx_oracle does that) or just iterating over records and doing execute() on each record separately (some optimizations are possible even there, e.g. pre-compiling a statement once on server side). Bulk insertion and RETURNING may not work together for Oracle, sqlalchemy docs says explicitly that RETURNING is only for single-row inserts. I'm playing with a test and it indeed looks like that, there is no way to get anything out of Oracle for bulk inserted rows. I'll post a summary of what I have found, but want to do more testing with sqlite and Postgres before that. If we want to do conflict resolution and bulk insert in different tables but in one transaction we'll need to think about transaction management, my current implementation of conflict resolution does its own transactions, so it needs to be extended.
            Hide
            jbosch Jim Bosch added a comment -

            I think at one point Christopher Stephens [X] convinced me (perhaps even with an example) that RETURNING with some kind of bulk insert in Oracle was possible, but my (superficial) follow-up identified that the problem was that SQLAlchemy's Python syntax for RETURNING in PostgreSQL wasn't supported by its Oracle driver for some reason.

            Show
            jbosch Jim Bosch added a comment - I think at one point Christopher Stephens [X] convinced me (perhaps even with an example) that RETURNING with some kind of bulk insert in Oracle was possible, but my (superficial) follow-up identified that the problem was that SQLAlchemy's Python syntax for RETURNING in PostgreSQL wasn't supported by its Oracle driver for some reason.
            Hide
            cs2018 Christopher Stephens [X] (Inactive) added a comment -

            I'm unsure whether SQLAlchemy can generate the appropriate DDL at table creation for autoincrement in Oracle but that functionality has existed since 12.1 (we are on 19.4).

            https://oracle-base.com/articles/12c/identity-columns-in-oracle-12cr1

            Jim Bosch good memory! Here is the code example I gave Jim to demonstrate ability to return IDs from bulk insert:

            Hey Jim, just a quick code sample for returning ID's when doing bulk inserts for future reference w/ cx_Oracle:

            --setup
            DROP TABLE INS_RET_TEST;

            CREATE TABLE INS_RET_TEST
              (ID NUMBER NOT NULL ENABLE,
               COL1 NUMBER NOT NULL ENABLE,
               CONSTRAINT INS_RET_TEST_PK PRIMARY KEY (ID)
               USING INDEX);

            DROP SEQUENCE INS_RET_TEST_SEQ;

            CREATE SEQUENCE  INS_RET_TEST_SEQ  MINVALUE 1 MAXVALUE 999999999
            INCREMENT BY 1 START WITH 1;

            CREATE OR REPLACE TRIGGER INS_RET_TEST_TRG
            BEFORE INSERT ON INS_RET_TEST
            FOR EACH ROW
            BEGIN
            <<COLUMN_SEQUENCES>>
            BEGIN
            IF INSERTING AND :NEW.ID IS NULL THEN
            SELECT INS_RET_TEST_SEQ.NEXTVAL INTO :NEW.ID FROM SYS.DUAL;
            END IF;
            END COLUMN_SEQUENCES;
            END;
            /

            ALTER TRIGGER INS_RET_TEST_TRG ENABLE;

            --demonstration

            #!/usr/bin/env python

            import cx_Oracle
            import time

            db = cx_Oracle.connect(dsn="sec_lsst1db_cs2018")
            cur = db.cursor()

            data = [[1],[2],[3],[4],[5]]

            IdArr = cur.var(int, arraysize = len(data))
            cur.setinputsizes(None, IdArr)
            cur.executemany("insert into ins_ret_test(col1) values (:1) returning id into :2", data)

            for ix, datum in enumerate(data):
            print("Data: " , datum[0], "ID: ", IdArr.getvalue(ix))

            Show
            cs2018 Christopher Stephens [X] (Inactive) added a comment - I'm unsure whether SQLAlchemy can generate the appropriate DDL at table creation for autoincrement in Oracle but that functionality has existed since 12.1 (we are on 19.4). https://oracle-base.com/articles/12c/identity-columns-in-oracle-12cr1 Jim Bosch good memory! Here is the code example I gave Jim to demonstrate ability to return IDs from bulk insert: Hey Jim, just a quick code sample for returning ID's when doing bulk inserts for future reference w/ cx_Oracle: --setup DROP TABLE INS_RET_TEST; CREATE TABLE INS_RET_TEST   (ID NUMBER NOT NULL ENABLE,    COL1 NUMBER NOT NULL ENABLE,    CONSTRAINT INS_RET_TEST_PK PRIMARY KEY (ID)    USING INDEX); DROP SEQUENCE INS_RET_TEST_SEQ; CREATE SEQUENCE  INS_RET_TEST_SEQ  MINVALUE 1 MAXVALUE 999999999 INCREMENT BY 1 START WITH 1; CREATE OR REPLACE TRIGGER INS_RET_TEST_TRG BEFORE INSERT ON INS_RET_TEST FOR EACH ROW BEGIN <<COLUMN_SEQUENCES>> BEGIN IF INSERTING AND :NEW.ID IS NULL THEN SELECT INS_RET_TEST_SEQ.NEXTVAL INTO :NEW.ID FROM SYS.DUAL; END IF; END COLUMN_SEQUENCES; END; / ALTER TRIGGER INS_RET_TEST_TRG ENABLE; --demonstration #!/usr/bin/env python import cx_Oracle import time db = cx_Oracle.connect(dsn="sec_lsst1db_cs2018") cur = db.cursor() data = [ [1] , [2] , [3] , [4] , [5] ] IdArr = cur.var(int, arraysize = len(data)) cur.setinputsizes(None, IdArr) cur.executemany("insert into ins_ret_test(col1) values (:1) returning id into :2", data) for ix, datum in enumerate(data): print("Data: " , datum [0] , "ID: ", IdArr.getvalue(ix))
            Hide
            salnikov Andy Salnikov added a comment -

            Christopher Stephens [X], thanks for an example! From what I see with SQLAlchemy it has no problem with support of autoincrement feature, though it does not use triggers or special DDL for that but relies on explicit "nextval" from a user-specified sequence. SQLAlchemy problem is with returning multiple generated keys, it explicitly says that it is not supported and this is what I see in my tests. It's good to know that cx_Oracle allows that, but here my worry is that if we switch to low-level DBAPI calls we are going to loose all SQLAlchemy specializations that we use (like special type decorators). And that using backend-specific DBAPI makes it harder to test/debug stuff.

            Show
            salnikov Andy Salnikov added a comment - Christopher Stephens [X] , thanks for an example! From what I see with SQLAlchemy it has no problem with support of autoincrement feature, though it does not use triggers or special DDL for that but relies on explicit "nextval" from a user-specified sequence. SQLAlchemy problem is with returning multiple generated keys, it explicitly says that it is not supported and this is what I see in my tests. It's good to know that cx_Oracle allows that, but here my worry is that if we switch to low-level DBAPI calls we are going to loose all SQLAlchemy specializations that we use (like special type decorators). And that using backend-specific DBAPI makes it harder to test/debug stuff.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Here is a summary what I have seen so far with a standard SQLAlchemy `execute()` method. I tried several variations: single row vs. multi-row inserts, and using RETURNING with engines that support it. I tested it on a simple table with few columns and integer primary key column. Script is here: https://gist.github.com/andy-slac/7e1e473dd49ced02dd844fda7c835e8e

            Both Oracle and Postgres support implicit_returning option (enabled by default) which adds RETURNING to single-row INSERTs to return value of the added primary key. It does nothing for multi-row inserts.

            Oracle backend

            To support autoincrement Oracle needs separate SEQUENCE defined, which needs to be given to a PK Column constructor: https://docs.sqlalchemy.org/en/13/dialects/oracle.html#auto-increment-behavior

            1. Implicit RETURNING, single-row insert:
              • generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
              • params: {'val1': 1, 'val2': 10, 'ret_0': <cx_Oracle.STRING with value [None, ..., None]>}
              • result.rowcount: 1
              • result.inserted_primary_key: list of one integer, inserted ID
              • no rows returned in the result
            2. Explicit RETURNING, single row insert:
              • SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
              • params: {'val1': 1, 'val2': 10, 'ret_0': <cx_Oracle.STRING with value [None, ..., None]>}
              • result.rowcount: 1
              • result.inserted_primary_key: Not defined
              • result has single row with the value of ID inserted
            3. Implicit RETURNING, multi-row insert:
              • SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2)
              • params: [{'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30}]
              • result.rowcount: 3
              • result.inserted_primary_key: Not defined
              • no rows returned in the result
            4. Explicit RETURNING, multi-row insert:
              • SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
              • params: [{'val1': 1, 'val2': 10, 'ret_0': None}, {'val1': 2, 'val2': 20, 'ret_0': None}, {'val1': 3, 'val2': 30, 'ret_0': None}]
              • result.rowcount: 3
              • result.inserted_primary_key: Not defined
              • no rows returned in the result

            Postgres backend

            Like Oracle Postgres needs a sequence to generate auto-increment values, Postgres can define sequence itself if PK column has special type (SERIAL/BIGSERIAL) but using explicit sequence a la Oracle works too: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#sequences-serial-identity

            1. Implicit RETURNING, single-row insert:
              • generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id
              • params: {'val1': 1, 'val2': 10}
              • result.rowcount: 1
              • result.inserted_primary_key: list of one integer, inserted ID
              • result.lastrowid: 0 (probably meaningless)
              • no rows returned in the result
            2. Explicit RETURNING, single row insert:
              • generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id
              • params: {'val1': 1, 'val2': 10}
              • result.rowcount: 1
              • result.inserted_primary_key: Not defined
              • result.lastrowid: 0 (probably meaningless)
              • result has single row with the value of ID inserted
            3. Implicit RETURNING, multi-row insert:
              • SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s)
              • params: ({'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30})
              • result.rowcount: 3
              • result.inserted_primary_key: Not defined
              • result.lastrowid: 0 (probably meaningless)
              • no rows returned in the result
            4. Explicit RETURNING, multi-row insert:
              • SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id
              • params: ({'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30})
              • result.rowcount: 3
              • result.inserted_primary_key: Not defined
              • result.lastrowid: 0 (probably meaningless)
              • no rows returned in the result

            SQLite backend

            SQLite has no support for RETURNING.

            1. single-row insert:
              • generated SQL: INSERT INTO test_table_ret (val1, val2) VALUES (?, ?)
              • params: (1, 10)
              • result.rowcount: 1
              • result.inserted_primary_key: list of one integer, inserted ID
              • result.lastrowid: number (same as in inserted_primary_key)
              • no rows returned in the result
            2. multi-row insert:
              • SQL: INSERT INTO test_table_ret (val1, val2) VALUES (?, ?)
              • params: ((1, 10), (2, 20), (3, 30))
              • result.rowcount: 3
              • result.inserted_primary_key: Not defined
              • result.lastrowid: None
              • no rows returned in the result

            Summary I guess is that:

            • supporting auto-increment feature is not hard, just need a Sequence defined for Oracle and optionally for Postgres (I think it can even be shared between multiple tables)
            • returning generated PK value by default only works for single-row insert, there is no special anything needed, result.inserted_primary_key has the value
            • multi-row insert has no standard way of returning values to client for any of the backends, there may be backend-specific ways to do that though, will look at it next.
            Show
            salnikov Andy Salnikov added a comment - - edited Here is a summary what I have seen so far with a standard SQLAlchemy `execute()` method. I tried several variations: single row vs. multi-row inserts, and using RETURNING with engines that support it. I tested it on a simple table with few columns and integer primary key column. Script is here: https://gist.github.com/andy-slac/7e1e473dd49ced02dd844fda7c835e8e Both Oracle and Postgres support implicit_returning option (enabled by default) which adds RETURNING to single-row INSERTs to return value of the added primary key. It does nothing for multi-row inserts. Oracle backend To support autoincrement Oracle needs separate SEQUENCE defined, which needs to be given to a PK Column constructor: https://docs.sqlalchemy.org/en/13/dialects/oracle.html#auto-increment-behavior Implicit RETURNING, single-row insert: generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0 params: { 'val1': 1, 'val2': 10, 'ret_0': <cx_Oracle.STRING with value [None, ..., None] > } result.rowcount: 1 result.inserted_primary_key: list of one integer, inserted ID no rows returned in the result Explicit RETURNING, single row insert: SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0 params: { 'val1': 1, 'val2': 10, 'ret_0': <cx_Oracle.STRING with value [None, ..., None] > } result.rowcount: 1 result.inserted_primary_key: Not defined result has single row with the value of ID inserted Implicit RETURNING, multi-row insert: SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) params: [{'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30}] result.rowcount: 3 result.inserted_primary_key: Not defined no rows returned in the result Explicit RETURNING, multi-row insert: SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0 params: [{'val1': 1, 'val2': 10, 'ret_0': None}, {'val1': 2, 'val2': 20, 'ret_0': None}, {'val1': 3, 'val2': 30, 'ret_0': None}] result.rowcount: 3 result.inserted_primary_key: Not defined no rows returned in the result Postgres backend Like Oracle Postgres needs a sequence to generate auto-increment values, Postgres can define sequence itself if PK column has special type (SERIAL/BIGSERIAL) but using explicit sequence a la Oracle works too: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#sequences-serial-identity Implicit RETURNING, single-row insert: generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id params: {'val1': 1, 'val2': 10} result.rowcount: 1 result.inserted_primary_key: list of one integer, inserted ID result.lastrowid: 0 (probably meaningless) no rows returned in the result Explicit RETURNING, single row insert: generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id params: {'val1': 1, 'val2': 10} result.rowcount: 1 result.inserted_primary_key: Not defined result.lastrowid: 0 (probably meaningless) result has single row with the value of ID inserted Implicit RETURNING, multi-row insert: SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) params: ({'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30}) result.rowcount: 3 result.inserted_primary_key: Not defined result.lastrowid: 0 (probably meaningless) no rows returned in the result Explicit RETURNING, multi-row insert: SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id params: ({'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30}) result.rowcount: 3 result.inserted_primary_key: Not defined result.lastrowid: 0 (probably meaningless) no rows returned in the result SQLite backend SQLite has no support for RETURNING. single-row insert: generated SQL: INSERT INTO test_table_ret (val1, val2) VALUES (?, ?) params: (1, 10) result.rowcount: 1 result.inserted_primary_key: list of one integer, inserted ID result.lastrowid: number (same as in inserted_primary_key) no rows returned in the result multi-row insert: SQL: INSERT INTO test_table_ret (val1, val2) VALUES (?, ?) params: ((1, 10), (2, 20), (3, 30)) result.rowcount: 3 result.inserted_primary_key: Not defined result.lastrowid: None no rows returned in the result Summary I guess is that: supporting auto-increment feature is not hard, just need a Sequence defined for Oracle and optionally for Postgres (I think it can even be shared between multiple tables) returning generated PK value by default only works for single-row insert, there is no special anything needed, result.inserted_primary_key has the value multi-row insert has no standard way of returning values to client for any of the backends, there may be backend-specific ways to do that though, will look at it next.
            jbosch Jim Bosch made changes -
            Labels gen3-middleware gen2-deprecation-blocker gen3-middleware
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I tried to find a workaround for Oracle backend without going full-cx_Oracle. With simple script (modified version of Chris' script to use keyword params similar to what we do in SQLAlchemy, see also cx_Oracle docs) I looked at how it works and where the state is kept. The key to everything is cursor.setinputsizes() method which binds output variable (INTO :var) to some Python-side object. I think it's clear from that example that bound output variables are stored in cursor.bindvars which can be used to find the values of the outputs even if the original bound object is not accessible.

            For reference here is the output from cx_Oracle script:

            IdArr = <cx_Oracle.NUMBER with value [None, None, None]> @140719028050832
            pre-execute cursor.bindvars =  {'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>}
            post-execute cursor.bindvars =  {'ret_0': <cx_Oracle.NUMBER with value [[223], [224], [225]]>, 'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>}
            ret_0 = <cx_Oracle.NUMBER with value [[223], [224], [225]]> @140719028050832
            Data:  {'val1': 1, 'val2': 10} ID:  [223]
            Data:  {'val1': 2, 'val2': 20} ID:  [224]
            Data:  {'val1': 3, 'val2': 30} ID:  [225]
            

            (initially 'ret_0' in bindvars is set to [None, None, None], and after query execution it is updated to have actual values).

            I tried to use this approach with SQLAlchemy looking at the low-level cursor object and checking its bindvars attribute, unfortunately in SQLAlchemy it does not seem to work, it's corresponding bound variable never get updated. I tried to dump the content of cursor.bindvars right after cursor.executemany() but even at that point it only contains None. It looks like there is some extra option that is set by SQLAlchemy which prevents cx_Oracle from updating bound variable. One suspicion is that SQLAlchemy passes value of "ret_0" parameter (as None) when it calls executemany() but this is only a guess. The code that handles execution is spread across multiple classes, it's hard to understand where the issue can be.

            (and here is the output of my test script for reference too):

            $ ./test_returning.py -r -m oracle
            [... skip irrelevant output ...]
            2019-10-24 17:33:59,316 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
            param: BindParameter('val1', symbol('REQUIRED'), type_=Integer()) <class 'int'> @94163688039968
            param: BindParameter('val2', symbol('REQUIRED'), type_=Integer()) <class 'int'> @94163688039968
            param: BindParameter('ret_0', None, type_=Integer()) <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680
            event_setinputsizes cursor.bindvars =  None
            2019-10-24 17:33:59,317 INFO sqlalchemy.engine.base.Engine INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
            2019-10-24 17:33:59,318 INFO sqlalchemy.engine.base.Engine [{'val1': 1, 'val2': 10, 'ret_0': None}, {'val1': 2, 'val2': 20, 'ret_0': None}, {'val1': 3, 'val2': 30, 'ret_0': None}]
            after_cursor_execute:
              cursor = <cx_Oracle.Cursor on <cx_Oracle.Connection to @gen3_cred_salnikov>> @139887791399760
              cursor.bindvars =  {'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>, 'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>}
              values = <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680
            after conn.execute:
              cursor = <cx_Oracle.Cursor on <cx_Oracle.Connection to @gen3_cred_salnikov>> @139887791399760
              cursor.bindvars =  {'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>, 'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>}
              values = <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680
            rowcount =  3
            inserted_primary_key =  N/A
            2019-10-24 17:33:59,321 INFO sqlalchemy.engine.base.Engine COMMIT
            

            I think I reached a point where I'm ready to give up for now. As we discussed at the meeting today we can probably survive with single-row inserts if we get rid of savepoints (I think savepoints are only useful in the context of conflict handling, and we have already a solution for that). Not sure if I should close this ticket or keep it open.

            Show
            salnikov Andy Salnikov added a comment - - edited I tried to find a workaround for Oracle backend without going full-cx_Oracle. With simple script (modified version of Chris' script to use keyword params similar to what we do in SQLAlchemy, see also cx_Oracle docs ) I looked at how it works and where the state is kept. The key to everything is cursor.setinputsizes() method which binds output variable (INTO :var) to some Python-side object. I think it's clear from that example that bound output variables are stored in cursor.bindvars which can be used to find the values of the outputs even if the original bound object is not accessible. For reference here is the output from cx_Oracle script: IdArr = <cx_Oracle.NUMBER with value [None, None, None]> @140719028050832 pre-execute cursor.bindvars = {'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>} post-execute cursor.bindvars = {'ret_0': <cx_Oracle.NUMBER with value [[223], [224], [225]]>, 'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>} ret_0 = <cx_Oracle.NUMBER with value [[223], [224], [225]]> @140719028050832 Data: {'val1': 1, 'val2': 10} ID: [223] Data: {'val1': 2, 'val2': 20} ID: [224] Data: {'val1': 3, 'val2': 30} ID: [225] (initially 'ret_0' in bindvars is set to [None, None, None] , and after query execution it is updated to have actual values). I tried to use this approach with SQLAlchemy looking at the low-level cursor object and checking its bindvars attribute, unfortunately in SQLAlchemy it does not seem to work, it's corresponding bound variable never get updated. I tried to dump the content of cursor.bindvars right after cursor.executemany() but even at that point it only contains None . It looks like there is some extra option that is set by SQLAlchemy which prevents cx_Oracle from updating bound variable. One suspicion is that SQLAlchemy passes value of "ret_0" parameter (as None) when it calls executemany() but this is only a guess. The code that handles execution is spread across multiple classes, it's hard to understand where the issue can be. (and here is the output of my test script for reference too): $ ./test_returning.py -r -m oracle [... skip irrelevant output ...] 2019-10-24 17:33:59,316 INFO sqlalchemy.engine.base.Engine BEGIN (implicit) param: BindParameter('val1', symbol('REQUIRED'), type_=Integer()) <class 'int'> @94163688039968 param: BindParameter('val2', symbol('REQUIRED'), type_=Integer()) <class 'int'> @94163688039968 param: BindParameter('ret_0', None, type_=Integer()) <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680 event_setinputsizes cursor.bindvars = None 2019-10-24 17:33:59,317 INFO sqlalchemy.engine.base.Engine INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0 2019-10-24 17:33:59,318 INFO sqlalchemy.engine.base.Engine [{'val1': 1, 'val2': 10, 'ret_0': None}, {'val1': 2, 'val2': 20, 'ret_0': None}, {'val1': 3, 'val2': 30, 'ret_0': None}] after_cursor_execute: cursor = <cx_Oracle.Cursor on <cx_Oracle.Connection to @gen3_cred_salnikov>> @139887791399760 cursor.bindvars = {'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>, 'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>} values = <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680 after conn.execute: cursor = <cx_Oracle.Cursor on <cx_Oracle.Connection to @gen3_cred_salnikov>> @139887791399760 cursor.bindvars = {'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>, 'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>} values = <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680 rowcount = 3 inserted_primary_key = N/A 2019-10-24 17:33:59,321 INFO sqlalchemy.engine.base.Engine COMMIT I think I reached a point where I'm ready to give up for now. As we discussed at the meeting today we can probably survive with single-row inserts if we get rid of savepoints (I think savepoints are only useful in the context of conflict handling, and we have already a solution for that). Not sure if I should close this ticket or keep it open.
            salnikov Andy Salnikov made changes -
            Story Points 4
            Hide
            jbosch Jim Bosch added a comment -

            Given that the title of this ticket starts with "Research...", I think it's fair to close this as Done.  The comments make it clear where it's landed, and provide good information for another ticket to pick up if necessary.

            Show
            jbosch Jim Bosch added a comment - Given that the title of this ticket starts with "Research...", I think it's fair to close this as Done.  The comments make it clear where it's landed, and provide good information for another ticket to pick up if necessary.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Just to make it more complete, I also had a quick look at Postgres situation, and it may be somewhat better than Oracle and I think it may be possible to make it work with minimal hacking:

            • psycopg2 backend does not return result set from executemany() so by default there is no way to extract the data
            • psycopg2 has couple of non-standard optimizations for executemany: execute_batch() and execute_values(), both of these are supported in some form by SQLAlchemy by passing executemany_mode="batch" or executemany_mode="values" keywords to create_engine()
            • "batch" mode works by first "paging" arguments into a shorter sequences and then executing multiple semicolon-separated statements in one query (with 1000 rows to insert and page_size=100 there will be 10 calls to execute(), each call having a long query string with 100 queries in it). With this if we do INSERT ... RETURNING it is only possible to return result of the last query so the returned result will have just a single row (which is not useful for us).
            • "values" mode again uses pagination to split initial parameters into shorter sequences and then for each sequence it execute() one multi-VALUES insert statement. With this approach the returned result set will include rows from the last call to execute(), so if there is only one page then we will get back all inserted values. Pagination can be controlled by passing parameters to the backend, so potentially we can implement our own pagination to make sure that there is only one statement is executed on each call and them merge the returned rows ourselves.
            Show
            salnikov Andy Salnikov added a comment - - edited Just to make it more complete, I also had a quick look at Postgres situation, and it may be somewhat better than Oracle and I think it may be possible to make it work with minimal hacking: psycopg2 backend does not return result set from executemany() so by default there is no way to extract the data psycopg2 has couple of non-standard optimizations for executemany: execute_batch() and execute_values() , both of these are supported in some form by SQLAlchemy by passing executemany_mode="batch" or executemany_mode="values" keywords to create_engine() "batch" mode works by first "paging" arguments into a shorter sequences and then executing multiple semicolon-separated statements in one query (with 1000 rows to insert and page_size=100 there will be 10 calls to execute() , each call having a long query string with 100 queries in it). With this if we do INSERT ... RETURNING it is only possible to return result of the last query so the returned result will have just a single row (which is not useful for us). "values" mode again uses pagination to split initial parameters into shorter sequences and then for each sequence it execute() one multi-VALUES insert statement. With this approach the returned result set will include rows from the last call to execute() , so if there is only one page then we will get back all inserted values. Pagination can be controlled by passing parameters to the backend, so potentially we can implement our own pagination to make sure that there is only one statement is executed on each call and them merge the returned rows ourselves.
            Hide
            salnikov Andy Salnikov added a comment -

            And just to mention it - with SQLite I don't think any kind of hack is possible at all, we will be forced to use single-row inserts if we want to get every autoincremented ID back.

            Closing this, nothing more to say.

            Show
            salnikov Andy Salnikov added a comment - And just to mention it - with SQLite I don't think any kind of hack is possible at all, we will be forced to use single-row inserts if we want to get every autoincremented ID back. Closing this, nothing more to say.
            salnikov Andy Salnikov made changes -
            Resolution Done [ 10000 ]
            Status In Progress [ 3 ] Done [ 10002 ]

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Andy Salnikov, Christopher Stephens [X] (Inactive), Jim Bosch
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.