Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Story Points:4
-
Epic Link:
-
Team:Data Access and Database
Description
Using SQLAlchemy with SQLite, Oracle, and PostgreSQL backends, write a Python snippet that bulk-inserts into a table with an autoincrement field and returns to Python the values of the autoincrement field. This operation must be safe in the present of concurrent inserts into the same table.
My understanding is that:
- PostgreSQL can do this via INSERT ... RETURNING, and SQLAlchemy already supports that syntax.
- Oracle can do this via INSERT ... RETURNING, but the way those results are returned is different from how the PostgreSQL native Python driver does it, and SQLAlchemy does not support that syntax, at least for bulk inserts.
- SQLite cannot do this for bulk inserts at all, unless you either:
- rely on exclusive locking and do arithmetic on the last inserted ID and the number of rows (scary, to say the least - and I'd really like to stop using exclusive locking)
- select from the sequence in advance and then use those values explicitly in the insert (at least, I think this ought to work - but it definitely needs to be tested).
I don't think this operation needs to also include the conflict-resolution options of DM-21201, but it's possible we'll identify a need for that in the future.
I do think we'll want a guarantee that the order of returned IDs match the order of the rows provided, or have some other way to connect them if that's not true.
Attachments
Issue Links
Activity
Some thoughts on boundary conditions:
- I don't think we need to support both returning autoincrement IDs and conflict resolution in any single table. But we will want to do both on different tables within the same transaction.
- I would like to start using bulk insert syntax (though maybe we mean different things by that), so lastrowid / inserted_primary_key aren't on their own viable solutions.
- If we can come up with good solutions for Oracle and PostgreSQL via INSERT...RETURNING, but SQLite is much harder, maybe we should benchmark single-row inserts for SQLite in a loop before trying too hard. Right now we have both SAVEPOINT and INSERT for each row in a loop, and it may be enough to just move the SAVEPOINT calls out of the loop.
Regarding bulk insert, indeed it needs some clarification as there are at least two different ways to do that, one is via multiple VALUES in one query (INSERT ... (col1, col2) VALUES (v1, v2), (v3, v4), ...) and another is executemany()-style when one passes a list of records to inset to a single connection.execute() call. The VALUES case (needs calling values() method with a list of records) is non-standard and cannot be used with Oracle and has limitations on other backends, I think we should not use that. So execute() with the list of record is what I think the only reasonable option for bulk insert. It can be implemented as actual bulk insert in one query by some backends (I think cx_oracle does that) or just iterating over records and doing execute() on each record separately (some optimizations are possible even there, e.g. pre-compiling a statement once on server side).
Bulk insertion and RETURNING may not work together for Oracle, sqlalchemy docs says explicitly that RETURNING is only for single-row inserts. I'm playing with a test and it indeed looks like that, there is no way to get anything out of Oracle for bulk inserted rows. I'll post a summary of what I have found, but want to do more testing with sqlite and Postgres before that.
If we want to do conflict resolution and bulk insert in different tables but in one transaction we'll need to think about transaction management, my current implementation of conflict resolution does its own transactions, so it needs to be extended.
I think at one point Christopher Stephens [X] convinced me (perhaps even with an example) that RETURNING with some kind of bulk insert in Oracle was possible, but my (superficial) follow-up identified that the problem was that SQLAlchemy's Python syntax for RETURNING in PostgreSQL wasn't supported by its Oracle driver for some reason.
I'm unsure whether SQLAlchemy can generate the appropriate DDL at table creation for autoincrement in Oracle but that functionality has existed since 12.1 (we are on 19.4).
https://oracle-base.com/articles/12c/identity-columns-in-oracle-12cr1
Jim Bosch good memory! Here is the code example I gave Jim to demonstrate ability to return IDs from bulk insert:
Hey Jim, just a quick code sample for returning ID's when doing bulk inserts for future reference w/ cx_Oracle:
--setup
DROP TABLE INS_RET_TEST;
CREATE TABLE INS_RET_TEST
(ID NUMBER NOT NULL ENABLE,
COL1 NUMBER NOT NULL ENABLE,
CONSTRAINT INS_RET_TEST_PK PRIMARY KEY (ID)
USING INDEX);
DROP SEQUENCE INS_RET_TEST_SEQ;
CREATE SEQUENCE INS_RET_TEST_SEQ MINVALUE 1 MAXVALUE 999999999
INCREMENT BY 1 START WITH 1;
CREATE OR REPLACE TRIGGER INS_RET_TEST_TRG
BEFORE INSERT ON INS_RET_TEST
FOR EACH ROW
BEGIN
<<COLUMN_SEQUENCES>>
BEGIN
IF INSERTING AND :NEW.ID IS NULL THEN
SELECT INS_RET_TEST_SEQ.NEXTVAL INTO :NEW.ID FROM SYS.DUAL;
END IF;
END COLUMN_SEQUENCES;
END;
/
ALTER TRIGGER INS_RET_TEST_TRG ENABLE;
--demonstration
#!/usr/bin/env python
import cx_Oracle
import time
db = cx_Oracle.connect(dsn="sec_lsst1db_cs2018")
cur = db.cursor()
data = [[1],[2],[3],[4],[5]]
IdArr = cur.var(int, arraysize = len(data))
cur.setinputsizes(None, IdArr)
cur.executemany("insert into ins_ret_test(col1) values (:1) returning id into :2", data)
for ix, datum in enumerate(data):
print("Data: " , datum[0], "ID: ", IdArr.getvalue(ix))
Christopher Stephens [X], thanks for an example! From what I see with SQLAlchemy it has no problem with support of autoincrement feature, though it does not use triggers or special DDL for that but relies on explicit "nextval" from a user-specified sequence. SQLAlchemy problem is with returning multiple generated keys, it explicitly says that it is not supported and this is what I see in my tests. It's good to know that cx_Oracle allows that, but here my worry is that if we switch to low-level DBAPI calls we are going to loose all SQLAlchemy specializations that we use (like special type decorators). And that using backend-specific DBAPI makes it harder to test/debug stuff.
Here is a summary what I have seen so far with a standard SQLAlchemy `execute()` method. I tried several variations: single row vs. multi-row inserts, and using RETURNING with engines that support it. I tested it on a simple table with few columns and integer primary key column. Script is here: https://gist.github.com/andy-slac/7e1e473dd49ced02dd844fda7c835e8e
Both Oracle and Postgres support implicit_returning option (enabled by default) which adds RETURNING to single-row INSERTs to return value of the added primary key. It does nothing for multi-row inserts.
Oracle backend
To support autoincrement Oracle needs separate SEQUENCE defined, which needs to be given to a PK Column constructor: https://docs.sqlalchemy.org/en/13/dialects/oracle.html#auto-increment-behavior
- Implicit RETURNING, single-row insert:
- generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
- params: {'val1': 1, 'val2': 10, 'ret_0': <cx_Oracle.STRING with value [None, ..., None]>}
- result.rowcount: 1
- result.inserted_primary_key: list of one integer, inserted ID
- no rows returned in the result
- Explicit RETURNING, single row insert:
- SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
- params: {'val1': 1, 'val2': 10, 'ret_0': <cx_Oracle.STRING with value [None, ..., None]>}
- result.rowcount: 1
- result.inserted_primary_key: Not defined
- result has single row with the value of ID inserted
- Implicit RETURNING, multi-row insert:
- SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2)
- params: [{'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30}]
- result.rowcount: 3
- result.inserted_primary_key: Not defined
- no rows returned in the result
- Explicit RETURNING, multi-row insert:
- SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
- params: [{'val1': 1, 'val2': 10, 'ret_0': None}, {'val1': 2, 'val2': 20, 'ret_0': None}, {'val1': 3, 'val2': 30, 'ret_0': None}]
- result.rowcount: 3
- result.inserted_primary_key: Not defined
- no rows returned in the result
Postgres backend
Like Oracle Postgres needs a sequence to generate auto-increment values, Postgres can define sequence itself if PK column has special type (SERIAL/BIGSERIAL) but using explicit sequence a la Oracle works too: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#sequences-serial-identity
- Implicit RETURNING, single-row insert:
- generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id
- params: {'val1': 1, 'val2': 10}
- result.rowcount: 1
- result.inserted_primary_key: list of one integer, inserted ID
- result.lastrowid: 0 (probably meaningless)
- no rows returned in the result
- Explicit RETURNING, single row insert:
- generated SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id
- params: {'val1': 1, 'val2': 10}
- result.rowcount: 1
- result.inserted_primary_key: Not defined
- result.lastrowid: 0 (probably meaningless)
- result has single row with the value of ID inserted
- Implicit RETURNING, multi-row insert:
- SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s)
- params: ({'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30})
- result.rowcount: 3
- result.inserted_primary_key: Not defined
- result.lastrowid: 0 (probably meaningless)
- no rows returned in the result
- Explicit RETURNING, multi-row insert:
- SQL: INSERT INTO test_table_ret (id, val1, val2) VALUES (nextval('test_table_ret_id_seq'), %(val1)s, %(val2)s) RETURNING test_table_ret.id
- params: ({'val1': 1, 'val2': 10}, {'val1': 2, 'val2': 20}, {'val1': 3, 'val2': 30})
- result.rowcount: 3
- result.inserted_primary_key: Not defined
- result.lastrowid: 0 (probably meaningless)
- no rows returned in the result
SQLite backend
SQLite has no support for RETURNING.
- single-row insert:
- generated SQL: INSERT INTO test_table_ret (val1, val2) VALUES (?, ?)
- params: (1, 10)
- result.rowcount: 1
- result.inserted_primary_key: list of one integer, inserted ID
- result.lastrowid: number (same as in inserted_primary_key)
- no rows returned in the result
- multi-row insert:
- SQL: INSERT INTO test_table_ret (val1, val2) VALUES (?, ?)
- params: ((1, 10), (2, 20), (3, 30))
- result.rowcount: 3
- result.inserted_primary_key: Not defined
- result.lastrowid: None
- no rows returned in the result
Summary I guess is that:
- supporting auto-increment feature is not hard, just need a Sequence defined for Oracle and optionally for Postgres (I think it can even be shared between multiple tables)
- returning generated PK value by default only works for single-row insert, there is no special anything needed, result.inserted_primary_key has the value
- multi-row insert has no standard way of returning values to client for any of the backends, there may be backend-specific ways to do that though, will look at it next.
I tried to find a workaround for Oracle backend without going full-cx_Oracle. With simple script (modified version of Chris' script to use keyword params similar to what we do in SQLAlchemy, see also cx_Oracle docs) I looked at how it works and where the state is kept. The key to everything is cursor.setinputsizes() method which binds output variable (INTO :var) to some Python-side object. I think it's clear from that example that bound output variables are stored in cursor.bindvars which can be used to find the values of the outputs even if the original bound object is not accessible.
For reference here is the output from cx_Oracle script:
IdArr = <cx_Oracle.NUMBER with value [None, None, None]> @140719028050832
|
pre-execute cursor.bindvars = {'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>}
|
post-execute cursor.bindvars = {'ret_0': <cx_Oracle.NUMBER with value [[223], [224], [225]]>, 'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>}
|
ret_0 = <cx_Oracle.NUMBER with value [[223], [224], [225]]> @140719028050832
|
Data: {'val1': 1, 'val2': 10} ID: [223]
|
Data: {'val1': 2, 'val2': 20} ID: [224]
|
Data: {'val1': 3, 'val2': 30} ID: [225]
|
(initially 'ret_0' in bindvars is set to [None, None, None], and after query execution it is updated to have actual values).
I tried to use this approach with SQLAlchemy looking at the low-level cursor object and checking its bindvars attribute, unfortunately in SQLAlchemy it does not seem to work, it's corresponding bound variable never get updated. I tried to dump the content of cursor.bindvars right after cursor.executemany() but even at that point it only contains None. It looks like there is some extra option that is set by SQLAlchemy which prevents cx_Oracle from updating bound variable. One suspicion is that SQLAlchemy passes value of "ret_0" parameter (as None) when it calls executemany() but this is only a guess. The code that handles execution is spread across multiple classes, it's hard to understand where the issue can be.
(and here is the output of my test script for reference too):
$ ./test_returning.py -r -m oracle
|
[... skip irrelevant output ...]
|
2019-10-24 17:33:59,316 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
|
param: BindParameter('val1', symbol('REQUIRED'), type_=Integer()) <class 'int'> @94163688039968
|
param: BindParameter('val2', symbol('REQUIRED'), type_=Integer()) <class 'int'> @94163688039968
|
param: BindParameter('ret_0', None, type_=Integer()) <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680
|
event_setinputsizes cursor.bindvars = None
|
2019-10-24 17:33:59,317 INFO sqlalchemy.engine.base.Engine INSERT INTO test_table_ret (id, val1, val2) VALUES (test_table_ret_id_seq.nextval, :val1, :val2) RETURNING test_table_ret.id INTO :ret_0
|
2019-10-24 17:33:59,318 INFO sqlalchemy.engine.base.Engine [{'val1': 1, 'val2': 10, 'ret_0': None}, {'val1': 2, 'val2': 20, 'ret_0': None}, {'val1': 3, 'val2': 30, 'ret_0': None}]
|
after_cursor_execute:
|
cursor = <cx_Oracle.Cursor on <cx_Oracle.Connection to @gen3_cred_salnikov>> @139887791399760
|
cursor.bindvars = {'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>, 'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>}
|
values = <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680
|
after conn.execute:
|
cursor = <cx_Oracle.Cursor on <cx_Oracle.Connection to @gen3_cred_salnikov>> @139887791399760
|
cursor.bindvars = {'val1': <cx_Oracle.NUMBER with value [1, 2, 3]>, 'val2': <cx_Oracle.NUMBER with value [10, 20, 30]>, 'ret_0': <cx_Oracle.NUMBER with value [None, None, None]>}
|
values = <cx_Oracle.NUMBER with value [None, None, None]> @139887790588680
|
rowcount = 3
|
inserted_primary_key = N/A
|
2019-10-24 17:33:59,321 INFO sqlalchemy.engine.base.Engine COMMIT
|
I think I reached a point where I'm ready to give up for now. As we discussed at the meeting today we can probably survive with single-row inserts if we get rid of savepoints (I think savepoints are only useful in the context of conflict handling, and we have already a solution for that). Not sure if I should close this ticket or keep it open.
Given that the title of this ticket starts with "Research...", I think it's fair to close this as Done. The comments make it clear where it's landed, and provide good information for another ticket to pick up if necessary.
Just to make it more complete, I also had a quick look at Postgres situation, and it may be somewhat better than Oracle and I think it may be possible to make it work with minimal hacking:
- psycopg2 backend does not return result set from executemany() so by default there is no way to extract the data
- psycopg2 has couple of non-standard optimizations for executemany: execute_batch() and execute_values(), both of these are supported in some form by SQLAlchemy by passing executemany_mode="batch" or executemany_mode="values" keywords to create_engine()
- "batch" mode works by first "paging" arguments into a shorter sequences and then executing multiple semicolon-separated statements in one query (with 1000 rows to insert and page_size=100 there will be 10 calls to execute(), each call having a long query string with 100 queries in it). With this if we do INSERT ... RETURNING it is only possible to return result of the last query so the returned result will have just a single row (which is not useful for us).
- "values" mode again uses pagination to split initial parameters into shorter sequences and then for each sequence it execute() one multi-VALUES insert statement. With this approach the returned result set will include rows from the last call to execute(), so if there is only one page then we will get back all inserted values. Pagination can be controlled by passing parameters to the backend, so potentially we can implement our own pagination to make sure that there is only one statement is executed on each call and them merge the returned rows ourselves.
And just to mention it - with SQLite I don't think any kind of hack is possible at all, we will be forced to use single-row inserts if we want to get every autoincremented ID back.
Closing this, nothing more to say.
General comment for autoincrement feature - its implementation is backend-specific and is done very differently in sqlalchemy for every separate backend:
Relying on autoincrement with bulk inserts is somewhat tricky - one would have to retrieve IDs of the inserted rows and associate them with client-side data. Oracle and Postgres support INSERT ... RETURNING syntax so it is possible to specify what has to be returned from INSERT. SQLAlchemy also provides inserted_primary_key and lastrowid attributes for ResultSet which work for a single-row insert (that should be OK for us as we do not really use "true" bulk insert syntax). Combining multi-row insert and conflict resolution means that number of rows actually inserted can be different from number of rows supplied, and associating two may not be trivial at all. I think the only way out of that is to insert one row at a time and check whether it was inserted OK and find its primary key.
Just to mention another approach - we could use explicit sequence calls to obtain a set of IDs and then insert new rows with the known IDs (for SQLite we'd have simulate a sequence). Drawback here is that we'll need multiple calls a sequence to generate multiple IDs and each call is basically a separate query and separate round trip to a server so effectively it means doubling the number of queries sent to a server.
I want to experiment a little bit with all three backends to check for any non-trivial issues.