# Use UUIDs as dataset_ids in registry

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
12
• Sprint:
DB_S21_12
• Team:
Data Access and Database
• Urgent?:
No

#### Description

Syncing registries from different sources (such as the lightweight registry from workflows) will be simplified significantly if we switch from autoincrementing integers to UUIDs for our datasets.

Some things to consider:

• Where are UUIDs calculated? By Registry?
• Should we allow UUIDs to be calculated from a datasetRef (dataset type / dataId and run name). This could be an alternative implementation of DM-21794, allowing raw data to have predictable IDs.
• Should we allow an external source for UUID?
• Is there a case for datastore to allocate IDs itself which can then be used by registry?

#### Activity

No builds found.
Tim Jenness created issue -
Hide
Tim Jenness added a comment -

Regarding datastore issuing UUIDs. I was more thinking that pipe_base would generate all the dataset UUIDs when it forms the lightweight registry. Then datastore would use whatever ID it finds in the DatasetRef that it's told to store. Do we really have a need for datastore to be used without an associated Butler?

Show
Tim Jenness added a comment - Regarding datastore issuing UUIDs. I was more thinking that pipe_base would generate all the dataset UUIDs when it forms the lightweight registry. Then datastore would use whatever ID it finds in the DatasetRef that it's told to store. Do we really have a need for datastore to be used without an associated Butler?
Field Original Value New Value
Description Syncing registries from different sources (such as the lightweight registry from workflows) will be simplified significantly if we switch from autoincrementing integers to UUIDs for our datasets.

Some things to consider:

* Where are UUIDs calculated? By Registry?
* Should we allow UUIDs to be calculated from a dataId and collection name. This could be an alternative implementation of DM-21794, allowing raw data to have predictable IDs.
* Is there a case for datastore to allocate IDs itself which can then be used by registry?

Syncing registries from different sources (such as the lightweight registry from workflows) will be simplified significantly if we switch from autoincrementing integers to UUIDs for our datasets.

Some things to consider:

* Where are UUIDs calculated? By Registry?
* Should we allow UUIDs to be calculated from a datasetRef (dataset type / dataId and run name). This could be an alternative implementation of DM-21794, allowing raw data to have predictable IDs.
* Should we allow an external source for UUID?
* Is there a case for datastore to allocate IDs itself which can then be used by registry?

Hide
Michelle Gower added a comment -

dmtn-099 has the original investigation into unique IDs.

Show
Michelle Gower added a comment - dmtn-099  has the original investigation into unique IDs.
Hide
Jim Bosch added a comment -

I have not thought about this in much depth, and don't plan to until Andy Salnikov has had a chance to look at it, but here are some quick thoughts on Tim Jenness's questions:

• Where are UUIDs calculated? By Registry?
• Should we allow an external source for UUID?
• Is there a case for datastore to allocate IDs itself which can then be used by registry?

I think a big advantage of UUIDs (if they work the way I think they might) is that they could come from anywhere, including external sources, and potentially datastore (someday), so I think we want interfaces that leave room for that (e.g. all ingest/import operations should be able to accept and use an existing UUID, with the caller guaranteeing that those IDs meet whatever specs for UUIDs we come up with). But I yet don't see a need for anything other than a Registry implementation for generating them in the near future.

Should we allow UUIDs to be calculated from a datasetRef (dataset type / dataId and run name). This could be an alternative implementation of DM-21794, allowing raw data to have predictable IDs.

I think that's a good generalization for meeting the raw-data use case, and there may be a few other cases where we'd want to use that type of ID, because the definition of the dataset really can be globally consistent. But we definitely don't want this to be the only way to generate UUIDs, because I think we really want UUID equality across different repos to mean "these datasets really are identical", and the dataset type and run name (in particular) have to be very special for them to be sufficient for global uniqueness.

Regarding datastore issuing UUIDs. I was more thinking that pipe_base would generate all the dataset UUIDs when it forms the lightweight registry. Then datastore would use whatever ID it finds in the DatasetRef that it's told to store. Do we really have a need for datastore to be used without an associated Butler?

This sounds totally reasonable for now (and maybe forever), though I'm not sure I understood what you meant by the last question.

But I do feel like there's one bit of information that falls through the cracks a bit in our staging-repo processing plans, both with and without UUIDs: the datastore-internal records, like the formatter and read storage class information. That's mostly okay, because we can mostly rely on the workflow system to make sure the configuration that determines those values is the same in the job repo that actually puts the datasets and the shared repo we access them from later. But it means taking maintainance of what should be Datastore's internal invariants out of its hands and giving it to much higher-level workflow code where many more things can go wrong, and I feel like it might be a better long-term plan to write the datastore records to a scalable NoSQL database at the same time that we write the file itself to the object store, and never try to put them into the PostgreSQL database later.

That also ties in with my still-vague thoughts about getting rid of dataset_location and reworking our notion of what the source of truth is for dataset existence and deletion - basically, if datastore records (as well as files) can exist first, with their own UUID-based dataset_id values, totally independent of a Registry (but with datastore records and files always written in concert), then I think we get to the kind of really decoupled system we hoped for originally, where Registry associates UUIDs with descriptive metadata and organizational structure, Datastore associates UUIDs with files and the information needed to read them, and Registry and Datastore don't have to talk to each other at all.

But that depends on UUIDs being really magical about avoiding collisions, and I don't claim to understand the constraints they impose in order to pull that off, so I have no idea if that all hangs together (or if it raises new problems that make it not a good tradeoff). Hopefully Andy Salnikov will be able to tell us.

Show
 Status To Do [ 10001 ] In Progress [ 3 ]
Hide
Andy Salnikov added a comment -

First a short summary of UUID support by databases that we use.

PostgreSQL has a special data type for UUIDs (128-bit integer) and methods to generate UUIDs:

• since release 13 it has a built-in function gen_random_uuid() which returns UUID4-style number.
• other functions are available in uuid-ossp extension which can also be used for earlier Postgres versions.

SQLite has no builtin support for UUIDs at all, but of course it happily allows any data type to be used as a primary key, so UUIDs can be stored as binary data or strings if needed.

• there is an extension module for UUID generation but is probably pointless as it will be easier to generate UUIDs in Python
• using strings instead of numbers for primary keys is a potential performance hit,

SQLAlchemy provides an example of defining TypeDecorator for Python UUID type that converts then to Postgres UUID type or CHAR(32) for other backends (though it does it by formatting integer number, not using standard UUID presentation): https://docs.sqlalchemy.org/en/14/core/custom_types.html#backend-agnostic-guid-type

General concern for storing UUIDs in database is randomness which obviously impacts indexing performance. As Chris showed in DMTN-099, performance hit can be significant for UUID4. OTOH randomness of UUID4 is probably what we want for our use case to generate more unique UUIDs, concern is that that UUID1 is less random and could result in higher collision chance. It may be interesting to repeat the same studies with Postgres to quantify an effect.

Show
Andy Salnikov added a comment - First a short summary of UUID support by databases that we use. PostgreSQL has a special data type for UUIDs (128-bit integer) and methods to generate UUIDs: since release 13 it has a built-in function gen_random_uuid() which returns UUID4-style number. other functions are available in uuid-ossp extension which can also be used for earlier Postgres versions. SQLite has no builtin support for UUIDs at all, but of course it happily allows any data type to be used as a primary key, so UUIDs can be stored as binary data or strings if needed. there is an extension module for UUID generation but is probably pointless as it will be easier to generate UUIDs in Python using strings instead of numbers for primary keys is a potential performance hit, SQLAlchemy provides an example of defining TypeDecorator for Python UUID type that converts then to Postgres UUID type or CHAR(32) for other backends (though it does it by formatting integer number, not using standard UUID presentation): https://docs.sqlalchemy.org/en/14/core/custom_types.html#backend-agnostic-guid-type General concern for storing UUIDs in database is randomness which obviously impacts indexing performance. As Chris showed in DMTN-099, performance hit can be significant for UUID4. OTOH randomness of UUID4 is probably what we want for our use case to generate more unique UUIDs, concern is that that UUID1 is less random and could result in higher collision chance. It may be interesting to repeat the same studies with Postgres to quantify an effect.
 Link This issue relates to DM-21794 [ DM-21794 ]
Hide
Jim Bosch added a comment - - edited

It occurs to me that we could also take an intermediate approach, and continue to use autoincrement integer keys for the primary key and foreign keys within each Registry (now as an implementation detail, with no expectation that they would ever be the same across Registries even for raws), while using UUIDs as the public-facing thing we put on DatasetRef, include in export/import files, and use for communication between Registry and Datastore. We would still need to index the UUIDs, but we'd use those indexes in fewer places.

Show
Jim Bosch added a comment - - edited It occurs to me that we could also take an intermediate approach, and continue to use autoincrement integer keys for the primary key and foreign keys within each Registry (now as an implementation detail, with no expectation that they would ever be the same across Registries even for raws), while using UUIDs as the public-facing thing we put on DatasetRef, include in export/import files, and use for communication between Registry and Datastore. We would still need to index the UUIDs, but we'd use those indexes in fewer places.
Hide
Andy Salnikov added a comment -

Few more trivial thoughts on this topic.

Having a globally unique ID (in any form, either siteID+autoincrementID or UUID) as PK is indeed very beneficial for merging registries, there is no need to re-map PK/FK when importing data into a database. Managing siteID is not quite trivial topic, but it can be solved by using a separate service which makes unique siteIDs for each site (so some features of distributed database system need to be implemented even if we want to avoid calling it a distributed database). Random UUIDs may look better in that respect due to randomness but there is always a tiny chance of collisions (non-uniqueness), probability is low of course and many people do not care about it, but it would be nice to have some way to detect possible collisions.

Generating non-random UUIDs from a set of unique values (dataset type / dataId / collection) is of course possible by hashing all that info (e.g. using UUID3/5 mechanism) and it may be useful if we want to allow multiple sources to fill the same collection, or if we want to detect accidental collisions in merged data (though we already have a constraint at the collection level, so this is not critical). And like with any UUID type the chance of collision is tiny but non-zero. Collisions for non-random UUIDs may be more difficult to handle than for random ones, random UUID can be regenerated to produce different one, for non--random I guess the run collection name will need to be changed.

The source of random UUIDs can be anything, either registry code or database or some other source. For non-random UUIDs the location where UUID is generated needs to know all identifying information, registry itself is probably good place for this using uuid Python module.

Jim's idea for having UUID separate from non-unique PK can probably work but it certainly need some extra work during import, and I'm not sure what other implications could be for that.

I think we want some performance test to see how different options work for Postgres.

Show
Andy Salnikov added a comment - Few more trivial thoughts on this topic. Having a globally unique ID (in any form, either siteID+autoincrementID or UUID) as PK is indeed very beneficial for merging registries, there is no need to re-map PK/FK when importing data into a database. Managing siteID is not quite trivial topic, but it can be solved by using a separate service which makes unique siteIDs for each site (so some features of distributed database system need to be implemented even if we want to avoid calling it a distributed database). Random UUIDs may look better in that respect due to randomness but there is always a tiny chance of collisions (non-uniqueness), probability is low of course and many people do not care about it, but it would be nice to have some way to detect possible collisions. Generating non-random UUIDs from a set of unique values (dataset type / dataId / collection) is of course possible by hashing all that info (e.g. using UUID3/5 mechanism) and it may be useful if we want to allow multiple sources to fill the same collection, or if we want to detect accidental collisions in merged data (though we already have a constraint at the collection level, so this is not critical). And like with any UUID type the chance of collision is tiny but non-zero. Collisions for non-random UUIDs may be more difficult to handle than for random ones, random UUID can be regenerated to produce different one, for non--random I guess the run collection name will need to be changed. The source of random UUIDs can be anything, either registry code or database or some other source. For non-random UUIDs the location where UUID is generated needs to know all identifying information, registry itself is probably good place for this using uuid Python module. Jim's idea for having UUID separate from non-unique PK can probably work but it certainly need some extra work during import, and I'm not sure what other implications could be for that. I think we want some performance test to see how different options work for Postgres.
 Epic Link DM-27791 [ 442307 ] Sprint DB_S21_12 [ 1065 ] Labels gen3-middleware
 Attachment DM-29196 datasets insert time sqlite.png [ 48516 ] Attachment DM-29196 datasets insert time postgres.png [ 48517 ]
Hide
Andy Salnikov added a comment - - edited

I ran a simple standalone test for insert time into a table which looks like a dataset table with different kinds of primary keys, either with site_is two-column PK or UUID of various versions (1, 4, and 5).

Summary of the setup:

• all inserts are done by a single client
• total 10,000,000 records inserted in each test, starting with empty table
• single transaction included 100 inserts, 100,000 transactions in total
• sqlite database resided on GPFS filesystem in /project area, new database file created for each test
• postgres backend ran against server at NCSA using my personal account, table was dropper/re-created in each test
• for UUID storage I used TypeDecorator from sqlalchemy example, it uses native uuid type with postgres and CHAR(32) with sqlite

Here is a plot for insertion time as a function of time for each of site_id, uuid1, uuid4, and uuid5 cases:

Notes: site_id is fastest and stable over whole period at ~12ms per transaction, uuid1 is similar to site_id, uuid4 and uuid5 looks similar to each other and are slower than uuid1 and vary with time, important thing is that they don't grow uncontrollably.

Matching plots for postgres:

Notes: postgres numbers are slower than sqlite; site_id is at the level of ~28ms per transaction; uuid cases are slightly higher at ~33-35ms. There are also observable spikes on uuid4/5 plots which happen every 15 minutes, this is probably do to some other activity happening concurrently.

I think conclusion for this is that uuid causes some expected slowdown but it does not look critical. It will be interesting to see how it affects select queries. For that I think its better to have more realistic use case, I'll see if I can instrument registry code to dump performance metrics for execution time.

EDIT: I updated plots, previous version did not include commit time.

Show
Andy Salnikov added a comment - - edited I ran a simple standalone test for insert time into a table which looks like a dataset table with different kinds of primary keys, either with site_is two-column PK or UUID of various versions (1, 4, and 5). Summary of the setup: all inserts are done by a single client total 10,000,000 records inserted in each test, starting with empty table single transaction included 100 inserts, 100,000 transactions in total sqlite database resided on GPFS filesystem in /project area, new database file created for each test postgres backend ran against server at NCSA using my personal account, table was dropper/re-created in each test for UUID storage I used TypeDecorator from sqlalchemy example, it uses native uuid type with postgres and CHAR(32) with sqlite Here is a plot for insertion time as a function of time for each of site_id, uuid1, uuid4, and uuid5 cases: Notes: site_id is fastest and stable over whole period at ~12ms per transaction, uuid1 is similar to site_id, uuid4 and uuid5 looks similar to each other and are slower than uuid1 and vary with time, important thing is that they don't grow uncontrollably. Matching plots for postgres: Notes: postgres numbers are slower than sqlite; site_id is at the level of ~28ms per transaction; uuid cases are slightly higher at ~33-35ms. There are also observable spikes on uuid4/5 plots which happen every 15 minutes, this is probably do to some other activity happening concurrently. I think conclusion for this is that uuid causes some expected slowdown but it does not look critical. It will be interesting to see how it affects select queries. For that I think its better to have more realistic use case, I'll see if I can instrument registry code to dump performance metrics for execution time.  EDIT: I updated plots, previous version did not include commit time.
 Attachment DM-29196 datasets insert time sqlite.png [ 48516 ]
 Attachment DM-29196 datasets insert time postgres.png [ 48517 ]
 Attachment DM-29196 datasets insert time postgres.png [ 48520 ] Attachment DM-29196 datasets insert time sqlite.png [ 48521 ]
Hide
Tim Jenness added a comment -

These numbers imply to me that we should use uuid4+5 everywhere. The sqlite behavior is presumably not much of a concern since we aren't going to be having sqlite registry with millions of rows.

The impact on SELECT seems to be the real question then. The tech note implied that things slow down because the index is twice the size with 128bit number vs a small site integer and incrementing int.

Show
Tim Jenness added a comment - These numbers imply to me that we should use uuid4+5 everywhere. The sqlite behavior is presumably not much of a concern since we aren't going to be having sqlite registry with millions of rows. The impact on SELECT seems to be the real question then. The tech note implied that things slow down because the index is twice the size with 128bit number vs a small site integer and incrementing int.
 Remote Link This issue links to "Script used for testing INSERT performance (Web Link)" [ 27943 ]
Hide
Jim Bosch added a comment -

You mean UUID4 for regular datasets, and UUID5 for raws (or any other There Can Only Be One datasets we encounter)?

Show
Jim Bosch added a comment - You mean UUID4 for regular datasets, and UUID5 for raws (or any other There Can Only Be One datasets we encounter)?
Hide
Tim Jenness added a comment -

Yes, I mean UUID4 by default and allowing UUID5 for raw-like entities coming from ingest rather than put.

Show
Tim Jenness added a comment - Yes, I mean UUID4 by default and allowing UUID5 for raw-like entities coming from ingest rather than put.
 Attachment DM-29196 timing datasets methods uuid.png [ 48566 ] Attachment DM-29196 timing datasets methods original.png [ 48567 ] Attachment DM-29196 datasets schema uuid.png [ 48568 ]
Hide
Andy Salnikov added a comment -

I made a quick change in schema to use uuid instead of integer Id for dataset table and all FK referring to it, this is only to test that query performance is not impacted. The only way to test it for me right now is with ci_hsc_gen3 and that is not very realistic test due its small scale. Anyways as expected for this tiny database the timing does not change, here are two plots showing execution time before/after schema change for queries that deal with datasets:

And here is for reference the diagram with dataset tables directly from Postgres:

Complicated issue now is how to do schema migration, as we cannot do this switch without also producing a tool to migrate existing schema.

Show
Andy Salnikov added a comment - I made a quick change in schema to use uuid instead of integer Id for dataset table and all FK referring to it, this is only to test that query performance is not impacted. The only way to test it for me right now is with ci_hsc_gen3 and that is not very realistic test due its small scale. Anyways as expected for this tiny database the timing does not change, here are two plots showing execution time before/after schema change for queries that deal with datasets: And here is for reference the diagram with dataset tables directly from Postgres: Complicated issue now is how to do schema migration, as we cannot do this switch without also producing a tool to migrate existing schema.
Hide
Tim Jenness added a comment -

Is the complication here that the dataset_id is fundamental and so we can't simply switch a manager class in a config file to let new releases use uuid whilst old registries use incrementing integer?

Show
Tim Jenness added a comment - Is the complication here that the dataset_id is fundamental and so we can't simply switch a manager class in a config file to let new releases use uuid whilst old registries use incrementing integer?
Hide
Andy Salnikov added a comment -

I did not think about that option. I guess we could do it, though Datastore needs to be made smarter to guess what type it needs to use for its foreign key.

How do we know when to use UUID5, is is a property of dataset type, collection, or something else?

Show
Andy Salnikov added a comment - I did not think about that option. I guess we could do it, though Datastore needs to be made smarter to guess what type it needs to use for its foreign key. How do we know when to use UUID5, is is a property of dataset type, collection, or something else?
Hide
Tim Jenness added a comment - - edited

Datastore has no real idea – it just uses the ID from the DatasetRef. Would the plan be to have a DatasetRef that has an id and uuid property? Datastore could easily look for both and prefer uuid.

Or maybe the real problem is the definition of the table?

Show
Tim Jenness added a comment - - edited Datastore has no real idea – it just uses the ID from the DatasetRef. Would the plan be to have a DatasetRef that has an id and uuid property? Datastore could easily look for both and prefer uuid. Or maybe the real problem is the definition of the table?
Hide
Tim Jenness added a comment - - edited

External ingest is the only possible time where a UUID5 would be used. I think for now it would be something that only butler ingest-raws can trigger (ie RawIngestTask). Could we do it by having RawIngestTask create a resolved DatasetRef with a UUID5 attached and have butler ingest accept that it has to use that UUID? (presumably butler import would be doing something similar since it would be telling butler to reuse the UUID[4/5] from the exported yaml file)

Show
Tim Jenness added a comment - - edited External ingest is the only possible time where a UUID5 would be used. I think for now it would be something that only butler ingest-raws can trigger (ie RawIngestTask). Could we do it by having RawIngestTask create a resolved DatasetRef with a UUID5 attached and have butler ingest accept that it has to use that UUID? (presumably butler import would be doing something similar since it would be telling butler to reuse the UUID [4/5] from the exported yaml file)
Hide
Andy Salnikov added a comment -

Datastore table schema needs to use data type consistent with PK in dataset table.

And I did not think about DatasetRef ID. If we want to support both integers and UUIDs via different managers that would mean that DatasetRef ID type will depend on registry manager configuration. I'm not sure how confusing that could be and what sort of complication that causes.

I'm OK with the ingest tool providing UUIDs, that looks like correct location for decision which UUID version to use.

Show
Andy Salnikov added a comment - Datastore table schema needs to use data type consistent with PK in dataset table. And I did not think about DatasetRef ID. If we want to support both integers and UUIDs via different managers that would mean that DatasetRef ID type will depend on registry manager configuration. I'm not sure how confusing that could be and what sort of complication that causes. I'm OK with the ingest tool providing UUIDs, that looks like correct location for decision which UUID version to use.
Hide
Tim Jenness added a comment -

Maybe it is easier for DatasetRef.id to stay the same but the thing in it to change.

Show
Tim Jenness added a comment - Maybe it is easier for DatasetRef.id to stay the same but the thing in it to change.
Hide
Andy Salnikov added a comment -

There is some progress with implementation, but right now I am stuck with the unit tests which are written with the assumption that DatasetRef IDs are auti-incremental integers and they break for random UUIDs, e.g. test_cliCmdQueryDatasets.py. Not sure what to do about that, maybe the right thing is to ignore "id" column entirely, means I'll need to drop it from astropy table.

There are likely bunch of other places that assume integer ID which we are not testing in the unit test, this could break later.

Show
Andy Salnikov added a comment - There is some progress with implementation, but right now I am stuck with the unit tests which are written with the assumption that DatasetRef IDs are auti-incremental integers and they break for random UUIDs, e.g. test_cliCmdQueryDatasets.py. Not sure what to do about that, maybe the right thing is to ignore "id" column entirely, means I'll need to drop it from astropy table. There are likely bunch of other places that assume integer ID which we are not testing in the unit test, this could break later.
Hide
Tim Jenness added a comment -

I think any test assuming incrementing integer IDs is broken.

Show
Tim Jenness added a comment - I think any test assuming incrementing integer IDs is broken.
Hide
Tim Jenness added a comment -

Is your problem that the integer is the default for sorting the astropy table? We need to be able to sort DatasetRef somehow.

I think the ID in the output of the query commands is useful since it lets you explicitly get that dataset later.

Show
Tim Jenness added a comment - Is your problem that the integer is the default for sorting the astropy table? We need to be able to sort DatasetRef somehow. I think the ID in the output of the query commands is useful since it lets you explicitly get that dataset later.
Hide
Andy Salnikov added a comment -

Sorting is not a problem (yet). It breaks even for a single record because expected output has ID "1" and actual output is a random UUID string. I agree that ID on output should be useful, but we need to be able to test that with random ID.

Show
Andy Salnikov added a comment - Sorting is not a problem (yet). It breaks even for a single record because expected output has ID "1" and actual output is a random UUID string. I agree that ID on output should be useful, but we need to be able to test that with random ID.
Hide
Tim Jenness added a comment -

Okay. Yes, the test should not be assuming specific dataset_ids.

Show
Tim Jenness added a comment - Okay. Yes, the test should not be assuming specific dataset_ids.
 Watchers Andy Salnikov, Jim Bosch, Michelle Gower, Tim Jenness [ Andy Salnikov, Jim Bosch, Michelle Gower, Tim Jenness ] Andy Salnikov, Brian Yanny, Jim Bosch, Michelle Gower, Tim Jenness [ Andy Salnikov, Brian Yanny, Jim Bosch, Michelle Gower, Tim Jenness ]
Hide
Andy Salnikov added a comment -

Tim Jenness, I am thinking about the idea of doing pre-calculated UUIDs for raw ingest. I'd like to avoid exposing a specific ID type (UUID/int) at the high level of ingest task and would prefer more control over generating those to avoid potential issues (that process needs to be super-stable). Additionally our current Registry API does not support passing DatasetRef to insert method which only accepts DataIds. What if instead of passing UUID we pass a special flag to Butler.ingest and registry methods which will tell Registry what kind of ID to generate - unique or deterministic? That still needs API change but I think it's smaller change and we can provide a reasonable default to keep current behavior.

Show
Andy Salnikov added a comment - Tim Jenness , I am thinking about the idea of doing pre-calculated UUIDs for raw ingest. I'd like to avoid exposing a specific ID type (UUID/int) at the high level of ingest task and would prefer more control over generating those to avoid potential issues (that process needs to be super-stable). Additionally our current Registry API does not support passing DatasetRef to insert method which only accepts DataIds. What if instead of passing UUID we pass a special flag to Butler.ingest and registry methods which will tell Registry what kind of ID to generate - unique or deterministic? That still needs API change but I think it's smaller change and we can provide a reasonable default to keep current behavior.
 Summary Use UUIDs are dataset_ids in registry Use UUIDs as dataset_ids in registry
Hide
Tim Jenness added a comment -

How are we handling butler import then? Surely for butler import we want to reuse the UUID that is found in the exported yaml file. Am I misunderstanding what you are saying above about this not working?

butler.ingest takes FileDataset as argument which has DatasetRef in it so I was expecting something to populate the id in those before the insert happened. Whether that's done as part of ingest code in obs_base or some special code in daf_butler (maybe based entirely on the dataId of the thing being ingested – modulo some thought about whether collection name is involved) I hadn't thought about.

I had always assumed that because import required us to force the ID onto the database and ingest needed to do that, then the insert code would have to take whatever UUID it was told.

Show
Tim Jenness added a comment - How are we handling butler import then? Surely for butler import we want to reuse the UUID that is found in the exported yaml file. Am I misunderstanding what you are saying above about this not working? butler.ingest takes FileDataset as argument which has DatasetRef in it so I was expecting something to populate the id in those before the insert happened. Whether that's done as part of ingest code in obs_base or some special code in daf_butler (maybe based entirely on the dataId of the thing being ingested – modulo some thought about whether collection name is involved) I hadn't thought about. I had always assumed that because import required us to force the ID onto the database and ingest needed to do that, then the insert code would have to take whatever UUID it was told.
Hide
Andy Salnikov added a comment -

I have not looked at import closely yet so I don't have complete picture for that part, I'm looking at ingest side now and there I think we can avoid exposing UUIDs or whatever other IDs we could have. In principle import does not have to use exactly the same API as ingest but I I can see that it may be beneficial to have fewer registry methods to deal with that. Let me look at import and we can talk tomorrow about strategy.

Show
Andy Salnikov added a comment - I have not looked at import closely yet so I don't have complete picture for that part, I'm looking at ingest side now and there I think we can avoid exposing UUIDs or whatever other IDs we could have. In principle import does not have to use exactly the same API as ingest but I I can see that it may be beneficial to have fewer registry methods to deal with that. Let me look at import and we can talk tomorrow about strategy.
Hide
Andy Salnikov added a comment -

Some info and thoughts on ingest/import.

• Ingest is a call to Butler.ingest() while import goes through YamlRepoImportBackend.load()
• Both of those call Registry.insertDatasets() which takes a bunch of DataId instances and does not provide a way to specify dataset_id, and it returns a list of DatasetRef
• registry.insertDatasets() delegates actual work to DatasetRecordStorage.insert() which today uses auto-increment PK to generate dataset IDs for returned DatasetRefs

In the UUID future we want to produce deterministic IDs for a subset of ingested data and unique/random IDs for others, the idea is that on import non-unique IDs will be skipped if they are already in datasets table. OTOH we probably do not want high-level ingest code to generate those IDS, it would be better to just pass some flag to Registry.insertDatasets() and Butler.ingest() that tells datasets manager to generate UUID5 for those datasets and UUID4 for all others.

Import picture is a bit more complicated if we want to support imports from different ID types, e.g. import from auto-increment storage into UUID storage. Some possible cases that I could think of:

• Import from UUID into UUID - this is simplest case, should be able to just ignore matching UUIDs (assuming that they are deterministic). This means that we need to be able to pass dataset IDs into registry.insertDatasets() method or a method that is going to replace it.
• Import from autoi-increment into UUID - this should be possible if we know which datasets need deterministic IDs and which need random ones. That could be done for example based on dataset type name and/or run collection name, and this should be long to the configuration of the import tool. In this case there is no need to pass existing IDs to registry.insertDatasets(), it should be sufficient to use the same sort of flag that is used in ingest.
• import from UUID into auto-increment - question is mostly do we need to support deterministic IDs in auto-increment storage or do we want to deprecate it ignore all problems associated with import into auto-increment storage.

One related issue - my current guess is that use case for deterministic IDs only matters for importing from one repo into another, but for a single repo we should not try to ingest the same dataset twice. But if we do want support for that then what should happen if same dataset is ingested twice - should registry.insertDatasets() just return existing DatasetRefs or raise an error? If it returns DatasetRefs is datastore going to ingest the data?

I think it is clear that we need a method like registry.insertDatasets() which also accepts dataset IDs, either in the form of DatasetRefs or maybe just a separate list of IDs. Maybe we should start with abusing the same method but passing two extra parameters to it:

• optional list of dataset IDs which is only populated during import
• a flag which tells it what kind of IDs to generate if the above list is not given, or if it is given but contains incompatible type of ID
Show
Andy Salnikov added a comment - Some info and thoughts on ingest/import. Ingest is a call to Butler.ingest() while import goes through YamlRepoImportBackend.load() Both of those call Registry.insertDatasets() which takes a bunch of DataId instances and does not provide a way to specify dataset_id, and it returns a list of DatasetRef registry.insertDatasets() delegates actual work to DatasetRecordStorage.insert() which today uses auto-increment PK to generate dataset IDs for returned DatasetRefs In the UUID future we want to produce deterministic IDs for a subset of ingested data and unique/random IDs for others, the idea is that on import non-unique IDs will be skipped if they are already in datasets table. OTOH we probably do not want high-level ingest code to generate those IDS, it would be better to just pass some flag to Registry.insertDatasets() and Butler.ingest() that tells datasets manager to generate UUID5 for those datasets and UUID4 for all others. Import picture is a bit more complicated if we want to support imports from different ID types, e.g. import from auto-increment storage into UUID storage. Some possible cases that I could think of: Import from UUID into UUID - this is simplest case, should be able to just ignore matching UUIDs (assuming that they are deterministic). This means that we need to be able to pass dataset IDs into registry.insertDatasets() method or a method that is going to replace it. Import from autoi-increment into UUID - this should be possible if we know which datasets need deterministic IDs and which need random ones. That could be done for example based on dataset type name and/or run collection name, and this should be long to the configuration of the import tool. In this case there is no need to pass existing IDs to registry.insertDatasets() , it should be sufficient to use the same sort of flag that is used in ingest. import from UUID into auto-increment - question is mostly do we need to support deterministic IDs in auto-increment storage or do we want to deprecate it ignore all problems associated with import into auto-increment storage. One related issue - my current guess is that use case for deterministic IDs only matters for importing from one repo into another, but for a single repo we should not try to ingest the same dataset twice. But if we do want support for that then what should happen if same dataset is ingested twice - should registry.insertDatasets() just return existing DatasetRefs or raise an error? If it returns DatasetRefs is datastore going to ingest the data? I think it is clear that we need a method like registry.insertDatasets() which also accepts dataset IDs, either in the form of DatasetRefs or maybe just a separate list of IDs. Maybe we should start with abusing the same method but passing two extra parameters to it: optional list of dataset IDs which is only populated during import a flag which tells it what kind of IDs to generate if the above list is not given, or if it is given but contains incompatible type of ID
Hide
Andy Salnikov added a comment -

I am currently looking at two options for re-defining registry.insertDatasets() interface, any input would be welcome:

First option is to keep existing parameters but to add two extra parameters:

  # DatasetId = Union[int, uuid.UUID]  def insertDatasets(self,  datasetType: Union[DatasetType, str],  dataIds: Iterable[DataId],  run: Optional[str] = None,  datasetIds: Optional[Iterable[DatasetId]] = None,  idGenerationMode: DatasetIdGenEnum = DatasetIdGenEnum.UNIQUE,  ) -> List[DatasetRef]:  """Insert one or more datasets into the Registry    This always adds new datasets; to associate existing datasets with  a new collection, use associate.    Parameters  ----------  datasetType : DatasetType or str  A DatasetType or the name of one.  dataIds : ~collections.abc.Iterable of dict or DataCoordinate  Dimension-based identifiers for the new datasets.  run : str, optional  The name of the run that produced the datasets. Defaults to  self.defaults.run.  datasetIds : iterable of DatasetId, optional  If provided has to contain dataset ID for each inserted dataset.  All dataset IDs must have the same type (int or uuid.UUID),  if type of dataset IDs does not match configured backend then  IDs will be ignored and new IDs will be generated by backend.  idGenerationMode : DatasetIdGenEnum, optional  Specifies option for generating dataset IDs when IDs are not  provided or their type does not match backend type. By default  unique IDs are generated for each inserted dataset.    Returns  -------  refs : list of DatasetRef  Resolved DatasetRef instances for all given data IDs (in the same  order).    Notes  -----  If datasetIds includes ID which already exists in the database then  it will not be inserted or updated, but a resolved DatasetRef will be  returned for it in any case.  """ 

Second option is to wrap dataId, datasetType, and run into DatasetRef:

  def insertDatasets(self,  datasets: Iterable[DatasetRef],   idGenerationMode: DatasetIdGenEnum = DatasetIdGenEnum.UNIQUE,  ) -> List[DatasetRef]:  """Insert one or more datasets into the Registry    This always adds new datasets; to associate existing datasets with  a new collection, use associate.    Parameters  ----------  datasets : ~collections.abc.Iterable of DatasetRef  Datasets to be inserted. All DatasetRef instances must have  identical datasetType and run attributes. run  attribute can be None and defaults to self.defaults.run.  Datasets can specify id attribute which will be used for  inserted datasets. All dataset IDs must have the same type  (int or uuid.UUID), if type of dataset IDs does not match  configured backend then IDs will be ignored and new IDs will be  generated by backend.  idGenerationMode : DatasetIdGenEnum, optional  Specifies option for generating dataset IDs when IDs are not  provided or their type does not match backend type. By default  unique IDs are generated for each inserted dataset.    Returns  -------  refs : list of DatasetRef  Resolved DatasetRef instances for all given data IDs (in the same  order).    Notes  -----  If any of datasets has an ID which already exists in the database  then it will not be inserted or updated, but a resolved DatasetRef  will be returned for it in any case.  """ 

I sort of like second option better as it is less verbose but it also has some drawbacks:

• duplication of datasetType and run in every DatasetRef (but it already happens at import level so should not be a big concern)
• does not allow simple dict as dataId version allowed, so lots of tests need to be updated (though DatasetRef constructor should probably take care of it)
• DatasetRef currently does not allow non-None run with None id and vice versa, we could change that requirement but I don't know what the implications are

Any preferences, or should we combine both of these into one method or two separate methods?

Show
Andy Salnikov added a comment - I am currently looking at two options for re-defining registry.insertDatasets() interface, any input would be welcome: First option is to keep existing parameters but to add two extra parameters: # DatasetId = Union[int, uuid.UUID] def insertDatasets( self , datasetType: Union[DatasetType, str ], dataIds: Iterable[DataId], run: Optional[ str ] = None , datasetIds: Optional[Iterable[DatasetId]] = None , idGenerationMode: DatasetIdGenEnum = DatasetIdGenEnum.UNIQUE, ) - > List [DatasetRef]: """Insert one or more datasets into the Registry   This always adds new datasets; to associate existing datasets with a new collection, use associate.   Parameters - - - - - - - - - - datasetType : DatasetType or  str  A DatasetType or the name of one. dataIds : ~collections.abc.Iterable of  dict  or DataCoordinate Dimension - based identifiers for the new datasets. run :  str , optional The name of the run that produced the datasets. Defaults to  self .defaults.run. datasetIds : iterable of DatasetId, optional If provided has to contain dataset ID for each inserted dataset. All dataset IDs must have the same type ( int  or uuid.UUID), if type of dataset IDs does not match configured backend then IDs will be ignored and new IDs will be generated by backend. idGenerationMode : DatasetIdGenEnum, optional Specifies option for generating dataset IDs when IDs are not provided or their type does not match backend type . By default unique IDs are generated for each inserted dataset.   Returns - - - - - - - refs :  list  of DatasetRef Resolved DatasetRef instances for all given data IDs ( in the same order).   Notes - - - - - If datasetIds includes ID which already exists in the database then it will not be inserted or updated, but a resolved DatasetRef will be returned for it in any case. """ Second option is to wrap dataId, datasetType, and run into DatasetRef: def insertDatasets( self , datasets: Iterable[DatasetRef], idGenerationMode: DatasetIdGenEnum = DatasetIdGenEnum.UNIQUE, ) - > List [DatasetRef]: """Insert one or more datasets into the Registry   This always adds new datasets; to associate existing datasets with a new collection, use associate.   Parameters - - - - - - - - - - datasets : ~collections.abc.Iterable of DatasetRef Datasets to be inserted. All DatasetRef instances must have identical datasetType and run attributes. run attribute can be  None  and defaults to  self .defaults.run. Datasets can specify  id  attribute which will be used for inserted datasets. All dataset IDs must have the same type ( int  or uuid.UUID), if type of dataset IDs does not match configured backend then IDs will be ignored and new IDs will be generated by backend. idGenerationMode : DatasetIdGenEnum, optional Specifies option for generating dataset IDs when IDs are not provided or their type does not match backend type . By default unique IDs are generated for each inserted dataset.   Returns - - - - - - - refs :  list  of DatasetRef Resolved DatasetRef instances for all given data IDs ( in the same order).   Notes - - - - - If any of datasets has an ID which already exists in the database then it will not be inserted or updated, but a resolved DatasetRef will be returned for it in any case. """ I sort of like second option better as it is less verbose but it also has some drawbacks: duplication of datasetType and run in every DatasetRef (but it already happens at import level so should not be a big concern) does not allow simple dict as dataId version allowed, so lots of tests need to be updated (though DatasetRef constructor should probably take care of it) DatasetRef currently does not allow non-None run with None id and vice versa, we could change that requirement but I don't know what the implications are Any preferences, or should we combine both of these into one method or two separate methods?
Hide
Tim Jenness added a comment -

I'm definitely edging towards option two. It makes the API simpler and symmetric – you give it some DatasetRef and it returns a DatasetRef for each one given, possibly with an updated id field, possibly the same thing it was given.

Show
Tim Jenness added a comment - I'm definitely edging towards option two. It makes the API simpler and symmetric – you give it some DatasetRef and it returns a DatasetRef for each one given, possibly with an updated id field, possibly the same thing it was given.
Hide
Jim Bosch added a comment - - edited

I agree the second option is better as well, though it's pushing us even farther from my dream of only having resolved DatasetRefs

But I'd prefer that the run still come only from a separate argument, and just check that any DatasetRef.run that isn't None is consistent with that. We already know we don't need to support different run values in a single call, and I do think it really helps to limit the number of states DatasetRef can be in.

On that note, I would love for the DatasetRef constructor to reject simple dict data IDs (or require that it be allowed to conform them to DataCoordinate), and it almost does. But the problem with going further is not just unit tests - it's the lookup on storageClass/datasetType/datasetRef logic in (mostly?) Datastores, which use DatasetRef objects with just an instrument value for their data ID but a dataset type with more dimensions. I don't know how to solve that problem other than to make the object used there not a real DatasetRef.

Show
Jim Bosch added a comment - - edited I agree the second option is better as well, though it's pushing us even farther from my dream of only having resolved DatasetRefs But I'd prefer that the run still come only from a separate argument, and just check that any DatasetRef.run that isn't None is consistent with that. We already know we don't need to support different run values in a single call, and I do think it really helps to limit the number of states DatasetRef can be in. On that note, I would love for the DatasetRef constructor to reject simple dict data IDs (or require that it be allowed to conform them to DataCoordinate ), and it almost does. But the problem with going further is not just unit tests - it's the lookup on storageClass/datasetType/datasetRef logic in (mostly?) Datastores, which use DatasetRef objects with just an instrument value for their data ID but a dataset type with more dimensions. I don't know how to solve that problem other than to make the object used there not a real DatasetRef.
Hide
Andy Salnikov added a comment -

Jim Bosch, I think it is ready for review. After debating it with my self I decided to keep insertDatasets() as it is but add importDatasets() as a separate method, I believe this reduces complexity, though at a price of some code duplication. I added unit tests for the new manager class and also switched default configuration to use UUIDs. Ingest code needs to be modified to use new deterministic UUIDs, I think it deserves a new ticket.

Tim Jenness, I know you are going to look at it anyways , but could you look more closely at YAML and JSON serialization, I'm not sure I get that perfectly right or how it should be structured.

Jenkins is till running, but I have tested it already with ci_hsc_gen3 with both SQLite and Postgres.

Show
Andy Salnikov added a comment - Jim Bosch , I think it is ready for review. After debating it with my self I decided to keep insertDatasets() as it is but add importDatasets() as a separate method, I believe this reduces complexity, though at a price of some code duplication. I added unit tests for the new manager class and also switched default configuration to use UUIDs. Ingest code needs to be modified to use new deterministic UUIDs, I think it deserves a new ticket. Tim Jenness , I know you are going to look at it anyways , but could you look more closely at YAML and JSON serialization, I'm not sure I get that perfectly right or how it should be structured. Jenkins is till running, but I have tested it already with ci_hsc_gen3 with both SQLite and Postgres. PR: https://github.com/lsst/daf_butler/pull/500
 Reviewers Jim Bosch [ jbosch ] Status In Progress [ 3 ] In Review [ 10004 ]
Hide
Tim Jenness added a comment -

Show
 Link This issue is triggering DM-29593 [ DM-29593 ]
Hide
Jim Bosch added a comment -

Most of my comments are minor, but I've got two bigger ones:

• I think the documentation lies when it claims that duplicate inserts of datasets with the same (deterministic) UUIDs are ignored, and fixing that is pretty hard, so we should just change the docs.  Good news is that the fact that this is broken also means that we'll at least know (via a conflict->rollback) if one of those super-rare random UUID collisions does happen.
• I'd like to see integer dataset_id implementation raise more when it can't do things only the UUID implementation can do, instead of silently doing something else.  Having gotten all of the way through the review, I think the simplest way to "resolve" this for importDatasets may be to make that a private method, for use only by Butler.import and its helper classes, thus limiting the surprising behavior to callers that can be more reasonably expected to read the docs carefully and understand what they're getting.  But I'm hoping we can just raise when the integer dataset_id implementation is asked to generate deterministic IDs without causing too much trouble downstream.

More details on both of these in various PR comments.

Show
Jim Bosch added a comment - Most of my comments are minor, but I've got two bigger ones: I think the documentation lies when it claims that duplicate inserts of datasets with the same (deterministic) UUIDs are ignored, and fixing that is pretty hard, so we should just change the docs.  Good news is that the fact that this is broken also means that we'll at least know (via a conflict->rollback) if one of those super-rare random UUID collisions does happen. I'd like to see integer dataset_id implementation raise more when it can't do things only the UUID implementation can do, instead of silently doing something else.  Having gotten all of the way through the review, I think the simplest way to "resolve" this for importDatasets may be to make that a private method, for use only by Butler.import and its helper classes, thus limiting the surprising behavior to callers that can be more reasonably expected to read the docs carefully and understand what they're getting.  But I'm hoping we can just raise when the integer dataset_id implementation is asked to generate deterministic IDs without causing too much trouble downstream. More details on both of these in various PR comments.
 Status In Review [ 10004 ] Reviewed [ 10101 ]
Hide
Andy Salnikov added a comment - - edited

I added ci_hsc_gen3 PR (https://github.com/lsst/ci_hsc_gen3/pull/48) which provides default value for --butler-config SCons option pointing to a registry.yaml with UUID manager.

And I dropped UUID as default from a default config.

Show
Andy Salnikov added a comment - - edited I added ci_hsc_gen3 PR ( https://github.com/lsst/ci_hsc_gen3/pull/48 ) which provides default value for --butler-config SCons option pointing to a registry.yaml with UUID manager. And I dropped UUID as default from a default config.
Hide
Andy Salnikov added a comment - - edited

Jim Bosch, could you look at the new commit which adds support for re-using of integer dataset IDs while I'm working on rebasing whole branch. There is not much code but I had to pass parameters from the highest level to very bottom so a bunch of files have been touched.

Show
Andy Salnikov added a comment - - edited Jim Bosch , could you look at the new commit which adds support for re-using of integer dataset IDs while I'm working on rebasing whole branch. There is not much code but I had to pass parameters from the highest level to very bottom so a bunch of files have been touched.
Hide
Andy Salnikov added a comment -

Preparing a rebase I'm trying to see how to make pydantic handle UUID. Something is very odd, not sure how it can behave that way:

 >>> from pydantic import * >>> from uuid import * >>> from typing import * >>> class SerializedDatasetRef(BaseModel): ... id: Optional[Union[int, UUID]] = None ... run: str = "" ...   # this looks good >>> SerializedDatasetRef(id=1, run="RUN") SerializedDatasetRef(id=1, run='RUN')   # this is odd >>> SerializedDatasetRef(id=uuid4(), run="RUN") SerializedDatasetRef(id=167854762555046396793552098261208053632, run='RUN') >>> SerializedDatasetRef(id=uuid4(), run="RUN").json() '{"id": 180405824369238067234879699192746726284, "run": "RUN"}'   # this works but WTH >>> SerializedDatasetRef(id=str(uuid4()), run="RUN") SerializedDatasetRef(id=UUID('ffa1f977-0d31-44a2-b281-17d0a30151d0'), run='RUN') 

It looks like it decides for some reason to convert UUID to int

Show
Andy Salnikov added a comment - Preparing a rebase I'm trying to see how to make pydantic handle UUID. Something is very odd, not sure how it can behave that way: >>> from pydantic import * >>> from uuid import * >>> from typing import * >>> class SerializedDatasetRef(BaseModel): ... id: Optional[Union[int, UUID]] = None ... run: str = "" ...   # this looks good >>> SerializedDatasetRef(id=1, run="RUN") SerializedDatasetRef(id=1, run='RUN')   # this is odd >>> SerializedDatasetRef(id=uuid4(), run="RUN") SerializedDatasetRef(id=167854762555046396793552098261208053632, run='RUN') >>> SerializedDatasetRef(id=uuid4(), run="RUN").json() '{"id": 180405824369238067234879699192746726284, "run": "RUN"}'   # this works but WTH >>> SerializedDatasetRef(id=str(uuid4()), run="RUN") SerializedDatasetRef(id=UUID('ffa1f977-0d31-44a2-b281-17d0a30151d0'), run='RUN') It looks like it decides for some reason to convert UUID to int
Hide
Andy Salnikov added a comment - - edited

Looks like there are about 10 bug reports filed for this "feature", e.g. https://github.com/samuelcolvin/pydantic/issues/2135.

Show
Andy Salnikov added a comment - - edited Looks like there are about 10 bug reports filed for this "feature", e.g. https://github.com/samuelcolvin/pydantic/issues/2135 .
Hide
Tim Jenness added a comment -

Does StrictUnion help?

Show
Tim Jenness added a comment - Does StrictUnion help?
Hide
Andy Salnikov added a comment -

I don't see StrictUnion in our pydantic version (which is 1.8.1), it could appear in 1.9. It will probably help, but current workaround is to re-order types in the Union: Union[uuid.UUID, int]. I'm still testing but it looks like this fixes this particular issue.

Show
Andy Salnikov added a comment - I don't see StrictUnion in our pydantic version (which is 1.8.1), it could appear in 1.9. It will probably help, but current workaround is to re-order types in the Union: Union [uuid.UUID, int] . I'm still testing but it looks like this fixes this particular issue.
Hide
Andy Salnikov added a comment -

Thanks for all comments, answers and reviews! Finally merged both daf_butler and ci_hsc_gen3. Jenkins is happy and I did lots of tests of ci_hsc_gen3 with sqlite/postgres and both UUIDs and integers.

Show
Andy Salnikov added a comment - Thanks for all comments, answers and reviews! Finally merged both daf_butler and ci_hsc_gen3. Jenkins is happy and I did lots of tests of ci_hsc_gen3 with sqlite/postgres and both UUIDs and integers.
 Resolution Done [ 10000 ] Status Reviewed [ 10101 ] Done [ 10002 ]
 Link This issue relates to DM-29765 [ DM-29765 ]

#### People

Assignee:
Andy Salnikov
Reporter:
Tim Jenness
Reviewers:
Jim Bosch
Watchers:
Andy Salnikov, Brian Yanny, Jim Bosch, Michelle Gower, Tim Jenness