# queryDatasets produces lots of duplicate outputs

XMLWordPrintable

## Details

• Type: Story
• Status: Won't Fix
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
2
• Team:
Data Release Production

## Description

Running a query like

 list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],  skymap="discrete/ci_hsc", tract=0, patch=70)) 

produces many duplicate records for unclear reasons (they're not from having multiple collections, so deduplicate doesn't help).  Try to fix this, or at least document when duplicate results must be expected.

## Activity

Jim Bosch created issue -
Field Original Value New Value
Description Running a query like

{{list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],}}
{{ skymap="discrete/ci_hsc", tract=0, patch=70))}}

produces many duplicate records for unclear reasons (they're not from having multiple collections, so {{deduplicate}} doesn't help).  Try to fix this, or at least document when duplicate results must be expected.
Running a query like

{code}
list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],
skymap="discrete/ci_hsc", tract=0, patch=70))
{code}

produces many duplicate records for unclear reasons (they're not from having multiple collections, so {{deduplicate}} doesn't help).  Try to fix this, or at least document when duplicate results must be expected.
 Description Running a query like {code} list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],                                                      skymap="discrete/ci_hsc", tract=0, patch=70)) {code} produces many duplicate records for unclear reasons (they're not from having multiple collections, so {{deduplicate}} doesn't help).  Try to fix this, or at least document when duplicate results must be expected. Running a query like {code} list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],                                      skymap="discrete/ci_hsc", tract=0, patch=70)) {code} produces many duplicate records for unclear reasons (they're not from having multiple collections, so {{deduplicate}} doesn't help).  Try to fix this, or at least document when duplicate results must be expected.
 Description Running a query like {code} list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],                                      skymap="discrete/ci_hsc", tract=0, patch=70)) {code} produces many duplicate records for unclear reasons (they're not from having multiple collections, so {{deduplicate}} doesn't help).  Try to fix this, or at least document when duplicate results must be expected. Running a query like {code} list(butler.registry.queryDatasets("calexp", collections=["shared/ci_hsc_output"],                                    skymap="discrete/ci_hsc", tract=0, patch=70)) {code} produces many duplicate records for unclear reasons (they're not from having multiple collections, so {{deduplicate}} doesn't help).  Try to fix this, or at least document when duplicate results must be expected.
Hide
Tim Jenness added a comment -

Since we know that all the DatasetRefs being returned must be coming from the same butler repository, this means that the ref.id must be unique. A short term fix is therefore to deduplicate the results by checking for duplicated ref.id – that's a couple of lines of code using a dict.

Is it worth doing this quickly as a separate ticket and reserving this ticket for understanding why the query itself is producing duplicates?

Show
Tim Jenness added a comment - Since we know that all the DatasetRefs being returned must be coming from the same butler repository, this means that the ref.id must be unique. A short term fix is therefore to deduplicate the results by checking for duplicated ref.id – that's a couple of lines of code using a dict . Is it worth doing this quickly as a separate ticket and reserving this ticket for understanding why the query itself is producing duplicates?
Hide
Jim Bosch added a comment -

Is it worth doing this quickly as a separate ticket and reserving this ticket for understanding why the query itself is producing duplicates?

That sounds reasonable, though I may still not get to it super quickly.  Others are welcome (Arun Kannawadi expressed in another daf_butler issue that turns out to be trickier than this one; maybe he'd be interested?)

Show
Jim Bosch added a comment - Is it worth doing this quickly as a separate ticket and reserving this ticket for understanding why the query itself is producing duplicates? That sounds reasonable, though I may still not get to it super quickly.  Others are welcome ( Arun Kannawadi expressed in another daf_butler issue that turns out to be trickier than this one; maybe he'd be interested?)
Hide

Yes, I can work on it to remove the duplicates for now. Should I create another ticket? And is this a Gen3 butler?

Show
Arun Kannawadi added a comment - Yes, I can work on it to remove the duplicates for now. Should I create another ticket? And is this a Gen3 butler?
Hide
Tim Jenness added a comment -

This is gen3. Yes, I think a new ticket.

Show
Tim Jenness added a comment - This is gen3. Yes, I think a new ticket.
 Link This issue is duplicated by DM-21448 [ DM-21448 ]
 Link This issue is duplicated by DM-21448 [ DM-21448 ]
 Link This issue duplicates DM-21448 [ DM-21448 ]
 Link This issue relates to DM-21448 [ DM-21448 ]
 Link This issue duplicates DM-21448 [ DM-21448 ]
 Link This issue relates to DM-22286 [ DM-22286 ]
Hide
Tim Jenness added a comment -

DM-22286 was not closed quickly. Is this ticket still an issue?

Show
Tim Jenness added a comment - DM-22286 was not closed quickly. Is this ticket still an issue?
Hide
Jim Bosch added a comment -

Closing this along with DM-22286 as Won't Fix.  DM-24938 will provide a way to get unique results, but I'm not making it the default.

Show
Jim Bosch added a comment - Closing this along with DM-22286 as Won't Fix.  DM-24938 will provide a way to get unique results, but I'm not making it the default.
 Resolution Done [ 10000 ] Status To Do [ 10001 ] Won't Fix [ 10405 ]

## People

• Assignee:
Jim Bosch
Reporter:
Jim Bosch
Watchers:
Arun Kannawadi, Jim Bosch, Tim Jenness