# Remove duplicates from the output of queryDatasets

XMLWordPrintable

#### Details

• Type: Story
• Status: Won't Fix
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
• Story Points:
1
• Sprint:
DRP F19-6 (Nov)
• Team:
Data Release Production

#### Description

This is a child of DM-22178, providing a quick fix to remove replicated records from the output of queryDatasets

#### Activity

Hide
Jim Bosch added a comment -

DM-21448 is now in review; that adds hashability to DatasetRef, and hence may make this ticket even easier.

Show
Jim Bosch added a comment - DM-21448 is now in review; that adds hashability to DatasetRef, and hence may make this ticket even easier.
Hide

Thanks for the update Jim Bosch. In addition to rejecting duplicates, in an attempt to identify the cause, I've tracked the duplicates to result from the execute method of Query class. So this issue might as affect well beyond DatasetRef, perhaps?

Show
Arun Kannawadi added a comment - Thanks for the update Jim Bosch . In addition to rejecting duplicates, in an attempt to identify the cause, I've tracked the duplicates to result from the execute method of Query class. So this issue might as affect well beyond DatasetRef, perhaps?
Hide
Jim Bosch added a comment -

Yes, definitely; in particular, I'm pretty sure it also affects queryDataIds.  The root problem goes further back - it's in the definition of the Query before we call execute (which pretty much just blindly executes the SQL we've given it).  That code is very complex, and I don't recommend trying to trace the problem all the way back through that, at least not directly.  I think the way to debug it would be to see what kinds of query options (and database content) do and don't yield duplicates, but that's still a pretty big phase space to explore, and it's not something I'd recommend as part of what's supposed to be a quick ticket.

Show
Jim Bosch added a comment - Yes, definitely; in particular, I'm pretty sure it also affects queryDataIds.  The root problem goes further back - it's in the definition of the Query before we call execute (which pretty much just blindly executes the SQL we've given it).  That code is very complex, and I don't recommend trying to trace the problem all the way back through that, at least not directly.  I think the way to debug it would be to see what kinds of query options (and database content) do and don't yield duplicates, but that's still a pretty big phase space to explore, and it's not something I'd recommend as part of what's supposed to be a quick ticket.
Hide
Tim Jenness added a comment -

Jim Bosch has this now been fixed by other work?

Show
Tim Jenness added a comment - Jim Bosch has this now been fixed by other work?
Hide
Jim Bosch added a comment -

DM-24938 won't fix this automatically, but it clarifies that fixing it directly is probably too expensive in general to make the default, and it provides an easy way to get a unique version of the results if desired (queryDatasets(...).subset(unique=True)).

Show
Jim Bosch added a comment - DM-24938 won't fix this automatically, but it clarifies that fixing it directly is probably too expensive in general to make the default, and it provides an easy way to get a unique version of the results if desired ( queryDatasets(...).subset(unique=True) ).

#### People

Assignee:
Reporter:
Watchers:
Arun Kannawadi, Jim Bosch, Tim Jenness