# Tasks that create Parquet tables should fail on missing/all-NaN columns.

XMLWordPrintable

#### Details

• Type: Story
• Status: To Do
• Resolution: Unresolved
• Fix Version/s: None
• Component/s:
• Labels:
• Team:
Data Release Production
• Urgent?:
No

#### Description

While debugging a ci_hsc failure on DM-35060, I found that one of the plotting tasks in analysis_drp was failing due to lack of valid data. This suggests that something should have failed much earlier (likely in the consolidation tasks), as one of our "implied required" columns (slot.psfShape) being all zeros/NaNs should fail early, instead of trickling down to unclear plotting failures. This is related to RFC-808, that defined sentinel values to fill empty columns with, but didn't define what to do when all such values are "sentineled out".

Eli Rykoff suggested at least making ci_hsc's test_validate_outputs.py read in the parquet source and object tables and check for all-NaN columns, which might help catch changes to the default DRP pipeline that trigger this problem, but wouldn't catch other kinds of failures in production.

Slack thread for the discussion that prompted this: https://lsstc.slack.com/archives/C2JPMCF5X/p1657329375041579

#### Attachments

1. ci_hsc_build.log.txt
3.53 MB

#### Activity

Hide
John Parejko added a comment -

I've attached the ci_hsc build log from a failed Jenkins run that demonstrates this bug. If there is useful information there earlier than the first plot_e2PSF_scatter_visit ERROR log, I couldn't find it.

Show
John Parejko added a comment - I've attached the ci_hsc build log from a failed Jenkins run that demonstrates this bug. If there is useful information there earlier than the first plot_e2PSF_scatter_visit ERROR log, I couldn't find it.
Hide
Lauren MacArthur added a comment -

Huh, I have just gotten back to working on DM-30927 and note that, as per Eli Rykoff's suggestion/request, am deliberately setting entire columns to NaN where there was a WCS failure (see ticket for details).  These get set on source catalogs at the detector level in the astrometry fit during calibrate and, of course will get propagated to the parquet tables. One of the trickiest part of this ticket is in making sure all downstream tasks know how to handle this for their purposes, but this is a deliberate setting-to-NaN. I'm not sure this would be compatible with a test failing on any given column having all NaNs, although perhaps this something you (John Parejko) were trying to accommodate by referring to "our "implied required" columns"? I'm not sure what this list comprises, but I would worry that coord_ra [ra] & coord_dec [decl] would seem a natural addition to that list.

Show
Lauren MacArthur added a comment - Huh, I have just gotten back to working on DM-30927 and note that, as per Eli Rykoff 's suggestion/request, am deliberately setting entire columns to NaN where there was a WCS failure (see ticket for details).  These get set on source catalogs at the detector level in the astrometry fit during  calibrate and, of course will get propagated to the parquet tables. One of the trickiest part of this ticket is in making sure all downstream tasks know how to handle this for their purposes, but this is a deliberate setting-to-NaN. I'm not sure this would be compatible with a test failing on any given column having all NaNs, although perhaps this something you ( John Parejko ) were trying to accommodate by referring to "our "implied required" columns"? I'm not sure what this list comprises, but I would worry that coord_ra [ra] & coord_dec [decl] would seem a natural addition to that list.
Hide
John Parejko added a comment -

In the case that triggered this, the slot field was mis-assigned:

 >>> src['slot_PsfShape'] KeyError: "Field with name 'ext_shapeHSM_HsmSourceMoments' not found" 

Thus, functors.py:493 or functors.py:347 triggers, completely squashing the exception and assigning all values to NaN. That seems like a terrible failure mode: we should be catching a specific list of exceptions there, and raising the rest; KeyError should just cause SourceTableTask to fail, not leave us with hidden bad data.

This is DM-14310 and DM-16537 all over again. Catching all exceptions and dropping the exception message is dangerous for exactly this reason. I'll see if I can come up with a small test to demonstrate it.

Show
John Parejko added a comment - In the case that triggered this, the slot field was mis-assigned: >>> src['slot_PsfShape'] KeyError: "Field with name 'ext_shapeHSM_HsmSourceMoments' not found" Thus, functors.py:493 or functors.py:347 triggers, completely squashing the exception and assigning all values to NaN. That seems like a terrible failure mode: we should be catching a specific list of exceptions there, and raising the rest; KeyError should just cause SourceTableTask to fail, not leave us with hidden bad data. This is DM-14310 and DM-16537 all over again. Catching all exceptions and dropping the exception message is dangerous for exactly this reason. I'll see if I can come up with a small test to demonstrate it.
Hide
John Parejko added a comment -

I've added additional logging on DM-35060, which helped me track down my problem (I'd mis-configured piff), which in the end was not related to parquet translation (but I didn't know that until I'd put in said logging). That partially alleviates the problem condition here (errors in parquet translation disappear), but I think it's worth considering where we should actually raise on such errors. Eric filed DM-35645 after we talked about this at standup (so AP can bail on catastrophically bad detectors, however defined).

Show
John Parejko added a comment - I've added additional logging on DM-35060 , which helped me track down my problem (I'd mis-configured piff), which in the end was not related to parquet translation (but I didn't know that until I'd put in said logging). That partially alleviates the problem condition here (errors in parquet translation disappear), but I think it's worth considering where we should actually raise on such errors. Eric filed DM-35645 after we talked about this at standup (so AP can bail on catastrophically bad detectors, however defined).

#### People

Assignee:
Unassigned
Reporter:
John Parejko
Watchers:
Colin Slater, Eli Rykoff, John Parejko, Lauren MacArthur, Nate Lust, Sophie Reed, Yusra AlSayyad