XMLWordPrintable

#### Details

• Type: Story
• Status: To Do
• Resolution: Unresolved
• Fix Version/s: None
• Component/s:
• Labels:

#### Description

As part of my attempts at speeding up reading data from all visits in a tract, I've releasing the gil in afw.catalog.readFits and use a ThreadPool to try reading multiple files at once (as opposed to a ProcessPool, which requires pickle/unpickle and our Catalog pickle code is slow and probably difficult to speed up). This causes segfaults. Krzysztof Findeisen and I spent a couple of hours yesterday digging into it, but didn't find an obvious source of static class members or global variables, beyond the daf.base.Citizen object, but none of the tracebacks I've seen reference Citizen that I can tell.

I'll add an sample test script to this ticket, and push a branch, if anyone else wants to poke around. Debugging concurrency issues is "fun".

I've tagged a bunch of the people with C++ expertise that I could think of: if you're uninterested, please remove yourself; if you think someone else might have expertise to add, please add them!

#### Attachments

3 kB
2. ptest_run_2019-04-09.txt
8 kB
3. ptest_standalone.tar.bz2
5 kB

#### Activity

Hide
Krzysztof Findeisen added a comment -

For the record, we also didn't find any objects being explicitly shared among tasks/threads by the top-level script.

Show
Krzysztof Findeisen added a comment - For the record, we also didn't find any objects being explicitly shared among tasks/threads by the top-level script.
Hide
John Parejko added a comment - - edited

I've pushed a cfitsio branch that adds the --enable-reentrant flag, and an afw branch that adds gil_scoped_release to the pybind11 readFits, plus an attempt at removing one possible source of shared state (a default InputArchive pointer).

Show
John Parejko added a comment - - edited I've pushed a cfitsio branch that adds the --enable-reentrant flag , and an afw branch that adds gil_scoped_release to the pybind11 readFits , plus an attempt at removing one possible source of shared state (a default InputArchive pointer).
Hide
Gabor Kovacs [X] (Inactive) added a comment -

Ticket branch got rebased which solved the following compile error in afw:

[2019-04-02T22:08:38.806673Z] scons: *** [src/cameraGeom/Orientation.os] Error 1
[2019-04-02T22:08:39.001822Z] In file included from include/lsst/afw/image/ImageBase.h:40:0,
[2019-04-02T22:08:39.001845Z] from include/lsst/afw/image/Image.h:42,
[2019-04-02T22:08:39.001849Z] from include/lsst/afw/detection/Footprint.h:33,
[2019-04-02T22:08:39.001851Z] from src/detection/Footprint.cc:24:
[2019-04-02T22:08:39.001853Z] include/lsst/afw/image/lsstGil.h:119:42: error: macro "GIL_DEFINE_BASE_TYPEDEFS" passed 3 arguments, but takes just 2
[2019-04-02T22:08:39.001855Z] GIL_DEFINE_BASE_TYPEDEFS(64, bits64, gray)

Show
Gabor Kovacs [X] (Inactive) added a comment - Ticket branch got rebased which solved the following compile error in afw: [2019-04-02T22:08:38.806673Z] scons: *** [src/cameraGeom/Orientation.os] Error 1 [2019-04-02T22:08:39.001822Z] In file included from include/lsst/afw/image/ImageBase.h:40:0, [2019-04-02T22:08:39.001845Z] from include/lsst/afw/image/Image.h:42, [2019-04-02T22:08:39.001847Z] from include/lsst/afw/image/MaskedImage.h:43, [2019-04-02T22:08:39.001849Z] from include/lsst/afw/detection/Footprint.h:33, [2019-04-02T22:08:39.001851Z] from src/detection/Footprint.cc:24: [2019-04-02T22:08:39.001853Z] include/lsst/afw/image/lsstGil.h:119:42: error: macro "GIL_DEFINE_BASE_TYPEDEFS" passed 3 arguments, but takes just 2 [2019-04-02T22:08:39.001855Z] GIL_DEFINE_BASE_TYPEDEFS(64, bits64, gray)
Hide
Gabor Kovacs [X] (Inactive) added a comment - - edited

Investigating this issue, a short C++ test program was created ptest. It is available in diffimTests. This is scons compilable in a properly set up lsstsw stack (up to afw). It has a list of hard wired fits file names from package validation_data_hsc, update them as necessary. My main conclusion is that the problem lies directly in the compiled codes and so far we can assume that it has no connection to the Python bindings at all.

Debugging notes:

• According to John Parejko, static variables (a global counter) is implemented in daf_base::Citizen only.
• daf_base::Citizen seems to be thread safe, implementing its own low level (pre- C++14) locking for accessing its static members.
-pthread is generally required for all compilation to generate thread safe version of standard library calls (memory allocation?). This may include all compiled code afw SourceCatalog uses. It is platform specific in general, on x86 it is said to be equivalent of "-D_REENTRANT" and "-lpthread".
• I’ve built the whole lsst stack up to afw (and daf_butler, obs_subaru) with “-Og” and “-pthread” options and cfitsio from tickets/DM-18695 by John Parejko. This ensures that cfitsio has the correct compilation flags. Cfitsio supposed to be thread safe at least for reading different files. Also daf_base was added a buildOpts.py, but see below for building notes.
• Non-multithread compatible code could still potentially be built in non-scons configured targets (can think of boost?, gsl? or other libs).
• Of course, if any code accesses shared memory it needs more than a -pthread flag at compilation.
• As I understand, std::shared_ptr is thread safe in the sense that its own internal counter manipulations are atomic. This means nothing in respect of accessing the referenced object. Is this dependent on the -pthread flag? Maybe.

Building notes:

• scons, in general, ignores environment variables on purpose. Use command line options or set variables in SConstruct.
• sconsUtils defines the “opt” and “archflags” scons variables that will affect CPPFLAGS. Modifying CPPFLAGS in the SConstruct file works but will result in duplicate flags.
• sconsUtils reads the buildOpts.py file for command line options.
• eupspkg build however picks up environment variables and passes on to scons as cmd-line arguments. Therefore “opt=g” in buildOpts.py is disrespected if using rebuild. Use EUPSPKG_SCONSFLAGS="opt=g archflags=-pthread" for rebuild.
• Also there are extra command line options e.g. --verbose; see lsst/sconsUtils/state/_initOptions(). These options are not listed for -h, -H.
Show
Gabor Kovacs [X] (Inactive) added a comment - - edited Investigating this issue, a short C++ test program was created ptest. It is available in diffimTests . This is scons compilable in a properly set up lsstsw stack (up to afw). It has a list of hard wired fits file names from package validation_data_hsc, update them as necessary. My main conclusion is that the problem lies directly in the compiled codes and so far we can assume that it has no connection to the Python bindings at all. Debugging notes: According to John Parejko , static variables (a global counter) is implemented in daf_base::Citizen only . daf_base::Citizen seems to be thread safe, implementing its own low level (pre- C++14) locking for accessing its static members. -pthread is generally required for all compilation to generate thread safe version of standard library calls (memory allocation?). This may include all compiled code afw SourceCatalog uses. It is platform specific in general, on x86 it is said to be equivalent of "-D_REENTRANT" and "-lpthread". I’ve built the whole lsst stack up to afw (and daf_butler, obs_subaru) with “-Og” and “-pthread” options and cfitsio from tickets/ DM-18695 by John Parejko . This ensures that cfitsio has the correct compilation flags. Cfitsio supposed to be thread safe at least for reading different files. Also daf_base was added a buildOpts.py, but see below for building notes. Non-multithread compatible code could still potentially be built in non-scons configured targets (can think of boost?, gsl? or other libs). Of course, if any code accesses shared memory it needs more than a -pthread flag at compilation. As I understand, std::shared_ptr is thread safe in the sense that its own internal counter manipulations are atomic. This means nothing in respect of accessing the referenced object. Is this dependent on the -pthread flag? Maybe. Building notes: scons, in general, ignores environment variables on purpose. Use command line options or set variables in SConstruct. sconsUtils defines the “opt” and “archflags” scons variables that will affect CPPFLAGS . Modifying CPPFLAGS in the SConstruct file works but will result in duplicate flags. sconsUtils reads the buildOpts.py file for command line options. eupspkg build however picks up environment variables and passes on to scons as cmd-line arguments. Therefore “opt=g” in buildOpts.py is disrespected if using rebuild. Use EUPSPKG_SCONSFLAGS="opt=g archflags=-pthread" for rebuild. Also there are extra command line options e.g. --verbose; see lsst/sconsUtils/state/_initOptions() . These options are not listed for -h, -H.
Hide
Paul Price added a comment -

Because debugging concurrency issues is extremely painful, I would encourage you to advance on alternative fronts at the same time. I think you should adopt an internal data format that is slimmer than a full Catalog (containing just the columns you need, and the subset of sources that you need); that would pickle much faster. Moreover, it would give you a product that you could persist and read in whenever you want to short-circuit the parallel read altogether, greatly speeding up the debug loop.

Show
Paul Price added a comment - Because debugging concurrency issues is extremely painful, I would encourage you to advance on alternative fronts at the same time. I think you should adopt an internal data format that is slimmer than a full Catalog (containing just the columns you need, and the subset of sources that you need); that would pickle much faster. Moreover, it would give you a product that you could persist and read in whenever you want to short-circuit the parallel read altogether, greatly speeding up the debug loop.
Hide
John Parejko added a comment -

Gabor Kovacs [X]: thank you for sharing your notes.

Paul Price: the purpose of this is not just for jointcal, but for any action that deals with multiple sensors and/or visits at a time, e.g. validate_drp or any full-focalplane aggregate plots. Those each can require different components of the catalogs, so there is not one "bundled" format that would work for all of the above.

Dumping all of the catalogs to a format like Parquet might enable parallel reads, but I have not experimented with that option.

Show
John Parejko added a comment - Gabor Kovacs [X] : thank you for sharing your notes. Paul Price : the purpose of this is not just for jointcal, but for any action that deals with multiple sensors and/or visits at a time, e.g. validate_drp or any full-focalplane aggregate plots. Those each can require different components of the catalogs, so there is not one "bundled" format that would work for all of the above. Dumping all of the catalogs to a format like Parquet might enable parallel reads, but I have not experimented with that option.
Hide
Jim Bosch added a comment -

I doubt Citizen is in play here, but if you want to rule it out, feel free to remove it from any code you see (and merge that to master).  There's no reason we have to do it all at once, and it's of course much harder to remove everywhere than it is to remove it from a handful of classes.

It looks from the log you posted like the segfault happens after all the files have been successfully read.  If so, do you know whether it happens at process close, or earlier?  If it's at process close, that raises the possibility of a whole new class of problems, and it actually reminds me that Bob Armstrong tried to build some multithreaded code against our codebase quite a while ago and saw segfaults at process close.  If this is the scenario we're in, I think the right tack is to just trying to reduce what's in the minimal test case (especially in how much stack code it depends on) until it goes away.

Show
Jim Bosch added a comment - I doubt Citizen is in play here, but if you want to rule it out, feel free to remove it from any code you see (and merge that to master).  There's no reason we have to do it all at once, and it's of course much harder to remove everywhere than it is to remove it from a handful of classes. It looks from the log you posted like the segfault happens after all the files have been successfully read.  If so, do you know whether it happens at process close, or earlier?  If it's at process close, that raises the possibility of a whole new class of problems, and it actually reminds me that Bob Armstrong tried to build some multithreaded code against our codebase quite a while ago and saw segfaults at process close.  If this is the scenario we're in, I think the right tack is to just trying to reduce what's in the minimal test case (especially in how much stack code it depends on) until it goes away.
Hide
Gabor Kovacs [X] (Inactive) added a comment - - edited

Moved out ptest to DM-19212. There is a standalone dummy package version attached here and the branch afw/tickets/DM-19212 supposed to build automatically under afw/examples/threading.

Show
Gabor Kovacs [X] (Inactive) added a comment - - edited Moved out ptest to DM-19212 . There is a standalone dummy package version attached here and the branch afw/tickets/ DM-19212 supposed to build automatically under afw/examples/threading.

#### People

Assignee:
Unassigned
Reporter:
John Parejko
Watchers:
Christopher Waters, Colin Slater, Fritz Mueller, Gabor Kovacs [X] (Inactive), Jim Bosch, John Parejko, John Swinbank, Kian-Tat Lim, Krzysztof Findeisen, Paul Price, Yusra AlSayyad