Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:1
-
Epic Link:
-
Sprint:AP S18-4
-
Team:Alert Production
Description
While processing some cosmos data on lsst-dev it became necessary to run data sequentially using the multibandDriver to diagnose a problem. This can be done by using the --batch-type=None option.
Most of the processing preceded fine, however I noticed that openMP was not being disabled. This was most evident when the deblender was using several hundred percent cpu. Occasionally a processing job would seed to get stuck (not get past the deblending stage, yet still using 100% cpu) even when left for 24 hours. I attached to the process using gdb and found that some numpy routine was making some openblas calls. These calls would get into a state where libopenblas ,which is compiled into numpy, was issuing calls to sleep a tread, which would then turn over control to the kernel. The kernel would complete the sleep and immediately turn control back to openblas, which would then issue a thread sleep call. This kept the cpu using 100% cpu, but no work ever got done, and the loop could not exit.
Killing the process and restarting could get the processing past the particular patch/tract that was getting stuck, but sometimes a different tract/patch would get stuck. It was not deterministic but happened several times when processing all the patches in a tract.
Some relevant lines from the traceback when the program was in the sleep loop are below:
#0 0x00007fa6496d0e47 in sched_yield ()
|
at ../sysdeps/unix/syscall-template.S:81
|
#1 0x00007fa63ddbf365 in exec_blas_async_wait ()
|
from /software/lsstsw/stack/Linux64/miniconda2/3.19.0.lsst4/bin/../lib/libopenblas.so
|
#2 0x00007fa63ddbfa02 in exec_blas ()
|
from /software/lsstsw/stack/Linux64/miniconda2/3.19.0.lsst4/bin/../lib/libopenblas.so
|
#3 0x00007fa63ddbfd9d in blas_level1_thread ()
|
from /software/lsstsw/stack/Linux64/miniconda2/3.19.0.lsst4/bin/../lib/libopenblas.so
|
An obvious workaround for this is to simply disable multiple threads even when running only in a single process. This seems like it might be the right thing to do anyway, since we might expect contention between multiple users running pipeline jobs, or a single user running multiple pipeline jobs, even if each of those jobs is only a single process. Paul Price, this seems like something you probably know & care about.