Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12354

Running the deblender with multiple threads livelocks

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      While processing some cosmos data on lsst-dev it became necessary to run data sequentially using the multibandDriver to diagnose a problem. This can be done by using the --batch-type=None option.

      Most of the processing preceded fine, however I noticed that openMP was not being disabled. This was most evident when the deblender was using several hundred percent cpu. Occasionally a processing job would seed to get stuck (not get past the deblending stage, yet still using 100% cpu) even when left for 24 hours. I attached to the process using gdb and found that some numpy routine was making some openblas calls. These calls would get into a state where libopenblas ,which is compiled into numpy, was issuing calls to sleep a tread, which would then turn over control to the kernel. The kernel would complete the sleep and immediately turn control back to openblas, which would then issue a thread sleep call. This kept the cpu using 100% cpu, but no work ever got done, and the loop could not exit.

      Killing the process and restarting could get the processing past the particular patch/tract that was getting stuck, but sometimes a different tract/patch would get stuck. It was not deterministic but happened several times when processing all the patches in a tract.

      Some relevant lines from the traceback when the program was in the sleep loop are below:

      #0  0x00007fa6496d0e47 in sched_yield ()
          at ../sysdeps/unix/syscall-template.S:81
      #1  0x00007fa63ddbf365 in exec_blas_async_wait ()
         from /software/lsstsw/stack/Linux64/miniconda2/3.19.0.lsst4/bin/../lib/libopenblas.so
      #2  0x00007fa63ddbfa02 in exec_blas ()
         from /software/lsstsw/stack/Linux64/miniconda2/3.19.0.lsst4/bin/../lib/libopenblas.so
      #3  0x00007fa63ddbfd9d in blas_level1_thread ()
         from /software/lsstsw/stack/Linux64/miniconda2/3.19.0.lsst4/bin/../lib/libopenblas.so
      

        Attachments

          Activity

          Hide
          swinbank John Swinbank added a comment -

          An obvious workaround for this is to simply disable multiple threads even when running only in a single process. This seems like it might be the right thing to do anyway, since we might expect contention between multiple users running pipeline jobs, or a single user running multiple pipeline jobs, even if each of those jobs is only a single process. Paul Price, this seems like something you probably know & care about.

          Show
          swinbank John Swinbank added a comment - An obvious workaround for this is to simply disable multiple threads even when running only in a single process. This seems like it might be the right thing to do anyway, since we might expect contention between multiple users running pipeline jobs, or a single user running multiple pipeline jobs, even if each of those jobs is only a single process. Paul Price , this seems like something you probably know & care about.
          Hide
          price Paul Price added a comment -

          I agree we should disable multiple threads for --batch-type=none.

          Show
          price Paul Price added a comment - I agree we should disable multiple threads for --batch-type=none .
          Hide
          swinbank John Swinbank added a comment -

          Paul Price — I think you've already basically signed off on the changes here, but then I was distracted by moving house and never followed through. Are you happy for me to go ahead and merge what you've already approved on GitHub?

          Show
          swinbank John Swinbank added a comment - Paul Price — I think you've already basically signed off on the changes here, but then I was distracted by moving house and never followed through. Are you happy for me to go ahead and merge what you've already approved on GitHub?
          Hide
          price Paul Price added a comment -

          Yes, thanks.

          Show
          price Paul Price added a comment - Yes, thanks.
          Hide
          swinbank John Swinbank added a comment -

          Thanks Paul!

          Jenkins #27720 is happy. Merged and done.

          Show
          swinbank John Swinbank added a comment - Thanks Paul! Jenkins #27720 is happy. Merged and done.

            People

            Assignee:
            swinbank John Swinbank
            Reporter:
            nlust Nate Lust
            Reviewers:
            Paul Price
            Watchers:
            John Swinbank, Nate Lust, Paul Price
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.