Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14181

Many socket timed out failures at slurm job launch

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Team:
      Data Facility

      Description

      Over the last weekend (2018-04-20 late afternoon - 2018-04-22 night) on the verification cluster, I got many slurm jobs failing to launch, and it seemed to have happened more frequently than usual (~30% of this kind of job submission). The logs have the following (with different workers):

      srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
      srun: error: Task launch for 136725.0 failed on node lsst-verify-worker03: Socket timed out on send/recv operation
      srun: error: Application launch failed: Socket timed out on send/recv operation
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      slurmstepd: error: *** STEP 136725.0 ON lsst-verify-worker02 CANCELLED AT 2018-04-22T20:53:48 ***
      srun: error: lsst-verify-worker02: task 0: Killed
      srun: error: lsst-verify-worker04: task 2: Killed
      [mpiexec@lsst-verify-worker02] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
      [mpiexec@lsst-verify-worker02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
      [mpiexec@lsst-verify-worker02] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
      [mpiexec@lsst-verify-worker02] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
      

      Below are slurm job IDs of this kind of failures:

      136119,136159,136057,136071,136082,136087,136096,136104,136118,136122,136134,136145,136147,136152,136059,136076,136102,136116,136130,136136,136150,136053,136091,136111,136124,136140,136051,136061,136067,136073,136100,136113,136127,136132,136265,136289,136315,136317,136320,136313,136198,136202,136206,136228,136197,136231,136215,136223,136201,136234,136239,136271,136277,136279,136281,136334,136725,136335
      

      Some were indeed run when the cluster was busy, but it seemed to happen also when the cluster wasn't very busy.

        Attachments

          Issue Links

            Activity

            Hide
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment -

            After some digging, my current thoughts are that this is an sssd problem. Below are logs from 3 different jobs (two failures, and one successful) Note the sssd start/stops inbetween, which is happening very quickly on the affected nodes.

            Apr 22 19:28:49 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: _run_prolog: run job script took usec=7
            Apr 22 19:28:49 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: _run_prolog: prolog with lock for job 136637 ran for 0 seconds
            Apr 22 19:28:49 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: Launching batch job 136637 for UID 34076
            Apr 22 19:29:17 lsst-verify-worker33.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Shutting down
            Apr 22 19:29:17 lsst-verify-worker33.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Starting up
            Apr 22 19:29:18 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: Warning: slurmstepd startup took 29 sec, possible file system problem or full memory
            Apr 22 19:29:19 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: launch task 136637.0 request from 34076.202@141.142.181.96 (port 30874)
            Apr 22 19:29:29 lsst-verify-worker33.ncsa.illinois.edu slurmstepd[92853]: done with job
            

            Apr 22 17:19:00 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: _run_prolog: run job script took usec=6
            Apr 22 17:19:00 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: _run_prolog: prolog with lock for job 136335 ran for 0 seconds
            Apr 22 17:19:00 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: Launching batch job 136335 for UID 34076
            Apr 22 17:19:07 lsst-verify-worker32.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Shutting down
            Apr 22 17:19:07 lsst-verify-worker32.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Starting up
            Apr 22 17:19:08 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: Warning: slurmstepd startup took 8 sec, possible file system problem or full memory
            Apr 22 17:19:08 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: launch task 136335.0 request from 34076.202@141.142.181.95 (port 27860)
            Apr 22 17:19:29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[71711]: error: *** STEP 136335.0 ON lsst-verify-worker32 CANCELLED AT 2018-04-22T17:19:29 ***
            Apr 22 17:19:29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[71648]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 65280
            Apr 22 17:19:29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[71648]: done with job
            

            Apr 22 17:23:32 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: _run_prolog: prolog with lock for job 136336 ran for 0 seconds
            Apr 22 17:23:32 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: Launching batch job 136336 for UID 34076
            Apr 22 17:23:32 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: launch task 136336.0 request from 34076.202@141.142.181.95 (port 46292)
            

             I don't see any additional logs for job 136336 in /var/log/messages, but looking elsewhere the job did complete, so I suspect the launch task worked in this case, and moved any additional output elsewhere.

            My current explanation for this issue is that the mpi process tries starting based on the owner of the job. When sssd stops while trying to launch the task, it can't authenticate and the task launch eventually times out, killing the job.

            Ex:
            launch task 136335.0 request from 34076.202@141.142.181.95 (port 27860)

            In this case '34076.202' is the uid/gid of the job owner. While the log shows that sssd has already started up, it likely doesn't have a cache to pull information from. I noticed running an 'id' on a user took a substantial amount of time to get a response on a node afflicted with this problem (likely longer than mpi waits before timing out).

            I will try doing some more digging to see what the cause of this sssd problem is, and how to fix it. Hopefully resolving that will also correct these frequent job failures.

            Thanks,

            Matt

            Show
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment - After some digging, my current thoughts are that this is an sssd problem. Below are logs from 3 different jobs (two failures, and one successful) Note the sssd start/stops inbetween, which is happening very quickly on the affected nodes. Apr 22 19 : 28 : 49 lsst-verify-worker33.ncsa.illinois.edu slurmd[ 3805 ]: _run_prolog: run job script took usec= 7 Apr 22 19 : 28 : 49 lsst-verify-worker33.ncsa.illinois.edu slurmd[ 3805 ]: _run_prolog: prolog with lock for job 136637 ran for 0 seconds Apr 22 19 : 28 : 49 lsst-verify-worker33.ncsa.illinois.edu slurmd[ 3805 ]: Launching batch job 136637 for UID 34076 Apr 22 19 : 29 : 17 lsst-verify-worker33.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Shutting down Apr 22 19 : 29 : 17 lsst-verify-worker33.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Starting up Apr 22 19 : 29 : 18 lsst-verify-worker33.ncsa.illinois.edu slurmd[ 3805 ]: Warning: slurmstepd startup took 29 sec, possible file system problem or full memory Apr 22 19 : 29 : 19 lsst-verify-worker33.ncsa.illinois.edu slurmd[ 3805 ]: launch task 136637.0 request from 34076.202 @141 .142. 181.96 (port 30874 ) Apr 22 19 : 29 : 29 lsst-verify-worker33.ncsa.illinois.edu slurmstepd[ 92853 ]: done with job Apr 22 17 : 19 : 00 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: _run_prolog: run job script took usec= 6 Apr 22 17 : 19 : 00 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: _run_prolog: prolog with lock for job 136335 ran for 0 seconds Apr 22 17 : 19 : 00 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: Launching batch job 136335 for UID 34076 Apr 22 17 : 19 : 07 lsst-verify-worker32.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Shutting down Apr 22 17 : 19 : 07 lsst-verify-worker32.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Starting up Apr 22 17 : 19 : 08 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: Warning: slurmstepd startup took 8 sec, possible file system problem or full memory Apr 22 17 : 19 : 08 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: launch task 136335.0 request from 34076.202 @141 .142. 181.95 (port 27860 ) Apr 22 17 : 19 : 29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[ 71711 ]: error: *** STEP 136335.0 ON lsst-verify-worker32 CANCELLED AT 2018 - 04 -22T17: 19 : 29 *** Apr 22 17 : 19 : 29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[ 71648 ]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error: 0 status 65280 Apr 22 17 : 19 : 29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[ 71648 ]: done with job Apr 22 17 : 23 : 32 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: _run_prolog: prolog with lock for job 136336 ran for 0 seconds Apr 22 17 : 23 : 32 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: Launching batch job 136336 for UID 34076 Apr 22 17 : 23 : 32 lsst-verify-worker32.ncsa.illinois.edu slurmd[ 3754 ]: launch task 136336.0 request from 34076.202 @141 .142. 181.95 (port 46292 )  I don't see any additional logs for job 136336 in /var/log/messages, but looking elsewhere the job did complete, so I suspect the launch task worked in this case, and moved any additional output elsewhere. My current explanation for this issue is that the mpi process tries starting based on the owner of the job. When sssd stops while trying to launch the task, it can't authenticate and the task launch eventually times out, killing the job. Ex: launch task 136335.0 request from 34076.202@141.142.181.95 (port 27860) In this case '34076.202' is the uid/gid of the job owner. While the log shows that sssd has already started up, it likely doesn't have a cache to pull information from. I noticed running an 'id' on a user took a substantial amount of time to get a response on a node afflicted with this problem (likely longer than mpi waits before timing out). I will try doing some more digging to see what the cause of this sssd problem is, and how to fix it. Hopefully resolving that will also correct these frequent job failures. Thanks, Matt
            Hide
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment -

            We have updated the sssd timeout window, which should resolve these job startup issues. I'll leave the ticket open for a few days to monitor jobs and ensure that the changes we made are sufficient.

            Please report any additional job failures here until the ticket is closed.

            Thanks,

            Matt

            Show
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment - We have updated the sssd timeout window, which should resolve these job startup issues. I'll leave the ticket open for a few days to monitor jobs and ensure that the changes we made are sufficient. Please report any additional job failures here until the ticket is closed. Thanks, Matt
            Hide
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment -

            Awaiting verification that problem is resolved.

            Show
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment - Awaiting verification that problem is resolved.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Since the update yesterday morning, I've submitted ~30 jobs that use multiple nodes and none failed due to this same timeout issue. It seems good so I'll mark this as Reviewed now. Thanks a lot for figuring this out and fixing it!

            Show
            hchiang2 Hsin-Fang Chiang added a comment - Since the update yesterday morning, I've submitted ~30 jobs that use multiple nodes and none failed due to this same timeout issue. It seems good so I'll mark this as Reviewed now. Thanks a lot for figuring this out and fixing it!
            Hide
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment -

            Have there been any issues following the last update?

            If not, can this ticket be closed?

            Show
            mtlong2 Matthew Thomas Long [X] (Inactive) added a comment - Have there been any issues following the last update? If not, can this ticket be closed?
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            I haven't encountered this issue since the update.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - I haven't encountered this issue since the update.

              People

              Assignee:
              mtlong2 Matthew Thomas Long [X] (Inactive)
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Reviewers:
              Greg Daues, Hsin-Fang Chiang, Matthew Thomas Long [X] (Inactive)
              Watchers:
              Greg Daues, Hsin-Fang Chiang, Jacob Rundall, Matthew Thomas Long [X] (Inactive), Michelle Butler [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.