# Many socket timed out failures at slurm job launch

XMLWordPrintable

#### Details

• Type: Improvement
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Team:
Data Facility

#### Description

Over the last weekend (2018-04-20 late afternoon - 2018-04-22 night) on the verification cluster, I got many slurm jobs failing to launch, and it seemed to have happened more frequently than usual (~30% of this kind of job submission). The logs have the following (with different workers):

 srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 136725.0 failed on node lsst-verify-worker03: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 136725.0 ON lsst-verify-worker02 CANCELLED AT 2018-04-22T20:53:48 *** srun: error: lsst-verify-worker02: task 0: Killed srun: error: lsst-verify-worker04: task 2: Killed [mpiexec@lsst-verify-worker02] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed [mpiexec@lsst-verify-worker02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [mpiexec@lsst-verify-worker02] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec@lsst-verify-worker02] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion 

Below are slurm job IDs of this kind of failures:

 136119,136159,136057,136071,136082,136087,136096,136104,136118,136122,136134,136145,136147,136152,136059,136076,136102,136116,136130,136136,136150,136053,136091,136111,136124,136140,136051,136061,136067,136073,136100,136113,136127,136132,136265,136289,136315,136317,136320,136313,136198,136202,136206,136228,136197,136231,136215,136223,136201,136234,136239,136271,136277,136279,136281,136334,136725,136335 

Some were indeed run when the cluster was busy, but it seemed to happen also when the cluster wasn't very busy.

#### Activity

Hide
Matthew Thomas Long [X] (Inactive) added a comment -

After some digging, my current thoughts are that this is an sssd problem. Below are logs from 3 different jobs (two failures, and one successful) Note the sssd start/stops inbetween, which is happening very quickly on the affected nodes.

 Apr 22 19:28:49 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: _run_prolog: run job script took usec=7 Apr 22 19:28:49 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: _run_prolog: prolog with lock for job 136637 ran for 0 seconds Apr 22 19:28:49 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: Launching batch job 136637 for UID 34076 Apr 22 19:29:17 lsst-verify-worker33.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Shutting down Apr 22 19:29:17 lsst-verify-worker33.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Starting up Apr 22 19:29:18 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: Warning: slurmstepd startup took 29 sec, possible file system problem or full memory Apr 22 19:29:19 lsst-verify-worker33.ncsa.illinois.edu slurmd[3805]: launch task 136637.0 request from 34076.202@141.142.181.96 (port 30874) Apr 22 19:29:29 lsst-verify-worker33.ncsa.illinois.edu slurmstepd[92853]: done with job 

 Apr 22 17:19:00 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: _run_prolog: run job script took usec=6 Apr 22 17:19:00 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: _run_prolog: prolog with lock for job 136335 ran for 0 seconds Apr 22 17:19:00 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: Launching batch job 136335 for UID 34076 Apr 22 17:19:07 lsst-verify-worker32.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Shutting down Apr 22 17:19:07 lsst-verify-worker32.ncsa.illinois.edu sssd[be[ldap.ncsa.illinois.edu]]: Starting up Apr 22 17:19:08 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: Warning: slurmstepd startup took 8 sec, possible file system problem or full memory Apr 22 17:19:08 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: launch task 136335.0 request from 34076.202@141.142.181.95 (port 27860) Apr 22 17:19:29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[71711]: error: *** STEP 136335.0 ON lsst-verify-worker32 CANCELLED AT 2018-04-22T17:19:29 *** Apr 22 17:19:29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[71648]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 65280 Apr 22 17:19:29 lsst-verify-worker32.ncsa.illinois.edu slurmstepd[71648]: done with job 

 Apr 22 17:23:32 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: _run_prolog: prolog with lock for job 136336 ran for 0 seconds Apr 22 17:23:32 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: Launching batch job 136336 for UID 34076 Apr 22 17:23:32 lsst-verify-worker32.ncsa.illinois.edu slurmd[3754]: launch task 136336.0 request from 34076.202@141.142.181.95 (port 46292) 

I don't see any additional logs for job 136336 in /var/log/messages, but looking elsewhere the job did complete, so I suspect the launch task worked in this case, and moved any additional output elsewhere.

My current explanation for this issue is that the mpi process tries starting based on the owner of the job. When sssd stops while trying to launch the task, it can't authenticate and the task launch eventually times out, killing the job.

Ex:
launch task 136335.0 request from 34076.202@141.142.181.95 (port 27860)

In this case '34076.202' is the uid/gid of the job owner. While the log shows that sssd has already started up, it likely doesn't have a cache to pull information from. I noticed running an 'id' on a user took a substantial amount of time to get a response on a node afflicted with this problem (likely longer than mpi waits before timing out).

I will try doing some more digging to see what the cause of this sssd problem is, and how to fix it. Hopefully resolving that will also correct these frequent job failures.

Thanks,

Matt

Show
Hide
Matthew Thomas Long [X] (Inactive) added a comment -

We have updated the sssd timeout window, which should resolve these job startup issues. I'll leave the ticket open for a few days to monitor jobs and ensure that the changes we made are sufficient.

Please report any additional job failures here until the ticket is closed.

Thanks,

Matt

Show
Matthew Thomas Long [X] (Inactive) added a comment - We have updated the sssd timeout window, which should resolve these job startup issues. I'll leave the ticket open for a few days to monitor jobs and ensure that the changes we made are sufficient. Please report any additional job failures here until the ticket is closed. Thanks, Matt
Hide
Matthew Thomas Long [X] (Inactive) added a comment -

Awaiting verification that problem is resolved.

Show
Matthew Thomas Long [X] (Inactive) added a comment - Awaiting verification that problem is resolved.
Hide
Hsin-Fang Chiang added a comment -

Since the update yesterday morning, I've submitted ~30 jobs that use multiple nodes and none failed due to this same timeout issue. It seems good so I'll mark this as Reviewed now. Thanks a lot for figuring this out and fixing it!

Show
Hsin-Fang Chiang added a comment - Since the update yesterday morning, I've submitted ~30 jobs that use multiple nodes and none failed due to this same timeout issue. It seems good so I'll mark this as Reviewed now. Thanks a lot for figuring this out and fixing it!
Hide
Matthew Thomas Long [X] (Inactive) added a comment -

Have there been any issues following the last update?

If not, can this ticket be closed?

Show
Matthew Thomas Long [X] (Inactive) added a comment - Have there been any issues following the last update? If not, can this ticket be closed?
Hide
Hsin-Fang Chiang added a comment -

I haven't encountered this issue since the update.

Show
Hsin-Fang Chiang added a comment - I haven't encountered this issue since the update.

#### People

Assignee:
Matthew Thomas Long [X] (Inactive)
Reporter:
Hsin-Fang Chiang
Reviewers:
Greg Daues, Hsin-Fang Chiang, Matthew Thomas Long [X] (Inactive)
Watchers:
Greg Daues, Hsin-Fang Chiang, Jacob Rundall, Matthew Thomas Long [X] (Inactive), Michelle Butler [X] (Inactive)