Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: Continuous Integration
-
Labels:None
-
Story Points:5
-
Epic Link:
-
Team:SQuaRE
Description
Jenkins jobs that use docker are frequently hanging and dockerd / xfs errors are appearing in the dmesg:
[279961.166094] INFO: task dockerd:902 blocked for more than 120 seconds. |
[279961.169742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
[279961.173508] dockerd D ffff8800a188e010 0 902 1 0x00000080 |
[279961.176932] ffff880033883d78 0000000000000086 ffff8803bb2e8fd0 ffff880033883fd8 |
[279961.180601] ffff880033883fd8 ffff880033883fd8 ffff8803bb2e8fd0 ffff8800a188e000 |
[279961.184682] ffff8800a188e040 ffff88036f27c140 ffff8800a188e068 ffff8800a188e010 |
[279961.188403] Call Trace: |
[279961.190270] [<ffffffff816a94e9>] schedule+0x29/0x70 |
[279961.193155] [<ffffffffc01732ea>] xfs_ail_push_all_sync+0xba/0x110 [xfs] |
[279961.196618] [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30 |
[279961.199554] [<ffffffffc015c2e1>] xfs_unmountfs+0x71/0x1c0 [xfs] |
[279961.202526] [<ffffffffc015cded>] ? xfs_mru_cache_destroy+0x6d/0xa0 [xfs] |
[279961.205747] [<ffffffffc015ee92>] xfs_fs_put_super+0x32/0x90 [xfs] |
[279961.209261] [<ffffffff81203722>] generic_shutdown_super+0x72/0x100 |
[279961.212640] [<ffffffff81203b67>] kill_block_super+0x27/0x70 |
[279961.215738] [<ffffffff81203ea9>] deactivate_locked_super+0x49/0x60 |
[279961.218801] [<ffffffff81204616>] deactivate_super+0x46/0x60 |
[279961.221628] [<ffffffff8122184f>] cleanup_mnt+0x3f/0x80 |
[279961.224301] [<ffffffff812218e2>] __cleanup_mnt+0x12/0x20 |
[279961.227135] [<ffffffff810ad247>] task_work_run+0xa7/0xf0 |
[279961.230449] [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0 |
[279961.233768] [<ffffffff816b52bd>] int_signal+0x12/0x17 |
These problems seem to go away, at least for awhile, after the node is restarted (they will not soft reboot). It isn't yet known if this is a kernel bug or some of random EBS I/O timeouts. As it seems to be only triggered by dockerd, it seems much more likely to be a kernel issue. What is unusual is that this doesn't seem to have been an issue until ~ a week ago and this doesn't correlate with a kernel or dockerd version change.
The kernel update is currently being rolled out as part of
DM-13347.