Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12854

ongoing jenkins dockerd/xfs errors

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Continuous Integration
    • Labels:
      None
    • Templates:
    • Story Points:
      5
    • Epic Link:
    • Team:
      SQuaRE

      Description

      Jenkins jobs that use docker are frequently hanging and dockerd / xfs errors are appearing in the dmesg:

      [279961.166094] INFO: task dockerd:902 blocked for more than 120 seconds.
      [279961.169742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [279961.173508] dockerd         D ffff8800a188e010     0   902      1 0x00000080
      [279961.176932]  ffff880033883d78 0000000000000086 ffff8803bb2e8fd0 ffff880033883fd8
      [279961.180601]  ffff880033883fd8 ffff880033883fd8 ffff8803bb2e8fd0 ffff8800a188e000
      [279961.184682]  ffff8800a188e040 ffff88036f27c140 ffff8800a188e068 ffff8800a188e010
      [279961.188403] Call Trace:
      [279961.190270]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [279961.193155]  [<ffffffffc01732ea>] xfs_ail_push_all_sync+0xba/0x110 [xfs]
      [279961.196618]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
      [279961.199554]  [<ffffffffc015c2e1>] xfs_unmountfs+0x71/0x1c0 [xfs]
      [279961.202526]  [<ffffffffc015cded>] ? xfs_mru_cache_destroy+0x6d/0xa0 [xfs]
      [279961.205747]  [<ffffffffc015ee92>] xfs_fs_put_super+0x32/0x90 [xfs]
      [279961.209261]  [<ffffffff81203722>] generic_shutdown_super+0x72/0x100
      [279961.212640]  [<ffffffff81203b67>] kill_block_super+0x27/0x70
      [279961.215738]  [<ffffffff81203ea9>] deactivate_locked_super+0x49/0x60
      [279961.218801]  [<ffffffff81204616>] deactivate_super+0x46/0x60
      [279961.221628]  [<ffffffff8122184f>] cleanup_mnt+0x3f/0x80
      [279961.224301]  [<ffffffff812218e2>] __cleanup_mnt+0x12/0x20
      [279961.227135]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
      [279961.230449]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
      [279961.233768]  [<ffffffff816b52bd>] int_signal+0x12/0x17
      

      These problems seem to go away, at least for awhile, after the node is restarted (they will not soft reboot). It isn't yet known if this is a kernel bug or some of random EBS I/O timeouts. As it seems to be only triggered by dockerd, it seems much more likely to be a kernel issue. What is unusual is that this doesn't seem to have been an issue until ~ a week ago and this doesn't correlate with a kernel or dockerd version change.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jhoblitt Joshua Hoblitt
                Reporter:
                jhoblitt Joshua Hoblitt
                Reviewers:
                Joshua Hoblitt
                Watchers:
                Joshua Hoblitt
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel