Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12854

ongoing jenkins dockerd/xfs errors

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Continuous Integration
    • Labels:
      None

      Description

      Jenkins jobs that use docker are frequently hanging and dockerd / xfs errors are appearing in the dmesg:

      [279961.166094] INFO: task dockerd:902 blocked for more than 120 seconds.
      [279961.169742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [279961.173508] dockerd         D ffff8800a188e010     0   902      1 0x00000080
      [279961.176932]  ffff880033883d78 0000000000000086 ffff8803bb2e8fd0 ffff880033883fd8
      [279961.180601]  ffff880033883fd8 ffff880033883fd8 ffff8803bb2e8fd0 ffff8800a188e000
      [279961.184682]  ffff8800a188e040 ffff88036f27c140 ffff8800a188e068 ffff8800a188e010
      [279961.188403] Call Trace:
      [279961.190270]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [279961.193155]  [<ffffffffc01732ea>] xfs_ail_push_all_sync+0xba/0x110 [xfs]
      [279961.196618]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
      [279961.199554]  [<ffffffffc015c2e1>] xfs_unmountfs+0x71/0x1c0 [xfs]
      [279961.202526]  [<ffffffffc015cded>] ? xfs_mru_cache_destroy+0x6d/0xa0 [xfs]
      [279961.205747]  [<ffffffffc015ee92>] xfs_fs_put_super+0x32/0x90 [xfs]
      [279961.209261]  [<ffffffff81203722>] generic_shutdown_super+0x72/0x100
      [279961.212640]  [<ffffffff81203b67>] kill_block_super+0x27/0x70
      [279961.215738]  [<ffffffff81203ea9>] deactivate_locked_super+0x49/0x60
      [279961.218801]  [<ffffffff81204616>] deactivate_super+0x46/0x60
      [279961.221628]  [<ffffffff8122184f>] cleanup_mnt+0x3f/0x80
      [279961.224301]  [<ffffffff812218e2>] __cleanup_mnt+0x12/0x20
      [279961.227135]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
      [279961.230449]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
      [279961.233768]  [<ffffffff816b52bd>] int_signal+0x12/0x17
      

      These problems seem to go away, at least for awhile, after the node is restarted (they will not soft reboot). It isn't yet known if this is a kernel bug or some of random EBS I/O timeouts. As it seems to be only triggered by dockerd, it seems much more likely to be a kernel issue. What is unusual is that this doesn't seem to have been an issue until ~ a week ago and this doesn't correlate with a kernel or dockerd version change.

        Attachments

          Issue Links

            Activity

            Hide
            jhoblitt Joshua Hoblitt added a comment -

            The kernel update is currently being rolled out as part of DM-13347.

            Show
            jhoblitt Joshua Hoblitt added a comment - The kernel update is currently being rolled out as part of DM-13347 .
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            Taking a wait and see stance after the kernel update rolled out in DM-13347.

            Show
            jhoblitt Joshua Hoblitt added a comment - Taking a wait and see stance after the kernel update rolled out in DM-13347 .
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            There hasn't been evidence of this problem for a month now, so I'm going to dare to call it resolved (again).

            Show
            jhoblitt Joshua Hoblitt added a comment - There hasn't been evidence of this problem for a month now, so I'm going to dare to call it resolved (again).
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            and it's back on el7-5:

            [4153923.794084] INFO: task dockerd:1020 blocked for more than 120 seconds.
            [4153923.797007] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [4153923.800267] dockerd         D ffff880035399fa0     0  1020      1 0x00000080
            [4153923.803389] Call Trace:
            [4153923.804911]  [<ffffffff810c63a3>] ? try_to_wake_up+0x183/0x340
            [4153923.807600]  [<ffffffff816ab8a9>] schedule+0x29/0x70
            [4153923.810097]  [<ffffffffc01b631a>] xfs_ail_push_all_sync+0xba/0x110 [xfs]
            [4153923.813242]  [<ffffffff810b3690>] ? wake_up_atomic_t+0x30/0x30
            [4153923.815979]  [<ffffffffc019f311>] xfs_unmountfs+0x71/0x1c0 [xfs]
            [4153923.818788]  [<ffffffffc019fe1d>] ? xfs_mru_cache_destroy+0x6d/0xa0 [xfs]
            [4153923.821790]  [<ffffffffc01a1ec2>] xfs_fs_put_super+0x32/0x90 [xfs]
            [4153923.824586]  [<ffffffff812056d2>] generic_shutdown_super+0x72/0x100
            [4153923.827547]  [<ffffffff81205b17>] kill_block_super+0x27/0x70
            [4153923.830239]  [<ffffffff81205e59>] deactivate_locked_super+0x49/0x60
            [4153923.833078]  [<ffffffff812065c6>] deactivate_super+0x46/0x60
            [4153923.835744]  [<ffffffff8122395f>] cleanup_mnt+0x3f/0x80
            [4153923.838274]  [<ffffffff812239f2>] __cleanup_mnt+0x12/0x20
            [4153923.840858]  [<ffffffff810aefc7>] task_work_run+0xa7/0xf0
            [4153923.843507]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
            [4153923.846125]  [<ffffffff816b8d37>] int_signal+0x12/0x17
            

            Show
            jhoblitt Joshua Hoblitt added a comment - and it's back on el7-5 : [ 4153923.794084 ] INFO: task dockerd: 1020 blocked for more than 120 seconds. [ 4153923.797007 ] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4153923.800267 ] dockerd D ffff880035399fa0 0 1020 1 0x00000080 [ 4153923.803389 ] Call Trace: [ 4153923.804911 ] [<ffffffff810c63a3>] ? try_to_wake_up+ 0x183 / 0x340 [ 4153923.807600 ] [<ffffffff816ab8a9>] schedule+ 0x29 / 0x70 [ 4153923.810097 ] [<ffffffffc01b631a>] xfs_ail_push_all_sync+ 0xba / 0x110 [xfs] [ 4153923.813242 ] [<ffffffff810b3690>] ? wake_up_atomic_t+ 0x30 / 0x30 [ 4153923.815979 ] [<ffffffffc019f311>] xfs_unmountfs+ 0x71 / 0x1c0 [xfs] [ 4153923.818788 ] [<ffffffffc019fe1d>] ? xfs_mru_cache_destroy+ 0x6d / 0xa0 [xfs] [ 4153923.821790 ] [<ffffffffc01a1ec2>] xfs_fs_put_super+ 0x32 / 0x90 [xfs] [ 4153923.824586 ] [<ffffffff812056d2>] generic_shutdown_super+ 0x72 / 0x100 [ 4153923.827547 ] [<ffffffff81205b17>] kill_block_super+ 0x27 / 0x70 [ 4153923.830239 ] [<ffffffff81205e59>] deactivate_locked_super+ 0x49 / 0x60 [ 4153923.833078 ] [<ffffffff812065c6>] deactivate_super+ 0x46 / 0x60 [ 4153923.835744 ] [<ffffffff8122395f>] cleanup_mnt+ 0x3f / 0x80 [ 4153923.838274 ] [<ffffffff812239f2>] __cleanup_mnt+ 0x12 / 0x20 [ 4153923.840858 ] [<ffffffff810aefc7>] task_work_run+ 0xa7 / 0xf0 [ 4153923.843507 ] [<ffffffff8102ab52>] do_notify_resume+ 0x92 / 0xb0 [ 4153923.846125 ] [<ffffffff816b8d37>] int_signal+ 0x12 / 0x17
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            I'm closing this yet again... The production jenkins env has been migrated (except for the jenkins master node, which has an xfs root filesystem that was formated without d_type) has from the devicemapper+loopback storage driver to overlay2. Hopefully, this will be a definitive resolution...

            Show
            jhoblitt Joshua Hoblitt added a comment - I'm closing this yet again... The production jenkins env has been migrated (except for the jenkins master node, which has an xfs root filesystem that was formated without d_type ) has from the devicemapper+loopback storage driver to overlay2 . Hopefully, this will be a definitive resolution...

              People

              • Assignee:
                jhoblitt Joshua Hoblitt
                Reporter:
                jhoblitt Joshua Hoblitt
                Reviewers:
                Joshua Hoblitt
                Watchers:
                Joshua Hoblitt
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel