Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12810

weekly release w_2017_47 failed

    Details

      Description

      The _47 weekly release failed trying to tag git repos on all 3 attempts. It seems that every attempt was being scheduled on jenkins-el7-4 and there was some sort of jenkins remote communication error:

      Cannot contact jenkins-el7-4: java.io.IOException: Remote call on JNLP4-connect connection from ip-192-168-123-181.ec2.internal/192.168.123.181:48368 failed
      

      This node (as were all el7 nodes) was restarted last Wednesday as part of DM-12782, which may some how have triggered this new issue.

      I noticed something funky was going on over the weekend and took that node offline at Nov 25, 2017 8:04:56 AM. However, the weekly release still needs to be salvaged and the problem with that build node diagnosed.

        Attachments

          Issue Links

            Activity

            Hide
            jhoblitt Joshua Hoblitt added a comment - - edited

            The failures on el7-4 appear to be a repeat of DM-12782.

            [163718.369401] XFS (dm-3): Unmounting Filesystem
            [163920.482114] INFO: task dockerd:21922 blocked for more than 120 seconds.
            [163920.486775] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [163920.491734] dockerd         D ffff8800182cba10     0 21922      1 0x00000080
            [163920.496572]  ffff880279c53d78 0000000000000082 ffff8803be903f40 ffff880279c53fd8
            [163920.502162]  ffff880279c53fd8 ffff880279c53fd8 ffff8803be903f40 ffff8800182cba00
            [163920.507285]  ffff8800182cba40 ffff880388245d20 ffff8800182cba68 ffff8800182cba10
            [163920.513018] Call Trace:
            [163920.515310]  [<ffffffff816a94e9>] schedule+0x29/0x70
            [163920.518904]  [<ffffffffc01c42ea>] xfs_ail_push_all_sync+0xba/0x110 [xfs]
            [163920.523430]  [<ffffffff810b1910>] ? wake_up_atomic_t+0x30/0x30
            [163920.527697]  [<ffffffffc01ad2e1>] xfs_unmountfs+0x71/0x1c0 [xfs]
            [163920.532160]  [<ffffffffc01added>] ? xfs_mru_cache_destroy+0x6d/0xa0 [xfs]
            [163920.536747]  [<ffffffffc01afe92>] xfs_fs_put_super+0x32/0x90 [xfs]
            [163920.540766]  [<ffffffff81203722>] generic_shutdown_super+0x72/0x100
            [163920.545167]  [<ffffffff81203b67>] kill_block_super+0x27/0x70
            [163920.549002]  [<ffffffff81203ea9>] deactivate_locked_super+0x49/0x60
            [163920.553098]  [<ffffffff81204616>] deactivate_super+0x46/0x60
            [163920.556692]  [<ffffffff8122184f>] cleanup_mnt+0x3f/0x80
            [163920.560303]  [<ffffffff812218e2>] __cleanup_mnt+0x12/0x20
            [163920.564096]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
            [163920.567582]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
            [163920.571218]  [<ffffffff816b52bd>] int_signal+0x12/0x17
            

            dockerd is not responding, listed as <defunct>, and can not be killed. dm-3 is a loopback filesystem:

            [vagrant@jenkins-el7-4 ~]$ cat /proc/partitions 
            major minor  #blocks  name
             
             202        0 1572864000 xvda
             202        1 1572858866 xvda1
               7        0  104857600 loop0
               7        1    2097152 loop1
             253        0  104857600 dm-0
             253        1   10485760 dm-1
             253        2   10485760 dm-2
             253        3   10485760 dm-3
            

            However, it doesn't to have a loopback filesystem attached to it.

            [vagrant@jenkins-el7-4 ~]$ losetup -a
            /dev/loop0: []: (/var/lib/docker/devicemapper/devicemapper/data)
            /dev/loop1: []: (/var/lib/docker/devicemapper/devicemapper/metadata)
            

            Which leaves us once again with either kernel corruption due to resource leakage or EBS I/O problems that have caused deadlocks.

            el7-4 is running kernel 3.10.0-693.2.2.el7, while the latest is 3.10.0-693.5.2.el7, there doesn't seem to be a lot of options so I'll try the newer kernel in a test env.

            Show
            jhoblitt Joshua Hoblitt added a comment - - edited The failures on el7-4 appear to be a repeat of DM-12782 . [ 163718.369401 ] XFS (dm- 3 ): Unmounting Filesystem [ 163920.482114 ] INFO: task dockerd: 21922 blocked for more than 120 seconds. [ 163920.486775 ] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 163920.491734 ] dockerd D ffff8800182cba10 0 21922 1 0x00000080 [ 163920.496572 ] ffff880279c53d78 0000000000000082 ffff8803be903f40 ffff880279c53fd8 [ 163920.502162 ] ffff880279c53fd8 ffff880279c53fd8 ffff8803be903f40 ffff8800182cba00 [ 163920.507285 ] ffff8800182cba40 ffff880388245d20 ffff8800182cba68 ffff8800182cba10 [ 163920.513018 ] Call Trace: [ 163920.515310 ] [<ffffffff816a94e9>] schedule+ 0x29 / 0x70 [ 163920.518904 ] [<ffffffffc01c42ea>] xfs_ail_push_all_sync+ 0xba / 0x110 [xfs] [ 163920.523430 ] [<ffffffff810b1910>] ? wake_up_atomic_t+ 0x30 / 0x30 [ 163920.527697 ] [<ffffffffc01ad2e1>] xfs_unmountfs+ 0x71 / 0x1c0 [xfs] [ 163920.532160 ] [<ffffffffc01added>] ? xfs_mru_cache_destroy+ 0x6d / 0xa0 [xfs] [ 163920.536747 ] [<ffffffffc01afe92>] xfs_fs_put_super+ 0x32 / 0x90 [xfs] [ 163920.540766 ] [<ffffffff81203722>] generic_shutdown_super+ 0x72 / 0x100 [ 163920.545167 ] [<ffffffff81203b67>] kill_block_super+ 0x27 / 0x70 [ 163920.549002 ] [<ffffffff81203ea9>] deactivate_locked_super+ 0x49 / 0x60 [ 163920.553098 ] [<ffffffff81204616>] deactivate_super+ 0x46 / 0x60 [ 163920.556692 ] [<ffffffff8122184f>] cleanup_mnt+ 0x3f / 0x80 [ 163920.560303 ] [<ffffffff812218e2>] __cleanup_mnt+ 0x12 / 0x20 [ 163920.564096 ] [<ffffffff810ad247>] task_work_run+ 0xa7 / 0xf0 [ 163920.567582 ] [<ffffffff8102ab62>] do_notify_resume+ 0x92 / 0xb0 [ 163920.571218 ] [<ffffffff816b52bd>] int_signal+ 0x12 / 0x17 dockerd is not responding, listed as <defunct> , and can not be killed. dm-3 is a loopback filesystem: [vagrant @jenkins -el7- 4 ~]$ cat /proc/partitions major minor #blocks name   202 0 1572864000 xvda 202 1 1572858866 xvda1 7 0 104857600 loop0 7 1 2097152 loop1 253 0 104857600 dm- 0 253 1 10485760 dm- 1 253 2 10485760 dm- 2 253 3 10485760 dm- 3 However, it doesn't to have a loopback filesystem attached to it. [vagrant @jenkins -el7- 4 ~]$ losetup -a /dev/loop0: []: (/var/lib/docker/devicemapper/devicemapper/data) /dev/loop1: []: (/var/lib/docker/devicemapper/devicemapper/metadata) Which leaves us once again with either kernel corruption due to resource leakage or EBS I/O problems that have caused deadlocks. el7-4 is running kernel 3.10.0-693.2.2.el7 , while the latest is 3.10.0-693.5.2.el7 , there doesn't seem to be a lot of options so I'll try the newer kernel in a test env.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            I've implemented a puppet class to both ensure the correct kernel is installed and to reboot the node if the the current running kernel version doesn't match. However, the changelog between the latest kernel and what el7-4 doesn't have an obvious fix for the symptoms we've seen. The only patches that might be in the ballpark would be the mm/cgroup fixes.

            * Thu Oct 19 2017 CentOS Sources <bugs@centos.org> - 3.10.0-693.5.2.el7
            - Apply debranding changes
             
            * Fri Oct 13 2017 Alexander Gordeev <agordeev@redhat.com> [3.10.0-693.5.2.el7]
            - [mm] page_cgroup: Fix Kernel bug during boot with memory cgroups enabled (Larry Woodman) [1491970 1483747]
            - Revert: [mm] Fix Kernel bug during boot with memory cgroups enabled (Larry Woodman) [1491970 1483747]
             
            * Sat Sep 16 2017 Alexander Gordeev <agordeev@redhat.com> [3.10.0-693.5.1.el7]
            - [netdrv] i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq (Stefan Assmann) [1491972 1484232]
            - [netdrv] i40e: avoid NVM acquire deadlock during NVM update (Stefan Assmann) [1491972 1484232]
            - [mm] Fix Kernel bug during boot with memory cgroups enabled (Larry Woodman) [1491970 1483747]
            - [fs] nfsv4: Ensure we don't re-test revoked and freed stateids (Dave Wysochanski) [1491969 1459733]
            - [netdrv] bonding: commit link status change after propose (Jarod Wilson) [1491121 1469790]
            - [mm] page_alloc: ratelimit PFNs busy info message (Jonathan Toppins) [1491120 1383179]
            - [netdrv] cxgb4: avoid crash on PCI error recovery path (Gustavo Duarte) [1489872 1456990]
            - [scsi] Add STARGET_CREATED_REMOVE state to scsi_target_state (Ewan Milne) [1489814 1468727]
            - [net] tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 (Davide Caratti) [1488341 1487061] {CVE-2017-14106}
            - [net] tcp: fix 0 divide in __tcp_select_window() (Davide Caratti) [1488341 1487061] {CVE-2017-14106}
            - [net] sctp: Avoid out-of-bounds reads from address storage (Stefano Brivio) [1484356 1484355] {CVE-2017-7558}
            - [net] udp: consistently apply ufo or fragmentation (Davide Caratti) [1481530 1481535] {CVE-2017-1000112}
            - [net] udp: account for current skb length when deciding about UFO (Davide Caratti) [1481530 1481535] {CVE-2017-1000112}
            - [net] ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output (Davide Caratti) [1481530 1481535] {CVE-2017-1000112}
            - [net] udp: avoid ufo handling on IP payload compression packets (Stefano Brivio) [1490263 1464161]
            - [pci] hv: Use vPCI protocol version 1.2 (Vitaly Kuznetsov) [1478256 1459202]
            - [pci] hv: Add vPCI version protocol negotiation (Vitaly Kuznetsov) [1478256 1459202]
            - [pci] hv: Use page allocation for hbus structure (Vitaly Kuznetsov) [1478256 1459202]
            - [pci] hv: Fix comment formatting and use proper integer fields (Vitaly Kuznetsov) [1478256 1459202]
            - [net] ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() (Stefano Brivio) [1477007 1477010] {CVE-2017-7542}
            - [net] ipv6: avoid overflow of offset in ip6_find_1stfragopt (Sabrina Dubroca) [1477007 1477010] {CVE-2017-7542}
            - [net] xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder (Hannes Frederic Sowa) [1435672 1435670] {CVE-2017-7184}
            - [net] xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window (Hannes Frederic Sowa) [1435672 1435670] {CVE-2017-7184}
            - [net] l2cap: prevent stack overflow on incoming bluetooth packet (Neil Horman) [1489788 1489789] {CVE-2017-1000251}
             
            * Fri Sep 08 2017 Alexander Gordeev <agordeev@redhat.com> [3.10.0-693.4.1.el7]
            - [fs] nfsv4: Add missing nfs_put_lock_context() (Benjamin Coddington) [1487271 1476826]
            - [fs] nfs: discard nfs_lockowner structure (Benjamin Coddington) [1487271 1476826]
            - [fs] nfsv4: enhance nfs4_copy_lock_stateid to use a flock stateid if there is one (Benjamin Coddington) [1487271 1476826]
            - [fs] nfsv4: change nfs4_select_rw_stateid to take a lock_context inplace of lock_owner (Benjamin Coddington) [1487271 1476826]
            - [fs] nfsv4: change nfs4_do_setattr to take an open_context instead of a nfs4_state (Benjamin Coddington) [1487271 1476826]
            - [fs] nfsv4: add flock_owner to open context (Benjamin Coddington) [1487271 1476826]
            - [fs] nfs: remove l_pid field from nfs_lockowner (Benjamin Coddington) [1487271 1476826]
            - [x86] platform/uv/bau: Disable BAU on single hub configurations (Frank Ramsay) [1487159 1487160 1472455 1473353]
            - [x86] platform/uv/bau: Fix congested_response_us not taking effect (Frank Ramsay) [1487159 1472455]
            - [fs] cifs: Disable encryption capability for RHEL 7.4 kernel (Sachin Prabhu) [1485445 1485445]
            - [fs] sunrpc: Handle EADDRNOTAVAIL on connection failures (Dave Wysochanski) [1484269 1479043]
            - [fs] include/linux/printk.h: include pr_fmt in pr_debug_ratelimited (Sachin Prabhu) [1484267 1472823]
            - [fs] printk: pr_debug_ratelimited: check state first to reduce "callbacks suppressed" messages (Sachin Prabhu) [1484267 1472823]
            - [net] packet: fix tp_reserve race in packet_set_ring (Stefano Brivio) [1481938 1481940] {CVE-2017-1000111}
            - [fs] proc: revert /proc/<pid>/maps [stack:TID] annotation (Waiman Long) [1481724 1448534]
            - [net] ping: check minimum size on ICMP header length (Matteo Croce) [1481578 1481573] {CVE-2016-8399}
            - [ipc] mqueue: fix a use-after-free in sys_mq_notify() (Davide Caratti) [1476128 1476126] {CVE-2017-11176}
            - [netdrv] brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() (Stanislaw Gruszka) [1474778 1474784] {CVE-2017-7541}
             
            * Mon Sep 04 2017 Alexander Gordeev <agordeev@redhat.com> [3.10.0-693.3.1.el7]
            - [block] blk-mq-tag: fix wakeup hang after tag resize (Ming Lei) [1487281 1472434]
            

            el7-4 had to be forcibly stopped and restarted. It is now booted with the latest kernel. If this issue re-occurs, I think the next best thing to try is to terminate the instance and re-provision from the base image. That would change very little but it would rule out xfs corruption in the root filesystem (the dmesg entries point the finger at xfs in the docker loopback though).

            Show
            jhoblitt Joshua Hoblitt added a comment - I've implemented a puppet class to both ensure the correct kernel is installed and to reboot the node if the the current running kernel version doesn't match. However, the changelog between the latest kernel and what el7-4 doesn't have an obvious fix for the symptoms we've seen. The only patches that might be in the ballpark would be the mm/cgroup fixes. * Thu Oct 19 2017 CentOS Sources <bugs @centos .org> - 3.10 . 0 - 693.5 . 2 .el7 - Apply debranding changes   * Fri Oct 13 2017 Alexander Gordeev <agordeev @redhat .com> [ 3.10 . 0 - 693.5 . 2 .el7] - [mm] page_cgroup: Fix Kernel bug during boot with memory cgroups enabled (Larry Woodman) [ 1491970 1483747 ] - Revert: [mm] Fix Kernel bug during boot with memory cgroups enabled (Larry Woodman) [ 1491970 1483747 ]   * Sat Sep 16 2017 Alexander Gordeev <agordeev @redhat .com> [ 3.10 . 0 - 693.5 . 1 .el7] - [netdrv] i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq (Stefan Assmann) [ 1491972 1484232 ] - [netdrv] i40e: avoid NVM acquire deadlock during NVM update (Stefan Assmann) [ 1491972 1484232 ] - [mm] Fix Kernel bug during boot with memory cgroups enabled (Larry Woodman) [ 1491970 1483747 ] - [fs] nfsv4: Ensure we don't re-test revoked and freed stateids (Dave Wysochanski) [ 1491969 1459733 ] - [netdrv] bonding: commit link status change after propose (Jarod Wilson) [ 1491121 1469790 ] - [mm] page_alloc: ratelimit PFNs busy info message (Jonathan Toppins) [ 1491120 1383179 ] - [netdrv] cxgb4: avoid crash on PCI error recovery path (Gustavo Duarte) [ 1489872 1456990 ] - [scsi] Add STARGET_CREATED_REMOVE state to scsi_target_state (Ewan Milne) [ 1489814 1468727 ] - [net] tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 (Davide Caratti) [ 1488341 1487061 ] {CVE- 2017 - 14106 } - [net] tcp: fix 0 divide in __tcp_select_window() (Davide Caratti) [ 1488341 1487061 ] {CVE- 2017 - 14106 } - [net] sctp: Avoid out-of-bounds reads from address storage (Stefano Brivio) [ 1484356 1484355 ] {CVE- 2017 - 7558 } - [net] udp: consistently apply ufo or fragmentation (Davide Caratti) [ 1481530 1481535 ] {CVE- 2017 - 1000112 } - [net] udp: account for current skb length when deciding about UFO (Davide Caratti) [ 1481530 1481535 ] {CVE- 2017 - 1000112 } - [net] ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output (Davide Caratti) [ 1481530 1481535 ] {CVE- 2017 - 1000112 } - [net] udp: avoid ufo handling on IP payload compression packets (Stefano Brivio) [ 1490263 1464161 ] - [pci] hv: Use vPCI protocol version 1.2 (Vitaly Kuznetsov) [ 1478256 1459202 ] - [pci] hv: Add vPCI version protocol negotiation (Vitaly Kuznetsov) [ 1478256 1459202 ] - [pci] hv: Use page allocation for hbus structure (Vitaly Kuznetsov) [ 1478256 1459202 ] - [pci] hv: Fix comment formatting and use proper integer fields (Vitaly Kuznetsov) [ 1478256 1459202 ] - [net] ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() (Stefano Brivio) [ 1477007 1477010 ] {CVE- 2017 - 7542 } - [net] ipv6: avoid overflow of offset in ip6_find_1stfragopt (Sabrina Dubroca) [ 1477007 1477010 ] {CVE- 2017 - 7542 } - [net] xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder (Hannes Frederic Sowa) [ 1435672 1435670 ] {CVE- 2017 - 7184 } - [net] xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window (Hannes Frederic Sowa) [ 1435672 1435670 ] {CVE- 2017 - 7184 } - [net] l2cap: prevent stack overflow on incoming bluetooth packet (Neil Horman) [ 1489788 1489789 ] {CVE- 2017 - 1000251 }   * Fri Sep 08 2017 Alexander Gordeev <agordeev @redhat .com> [ 3.10 . 0 - 693.4 . 1 .el7] - [fs] nfsv4: Add missing nfs_put_lock_context() (Benjamin Coddington) [ 1487271 1476826 ] - [fs] nfs: discard nfs_lockowner structure (Benjamin Coddington) [ 1487271 1476826 ] - [fs] nfsv4: enhance nfs4_copy_lock_stateid to use a flock stateid if there is one (Benjamin Coddington) [ 1487271 1476826 ] - [fs] nfsv4: change nfs4_select_rw_stateid to take a lock_context inplace of lock_owner (Benjamin Coddington) [ 1487271 1476826 ] - [fs] nfsv4: change nfs4_do_setattr to take an open_context instead of a nfs4_state (Benjamin Coddington) [ 1487271 1476826 ] - [fs] nfsv4: add flock_owner to open context (Benjamin Coddington) [ 1487271 1476826 ] - [fs] nfs: remove l_pid field from nfs_lockowner (Benjamin Coddington) [ 1487271 1476826 ] - [x86] platform/uv/bau: Disable BAU on single hub configurations (Frank Ramsay) [ 1487159 1487160 1472455 1473353 ] - [x86] platform/uv/bau: Fix congested_response_us not taking effect (Frank Ramsay) [ 1487159 1472455 ] - [fs] cifs: Disable encryption capability for RHEL 7.4 kernel (Sachin Prabhu) [ 1485445 1485445 ] - [fs] sunrpc: Handle EADDRNOTAVAIL on connection failures (Dave Wysochanski) [ 1484269 1479043 ] - [fs] include/linux/printk.h: include pr_fmt in pr_debug_ratelimited (Sachin Prabhu) [ 1484267 1472823 ] - [fs] printk: pr_debug_ratelimited: check state first to reduce "callbacks suppressed" messages (Sachin Prabhu) [ 1484267 1472823 ] - [net] packet: fix tp_reserve race in packet_set_ring (Stefano Brivio) [ 1481938 1481940 ] {CVE- 2017 - 1000111 } - [fs] proc: revert /proc/<pid>/maps [stack:TID] annotation (Waiman Long) [ 1481724 1448534 ] - [net] ping: check minimum size on ICMP header length (Matteo Croce) [ 1481578 1481573 ] {CVE- 2016 - 8399 } - [ipc] mqueue: fix a use-after-free in sys_mq_notify() (Davide Caratti) [ 1476128 1476126 ] {CVE- 2017 - 11176 } - [netdrv] brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() (Stanislaw Gruszka) [ 1474778 1474784 ] {CVE- 2017 - 7541 }   * Mon Sep 04 2017 Alexander Gordeev <agordeev @redhat .com> [ 3.10 . 0 - 693.3 . 1 .el7] - [block] blk-mq-tag: fix wakeup hang after tag resize (Ming Lei) [ 1487281 1472434 ] el7-4 had to be forcibly stopped and restarted. It is now booted with the latest kernel. If this issue re-occurs, I think the next best thing to try is to terminate the instance and re-provision from the base image. That would change very little but it would rule out xfs corruption in the root filesystem (the dmesg entries point the finger at xfs in the docker loopback though).
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            There is still a deep backlog from the weekend. I'm going to leave the 2 additional nodes I brought online this morning running until tomorrow. I will probably start running parts of the weekly in the early evening.

            I've added absolute (in queue or running) timeouts to:

            • stack-os-matrix (24 hours – I'm guessing that users will give up / abandon a job after a day but this value can be adjusted as necessary)

            Added absolute (in queue or running) timeouts and disable concurrent building of these periodically triggered jobs in order to prevent multiple builds from accumulating in the backlog (some other periodic jobs already have absolute timeouts in place):

            • sqre/infrastructure/update_cmirror
            • sqre/infrastructure/jenkins-node-cleanup
            • science-pipelines/lsst_distrib
            • qserv/dax_webserv
            • qserv/qserv_distrib
            • sims/lsst_sims

            I've also adjusted the absolute timeouts on the the nightly/weekly release jobs.

            • release/nightly-release (23 hours – to prevent adding to an existing backlog)
            • release/weekly-release (48 hours – in case there is already a deep backlog)
            Show
            jhoblitt Joshua Hoblitt added a comment - There is still a deep backlog from the weekend. I'm going to leave the 2 additional nodes I brought online this morning running until tomorrow. I will probably start running parts of the weekly in the early evening. I've added absolute (in queue or running) timeouts to: stack-os-matrix (24 hours – I'm guessing that users will give up / abandon a job after a day but this value can be adjusted as necessary) Added absolute (in queue or running) timeouts and disable concurrent building of these periodically triggered jobs in order to prevent multiple builds from accumulating in the backlog (some other periodic jobs already have absolute timeouts in place): sqre/infrastructure/update_cmirror sqre/infrastructure/jenkins-node-cleanup science-pipelines/lsst_distrib qserv/dax_webserv qserv/qserv_distrib sims/lsst_sims I've also adjusted the absolute timeouts on the the nightly/weekly release jobs. release/nightly-release (23 hours – to prevent adding to an existing backlog) release/weekly-release (48 hours – in case there is already a deep backlog)
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            I discovered a build from what should have been a trivial job hung for several days (tying up a node), so I bit the bullet and implemented runtime timeouts for most jobs and added explicit labels to node blocks for trivial actions. The later should help complex pipelines finish faster rather then being blocked what on a fast node to do a simple action such as log archiving.

            Show
            jhoblitt Joshua Hoblitt added a comment - I discovered a build from what should have been a trivial job hung for several days (tying up a node), so I bit the bullet and implemented runtime timeouts for most jobs and added explicit labels to node blocks for trivial actions. The later should help complex pipelines finish faster rather then being blocked what on a fast node to do a simple action such as log archiving.
            Hide
            jhoblitt Joshua Hoblitt added a comment - - edited

            The _47 weekly release finished last night. The git tag, eups tag, eupspkg, tarballs, and docker image have all been published.

            It appears there were more dockerd / xfs errors last night on nodes other than el7-4. Further work on debugging that problem will be on DM-12854.

            Summary of work:

            • The build failures on el7-4 were investigated and a puppet profile was written to enforce but the install and running kernel versions – the current latest kernel version has been configured.
            • The el7-4 ec2 instance had to be forcibly stopped and restarted to return it to service; other nodes still need to be restarted to update the kernel version – this should be part of DM-12854.
            • Two additional jenkins nodes were provisioned to help reduce the build backlog that piled up over the long weekend in order to get the weekly release out in a timely manner.
            • Absolute (inqueue or running) and/or runtime timeouts were added to virtually all jenkins jobs in attempt to reduce backlog "snowballing" in the future AND to be keep hung builds from tying up execution resources until they are noticed and manually killed by an operation.
            • Trivial pipeline operations, across several jobs, were changed to not require the use of a high-end build node – this was intended to help build finish faster (in terms of wall clock time) when there is a backlog.
            • The w_2017_47 weekly-release build was manually "replayed" and edited to preserve the eups distrib tag that was originally generated.
            Show
            jhoblitt Joshua Hoblitt added a comment - - edited The _47 weekly release finished last night. The git tag, eups tag, eupspkg , tarballs , and docker image have all been published. It appears there were more dockerd / xfs errors last night on nodes other than el7-4 . Further work on debugging that problem will be on DM-12854 . Summary of work: The build failures on el7-4 were investigated and a puppet profile was written to enforce but the install and running kernel versions – the current latest kernel version has been configured. The el7-4 ec2 instance had to be forcibly stopped and restarted to return it to service; other nodes still need to be restarted to update the kernel version – this should be part of DM-12854 . Two additional jenkins nodes were provisioned to help reduce the build backlog that piled up over the long weekend in order to get the weekly release out in a timely manner. Absolute (inqueue or running) and/or runtime timeouts were added to virtually all jenkins jobs in attempt to reduce backlog "snowballing" in the future AND to be keep hung builds from tying up execution resources until they are noticed and manually killed by an operation. Trivial pipeline operations, across several jobs, were changed to not require the use of a high-end build node – this was intended to help build finish faster (in terms of wall clock time) when there is a backlog. The w_2017_47 weekly-release build was manually "replayed" and edited to preserve the eups distrib tag that was originally generated.

              People

              • Assignee:
                jhoblitt Joshua Hoblitt
                Reporter:
                jhoblitt Joshua Hoblitt
                Watchers:
                John Swinbank, Joshua Hoblitt, Kian-Tat Lim, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: