Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16459

TSSW Jenkins Test servers not responsive - 502 Bad Gateway error

    XMLWordPrintable

    Details

      Description

      The TSSW Jenkins service, https://ts-ci.lsst.codes/blue/organizations/jenkins, is once again, non-responsive.  This is most likely related to DM-16180.

        Attachments

          Issue Links

            Activity

            Hide
            jhoblitt Joshua Hoblitt added a comment -

            The jenkins daemon is still running, with an RSS of 2.1GiB, but is unresponsive via http – the swarm agent logs show they agents are also disconnected.

            jenkins log:

            Nov 07, 2018 11:20:55 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber onEvent
            INFO: Received PushEvent for https://github.com/lsst-ts/robotframework_ts_xml from 192.30.252.35 ⇒ https://ts-ci.lsst.codes:8080/github-webhook/
            Nov 07, 2018 11:20:55 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber$1 run
            INFO: Poked ts_xml
            Nov 07, 2018 11:20:56 AM com.cloudbees.jenkins.GitHubPushTrigger$1 run
            INFO: SCM changes detected in ts_xml. Triggering #881
            Nov 07, 2018 11:23:30 AM hudson.model.Run execute
            INFO: ts_xml #881 main build action completed: SUCCESS
            Nov 07, 2018 11:31:11 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber onEvent
            INFO: Received PushEvent for https://github.com/lsst-ts/robotframework_ts_xml from 192.30.252.34 ⇒ https://ts-ci.lsst.codes:8080/github-webhook/
            Nov 07, 2018 11:31:11 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber$1 run
            INFO: Poked ts_xml
            Nov 07, 2018 11:31:11 AM com.cloudbees.jenkins.GitHubPushTrigger$1 run
            INFO: SCM changes detected in ts_xml. Triggering #882
            Nov 07, 2018 11:33:46 AM hudson.model.Run execute
            INFO: ts_xml #882 main build action completed: SUCCESS
            Nov 07, 2018 11:47:27 AM hudson.model.AsyncPeriodicWork$1 run
            INFO: Started Download metadata
            

            Nothing looks too suspicious in the recent log entries. In particular, the jvm heap errors haven't occurred since the 16th.

            Logs from one of the swarm client, presumably in UTC, suggest that the jenkins master went down very recently.

            2018-11-07 19:47:30.391+0000 [id=39183] INFO    h.remoting.jnlp.Main$CuiListener#status: Terminated
            2018-11-07 19:47:30.396+0000 [id=1]     INFO    hudson.plugins.swarm.Client#run: Retrying in 10 seconds
            2018-11-07 19:47:40.399+0000 [id=1]     SEVERE  hudson.plugins.swarm.Client#run: IOexception occurred
            java.net.ConnectException: Connection refused (Connection refused)
            

            The gc logs do show some strange messages:

            ts-jenkins-gc.tar.xz

            172817.706: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: humongous allocation request failed, allocation request: 2422040 bytes]
             172817.709: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 6194987008 bytes, allocation request: 2420592 bytes, threshold: 2817943110 bytes (45.00 %), source: concurrent humongous allocation]
             172817.709: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: humongous allocation request failed, allocation request: 2420592 bytes]
             172817.712: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 6199181312 bytes, allocation request: 2420472 bytes, threshold: 2819830545 bytes (45.00 %), source: concurrent humongous allocation]
             172817.712: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: humongous allocation request failed, allocation request: 2420472 bytes]
            

            GCeasy.io says that the peak usage over 21 days was 4.75 GiB. The kernel OOM was triggered on Oct 23rd but no recently.

            [476680.163297] yum-cron[26535]: segfault at 24 ip 00007f1aca2930da sp 00007ffd8192f970 error 6 in libpython2.7.so.1.0[7f1aca209000+17e000]
            [594426.648251] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
            [594426.650797] java cpuset=/ mems_allowed=0
            [594426.652120] CPU: 3 PID: 18711 Comm: java Kdump: loaded Not tainted 3.10.0-862.11.6.el7.x86_64 #1
            [594426.655009] Hardware name: Amazon EC2 c5.xlarge/, BIOS 1.0 10/16/2017
            [594426.657015] Call Trace:
            [594426.657757]  [<ffffffff8a5135d4>] dump_stack+0x19/0x1b
            [594426.659263]  [<ffffffff8a50e79f>] dump_header+0x90/0x229
            [594426.660770]  [<ffffffff8a0dc63b>] ? cred_has_capability+0x6b/0x120
            [594426.662531]  [<ffffffff89f9ac64>] oom_kill_process+0x254/0x3d0
            [594426.664167]  [<ffffffff8a0dc70c>] ? selinux_capable+0x1c/0x40
            [594426.665796]  [<ffffffff89f9b4a6>] out_of_memory+0x4b6/0x4f0
            [594426.667525]  [<ffffffff8a50f2a3>] __alloc_pages_slowpath+0x5d6/0x724
            [594426.669462]  [<ffffffff89fa17f5>] __alloc_pages_nodemask+0x405/0x420
            [594426.671165]  [<ffffffff89febf98>] alloc_pages_current+0x98/0x110
            [594426.672984]  [<ffffffff89f97057>] __page_cache_alloc+0x97/0xb0
            [594426.676366]  [<ffffffff89f99758>] filemap_fault+0x298/0x490
            [594426.679640]  [<ffffffffc022485f>] xfs_filemap_fault+0x5f/0xe0 [xfs]
            [594426.683250]  [<ffffffff89fc352a>] __do_fault.isra.58+0x8a/0x100
            [594426.686798]  [<ffffffff89fc3adc>] do_read_fault.isra.60+0x4c/0x1b0
            [594426.690433]  [<ffffffff89fc8484>] handle_pte_fault+0x2f4/0xd10
            [594426.693824]  [<ffffffff89f06db5>] ? futex_wake_op+0x4a5/0x610
            [594426.697230]  [<ffffffff89fcae3d>] handle_mm_fault+0x39d/0x9b0
            [594426.700807]  [<ffffffff8a520557>] __do_page_fault+0x197/0x4f0
            [594426.704260]  [<ffffffff8a520996>] trace_do_page_fault+0x56/0x150
            [594426.707726]  [<ffffffff8a51ff22>] do_async_page_fault+0x22/0xf0
            [594426.711300]  [<ffffffff8a51c788>] async_page_fault+0x28/0x30
            [594426.714734] Mem-Info:
            [594426.717267] active_anon:1830371 inactive_anon:10422 isolated_anon:0
             active_file:183 inactive_file:1736 isolated_file:0
             unevictable:0 dirty:3 writeback:0 unstable:0
             slab_reclaimable:6290 slab_unreclaimable:9216
             mapped:7935 shmem:10461 pagetables:5281 bounce:0
             free:27841 free_pcp:494 free_cma:0
            [594426.740948] Node 0 DMA free:15908kB min:140kB low:172kB high:208kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
            [594426.767444] lowmem_reserve[]: 0 2830 7454 7454
            [594426.771436] Node 0 DMA32 free:48172kB min:25624kB low:32028kB high:38436kB active_anon:2789688kB inactive_anon:9872kB active_file:244kB inactive_file:3504kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129320kB managed:2898888kB mlocked:0kB dirty:4kB writeback:0kB mapped:6788kB shmem:9924kB slab_reclaimable:8436kB slab_unreclaimable:13880kB kernel_stack:1216kB pagetables:8056kB unstable:0kB bounce:0kB free_pcp:872kB local_pcp:228kB free_cma:0kB writeback_tmp:0kB pages_scanned:120 all_unreclaimable? yes
            [594426.797479] lowmem_reserve[]: 0 0 4623 4623
            [594426.801086] Node 0 Normal free:41676kB min:41816kB low:52268kB high:62724kB active_anon:4531944kB inactive_anon:31816kB active_file:488kB inactive_file:3392kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4878336kB managed:4734284kB mlocked:0kB dirty:8kB writeback:0kB mapped:25208kB shmem:31920kB slab_reclaimable:17684kB slab_unreclaimable:27068kB kernel_stack:3184kB pagetables:13068kB unstable:0kB bounce:0kB free_pcp:1376kB local_pcp:268kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
            [594426.828216] lowmem_reserve[]: 0 0 0 0
            [594426.831913] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
            [594426.841541] Node 0 DMA32: 138*4kB (UEM) 21*8kB (UEM) 335*16kB (UEM) 225*32kB (UEM) 95*64kB (UEM) 44*128kB (UEM) 32*256kB (UEM) 12*512kB (E) 5*1024kB (EM) 0*2048kB 0*4096kB = 44448kB
            [594426.853382] Node 0 Normal: 152*4kB (M) 376*8kB (UEM) 537*16kB (UEM) 306*32kB (UEM) 147*64kB (UEM) 53*128kB (UEM) 11*256kB (EM) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 42032kB
            [594426.865166] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
            [594426.871371] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
            [594426.877597] 12588 total pagecache pages
            [594426.880567] 0 pages in swap cache
            [594426.883444] Swap cache stats: add 0, delete 0, find 0/0
            [594426.886771] Free swap  = 0kB
            [594426.889533] Total swap = 0kB
            [594426.892276] 2005912 pages RAM
            [594426.895012] 0 pages HighMem/MovableOnly
            [594426.898001] 93642 pages reserved
            [594426.900796] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
            [594426.906771] [  375]     0   375    39519    11090      54        0             0 systemd-journal
            [594426.912933] [  420]     0   420    11779      471      24        0         -1000 systemd-udevd
            [594426.919081] [  459]     0   459    13877      115      27        0         -1000 auditd
            [594426.925097] [  524]     0   524     6594       81      19        0             0 systemd-logind
            [594426.931278] [  526]     0   526     5381       60      14        0             0 irqbalance
            [594426.937228] [  527]    81   527    16588      175      32        0          -900 dbus-daemon
            [594426.943327] [  530]     0   530    50325      134      38        0             0 gssproxy
            [594426.948953] [  541]   999   541   134609     1627      58        0             0 polkitd
            [594426.954539] [  785]     0   785    26849      500      50        0             0 dhclient
            [594426.960142] [  844]   994   844    28293       49      12        0             0 jenkins-slave-r
            [594426.965943] [  845]     0   845   143454     2764      96        0             0 tuned
            [594426.972799] [  846]   995   846    74361      413      88        0             0 gmetad
            [594426.978739] [  848]   994   848  1411673    64356     276        0             0 java
            [594426.984740] [  850]     0   850    71155     1073      88        0             0 php-fpm
            [594426.990780] [  976]     0   976    68731     7144      65        0             0 rsyslogd
            [594426.996858] [  983]     0   983    28203      258      59        0         -1000 sshd
            [594427.002775] [  987]     0   987    31571      161      19        0             0 crond
            [594427.008747] [  988]     0   988    27522       34       9        0             0 agetty
            [594427.014756] [  990]     0   990    27522       33      10        0             0 agetty
            [594427.020744] [ 1036]     0  1036    24949      274      26        0             0 nginx
            [594427.026403] [ 1037]   992  1037    25316      578      29        0             0 nginx
            [594427.031907] [ 1440]    99  1440    57650      324      32        0             0 gmond
            [594427.037896] [23185]    38 23185     7485      155      20        0             0 ntpd
            [594427.043938] [23361]     0 23361   156811     8361      92        0          -500 dockerd
            [594427.050129] [23368]     0 23368   112478     3128      41        0          -500 docker-containe
            [594427.056838] [23560]    48 23560    71155     1023      84        0             0 php-fpm
            [594427.062942] [23561]    48 23561    71155     1024      84        0             0 php-fpm
            [594427.069067] [23562]    48 23562    71155     1024      84        0             0 php-fpm
            [594427.075054] [23563]    48 23563    71155     1024      84        0             0 php-fpm
            [594427.080954] [23564]    48 23564    71155     1024      84        0             0 php-fpm
            [594427.086972] [18689]   993 18689  2594730  1741923    3559        0             0 java
            [594427.092931] [32181]   993 32181    29124       40      12        0             0 git
            [594427.099031] [32185]   993 32185    29124       41      13        0             0 git
            [594427.104797] [32189]   993 32189    39448      832      50        0             0 git-remote-http
            [594427.110666] [32190]   993 32190    39448      768      48        0             0 git-remote-http
            [594427.116486] Out of memory: Kill process 18689 (java) score 912 or sacrifice child
            [594427.122473] Killed process 32185 (git) total-vm:116496kB, anon-rss:164kB, file-rss:0kB, shmem-rss:0kB
            [1169178.256959] yum-cron[10408]: segfault at 24 ip 00007f6f8776d0da sp 00007ffe4d58db80 error 6 in libpython2.7.so.1.0[7f6f876e3000+17e000]
            

            We've never seen this type of dead lock with the DM jenkins instance, so I'm suspicious that this is plugin related.

            The TMS_Simulator job had a label restricting to running on jenkins-master, which I have removed.

            The Jenkins master has been restarted and appears to be running normally. As there isn't a known triggering event to reproduce the problem, I'm not sure if there's any other action that can be taken at this time.

            Show
            jhoblitt Joshua Hoblitt added a comment - The jenkins daemon is still running, with an RSS of 2.1GiB, but is unresponsive via http – the swarm agent logs show they agents are also disconnected. jenkins log: Nov 07 , 2018 11 : 20 : 55 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber onEvent INFO: Received PushEvent for https: //github.com/lsst-ts/robotframework_ts_xml from 192.30.252.35 ⇒ https://ts-ci.lsst.codes:8080/github-webhook/ Nov 07 , 2018 11 : 20 : 55 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber$ 1 run INFO: Poked ts_xml Nov 07 , 2018 11 : 20 : 56 AM com.cloudbees.jenkins.GitHubPushTrigger$ 1 run INFO: SCM changes detected in ts_xml. Triggering # 881 Nov 07 , 2018 11 : 23 : 30 AM hudson.model.Run execute INFO: ts_xml # 881 main build action completed: SUCCESS Nov 07 , 2018 11 : 31 : 11 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber onEvent INFO: Received PushEvent for https: //github.com/lsst-ts/robotframework_ts_xml from 192.30.252.34 ⇒ https://ts-ci.lsst.codes:8080/github-webhook/ Nov 07 , 2018 11 : 31 : 11 AM org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber$ 1 run INFO: Poked ts_xml Nov 07 , 2018 11 : 31 : 11 AM com.cloudbees.jenkins.GitHubPushTrigger$ 1 run INFO: SCM changes detected in ts_xml. Triggering # 882 Nov 07 , 2018 11 : 33 : 46 AM hudson.model.Run execute INFO: ts_xml # 882 main build action completed: SUCCESS Nov 07 , 2018 11 : 47 : 27 AM hudson.model.AsyncPeriodicWork$ 1 run INFO: Started Download metadata Nothing looks too suspicious in the recent log entries. In particular, the jvm heap errors haven't occurred since the 16th. Logs from one of the swarm client, presumably in UTC, suggest that the jenkins master went down very recently. 2018 - 11 - 07 19 : 47 : 30.391 + 0000 [id= 39183 ] INFO h.remoting.jnlp.Main$CuiListener#status: Terminated 2018 - 11 - 07 19 : 47 : 30.396 + 0000 [id= 1 ] INFO hudson.plugins.swarm.Client#run: Retrying in 10 seconds 2018 - 11 - 07 19 : 47 : 40.399 + 0000 [id= 1 ] SEVERE hudson.plugins.swarm.Client#run: IOexception occurred java.net.ConnectException: Connection refused (Connection refused) The gc logs do show some strange messages: ts-jenkins-gc.tar.xz 172817.706 : [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: humongous allocation request failed, allocation request: 2422040 bytes] 172817.709 : [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 6194987008 bytes, allocation request: 2420592 bytes, threshold: 2817943110 bytes ( 45.00 %), source: concurrent humongous allocation] 172817.709 : [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: humongous allocation request failed, allocation request: 2420592 bytes] 172817.712 : [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 6199181312 bytes, allocation request: 2420472 bytes, threshold: 2819830545 bytes ( 45.00 %), source: concurrent humongous allocation] 172817.712 : [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: humongous allocation request failed, allocation request: 2420472 bytes] GCeasy.io says that the peak usage over 21 days was 4.75 GiB. The kernel OOM was triggered on Oct 23rd but no recently. [ 476680.163297 ] yum-cron[ 26535 ]: segfault at 24 ip 00007f1aca2930da sp 00007ffd8192f970 error 6 in libpython2. 7 .so. 1.0 [7f1aca209000+17e000] [ 594426.648251 ] java invoked oom-killer: gfp_mask= 0x201da , order= 0 , oom_score_adj= 0 [ 594426.650797 ] java cpuset=/ mems_allowed= 0 [ 594426.652120 ] CPU: 3 PID: 18711 Comm: java Kdump: loaded Not tainted 3.10 . 0 - 862.11 . 6 .el7.x86_64 # 1 [ 594426.655009 ] Hardware name: Amazon EC2 c5.xlarge/, BIOS 1.0 10 / 16 / 2017 [ 594426.657015 ] Call Trace: [ 594426.657757 ] [<ffffffff8a5135d4>] dump_stack+ 0x19 / 0x1b [ 594426.659263 ] [<ffffffff8a50e79f>] dump_header+ 0x90 / 0x229 [ 594426.660770 ] [<ffffffff8a0dc63b>] ? cred_has_capability+ 0x6b / 0x120 [ 594426.662531 ] [<ffffffff89f9ac64>] oom_kill_process+ 0x254 / 0x3d0 [ 594426.664167 ] [<ffffffff8a0dc70c>] ? selinux_capable+ 0x1c / 0x40 [ 594426.665796 ] [<ffffffff89f9b4a6>] out_of_memory+ 0x4b6 / 0x4f0 [ 594426.667525 ] [<ffffffff8a50f2a3>] __alloc_pages_slowpath+ 0x5d6 / 0x724 [ 594426.669462 ] [<ffffffff89fa17f5>] __alloc_pages_nodemask+ 0x405 / 0x420 [ 594426.671165 ] [<ffffffff89febf98>] alloc_pages_current+ 0x98 / 0x110 [ 594426.672984 ] [<ffffffff89f97057>] __page_cache_alloc+ 0x97 / 0xb0 [ 594426.676366 ] [<ffffffff89f99758>] filemap_fault+ 0x298 / 0x490 [ 594426.679640 ] [<ffffffffc022485f>] xfs_filemap_fault+ 0x5f / 0xe0 [xfs] [ 594426.683250 ] [<ffffffff89fc352a>] __do_fault.isra. 58 + 0x8a / 0x100 [ 594426.686798 ] [<ffffffff89fc3adc>] do_read_fault.isra. 60 + 0x4c / 0x1b0 [ 594426.690433 ] [<ffffffff89fc8484>] handle_pte_fault+ 0x2f4 / 0xd10 [ 594426.693824 ] [<ffffffff89f06db5>] ? futex_wake_op+ 0x4a5 / 0x610 [ 594426.697230 ] [<ffffffff89fcae3d>] handle_mm_fault+ 0x39d / 0x9b0 [ 594426.700807 ] [<ffffffff8a520557>] __do_page_fault+ 0x197 / 0x4f0 [ 594426.704260 ] [<ffffffff8a520996>] trace_do_page_fault+ 0x56 / 0x150 [ 594426.707726 ] [<ffffffff8a51ff22>] do_async_page_fault+ 0x22 / 0xf0 [ 594426.711300 ] [<ffffffff8a51c788>] async_page_fault+ 0x28 / 0x30 [ 594426.714734 ] Mem-Info: [ 594426.717267 ] active_anon: 1830371 inactive_anon: 10422 isolated_anon: 0 active_file: 183 inactive_file: 1736 isolated_file: 0 unevictable: 0 dirty: 3 writeback: 0 unstable: 0 slab_reclaimable: 6290 slab_unreclaimable: 9216 mapped: 7935 shmem: 10461 pagetables: 5281 bounce: 0 free: 27841 free_pcp: 494 free_cma: 0 [ 594426.740948 ] Node 0 DMA free:15908kB min:140kB low:172kB high:208kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned: 0 all_unreclaimable? yes [ 594426.767444 ] lowmem_reserve[]: 0 2830 7454 7454 [ 594426.771436 ] Node 0 DMA32 free:48172kB min:25624kB low:32028kB high:38436kB active_anon:2789688kB inactive_anon:9872kB active_file:244kB inactive_file:3504kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129320kB managed:2898888kB mlocked:0kB dirty:4kB writeback:0kB mapped:6788kB shmem:9924kB slab_reclaimable:8436kB slab_unreclaimable:13880kB kernel_stack:1216kB pagetables:8056kB unstable:0kB bounce:0kB free_pcp:872kB local_pcp:228kB free_cma:0kB writeback_tmp:0kB pages_scanned: 120 all_unreclaimable? yes [ 594426.797479 ] lowmem_reserve[]: 0 0 4623 4623 [ 594426.801086 ] Node 0 Normal free:41676kB min:41816kB low:52268kB high:62724kB active_anon:4531944kB inactive_anon:31816kB active_file:488kB inactive_file:3392kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4878336kB managed:4734284kB mlocked:0kB dirty:8kB writeback:0kB mapped:25208kB shmem:31920kB slab_reclaimable:17684kB slab_unreclaimable:27068kB kernel_stack:3184kB pagetables:13068kB unstable:0kB bounce:0kB free_pcp:1376kB local_pcp:268kB free_cma:0kB writeback_tmp:0kB pages_scanned: 32 all_unreclaimable? no [ 594426.828216 ] lowmem_reserve[]: 0 0 0 0 [ 594426.831913 ] Node 0 DMA: 1 *4kB (U) 0 *8kB 0 *16kB 1 *32kB (U) 2 *64kB (U) 1 *128kB (U) 1 *256kB (U) 0 *512kB 1 *1024kB (U) 1 *2048kB (M) 3 *4096kB (M) = 15908kB [ 594426.841541 ] Node 0 DMA32: 138 *4kB (UEM) 21 *8kB (UEM) 335 *16kB (UEM) 225 *32kB (UEM) 95 *64kB (UEM) 44 *128kB (UEM) 32 *256kB (UEM) 12 *512kB (E) 5 *1024kB (EM) 0 *2048kB 0 *4096kB = 44448kB [ 594426.853382 ] Node 0 Normal: 152 *4kB (M) 376 *8kB (UEM) 537 *16kB (UEM) 306 *32kB (UEM) 147 *64kB (UEM) 53 *128kB (UEM) 11 *256kB (EM) 0 *512kB 1 *1024kB (M) 0 *2048kB 0 *4096kB = 42032kB [ 594426.865166 ] Node 0 hugepages_total= 0 hugepages_free= 0 hugepages_surp= 0 hugepages_size=1048576kB [ 594426.871371 ] Node 0 hugepages_total= 0 hugepages_free= 0 hugepages_surp= 0 hugepages_size=2048kB [ 594426.877597 ] 12588 total pagecache pages [ 594426.880567 ] 0 pages in swap cache [ 594426.883444 ] Swap cache stats: add 0 , delete 0 , find 0 / 0 [ 594426.886771 ] Free swap = 0kB [ 594426.889533 ] Total swap = 0kB [ 594426.892276 ] 2005912 pages RAM [ 594426.895012 ] 0 pages HighMem/MovableOnly [ 594426.898001 ] 93642 pages reserved [ 594426.900796 ] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 594426.906771 ] [ 375 ] 0 375 39519 11090 54 0 0 systemd-journal [ 594426.912933 ] [ 420 ] 0 420 11779 471 24 0 - 1000 systemd-udevd [ 594426.919081 ] [ 459 ] 0 459 13877 115 27 0 - 1000 auditd [ 594426.925097 ] [ 524 ] 0 524 6594 81 19 0 0 systemd-logind [ 594426.931278 ] [ 526 ] 0 526 5381 60 14 0 0 irqbalance [ 594426.937228 ] [ 527 ] 81 527 16588 175 32 0 - 900 dbus-daemon [ 594426.943327 ] [ 530 ] 0 530 50325 134 38 0 0 gssproxy [ 594426.948953 ] [ 541 ] 999 541 134609 1627 58 0 0 polkitd [ 594426.954539 ] [ 785 ] 0 785 26849 500 50 0 0 dhclient [ 594426.960142 ] [ 844 ] 994 844 28293 49 12 0 0 jenkins-slave-r [ 594426.965943 ] [ 845 ] 0 845 143454 2764 96 0 0 tuned [ 594426.972799 ] [ 846 ] 995 846 74361 413 88 0 0 gmetad [ 594426.978739 ] [ 848 ] 994 848 1411673 64356 276 0 0 java [ 594426.984740 ] [ 850 ] 0 850 71155 1073 88 0 0 php-fpm [ 594426.990780 ] [ 976 ] 0 976 68731 7144 65 0 0 rsyslogd [ 594426.996858 ] [ 983 ] 0 983 28203 258 59 0 - 1000 sshd [ 594427.002775 ] [ 987 ] 0 987 31571 161 19 0 0 crond [ 594427.008747 ] [ 988 ] 0 988 27522 34 9 0 0 agetty [ 594427.014756 ] [ 990 ] 0 990 27522 33 10 0 0 agetty [ 594427.020744 ] [ 1036 ] 0 1036 24949 274 26 0 0 nginx [ 594427.026403 ] [ 1037 ] 992 1037 25316 578 29 0 0 nginx [ 594427.031907 ] [ 1440 ] 99 1440 57650 324 32 0 0 gmond [ 594427.037896 ] [ 23185 ] 38 23185 7485 155 20 0 0 ntpd [ 594427.043938 ] [ 23361 ] 0 23361 156811 8361 92 0 - 500 dockerd [ 594427.050129 ] [ 23368 ] 0 23368 112478 3128 41 0 - 500 docker-containe [ 594427.056838 ] [ 23560 ] 48 23560 71155 1023 84 0 0 php-fpm [ 594427.062942 ] [ 23561 ] 48 23561 71155 1024 84 0 0 php-fpm [ 594427.069067 ] [ 23562 ] 48 23562 71155 1024 84 0 0 php-fpm [ 594427.075054 ] [ 23563 ] 48 23563 71155 1024 84 0 0 php-fpm [ 594427.080954 ] [ 23564 ] 48 23564 71155 1024 84 0 0 php-fpm [ 594427.086972 ] [ 18689 ] 993 18689 2594730 1741923 3559 0 0 java [ 594427.092931 ] [ 32181 ] 993 32181 29124 40 12 0 0 git [ 594427.099031 ] [ 32185 ] 993 32185 29124 41 13 0 0 git [ 594427.104797 ] [ 32189 ] 993 32189 39448 832 50 0 0 git-remote-http [ 594427.110666 ] [ 32190 ] 993 32190 39448 768 48 0 0 git-remote-http [ 594427.116486 ] Out of memory: Kill process 18689 (java) score 912 or sacrifice child [ 594427.122473 ] Killed process 32185 (git) total-vm:116496kB, anon-rss:164kB, file-rss:0kB, shmem-rss:0kB [ 1169178.256959 ] yum-cron[ 10408 ]: segfault at 24 ip 00007f6f8776d0da sp 00007ffe4d58db80 error 6 in libpython2. 7 .so. 1.0 [7f6f876e3000+17e000] We've never seen this type of dead lock with the DM jenkins instance, so I'm suspicious that this is plugin related. The TMS_Simulator job had a label restricting to running on jenkins-master , which I have removed. The Jenkins master has been restarted and appears to be running normally. As there isn't a known triggering event to reproduce the problem, I'm not sure if there's any other action that can be taken at this time.
            Hide
            rbovill Rob Bovill added a comment -

            Jenkins is back up and running.  I am concerned that we don't know why it goes down.

             

            Also, I would like access to restart Jenkins master myself, so I don't have to always rely on DM to handle these situations.

            Show
            rbovill Rob Bovill added a comment - Jenkins is back up and running.  I am concerned that we don't know why it goes down.   Also, I would like access to restart Jenkins master myself, so I don't have to always rely on DM to handle these situations.
            Hide
            rbovill Rob Bovill added a comment -

            Adding Wil O'Mullane and Kevin Robison [X] as watchers.

            Show
            rbovill Rob Bovill added a comment - Adding Wil O'Mullane and Kevin Robison [X] as watchers.

              People

              Assignee:
              jhoblitt Joshua Hoblitt
              Reporter:
              rbovill Rob Bovill
              Reviewers:
              Rob Bovill
              Watchers:
              Andy Clements, Joshua Hoblitt, Kevin Robison [X] (Inactive), Rob Bovill, Simon Krughoff, Wil O'Mullane
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.