Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13578

Plot the node utilization for RC1 Reprocessed Jobs

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Create node utilization vs. time plots for the following reprocessing weeks and slurm ID's :

      w_2017_25: 75450, 75452, 75456, 75451, 75461, 75453, 74961, 74963, 74962, 74965, 74964, 75611, 75686, 75704, 74936, 74938, 74941, 74937, 74940, 74939, 74931, 74934, 74932, 74935, 74933, 74945, 74946, 74948, 74949, 74951, 74952, 74953, 74955, 74956, 74958, 74959, 74960, 75042, 75431, 75432, 75433, 74944

      w_2017_27: 78843, 78841, 78845, 78375, 78847, 78377, 78877, 78844, 78380, 78855, 78848, 78376, 78370, 78846, 78379, 78372, 78838, 78378, 78371, 78840, 78374, 78839, 78373, 78842 (cannot find logs from mosaic)

      w_2017_28: 77637, 77871, 78079, 78072, 77872, 78099, 78080, 77813, 77815, 77814, 77817, 77816, 78300, 77881, 77631, 77633, 77636, 77632, 77635, 77634, 77626, 77629, 77627, 77630, 77628, 77801, 77803, 77802, 77805, 77804, 77806, 77808, 77807, 77810, 77809, 77795, 77797, 77796, 77798, 77799, 77800

      w_2017_30: 79563, 79561, 79565, 79432, 79567, 79434, 80036, 79564, 79437, 79742, 79568, 79433, 79427, 79566, 79436, 79429, 79558, 79435, 79428, 79560, 79431, 79559, 79430, 79562 (cannot find logs from mosaic) 

      w_2017_32: 81995, 82071, 82073, 82075, 82072, 82076, 82074, 82066, 82068, 82067, 82070, 82069, 82078, 82077, 82042, 82044, 82047, 82043, 82046, 82045, 82037, 82040, 82038, 82041, 82039, 82054, 82056, 82055, 82058, 82057, 82059, 82061, 82060, 82063, 82062, 82048, 82050, 82049, 82051, 82052, 82053

      w_2017_34: 84984, 84986, 84985, 84988, 84987, 84989, 84991, 84990, 84993, 84992, 84978, 84980, 84979, 85011, 84982, 84983, 85110, 85112, 85114, 85111, 85215, 85113, 85105, 85107, 85106, 85109, 85108, 85542, 85687, 85541, 84808, 84810, 84813, 84809, 84812, 84811, 84803, 84806, 84804, 84807, 84805,84814

      w_2017_36: 87154, 87661, 87663, 87662, 87665, 87664, 87666, 87668, 87667, 87670, 87669, 87655, 87657, 87656, 87658, 87659, 87660, 87689, 87691, 87693, 87690, 87694, 87692, 87684, 87686, 87685, 87688, 87687, 87777, 88034, 87776, 87142, 87144, 87147, 87143, 87146, 87145, 87137, 87140, 87138, 87141, 87139

      w_2017_38: 90037, 90144, 90146, 90145, 90401, 90147, 90149, 90151, 90150, 90400, 90152, 90138, 90140, 90139, 90399, 90142, 90143, 90410, 90412, 90417, 90411, 90424, 90416, 90196, 90198, 90197, 90409, 90199, 90428, 90453, 90427, 90029, 90031, 90034, 90030, 90379, 90032, 90024, 90027, 90035, 90378, 90026

      w_2017_40: 92456, 92981, 92983, 92985, 92982, 92986, 92984, 92976, 92978, 92977, 92980, 92979, 93000, 93029, 92999, 92449, 92451, 92454, 92450, 92453, 92452, 92444, 92447, 92445, 92448, 92446, 92963, 92965, 92964, 92967, 92966, 92968, 92970, 92969, 92972, 92971, 92957, 92959, 92958, 92960, 92961, 92962

      w_2017_42: 94858, 94909, 94911, 94910, 94913, 94912, 94914, 94916, 94915, 94918, 94917, 94903, 94905, 94904, 94906, 94907, 94908, 95124, 95126, 95128, 95125, 95129, 95127, 95119, 95121, 95120, 95123, 95122, 95211, 95210, 94852, 94854, 94857, 94853, 94856, 94855, 94847, 94850, 94848, 94851, 94849

      w_2017_44: 99690, 99741, 99743, 99745, 99742, 99746, 99744, 99736, 99738, 99737, 99740, 99739, 100062, 99752, 99751, 99696, 99698, 99701, 99697, 99700, 99699, 99691, 99694, 99692, 99695, 99693, 99725, 99727, 99726, 99729, 99728, 99730, 99732, 99731, 99734, 99733, 99719, 99721, 99720, 99722, 99723, 99724

      w_2017_46: 102428, 102541, 102543, 102542, 102545, 102544, 102546, 102548, 102547, 102550, 102549, 102535, 102537, 102536, 102538, 102539, 102540, 103850, 103852, 103854, 103851, 103855, 103853, 103839, 103841, 103840, 103843, 103842, 104331, 104149, 102422, 102424, 102427, 102423, 102426, 102425, 102411, 102414, 102412, 102415, 102413

      w_2017_48: 105786, 105903, 105905, 105907, 105904, 105908, 105906, 105909, 105911, 105910, 105913, 105912, 106226, 106225, 105827, 105782, 105785, 105781, 105784, 105783, 105853, 105771, 105769, 105779, 105770, 105892, 105894, 105893, 105896, 105895, 105897, 105899, 105898, 105901, 105900, 105854, 105856, 105855, 105857, 105858, 105859

      w_2017_50: 106683, 106691, 106693, 106692, 106695, 106694, 106696, 106698, 106697, 106700, 106699, 106685, 106687, 106686, 106688, 106689, 106690, 106708, 106710, 106712, 106709, 106713, 106711, 106703, 106705, 106704, 106707, 106706, 106861, 106855, 106621, 106623, 106626, 106622, 106625, 106624, 106616, 106619, 106617, 106620, 106618

      w_2017_52: DM-12982

      w_2018_02: 107589, 107591, 107593, 107590, 107596, 107592, 107584, 107586, 107585, 107588, 107587, 107598, 107597, 107558, 107560, 107563, 107559, 107562, 107561, 107553, 107556, 107554, 107557, 107555, 107574, 107576, 107575, 107578, 107577, 107579, 107581, 107580, 107583, 107582, 107568, 107570, 107569, 107571, 107572, 107573

      w_2018_03: DM-13463

       

        Attachments

        1. usage_v2.py
          5 kB
        2. usage_w_2017_25.png
          usage_w_2017_25.png
          36 kB
        3. usage_w_2017_27.png
          usage_w_2017_27.png
          35 kB
        4. usage_w_2017_28.png
          usage_w_2017_28.png
          33 kB
        5. usage_w_2017_30.png
          usage_w_2017_30.png
          36 kB
        6. usage_w_2017_32.png
          usage_w_2017_32.png
          34 kB
        7. usage_w_2017_34.png
          usage_w_2017_34.png
          38 kB
        8. usage_w_2017_36.png
          usage_w_2017_36.png
          33 kB
        9. usage_w_2017_38.png
          usage_w_2017_38.png
          35 kB
        10. usage_w_2017_40.png
          usage_w_2017_40.png
          36 kB
        11. usage_w_2017_42.png
          usage_w_2017_42.png
          34 kB
        12. usage_w_2017_44.png
          usage_w_2017_44.png
          37 kB
        13. usage_w_2017_46.png
          usage_w_2017_46.png
          36 kB
        14. usage_w_2017_48.png
          usage_w_2017_48.png
          33 kB
        15. usage_w_2017_50.png
          usage_w_2017_50.png
          35 kB
        16. usage_w_2017_52.png
          usage_w_2017_52.png
          37 kB
        17. usage_w_2018_02.png
          usage_w_2018_02.png
          36 kB
        18. usage_w_2018_03.png
          usage_w_2018_03.png
          33 kB
        19. usage_w42_err.out
          0.3 kB

          Issue Links

            Activity

            Hide
            sthrush Samantha Thrush added a comment -

            I've attached all of the plots requested above.  Overall, the plots were exactly as expected and showed an increase in node usage over time, which is to be expected (running multiBandDriver on the Cosmos data set takes more nodes than it did on the past, among other contributing factors).  This increase is best shown when comparing the results from w_2017_25 and w_2018_03:

            However when I tried to run usage.py on the w_2017_42, I noticed that sacct gave the following errors (also included in usage_w42_err.out):

            JobID JobName NNodes Elapsed State ExitCode
            ------------ ---------- -------- ---------- ---------- --------
            Conflicting JOB_STEP record for jobstep 95210.0 at line 263284 -- ignoring it
            Conflicting JOB_STEP record for jobstep 95211.0 at line 263288 -- ignoring it
            Conflicting JOB_TERMINATED record (COMPLETED) for job 95210 at line 263355 -- ignoring it
            Conflicting JOB_TERMINATED record (COMPLETED) for job 95211 at line 263359 -- ignoring it
            95210 mtWide 3 00:04:28 NODE_FAIL 127:0
            95210.0 hydra_pmi+ 3 00:04:27 FAILED 7:0
            95210.1 hydra_pmi+ 3 07:44:48 COMPLETED 0:0
            95211 mtCosmos 4 00:04:04 NODE_FAIL 127:0
            95211.0 hydra_pmi+ 4 00:04:04 FAILED 7:0
            95211.1 hydra_pmi+ 4 10:10:13 COMPLETED 0:0
            

            Initially, this caused usage.py to halt, but after removing those two visits from the jobID set and rerunning, I was able to create the plot.  It should be noted that I did not encounter these errors in any of the other jobID sets provided.  

            However, this doesn't solve the core problem that the mtWide and mtCosmos completed successfully, but the errors given prevented them from being included in the usage.py data (and removing those visits gives an unrealistic view of the node usage in this case).  In order to resolve this I'm currently modifying usage.py slightly so that I will be able to grab the needed information for those two jobIDs despite the error messages.  Once I do that, I'll upload the new w_2017_42 as well as the modified usage.py.  

            Show
            sthrush Samantha Thrush added a comment - I've attached all of the plots requested above.  Overall, the plots were exactly as expected and showed an increase in node usage over time, which is to be expected (running multiBandDriver on the Cosmos data set takes more nodes than it did on the past, among other contributing factors).  This increase is best shown when comparing the results from w_2017_25 and w_2018_03: However when I tried to run usage.py on the w_2017_42, I noticed that sacct gave the following errors (also included in  usage_w42_err.out ): JobID JobName NNodes Elapsed State ExitCode ------------ ---------- -------- ---------- ---------- -------- Conflicting JOB_STEP record for jobstep 95210.0 at line 263284 -- ignoring it Conflicting JOB_STEP record for jobstep 95211.0 at line 263288 -- ignoring it Conflicting JOB_TERMINATED record (COMPLETED) for job 95210 at line 263355 -- ignoring it Conflicting JOB_TERMINATED record (COMPLETED) for job 95211 at line 263359 -- ignoring it 95210 mtWide 3 00:04:28 NODE_FAIL 127:0 95210.0 hydra_pmi+ 3 00:04:27 FAILED 7:0 95210.1 hydra_pmi+ 3 07:44:48 COMPLETED 0:0 95211 mtCosmos 4 00:04:04 NODE_FAIL 127:0 95211.0 hydra_pmi+ 4 00:04:04 FAILED 7:0 95211.1 hydra_pmi+ 4 10:10:13 COMPLETED 0:0 Initially, this caused usage.py to halt, but after removing those two visits from the jobID set and rerunning, I was able to create the plot.  It should be noted that I did not encounter these errors in any of the other jobID sets provided.   However, this doesn't solve the core problem that the mtWide and mtCosmos completed successfully, but the errors given prevented them from being included in the usage.py data (and removing those visits gives an unrealistic view of the node usage in this case).  In order to resolve this I'm currently modifying usage.py slightly so that I will be able to grab the needed information for those two jobIDs despite the error messages.  Once I do that, I'll upload the new w_2017_42 as well as the modified usage.py.  
            Hide
            sthrush Samantha Thrush added a comment - - edited

            After some fiddling, I have made the new (correct) plot for w_2017_42:

            As I had suspected, the reason why the mtWide and mtCosmos runs were marked as NODE_FAILs but then went on to be COMPLETED was because those jobs were originally submitted on 24 Oct. 2017, which was when lsst-dev01 was experiencing the GPFS outage.  Due to the outage, it looks like the jobs were delayed and marked as "NODE_FAIL", but once lsst-dev01 was back online at 8pm that night, the jobs were able to run to completion.  

            I've attached the modified version of usage.py (usage_v2.py), but please be warned: while it does exactly what I want it to do in this case (as long as I input the 95210.1 and 95211.1 jobs instead of just 95210 and 95211), it is still a bit sloppy, and should not be run on other jobs that have failed unless they failed in the exact same way as above.  I'll keep tinkering on it so as to make sure that it will work with only the cases where the jobs were ultimately completed correctly. 

            Show
            sthrush Samantha Thrush added a comment - - edited After some fiddling, I have made the new (correct) plot for w_2017_42: As I had suspected, the reason why the mtWide and mtCosmos runs were marked as NODE_FAILs but then went on to be COMPLETED was because those jobs were originally submitted on 24 Oct. 2017, which was when lsst-dev01 was experiencing the GPFS outage.  Due to the outage, it looks like the jobs were delayed and marked as "NODE_FAIL", but once lsst-dev01 was back online at 8pm that night, the jobs were able to run to completion.   I've attached the modified version of usage.py ( usage_v2.py ), but please be warned: while it does exactly what I want it to do in this case (as long as I input the 95210.1 and 95211.1 jobs instead of just 95210 and 95211), it is still a bit sloppy, and should not be run on other jobs that have failed unless they failed in the exact same way as above.  I'll keep tinkering on it so as to make sure that it will work with only the cases where the jobs were ultimately completed correctly. 
            Hide
            sthrush Samantha Thrush added a comment -

            As Hsin-Fang mentioned, it might be a bit more clear if the plots were step-type plots.  These modified plots have been added. 

            Show
            sthrush Samantha Thrush added a comment - As Hsin-Fang mentioned, it might be a bit more clear if the plots were step-type plots.  These modified plots have been added. 

              People

              Assignee:
              sthrush Samantha Thrush
              Reporter:
              sthrush Samantha Thrush
              Watchers:
              Hsin-Fang Chiang, Samantha Thrush
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.