Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-9541

Bug related to MPI pickling when running coaddDriver

    XMLWordPrintable

Details

    • Bug
    • Status: Done
    • Resolution: Done
    • None
    • ctrl_pool
    • None

    Description

      As noted in DM-9447, the following error was encountered when running coaddDriver.py on the COSMOS HSC-Y subset of the HSC RC dataset (DM-6816):

      SystemError on tiger-r6c3n9:11271 in reduce: Negative size passed to PyString_FromStringAndSize
      Traceback (most recent call last):
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1-7-gb57f33e+8/python/lsst/ctrl/pool/pool.py", line 113, in wrapper
          return func(*args, **kwargs)
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1-7-gb57f33e+8/python/lsst/ctrl/pool/pool.py", line 237, in wrapper
          return func(*args, **kwargs)
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1-7-gb57f33e+8/python/lsst/ctrl/pool/pool.py", line 747, in reduce
          results = self.comm.gather(None, root=self.root)
        File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949)
        File "MPI/msgpickle.pxi", line 664, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47643)
        File "MPI/msgpickle.pxi", line 179, in mpi4py.MPI.Pickle.allocv (src/mpi4py.MPI.c:41800)
        File "MPI/msgpickle.pxi", line 127, in mpi4py.MPI.Pickle.alloc (src/mpi4py.MPI.c:40945)
      SystemError: Negative size passed to PyString_FromStringAndSize
      application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
      

      This same processing went through without error on the weekly stack of the first week of Jan, 2017 (I'm not sure of the exact version that was used).

      Attachments

        Issue Links

          Activity

            No builds found.
            lauren Lauren MacArthur created issue -
            price Paul Price added a comment -

            One possibility is that the size of the pickle has grown too large. Could you please post the command that produced this error? Are there any more clues (from the log, perhaps) about exactly what the data is that is being passed and causing the error?

            price Paul Price added a comment - One possibility is that the size of the pickle has grown too large. Could you please post the command that produced this error? Are there any more clues (from the log, perhaps) about exactly what the data is that is being passed and causing the error?

            The command I ran was:

            coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/$ticket/cosmos_noJunk --job $ticket-cosmos_noJunk-y-coaddDriver --time 600 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103  visit=$cosmosVisitsY -C /tigress/HSC/users/lauren/DM-6816/noJunk_coadd.py --clobber-versions -c assembleCoadd.doApplyUberCal=False makeCoaddTempExp.doApplyUberCal=False --clobber-config
            

            where

            $ echo $ticket 
            DM-9447
             
            $ echo $cosmosVisitsY 
            274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2
             
            $ more /tigress/HSC/users/lauren/DM-6816/noJunk_coadd.py 
            config.detectCoaddSources.detection.doTempLocalBackground=False
            config.assembleCoadd.clipDetection.doTempLocalBackground=False
            

            I don't see anything else related in the logs. You can have a look yourself at:
            /tigress/HSC/users/lauren/DM-9447/DM-9447-cosmos_noJunk-y-coaddDriver.o2979771
            and
            /tigress/HSC/users/lauren/DM-9447/job_DM-9447-cosmos_noJunk-y-coaddDriver.log

            lauren Lauren MacArthur added a comment - The command I ran was: coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/$ticket/cosmos_noJunk --job $ticket-cosmos_noJunk-y-coaddDriver --time 600 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103 visit=$cosmosVisitsY -C /tigress/HSC/users/lauren/DM-6816/noJunk_coadd.py --clobber-versions -c assembleCoadd.doApplyUberCal=False makeCoaddTempExp.doApplyUberCal=False --clobber-config where $ echo $ticket DM-9447   $ echo $cosmosVisitsY 274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2   $ more /tigress/HSC/users/lauren/DM-6816/noJunk_coadd.py config.detectCoaddSources.detection.doTempLocalBackground=False config.assembleCoadd.clipDetection.doTempLocalBackground=False I don't see anything else related in the logs. You can have a look yourself at: /tigress/HSC/users/lauren/ DM-9447 / DM-9447 -cosmos_noJunk-y-coaddDriver.o2979771 and /tigress/HSC/users/lauren/ DM-9447 /job_ DM-9447 -cosmos_noJunk-y-coaddDriver.log
            price Paul Price added a comment -

            I believe this is due to the size of the pickle. A workaround for now is to break the tract into patches.
            Fixing this properly probably involves refactoring the coaddDriver.

            price Paul Price added a comment - I believe this is due to the size of the pickle. A workaround for now is to break the tract into patches. Fixing this properly probably involves refactoring the coaddDriver.

            Just FYI, to get around this for my latest RC dataset run, I rewound to c02cc82 to process the HSC-Y COSMOS dataset, so I'm not blocked...but I imagine this will need some kind of resolution by the time of the stack freeze/release for the next major HSC data processing run (and I'm sure you know best what fix will be acceptable for that!)

            lauren Lauren MacArthur added a comment - Just FYI, to get around this for my latest RC dataset run, I rewound to c02cc82 to process the HSC-Y COSMOS dataset, so I'm not blocked...but I imagine this will need some kind of resolution by the time of the stack freeze/release for the next major HSC data processing run (and I'm sure you know best what fix will be acceptable for that!)
            price Paul Price added a comment -

            Oh, so the changes I made in DM-5989 caused it!? OK, let me think some more...

            price Paul Price added a comment - Oh, so the changes I made in DM-5989 caused it!? OK, let me think some more...
            price Paul Price added a comment -

            I think the reason is that instead of transferring individual results back to the master, we're now transferring all results at once. That means the pickle size is larger, and apparently it's become too large.

            price Paul Price added a comment - I think the reason is that instead of transferring individual results back to the master, we're now transferring all results at once. That means the pickle size is larger, and apparently it's become too large.
            price Paul Price made changes -
            Field Original Value New Value
            Assignee Paul Price [ price ]
            price Paul Price added a comment -

            I have a fix on tickets/DM-9541 of ctrl_pool. lauren, would you mind giving it a try and reviewing the patch? I have demonstrated that the Pool produces fewer problems with this patch when transferring large pickles (see below). The failure mode for my test isn't exactly the same as what you saw (OverflowError: integer 4831838248 does not fit in 'int' instead of SystemError: Negative size passed to PyString_FromStringAndSize) but I believe the cause of these two failure modes is the same (the difference being due to the size of the pickle: whether it's very big or very very big) and I'm therefore hopeful that it should fix the problem you identified as well.

            pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ cat ~/test_dm-9541.py 
            #!/usr/bin/env python
             
            from lsst.ctrl.pool.pool import Debugger, Pool, startPool, NODE
             
            Debugger().enabled = True
             
            SIZE = 2**30 + 2**29
            NUM = 10
             
            def func(index):
                    print "Processing %d on %s" % (index, NODE)
                    return "X"*SIZE
             
            def main():
                    indices = list(range(NUM))
                    pool = Pool(None)
                    results = pool.map(func, indices)
                    print len(results), [len(rr) for rr in results]
                    pool.exit()
             
             
            if __name__ == "__main__":
                    startPool()
                    main()
             
            pprice@lsst-dev01:~/LSST/ctrl/pool[tickets/DM-9541] $ srun -N 2 --ntasks-per-node=2 -I --pty bash
            pprice@lsst-verify-worker04:~/LSST/ctrl/pool[tickets/DM-9541] $ mpiexec python ~/test_dm-9541.py 
            Master: command reduce
            Slave 3: waiting for command from 0
            Slave 1: waiting for command from 0
            Master: instruct
            Slave 1: command reduce
            Slave 2: waiting for command from 0
            Slave 1: waiting for instruction
            Slave 3: command reduce
            Slave 2: command reduceSlave 1: waiting for job
             
            Slave 3: waiting for instruction
            Slave 2: waiting for instruction
            Master: scatter initial jobs
            Slave 3: waiting for job
            Slave 2: waiting for job
            Slave 2: running job
            Processing 1 on lsst-verify-worker05:56007
            Processing 0 on lsst-verify-worker04:72783Slave 1: running job
             
            Slave 3: running job
            Processing 2 on lsst-verify-worker05:56008
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: send job to slave 3 2
            Processing 3 on lsst-verify-worker05:56007
            Slave 2: running job
            Slave 1: waiting for job
            Master: gather from slave 1
            Master: send job to slave 4 1
            Processing 4 on lsst-verify-worker04:72783
            Slave 1: running job
            Slave 3: waiting for job
            Master: gather from slave 3
            Master: send job to slave 5 3
            Slave 3: running job
            Processing 5 on lsst-verify-worker05:56008
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: send job to slave 6 2
            Processing 6 on lsst-verify-worker05:56007
            Slave 2: running job
            Slave 1: waiting for job
            Master: gather from slave 1
            Master: send job to slave 7 1
            Processing 7 on lsst-verify-worker04:72783
            Slave 1: running job
            Slave 3: waiting for job
            Master: gather from slave 3
            Master: send job to slave 8 3
            Slave 3: running job
            Processing 8 on lsst-verify-worker05:56008
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: send job to slave 9 2
            Processing 9 on lsst-verify-worker05:56007
            Slave 2: running job
            Slave 1: waiting for job
            Master: gather from slave 1
            Slave 1: done
            Slave 1: waiting for command from 0
            Slave 3: waiting for job
            Master: gather from slave 3
            Slave 3: done
            Slave 3: waiting for command from 0
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: done
            10 [1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736]
            Master: command exit
            Slave 1: command exit
            Slave 1: exiting
            Slave 2: done
            Slave 2: waiting for command from 0
            Slave 2: command exit
            Slave 2: exiting
            Slave 3: command exit
            Slave 3: exiting
            pprice@lsst-verify-worker04:~/LSST/ctrl/pool[tickets/DM-9541] $ git co master
            Switched to branch 'master'
            Your branch is up-to-date with 'origin/master'.
            pprice@lsst-verify-worker04:~/LSST/ctrl/pool[master] $ mpiexec python ~/test_dm-9541.py 
            Slave 2: waiting for command from 0
            Slave 1: waiting for command from 0
            Slave 3: waiting for command from 0
            Master: command reduce
            Master: instruct
            Slave 2: command reduce
            Slave 1: command reduce
            Slave 1: waiting for instruction
            Slave 3: command reduce
            Slave 1: waiting for job
            Slave 2: waiting for instruction
            Master: scatter initial jobsSlave 3: waiting for instruction
             
            Slave 2: waiting for job
            Slave 3: waiting for job
            Processing 0 on lsst-verify-worker04:72948
            Processing 1 on lsst-verify-worker05:56118
            Slave 1: running job
            Slave 2: running job
            Processing 2 on lsst-verify-worker05:56119
            Slave 3: running job
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: send job to slave 3 2
            Slave 1: waiting for job
            Processing 3 on lsst-verify-worker05:56118
            Slave 2: running job
            Master: gather from slave 1
            Master: send job to slave 4 1
            Slave 1: running job
            Processing 4 on lsst-verify-worker04:72948
            Slave 3: waiting for job
            Master: gather from slave 3
            Master: send job to slave 5 3
            Slave 3: running job
            Processing 5 on lsst-verify-worker05:56119
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: send job to slave 6 2
            Processing 6 on lsst-verify-worker05:56118
            Slave 2: running job
            Slave 1: waiting for job
            Slave 3: waiting for job
            Master: gather from slave 3
            Master: send job to slave 7 3
            Master: gather from slave 1
            Master: send job to slave 8 1
            Processing 8 on lsst-verify-worker04:72948
            Slave 1: running job
            Processing 7 on lsst-verify-worker05:56119
            Slave 3: running job
            Slave 2: waiting for job
            Master: gather from slave 2
            Master: send job to slave 9 2
            Processing 9 on lsst-verify-worker05:56118
            Slave 2: running job
            Slave 3: waiting for job
            Slave 1: waiting for job
            Master: gather from slave 3
            Master: gather from slave 1
            Slave 2: waiting for job
            Master: gather from slave 2
            OverflowError on lsst-verify-worker04:72948 in run: integer 4831838248 does not fit in 'int'
            Traceback (most recent call last):
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 113, in wrapper
                return func(*args, **kwargs)
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 1071, in run
                while not menu[command]():
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 237, in wrapper
                return func(*args, **kwargs)
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 1098, in reduce
                self.comm.gather(out, root=self.root)
              File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949)
              File "MPI/msgpickle.pxi", line 659, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47570)
              File "MPI/msgpickle.pxi", line 119, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40840)
              File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
            OverflowError: integer 4831838248 does not fit in 'int'
            application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
            

            price Paul Price added a comment - I have a fix on tickets/ DM-9541 of ctrl_pool. lauren , would you mind giving it a try and reviewing the patch? I have demonstrated that the Pool produces fewer problems with this patch when transferring large pickles (see below). The failure mode for my test isn't exactly the same as what you saw ( OverflowError: integer 4831838248 does not fit in 'int' instead of SystemError: Negative size passed to PyString_FromStringAndSize ) but I believe the cause of these two failure modes is the same (the difference being due to the size of the pickle: whether it's very big or very very big) and I'm therefore hopeful that it should fix the problem you identified as well. pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ cat ~/test_dm-9541.py #!/usr/bin/env python   from lsst.ctrl.pool.pool import Debugger, Pool, startPool, NODE   Debugger().enabled = True   SIZE = 2**30 + 2**29 NUM = 10   def func(index): print "Processing %d on %s" % (index, NODE) return "X"*SIZE   def main(): indices = list(range(NUM)) pool = Pool(None) results = pool.map(func, indices) print len(results), [len(rr) for rr in results] pool.exit()     if __name__ == "__main__": startPool() main()   pprice@lsst-dev01:~/LSST/ctrl/pool[tickets/DM-9541] $ srun -N 2 --ntasks-per-node=2 -I --pty bash pprice@lsst-verify-worker04:~/LSST/ctrl/pool[tickets/DM-9541] $ mpiexec python ~/test_dm-9541.py Master: command reduce Slave 3: waiting for command from 0 Slave 1: waiting for command from 0 Master: instruct Slave 1: command reduce Slave 2: waiting for command from 0 Slave 1: waiting for instruction Slave 3: command reduce Slave 2: command reduceSlave 1: waiting for job   Slave 3: waiting for instruction Slave 2: waiting for instruction Master: scatter initial jobs Slave 3: waiting for job Slave 2: waiting for job Slave 2: running job Processing 1 on lsst-verify-worker05:56007 Processing 0 on lsst-verify-worker04:72783Slave 1: running job   Slave 3: running job Processing 2 on lsst-verify-worker05:56008 Slave 2: waiting for job Master: gather from slave 2 Master: send job to slave 3 2 Processing 3 on lsst-verify-worker05:56007 Slave 2: running job Slave 1: waiting for job Master: gather from slave 1 Master: send job to slave 4 1 Processing 4 on lsst-verify-worker04:72783 Slave 1: running job Slave 3: waiting for job Master: gather from slave 3 Master: send job to slave 5 3 Slave 3: running job Processing 5 on lsst-verify-worker05:56008 Slave 2: waiting for job Master: gather from slave 2 Master: send job to slave 6 2 Processing 6 on lsst-verify-worker05:56007 Slave 2: running job Slave 1: waiting for job Master: gather from slave 1 Master: send job to slave 7 1 Processing 7 on lsst-verify-worker04:72783 Slave 1: running job Slave 3: waiting for job Master: gather from slave 3 Master: send job to slave 8 3 Slave 3: running job Processing 8 on lsst-verify-worker05:56008 Slave 2: waiting for job Master: gather from slave 2 Master: send job to slave 9 2 Processing 9 on lsst-verify-worker05:56007 Slave 2: running job Slave 1: waiting for job Master: gather from slave 1 Slave 1: done Slave 1: waiting for command from 0 Slave 3: waiting for job Master: gather from slave 3 Slave 3: done Slave 3: waiting for command from 0 Slave 2: waiting for job Master: gather from slave 2 Master: done 10 [1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736] Master: command exit Slave 1: command exit Slave 1: exiting Slave 2: done Slave 2: waiting for command from 0 Slave 2: command exit Slave 2: exiting Slave 3: command exit Slave 3: exiting pprice@lsst-verify-worker04:~/LSST/ctrl/pool[tickets/DM-9541] $ git co master Switched to branch 'master' Your branch is up-to-date with 'origin/master'. pprice@lsst-verify-worker04:~/LSST/ctrl/pool[master] $ mpiexec python ~/test_dm-9541.py Slave 2: waiting for command from 0 Slave 1: waiting for command from 0 Slave 3: waiting for command from 0 Master: command reduce Master: instruct Slave 2: command reduce Slave 1: command reduce Slave 1: waiting for instruction Slave 3: command reduce Slave 1: waiting for job Slave 2: waiting for instruction Master: scatter initial jobsSlave 3: waiting for instruction   Slave 2: waiting for job Slave 3: waiting for job Processing 0 on lsst-verify-worker04:72948 Processing 1 on lsst-verify-worker05:56118 Slave 1: running job Slave 2: running job Processing 2 on lsst-verify-worker05:56119 Slave 3: running job Slave 2: waiting for job Master: gather from slave 2 Master: send job to slave 3 2 Slave 1: waiting for job Processing 3 on lsst-verify-worker05:56118 Slave 2: running job Master: gather from slave 1 Master: send job to slave 4 1 Slave 1: running job Processing 4 on lsst-verify-worker04:72948 Slave 3: waiting for job Master: gather from slave 3 Master: send job to slave 5 3 Slave 3: running job Processing 5 on lsst-verify-worker05:56119 Slave 2: waiting for job Master: gather from slave 2 Master: send job to slave 6 2 Processing 6 on lsst-verify-worker05:56118 Slave 2: running job Slave 1: waiting for job Slave 3: waiting for job Master: gather from slave 3 Master: send job to slave 7 3 Master: gather from slave 1 Master: send job to slave 8 1 Processing 8 on lsst-verify-worker04:72948 Slave 1: running job Processing 7 on lsst-verify-worker05:56119 Slave 3: running job Slave 2: waiting for job Master: gather from slave 2 Master: send job to slave 9 2 Processing 9 on lsst-verify-worker05:56118 Slave 2: running job Slave 3: waiting for job Slave 1: waiting for job Master: gather from slave 3 Master: gather from slave 1 Slave 2: waiting for job Master: gather from slave 2 OverflowError on lsst-verify-worker04:72948 in run: integer 4831838248 does not fit in 'int' Traceback (most recent call last): File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 113, in wrapper return func(*args, **kwargs) File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 1071, in run while not menu[command](): File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 237, in wrapper return func(*args, **kwargs) File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 1098, in reduce self.comm.gather(out, root=self.root) File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949) File "MPI/msgpickle.pxi", line 659, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47570) File "MPI/msgpickle.pxi", line 119, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40840) File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070) OverflowError: integer 4831838248 does not fit in 'int' application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
            price Paul Price made changes -
            Reviewers Lauren MacArthur [ lauren ]
            Status To Do [ 10001 ] In Review [ 10004 ]
            price Paul Price added a comment -

            Ah, I got one with the same failure mode that you saw, and my ticket fixes it too:

            pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ mpiexec -n 3 python ~/test_dm-9541.py 
            Processing 0 on lsst-dev01.ncsa.illinois.edu:1293245
            Processing 1 on lsst-dev01.ncsa.illinois.edu:1293246
            Processing 2 on lsst-dev01.ncsa.illinois.edu:1293245
            Processing 3 on lsst-dev01.ncsa.illinois.edu:1293246
            Processing 5 on lsst-dev01.ncsa.illinois.edu:1293245
            Processing 4 on lsst-dev01.ncsa.illinois.edu:1293246
            Processing 6 on lsst-dev01.ncsa.illinois.edu:1293245
            Processing 7 on lsst-dev01.ncsa.illinois.edu:1293246
            Processing 9 on lsst-dev01.ncsa.illinois.edu:1293245
            Processing 8 on lsst-dev01.ncsa.illinois.edu:1293246
            SystemError on lsst-dev01.ncsa.illinois.edu:1293244 in reduce: Negative size passed to PyString_FromStringAndSize
            Traceback (most recent call last):
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 113, in wrapper
                return func(*args, **kwargs)
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 237, in wrapper
                return func(*args, **kwargs)
              File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 747, in reduce
                results = self.comm.gather(None, root=self.root)
              File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949)
              File "MPI/msgpickle.pxi", line 664, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47643)
              File "MPI/msgpickle.pxi", line 179, in mpi4py.MPI.Pickle.allocv (src/mpi4py.MPI.c:41800)
              File "MPI/msgpickle.pxi", line 127, in mpi4py.MPI.Pickle.alloc (src/mpi4py.MPI.c:40945)
            SystemError: Negative size passed to PyString_FromStringAndSize
            application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
             
            pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ git co tickets/DM-9541
            Switched to branch 'tickets/DM-9541'
            Your branch is up-to-date with 'origin/tickets/DM-9541'.
            pprice@lsst-dev01:~/LSST/ctrl/pool[tickets/DM-9541] $ mpiexec -n 3 python ~/test_dm-9541.py 
            Processing 0 on lsst-dev01.ncsa.illinois.edu:1293942
            Processing 1 on lsst-dev01.ncsa.illinois.edu:1293943
            Processing 2 on lsst-dev01.ncsa.illinois.edu:1293943
            Processing 3 on lsst-dev01.ncsa.illinois.edu:1293942
            Processing 4 on lsst-dev01.ncsa.illinois.edu:1293943
            Processing 5 on lsst-dev01.ncsa.illinois.edu:1293942
            Processing 6 on lsst-dev01.ncsa.illinois.edu:1293943
            Processing 7 on lsst-dev01.ncsa.illinois.edu:1293942
            Processing 8 on lsst-dev01.ncsa.illinois.edu:1293943
            Processing 9 on lsst-dev01.ncsa.illinois.edu:1293942
            10 [536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912]
            

            The difference in the script is that:

            SIZE = 2**29
            NUM = 10
            

            price Paul Price added a comment - Ah, I got one with the same failure mode that you saw, and my ticket fixes it too: pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ mpiexec -n 3 python ~/test_dm-9541.py Processing 0 on lsst-dev01.ncsa.illinois.edu:1293245 Processing 1 on lsst-dev01.ncsa.illinois.edu:1293246 Processing 2 on lsst-dev01.ncsa.illinois.edu:1293245 Processing 3 on lsst-dev01.ncsa.illinois.edu:1293246 Processing 5 on lsst-dev01.ncsa.illinois.edu:1293245 Processing 4 on lsst-dev01.ncsa.illinois.edu:1293246 Processing 6 on lsst-dev01.ncsa.illinois.edu:1293245 Processing 7 on lsst-dev01.ncsa.illinois.edu:1293246 Processing 9 on lsst-dev01.ncsa.illinois.edu:1293245 Processing 8 on lsst-dev01.ncsa.illinois.edu:1293246 SystemError on lsst-dev01.ncsa.illinois.edu:1293244 in reduce: Negative size passed to PyString_FromStringAndSize Traceback (most recent call last): File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 113, in wrapper return func(*args, **kwargs) File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 237, in wrapper return func(*args, **kwargs) File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 747, in reduce results = self.comm.gather(None, root=self.root) File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949) File "MPI/msgpickle.pxi", line 664, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47643) File "MPI/msgpickle.pxi", line 179, in mpi4py.MPI.Pickle.allocv (src/mpi4py.MPI.c:41800) File "MPI/msgpickle.pxi", line 127, in mpi4py.MPI.Pickle.alloc (src/mpi4py.MPI.c:40945) SystemError: Negative size passed to PyString_FromStringAndSize application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0   pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ git co tickets/DM-9541 Switched to branch 'tickets/DM-9541' Your branch is up-to-date with 'origin/tickets/DM-9541'. pprice@lsst-dev01:~/LSST/ctrl/pool[tickets/DM-9541] $ mpiexec -n 3 python ~/test_dm-9541.py Processing 0 on lsst-dev01.ncsa.illinois.edu:1293942 Processing 1 on lsst-dev01.ncsa.illinois.edu:1293943 Processing 2 on lsst-dev01.ncsa.illinois.edu:1293943 Processing 3 on lsst-dev01.ncsa.illinois.edu:1293942 Processing 4 on lsst-dev01.ncsa.illinois.edu:1293943 Processing 5 on lsst-dev01.ncsa.illinois.edu:1293942 Processing 6 on lsst-dev01.ncsa.illinois.edu:1293943 Processing 7 on lsst-dev01.ncsa.illinois.edu:1293942 Processing 8 on lsst-dev01.ncsa.illinois.edu:1293943 Processing 9 on lsst-dev01.ncsa.illinois.edu:1293942 10 [536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912] The difference in the script is that: SIZE = 2**29 NUM = 10
            price Paul Price added a comment -

            lauren asked if someone more familiar with ctrl_pool might look at this. hchiang2, would you mind? Lauren will verify that this fixes her particular case.

            price Paul Price added a comment - lauren asked if someone more familiar with ctrl_pool might look at this. hchiang2 , would you mind? Lauren will verify that this fixes her particular case.
            price Paul Price made changes -
            Reviewers Lauren MacArthur [ lauren ] Hsin-Fang Chiang [ hchiang2 ]
            Status In Review [ 10004 ] In Review [ 10004 ]

            The code changes look fine to me, although I don't understand fully how pool works. I didn't run any test.

            hchiang2 Hsin-Fang Chiang added a comment - The code changes look fine to me, although I don't understand fully how pool works. I didn't run any test.
            hchiang2 Hsin-Fang Chiang made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]

            I am running the test now (the original command where I bumped into this). All looks good so far. I'll post again once it has run to completion.

            lauren Lauren MacArthur added a comment - I am running the test now (the original command where I bumped into this). All looks good so far. I'll post again once it has run to completion.
            swinbank John Swinbank made changes -
            Epic Link DM-8299 [ 27821 ]
            price Paul Price added a comment -

            lauren, did the test work?

            price Paul Price added a comment - lauren , did the test work?
            price Paul Price made changes -
            Link This issue blocks DM-9870 [ DM-9870 ]

            Yes

            lauren Lauren MacArthur added a comment - Yes
            price Paul Price added a comment -

            Awesome!

            Merged to master.

            Thanks, all!

            price Paul Price added a comment - Awesome! Merged to master. Thanks, all!
            price Paul Price made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]

            People

              price Paul Price
              lauren Lauren MacArthur
              Hsin-Fang Chiang
              Hsin-Fang Chiang, John Swinbank, Lauren MacArthur, Paul Price, Pim Schellart [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.