Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_pool
-
Labels:None
-
Epic Link:
-
Team:Data Release Production
Description
As noted in DM-9447, the following error was encountered when running coaddDriver.py on the COSMOS HSC-Y subset of the HSC RC dataset (DM-6816):
SystemError on tiger-r6c3n9:11271 in reduce: Negative size passed to PyString_FromStringAndSize
|
Traceback (most recent call last):
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1-7-gb57f33e+8/python/lsst/ctrl/pool/pool.py", line 113, in wrapper
|
return func(*args, **kwargs)
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1-7-gb57f33e+8/python/lsst/ctrl/pool/pool.py", line 237, in wrapper
|
return func(*args, **kwargs)
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1-7-gb57f33e+8/python/lsst/ctrl/pool/pool.py", line 747, in reduce
|
results = self.comm.gather(None, root=self.root)
|
File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949)
|
File "MPI/msgpickle.pxi", line 664, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47643)
|
File "MPI/msgpickle.pxi", line 179, in mpi4py.MPI.Pickle.allocv (src/mpi4py.MPI.c:41800)
|
File "MPI/msgpickle.pxi", line 127, in mpi4py.MPI.Pickle.alloc (src/mpi4py.MPI.c:40945)
|
SystemError: Negative size passed to PyString_FromStringAndSize
|
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
|
This same processing went through without error on the weekly stack of the first week of Jan, 2017 (I'm not sure of the exact version that was used).
Attachments
Issue Links
- blocks
-
DM-9870 Release hscPipe 5.0-beta1
- Done
Activity
The command I ran was:
coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/$ticket/cosmos_noJunk --job $ticket-cosmos_noJunk-y-coaddDriver --time 600 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103 visit=$cosmosVisitsY -C /tigress/HSC/users/lauren/DM-6816/noJunk_coadd.py --clobber-versions -c assembleCoadd.doApplyUberCal=False makeCoaddTempExp.doApplyUberCal=False --clobber-config
|
where
$ echo $ticket
|
DM-9447
|
|
$ echo $cosmosVisitsY
|
274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2
|
|
$ more /tigress/HSC/users/lauren/DM-6816/noJunk_coadd.py
|
config.detectCoaddSources.detection.doTempLocalBackground=False
|
config.assembleCoadd.clipDetection.doTempLocalBackground=False
|
I don't see anything else related in the logs. You can have a look yourself at:
/tigress/HSC/users/lauren/DM-9447/DM-9447-cosmos_noJunk-y-coaddDriver.o2979771
and
/tigress/HSC/users/lauren/DM-9447/job_DM-9447-cosmos_noJunk-y-coaddDriver.log
I believe this is due to the size of the pickle. A workaround for now is to break the tract into patches.
Fixing this properly probably involves refactoring the coaddDriver.
Just FYI, to get around this for my latest RC dataset run, I rewound to c02cc82 to process the HSC-Y COSMOS dataset, so I'm not blocked...but I imagine this will need some kind of resolution by the time of the stack freeze/release for the next major HSC data processing run (and I'm sure you know best what fix will be acceptable for that!)
Oh, so the changes I made in DM-5989 caused it!? OK, let me think some more...
I think the reason is that instead of transferring individual results back to the master, we're now transferring all results at once. That means the pickle size is larger, and apparently it's become too large.
I have a fix on tickets/DM-9541 of ctrl_pool. Lauren MacArthur, would you mind giving it a try and reviewing the patch? I have demonstrated that the Pool produces fewer problems with this patch when transferring large pickles (see below). The failure mode for my test isn't exactly the same as what you saw (OverflowError: integer 4831838248 does not fit in 'int' instead of SystemError: Negative size passed to PyString_FromStringAndSize) but I believe the cause of these two failure modes is the same (the difference being due to the size of the pickle: whether it's very big or very very big) and I'm therefore hopeful that it should fix the problem you identified as well.
pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ cat ~/test_dm-9541.py
|
#!/usr/bin/env python
|
|
from lsst.ctrl.pool.pool import Debugger, Pool, startPool, NODE
|
|
Debugger().enabled = True
|
|
SIZE = 2**30 + 2**29
|
NUM = 10
|
|
def func(index):
|
print "Processing %d on %s" % (index, NODE)
|
return "X"*SIZE
|
|
def main():
|
indices = list(range(NUM))
|
pool = Pool(None)
|
results = pool.map(func, indices)
|
print len(results), [len(rr) for rr in results]
|
pool.exit()
|
|
|
if __name__ == "__main__":
|
startPool()
|
main()
|
|
pprice@lsst-dev01:~/LSST/ctrl/pool[tickets/DM-9541] $ srun -N 2 --ntasks-per-node=2 -I --pty bash
|
pprice@lsst-verify-worker04:~/LSST/ctrl/pool[tickets/DM-9541] $ mpiexec python ~/test_dm-9541.py
|
Master: command reduce
|
Slave 3: waiting for command from 0
|
Slave 1: waiting for command from 0
|
Master: instruct
|
Slave 1: command reduce
|
Slave 2: waiting for command from 0
|
Slave 1: waiting for instruction
|
Slave 3: command reduce
|
Slave 2: command reduceSlave 1: waiting for job
|
|
Slave 3: waiting for instruction
|
Slave 2: waiting for instruction
|
Master: scatter initial jobs
|
Slave 3: waiting for job
|
Slave 2: waiting for job
|
Slave 2: running job
|
Processing 1 on lsst-verify-worker05:56007
|
Processing 0 on lsst-verify-worker04:72783Slave 1: running job
|
|
Slave 3: running job
|
Processing 2 on lsst-verify-worker05:56008
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: send job to slave 3 2
|
Processing 3 on lsst-verify-worker05:56007
|
Slave 2: running job
|
Slave 1: waiting for job
|
Master: gather from slave 1
|
Master: send job to slave 4 1
|
Processing 4 on lsst-verify-worker04:72783
|
Slave 1: running job
|
Slave 3: waiting for job
|
Master: gather from slave 3
|
Master: send job to slave 5 3
|
Slave 3: running job
|
Processing 5 on lsst-verify-worker05:56008
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: send job to slave 6 2
|
Processing 6 on lsst-verify-worker05:56007
|
Slave 2: running job
|
Slave 1: waiting for job
|
Master: gather from slave 1
|
Master: send job to slave 7 1
|
Processing 7 on lsst-verify-worker04:72783
|
Slave 1: running job
|
Slave 3: waiting for job
|
Master: gather from slave 3
|
Master: send job to slave 8 3
|
Slave 3: running job
|
Processing 8 on lsst-verify-worker05:56008
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: send job to slave 9 2
|
Processing 9 on lsst-verify-worker05:56007
|
Slave 2: running job
|
Slave 1: waiting for job
|
Master: gather from slave 1
|
Slave 1: done
|
Slave 1: waiting for command from 0
|
Slave 3: waiting for job
|
Master: gather from slave 3
|
Slave 3: done
|
Slave 3: waiting for command from 0
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: done
|
10 [1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736, 1610612736]
|
Master: command exit
|
Slave 1: command exit
|
Slave 1: exiting
|
Slave 2: done
|
Slave 2: waiting for command from 0
|
Slave 2: command exit
|
Slave 2: exiting
|
Slave 3: command exit
|
Slave 3: exiting
|
pprice@lsst-verify-worker04:~/LSST/ctrl/pool[tickets/DM-9541] $ git co master
|
Switched to branch 'master'
|
Your branch is up-to-date with 'origin/master'.
|
pprice@lsst-verify-worker04:~/LSST/ctrl/pool[master] $ mpiexec python ~/test_dm-9541.py
|
Slave 2: waiting for command from 0
|
Slave 1: waiting for command from 0
|
Slave 3: waiting for command from 0
|
Master: command reduce
|
Master: instruct
|
Slave 2: command reduce
|
Slave 1: command reduce
|
Slave 1: waiting for instruction
|
Slave 3: command reduce
|
Slave 1: waiting for job
|
Slave 2: waiting for instruction
|
Master: scatter initial jobsSlave 3: waiting for instruction
|
|
Slave 2: waiting for job
|
Slave 3: waiting for job
|
Processing 0 on lsst-verify-worker04:72948
|
Processing 1 on lsst-verify-worker05:56118
|
Slave 1: running job
|
Slave 2: running job
|
Processing 2 on lsst-verify-worker05:56119
|
Slave 3: running job
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: send job to slave 3 2
|
Slave 1: waiting for job
|
Processing 3 on lsst-verify-worker05:56118
|
Slave 2: running job
|
Master: gather from slave 1
|
Master: send job to slave 4 1
|
Slave 1: running job
|
Processing 4 on lsst-verify-worker04:72948
|
Slave 3: waiting for job
|
Master: gather from slave 3
|
Master: send job to slave 5 3
|
Slave 3: running job
|
Processing 5 on lsst-verify-worker05:56119
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: send job to slave 6 2
|
Processing 6 on lsst-verify-worker05:56118
|
Slave 2: running job
|
Slave 1: waiting for job
|
Slave 3: waiting for job
|
Master: gather from slave 3
|
Master: send job to slave 7 3
|
Master: gather from slave 1
|
Master: send job to slave 8 1
|
Processing 8 on lsst-verify-worker04:72948
|
Slave 1: running job
|
Processing 7 on lsst-verify-worker05:56119
|
Slave 3: running job
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
Master: send job to slave 9 2
|
Processing 9 on lsst-verify-worker05:56118
|
Slave 2: running job
|
Slave 3: waiting for job
|
Slave 1: waiting for job
|
Master: gather from slave 3
|
Master: gather from slave 1
|
Slave 2: waiting for job
|
Master: gather from slave 2
|
OverflowError on lsst-verify-worker04:72948 in run: integer 4831838248 does not fit in 'int'
|
Traceback (most recent call last):
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 113, in wrapper
|
return func(*args, **kwargs)
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 1071, in run
|
while not menu[command]():
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 237, in wrapper
|
return func(*args, **kwargs)
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 1098, in reduce
|
self.comm.gather(out, root=self.root)
|
File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949)
|
File "MPI/msgpickle.pxi", line 659, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47570)
|
File "MPI/msgpickle.pxi", line 119, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40840)
|
File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
|
OverflowError: integer 4831838248 does not fit in 'int'
|
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
|
Ah, I got one with the same failure mode that you saw, and my ticket fixes it too:
pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ mpiexec -n 3 python ~/test_dm-9541.py
|
Processing 0 on lsst-dev01.ncsa.illinois.edu:1293245
|
Processing 1 on lsst-dev01.ncsa.illinois.edu:1293246
|
Processing 2 on lsst-dev01.ncsa.illinois.edu:1293245
|
Processing 3 on lsst-dev01.ncsa.illinois.edu:1293246
|
Processing 5 on lsst-dev01.ncsa.illinois.edu:1293245
|
Processing 4 on lsst-dev01.ncsa.illinois.edu:1293246
|
Processing 6 on lsst-dev01.ncsa.illinois.edu:1293245
|
Processing 7 on lsst-dev01.ncsa.illinois.edu:1293246
|
Processing 9 on lsst-dev01.ncsa.illinois.edu:1293245
|
Processing 8 on lsst-dev01.ncsa.illinois.edu:1293246
|
SystemError on lsst-dev01.ncsa.illinois.edu:1293244 in reduce: Negative size passed to PyString_FromStringAndSize
|
Traceback (most recent call last):
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 113, in wrapper
|
return func(*args, **kwargs)
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 237, in wrapper
|
return func(*args, **kwargs)
|
File "/home/pprice/LSST/ctrl/pool/python/lsst/ctrl/pool/pool.py", line 747, in reduce
|
results = self.comm.gather(None, root=self.root)
|
File "MPI/Comm.pyx", line 1281, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:108949)
|
File "MPI/msgpickle.pxi", line 664, in mpi4py.MPI.PyMPI_gather (src/mpi4py.MPI.c:47643)
|
File "MPI/msgpickle.pxi", line 179, in mpi4py.MPI.Pickle.allocv (src/mpi4py.MPI.c:41800)
|
File "MPI/msgpickle.pxi", line 127, in mpi4py.MPI.Pickle.alloc (src/mpi4py.MPI.c:40945)
|
SystemError: Negative size passed to PyString_FromStringAndSize
|
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
|
|
pprice@lsst-dev01:~/LSST/ctrl/pool[master] $ git co tickets/DM-9541
|
Switched to branch 'tickets/DM-9541'
|
Your branch is up-to-date with 'origin/tickets/DM-9541'.
|
pprice@lsst-dev01:~/LSST/ctrl/pool[tickets/DM-9541] $ mpiexec -n 3 python ~/test_dm-9541.py
|
Processing 0 on lsst-dev01.ncsa.illinois.edu:1293942
|
Processing 1 on lsst-dev01.ncsa.illinois.edu:1293943
|
Processing 2 on lsst-dev01.ncsa.illinois.edu:1293943
|
Processing 3 on lsst-dev01.ncsa.illinois.edu:1293942
|
Processing 4 on lsst-dev01.ncsa.illinois.edu:1293943
|
Processing 5 on lsst-dev01.ncsa.illinois.edu:1293942
|
Processing 6 on lsst-dev01.ncsa.illinois.edu:1293943
|
Processing 7 on lsst-dev01.ncsa.illinois.edu:1293942
|
Processing 8 on lsst-dev01.ncsa.illinois.edu:1293943
|
Processing 9 on lsst-dev01.ncsa.illinois.edu:1293942
|
10 [536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912, 536870912]
|
The difference in the script is that:
SIZE = 2**29
|
NUM = 10
|
Lauren MacArthur asked if someone more familiar with ctrl_pool might look at this. Hsin-Fang Chiang, would you mind? Lauren will verify that this fixes her particular case.
The code changes look fine to me, although I don't understand fully how pool works. I didn't run any test.
I am running the test now (the original command where I bumped into this). All looks good so far. I'll post again once it has run to completion.
Lauren MacArthur, did the test work?
Yes
Awesome!
Merged to master.
Thanks, all!
One possibility is that the size of the pickle has grown too large. Could you please post the command that produced this error? Are there any more clues (from the log, perhaps) about exactly what the data is that is being passed and causing the error?