Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_pool, pipe_drivers
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Team:Data Release Production
Description
Lauren MacArthur is running:
coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/DM-6816/cosmos --job DM-6816-cosmos-y-coaddDriver --time 100 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103 filter=HSC-Y visit=274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2 --batch-submit '--mem-per-cpu 8000'
|
and it is producing:
OverflowError on tiger-r8c1n12:19889 in map: integer 2155421250 does not fit in 'int'
|
Traceback (most recent call last):
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 99, in wrapper
|
return func(*args, **kwargs)
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 218, in wrapper
|
return func(*args, **kwargs)
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 554, in map
|
self.comm.scatter(initial, root=self.rank)
|
File "MPI/Comm.pyx", line 1286, in mpi4py.MPI.Comm.scatter (src/mpi4py.MPI.c:109079)
|
File "MPI/msgpickle.pxi", line 707, in mpi4py.MPI.PyMPI_scatter (src/mpi4py.MPI.c:48114)
|
File "MPI/msgpickle.pxi", line 168, in mpi4py.MPI.Pickle.dumpv (src/mpi4py.MPI.c:41672)
|
File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
|
OverflowError: integer 2155421250 does not fit in 'int'
|
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0```
|
We need to fix or work around this problem.
Attachments
Issue Links
- mentioned in
-
Page Loading...
Activity
Field | Original Value | New Value |
---|---|---|
Watchers | John Swinbank, Lauren MacArthur, Paul Price [ John Swinbank, Lauren MacArthur, Paul Price ] | Fritz Mueller, John Swinbank, Lauren MacArthur, Paul Price [ Fritz Mueller, John Swinbank, Lauren MacArthur, Paul Price ] |
Reviewers | Fritz Mueller [ fritzm ] | |
Status | To Do [ 10001 ] | In Review [ 10004 ] |
Story Points | 5 | 2 |
Reviewers | Fritz Mueller [ fritzm ] | Nate Pease [ npease ] |
Status | In Review [ 10004 ] | In Review [ 10004 ] |
Status | In Review [ 10004 ] | Reviewed [ 10101 ] |
Resolution | Done [ 10000 ] | |
Status | Reviewed [ 10101 ] | Done [ 10002 ] |
Epic Link |
|
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
Remote Link | This issue links to "Page (Confluence)" [ 14355 ] | This issue links to "Page (Confluence)" [ 14355 ] |
This appears to be a fundamental limitation in mpi4py and python:
$ python -c 'import cPickle as pickle; print len(pickle.dumps("X"*2**31, -1))'
Traceback (most recent call last):
File "<string>", line 1, in <module>
SystemError: error return without exception set
I can reproduce the error with some simple code:
import mpi4py.MPI as mpi
import cPickle as pickle
def main():
comm = mpi.COMM_WORLD
rank = comm.Get_rank()
print rank, "OK"
if rank == 0:
size = int(2**31 - 1)
string = "X"*size
print "MASTER:", len(pickle.dumps(string, -1))
else:
string = None
string = comm.bcast(string, root=0)
print rank, len(string)
if __name__ == "__main__":
main()
Running this under MPI:
$ mpiexec -n 2 python test.py
0 OK
1 OK
MASTER: 2147483657
Traceback (most recent call last):
File "test.py", line 18, in <module>
main()
File "test.py", line 14, in main
string = comm.bcast(string, root=0)
File "MPI/Comm.pyx", line 1276, in mpi4py.MPI.Comm.bcast (src/mpi4py.MPI.c:108819)
File "MPI/msgpickle.pxi", line 612, in mpi4py.MPI.PyMPI_bcast (src/mpi4py.MPI.c:47005)
File "MPI/msgpickle.pxi", line 119, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40840)
File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
OverflowError: integer 2147483657 does not fit in 'int'
^C[mpiexec@tiger-sumire] Sending Ctrl-C to processes as requested
[mpiexec@tiger-sumire] Press Ctrl-C again to force abort
I'm going to see if I can work around this and/or reduce the amount of data that's being transferred.