Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_pool, pipe_drivers
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Team:Data Release Production
Description
Lauren MacArthur is running:
coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/DM-6816/cosmos --job DM-6816-cosmos-y-coaddDriver --time 100 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103 filter=HSC-Y visit=274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2 --batch-submit '--mem-per-cpu 8000'
|
and it is producing:
OverflowError on tiger-r8c1n12:19889 in map: integer 2155421250 does not fit in 'int'
|
Traceback (most recent call last):
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 99, in wrapper
|
return func(*args, **kwargs)
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 218, in wrapper
|
return func(*args, **kwargs)
|
File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 554, in map
|
self.comm.scatter(initial, root=self.rank)
|
File "MPI/Comm.pyx", line 1286, in mpi4py.MPI.Comm.scatter (src/mpi4py.MPI.c:109079)
|
File "MPI/msgpickle.pxi", line 707, in mpi4py.MPI.PyMPI_scatter (src/mpi4py.MPI.c:48114)
|
File "MPI/msgpickle.pxi", line 168, in mpi4py.MPI.Pickle.dumpv (src/mpi4py.MPI.c:41672)
|
File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
|
OverflowError: integer 2155421250 does not fit in 'int'
|
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0```
|
We need to fix or work around this problem.
Attachments
Issue Links
- mentioned in
-
Page Loading...
Fritz Mueller, since you're interested, would you be willing to review this fix?
price@price-laptop:~/LSST/ctrl/pool (tickets/DM-8021=) $ git sub-patch
commit 9df40a36febd5bbcfd3842356097aaa3ba79860f
Author: Paul Price <price@astro.princeton.edu>
Date: Tue Oct 18 20:19:00 2016 -0400
Comm: add alternate version of scatter
The default version apparently pickles the entire 'dataList', which
can cause errors if the pickle size grows over 2^31 bytes due to
fundamental problems with pickle in python 2 [1][2] (causing, e.g.,
"OverflowError: integer 2155421250 does not fit in 'int'"). Instead,
we send the data to each slave node in turn; this reduces the pickle
size.
[1] http://bugs.python.org/issue11564
[2] https://www.python.org/dev/peps/pep-3154/
diff --git a/python/lsst/ctrl/pool/pool.py b/python/lsst/ctrl/pool/pool.py
index 59e1277..ce3d663 100644
--- a/python/lsst/ctrl/pool/pool.py
+++ b/python/lsst/ctrl/pool/pool.py
@@ -300,6 +300,30 @@ class Comm(mpi.Intracomm):
with PickleHolder(value):
return super(Comm, self).bcast(value, root=root)
+ def scatter(self, dataList, root=0, tag=0):
+ """Scatter data across the nodes
+
+ The default version apparently pickles the entire 'dataList',
+ which can cause errors if the pickle size grows over 2^31 bytes
+ due to fundamental problems with pickle in python 2. Instead,
+ we send the data to each slave node in turn; this reduces the
+ pickle size.
+
+ @param dataList List of data to distribute; one per node
+ (including root)
+ @param root Index of root node
+ @param tag Message tag (integer)
+ @return Data for this node
+ """
+ if self.Get_rank() == root:
+ for rank, data in enumerate(dataList):
+ if rank == root:
+ continue
+ self.send(data, rank, tag=tag)
+ return dataList[root]
+ else:
+ return self.recv(source=root, tag=tag)
+
def Free(self):
if self._barrierComm is not None:
self._barrierComm.Free()