Details

    • Type: Bug
    • Status: Done
    • Priority: Major
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ctrl_pool, pipe_drivers
    • Labels:
      None
    • Templates:
    • Story Points:
      2
    • Epic Link:
    • Team:
      Data Release Production

      Description

      Lauren MacArthur is running:

      coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/DM-6816/cosmos --job DM-6816-cosmos-y-coaddDriver --time 100 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103 filter=HSC-Y visit=274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2 --batch-submit '--mem-per-cpu 8000'
      

      and it is producing:

      OverflowError on tiger-r8c1n12:19889 in map: integer 2155421250 does not fit in 'int'
      Traceback (most recent call last):
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 99, in wrapper
          return func(*args, **kwargs)
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 218, in wrapper
          return func(*args, **kwargs)
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 554, in map
          self.comm.scatter(initial, root=self.rank)
        File "MPI/Comm.pyx", line 1286, in mpi4py.MPI.Comm.scatter (src/mpi4py.MPI.c:109079)
        File "MPI/msgpickle.pxi", line 707, in mpi4py.MPI.PyMPI_scatter (src/mpi4py.MPI.c:48114)
        File "MPI/msgpickle.pxi", line 168, in mpi4py.MPI.Pickle.dumpv (src/mpi4py.MPI.c:41672)
        File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
      OverflowError: integer 2155421250 does not fit in 'int'
      application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0```
      

      We need to fix or work around this problem.

        Issue Links

          Activity

          Hide
          price Paul Price added a comment -

          Fritz Mueller, since you're interested, would you be willing to review this fix?

          price@price-laptop:~/LSST/ctrl/pool (tickets/DM-8021=) $ git sub-patch
          commit 9df40a36febd5bbcfd3842356097aaa3ba79860f
          Author: Paul Price <price@astro.princeton.edu>
          Date:   Tue Oct 18 20:19:00 2016 -0400
           
              Comm: add alternate version of scatter
              
              The default version apparently pickles the entire 'dataList', which
              can cause errors if the pickle size grows over 2^31 bytes due to
              fundamental problems with pickle in python 2 [1][2] (causing, e.g.,
              "OverflowError: integer 2155421250 does not fit in 'int'"). Instead,
              we send the data to each slave node in turn; this reduces the pickle
              size.
              
              [1] http://bugs.python.org/issue11564
              [2] https://www.python.org/dev/peps/pep-3154/
           
          diff --git a/python/lsst/ctrl/pool/pool.py b/python/lsst/ctrl/pool/pool.py
          index 59e1277..ce3d663 100644
          --- a/python/lsst/ctrl/pool/pool.py
          +++ b/python/lsst/ctrl/pool/pool.py
          @@ -300,6 +300,30 @@ class Comm(mpi.Intracomm):
                   with PickleHolder(value):
                       return super(Comm, self).bcast(value, root=root)
           
          +    def scatter(self, dataList, root=0, tag=0):
          +        """Scatter data across the nodes
          +
          +        The default version apparently pickles the entire 'dataList',
          +        which can cause errors if the pickle size grows over 2^31 bytes
          +        due to fundamental problems with pickle in python 2. Instead,
          +        we send the data to each slave node in turn; this reduces the
          +        pickle size.
          +
          +        @param dataList  List of data to distribute; one per node
          +            (including root)
          +        @param root  Index of root node
          +        @param tag  Message tag (integer)
          +        @return  Data for this node
          +        """
          +        if self.Get_rank() == root:
          +            for rank, data in enumerate(dataList):
          +                if rank == root:
          +                    continue
          +                self.send(data, rank, tag=tag)
          +            return dataList[root]
          +        else:
          +            return self.recv(source=root, tag=tag)
          +
               def Free(self):
                   if self._barrierComm is not None:
                       self._barrierComm.Free()
          

          Show
          price Paul Price added a comment - Fritz Mueller , since you're interested, would you be willing to review this fix? price@price-laptop:~/LSST/ctrl/pool (tickets/DM-8021=) $ git sub-patch commit 9df40a36febd5bbcfd3842356097aaa3ba79860f Author: Paul Price <price@astro.princeton.edu> Date: Tue Oct 18 20:19:00 2016 -0400   Comm: add alternate version of scatter The default version apparently pickles the entire 'dataList', which can cause errors if the pickle size grows over 2^31 bytes due to fundamental problems with pickle in python 2 [1][2] (causing, e.g., "OverflowError: integer 2155421250 does not fit in 'int'"). Instead, we send the data to each slave node in turn; this reduces the pickle size. [1] http://bugs.python.org/issue11564 [2] https://www.python.org/dev/peps/pep-3154/   diff --git a/python/lsst/ctrl/pool/pool.py b/python/lsst/ctrl/pool/pool.py index 59e1277..ce3d663 100644 --- a/python/lsst/ctrl/pool/pool.py +++ b/python/lsst/ctrl/pool/pool.py @@ -300,6 +300,30 @@ class Comm(mpi.Intracomm): with PickleHolder(value): return super(Comm, self).bcast(value, root=root) + def scatter(self, dataList, root=0, tag=0): + """Scatter data across the nodes + + The default version apparently pickles the entire 'dataList', + which can cause errors if the pickle size grows over 2^31 bytes + due to fundamental problems with pickle in python 2. Instead, + we send the data to each slave node in turn; this reduces the + pickle size. + + @param dataList List of data to distribute; one per node + (including root) + @param root Index of root node + @param tag Message tag (integer) + @return Data for this node + """ + if self.Get_rank() == root: + for rank, data in enumerate(dataList): + if rank == root: + continue + self.send(data, rank, tag=tag) + return dataList[root] + else: + return self.recv(source=root, tag=tag) + def Free(self): if self._barrierComm is not None: self._barrierComm.Free()
          Hide
          lauren Lauren MacArthur added a comment -

          I can confirm that running with the current ticket branch solve the problem. Thanks Paul!

          Show
          lauren Lauren MacArthur added a comment - I can confirm that running with the current ticket branch solve the problem. Thanks Paul!
          Hide
          price Paul Price added a comment -

          Fritz Mueller indicated that Nate Pease might be a better reviewer.

          Nate, would you mind taking a look at this, please?

          Show
          price Paul Price added a comment - Fritz Mueller indicated that Nate Pease might be a better reviewer. Nate, would you mind taking a look at this, please?
          Hide
          npease Nate Pease added a comment -

          it looks good to me.

          Show
          npease Nate Pease added a comment - it looks good to me.
          Hide
          price Paul Price added a comment -

          Thanks Nate!

          Merged to master.

          Show
          price Paul Price added a comment - Thanks Nate! Merged to master.

            People

            • Assignee:
              price Paul Price
              Reporter:
              price Paul Price
              Reviewers:
              Nate Pease
              Watchers:
              Fritz Mueller, John Swinbank, Lauren MacArthur, Nate Pease, Paul Price
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development