Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-8021

Deal with large pickles

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ctrl_pool, pipe_drivers
    • Labels:
      None

      Description

      Lauren MacArthur is running:

      coaddDriver.py /tigress/HSC/HSC --rerun lauren/LSST/DM-6816/cosmos --job DM-6816-cosmos-y-coaddDriver --time 100 --cores 96 --batch-type=slurm --mpiexec='-bind-to socket' --id tract=0 filter=HSC-Y --selectId ccd=0..103 filter=HSC-Y visit=274..302:2^306..334:2^342..370:2^1858..1862:2^1868..1882:2^11718..11742:2^22602..22608:2^22626..22632:2^22642..22648:2^22658..22664:2 --batch-submit '--mem-per-cpu 8000'
      

      and it is producing:

      OverflowError on tiger-r8c1n12:19889 in map: integer 2155421250 does not fit in 'int'
      Traceback (most recent call last):
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 99, in wrapper
          return func(*args, **kwargs)
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 218, in wrapper
          return func(*args, **kwargs)
        File "/tigress/HSC/LSST/stack_20160915/Linux64/ctrl_pool/12.1+5/python/lsst/ctrl/pool/pool.py", line 554, in map
          self.comm.scatter(initial, root=self.rank)
        File "MPI/Comm.pyx", line 1286, in mpi4py.MPI.Comm.scatter (src/mpi4py.MPI.c:109079)
        File "MPI/msgpickle.pxi", line 707, in mpi4py.MPI.PyMPI_scatter (src/mpi4py.MPI.c:48114)
        File "MPI/msgpickle.pxi", line 168, in mpi4py.MPI.Pickle.dumpv (src/mpi4py.MPI.c:41672)
        File "MPI/msgbuffer.pxi", line 35, in mpi4py.MPI.downcast (src/mpi4py.MPI.c:29070)
      OverflowError: integer 2155421250 does not fit in 'int'
      application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0```
      

      We need to fix or work around this problem.

        Attachments

          Issue Links

            Activity

            Hide
            price Paul Price added a comment -

            Fritz Mueller, since you're interested, would you be willing to review this fix?

            price@price-laptop:~/LSST/ctrl/pool (tickets/DM-8021=) $ git sub-patch
            commit 9df40a36febd5bbcfd3842356097aaa3ba79860f
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Tue Oct 18 20:19:00 2016 -0400
             
                Comm: add alternate version of scatter
                
                The default version apparently pickles the entire 'dataList', which
                can cause errors if the pickle size grows over 2^31 bytes due to
                fundamental problems with pickle in python 2 [1][2] (causing, e.g.,
                "OverflowError: integer 2155421250 does not fit in 'int'"). Instead,
                we send the data to each slave node in turn; this reduces the pickle
                size.
                
                [1] http://bugs.python.org/issue11564
                [2] https://www.python.org/dev/peps/pep-3154/
             
            diff --git a/python/lsst/ctrl/pool/pool.py b/python/lsst/ctrl/pool/pool.py
            index 59e1277..ce3d663 100644
            --- a/python/lsst/ctrl/pool/pool.py
            +++ b/python/lsst/ctrl/pool/pool.py
            @@ -300,6 +300,30 @@ class Comm(mpi.Intracomm):
                     with PickleHolder(value):
                         return super(Comm, self).bcast(value, root=root)
             
            +    def scatter(self, dataList, root=0, tag=0):
            +        """Scatter data across the nodes
            +
            +        The default version apparently pickles the entire 'dataList',
            +        which can cause errors if the pickle size grows over 2^31 bytes
            +        due to fundamental problems with pickle in python 2. Instead,
            +        we send the data to each slave node in turn; this reduces the
            +        pickle size.
            +
            +        @param dataList  List of data to distribute; one per node
            +            (including root)
            +        @param root  Index of root node
            +        @param tag  Message tag (integer)
            +        @return  Data for this node
            +        """
            +        if self.Get_rank() == root:
            +            for rank, data in enumerate(dataList):
            +                if rank == root:
            +                    continue
            +                self.send(data, rank, tag=tag)
            +            return dataList[root]
            +        else:
            +            return self.recv(source=root, tag=tag)
            +
                 def Free(self):
                     if self._barrierComm is not None:
                         self._barrierComm.Free()
            

            Show
            price Paul Price added a comment - Fritz Mueller , since you're interested, would you be willing to review this fix? price@price-laptop:~/LSST/ctrl/pool (tickets/DM-8021=) $ git sub-patch commit 9df40a36febd5bbcfd3842356097aaa3ba79860f Author: Paul Price <price@astro.princeton.edu> Date: Tue Oct 18 20:19:00 2016 -0400   Comm: add alternate version of scatter The default version apparently pickles the entire 'dataList', which can cause errors if the pickle size grows over 2^31 bytes due to fundamental problems with pickle in python 2 [1][2] (causing, e.g., "OverflowError: integer 2155421250 does not fit in 'int'"). Instead, we send the data to each slave node in turn; this reduces the pickle size. [1] http://bugs.python.org/issue11564 [2] https://www.python.org/dev/peps/pep-3154/   diff --git a/python/lsst/ctrl/pool/pool.py b/python/lsst/ctrl/pool/pool.py index 59e1277..ce3d663 100644 --- a/python/lsst/ctrl/pool/pool.py +++ b/python/lsst/ctrl/pool/pool.py @@ -300,6 +300,30 @@ class Comm(mpi.Intracomm): with PickleHolder(value): return super(Comm, self).bcast(value, root=root) + def scatter(self, dataList, root=0, tag=0): + """Scatter data across the nodes + + The default version apparently pickles the entire 'dataList', + which can cause errors if the pickle size grows over 2^31 bytes + due to fundamental problems with pickle in python 2. Instead, + we send the data to each slave node in turn; this reduces the + pickle size. + + @param dataList List of data to distribute; one per node + (including root) + @param root Index of root node + @param tag Message tag (integer) + @return Data for this node + """ + if self.Get_rank() == root: + for rank, data in enumerate(dataList): + if rank == root: + continue + self.send(data, rank, tag=tag) + return dataList[root] + else: + return self.recv(source=root, tag=tag) + def Free(self): if self._barrierComm is not None: self._barrierComm.Free()
            Hide
            lauren Lauren MacArthur added a comment -

            I can confirm that running with the current ticket branch solve the problem. Thanks Paul!

            Show
            lauren Lauren MacArthur added a comment - I can confirm that running with the current ticket branch solve the problem. Thanks Paul!
            Hide
            price Paul Price added a comment -

            Fritz Mueller indicated that Nate Pease [X] might be a better reviewer.

            Nate, would you mind taking a look at this, please?

            Show
            price Paul Price added a comment - Fritz Mueller indicated that Nate Pease [X] might be a better reviewer. Nate, would you mind taking a look at this, please?
            Hide
            npease Nate Pease [X] (Inactive) added a comment -

            it looks good to me.

            Show
            npease Nate Pease [X] (Inactive) added a comment - it looks good to me.
            Hide
            price Paul Price added a comment -

            Thanks Nate!

            Merged to master.

            Show
            price Paul Price added a comment - Thanks Nate! Merged to master.

              People

              Assignee:
              price Paul Price
              Reporter:
              price Paul Price
              Reviewers:
              Nate Pease [X] (Inactive)
              Watchers:
              Fritz Mueller, John Swinbank, Lauren MacArthur, Nate Pease [X] (Inactive), Paul Price
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.