Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13645

Figure out how to start a dask cluster at lsst-dev using shared stack

    Details

    • Type: Story
    • Status: Won't Fix
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      2
    • Epic Link:
    • Sprint:
      DRP S18-4, DRP S18-5, DRP S18-6, DRP F18-1, DRP F18-2, DRP F18-3, DRP F18-4, DRP F18-5, DRP F18-6, DRP S19-1
    • Team:
      Data Release Production

      Description

      For the interactive qa plotting, I have been occasionally using a dask cluster running on the head node of lsst-dev. However, I have been running this from my own special conda environment. The command I have been executing is the following:

      mpiexec -np 12 dask-mpi --scheduler-file scheduler.json  --nthreads 1
      

      from /scratch/tmorton/dask, and this has been starting up a dask cluster I can connect to when I'm dealing with large datasets. However, as using a personal conda environment is clearly not sustainable, this needs to be possible to do in the context of the shared stack. (Additionally, it ideally should be possible to start a dask cluster on the actual cluster, via slurm, but that is probably an issue for another ticket.)

      When I try this with the current stack (currently being interactive-QA-enabled by John Swinbank in DM-13632), I get the following sorts of errors:

      [tmorton@lsst-dev01 dask]$ mpiexec -np 12 dask-mpi --scheduler-file scheduler.json  --nthreads 1
      distributed.scheduler - INFO - Clear task state
      distributed.scheduler - INFO -   Scheduler at: tcp://141.142.237.49:8786
      distributed.scheduler - INFO -       bokeh at:                     :8787
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:22996'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:31512'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:23697'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:26871'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:31404'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:2175'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:13582'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:28105'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:22621'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:22647'
      distributed.nanny - INFO -         Start Nanny at: 'tcp://141.142.237.49:14319'
      [cli_8]: write_line error; fd=37 buf=:cmd=init pmi_version=1 pmi_subversion=1
      :
      system msg for write_line failure : Bad file descriptor
      [cli_8]: Unable to write to PMI_fd
      [cli_8]: write_line error; fd=37 buf=:cmd=get_appnum
      :
      system msg for write_line failure : Bad file descriptor
      Fatal error in PMPI_Init_thread: Other MPI error, error stack:
      MPIR_Init_thread(474):
      MPID_Init(152).......: channel initialization failed
      MPID_Init(426).......: PMI_Get_appnum returned -1
      [cli_8]: write_line error; fd=37 buf=:cmd=abort exitcode=1094159
      :
      system msg for write_line failure : Bad file descriptor
      distributed.nanny - WARNING - Worker process 178401 was killed by unknown signal
      distributed.nanny - WARNING - Restarting worker
      [cli_8]: write_line error; fd=37 buf=:cmd=init pmi_version=1 pmi_subversion=1
      :
      system msg for write_line failure : Bad file descriptor
      [cli_8]: Unable to write to PMI_fd
      [cli_8]: write_line error; fd=37 buf=:cmd=get_appnum
      

      etc.

      I'm not sure who would be able to debug this, but ideally if there's anyone at NCSA who's familiar with dask and able to support me with this (and with the future dask@slurm issue), that would be great.

        Attachments

          Issue Links

            Activity

            Hide
            tmorton Tim Morton added a comment -

            By the way, I'm fairly sure I was relying on the stack's mpiexec before, but if I can time-travel back to a week ago or so, I could confirm that

            Show
            tmorton Tim Morton added a comment - By the way, I'm fairly sure I was relying on the stack's mpiexec before, but if I can time-travel back to a week ago or so, I could confirm that
            Hide
            tmorton Tim Morton added a comment -

            oh, the mashup must come from the fact that my environment had cloned the /ssd one, so that test is a mess to try...

            Show
            tmorton Tim Morton added a comment - oh, the mashup must come from the fact that my environment had cloned the /ssd one, so that test is a mess to try...
            Hide
            swinbank John Swinbank added a comment -

            How do I set up an old version of the stack (from before your recent additions)? This would help me go back to my old env, so I can test.

            I'm not aware of any way to roll back a conda environment, so I don't think this is possible.

            Show
            swinbank John Swinbank added a comment - How do I set up an old version of the stack (from before your recent additions)? This would help me go back to my old env, so I can test. I'm not aware of any way to roll back a conda environment, so I don't think this is possible.
            Hide
            tmorton Tim Morton added a comment -

            OK, then I think I'll have to pretend this was never working to begin with.

            Show
            tmorton Tim Morton added a comment - OK, then I think I'll have to pretend this was never working to begin with.
            Hide
            tmorton Tim Morton added a comment -

            This is an obsolete need, as it seems like there is no need anymore for an individual to start up a personal dask cluster on lsst-dev, as I was doing while testing things out.

            Show
            tmorton Tim Morton added a comment - This is an obsolete need, as it seems like there is no need anymore for an individual to start up a personal dask cluster on lsst-dev, as I was doing while testing things out.

              People

              • Assignee:
                tmorton Tim Morton
                Reporter:
                tmorton Tim Morton
                Watchers:
                John Swinbank, Tim Morton, Yusra AlSayyad
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel