Figure out how to start a dask cluster at lsst-dev using shared stack

XMLWordPrintable

Details

• Type: Story
• Status: Won't Fix
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Story Points:
2
• Sprint:
DRP S18-4, DRP S18-5, DRP S18-6, DRP F18-1, DRP F18-2, DRP F18-3, DRP F18-4, DRP F18-5, DRP F18-6, DRP S19-1
• Team:
Data Release Production

Description

For the interactive qa plotting, I have been occasionally using a dask cluster running on the head node of lsst-dev. However, I have been running this from my own special conda environment. The command I have been executing is the following:

 mpiexec -np 12 dask-mpi --scheduler-file scheduler.json --nthreads 1 

from /scratch/tmorton/dask, and this has been starting up a dask cluster I can connect to when I'm dealing with large datasets. However, as using a personal conda environment is clearly not sustainable, this needs to be possible to do in the context of the shared stack. (Additionally, it ideally should be possible to start a dask cluster on the actual cluster, via slurm, but that is probably an issue for another ticket.)

When I try this with the current stack (currently being interactive-QA-enabled by John Swinbank in DM-13632), I get the following sorts of errors:

 [tmorton@lsst-dev01 dask]\$ mpiexec -np 12 dask-mpi --scheduler-file scheduler.json --nthreads 1 distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://141.142.237.49:8786 distributed.scheduler - INFO - bokeh at: :8787 distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:22996' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:31512' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:23697' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:26871' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:31404' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:2175' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:13582' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:28105' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:22621' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:22647' distributed.nanny - INFO - Start Nanny at: 'tcp://141.142.237.49:14319' [cli_8]: write_line error; fd=37 buf=:cmd=init pmi_version=1 pmi_subversion=1 : system msg for write_line failure : Bad file descriptor [cli_8]: Unable to write to PMI_fd [cli_8]: write_line error; fd=37 buf=:cmd=get_appnum : system msg for write_line failure : Bad file descriptor Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(474): MPID_Init(152).......: channel initialization failed MPID_Init(426).......: PMI_Get_appnum returned -1 [cli_8]: write_line error; fd=37 buf=:cmd=abort exitcode=1094159 : system msg for write_line failure : Bad file descriptor distributed.nanny - WARNING - Worker process 178401 was killed by unknown signal distributed.nanny - WARNING - Restarting worker [cli_8]: write_line error; fd=37 buf=:cmd=init pmi_version=1 pmi_subversion=1 : system msg for write_line failure : Bad file descriptor [cli_8]: Unable to write to PMI_fd [cli_8]: write_line error; fd=37 buf=:cmd=get_appnum 

etc.

I'm not sure who would be able to debug this, but ideally if there's anyone at NCSA who's familiar with dask and able to support me with this (and with the future dask@slurm issue), that would be great.

Activity

Hide
Tim Morton added a comment -

By the way, I'm fairly sure I was relying on the stack's mpiexec before, but if I can time-travel back to a week ago or so, I could confirm that

Show
Tim Morton added a comment - By the way, I'm fairly sure I was relying on the stack's mpiexec before, but if I can time-travel back to a week ago or so, I could confirm that
Hide
Tim Morton added a comment -

oh, the mashup must come from the fact that my environment had cloned the /ssd one, so that test is a mess to try...

Show
Tim Morton added a comment - oh, the mashup must come from the fact that my environment had cloned the /ssd one, so that test is a mess to try...
Hide
John Swinbank added a comment -

How do I set up an old version of the stack (from before your recent additions)? This would help me go back to my old env, so I can test.

I'm not aware of any way to roll back a conda environment, so I don't think this is possible.

Show
John Swinbank added a comment - How do I set up an old version of the stack (from before your recent additions)? This would help me go back to my old env, so I can test. I'm not aware of any way to roll back a conda environment, so I don't think this is possible.
Hide
Tim Morton added a comment -

OK, then I think I'll have to pretend this was never working to begin with.

Show
Tim Morton added a comment - OK, then I think I'll have to pretend this was never working to begin with.
Hide
Tim Morton added a comment -

This is an obsolete need, as it seems like there is no need anymore for an individual to start up a personal dask cluster on lsst-dev, as I was doing while testing things out.

Show
Tim Morton added a comment - This is an obsolete need, as it seems like there is no need anymore for an individual to start up a personal dask cluster on lsst-dev, as I was doing while testing things out.

People

• Assignee:
Tim Morton
Reporter:
Tim Morton
Watchers:
John Swinbank, Tim Morton, Yusra AlSayyad