Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29530

Config Paths hardcoded during graph building

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      Hello, 

      An attempt to run a workflow generated on environment different to the edge node setup leads to this error https://ai-idds-01.cern.ch:25443/cache/DOMA_Harvester_1009468.out : 

      botocore.hooks DEBUG: Event after-call.s3.HeadObject: calling handler <bound method RetryQuotaChecker.release_retry_quota of <botocore.retries.standard.RetryQuotaChecker object at 0x7fce58f962b0>>
       transformSourceTable INFO: from /home/spadolski/wrk/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/obs_subaru/21.0.0-27-g1176d449+70a9c181f9/policy/Source.yaml
       Traceback (most recent call last):
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d//bin/pipetask", line 29, in <module>
       sys.exit(main())
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/cli/pipetask.py", line 43, in main
       return cli()
       File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.4.3/lib/python3.8/site-packages/click/core.py", line 829, in _call_
       return self.main(*args, **kwargs)
       File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.4.3/lib/python3.8/site-packages/click/core.py", line 782, in main
       rv = self.invoke(ctx)
       File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.4.3/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
       return _process_result(sub_ctx.command.invoke(sub_ctx))
       File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.4.3/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
       return ctx.invoke(self.callback, **ctx.params)
       File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.4.3/lib/python3.8/site-packages/click/core.py", line 610, in invoke
       return callback(*args, **kwargs)
       File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.4.3/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
       return f(get_current_context(), *args, **kwargs)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/cli/cmd/commands.py", line 103, in run
       script.run(qgraphObj=qgraph, **kwargs)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/cli/script/run.py", line 167, in run
       f.runPipeline(qgraphObj, taskFactory, args)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 597, in runPipeline
       preExecInit.initialize(graph,
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/preExecInit.py", line 89, in initialize
       self.saveInitOutputs(graph)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/preExecInit.py", line 182, in saveInitOutputs
       task = self.taskFactory.makeTask(taskDef.taskClass, taskDef.config, None, self.butler)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/ctrl_mpexec/21.0.0-24-g0c1e3ff+d4d6b51e8d/python/lsst/ctrl/mpexec/taskFactory.py", line 93, in makeTask
       task = taskClass(config=config, initInputs=initInputs)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/pipe_tasks/21.0.0-58-g436064c8+549a115573/python/lsst/pipe/tasks/postprocess.py", line 571, in _init_
       self.funcs = CompositeFunctor.from_file(self.config.functorFile)
       File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/pipe_tasks/21.0.0-58-g436064c8+549a115573/python/lsst/pipe/tasks/functors.py", line 510, in from_file
       with open(filename) as f:
       FileNotFoundError: [Errno 2] No such file or directory: '/home/spadolski/wrk/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/obs_subaru/21.0.0-27-g1176d449+70a9c181f9/policy/Source.yaml'

       
      That is obvious that the path in wrong, this is the path on the submitting machine, not at the edge node container.
      I assume that the problems arises here: https://github.com/lsst/obs_subaru/blob/master/config/transformSourceTable.py#L5 or, more specifically here: https://github.com/lsst/utils/blob/master/src/packaging.cc#L39
      In the container:
      (lsst-scipipe-0.4.3) [lsst@b65d486014ba stack]$ env | grep OBS_SUBARU_DIR
      OBS_SUBARU_DIR=/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/obs_subaru/21.0.0-27-g1176d449+70a9c181f9
      Can this be fixed?

       

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            Yes, so this is seemingly a general problem with the way we handle pex configs.

            • The config has python code in it that executes and assigns the answer to a config item
            • The graph builder serializes that config (removing all that python code)
            • The execution system reads the config but now has the burned in derived values.
            • Everything breaks.

            We are essentially requiring that the graph builder and the executor node have the LSST software in exactly the same place. The fix would be to use environment variable strings directly in the file names and have the thing that reads the path do the environ expansion. I'm not sure how amenable the pipeline developers would be to that kind of change. Even if they were it would take a while to implement so you'd need to change the submission code now to use the same location anyhow.

            cc/ Kian-Tat Lim, Yusra AlSayyad

            Show
            tjenness Tim Jenness added a comment - Yes, so this is seemingly a general problem with the way we handle pex configs. The config has python code in it that executes and assigns the answer to a config item The graph builder serializes that config (removing all that python code) The execution system reads the config but now has the burned in derived values. Everything breaks. We are essentially requiring that the graph builder and the executor node have the LSST software in exactly the same place. The fix would be to use environment variable strings directly in the file names and have the thing that reads the path do the environ expansion. I'm not sure how amenable the pipeline developers would be to that kind of change. Even if they were it would take a while to implement so you'd need to change the submission code now to use the same location anyhow. cc/ Kian-Tat Lim , Yusra AlSayyad
            Hide
            ktl Kian-Tat Lim added a comment -

            Seems to me the fundamental problem is https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/functors.py#L510 and its non-Butler I/O.

            As an opening bid, I'd rather the original config Python code burn the entire YAML into the config rather than the path to the file.

            Show
            ktl Kian-Tat Lim added a comment - Seems to me the fundamental problem is https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/functors.py#L510 and its non-Butler I/O. As an opening bid, I'd rather the original config Python code burn the entire YAML into the config rather than the path to the file.
            Hide
            tjenness Tim Jenness added a comment -

            That is the specific problem stated above. I was simply worried that it's not the only case where we use getPackageDir to locate a file to reference in a config. Maybe I'm too pessimistic.

            Show
            tjenness Tim Jenness added a comment - That is the specific problem stated above. I was simply worried that it's not the only case where we use getPackageDir to locate a file to reference in a config. Maybe I'm too pessimistic.
            Hide
            ktl Kian-Tat Lim added a comment -

            Almost all other getPackageDir() references are for loading configs, which is fine. Out of the 22 results for GitHub search for org:lsst extension:py path:/config getPackageDir, I see only datasetIngest.py in some obs packages, which is ap_verify-specific, as having a similar problem.

            Show
            ktl Kian-Tat Lim added a comment - Almost all other getPackageDir() references are for loading configs, which is fine. Out of the 22 results for GitHub search for org:lsst extension:py path:/config getPackageDir , I see only datasetIngest.py in some obs packages, which is ap_verify -specific, as having a similar problem.
            Hide
            ktl Kian-Tat Lim added a comment - - edited

            Maybe I should be more clear. The reason to have frozen configs is so that they are portable and well-defined. If they have environment variables that change depending on where they are used, they are not frozen and might change on re-execution. This is same as the scons problem. We should not be enabling people to do dangerous things.

            Show
            ktl Kian-Tat Lim added a comment - - edited Maybe I should be more clear. The reason to have frozen configs is so that they are portable and well-defined. If they have environment variables that change depending on where they are used, they are not frozen and might change on re-execution. This is same as the scons problem. We should not be enabling people to do dangerous things.
            Hide
            tjenness Tim Jenness added a comment -

            If this is a one off I'm fine with fixing it. That way we don't embed environment variables in the configs and we also don't embed paths to software installations.

            Blame suggests Yusra AlSayyad added this functor code relatively recently.

            Show
            tjenness Tim Jenness added a comment - If this is a one off I'm fine with fixing it. That way we don't embed environment variables in the configs and we also don't embed paths to software installations. Blame suggests Yusra AlSayyad added this functor code relatively recently.
            Hide
            yusra Yusra AlSayyad added a comment -

            Yeah, we knew that hardcoding of the functor file would bite us eventually and a replacement is in progress (see https://jira.lsstcorp.org/browse/DM-29031), but I'm not sure if it'll be done on the timescale needed to unblock this. (Earliest we could do it with that is paircoding with Nate Lust next Wednesday). What's your workaround?

            Show
            yusra Yusra AlSayyad added a comment - Yeah, we knew that hardcoding of the functor file would bite us eventually and a replacement is in progress (see https://jira.lsstcorp.org/browse/DM-29031 ), but I'm not sure if it'll be done on the timescale needed to unblock this. (Earliest we could do it with that is paircoding with Nate Lust next Wednesday). What's your workaround?
            Hide
            nlust Nate Lust added a comment - - edited

            We could probably get something up and going by early next week if we need to (if this is quite high priority), as we were planning on replacing the functors anyway (but that might involve changing the structure of some tables, so we might need to consult with some others)

            Show
            nlust Nate Lust added a comment - - edited We could probably get something up and going by early next week if we need to (if this is quite high priority), as we were planning on replacing the functors anyway (but that might involve changing the structure of some tables, so we might need to consult with some others)
            Hide
            tjenness Tim Jenness added a comment -

            I think in the short term Sergey Padolski is going to change the panda submit node so that the software is in the same place as the execution nodes.

            Show
            tjenness Tim Jenness added a comment - I think in the short term Sergey Padolski is going to change the panda submit node so that the software is in the same place as the execution nodes.
            Hide
            podolsky Sergey Padolski added a comment -

            Thanks a lot for looking into this. Short term speaking I am unblocked because I've aligned paths on the submission machine to match the container structure running on edge nodes.

            Show
            podolsky Sergey Padolski added a comment - Thanks a lot for looking into this. Short term speaking I am unblocked because I've aligned paths on the submission machine to match the container structure running on edge nodes.
            Hide
            tjenness Tim Jenness added a comment - - edited

            This is still breaking PanDA testing. The discussion above said this would all be fixed mid-April but that doesn't seem to be happening.

            Can I do my own quick fix? I think if I change the config entry to use the string $OBS_SUBARU_DIR and then change CompositeFunctor.from_file to call os.path.expandVars then that would fix it in a handful of lines. Do people see a problem with doing that?

            Show
            tjenness Tim Jenness added a comment - - edited This is still breaking PanDA testing. The discussion above said this would all be fixed mid-April but that doesn't seem to be happening. Can I do my own quick fix? I think if I change the config entry to use the string $OBS_SUBARU_DIR and then change CompositeFunctor.from_file to call os.path.expandVars then that would fix it in a handful of lines. Do people see a problem with doing that?
            Hide
            yusra Yusra AlSayyad added a comment -

            Yes please

            Show
            yusra Yusra AlSayyad added a comment - Yes please
            Hide
            tjenness Tim Jenness added a comment -

            Yusra AlSayyad can you please review this? It's a couple of lines in pipe_tasks and then tweaks to the configs in obs_lsst and obs_subaru. Jenkins is running at the moment.

            Show
            tjenness Tim Jenness added a comment - Yusra AlSayyad can you please review this? It's a couple of lines in pipe_tasks and then tweaks to the configs in obs_lsst and obs_subaru. Jenkins is running at the moment.
            Hide
            yusra Yusra AlSayyad added a comment -

            Thanks for doing this.

            Looks good. Would feel better if you kicked off a ci_imsim too.

            Show
            yusra Yusra AlSayyad added a comment - Thanks for doing this. Looks good. Would feel better if you kicked off a ci_imsim too.
            Hide
            tjenness Tim Jenness added a comment -

            Sergey Padolski hopefully weekly 27 will fix your PanDA problem.

            Show
            tjenness Tim Jenness added a comment - Sergey Padolski hopefully weekly 27 will fix your PanDA problem.
            Hide
            podolsky Sergey Padolski added a comment -

            Thanks a lot, Tim Jenness!

            Show
            podolsky Sergey Padolski added a comment - Thanks a lot, Tim Jenness !

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              podolsky Sergey Padolski
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Kian-Tat Lim, Nate Lust, Sergey Padolski, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.