Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-18033

ts_scriptqueue unit tests failing on some machines

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Sprint:
      TSSW Sprint - Feb 18 - Mar 02, TSSW Sprint - Mar 04 - Mar 16
    • Team:
      Telescope and Site

      Description

      One or the other of the npytest_x unit tests is not reliable on some machines. In particular nopytest_script_queue.py sometimes fails for Tiago Ribeiro and test_stop_scripts in nopytest_queue_model.py usually times out on my work iMac (but runs reliably on my identical home iMac).

      Unfortunately for the test failures I see, they go away if I only run the failing test so each test is clearly not running completely independently. Two things I would like to try:

      • Measure the time it takes to start each script and the time it takes to configure it and report the max time, standard deviation and mean time for each of these when running all tests.
      • For nopytest_queue_model.py use a unique SAL index for each script, in case that reduces interference between tests (though I don't see why it should).
      • Put a pause at the start of the failing test to see if giving SAL a chance to clean things up helps.

        Attachments

          Issue Links

            Activity

            Hide
            rowen Russell Owen added a comment - - edited

            I did modify the tests to use a unique SAL index for each script. This didn't help the problem but it does make it easier to see what's going on so I left in the new code.

            Adding a delay before running test_stop_scripts did nothing so I did not keep that.

            The "fix" was to add a 0.1 second sleep before sending the configure command to the script. Without this pause, test_stop_scripts in nopytest_queue_model.py consistently fails for me: the first script added just sits there waiting for configuration.. The configure command is sent (I added print statements to prove that) but it never executes. SAL is clearly losing the configure command.

            This is clearly just a workaround for a badly understood problem, probably TSS-3279. I'm not very happy about it, but it won't hurt use of the script queue and if it does the job I suggest we live with it. Please try running the tests yourself a few times, preferably using:

            • scons --clean # to make sure all tests run in the next step
            • scons

            I also tried some other delay values:

            • 0 seconds wasn't enough (that just frees the event loop in case something else wants it).
            • 1 second was enough to make nopytest_queue_model.py run in multiprocessing mode (with pytest -n 4) but 0.1 and 0.5 seconds were not.
            • Even 1 second was not enough to allow nopytest_script_queue.py to run in multiprocessing mode. I didn't try longer values.

            I also overhauled the unit test to produce more useful output:

            • Each script has a unique SAL index, making it a bit easier to keep track of which script belongs to which test.
            • The queue and script state callback printouts include elapsed time for the current test.

            ts_scriptqueue pull request: https://github.com/lsst-ts/ts_scriptqueue/pull/19

            I didn't update the served docs because I don't think any of these changes affect them.

            Show
            rowen Russell Owen added a comment - - edited I did modify the tests to use a unique SAL index for each script. This didn't help the problem but it does make it easier to see what's going on so I left in the new code. Adding a delay before running test_stop_scripts did nothing so I did not keep that. The "fix" was to add a 0.1 second sleep before sending the configure command to the script. Without this pause, test_stop_scripts in nopytest_queue_model.py consistently fails for me: the first script added just sits there waiting for configuration.. The configure command is sent (I added print statements to prove that) but it never executes. SAL is clearly losing the configure command. This is clearly just a workaround for a badly understood problem, probably TSS-3279 . I'm not very happy about it, but it won't hurt use of the script queue and if it does the job I suggest we live with it. Please try running the tests yourself a few times, preferably using: scons --clean # to make sure all tests run in the next step scons I also tried some other delay values: 0 seconds wasn't enough (that just frees the event loop in case something else wants it). 1 second was enough to make nopytest_queue_model.py run in multiprocessing mode (with pytest -n 4 ) but 0.1 and 0.5 seconds were not. Even 1 second was not enough to allow nopytest_script_queue.py to run in multiprocessing mode. I didn't try longer values. I also overhauled the unit test to produce more useful output: Each script has a unique SAL index, making it a bit easier to keep track of which script belongs to which test. The queue and script state callback printouts include elapsed time for the current test. ts_scriptqueue pull request: https://github.com/lsst-ts/ts_scriptqueue/pull/19 I didn't update the served docs because I don't think any of these changes affect them.
            Hide
            tribeiro Tiago Ribeiro added a comment -

            Reviewed the code on GitHub. The tests are still not 100% stable, most of the issues I saw could be tracked down to some bugs/features of the current SAL version. Hopefully they will be fixed with future SAL releases. 

            Show
            tribeiro Tiago Ribeiro added a comment - Reviewed the code on GitHub. The tests are still not 100% stable, most of the issues I saw could be tracked down to some bugs/features of the current SAL version. Hopefully they will be fixed with future SAL releases. 
            Hide
            rowen Russell Owen added a comment -

            Merged to develop and master and tagged as v1.2.1

            Documentation updated at http://staff.washington.edu/rowen/ts_scriptqueue/index.html

            Show
            rowen Russell Owen added a comment - Merged to develop and master and tagged as v1.2.1 Documentation updated at http://staff.washington.edu/rowen/ts_scriptqueue/index.html

              People

              Assignee:
              rowen Russell Owen
              Reporter:
              rowen Russell Owen
              Reviewers:
              Tiago Ribeiro
              Watchers:
              Dave Mills, Russell Owen, Tiago Ribeiro
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Start date:
                End date:

                  Jenkins

                  No builds found.