Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-7254

Provide a global full-text search tool for the LSST code base

    Details

    • Team:
      SQuaRE

      Description

      Please provide a tool, accessible both from the command line (within a configured LSST development environment) and via Web interface, that offers full-text search on the LSST code base. It would be desirable for it to offer the following options:

      • Search a named release
      • Search recent nightly / weekly builds
      • Starting from a CI or QA run in the database of such runs, search the code used for that run
      • (command-line version only) Search the code base implied by an eups setup

      For extra credit, or in a later version, offer semantic search (e.g., "find all uses of a function named 'foo'", rather than "find all strings containing 'foo'").

        Attachments

          Activity

          Hide
          gpdf Gregory Dubois-Felsmann added a comment -

          As an example: the BaBar project ran a full-text glimpse index of every release build and every nightly QA build, and made searches on these indexes available. The command line interface was:

          srtglimpse -H 12.3.4 foo

          to perform the search in the specified release.

          The tool searched both the input source code and, optionally, any auto-generated code produced as part of the build (e.g., C++ stubs for CORBA .idl objects).

          Show
          gpdf Gregory Dubois-Felsmann added a comment - As an example: the BaBar project ran a full-text glimpse index of every release build and every nightly QA build, and made searches on these indexes available. The command line interface was: srtglimpse -H 12.3.4 foo to perform the search in the specified release. The tool searched both the input source code and, optionally, any auto-generated code produced as part of the build (e.g., C++ stubs for CORBA .idl objects).
          Hide
          jsick Jonathan Sick added a comment -

          It's not well specified yet, but full code search, at least from a web context, is planned to be part of the LSST Documentation Hub (universal search): https://sqr-011.lsst.io/#a-documentation-index

          Show
          jsick Jonathan Sick added a comment - It's not well specified yet, but full code search, at least from a web context, is planned to be part of the LSST Documentation Hub (universal search): https://sqr-011.lsst.io/#a-documentation-index
          Hide
          swinbank John Swinbank added a comment -

          Nate Lust built some variation on the system described here for "fun" last year. I don't know to what extent it meets all of the requests, and I don't know whether it's currently operational. Nate, could you fill us in, please?

          Show
          swinbank John Swinbank added a comment - Nate Lust built some variation on the system described here for "fun" last year. I don't know to what extent it meets all of the requests, and I don't know whether it's currently operational. Nate, could you fill us in, please?
          Hide
          jsick Jonathan Sick added a comment -

          After talking to Nate Lust it seems that he can deploy his existing swoosh-based search service on a Nebula instance. That will give us search by ticket branch (and maybe tag?) now for only the cost of deployment configuration.

          GitHub provides code search, but it's not 100% usable here because it doesn't seem to work across an organization and only tracks master.

          The documentation hub that I mentioned in SQR-011 might be a good long-term solution for this (I'll update that technote to specifically mention code search and this ticket). The DocHub will be Elasticsearch (i.e., robust, web-scale search) fronted by an API. This will allow search to happen from many contexts (e.g., from the DocHub landing page, or from the SQUASH dashboard, code documentation sites, etc.).

          Integrating code search in DocHub will be ideal in terms of having shared infrastructure, and having one-fewer webpage for folks to know about.

          The DocHub will be able to naturally deal with versioning as a search facet since this concept is also used in our documentation.

          I'm not sure about tooling to provide syntax analysis in elasticsearch. GitHub doesn't do this in their elasticsearch, for example. Keep in mind that code search is related to documentation search, and the documentation (API reference) does clearly distinguish functions, classes, etc. The API reference is also potentially tagged with each EUPS and CI 'build'. So perhaps there's some useful synergy in code search and doc search we can use here.

          As for a command-line search service, we could certainly have a command line client to the web search RESTful API.

          The tool for searching the currently 'setup' local stack would be quite distinct from this. But is there a need for a custom tool compared to, say, grepping the checked-out lsstsw directory? I see that last use-case as having the least profitability from DM-specific tooling. I could be wrong on this, though.

          Show
          jsick Jonathan Sick added a comment - After talking to Nate Lust it seems that he can deploy his existing swoosh-based search service on a Nebula instance. That will give us search by ticket branch (and maybe tag?) now for only the cost of deployment configuration. GitHub provides code search, but it's not 100% usable here because it doesn't seem to work across an organization and only tracks master. The documentation hub that I mentioned in SQR-011 might be a good long-term solution for this (I'll update that technote to specifically mention code search and this ticket). The DocHub will be Elasticsearch (i.e., robust, web-scale search) fronted by an API. This will allow search to happen from many contexts (e.g., from the DocHub landing page, or from the SQUASH dashboard, code documentation sites, etc.). Integrating code search in DocHub will be ideal in terms of having shared infrastructure, and having one-fewer webpage for folks to know about. The DocHub will be able to naturally deal with versioning as a search facet since this concept is also used in our documentation. I'm not sure about tooling to provide syntax analysis in elasticsearch. GitHub doesn't do this in their elasticsearch, for example. Keep in mind that code search is related to documentation search, and the documentation (API reference) does clearly distinguish functions, classes, etc. The API reference is also potentially tagged with each EUPS and CI 'build'. So perhaps there's some useful synergy in code search and doc search we can use here. As for a command-line search service, we could certainly have a command line client to the web search RESTful API. The tool for searching the currently 'setup' local stack would be quite distinct from this. But is there a need for a custom tool compared to, say, grepping the checked-out lsstsw directory? I see that last use-case as having the least profitability from DM-specific tooling. I could be wrong on this, though.
          Hide
          nlust Nate Lust added a comment -

          I just want to chime in and say I will spin my tool up on a nebula service sometime next week to be available as a web interface. However, the tool also runs off someones laptop and provides a command line interface. I can also provide the tools to anyone who wants to run it. The only gotcha is that each time you rebuild the index it takes a little time, and a low number of gigs for the index. The time factor my be solved with a nebula instance that supports a git lfs repo that is rebuilt often that others can just download as needed.

          Show
          nlust Nate Lust added a comment - I just want to chime in and say I will spin my tool up on a nebula service sometime next week to be available as a web interface. However, the tool also runs off someones laptop and provides a command line interface. I can also provide the tools to anyone who wants to run it. The only gotcha is that each time you rebuild the index it takes a little time, and a low number of gigs for the index. The time factor my be solved with a nebula instance that supports a git lfs repo that is rebuilt often that others can just download as needed.
          Hide
          nlust Nate Lust added a comment -

          As a status update I will not get to this for a few days, as we are working to finish up the pytest/py3 work. I have not forgotten about it.

          Show
          nlust Nate Lust added a comment - As a status update I will not get to this for a few days, as we are working to finish up the pytest/py3 work. I have not forgotten about it.
          Hide
          swinbank John Swinbank added a comment -

          Removing DMLT label, and providing a gentle reminder to Frossie Economou that people still care about this.

          Show
          swinbank John Swinbank added a comment - Removing DMLT label, and providing a gentle reminder to Frossie Economou that people still care about this.

            People

            • Assignee:
              Unassigned
              Reporter:
              gpdf Gregory Dubois-Felsmann
              Watchers:
              Gregory Dubois-Felsmann, John Swinbank, Jonathan Sick, Nate Lust
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Summary Panel