Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25917

Diagnose and fix Ook ingest failures

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: www_lsst_io
    • Labels:
      None

      Description

      These documents are examples of technotes not being correctly ingested:

       

      In this ticket we'll analyze these ingests, and devise fixes to ensure that these documents appear in the Algolia index. A large part of this work may be to create new diagnostic tools for the ook ingest pipeline.

      While working on the ingest code, let's also see if its possible to "bin" LaTeX paragraphs together to result in fewer records.

        Attachments

          Activity

          Hide
          jsick Jonathan Sick added a comment -

          We fixed many ingest issues:

          • Add heuristic rejecting tex pandoc lets through
          • Handle AASTeX technotes w/o Lander support
          • Fallback to first content chunk if no abstract
          • Support rst technote content before subsections
          • Support Lander docs without content

          Some documents are still posing issues:

          • DM-128 is a beamer document and pandoc (via Lander) failed to extract its plain text content
          • DMTN-010, DMTN-020, DMTN-023, and DMTN-047 seem to be failing due to very long sections (need to add intra-section chunking to rst document ingest)
          • Unknown errors related to ITTN-012 (dominated by tables)
          • DMTR-161
          Show
          jsick Jonathan Sick added a comment - We fixed many ingest issues: Add heuristic rejecting tex pandoc lets through Handle AASTeX technotes w/o Lander support Fallback to first content chunk if no abstract Support rst technote content before subsections Support Lander docs without content Some documents are still posing issues: DM-128 is a beamer document and pandoc (via Lander) failed to extract its plain text content DMTN-010, DMTN-020, DMTN-023, and DMTN-047 seem to be failing due to very long sections (need to add intra-section chunking to rst document ingest) Unknown errors related to ITTN-012 (dominated by tables) DMTR-161
          Hide
          jsick Jonathan Sick added a comment -

          Deployed Ook 0.3.0

          Show
          jsick Jonathan Sick added a comment - Deployed Ook 0.3.0

            People

            Assignee:
            jsick Jonathan Sick
            Reporter:
            jsick Jonathan Sick
            Watchers:
            Jonathan Sick
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                CI Builds

                No builds found.