Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-10928

Use pandoc in metadata extraction pipeline from TeX documents for Lander & metasrc

    Details

    • Templates:
    • Story Points:
      10.9
    • Team:
      SQuaRE

      Description

      Lander (https://github.com/lsst-sqre/lander) uses metasrc (https://github.com/lsst-sqre/metasrc) to discover document metadata from the document source itself. Normalizing LaTeX into plain unicode text is a non-trivial challenge.

      An approach that Tim Jenness suggested is to use pandoc to convert the document into HTML first, and then extract metadata from the HTML. HTML is generally an easier format to exrtact information from since it's standards-based.

      https://pypi.python.org/pypi/pypandoc might be useful since it's a package that includes pandoc.

      We might need to install pandoc in the lsst-texmf docker container so that pandoc can run inside a real latex environment with the lsstdoc class pre-installed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jsick Jonathan Sick
                Reporter:
                jsick Jonathan Sick
                Reviewers:
                Tim Jenness
                Watchers:
                Jonathan Sick, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel