Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-10928

Use pandoc in metadata extraction pipeline from TeX documents for Lander & metasrc

    Details

      Description

      Lander (https://github.com/lsst-sqre/lander) uses metasrc (https://github.com/lsst-sqre/metasrc) to discover document metadata from the document source itself. Normalizing LaTeX into plain unicode text is a non-trivial challenge.

      An approach that Tim Jenness suggested is to use pandoc to convert the document into HTML first, and then extract metadata from the HTML. HTML is generally an easier format to exrtact information from since it's standards-based.

      https://pypi.python.org/pypi/pypandoc might be useful since it's a package that includes pandoc.

      We might need to install pandoc in the lsst-texmf docker container so that pandoc can run inside a real latex environment with the lsstdoc class pre-installed.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            I think I was proposing that you do a first pass to look for the Abstract and Title in the preamble using python, then pass the abstract and title snippets to pandoc. Pandoc can be given a snippet such as:

            This is a $\alpha$ test$^5$
            

            and can quickly generate:

            <p>This is a <span class="math inline"><em>α</em></span> test<span class="math inline"><em></em><sup>5</sup></span></p>
            

            Show
            tjenness Tim Jenness added a comment - I think I was proposing that you do a first pass to look for the Abstract and Title in the preamble using python, then pass the abstract and title snippets to pandoc. Pandoc can be given a snippet such as: This is a $\alpha$ test$^5$ and can quickly generate: <p>This is a <span class="math inline"><em>α</em></span> test<span class="math inline"><em></em><sup>5</sup></span></p>
            Hide
            tjenness Tim Jenness added a comment -

            I will also state that we can assume we are in control of the upstream tex and can fix it to make our life easier rather than trying to handle every possible eventuality.

            Show
            tjenness Tim Jenness added a comment - I will also state that we can assume we are in control of the upstream tex and can fix it to make our life easier rather than trying to handle every possible eventuality.
            Hide
            jsick Jonathan Sick added a comment -

            This works for standard latex commands but I've already seen examples in the wild where custom commands are being defined elsewhere and used in the title and abstract. I'll need to respect that.

            E.g. https://github.com/lsst/LDM-503/blob/c3692954a0ae23a0154bc8d0749884eb3e716a86/LDM-503.tex#L11

            Show
            jsick Jonathan Sick added a comment - This works for standard latex commands but I've already seen examples in the wild where custom commands are being defined elsewhere and used in the title and abstract. I'll need to respect that. E.g. https://github.com/lsst/LDM-503/blob/c3692954a0ae23a0154bc8d0749884eb3e716a86/LDM-503.tex#L11
            Hide
            tjenness Tim Jenness added a comment -

            We still have control of both ends. If something like this is required then we define an explicit set that will be supported. We can easily extract the \product for example so long as we know to expect it as part of the defined LSST latex standard.

            I'm wary of trying to support arbitrary latex macros and adding complexity to the system when it's not warranted. To be concrete, you can't throw the .tex source file at pandoc and expect the abstract to appear somewhere because pandoc doesn't know what \setDocAbstract means. The python code therefore will have to already go through the tex and prepare it for pandoc. It seems easier to scan the preamble for the title, abstract, authors and product and then pass those snippets to pandoc (maybe including the product definition at the top of the snippet, copied directly from the tex).

            Show
            tjenness Tim Jenness added a comment - We still have control of both ends. If something like this is required then we define an explicit set that will be supported. We can easily extract the \product for example so long as we know to expect it as part of the defined LSST latex standard. I'm wary of trying to support arbitrary latex macros and adding complexity to the system when it's not warranted. To be concrete, you can't throw the .tex source file at pandoc and expect the abstract to appear somewhere because pandoc doesn't know what \setDocAbstract means. The python code therefore will have to already go through the tex and prepare it for pandoc. It seems easier to scan the preamble for the title, abstract, authors and product and then pass those snippets to pandoc (maybe including the product definition at the top of the snippet, copied directly from the tex).
            Hide
            jsick Jonathan Sick added a comment -

            That's fair. I like your idea of essentially white-listing what non-standard LaTeX macros can be used in these metadata commands.

            Show
            jsick Jonathan Sick added a comment - That's fair. I like your idea of essentially white-listing what non-standard LaTeX macros can be used in these metadata commands.
            Hide
            jsick Jonathan Sick added a comment -

            These two PRs for metasrc and lander now leverage pandoc for getting clean HTML extractions of abstracts, titles, author lists, and so on. I'm also using pybtex to render citations in the extracted latex snippets

            Show
            jsick Jonathan Sick added a comment - These two PRs for metasrc and lander now leverage pandoc for getting clean HTML extractions of abstracts, titles, author lists, and so on. I'm also using pybtex to render citations in the extracted latex snippets
            Hide
            tjenness Tim Jenness added a comment -

            Thanks for updating following my comments.

            Show
            tjenness Tim Jenness added a comment - Thanks for updating following my comments.
            Hide
            jsick Jonathan Sick added a comment -

            Thanks Tim Jenness for all the comments.

            Lander 0.1.7 is shipped, and should be used by future PDF builds automatically.

            Show
            jsick Jonathan Sick added a comment - Thanks Tim Jenness for all the comments. Lander 0.1.7 is shipped, and should be used by future PDF builds automatically.

              People

              • Assignee:
                jsick Jonathan Sick
                Reporter:
                jsick Jonathan Sick
                Reviewers:
                Tim Jenness
                Watchers:
                Jonathan Sick, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel