# Use pandoc in metadata extraction pipeline from TeX documents for Lander & metasrc

XMLWordPrintable

## Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
10.9
• Epic Link:
• Team:
SQuaRE

## Description

Lander (https://github.com/lsst-sqre/lander) uses metasrc (https://github.com/lsst-sqre/metasrc) to discover document metadata from the document source itself. Normalizing LaTeX into plain unicode text is a non-trivial challenge.

An approach that Tim Jenness suggested is to use pandoc to convert the document into HTML first, and then extract metadata from the HTML. HTML is generally an easier format to exrtact information from since it's standards-based.

https://pypi.python.org/pypi/pypandoc might be useful since it's a package that includes pandoc.

We might need to install pandoc in the lsst-texmf docker container so that pandoc can run inside a real latex environment with the lsstdoc class pre-installed.

## Activity

Hide
Tim Jenness added a comment -

I think I was proposing that you do a first pass to look for the Abstract and Title in the preamble using python, then pass the abstract and title snippets to pandoc. Pandoc can be given a snippet such as:

 This is a $\alpha$ test$^5$ 

and can quickly generate:

 

This is a α test5



Show
Tim Jenness added a comment - I think I was proposing that you do a first pass to look for the Abstract and Title in the preamble using python, then pass the abstract and title snippets to pandoc. Pandoc can be given a snippet such as: This is a $\alpha$ test$^5$ and can quickly generate: <p>This is a <span class="math inline"><em>α</em></span> test<span class="math inline"><em></em><sup>5</sup></span></p>
Hide
Tim Jenness added a comment -

I will also state that we can assume we are in control of the upstream tex and can fix it to make our life easier rather than trying to handle every possible eventuality.

Show
Tim Jenness added a comment - I will also state that we can assume we are in control of the upstream tex and can fix it to make our life easier rather than trying to handle every possible eventuality.
Hide
Jonathan Sick added a comment -

This works for standard latex commands but I've already seen examples in the wild where custom commands are being defined elsewhere and used in the title and abstract. I'll need to respect that.

Show
Jonathan Sick added a comment - This works for standard latex commands but I've already seen examples in the wild where custom commands are being defined elsewhere and used in the title and abstract. I'll need to respect that. E.g. https://github.com/lsst/LDM-503/blob/c3692954a0ae23a0154bc8d0749884eb3e716a86/LDM-503.tex#L11
Hide
Tim Jenness added a comment -

We still have control of both ends. If something like this is required then we define an explicit set that will be supported. We can easily extract the \product for example so long as we know to expect it as part of the defined LSST latex standard.

I'm wary of trying to support arbitrary latex macros and adding complexity to the system when it's not warranted. To be concrete, you can't throw the .tex source file at pandoc and expect the abstract to appear somewhere because pandoc doesn't know what \setDocAbstract means. The python code therefore will have to already go through the tex and prepare it for pandoc. It seems easier to scan the preamble for the title, abstract, authors and product and then pass those snippets to pandoc (maybe including the product definition at the top of the snippet, copied directly from the tex).

Show
Tim Jenness added a comment - We still have control of both ends. If something like this is required then we define an explicit set that will be supported. We can easily extract the \product for example so long as we know to expect it as part of the defined LSST latex standard. I'm wary of trying to support arbitrary latex macros and adding complexity to the system when it's not warranted. To be concrete, you can't throw the .tex source file at pandoc and expect the abstract to appear somewhere because pandoc doesn't know what \setDocAbstract means. The python code therefore will have to already go through the tex and prepare it for pandoc. It seems easier to scan the preamble for the title, abstract, authors and product and then pass those snippets to pandoc (maybe including the product definition at the top of the snippet, copied directly from the tex).
Hide
Jonathan Sick added a comment -

That's fair. I like your idea of essentially white-listing what non-standard LaTeX macros can be used in these metadata commands.

Show
Jonathan Sick added a comment - That's fair. I like your idea of essentially white-listing what non-standard LaTeX macros can be used in these metadata commands.
Hide
Jonathan Sick added a comment -

These two PRs for metasrc and lander now leverage pandoc for getting clean HTML extractions of abstracts, titles, author lists, and so on. I'm also using pybtex to render citations in the extracted latex snippets

Show
Jonathan Sick added a comment - These two PRs for metasrc and lander now leverage pandoc for getting clean HTML extractions of abstracts, titles, author lists, and so on. I'm also using pybtex to render citations in the extracted latex snippets
Hide
Tim Jenness added a comment -

Thanks for updating following my comments.

Show
Tim Jenness added a comment - Thanks for updating following my comments.
Hide
Jonathan Sick added a comment -

Thanks Tim Jenness for all the comments.

Lander 0.1.7 is shipped, and should be used by future PDF builds automatically.

Show
Jonathan Sick added a comment - Thanks Tim Jenness for all the comments. Lander 0.1.7 is shipped, and should be used by future PDF builds automatically.

## People

• Assignee:
Jonathan Sick
Reporter:
Jonathan Sick
Reviewers:
Tim Jenness
Watchers:
Jonathan Sick, Tim Jenness
• Votes:
0 Vote for this issue
Watchers:
2 Start watching this issue

## Dates

• Created:
Updated:
Resolved: