Skip to content

Commit

Permalink
Use figure index rather than xml:id attribute this is not always pres…
Browse files Browse the repository at this point in the history
…ent (#51)

* Use figure index rather than xml:id attribute this is not always present.

Particularly when loading papers which are parsed from PDFs

* Preserve xml:id where present
  • Loading branch information
elshimone authored Dec 3, 2023
1 parent b0e5aa9 commit 1a09f28
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions src/python/paperetl/file/tei.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,9 +242,10 @@ def text(soup, title):
sections.extend([(name, x) for x in sent_tokenize(text)])

# Extract text from tables
for figure in soup.find("text").find_all("figure"):
# Use XML Id as figure name to ensure figures are uniquely named
name = figure.get("xml:id").upper()
for i, figure in enumerate(soup.find("text").find_all("figure")):
# Use XML Id (if available) as figure name to ensure figures are uniquely named
name = figure.get("xml:id")
name = name.upper() if name else f"FIGURE_{i}"

# Search for table
table = figure.find("table")
Expand Down

0 comments on commit 1a09f28

Please sign in to comment.