Use figure index rather than xml:id attribute this is not always pres…

…ent (#51) * Use figure index rather than xml:id attribute this is not always present. Particularly when loading papers which are parsed from PDFs * Preserve xml:id where present
neuml · Dec 3, 2023 · 1a09f28 · 1a09f28
1 parent b0e5aa9
commit 1a09f28
Showing 1 changed file with 4 additions and 3 deletions.
diff --git a/src/python/paperetl/file/tei.py b/src/python/paperetl/file/tei.py
@@ -242,9 +242,10 @@ def text(soup, title):
             sections.extend([(name, x) for x in sent_tokenize(text)])
 
         # Extract text from tables
-        for figure in soup.find("text").find_all("figure"):
-            # Use XML Id as figure name to ensure figures are uniquely named
-            name = figure.get("xml:id").upper()
+        for i, figure in enumerate(soup.find("text").find_all("figure")):
+            # Use XML Id (if available) as figure name to ensure figures are uniquely named
+            name = figure.get("xml:id")
+            name = name.upper() if name else f"FIGURE_{i}"
 
             # Search for table
             table = figure.find("table")