Scripts for creating, analyzing and evaluating an LLM-labelled micrograph dataset from materials science preprints. A dataset of 842 micrographs is available in micrographs/
, with LLM generated labels that describe what material was imaged and with which technique in micrographs/labels.csv
.
NB: assumes a UNIX enviroment (macOS/ubuntu) and that python3.10 is installed on the system.
From the root directory run (interactive mode needed for venv)
sudo bash -i build_scripts/build.sh
Alternative using conda:
sudo bash -i build_scripts/build_conda.sh
Note that if using mac, you will need the brew package manager, and it may be more convenient to use conda. Tested on 64-bit Ubuntu 22.04.4 LTS, and macOS.
If you cannot install sbt or build the project, install the java JDK/JRE using your system package manager (brew, apt, etc.) and set the use_jar=True
flag in extract.py
. This will use the standalone .jar file of pdffigures and the java runtime instead of building the scala project. Note the pdffigures2.jar file must be in the pdffigures2
folder.
From the root directory with your virtual environment activated and OPENAI_API_KEY enviornment variable set, run
python tests.py
If you are on mac (or using the .jar version), run
python tests.py -jar
This will run a test of the (sub-)figure & caption extraction alongside some basic caption analysis using string matching or GPT-4 on an example paper (DOI 10.1149/1945-7111/ac7a68).
- Find and store the metadata (title, abstract, authors, DOI, date and download URL) of all papers that match the search query `microscopy' on a given preprint server (arxiv, chemrxiv). We chose a random sample of 500 chemRxiv papers as a demonstration.
- Download each paper to a temporary foldeer, extracting all figure/caption pairs and storing them alongside the metadata.
pdffigures2: scala program for extracting figures and associated captions from figures.
This command will
sbt "runMain org.allenai.pdffigures2.FigureExtractorBatchCli /absolute/file/path/file.pdf -m /absolute/file/path/out_imgs/ -d /absolute/file/path/out_captions/ -i 200"
- Threshold segment whole image into pixels that are white and that are not
- Perform binary opening to remove small gaps
- Do connected-component analysis of non-white regions and get bounding boxes per component
NB: this assumes sub-figures are separated with white gutters of ~>2px. This is generally true, but not always - some figures have no whitespace (i.e, timeseries), which is relevant when performing VLM analysis later.
GPT3.5/4 was prompted with the figure's caption and abstract and asked to identify whether the associated figure contained a micrograph, and if so what technique (AFM, TEM, SEM, etc.) was used to image it and what material was being imaged, as well as any additional comments for, say, processing conditions. The .json
structure for a figure that contained the micrograph was as follows:
{
"isMicrograph": "true",
"instrument": "Technique",
"material": "Description",
"comments": ["comment1", "comment2", "comment3"]
}
We tested giving the LLM both the caption and abstract and just the caption - we found providing both worked the best (in terms of balancing sensitivty and specificity). The good performance is a function of how well-structured scientific paper captions tend to be. In many ways, this is an easy task that only an LLM can do.
GPT3.5/4 is effective at detecting if a figure contains a micrograph, but it cannot tell you which sub-figure is the micrograph(s). To do this we analyzed each figure that GPT3.5/4 labelled as containing a micrograph and its extracted sub-figures. We fed the figure caption, paper abstract, specific sub-figure and the whole figure to GPT4-V and asked it if the specific sub-figure was a single micrograph (i.e, not a timeseries or unextracted figure).
Any single micrographs were placed into a the dataset in micrographs/
, with their labels, DOI, etc. in micrographs/label.csv
.
We compared the accuracy of LLM micrograph detection to two string-matching/regex approaches. The 'simple' regex said a micrograph was detected if the caption contained the substring "image" or "micrograph" anywhere. The 'greedy' regex said a micrograph was detected if any instrument name ("SEM" or "Scanning Electron Microscopy" or "TEM" etc.) was detected or if it contained "image" or "micrograph" as before.
The 'simple' regex scheme was remarkably effective (if slightly conservative), probably because scientific captions are well-structured. The 'greedy' regex had too many false positives, as could be expected.
We wrote a custom labelling app (in labelling_app/
) that allows humans to label the extracted figures and sub-figures and compare the LLM/regex labels to the human labels for evaluation.