The purpose of this file is to provide a rationale for the project. It explains the reasons behind the decisions made, the goals and objectives of the project, and the justification for its implementation. The rationale serves as a guide for understanding the project's context and helps understand the architecture / pipeline.
The PDFParser module is responsible for acting as a bridge between ScholarVista and the Grobid Client for Python. It processes a set of PDFs in a specified directory.
This module uses the Grobid Client for Python for analyzing the PDFs
Create an instance of the parser:
import scholarvista as sv
parser = sv.PDFParser()
Process the PDFs contained in the desired directory and specify an output directory for the TEI XML files:
parser.process_pdfs(pdf_dir='/path/to/pdf/dir', output_dir='./output')
The TeiXmlParser module is responsible for extracting all the desired information of a TEI XML file outputed by Grobid. It can extract the following information about the papers:
- Title
- Abstract
- Body
- Links
- Number of Figures
This module searches for specific tags in the TEI XML file and iterates through its child obtaining the text within them.
Create an instance of the parser:
import scholarvista as sv
parser = sv.TEIXMLParser(file_path='/path/to/tei/xml/file')
Extract the desired information:
abstract = parser.get_abstract()
body = parser.get_body()
figures_count = parser.get_figures_count()
links = parser.get_links()
title = parser.get_title()
The KeywordCloud module is responsible for generating and displaying a keyword cloud from an input text.
This module makes use of the WordCloud package.
Create an instance of KeywordCloud with one article's abstract extracted with TeiXmlParser
:
import scholarvista as sv
parser = sv.TEIXMLParser(file_path='/path/to/tei/xml/file')
text = parser.get_abstract()
title = 'Abstract'
keyword_cloud = sv.KeywordCloud(text=text, title=title)
Generate and display the figure:
keyword_cloud.generate().display()
The Plotter module is responsible for generating and displaying an histogram from two lists of values passed as arguments.
This module makes use of the Matplotlib package.
Generate a Plotter that will show the number of figures obtained by the TeiXmlParser
:
import scholarvista as sv
xml_files = ['/path/to/tei/xml/file1', '/path/to/tei/xml/file2']
parsed_data = {}
for xml_file in xml_files:
parser = sv.TeEIXMLParser(file_path=xml_file)
parsed_data[parser.get_title()] = {
'abstract': parser.get_abstract(),
'figures_count': parser.get_figures_count(),
'links': parser.get_links()
}
figures_counts = [data['figures_count']
for data in list(parsed_data.values())]
figures_per_article_histogram = Plotter(title='Figures per Article',
x_label='Article',
x_data=range(0, len(figures_counts)),
y_label='Figures',
y_data=figures_counts)
Display the figure:
figures_per_article_histogram.generate().display()