A recommendation and categorization application to suggest related research papers, given an abstract.
- Installation
- How to Run?
- Files
- Overview & Motivation
- Problem Statement
- Metrics
- Analysis
- Methodology
- Results
- Conclusion
- Acknowledgements
Python version required is >=3.7
. Below are the libraries used:
- pandas
- scikit-learn
- dask
- nltk
- flask
Install all libraries using this command: pip install -r requirements.txt
.
- Clone the repository.
- Create directories
data
&models
, if not already present. - Download the dataset from here and move it inside the
data
directory naming itdataset.json
. - Run the application:
flask run
- Go to
http://localhost:5000/setup
to setup the data models. - After running setup, the classification report is available to analyze in the console.
- Now, navigate to
http://localhost:5000/
and the app is ready to serve!
The following is the list of files hosted in this repository:
app.py
: flask application module containing the routestemplates
: directory containing the HTML filesetl.py
: data processing modulerequirements.txt
: project requirementsmodels
: directory containing the processed data and modelsdata
: directory containing the dataset JSON file
A research paper is a piece of academic writing containing original research results or an interpretation of existing results. The papers, even just the abstract text, are many a times really long and complex to understand, in the first glance. Basically, it's a time-intensive process. While there are softwares which can auto-summarize articles, it still takes a good amount of effort to go through a vast number of articles to find which are related to the paper in question.
This project attempts to solve the problem of sifting through a myriad number of articles to filter the relevant articles for the study, by exposing a web application in which the user can provide an abstract of an article to get a list of related articles and the categories associated to it. The application uses the arXiv research paper dataset to perform knowledge-based recommendations.
To provide a brief summary of the underlying approach taken in this project, the solution uses the TF-IDF (term frequency-inverse document frequency) transformation on the tokens, reflecting the importance of the word in the abstract with respect to the entire collection of abstracts. A one v/s rest multilabel support vector classifier is then used to output the categories associated with the abstract. As to finding the recommendations, cosine similarity technique, taking a cosine angle between the two vectors, is used to find document similarity using the TF-IDF vector of each document, where greater the angle, better the similarity.
The goal of the project is to create a web application to help users find similar articles and the associated categories, given a piece of text using TF-IDF transformations, linear kernel bound support vector classifier and cosine similarity techniques. The application is expected to be useful for conducting literature reviews, finding correlation between past researches and the user's own research, etc.
Since there are more true negatives than there are true positives, there is a class imbalance for which accuracy will not be a desired metric. Precision and recall are relatively insensitive to such imbalances. Further, F1 score is analyzed to seek a balance between precision and recall. Hence, the project evaluates the model on precision
, recall
and f1 score
, where
- Precision = True Positives / (True Positives + False Positives)
- Recall = True Positives / (True Positives + False Negatives)
- F1-score = (2 * Precision * Recall) / (Precision + Recall)
The arXiv dataset contains a repository of ~1.9 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. The analysis limits itself to 100,000 papers published after the year 2019.
Each item has the following keys:
- id: ArXiv ID
- submitter: Who submitted the paper
- authors: Authors of the paper
- title: Title of the paper
- comments: Additional info, such as number of pages and figures
- journal-ref: Information about the journal the paper was published in
- doi: Digital Object Identifier
- abstract: The abstract of the paper
- categories: Categories / tags in the ArXiv system
- versions: A version history
The articles are tagged with 155 unique categories and 20 unique 'general categories' (here, the general category is the category extracted before the first dot separator). The general category is not included in the data, hence it will be generated during preprocessing.
Fig. 1 A sample list of items after the extraction and transformation phase
Fig. 2 An item in detail view
It can be seen in Fig. 2 that the abstract contains characters relevant in displaying the mathematical formulas using MathJax script, but for the purposes of this analysis, it shall be removed as part of the tokenization process.
Fig. 3 A plot showing the number of submissions by submission date.
The above figure indicates that the submissions are skewed towards the recent dates.
Fig. 4 A plot showing the percentage of papers by general category, as observed in a sample of 10,000 articles
Most number of papers are tagged with Computer Science and Mathematics by a really great margin, followed by Condensed Matter and Physics, as shown in Fig. 4.
In the setup phase of the application, the following steps are executed for the recommendations:
- Load dataset
- Filter the data limiting to articles published after the year 2019 (at most 100,000 papers)
- Transform the data to add in general category
- Use general category to add dummy columns
- Compute and save the TF-IDF vectorizer and matrix built using the abstracts
- Save the dataset
Next, to predict categories, the following steps are executed:
- Split dataset into training and test sets
- Build a LinearSVC classifier pipeline to predict categories using OneVsRest classifier
- Use grid search CV technique to fit the model on the best parameters
- Save the model
This completes the preprocessing of the data. Finally, the index route can be loaded in the browser to test the application out. Provide an abstract in the input field and click on "Search" to get the similar article recommendations and the predicted categories of the text. This is done sequentially as follows:
- Load the saved dataset, vectorizer and the TF-IDF matrix
- Build another vectorizer using the saved vectorizer's vocabulary
- Compute TF-IDF matrix for the abstract
- Using cosine similarity method, build a list of indices sorted by best match
- Fetch metadata from saved dataset and send back the list to the web server
- Load the saved model
- Make predictions
- Send back the list of categories associated with the abstract
The response then contains the similar articles list and the categories list which then is compiled using JavaScript to display it on the web page.
GridSearchCV is used for hypertuning the parameters to build the model with the best parameters. The list of parameters used are as following:
- tfidf__use_idf: Use Inverse-Document Refetching while TF-IDF transformation or not
- vect__ngram_range: If words will be unigrams or bigrams, etc or a hybrid mix of n-grams.
- vect__max_df: while building vocabulary, the score of terms higher than this will be ignored
- vect__max_features: capping the number of features
- clf__estimator__C: regularization parameter
The following parameters grid is used:
# python
{
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000),
'tfidf__use_idf': (True, False),
'clf__estimator__C': [1, 10, 100, 1000]
}
To categorize the abstract, a Linear SVC model is used where the grid search CV technique found the best parameters as following:
- Use TF-IDF transformer's Inverse-Document frequency reweighting (param:
use_idf
= True) - Set Linear SVC classifier's regularization parameter
C
to 100. - Ignore TF-IDF score less than 0.5.
- Set parameter
ngram_range
to (1, 2), considering both unigrams and bigrams.
The model is split into training and testing set. The latter is then used to validate the model, computing the precision, recall and f1-score. The metrics associated with some categories are as following:
1. Category: cs
precision recall f1-score support
0 0.95 0.92 0.93 1212
1 0.88 0.92 0.90 788
accuracy 0.92 2000
macro avg 0.91 0.92 0.92 2000
weighted avg 0.92 0.92 0.92 2000
--
2. Category: math
precision recall f1-score support
0 0.94 0.95 0.95 1451
1 0.87 0.83 0.85 549
accuracy 0.92 2000
macro avg 0.91 0.89 0.90 2000
weighted avg 0.92 0.92 0.92 2000
--
3. Category: physics
precision recall f1-score support
0 0.94 0.98 0.96 1791
1 0.68 0.45 0.54 209
accuracy 0.92 2000
macro avg 0.81 0.71 0.75 2000
weighted avg 0.91 0.92 0.91 2000
Fig. 5 Classification reports for categories math, physics and statistics.
Percentage of articles tagged in actual test set and predicted values have pretty much the same shape and values as shown in Fig. 6 and Fig. 7.
Fig. 6 Bar graph showing percentage of articles tagged with the 'actual' categories
Fig. 7 Bar graph showing percentage of articles tagged with the 'predicted' categories
The project is able to successfully implement text mining using TF-IDF and Linear SVC to categorize over multiple categories. While the precision, recall and f1-score for each category was greater than 90%, there is still room for improvement.
It is also evident that the application is useful to find similar articles in a clean and compact list view. But, it is limited to arXiv papers with a smaller dataset size. It also has limitations to read in the mathJax format to fully comprehend the ignorance of important mathematical texts.
All in all, the project brings out its usefulness in a limited domain, but a better infrastructure and upgradation of techniques along with a better webview shall provide a really good enhancement to find interesting papers.
Steps followed to find the similar articles:
- Load dataset
- Filter dataset to include the articles published after 2019 (at most 100,000)
- Compute the general category of each article
- Compute TF-IDF matrix
- Save model and matrix
- When new abstract is provided, the saved model and matrix is loaded
- Using the saved vocabulary, a new model is created
- The new model generates a TF-IDF matrix for the given abstract
- Cosine similarity technique generates the similarity score between each article and the abstract
- Top N articles are then filtered from it
Steps followed to predict the categories:
- Load dataset
- Split it into train-test sets
- Build a Linear SVC classifier
- Fit the grid search CV model on training set with the set of parameters
- Save the resulting model
- Load the saved model on event of new abstract
- Predict using the model
Learning about how TF-IDF works has been the most interesting part of this project, giving me insights as to how many NLP applications are based upon it. While it was mostly fun to bring this project to life, the hardest part was to understand how to save the preprocessed model and the matrix and finally apply it as needed when the user invokes the search query.
Looking back, there are a lot of improvements that can be done in this project, namely:
- Extending this project to more than arXiv papers
- Generating a graph to show the connection lengths based on their similarity scores
- Enhance web application to take in user ratings over the search results
- In case arXiv and other publications are used, then LDA scheme can be considered for topic modelling
- Considering the title of the article along with the abstract
- Using the categories as another dimension to sort the generated similar articles
- Advanced filtering and sorting mechanisms for the web application
- Finding similar articles by inputting the link or the paper ID
Also, having more memory and better processing power shall help process articles dating back to 20th century providing lots of missing studies that might be relevant to the user.
Thanks to Kaggle, Cornell University, arXiv for providing the dataset and Udacity for delivering the data science course materials.