Skip to content

Commit

Permalink
Clean and package pypi
Browse files Browse the repository at this point in the history
  • Loading branch information
vishalmhjn committed Oct 18, 2023
1 parent 78da80c commit 0da841d
Show file tree
Hide file tree
Showing 16 changed files with 160 additions and 496 deletions.
9 changes: 3 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
install:
pip install --upgrade pip &&\
pip install -r requirements.txt

lint:
pylint --disable=R,C src/call_scopus.py
pylint --disable=R,C scopus_caller/call_scopus.py
pylint --disable=R,C scopus_caller/call_semanticscholar.py

format:
black *.py

test:
python -m pytest -vv src/test_call_scopus.py
python tests/test_call_scopus.py
84 changes: 25 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ access level of the article and authorized API, the article's **abstract-text**
### Semantic Scholar API

Semantic Scholar also provides an API to retrieve the article's meta-data. It is possible to obtain abstracts by
specifying the DOI of the article.
specifying the DOI of the article. Abstracts for all SCOPUS database articles are not available from Semantic Scholar database.

## Install the dependencies
## Installation

1. Create a virtual environment to install all packages in and activate the environment:
_(Make sure you are in the parrent folder of this project)_
Expand All @@ -34,90 +34,56 @@ specifying the DOI of the article.
source ~/.scopus-caller/bin/activate
```

2. Now install all the neccessary requirements for this project using one of the following two options:
2. Install the package

```sh
pip install -r requirements.txt
pip install -i https://test.pypi.org/simple/ scopus-caller==0.1
```

OR
## Obtain the API Key

```sh
make install
```

## Add the API_KEY

1. Create a new file for the api key:

```sh
touch input/.API
```

2. If you haven't created an account on [SCOPUS](https://dev.elsevier.com) yet, got to
1. If you haven't created an account on [SCOPUS](https://dev.elsevier.com) yet, got to
[SCOPUS](https://www.elsevier.com/solutions/scopus) and create a private account or one via your university.
3. After being logged in, create a new API key [here](https://dev.elsevier.com/apikey/manage), name the label to your
2. After being logged in, create a new API key [here](https://dev.elsevier.com/apikey/manage), name the label to your
likings and leave the website input field empty _(it is not important)_.
Carefully read and understand the "API
SERVICE AGREEMENT" and "Text and Data Mining (TDM) Provisions", before using the API and the retrieved data. These
will be presented to the user while generating the API.
4. Paste your newly generated `api_key` to the created `.API` file in the `input` folder _(input/.API)_.
3. Copy you API key and store it in a text file.

## Unrestricted search using CLI
## Usage

First make sure you are in the `scopus_caller/src` folder then run:
Import the library and paste the API key.

```sh
python call_scopus.py [--year YEAR] [--api API_KEY] [SEARCH_TERMS]
# import the module
import scopus_caller as sc

# paste the api here
api_key = ""
```

**Parameters**:
**Parameters of function _call_scopus.py_**:

- `--year` (Optional):
The upper bound of publication year for searching. If not specified, the current year will be used.
- `--api` (Optional):
The API key to use. If not specified, the API key in the `input/.API` file will be used.
- `SEARCH_TERMS`: The search terms to use.
Separate multiple search terms with spaces.
❗ When a search term has a space (e.g., "machine learning"), use **double quotations** to enclose it (safety "machine learning")
Parameters:

- api_key (str): Your Elsevier API key for authentication.
- keywords (list of str): Keywords to search for in article titles and abstracts.
- year (int, optional): The publication year to filter the articles. Default is 2023.

**Example**:

The following command will search for articles with the search terms `transportation`, `road safety` and `machine learning` published before 2023 (inclusive).

```sh
python call_scopus.py --year 2023 transportation "road safety" "machine learning"
```

## Abstracts

For abstracts, you need to specify the output of previous step as input and then run the following

```sh
python call_semanticscholar.py path/to/scopus/results.csv
```

The results of the query are stored in the `scopus_caller/data` folder as a csv file with prefix **abstract**, followed by the same name as input file.

Abstracts for all SCOPUS database articles are not available from Semantic Scholar database.
# Obtain the articles
df = sc.get_titles(api_key, ["transportation", "road safety", "transfer learning"], 2023)

## Using Keywords
# Obtain the abstracts of the above articles. For abstracts, you need to specify the output of previous step as input and then run the following

Here we read a set of keywords from a dataframe with two columns and then search exhaustively using combinations of the words from the first column with the words from the second column. This helps reduce the manual effort in case you have many words to search with. Currently, it is hard coded with a dataframe with two columns, but it can be made flexible. Please open a PR if someone is interested in doing this.

In the `input/keywords.csv` add you two search terms and replace the placeholders.
First make sure you are in the `scopus_caller/src` folder then run:

```sh
python keyword_scrapper.py ../data/keywords.csv
df = sc.get_abstracts(df)
```

The terms in each column should be unique keywords and need not be repeated. There can different number of keywords in each column. This code will iterate over column 1 (outer loop) and then iterate over column 2 (innner loop).

## Other settings

You can change the specifics of the search in call_scopus such as connecting string by `OR` or `AND`, etc.

## Citing

This is based on the base script [Scopus-Query](https://github.com/nsanthanakrishnan/Scopus-Query), so kindly cite:
Expand Down
Empty file removed data/.gitkeep
Empty file.
Empty file removed input/.gitkeep
Empty file.
3 changes: 0 additions & 3 deletions input/keywords.csv

This file was deleted.

3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
19 changes: 10 additions & 9 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
aiohttp
numpy
pandas
requests
pytest
click
pylint
black
pytest-cov
aiohttp>=3.8.5
numpy>=1.24.0
pandas>=2.0.2
requests>=2.31.0
pytest>=7.4.2
click>=8.1.5
pylint>=3.0.1
black>=23.10.0
nest-asyncio>=1.5.8
pytest-cov>=4.1.0
2 changes: 2 additions & 0 deletions scopuscaller/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .call_scopus import get_titles
from .call_semanticscholar import get_abstracts
114 changes: 40 additions & 74 deletions src/call_scopus.py → scopuscaller/call_scopus.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,6 @@

import pandas as pd
import requests
import argparse
from datetime import datetime

API_FILE = "../input/.API"


def create_article_dataframe(allentries):
Expand Down Expand Up @@ -87,84 +83,54 @@ def create_article_dataframe(allentries):
return articles


def get_arguments():
parser = argparse.ArgumentParser()
parser.add_argument(
"--year",
default=-1,
type=int,
help="Year to search for in Scopus (default: current year)",
)
parser.add_argument(
"--api",
default="",
type=str,
help="API key to use for Scopus (default: read from file)",
)
parser.add_argument("keywords", nargs="+", help="Keywords to search for in Scopus")
args = parser.parse_args()

# Get year
if args.year > 0:
year = args.year
else:
year = datetime.now().year

# Get API key
if args.api != "":
api_key = args.api
else:
api_key = open(API_FILE, "rb").readline().rstrip()
def get_titles(api_key, keywords, year=2023):
"""
Retrieve academic articles from Scopus based on specified keywords and publication year.
return year, api_key, args.keywords
Parameters:
- api_key (str): Your Elsevier API key for authentication.
- keywords (list of str): Keywords to search for in article titles and abstracts.
- year (int, optional): The publication year to filter the articles. Default is 2023.
Returns:
- pd.DataFrame: A DataFrame containing the retrieved academic articles.
"""

def wrapper(api_key, keywords, year):
url = "https://api.elsevier.com/content/search/scopus"
# Define the base URL and headers
base_url = "https://api.elsevier.com/content/search/scopus"
headers = {"X-ELS-APIKey": api_key}

# Construct the search query
search_keywords = " AND ".join(f'"{w}"' for w in keywords)
print(search_keywords)
query = f"?query=TITLE-ABS-KEY({search_keywords})"
query += f"&date=1950-{year}"
query += "&sort=relevance"
query += "&start=0"
r = requests.get(url + query, headers=headers, timeout=20)
result_len = int(r.json()["search-results"]["opensearch:totalResults"])
print(result_len)
query = f"?query=TITLE-ABS-KEY({search_keywords})&date=1950-{year}&sort=relevance&start=0"

# Send the initial request to get the total result count
response = requests.get(base_url + query, headers=headers, timeout=20)
result_len = int(response.json()["search-results"]["opensearch:totalResults"])

# Initialize a list to store all entries
all_entries = []

for start in range(0, result_len, 25):
if start < 5000: # Scopus throws an error above this value
entries = []
# query = '?query={'+first_term+'}+AND+{'+second_term+'}' #Enter the keyword inside the braces for exact phrase match
# Enter the keyword inside the double quotations for approximate phrase match
query = f"?query=TITLE-ABS-KEY({search_keywords})"
query += f"&date=1950-{year}&sort=relevance"
# query += '&subj=ENGI' # This is commented because many results might not be covered under ENGI
query += "&start=%d" % (start)
# query += '&count=%d' % (count)

r = requests.get(url + query, headers=headers, timeout=30)
if "entry" in r.json()["search-results"]:
if "error" in r.json()["search-results"]["entry"][0]:
continue
else:
entries += r.json()["search-results"]["entry"]
if len(entries) != 0:
all_entries.extend(entries)
if start >= 5000: # Scopus throws an error above this value
break

# Construct the query with pagination
query = f"?query=TITLE-ABS-KEY({search_keywords})&date=1950-{year}&sort=relevance&start={start}"

# Send the request for the current page
response = requests.get(base_url + query, headers=headers, timeout=30)

if "entry" in response.json()["search-results"]:
if "error" in response.json()["search-results"]["entry"][0]:
continue
else:
break
articles_loaded = pd.DataFrame()
articles_loaded = create_article_dataframe(all_entries)
return articles_loaded
all_entries.extend(response.json()["search-results"]["entry"])
else:
break

# Create a DataFrame from the collected entries
articles_loaded = create_article_dataframe(all_entries)

if __name__ == "__main__":
YEAR, API_KEY, KEYWORDS = get_arguments()
print(f"Current year is set to {YEAR}")
file_name = "_".join(KEYWORDS)
articles_extracted = wrapper(API_KEY, KEYWORDS, YEAR)
articles_extracted.to_csv(
f"../data/Results_{file_name}.csv", sep=",", encoding="utf-8"
)
print(f"Extraction for {KEYWORDS} completed")
print(f"Extraction for {keywords} completed")
return articles_loaded
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
import aiohttp
import asyncio
import sys
import pandas as pd
from random import choice
import nest_asyncio

nest_asyncio.apply()

desktop_agents = [""]
BASE_API_URL = "http://api.semanticscholar.org/v1/paper/"
Expand Down Expand Up @@ -46,19 +47,35 @@ async def fetch_articles_async(df):
return list_abstracts, list_topics


if __name__ == "__main__":
df = pd.read_csv(sys.argv[1])
def get_abstracts(df):
"""
Retrieve abstracts and topics for academic articles in a DataFrame.
Parameters:
- df (pd.DataFrame): The DataFrame containing academic articles.
Returns:
- pd.DataFrame: A DataFrame with abstracts and topics added.
"""

# Print the total number of articles in the DataFrame
print(f"Total articles: {len(df)}")

# Filter out articles with no DOI
df = df[df.doi != "No Doi"]

# Print the number of articles with abstracts
print(f"Articles with abstracts: {len(df)}")

# Run the asyncio event loop to fetch abstracts and topics asynchronously
loop = asyncio.get_event_loop()
list_abstracts, list_topics = loop.run_until_complete(fetch_articles_async(df))

# Add abstracts and topics to the DataFrame
df["abstract"] = list_abstracts
df["topics"] = list_topics

output_file = "../data/abstracts_" + sys.argv[1].split("/")[-1][:-4] + ".csv"
df.to_csv(output_file, index=None)
# Print a message indicating that the process is complete
print(f"Done")

return df
Loading

0 comments on commit 0da841d

Please sign in to comment.