Clean and package pypi

vishalmhjn · Oct 18, 2023 · 0da841d · 0da841d
1 parent 78da80c
commit 0da841d
Show file tree

Hide file tree

Showing 16 changed files with 160 additions and 496 deletions.
diff --git a/Makefile b/Makefile
@@ -1,12 +1,9 @@
-install:
-	pip install --upgrade pip &&\
-		pip install -r requirements.txt
-
 lint:
-	pylint --disable=R,C src/call_scopus.py
+	pylint --disable=R,C scopus_caller/call_scopus.py
+	pylint --disable=R,C scopus_caller/call_semanticscholar.py
 
 format:
 	black *.py
 
 test:
-	python -m pytest -vv src/test_call_scopus.py
+	python tests/test_call_scopus.py
diff --git a/README.md b/README.md
@@ -20,9 +20,9 @@ access level of the article and authorized API, the article's **abstract-text**
 ### Semantic Scholar API
 
 Semantic Scholar also provides an API to retrieve the article's meta-data. It is possible to obtain abstracts by
-specifying the DOI of the article.
+specifying the DOI of the article. Abstracts for all SCOPUS database articles are not available from Semantic Scholar database.
 
-## Install the dependencies
+## Installation
 
 1. Create a virtual environment to install all packages in and activate the environment:  
    _(Make sure you are in the parrent folder of this project)_
@@ -34,90 +34,56 @@ specifying the DOI of the article.
    source ~/.scopus-caller/bin/activate
    ```
 
-2. Now install all the neccessary requirements for this project using one of the following two options:
+2. Install the package
 
    ```sh
-   pip install -r requirements.txt
+   pip install -i https://test.pypi.org/simple/ scopus-caller==0.1
    ```
 
-   OR
+## Obtain the API Key
 
-   ```sh
-   make install
-   ```
-
-## Add the API_KEY
-
-1. Create a new file for the api key:
-
-   ```sh
-   touch input/.API
-   ```
-
-2. If you haven't created an account on [SCOPUS](https://dev.elsevier.com) yet, got to
+1. If you haven't created an account on [SCOPUS](https://dev.elsevier.com) yet, got to
    [SCOPUS](https://www.elsevier.com/solutions/scopus) and create a private account or one via your university.
-3. After being logged in, create a new API key [here](https://dev.elsevier.com/apikey/manage), name the label to your
+2. After being logged in, create a new API key [here](https://dev.elsevier.com/apikey/manage), name the label to your
    likings and leave the website input field empty _(it is not important)_.  
    Carefully read and understand the "API
    SERVICE AGREEMENT" and "Text and Data Mining (TDM) Provisions", before using the API and the retrieved data. These
    will be presented to the user while generating the API.
-4. Paste your newly generated `api_key` to the created `.API` file in the `input` folder _(input/.API)_.
+3. Copy you API key and store it in a text file.
 
-## Unrestricted search using CLI
+## Usage
 
-First make sure you are in the `scopus_caller/src` folder then run:
+Import the library and paste the API key.
 
 ```sh
-python call_scopus.py [--year YEAR] [--api API_KEY] [SEARCH_TERMS]
+# import the module
+import scopus_caller as sc
+
+# paste the api here
+api_key = ""
 ```
 
-**Parameters**:
+**Parameters of function _call_scopus.py_**:
 
-- `--year` (Optional):
-  The upper bound of publication year for searching. If not specified, the current year will be used.
-- `--api` (Optional):
-  The API key to use. If not specified, the API key in the `input/.API` file will be used.
-- `SEARCH_TERMS`: The search terms to use.
-  Separate multiple search terms with spaces.
-  ❗ When a search term has a space (e.g., "machine learning"), use **double quotations** to enclose it (safety "machine learning")
+Parameters:
+
+- api_key (str): Your Elsevier API key for authentication.
+- keywords (list of str): Keywords to search for in article titles and abstracts.
+- year (int, optional): The publication year to filter the articles. Default is 2023.
 
 **Example**:
 
 The following command will search for articles with the search terms `transportation`, `road safety` and `machine learning` published before 2023 (inclusive).
 
 ```sh
-python call_scopus.py --year 2023 transportation "road safety" "machine learning"
-```
-
-## Abstracts
-
-For abstracts, you need to specify the output of previous step as input and then run the following
-
-```sh
-python call_semanticscholar.py path/to/scopus/results.csv
-```
-
-The results of the query are stored in the `scopus_caller/data` folder as a csv file with prefix **abstract**, followed by the same name as input file.
-
-Abstracts for all SCOPUS database articles are not available from Semantic Scholar database.
+# Obtain the articles
+df = sc.get_titles(api_key, ["transportation", "road safety", "transfer learning"], 2023)
 
-## Using Keywords
+# Obtain the abstracts of the above articles. For abstracts, you need to specify the output of previous step as input and then run the following
 
-Here we read a set of keywords from a dataframe with two columns and then search exhaustively using combinations of the words from the first column with the words from the second column. This helps reduce the manual effort in case you have many words to search with. Currently, it is hard coded with a dataframe with two columns, but it can be made flexible. Please open a PR if someone is interested in doing this.
-
-In the `input/keywords.csv` add you two search terms and replace the placeholders.
-First make sure you are in the `scopus_caller/src` folder then run:
-
-```sh
-python keyword_scrapper.py ../data/keywords.csv
+df = sc.get_abstracts(df)
 ```
 
-The terms in each column should be unique keywords and need not be repeated. There can different number of keywords in each column. This code will iterate over column 1 (outer loop) and then iterate over column 2 (innner loop).
-
-## Other settings
-
-You can change the specifics of the search in call_scopus such as connecting string by `OR` or `AND`, etc.
-
 ## Citing
 
 This is based on the base script [Scopus-Query](https://github.com/nsanthanakrishnan/Scopus-Query), so kindly cite:

diff --git a/data/.gitkeep b/data/.gitkeep
diff --git a/input/.gitkeep b/input/.gitkeep
diff --git a/input/keywords.csv b/input/keywords.csv
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,3 @@
+[build-system]
+requires = ["setuptools"]
+build-backend = "setuptools.build_meta"
diff --git a/requirements.txt b/requirements.txt
@@ -1,9 +1,10 @@
-aiohttp
-numpy
-pandas
-requests
-pytest
-click
-pylint
-black
-pytest-cov
+aiohttp>=3.8.5
+numpy>=1.24.0
+pandas>=2.0.2
+requests>=2.31.0
+pytest>=7.4.2
+click>=8.1.5
+pylint>=3.0.1
+black>=23.10.0
+nest-asyncio>=1.5.8
+pytest-cov>=4.1.0
diff --git a/scopuscaller/__init__.py b/scopuscaller/__init__.py
@@ -0,0 +1,2 @@
+from .call_scopus import get_titles
+from .call_semanticscholar import get_abstracts
diff --git a/src/call_scopus.py → scopuscaller/call_scopus.py b/src/call_scopus.py → scopuscaller/call_scopus.py
@@ -21,10 +21,6 @@
 
 import pandas as pd
 import requests
-import argparse
-from datetime import datetime
-
-API_FILE = "../input/.API"
 
 
 def create_article_dataframe(allentries):
@@ -87,84 +83,54 @@ def create_article_dataframe(allentries):
     return articles
 
 
-def get_arguments():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--year",
-        default=-1,
-        type=int,
-        help="Year to search for in Scopus (default: current year)",
-    )
-    parser.add_argument(
-        "--api",
-        default="",
-        type=str,
-        help="API key to use for Scopus (default: read from file)",
-    )
-    parser.add_argument("keywords", nargs="+", help="Keywords to search for in Scopus")
-    args = parser.parse_args()
-
-    # Get year
-    if args.year > 0:
-        year = args.year
-    else:
-        year = datetime.now().year
-
-    # Get API key
-    if args.api != "":
-        api_key = args.api
-    else:
-        api_key = open(API_FILE, "rb").readline().rstrip()
+def get_titles(api_key, keywords, year=2023):
+    """
+    Retrieve academic articles from Scopus based on specified keywords and publication year.
 
-    return year, api_key, args.keywords
+    Parameters:
+    - api_key (str): Your Elsevier API key for authentication.
+    - keywords (list of str): Keywords to search for in article titles and abstracts.
+    - year (int, optional): The publication year to filter the articles. Default is 2023.
 
+    Returns:
+    - pd.DataFrame: A DataFrame containing the retrieved academic articles.
+    """
 
-def wrapper(api_key, keywords, year):
-    url = "https://api.elsevier.com/content/search/scopus"
+    # Define the base URL and headers
+    base_url = "https://api.elsevier.com/content/search/scopus"
     headers = {"X-ELS-APIKey": api_key}
+
+    # Construct the search query
     search_keywords = " AND ".join(f'"{w}"' for w in keywords)
-    print(search_keywords)
-    query = f"?query=TITLE-ABS-KEY({search_keywords})"
-    query += f"&date=1950-{year}"
-    query += "&sort=relevance"
-    query += "&start=0"
-    r = requests.get(url + query, headers=headers, timeout=20)
-    result_len = int(r.json()["search-results"]["opensearch:totalResults"])
-    print(result_len)
+    query = f"?query=TITLE-ABS-KEY({search_keywords})&date=1950-{year}&sort=relevance&start=0"
+
+    # Send the initial request to get the total result count
+    response = requests.get(base_url + query, headers=headers, timeout=20)
+    result_len = int(response.json()["search-results"]["opensearch:totalResults"])
+
+    # Initialize a list to store all entries
     all_entries = []
 
     for start in range(0, result_len, 25):
-        if start < 5000:  # Scopus throws an error above this value
-            entries = []
-            # query = '?query={'+first_term+'}+AND+{'+second_term+'}' #Enter the keyword inside the braces for exact phrase match
-            # Enter the keyword inside the double quotations for approximate phrase match
-            query = f"?query=TITLE-ABS-KEY({search_keywords})"
-            query += f"&date=1950-{year}&sort=relevance"
-            # query += '&subj=ENGI' # This is commented because many results might not be covered under ENGI
-            query += "&start=%d" % (start)
-            # query += '&count=%d' % (count)
-
-            r = requests.get(url + query, headers=headers, timeout=30)
-            if "entry" in r.json()["search-results"]:
-                if "error" in r.json()["search-results"]["entry"][0]:
-                    continue
-                else:
-                    entries += r.json()["search-results"]["entry"]
-            if len(entries) != 0:
-                all_entries.extend(entries)
+        if start >= 5000:  # Scopus throws an error above this value
+            break
+
+        # Construct the query with pagination
+        query = f"?query=TITLE-ABS-KEY({search_keywords})&date=1950-{year}&sort=relevance&start={start}"
+
+        # Send the request for the current page
+        response = requests.get(base_url + query, headers=headers, timeout=30)
+
+        if "entry" in response.json()["search-results"]:
+            if "error" in response.json()["search-results"]["entry"][0]:
+                continue
             else:
-                break
-    articles_loaded = pd.DataFrame()
-    articles_loaded = create_article_dataframe(all_entries)
-    return articles_loaded
+                all_entries.extend(response.json()["search-results"]["entry"])
+        else:
+            break
 
+    # Create a DataFrame from the collected entries
+    articles_loaded = create_article_dataframe(all_entries)
 
-if __name__ == "__main__":
-    YEAR, API_KEY, KEYWORDS = get_arguments()
-    print(f"Current year is set to {YEAR}")
-    file_name = "_".join(KEYWORDS)
-    articles_extracted = wrapper(API_KEY, KEYWORDS, YEAR)
-    articles_extracted.to_csv(
-        f"../data/Results_{file_name}.csv", sep=",", encoding="utf-8"
-    )
-    print(f"Extraction for {KEYWORDS} completed")
+    print(f"Extraction for {keywords} completed")
+    return articles_loaded
diff --git a/src/call_semanticscholar.py → scopuscaller/call_semanticscholar.py b/src/call_semanticscholar.py → scopuscaller/call_semanticscholar.py
@@ -1,8 +1,9 @@
 import aiohttp
 import asyncio
-import sys
-import pandas as pd
 from random import choice
+import nest_asyncio
+
+nest_asyncio.apply()
 
 desktop_agents = [""]
 BASE_API_URL = "http://api.semanticscholar.org/v1/paper/"
@@ -46,19 +47,35 @@ async def fetch_articles_async(df):
     return list_abstracts, list_topics
 
 
-if __name__ == "__main__":
-    df = pd.read_csv(sys.argv[1])
+def get_abstracts(df):
+    """
+    Retrieve abstracts and topics for academic articles in a DataFrame.
+
+    Parameters:
+    - df (pd.DataFrame): The DataFrame containing academic articles.
+
+    Returns:
+    - pd.DataFrame: A DataFrame with abstracts and topics added.
+    """
 
+    # Print the total number of articles in the DataFrame
     print(f"Total articles: {len(df)}")
 
+    # Filter out articles with no DOI
     df = df[df.doi != "No Doi"]
+
+    # Print the number of articles with abstracts
     print(f"Articles with abstracts: {len(df)}")
 
+    # Run the asyncio event loop to fetch abstracts and topics asynchronously
     loop = asyncio.get_event_loop()
     list_abstracts, list_topics = loop.run_until_complete(fetch_articles_async(df))
 
+    # Add abstracts and topics to the DataFrame
     df["abstract"] = list_abstracts
     df["topics"] = list_topics
 
-    output_file = "../data/abstracts_" + sys.argv[1].split("/")[-1][:-4] + ".csv"
-    df.to_csv(output_file, index=None)
+    # Print a message indicating that the process is complete
+    print(f"Done")
+
+    return df
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		from .call_scopus import get_titles
		from .call_semanticscholar import get_abstracts