The open_connectedpapers
project aims to replicate the functionality of a scholarly paper sharing website, specifically targeting Connected Papers. The objective is to crawl data from scholarly articles available on Google Scholar, visualize various attributes of these papers such as citation counts, publication years, and their inter-citation relationships.
- Web Scraping: Utilizes web scraping techniques to gather paper data from Google Scholar.
- Data Parsing: Parses HTML content using BeautifulSoup to extract paper metadata like title, URL, publication year, and citation count.
- Citation Graph: Constructs a citation graph to visualize the relationships between papers.
- Visualization: Employs Pyecharts to visualize the citation graph with nodes representing papers and edges representing citations.
- Interactive Interface: Users can explore the citation graph interactively.
- Python 3.x
- Selenium: For web automation.
- BeautifulSoup: For HTML parsing.
- Pyecharts: For data visualization.
- Matplotlib: For color mapping.
Ensure you have Python installed on your system. Then, install the required packages using pip:
pip install -r requirements.txt
Before running the code, make sure you have the necessary drivers for Selenium. You can download the appropriate driver for your browser (e.g., Edge) from the Selenium website.
Execute the provided Python script open_connectedpapers.py
:
python open_connectedpapers.py
The code consists of several functions to perform different tasks:
build_graph()
: Placeholder function to build the citation graph (yet to be implemented).parse_paper_div(paper_div, papers_data)
: Extracts paper information from HTML div elements.get_all_papers_info(paper_url)
: Gathers paper data by crawling Google Scholar.get_paper_data()
: Retrieves paper data, including citation information, and saves it in JSON format.visualize()
: Visualizes the citation graph using Pyecharts.
You can customize the code according to your requirements:
- Adjust web scraping parameters such as URLs, search queries, or pagination settings.
- Modify visualization settings like node size, color scheme, or graph layout.
The open_connectedpapers
project provides a foundation for exploring scholarly paper citation networks. By leveraging web scraping and visualization techniques, it enables users to analyze and visualize citation patterns, facilitating research and academic exploration.
Please use this tool responsibly and respect the terms of service of the websites from which data is being scraped. Unauthorized or excessive web scraping may violate these terms and may have legal implications.