Skip to content

A universal scraper that grabs text from multiple types of webpages.

Notifications You must be signed in to change notification settings

caimeng2/UniScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniScraper

Description

Uniscraper is a universal scraper that collects text from multiple types of webpages. Currently it supports html (including dynamic webpages that use javascript), online pdfs, word documents, presentation slides, and spreadsheets.

Installation instructions

Clone the git repo:

git clone https://github.com/caimeng2/UniScraper.git

Set up a conda environment by running the following command:

conda env create --prefix ./envs --file environment.yml

conda activate ./envs

Dependency

bs4 webdriver_manager pandas selenium requests python-docx python-pptx pdfminer

Example usage

Please run example.ipynb to see example usage.

About

A universal scraper that grabs text from multiple types of webpages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published