$ pip install GoogleNewsScraper
from GoogleNewsScraper import GoogleNewsScraper
GoogleNewsScraper(driver)
Constructor Parameters
Name | Type | Required |
---|---|---|
driver | web driver | no |
Possible values:
'chrome'
: The driver will default to use this package's chrome driver- A path to some driver (FireFox, for instance) stored on the user's system
This method is both public and private, though it really should only be used by the class
locate_html_element(self, driver, element, selector, wait_seconds)
Name | Type | Required | Description |
---|---|---|---|
driver | web driver | yes | A web driver (Chrome, FireFox, etc) |
element | string | yes | Id or class selector of an HTML element |
selector | Module import | yes | see below |
wait_seconds | int | no | Waits a certain number of seconds in order to locate an HTML element |
To configure the 'selector' param:
First install selenium
$ pip install selenium
Then import By
from selenium.webdriver.common.by import By
Possible values:
By.ID
By.CLASS_NAME
By.CSS_SELECTOR
By.LINK_TEXT
By.NAME
By.PARTIAL_LINK_TEXT
By.TAG_NAME
By.XPATH
GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None
Name | Type | Required | Description |
---|---|---|---|
search_text | str | yes | A series of word(s) that will be inputted into the Google search engine |
date_range | str | no | Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives |
pages | str or int | no | Number of pages that should be scraped (defaults to 'max') |
pagination_pause_per_page | int | no | Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages. |
cb | function | no | Will return all article data on a single page for every page scraped (defaults to False) |
- Example using 'cb' paramater:
def handle_page_data(page_data: list):
# Do something with page_data
GoogleNewsScraper(...args).search(...args, cb=handle_page_data)
NOTE:
- If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
- Each list will contain an object of news article data for every news article on that page
Example of the data that every article-object will contain:
'id'
: A unique id for every article data object'description'
: The preview description of the news article'title'
: The title of the news article'source'
: The source of news article (New York Times, for instance)'image_url'
: The url of the preview news article image'url'
: A link to the news article'date_time'
: A datetime string that represents the date of when the article was published