News Data Extraction

Scripts for extracting news articles from US newspapers

Scrapped data is available in folders inside the respective newspaper directory in the 'articleData' directory

Strucutre:

Each .json file in the articleData directory has one article data stored in valid json format. Each json array has 5 keys:

'Title' : Heading of the article
'Content' : Body of the article
'Date' : Date the article was published
'Author' : Author(s) of the article
'Link' : URL of that article

Note: Some articles will have "NULL" in their 'Author' key, this is because those articles are op-eds or opinion pieces that don't necessarily have an author (eg: letters to the editor)

How to use the data:

import json
import os

articleDataDirectoryPath = "" # whatever the path of the articleData directory is
filePathList = os.listDir(articleDataDirectoryPath) # gets a list of filePaths

for filePath in filePathList:
	absFilePath = articleDataDirectoryPath + filePath
	with open(absFilePath) as f:
		for line in f:
			articleData = json.load(line) 

			# use the data:
			title = articleData['Title']
			content = articleData['Content']
			date = articleData['Date']
			url = articleData['Link']
			author = articleData['Author'] # use a check for "NULL" author if you wish

Tokenizing for paragraphs and sentences:

In order to split the content into paragraphs use the python split() method on the content tag as shown below:

with open(jsonFilePath) as f:
	for line in f:
		articleData = json.loads(line)
		articleContent = articleData["Content"]

		articleParagraphs = articleContent.split(". , ") # Delimiter for the new paragraph

		paragraphCounter = 1
		for paragraph in articleParagraphs:
			print "Paragraph " + str(paragraphCounter) + " >> " + paragraph
			paragraphCounter += 1

		print '\n'
		paragraphCounter = 1

Currently data is available for:

LA Times
Seattle Times
Houston Chronicle
Chicago Tribune

Work in progress:

Philly

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
backup		backup
chicago-tribune		chicago-tribune
houston-chron		houston-chron
la-times		la-times
philly		philly
seattle-times		seattle-times
washington-post		washington-post
README.md		README.md
chicago-tribune-data-all-v3.json		chicago-tribune-data-all-v3.json
houstonChron-data-all-v3.json		houstonChron-data-all-v3.json
latimes-data-all-v3.json		latimes-data-all-v3.json
news-data.zip		news-data.zip
searchQueries.txt		searchQueries.txt
seattle-times-data-all-v3.json		seattle-times-data-all-v3.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Data Extraction

Scripts for extracting news articles from US newspapers

Scrapped data is available in folders inside the respective newspaper directory in the 'articleData' directory

Strucutre:

How to use the data:

Tokenizing for paragraphs and sentences:

Currently data is available for:

Work in progress:

About

Releases

Packages

Languages

alifbae/news-data-extraction

Folders and files

Latest commit

History

Repository files navigation

News Data Extraction

Scripts for extracting news articles from US newspapers

Scrapped data is available in folders inside the respective newspaper directory in the 'articleData' directory

Strucutre:

How to use the data:

Tokenizing for paragraphs and sentences:

Currently data is available for:

Work in progress:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages