Scrapped data is available in folders inside the respective newspaper directory in the 'articleData' directory
Each .json file in the articleData directory has one article data stored in valid json format. Each json array has 5 keys:
- 'Title' : Heading of the article
- 'Content' : Body of the article
- 'Date' : Date the article was published
- 'Author' : Author(s) of the article
- 'Link' : URL of that article
Note: Some articles will have "NULL" in their 'Author' key, this is because those articles are op-eds or opinion pieces that don't necessarily have an author (eg: letters to the editor)
import json
import os
articleDataDirectoryPath = "" # whatever the path of the articleData directory is
filePathList = os.listDir(articleDataDirectoryPath) # gets a list of filePaths
for filePath in filePathList:
absFilePath = articleDataDirectoryPath + filePath
with open(absFilePath) as f:
for line in f:
articleData = json.load(line)
# use the data:
title = articleData['Title']
content = articleData['Content']
date = articleData['Date']
url = articleData['Link']
author = articleData['Author'] # use a check for "NULL" author if you wish
In order to split the content into paragraphs use the python split() method on the content tag as shown below:
with open(jsonFilePath) as f:
for line in f:
articleData = json.loads(line)
articleContent = articleData["Content"]
articleParagraphs = articleContent.split(". , ") # Delimiter for the new paragraph
paragraphCounter = 1
for paragraph in articleParagraphs:
print "Paragraph " + str(paragraphCounter) + " >> " + paragraph
paragraphCounter += 1
print '\n'
paragraphCounter = 1
- LA Times
- Seattle Times
- Houston Chronicle
- Chicago Tribune
- Philly