This project aims to classify news articles as real or fake using a machine learning algorithm. The algorithm used is a Passive Aggressive Classifier, which is a type of online learning algorithm that is used for binary classification.
The following packages are required to run this project:
- newsapi-python
- pandas
- scikit-learn
To install these packages, run the following command:
pip install newsapi-python pandas scikit-learn
The training data used to train the classifier is taken from Kaggle.
The following modules are required to interact with the News API and create the model:
- NewsApiClient from newsapi
- random
To interact with the News API, an API key is required. The API key can be obtained by registering at https://newsapi.org/. After obtaining the API key, follow the steps below:
- Call the
NewsApiClient()
method and pass the API key to this method. - Create a method to get the news data from the API using the
get_everything()
method. - Pass the required parameters to the
get_everything()
method:- sources
- domains
- from_param
- to
- language
- sort_by
- page
- After getting the results from the API, pass the results to an array and return that array.
The News API has over 3000 authenticated news sources. To get news from these sources, follow the steps below:
- Get all the sources from the News API using the
get_sources()
method. - Add the ID of each source to a list.
- Truncate the list to a size of 10, and get news from those sources using the
del
keyword.
- Use a loop to iterate through all the sources from the
sourceList
and use thegetNews()
method to get news from the sources. - Add all the returned news to a list.
- Use the from_records() method from
pandas.DataFrame
to create a new DataFrame using the list. - Add new column headings to the DataFrame using the
dataframe.columns
attribute.
- Load the data from a
.csv
file available in the same directory with the name of thenews.csv
file using theread_csv()
method frompandas
. - Add the column headings to the DataFrame.
- Use the
concat()
method from pandas to concat bothDataFrames
.
To create the training model, the following modules are required:
train_test_split
fromsklearn.model_selection
CountVectorizer
fromsklearn.feature_extraction.text
PassiveAggressiveClassifier
fromsklearn.linear_model
accuracy_score
fromsklearn.metrics
Follow the steps below to train the model:
- Split the training and testing data from the DataFrame using the
train_test_split()
method. - Use 70% of the data for training and 30% for testing.
- Pass the combination of
title
,text
, and news labels to the*arrays
parameter of thetrain_test_split()
method. - Use
CountVectorizer
to create a matrix of token count from the text document. - Create a
PassiveAggressiveClassifier
model to classify real news from fake news. - Test the model using the test data and calculate the model's accuracy using the
accuracy_score
method.
By following these steps, we can create a machine learning model to detect fake news from real news.
This project was created by OpenAI and is based on the tutorial available on DataCamp.