Sentence Classification for News Titles

Table of Content

Project Description
Dataset
Pre-request
Training Example
Github

Project Description

Classifying the semantic content is one of the critical problems in natural language processing. There are many cases where only a small number of words are provided to interpret the meaning or intent such as keyword searches. However, the performance of short text classification is limited due to shortness of sentences, which causes sparse vector representations if we use word occurrence to represent sentences, and lack of context. On the other hand, news titles, though consisting of short sentences, provide rich information of the semantic content in a concise way. Because of this property, we believe that news title classification will be a good start point for our sentence classification task.

Dataset

News Aggregator Data Set

Reference to news web pages collected from an online aggregator in the period from March 10 to August 10 of 2014. The resources are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that point (has a link to) one of the news page in the collection.

Event Registry

It is collected from Event Registry API, in which contains much more up-to-date news. The downside is that they are using their own scraper or classifier, which may introduce more noise to the data compared to manually labeling. some example code is in [news_api]

Tag My News

TagMyNews Datasets is a collection of datasets of short text fragments which are used for topic-based text classifier. It is used in several other papers and is more difficult than the News Aggregator Dataset considering the scarcity of data and more categories.

Dependency

The word embedding GoogleNews-vectors-negative300.bin from Google word2vec. The Keras Deep Learning library and most recent Theano backend should be installed. You can use pip for that.

Implementation

We provide example cmd on in run.sh

Machine Learning Model on Different Sentence Representation

Training count of words on SVM and logistic regression

python bow_main.py (bow_config_file) (train_data) (test_data)

Training w2v on SVM and logistic regression

python w2v_main.py (w2v_config_file) (google_w2v_pretrain_model) (traindata) (test_data)

CNN

To train the CNN with bow of words use the following cmd

python cnn_main.py bow (bow_config_file) (traindata) (test_data)

To train the CNN with word2vec use the following cmd

python cnn_main.py w2v (w2v_config_file) (traindata) (test_data) (google_w2v_pretrain_model)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentence Classification for News Titles

Table of Content

Project Description

Dataset

News Aggregator Data Set

Event Registry

Tag My News

Dependency

Implementation

Machine Learning Model on Different Sentence Representation

CNN

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
LOG		LOG
__pycache__		__pycache__
config		config
data		data
include		include
model		model
README.md		README.md
__init__.py		__init__.py
bow_main.py		bow_main.py
cnn_main.py		cnn_main.py
evaluate_cnn.py		evaluate_cnn.py
run.sh		run.sh
w2v_main.py		w2v_main.py

wayne1199111810/sentence_classification_for_news_titles

Folders and files

Latest commit

History

Repository files navigation

Sentence Classification for News Titles

Table of Content

Project Description

Dataset

News Aggregator Data Set

Event Registry

Tag My News

Dependency

Implementation

Machine Learning Model on Different Sentence Representation

CNN

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages