Skip to content

wayne1199111810/sentence_classification_for_news_titles

Repository files navigation

Sentence Classification for News Titles

Table of Content

Project Description
Dataset
Pre-request
Training Example
Github

Project Description

Classifying the semantic content is one of the critical problems in natural language processing. There are many cases where only a small number of words are provided to interpret the meaning or intent such as keyword searches. However, the performance of short text classification is limited due to shortness of sentences, which causes sparse vector representations if we use word occurrence to represent sentences, and lack of context. On the other hand, news titles, though consisting of short sentences, provide rich information of the semantic content in a concise way. Because of this property, we believe that news title classification will be a good start point for our sentence classification task.

Dataset

Reference to news web pages collected from an online aggregator in the period from March 10 to August 10 of 2014. The resources are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that point (has a link to) one of the news page in the collection.

It is collected from Event Registry API, in which contains much more up-to-date news. The downside is that they are using their own scraper or classifier, which may introduce more noise to the data compared to manually labeling. some example code is in [news_api]

TagMyNews Datasets is a collection of datasets of short text fragments which are used for topic-based text classifier. It is used in several other papers and is more difficult than the News Aggregator Dataset considering the scarcity of data and more categories.

Dependency

The word embedding GoogleNews-vectors-negative300.bin from Google word2vec. The Keras Deep Learning library and most recent Theano backend should be installed. You can use pip for that.

Implementation

We provide example cmd on in run.sh

Machine Learning Model on Different Sentence Representation

Training count of words on SVM and logistic regression

python bow_main.py (bow_config_file) (train_data) (test_data)

Training w2v on SVM and logistic regression

python w2v_main.py (w2v_config_file) (google_w2v_pretrain_model) (traindata) (test_data)

CNN

To train the CNN with bow of words use the following cmd

python cnn_main.py bow (bow_config_file) (traindata) (test_data)

To train the CNN with word2vec use the following cmd

python cnn_main.py w2v (w2v_config_file) (traindata) (test_data) (google_w2v_pretrain_model)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published