GitHub - rohan-sahgal/POS-Tagging-with-HMMs: Python program that assigns part-of-speech tags to words in a sequence. Trained and executed using Hidden Markov Models and variation on Viterbi Algorithm.

Instructions

Run tagger.py with the training files (-d), the test files (-t), and the name of your output file (-o):

py tagger.py -d [training files] -t [test file] -o [output file]

The output will be a sequence of tagged words.

To evaluate accuracy:

You need some solution file that has the correct tags for each sequence, and you compare that with the output of tagger.py

py evaluator.py -o [output file] -s [solution file]

This will output results.txt which contains a list of missed tags as well as an accuracy percentage.

Background

POS Tagging

Part of speech tagging (also known as grammatical tagging) assigns grammatical tags to a sequence of words and punctuation symbols. It is used as a tool in Natural Language Processing, helping meaning to be discerned when a word has multiple possible meanings. This program and the following examples use the tags present in the British National Corpus.

For example:

The word John is a proper noun, and should be assigned the tag NP0.

The word book can be used as either a singular common noun or a finite/infinite form of verb, and can be assigned either the tag NN1, the tag VVB, or the tag VVI.

There are many possible strategies to determine the appropriate tag for a word in a sequence. The strategy this program employs uses Hidden Markov Models to achieve that goal.

Hidden Markov Models (HMMs)

Hidden Markov Models consist of two sequences of states: hidden states and observed states.

For each observed state, there exists an underlying hidden state. Given a sequence of untagged words, the observed states would be the sequence of words, and the hidden states would be the POS tags for each word. We use probabilities to calculate what the most likely hidden state is for each given observed state. More specifically, we use three probability tables:

Initial Probabilities: how likely it is for the first hidden state in a sequence to be each possible value
Transition Probabilities: how likely it is for each hidden state to follow every other hidden state (including itself)
Emission Probabilities: how likely each hidden state is given the observed state

These tables are calculated through training on existing sequences of tagged words.

Using these tables and a variation on the Viterbi Algorithm, we are able to determine the most likely POS tag for each word, and tag it accordingly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions

Background

POS Tagging

Hidden Markov Models (HMMs)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
test files		test files
training files		training files
README.md		README.md
evaluator.py		evaluator.py
tagger.py		tagger.py
test		test

rohan-sahgal/POS-Tagging-with-HMMs

Folders and files

Latest commit

History

Repository files navigation

Instructions

Background

POS Tagging

Hidden Markov Models (HMMs)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages