Skip to content

Python program that assigns part-of-speech tags to words in a sequence. Trained and executed using Hidden Markov Models and variation on Viterbi Algorithm.

Notifications You must be signed in to change notification settings

rohan-sahgal/POS-Tagging-with-HMMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instructions

Run tagger.py with the training files (-d), the test files (-t), and the name of your output file (-o):

py tagger.py -d [training files] -t [test file] -o [output file]

The output will be a sequence of tagged words.

To evaluate accuracy:

You need some solution file that has the correct tags for each sequence, and you compare that with the output of tagger.py

py evaluator.py -o [output file] -s [solution file]

This will output results.txt which contains a list of missed tags as well as an accuracy percentage.

Background

POS Tagging

Part of speech tagging (also known as grammatical tagging) assigns grammatical tags to a sequence of words and punctuation symbols. It is used as a tool in Natural Language Processing, helping meaning to be discerned when a word has multiple possible meanings. This program and the following examples use the tags present in the British National Corpus.

 

For example:

The word John is a proper noun, and should be assigned the tag NP0.

The word book can be used as either a singular common noun or a finite/infinite form of verb, and can be assigned either the tag NN1, the tag VVB, or the tag VVI.

 

There are many possible strategies to determine the appropriate tag for a word in a sequence. The strategy this program employs uses Hidden Markov Models to achieve that goal.


Hidden Markov Models (HMMs)

Hidden Markov Models consist of two sequences of states: hidden states and observed states.

image

For each observed state, there exists an underlying hidden state. Given a sequence of untagged words, the observed states would be the sequence of words, and the hidden states would be the POS tags for each word. We use probabilities to calculate what the most likely hidden state is for each given observed state. More specifically, we use three probability tables:

  • Initial Probabilities: how likely it is for the first hidden state in a sequence to be each possible value
  • Transition Probabilities: how likely it is for each hidden state to follow every other hidden state (including itself)
  • Emission Probabilities: how likely each hidden state is given the observed state

These tables are calculated through training on existing sequences of tagged words.

Using these tables and a variation on the Viterbi Algorithm, we are able to determine the most likely POS tag for each word, and tag it accordingly.

About

Python program that assigns part-of-speech tags to words in a sequence. Trained and executed using Hidden Markov Models and variation on Viterbi Algorithm.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages