Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about active features #91

Open
kite1988 opened this issue Aug 30, 2017 · 6 comments
Open

Question about active features #91

kite1988 opened this issue Aug 30, 2017 · 6 comments

Comments

@kite1988
Copy link

I used crfsuite to train a model for a named entity recognition task. I set the feature.minfreq to be 0 (no feature cut off), but I observed the number of active features (17450) is much smaller than the number of features (79075). Below is the snippet of the log:

Number of active features: 17450 (79075)
Number of active attributes: 5310 (64323)
Number of active labels: 21 (21)

Is any one know how the active features are selected? Another question, what are the differences between active features and active attributes? Thanks very much!

@chokkan
Copy link
Owner

chokkan commented Aug 31, 2017

CRFsuite removes features with zero weight assigned after finishing a training process. In your case, the number of features used in the training process was 79075, but only 17450 features have non-zero weights assigned by the training algorithm. For this reason, (79075-17450) features are removed from the model.

Roughly speaking, state features are pairs of attributes and labels. When a feature is removed from a model, there is also a possibility that the attribute associated with the feature is not referred to by any other feature and can be pruned. In your case, 5310 attributes are associated with features with non-zero weights, but the rest are with zero weights. For this reason, CRFsuite removed (64323-5310) attributes from the model.

I guess you used L1-regularization for training the model. It has a similar effect to setting a frequency cutoff.

@arvinarvi
Copy link

Is the tagging done using just the active features? How are the potentials computed for the tokens in the evaluation set which do not appear in the model file?
Thanks for reply.

@usptact
Copy link

usptact commented Jan 13, 2018

@arvinarvi The features which appear only in tagging mode but are not in the model, will get a weight of zero.

@arvinarvi
Copy link

@usptact Thank you for your reply.

@arvinarvi
Copy link

arvinarvi commented Jan 19, 2018

I am implementing a sequence labeling problem which extracts the learned potentials from the model file of CRFsuite and apply different inference algorithm. I am finding it difficult to generalize the extraction of potentials of state features from the saved model file for a particular token (in the evaluation set, if it is present in the model file) since only the active features are logged. Can anyone help me figure out the problem? (If more info is required, I can be very specific to my problem). Thanks.

@marctorsoc
Copy link

I don't understand @chokkan answer. If every feature is a pair attribute+label. Then there should be at least as many features as attributes. And in theory many more as an attribute might appear with different labels... can someone explain please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants