Skip to content

Feature interactions

Alexander Trufanov edited this page Jun 24, 2015 · 15 revisions

Since VW v 7.10.2 number of changes were introduced in mechanism of feature interactions generation (-q and --cubic):

  • Support of interactions of arbitrary length.
  • Better hashing.
  • Filtering out unnecessary feature interactions.
  • Filtering out unnecessary namespace interactions.

These may result in less number of generated features, changes in their values and hashes.

Support of interactions of arbitrary length

VW's command line parameters -q a or --cubic ab (where a and b are namespaces) are commonly used to generate pairs or triples of features from specified namespaces.

Now VW provides additional --interaction parameter that works like -q or --cubic but its argument's value may be longer. E.g. it could be --interaction ab, --interaction abcd, --interaction abcdef etc. Moreover, -q and --cubic arguments are internally converted to --interaction values. Both -q and --cubic are still supported for backward compatibility.

Although VW supports interactions of any length it shall be presumed that usage of long interactions may result in huge number of generated features. Also generation of interactions with length bigger than 3 will have some performance overhead due to processing in non-recursive loop.

Better hashing

VW currently uses murmur3 hash to encode single features. Features that were generated as a result of their interaction are stored in the same address space and thus shall have a hash too. But as there may be a huge number of generated features it's feasible to use slightly worse but much faster hash to encode each of them. Thus instead of using murmur3 for generated features VW encodes it by hashing the murmur3 hashes of single features that compose it.

For example, two features a and b will have hash values murmur3(a) and murmur3(b). But new feature a*b that was generated by -q argument will has hash value hash(murmur3(a), murmur3(b)) instead of murmur3(ab). It's faster as we already know murmur3(a)andmurmur3(b)at moment of encoding their interactiona*b`.

Since version 7.10.2 VW uses 32bit FNV hash for hashing interactions of any length. It generates less collisions than hashes used before it and has comparable performance.

Filtering out unnecessary feature interactions

In previous versions VW generated permutations of features for self-interacting namespaces. This means that -q ff for f| a b c will produce following new features:
a*a, a*b, a*c, b*a, b*b, b*c, c*a, c*b, c*c
It could be seen that there are 2 groups of generated features that won't improve your model:

  1. b*a, c*a, c*b will have same values after training as a*b, a*c, b*c and processing them is just a waste of time. Although they'll have different hashes it seems that they can't improve model's result by making it more robust to hash collisions. Removal of such features may significantly reduce time required for model training and slightly improve its prediction power.
  2. features a*a, b*b, c*c will have same values as simple a, b, c unless they have weights != 1.0. This exception is made for parts of bigger interactions too. For example: --cubic fff where f| a:0.5 b c will result in a*a*a, a*a*b, a*a*c, a*b*c.

Since 7.10.2 VW don't generate unnecessary features for self-interacting namespaces. It could be stated that instead of permutations VW generates simple combinations with exception to features with weight != 1.0.

This new rules of feature generation for interacting namespaces is enabled by default, but could be switched off by passing --permutations flag via VW command line. The exception for features with weight != 1.0 could be disabled by switching off const bool feature_self_interactions_for_weight_other_than_1 in interactions.h and rebuilding the project.

Note: due to implementation of simple combinations generation algorithm the namespaces in interaction string is sorted ascending. This allows to group same namespaces together and efficiently detect presence of self-interacting namespaces. This could affect the order of features in interaction and thus - its hash value. So vw prints a warning message if such changes has been made. For example:

$ vw --cubic bab --cubic aaa
creating cubic features for triples: bba aaa 
WARNING: some interactions contain duplicate characters and their characters order has been changed.
Interactions affected: 1.

In example above VW will continue to work with interactions aaa and abb.

Filtering out unnecessary namespace interactions

Using the same argumentation as in previous paragraph we can show that unnecessary features may be generated not only when namespace interacts with itself but also in cases like -q ab -q ba. It could be seen that although interactions generated by -q ba will have hashes other than -q ab they won't improve model results.
Such duplicate interactions may be unwillingly generated with wildcards like -q :: --cubic ::: or even --interaction ::::.

Thus since 7.10.2 VW automatically removes such interactions (providing a warning message).

$ vw --cubic :::
creating cubic features for triples: ::: 
WARNING: duplicate namespace interactions were found. Removed: 665942.
You can use --leave_duplicate_interactions to disable this behaviour.

This behaviour could be disabled with --leave_duplicate_interactions flag.

Clone this wiki locally