Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas to vw text format #2426

Merged

Conversation

etiennekintzler
Copy link
Contributor

@etiennekintzler etiennekintzler commented May 1, 2020

1. Overview

The goal of this PR is to fix the issue #2308.

The PR introduces a new class DFToVW in vowpalwabbit.pyvw that takes as input the pandas.DataFrame and special types (SimpleLabel, Feature, Namespace) that specify the desired VW conversion.

These classes make extensive use of a class Col that refers to a given column in the user specified dataframe.

A simpler interface DFtoVW.from_colnames also be used for the simple use-cases. The main benefit is that the user need not use the specific types.


Below are some usages of this class. They all rely on the following pandas.DataFrame called df :

  house_id  need_new_roof  price  sqft   age  year_built
0      id1              0   0.23  0.25  0.05        2006
1      id2              1   0.18  0.15  0.35        1976
2      id3              0   0.53  0.32  0.87        1924

2. Simple usage using DFtoVW.from_colnames

Let say we want to build a VW dataset with the target need_new_roof and the feature age :

from vowpalwabbit.pyvw import DFtoVW
conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df)

Then we can use the method process_df:

conv.process_df()

that outputs the following list:

['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924']

This list can then directly be consumed by the method pyvw.model.learn.

3. Advanced usages using default constructor

The class DFtoVW also allow the following patterns in its default constructor :

  • tag
  • (named) namespaces, with scaling factor
  • (named) features, with constant feature possible

To use these more complex patterns we need to import them using:

from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, Col

3.1. Named namespace with scaling, and named feature

Let's create a VW dataset that include a named namespace (with scaling) and a named feature:

conv = DFtoVW(
        df=df,
        label=SimpleLabel(Col("need_new_roof")),
        namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm"))
        )
conv.process_df()

which yields:

['0 |Imperial:0.092 sqm:0.25',
 '1 |Imperial:0.092 sqm:0.15',
 '0 |Imperial:0.092 sqm:0.32']

3.2. Multiple namespaces, multiple features, and tag

Let's create a more complex example with a tag and multiples namespaces with multiples features.

conv = DFtoVW(
        df=df, 
        label=SimpleLabel(Col("need_new_roof")),
        tag=Col("house_id"),
        namespaces=[
                Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")),
                Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))])
                ]
        )
conv.process_df()

which yields:

['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05',
 '1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35',
 '0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87']

4. Implementation details

  • The class DFtoVW and the specific types are located in vowpalwabbit/pyvw.py. The class only depends on the pandas module.
  • the code includes docstrings
  • 8 tests are included in tests/test_pyvw.py

5. Extensions

  • This PR does not yet handle multilines and more complex label types.
  • To convert very large dataset that can't fit in RAM, one can make use of the pandas import option chunksize and process each chunk at a time. I could also implement this functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).

@jackgerrits
Copy link
Member

This is really cool! Thanks for working on this.

So it looks to me that you define how to use each column of the DataFrame and then process each row accordingly. I wonder if it makes sense to make it clearer what are column names and what are other strings in this formula?

For instance it looks like namespaces and feature names aren't substituted, but feature values are. Maybe it could be like a format string where {column_name} is used in the formula in a generic way.

It seems that additionally there are limitations on how the label is constructed. Some labels have the form of action:cost:prob. So splitting by space will not allow this to be handled.

Have you put any thought into how multi line examples may be handled in this sort of scheme?

@etiennekintzler
Copy link
Contributor Author

So it looks to me that you define how to use each column of the DataFrame and then process each row accordingly.

Yes, the general pattern is specified in the formula. Then I proceed column by column (to take advantage of vectorization) to finally get a unique column where each element is a a string that define line. The unique column is then convert to list to get the list of lines.

I wonder if it makes sense to make it clearer what are column names and what are other strings in this formula?

For instance it looks like namespaces and feature names aren't substituted, but feature values are. Maybe it could be like a format string where {column_name} is used in the formula in a generic way.

Yes ! I too hesitated between the current formulation and using {} ; the latter having the benefit of avoiding ambiguity but the drawback of being longer to write. But I agree it's much more clear and will change the function accordingly. Hence :
"need_new_roof | price:price sqft:sqft age:age year_built" will become
"{need_new_roof} | price:{price} sqft:{sqft} age:{age} {year_built}"

It seems that additionally there are limitations on how the label is constructed. Some labels have the form of action:cost:prob. So splitting by space will not allow this to be handled.

Have you put any thought into how multi line examples may be handled in this sort of scheme?

I did not know such case was present (I use Input format page from wiki as a reference). Could you please tell me more about this pattern (or link me to a page where I can see examples) ? More specifically, in action:cost:prob what is substituted and what is not ?

Also, I thought of using a method that check the correctness of the formula using regex (in the same fashion as https://hunch.net/~vw/validate.html). What do you think of it ? Does this regex pattern already exist somewhere in the project (so I reuse it and slightly accomodate it for the {}) ?

Thanks 😃 !

@jackgerrits
Copy link
Member

Here is an example of the label I mentioned: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Logged-Contextual-Bandit-Example

There are several different label types available, the best list at the moment is the polylabel union. However, if you use the {} approach for substitution this will allow users to structure the labels as needed and you won't need to worry about handling all of these.

That validate page just handles the one label type I believe, but it is a good start.

@etiennekintzler
Copy link
Contributor Author

Ok done !

I simplified the class to allow more flexible schema for the formula.
In the same time I also added a check that raises error if the formula if not well built. The definition of being well built is quite broad. Almost everything is allow except the following :

  • no consecutive "|". For example " a | | b" will raise error
  • no space allow at left/right of ":" (or "*" as I saw this character in previous version). For example : "a :b" will raise error while "a:b" is of course ok
  • alphanumeric for feature name.
  • the column name specified in curly braces cannot be the empty string nor contains the characters '{' or '}'

Do not hesitate to tell me if this formula is too restrictive (or too flexible 😄 )

@jackgerrits
Copy link
Member

no consecutive "|". For example " a | | b" will raise error

Sounds good

no space allow at left/right of ":" (or "*" as I saw this character in previous version). For example : "a :b" will raise error while "a:b" is of course ok

This is actually permitted. If you supply something like | :1 then it means it is a single feature with a value of 1.0.

alphanumeric for feature name.

It is less restrictive than that, the only characters not permitted in a feature name are : and |.

the column name specified in curly braces cannot be the empty string nor contains the characters '{' or '}'

Sounds good.

)
)

def __init__(self, df, formula):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a thought, is there a way we can deduce or use types to drive this string formula? maybe using reflection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth pursuing a sort of code based approach to defining the formula:

"{y} {idx}|test_ns col_x:{x}"

Might become:

formula = Formula(label=SimpleLabel(ColumnBinding("y")), tag=ColumnBinding("idx"), namespaces=[Namespace(name="test_ns", features=[Feature(name="col_x", value=ColumnBinding("x"))])]

It is significantly more text to write out, but it may be easier to construct for a newcomer that has an IDE at their disposal. Thoughts?

@jackgerrits
Copy link
Member

The regex does not accept:

{y} |FirstNameSpace {a}:3 |DoubleIt:2 {b}

The {a}:3 is not matched. This is valid and means the feature_name is a and has a value of 3.


def test_oneline_with_multiple_namespaces():
df = pd.DataFrame({"y": [1], "a": [2], "b": [3]})
conv = DataFrameToVW(df, "{y} |FirstNameSpace {a} |DoubleIt:2 {b}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also consider an api for DataFrameToVW that takes in some sort of spec of what we want to output. Just a sketch not actual code:

[
 Namespace(name="FirstNamespace", features=[Feature(name=at("a"))]),
, Namespace(name="DoubleIt", value=2, features=[Feature(name=at("b"))])
]

Copy link
Contributor Author

@etiennekintzler etiennekintzler May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @lalo, I think this is good too !
To sum up, for the following formula (which is a bit more complex to take more cases)
"{y} {tag_col}|FirstNameSpace {a} {b}:2 |DoubleIt:2 ColC:{c} |Constant :1"

We could have something like :

DataFrameToVW(
    df, 
    Targets = [Label(col("y")), Tag(col("a"))], 
    Features = [
    Namespace(name="FirstNamespace", 
              features=[Feature(value=col("a")), Feature(name=col("b"), value=2)]),
    Namespace(name="DoubleIt", value=2, 
              features=[Feature(name="colC", value=col("c"))]),
    Namespace(name="Constant"), 
              features=[Feature(value=1)]
])

In my opinion, the approach of using extensive args is more explicit and does not require complex regex checking. However it's more tedious to write, which can be a drawback if you have a lot of features.

Regarding the approach based on string formula, the main benefit is that is fast to write and familiar to people from statistical background that are used to R-formula like (as in R or the python package statsmodels).

What do you think @jackgerrits, @lalo ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping. That and a filtered one where you send only the columns from that frame that convert into features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree convenience methods for common ways to map it can be useful.

I find the Targets value a little confusing as in the text format everything to the left of the first | is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start with SimpleLabel type.

You could also have the string formula generate this programmatic formula. Could be another nice extension convenience maybe? Maintaining the regex maybe not desirable though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again with pseudocode to sketch out the api:

we could have something like

DataFrameToVW(
    df,
    VWMappingDefault(df.columns()))

or

DataFrameToVW(
    df,
    VWMappingDefault(filter(df.columns,["a","b"])) # in case df has more columns we don't care about

Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.

Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.

Copy link
Contributor Author

@etiennekintzler etiennekintzler May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add convenience methods for example have some sort of default mapping that assumes no namespaces or values, anything more complicated than that someone has to define the explicit mapping.

You mean no feature name ? OK for the default mapping !

I find the Targets value a little confusing as in the text format everything to the left of the first | is the label (minus the tag). Therefore I think there should be a top level property which takes a Label object of a specific label type. We would need to specify all of those, but we can easily just start with SimpleLabel type.

In the vowpal input format wiki [Label] [Importance] [Base] [Tag]|Namespace Features |Namespace Features ... |Namespace Features. So at the left of the first | there can also be Importance and Base. Also, as you mentionned earlier there is also the polylabel union (for CB).

We could define, for the LHS of the first "|" the following properties/types : Label, Importance, Base, Tag and UnionLabel/PolyLabel (for the CB case) and the following properties for the RHS of the first "|": Namespace, Feature. What do you think ?

Again with pseudocode to sketch out the api:

we could have something like

DataFrameToVW(
df,
VWMappingDefault(df.columns()))
or

DataFrameToVW(
df,
VWMappingDefault(filter(df.columns,["a","b"])) # in case df has more columns we don't care about
Then we would have to make sure that VWMappingDefault just generates the structure of having no namespace, no values, etc just grabbing whatever value from the column into a feature.

Sounds good ! The by default the first column of the DataFrame would be the target ? Alternatively we could ask to supply the column name of the target and the list of names of the features as in VWMappingDefault(y="y", x=["a", "b"]).

Also a benefit on leaning on extra types is that if we change or update the input format we don't break all the users of the api (i.e. changing the '|' to '#', not that this will ever happen). We would be able to transform their types into the new input format cleanly.

Yes, I totally agree !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Label] [Importance] [Base] [Tag]| is only true for simple_label. Unfortunately, this is a case of the wiki not quite being up to date.

LHS of the first | will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first | will be the same for every label type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Label] [Importance] [Base] [Tag]| is only true for simple_label. Unfortunately, this is a case of the wiki not quite being up to date.

Another case is the pattern for CB (action:cost:probability | features). Is there other cases ?

LHS of the first | will depend on the label type, and therefore it must be specific per label type. This is why I suggest providing a rich label object and the tag. But RHS of the first | will be the same for every label type.

I have been thinking about the design and I could do the following an abstract class FeatureHandler with abstract method process and concrete method get_col_or_value. The following 2 classes will inherit of this abstract class :

  • SimpleLabel, which has an attribute name and implements process.
  • Feature, which has an attribute name and value and implements process.

In both class, the attributes name (and value) can either receive an object of type col (that specified the column to extract as in col("a")) or a value that will be considered as it. The concrete method get_col_or_value will extract the column from the dataframe (if col) or build the column with the repeat value if a value is passed.

The implementation of the process method will build the appropriate column of string (pandas.Series) according to the type of the object (SimpleLabel, Feature or PolyLabel)

A third class PolyLabel with attributes action, cost, probability can easily be added.

What do you think of it ?

Copy link
Contributor Author

@etiennekintzler etiennekintzler May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of it and simplified the classes (no abstract class, and no subtypes Label/Importance/Base/Tag).

Here is the UML design of the classes I wrote :

image

Usages

Let's build a toy dataset

import numpy as np
import pandas as pd

df_toy = pd.DataFrame(
    {
        "y": [-1, -1, 1],
        "p": np.random.uniform(size=3),
        "c": [4.5, 6.7, 9.6],
        "a": [1, 2, 3],
        "b": np.random.normal(size=3)
    }
)
# out
   y         p    c  a         b
0 -1  0.525440  4.5  1  0.616254
1 -1  0.262586  6.7  2 -1.133934
2  1  0.705830  9.6  3 -0.452018

1. Using just colnames to specify target and features
Same idea as @lalo 's VWMappingDefault. We call the method DFtoVW.from_colnames :

DFtoVW.from_colnames(y="y", X =["a"], df=df_toy).process_df()
# out
['-1 | 1', '-1 | 2', '1 | 3']


DFtoVW.from_colnames(y="y", X = set(df_toy) - set(["y", "p"]), df=df_toy).process_df()
# out
['-1 | 0.6162539137336357 1 4.5',
 '-1 | -1.1339344282053312 2 6.7',
 '1 | -0.45201750087182024 3 9.6']


DFtoVW.from_colnames(y=["a", 'c', "p"], X =["b"], df=df_toy, poly_label=True).process_df()
# out
['1:4.5:0.6121907426717043 | -0.43104977544310435',
 '2:6.7:0.6773696976137632 | 2.45363493233382',
 '3:9.6:0.5955350558877885 | 1.2190748658325201']

2. Using the interface that we talked about

DFtoVW(label=SimpleLabel(Col("y")), 
       namespaces=Namespace(features=[Feature(value=Col("a"))]),
       df=df_toy).process_df()
# out
['-1 | 1', '-1 | 2', '1 | 3']


complex_namespaces = [
    Namespace(name="FirstNamespace", features=[Feature(name="ColA", value=Col("a"))]),
    Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("b"))])
]
DFtoVW(label=SimpleLabel(Col("y")), 
       namespaces=complex_namespaces, 
       df=df_toy).process_df()
# out
['-1 |FirstNamespace ColA:1 |DoubleIt:2 0.6162539137336357',
 '-1 |FirstNamespace ColA:2 |DoubleIt:2 -1.1339344282053312',
 '1 |FirstNamespace ColA:3 |DoubleIt:2 -0.45201750087182024']


DFtoVW(label=PolyLabel(action=Col("a"), cost=Col("c"), proba=Col("p")),
       namespaces=Namespace(features=[Feature(value=Col("b"))]),
       df=df_toy).process_df()
# out
['1:4.5:0.6121907426717043 | -0.43104977544310435',
 '2:6.7:0.6773696976137632 | 2.45363493233382',
 '3:9.6:0.5955350558877885 | 1.2190748658325201']

Is that class design okay for you ?

@jackgerrits
Copy link
Member

In general I really like this! Especially scheme 2.

Another case is the pattern for CB (action:cost:probability | features). Is there other cases ?

Yes, in general. Any label can have entirely it's own form. The list of possible labels is here:

Each of these types has an associated parse function.

So I think for 1, the default being y="y" should work. But there must be the ability to express these more complex labels and poly_label=True isn't quite enough to do it.

One thing that we really need to consider while looking at this design is multiline examples. I am not sure if you are familiar with them or not. So for contextual bandit examples with action dependent features for example there will be one line which is called the shared example and then there will be a line for every action that can be taken. Then when you learn from this in VW you actually pass a list of examples to VW.

One idea to handle this is that you define a column in the dataframe as a grouping id, and for every row that has the same grouping id they form this list of examples. Then you need to be able to have a different formula for shared vs action examples, so you could define a column which specifies which formula to use for this row.

This bit of extra work on multiline examples turns this from very useful to extremely useful :)

@etiennekintzler
Copy link
Contributor Author

etiennekintzler commented May 8, 2020

Another case is the pattern for CB (action:cost:probability | features). Is there other cases ?

Yes, in general. Any label can have entirely it's own form. The list of possible labels is here:

Each of these types has an associated parse function.
So I think for 1, the default being y="y" should work. But there must be the ability to express these more complex labels and poly_label=True isn't quite enough to do it.

Unfortunately I don't know C, thus it is a bit hard for me to read the types in example.h. From what I understood polylabel is the union of all the label formats available in VW which is not what I thought when I created the class PolyLabel. I check the headers of the files included in example.h, from what I understood:

  • the class PolyLabel seems to correspond to the CB::label type
  • the class SimpleLabel seems fit the no_label type
  • the class Feature looks like the label_data type (without the initial attribute).

Is that right ?

One thing that we really need to consider while looking at this design is multiline examples. I am not sure if you are familiar with them or not. So for contextual bandit examples with action dependent features for example there will be one line which is called the shared example and then there will be a line for every action that can be taken. Then when you learn from this in VW you actually pass a list of examples to VW.

Nope I was not familiar with it. I look in the wiki it the format and found the following ressources :

Is there additionnal ressources I must know to treat this multilines case ? I will read it carefully to understand what it is about.

Edited: Ok I found the tutorials on vowpalwabbit.org who explained it quite well

One idea to handle this is that you define a column in the dataframe as a grouping id, and for every row that has the same grouping id they form this list of examples. Then you need to be able to have a different formula for shared vs action examples, so you could define a column which specifies which formula to use for this row.

Ok ! Will try this approach when I understand more what's behind multilines.

This bit of extra work on multiline examples turns this from very useful to extremely useful :)

Ok great ! 😄

@jackgerrits
Copy link
Member

Totally understandable if you aren't familiar with C/C++.

the class PolyLabel seems to correspond to the CB::label type

Yep I agree.

the class SimpleLabel seems fit the no_label type

Not quite, it corresponds to label_data simple;

the class Feature looks like the label_data type (without the initial attribute).

Features is a different part of the example, so it doesn't correspond to a label.

Here's what I think might make sense. I think what you've proposed is really solid and works (right now) for the SimpleLabel, single line example case. A little bit more work is needed to support further labels and multiline examples. But I think we should merge an initial implementation that is just supporting SimpleLabel and then we can work on adding to it.

What do you think?

@etiennekintzler
Copy link
Contributor Author

etiennekintzler commented May 8, 2020

the class SimpleLabel seems fit the no_label type

Not quite, it corresponds to label_data simple;

ok !

the class Feature looks like the label_data type (without the initial attribute).

Features is a different part of the example, so it doesn't correspond to a label.

ok !

Here's what I think might make sense. I think what you've proposed is really solid and works (right now) for the SimpleLabel, single line example case. A little bit more work is needed to support further labels and multiline examples. But I think we should merge an initial implementation that is just supporting SimpleLabel and then we can work on adding to it.

What do you think?

Yes sure we can merge the initial implementation with the SimpleLabel, single line example case. I need to rewrite the tests and verify the error handling process before we can merge :)

I am currently working on the multilines examples and it appears a bit more clear to me. However I would be glad if you could provide me with some pointers about the dataframe that would generate the multilines examples. If I take the multilines examples from the wiki :

| a:1 b:0.5
0:0.1:0.75 | a:0.5 b:1 c:2

shared | s_1 s_2
0:1.0:0.5 | a:1 b:1 c:1
| a:0.5 b:2 c:1

The part I am a bit confused is the following : Is the fact that the 2nd multilines example has a shared features line and labels for each action that are different than the 1st multilines a intentional choice by the user or the result of the availability of the data ?
To me the latter would make more sense (otherwise we would have to specify the features for each index x action wich is prohibitive if there are more than 5-10 different index). If it is the case, dataframe in input would look like that :

       action  cost  proba    a    b    c shared1 shared2
index                                                    
1           1   NaN    NaN  1.0  0.5  NaN     NaN     NaN
1           2   0.1   0.75  0.5  1.0  2.0     NaN     NaN
2           1   1.0   0.50  1.0  1.0  1.0     s_1     s_2
2           2   NaN    NaN  0.5  2.0  1.0     s_1     s_2

Am I missing something here ?

Thanks in advance for your explanation !

@jackgerrits
Copy link
Member

The part I am a bit confused is the following : Is the fact that the 2nd multilines example has a shared features line and labels for each action that are different than the 1st multilines a intentional choice by the user or the result of the availability of the data ?

Intentional choice by the user. In a contextual bandit scenario the shared example describes the features common to all actions, or in other words, the worlds features.

In my mind the example you gave may come from something like this:

index shared cost proba a b c shared_1 shared_2
1 false 1.0 0.5
1 false 0.1 0.75 0.5 1.0 2.0
2 true s_1 s_2
2 false 1.0 0.50 1.0 1.0 1.0
2 false 0.5 2.0 1.0

The key thing that makes this work in my mind though is that you can select between two different formulae based on the value of shared. As you can see it has a different shape to the actions. (and with other labels this is even more the case, for example CCB). This is in addition to fact it can determine when index changes to the next in the sequence to denote this multiline example list is complete.

@etiennekintzler
Copy link
Contributor Author

etiennekintzler commented May 8, 2020

The part I am a bit confused is the following : Is the fact that the 2nd multilines example has a shared features line and labels for each action that are different than the 1st multilines a intentional choice by the user or the result of the availability of the data ?

Intentional choice by the user. In a contextual bandit scenario the shared example describes the features common to all actions, or in other words, the worlds features.

Yes

In my mind the example you gave may come from something like this:

index shared cost proba a b c shared_1 shared_2
1 false 1.0 0.5
1 false 0.1 0.75 0.5 1.0 2.0
2 true s_1 s_2
2 false 1.0 0.50 1.0 1.0 1.0
2 false 0.5 2.0 1.0
The key thing that makes this work in my mind though is that you can select between two different formulae based on the value of shared. As you can see it has a different shape to the actions. (and with other labels this is even more the case, for example CCB).

This part is still not clear. I thought the share example line was just shown to explain what's possible, not that it help choose between two different formulas.

To be more concrete, let's use the analogy with the news recommender system of the tutorial on contextual bandit. Let's define :

  • shared label : user id and time of the day
  • actions : the articles categories (politics, sports, music, food)
  • not shared feature for each action (and user id) : URL categories and header size.

We would have as many different index as unique users connections. That's a lot of different index for which to define a specific formula. Hence how can the choice of using shared parameters or choosing a given features be made for each different index since they are so many index/user connections ?

Thanks again for your explanation !

@jackgerrits
Copy link
Member

jackgerrits commented May 11, 2020

Shared and action convey different things to VW and both are required. (As labels get more complex we need to support more structures such as this)

The two different formulae we're talk about here have the form:
"shared" -> shared | s_1 s_2
"action" -> 0:1.0:0.5 | a:1 b:1 c:1

We would have as many different index as unique users connections. That's a lot of different index for which to define a specific formula.

No there's only two, shared and action.

Hence how can the choice of using shared parameters or choosing a given features be made for each different index since they are so many index/user connections ?

The choice is made with whether the "shared" column is true or false in this case.

Index is merely used to know when to move onto the next block of examples (multiex). That part is a bit cumbersome, I'm open to ideas here.

@etiennekintzler
Copy link
Contributor Author

Thanks !

Shared and action convey different things to VW and both are required. (As labels get more complex we need to support more structures such as this)

The two different formulae we're talk about here have the form:
"shared" -> shared | s_1 s_2
"action" -> 0:1.0:0.5 | a:1 b:1 c:1

We would have as many different index as unique users connections. That's a lot of different index for which to define a specific formula.

No there's only two, shared and action.

Ok, I think there was misunderstanding from my part. From this exchange:

The part I am a bit confused is the following : Is the fact that the 2nd multilines example has a shared features line and labels for each action that are different than the 1st multilines a intentional choice by the user or the result of the availability of the data ?

Intentional choice by the user. In a contextual bandit scenario the shared example describes the features common to all actions, or in other words, the worlds features.

I understood that, for each index, the features choices (which I incorrectly called labels) were also an intentional choice by the user. From what I just understood, it is only the choice of including shared parameters that is up to the user. So in this example table (in one of your previous message), the choice to retains features a and b for index 1 only is due to the availability of the data, right ?

index shared cost proba a b c shared_1 shared_2
1 false 1.0 0.5
1 false 0.1 0.75 0.5 1.0 2.0
2 true s_1 s_2
2 false 1.0 0.50 1.0 1.0 1.0
2 false 0.5 2.0 1.0

By the way, I think the data that the users might have is more likely to be formatted as the table below (as in log files). If ever the shared features are on a different table, it's easier to left join the shared features table to the original table (with action labels/features) than it is to insert new a line below each index (that have shared parameters).

index shared cost proba a b c shared_1 shared_2
1 false 1.0 0.5
1 false 0.1 0.75 0.5 1.0 2.0
2 true 1.0 0.50 1.0 1.0 1.0 s_1 s_2
2 true 0.5 2.0 1.0 s_1 s_2

What do you think ?

Hence how can the choice of using shared parameters or choosing a given features be made for each different index since they are so many index/user connections ?

The choice is made with whether the "shared" column is true or false in this case.

Index is merely used to know when to move onto the next block of examples (multiex). That part is a bit cumbersome, I'm open to ideas here.

Ok ! So the formula could be:

DFtoVW(namespaces=Namespace([Feature(Col("a")), Feature(Col("b")), Feature(Col("c"))]),
       label=CBLabel(cost=Col("cost"), proba=Col("proba")),
       multilines=MultiLines(
           id=Col("index"), 
           shared=Share(id=Col("shared"), features=[Feature(Col("shared_1")), Feature(Col("shared_2"))]
       )
)

(The Share would be built in the similar way as the Namespace regarding the features attributes.)

What do you think ?

@jackgerrits
Copy link
Member

So in this example table (in one of your previous message), the choice to retains features a and b for index 1 only is due to the availability of the data, right ?

Yes, it is sparse so it doesn't matter if features are or aren't supplied for each individual example.

It looks like in the second table you provided the id's don't match to what . You wouldn't have two shared examples with the same ID. I would expect this table to read instead as:

index shared cost proba a b c shared_1 shared_2
1 false 1.0 0.5
2 false 0.1 0.75 0.5 1.0 2.0
1 true 1.0 0.50 1.0 1.0 1.0 s_1 s_2
2 true 0.5 2.0 1.0 s_1 s_2

Okay this is really interesting. So, if you think about multiline example formula definitions then I see three things as required:

  • List of formulae to pick along with their name
  • The column used to pick the formula
  • The id used to chunk the multilines

So if I convert the example you gave to follow this in a generic manner then I came up with:

DFtoVW(
    multilines=True,
    formulae={
        "action":
        Formula(namespaces=Namespace(
            [Feature(Col("a")),
             Feature(Col("b")),
             Feature(Col("c"))]),
                label=CBLabel(cost=Col("cost"), proba=Col("proba"))),
        "shared":
        Formula(features=[Feature(Col("shared_1")),
                          Feature(Col("shared_2"))],
                label=CBLabel(shared=True))
    },
    id=Col("index"),
    typeMap=Col("type"))

Do note though that now you need a column called "type" that has either "action" or "shared" in it. Also, wow we're getting pretty complex here, but I think it may be necessary to allow full expressiveness?

@etiennekintzler
Copy link
Contributor Author

So in this example table (in one of your previous message), the choice to retains features a and b for index 1 only is due to the availability of the data, right ?

Yes, it is sparse so it doesn't matter if features are or aren't supplied for each individual example.

Ok, perfect :)

It looks like in the second table you provided the id's don't match to what . You wouldn't have two shared examples with the same ID. I would expect this table to read instead as:

What I called index is a unique identifier of a multilines example. For instance if we take the anology with the users connection on a website it will be something like the datetime of a connection. The way you reindex it make it seems more like the action number, which I didn't consider as I made the assumption that the actions were ordered in a given index. A parameter action could be easily added if the actions are not ordered.

Okay this is really interesting. So, if you think about multiline example formula definitions then I see three things as required:

  • List of formulae to pick along with their name
  • The column used to pick the formula
  • The id used to chunk the multilines

So if I convert the example you gave to follow this in a generic manner then I came up with:

DFtoVW(
    multilines=True,
    formulae={
        "action":
        Formula(namespaces=Namespace(
            [Feature(Col("a")),
             Feature(Col("b")),
             Feature(Col("c"))]),
                label=CBLabel(cost=Col("cost"), proba=Col("proba"))),
        "shared":
        Formula(features=[Feature(Col("shared_1")),
                          Feature(Col("shared_2"))],
                label=CBLabel(shared=True))
    },
    id=Col("index"),
    typeMap=Col("type"))

Hm, I get that you'd like a clear separation between 'action' and 'shared' but I see the following drawbacks :

  • it mix the dictionary usage, which we haven't used so far in the class signature, with the class types (Feature/Namespace/CBLabel)
  • multilines=True seems to be somehow redundant. We can infer it from the fact that we pass an argument to id for instance (or MultiLines in my prev ex).

More importantly it seems quite different from the previous interface. I think the progression toward greater complexity should appear as seamless as possible for the user. For instance, from really simple to hell :

  • Simple regression
DFtoVW(label=SimpleLabel(Col("y")),
       namespaces=Namespace(Feature(Col("x"))) 
DFtoVW(label=CBLabel(action=Col("a"), prob=Col("p"), cost=Col("c")),
       namespaces=Namespace([Feature(Col("feat1"), Feature(Col("feat2"))]
  • CB with multilines as our current example without the shared component
DFtoVW(label=CBLabel(action=Col("a"), prob=Col("p"), cost=Col("c")),
       namespaces=Namespace([Feature(Col("a")), Feature(Col("b")), Feature(Col("c"))],
       multilines=Multilines(id=Col("index")))
  • CB with multilines and the shared component :
DFtoVW(label=CBLabel(cost=Col("cost"), proba=Col("proba")),
       namespaces=Namespace([Feature(Col("a")), Feature(Col("b")), Feature(Col("c"))]),
       multilines=MultiLines(
           id=Col("index"), 
           shared=Share(id=Col("shared"), features=[Feature(Col("shared_1")), Feature(Col("shared_2"))]
       )
)

What do you think ?

Do note though that now you need a column called "type" that has either "action" or "shared" in it.

I do not get why it is needed if shared has its dedicated columns (shared_1 and shared_2)

Also, wow we're getting pretty complex here, but I think it may be necessary to allow full expressiveness?

haha yes 😄 but since the expected format is complex too, better this than frustrate the user (for the time spent to understand the format and do formatting + the risk of ill-formatted file) !

@lalo
Copy link
Collaborator

lalo commented May 12, 2020

Let's forget about multiline for now, it seems natural for users to already have a representation that is compatible with the simple label but seems like a stretch that someone would have an already compatible structure with multiline on their dataframes. Don't get me wrong, the work you've done is great but we can't seem to justify on the need to convert multiline for now. Let's keep the scope of this PR limited and iterate before growing out the design. Does that sound good @etiennekintzler?

@etiennekintzler
Copy link
Contributor Author

I've included the modifications asked !

In my opinion the only dark spots that remain are the use of SimpleLabel and the type checking. I will try to clarify these two points below.

  1. SimpleLabel

SimpleLabel has an unique attribute name (contrary to Feature that has a name and a value attr) that can be either a string constant or a Col value. Note that it is not tied to a specific position (like the position of the label or the tag). Hence it can be used for the tag as well as for the label.

What could be done instead is to just remove the SimpleLabel and use just use Col (or a string if the value is constant). Hence DFtoVW(df=df, label=SimpleLabel(Col("y")), namespaces=...)) would become just DFtoVW(df=df, label=Col("y"), namespaces=...))

  1. Type checking

Also the type checking done in SimpleLabel and Feature is just for the arguments provided to the constructor. I do not check type of the column that the argument could refers to (if Col is used).

@jackgerrits
Copy link
Member

jackgerrits commented May 21, 2020

  1. SimpleLabel

SimpleLabel is a specific kind of label (in the ML sense), and so it should not be reused for tag. This will be confusing to users.

SimpleLabel should be used in only one place, the argument to label in DFtoVW. It is a coincidence that there is only one parameters and so it seems generic. We need a type here so that in future we can switch based on the type of label provided.

  1. Type checking

I think that's okay for now. We can add more strict type checking for this at a later point.

@etiennekintzler
Copy link
Contributor Author

  1. SimpleLabel

SimpleLabel is a specific kind of label (in the ML sense), and so it should not be reused for tag. This will be confusing to users.

SimpleLabel should be used in only one place, the argument to label in DFtoVW. It is a coincidence that there is only one parameters and so it seems generic. We need a type here so that in future we can switch based on the type of label provided.

Ok ! So I will create a Tag class to differentiate from the SimpleLabel. Is that ok for you ?

I think that's okay for now. We can add more strict type checking for this at a later point.

Yes I think too.


Thinking of the overall design (so maybe not for this PR) maybe the reference to a column of the dataframe should be the default behavior, and constant values would be supplied using Const. It would simplify the formula since there tends to be more column values than constants.

@jackgerrits
Copy link
Member

Since tag is just a string I think either a Col("") or a string can be provided directly?

Yeah potentially about Const vs Col. Making the binding behavior explicit makes sense to me but I can also see your point

@etiennekintzler
Copy link
Contributor Author

Since tag is just a string I think either a Col("") or a string can be provided directly?

If the tag is just a string, it would the same tag for all examples, which is wrong no ? Can't it just be a Col ?

Yeah potentially about Const vs Col. Making the binding behavior explicit makes sense to me but I can also see your point

ok

@jackgerrits
Copy link
Member

As a user I kind of expect that each of the places I can use Col() that I can also use just a string. It may not always make sense but it seems to make for an api that is least surprising

@lgtm-com
Copy link

lgtm-com bot commented May 22, 2020

This pull request introduces 1 alert when merging 8fff168 into 388d551 - view on LGTM.com

new alerts:

  • 1 for Unused import

@jackgerrits
Copy link
Member

Thanks so much for all of the work you've done here! Just need to fix CI and the warning and then I think we are good to get this in!

@etiennekintzler
Copy link
Contributor Author

etiennekintzler commented May 23, 2020

Thanks so much for all of the work you've done here! Just need to fix CI and the warning and then I think we are good to get this in!

Glad to contribute ! I fixed the problem mentioned in the previous CI job and remove the unused import.

However the current Linux CI job has started 18h ago but is still in progress (while usually it's finished after 10-20 minutes). The tests seem to have been passed though : https://dev.azure.com/vowpalwabbit/Vowpal%20Wabbit/_build/results?buildId=9662&view=results

@jackgerrits
Copy link
Member

Looks like the status check never called back to say it was done. Can you please push an empty commit to retrigger CI?

@etiennekintzler
Copy link
Contributor Author

Done !

conv = DFtoVW(
label=SimpleLabel(Col("y")),
tag=Col("idx"),
namespaces=Namespace([Feature(name="", value=2)]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are empty feature names allowed on VW's input at all? cc @jackgerrits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that it is according @jackgerrits 's answer

no space allow at left/right of ":" (or "*" as I saw this character in previous version). For example : "a :b" will raise error while "a:b" is of course ok

This is actually permitted. If you supply something like | :1 then it means it is a single feature with a value of 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@jackgerrits
Copy link
Member

@etiennekintzler would you be able to update the PR description to match where we landed with the discussion?

Copy link
Collaborator

@lalo lalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, discussion and awesome work you've done here @etiennekintzler

@etiennekintzler
Copy link
Contributor Author

etiennekintzler commented May 26, 2020

@etiennekintzler would you be able to update the PR description to match where we landed with the discussion?

Yep I am on it !

Just fixing one last anomaly when the method process_df is called multiple times and then push again.

@etiennekintzler
Copy link
Contributor Author

@etiennekintzler would you be able to update the PR description to match where we landed with the discussion?

Done ! Let me know if you want more detailed explanation !

Thanks for the contribution, discussion and awesome work you've done here @etiennekintzler

Thanks I really appreciate it !

@jackgerrits
Copy link
Member

This is awesome @etiennekintzler! Thanks for all of your hard work here and congrats on your first PR into VW 😄

@jackgerrits jackgerrits merged commit 8077112 into VowpalWabbit:master May 27, 2020
@etiennekintzler etiennekintzler deleted the pandas_to_vw_text_format branch May 27, 2020 14:58
@jackgerrits
Copy link
Member

@etiennekintzler
Copy link
Contributor Author

etiennekintzler commented May 27, 2020

This is awesome @etiennekintzler! Thanks for all of your hard work here and congrats on your first PR into VW smile

Thanks a lot,

I do believe that easier integration with python and its ecosystem (pandas, sklearn) will widen the user base !

NB: The class name I used DFtoVW does not strictly respect Camel case, you can rename it if you want.

Also the class Col and functions _get_col_or_value _get_all_cols and _check_type are duplicated in the source file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants