Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore_column seems obsolete #6699

Open
kainkad opened this issue Oct 25, 2024 · 1 comment
Open

ignore_column seems obsolete #6699

kainkad opened this issue Oct 25, 2024 · 1 comment
Labels

Comments

@kainkad
Copy link

kainkad commented Oct 25, 2024

Description

All of the Dataset parameters seem to work correctly, except for ignore_column https://lightgbm.readthedocs.io/en/latest/Parameters.html#ignore_column.
I have seen previous similar issues on this topic which are quite old and closed such as #1061, #2376, or #4657 so I understand why it wouldn't be working for a binary file that is already a constructed dataset, but why wouldn't it be working for a csv file? I suppose there are other ways of blacklisting or dropping columns from the dataset rather than using the lightgbm Dataset parameters but since it is there and seems available, I would like to reconfirm that is working as expected and that perhaps my reproducible example is not correct.

Reproducible example

import lightgbm as lgb
import pandas as pd

# Create a DataFrame with sample data
sample_data = pd.DataFrame({
    'label': [0, 1, 0, 1, 0, 1],
    'feature2': [5, 7, 2, 6, 1, 8],
    'feature3': [3, 8, 3, 7, 2, 9],
    'feature4': [4, 9, 4, 8, 3, 10],
    'group_id': [1, 2, 2, 3, 3, 3]
})
# Save the DataFrame to a CSV file
sample_file = 'sample_data.csv'
sample_data.to_csv(sample_file, index=False)

# Load the dataset with LightGBM (label is by default the first column but making it explicit)
dataset = lgb.Dataset(sample_file, params={'header': True, 'label_column': 'name:label', 'ignore_column': 'name:feature2,feature3', 'query': 'name:group_id'})

# Construct the dataset
dataset.construct()

# [LightGBM] [Info] Using column label as label
# [LightGBM] [Info] Using column group_id as group/query id
# [LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
# [LightGBM] [Info] Construct bin mappers from text data time 0.00 seconds

# Get the number of features and data points
num_feature = dataset.num_feature()
print(f'Number of features: {num_feature}') # shows 4 incorrectly?
num_data = dataset.num_data() # shows 6 correctly
print(f'Number of data points: {num_data}')
# the two below work correctly (label is by default the first column)
get_label = dataset.get_label()
print(f'Label: {get_label}')
get_group = dataset.get_group()
print(f'Group: {get_group}')

Environment info

Operating System: Windows 11

CPU/GPU model: CPU

C++/Python/R version: Python

LightGBM version or commit hash: 4.5.0 (same issue with v3)

Given the above sample data, I would expect the number of features to be only the feature4 with the label and group_id columns not being treated as features and the feature2 and feature3 being ignored/dropped. However, the number of features returned is 4 (which I understand only ignores the label column and loads the remaining 4 columns). I also tried with using the number of indices instead of the name: and using a plain text CSV read as csv instead of using a dataframe. I also thought that may be .num_features() can be retrieved only before constructing a dataset but the features cannot be obtained before constructing the dataset as the clear error message is returned in this case.
I also thought that may be this parameter only works in CLI as it was suggested in one of the older issues but then the documentation does not state that.

@kainkad
Copy link
Author

kainkad commented Oct 28, 2024

Adding to the above if I set the ignore_column parameter to contain indices, or names that don't exist in the data, the number of features still remains the same, e.g.

dataset = lgb.Dataset(sample_file, params={'header': True, 'label_column': 'name:label', 'ignore_column': '0,1,2,3,4,5,6,7,8,9,10', 'query': 'name:group_id'})
dataset.construct()
num_feature = dataset.num_feature() # gives 4 as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants