You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All of the Dataset parameters seem to work correctly, except for ignore_columnhttps://lightgbm.readthedocs.io/en/latest/Parameters.html#ignore_column.
I have seen previous similar issues on this topic which are quite old and closed such as #1061, #2376, or #4657 so I understand why it wouldn't be working for a binary file that is already a constructed dataset, but why wouldn't it be working for a csv file? I suppose there are other ways of blacklisting or dropping columns from the dataset rather than using the lightgbm Dataset parameters but since it is there and seems available, I would like to reconfirm that is working as expected and that perhaps my reproducible example is not correct.
Reproducible example
import lightgbm as lgb
import pandas as pd
# Create a DataFrame with sample data
sample_data = pd.DataFrame({
'label': [0, 1, 0, 1, 0, 1],
'feature2': [5, 7, 2, 6, 1, 8],
'feature3': [3, 8, 3, 7, 2, 9],
'feature4': [4, 9, 4, 8, 3, 10],
'group_id': [1, 2, 2, 3, 3, 3]
})
# Save the DataFrame to a CSV file
sample_file = 'sample_data.csv'
sample_data.to_csv(sample_file, index=False)
# Load the dataset with LightGBM (label is by default the first column but making it explicit)
dataset = lgb.Dataset(sample_file, params={'header': True, 'label_column': 'name:label', 'ignore_column': 'name:feature2,feature3', 'query': 'name:group_id'})
# Construct the dataset
dataset.construct()
# [LightGBM] [Info] Using column label as label
# [LightGBM] [Info] Using column group_id as group/query id
# [LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
# [LightGBM] [Info] Construct bin mappers from text data time 0.00 seconds
# Get the number of features and data points
num_feature = dataset.num_feature()
print(f'Number of features: {num_feature}') # shows 4 incorrectly?
num_data = dataset.num_data() # shows 6 correctly
print(f'Number of data points: {num_data}')
# the two below work correctly (label is by default the first column)
get_label = dataset.get_label()
print(f'Label: {get_label}')
get_group = dataset.get_group()
print(f'Group: {get_group}')
Environment info
Operating System: Windows 11
CPU/GPU model: CPU
C++/Python/R version: Python
LightGBM version or commit hash: 4.5.0 (same issue with v3)
Given the above sample data, I would expect the number of features to be only the feature4 with the label and group_id columns not being treated as features and the feature2 and feature3 being ignored/dropped. However, the number of features returned is 4 (which I understand only ignores the label column and loads the remaining 4 columns). I also tried with using the number of indices instead of the name: and using a plain text CSV read as csv instead of using a dataframe. I also thought that may be .num_features() can be retrieved only before constructing a dataset but the features cannot be obtained before constructing the dataset as the clear error message is returned in this case.
I also thought that may be this parameter only works in CLI as it was suggested in one of the older issues but then the documentation does not state that.
The text was updated successfully, but these errors were encountered:
Adding to the above if I set the ignore_column parameter to contain indices, or names that don't exist in the data, the number of features still remains the same, e.g.
dataset = lgb.Dataset(sample_file, params={'header': True, 'label_column': 'name:label', 'ignore_column': '0,1,2,3,4,5,6,7,8,9,10', 'query': 'name:group_id'})
dataset.construct()
num_feature = dataset.num_feature() # gives 4 as well
Description
All of the Dataset parameters seem to work correctly, except for
ignore_column
https://lightgbm.readthedocs.io/en/latest/Parameters.html#ignore_column.I have seen previous similar issues on this topic which are quite old and closed such as #1061, #2376, or #4657 so I understand why it wouldn't be working for a binary file that is already a constructed dataset, but why wouldn't it be working for a csv file? I suppose there are other ways of blacklisting or dropping columns from the dataset rather than using the lightgbm Dataset parameters but since it is there and seems available, I would like to reconfirm that is working as expected and that perhaps my reproducible example is not correct.
Reproducible example
Environment info
Operating System: Windows 11
CPU/GPU model: CPU
C++/Python/R version: Python
LightGBM version or commit hash: 4.5.0 (same issue with v3)
Given the above sample data, I would expect the number of features to be only the
feature4
with thelabel
andgroup_id
columns not being treated as features and thefeature2
andfeature3
being ignored/dropped. However, the number of features returned is 4 (which I understand only ignores the label column and loads the remaining 4 columns). I also tried with using the number of indices instead of thename:
and using a plain text CSV read as csv instead of using a dataframe. I also thought that may be.num_features()
can be retrieved only before constructing a dataset but the features cannot be obtained before constructing the dataset as the clear error message is returned in this case.I also thought that may be this parameter only works in CLI as it was suggested in one of the older issues but then the documentation does not state that.
The text was updated successfully, but these errors were encountered: