ignore_column seems obsolete #6699

kainkad · 2024-10-25T13:58:29Z

Description

All of the Dataset parameters seem to work correctly, except for ignore_column https://lightgbm.readthedocs.io/en/latest/Parameters.html#ignore_column.
I have seen previous similar issues on this topic which are quite old and closed such as #1061, #2376, or #4657 so I understand why it wouldn't be working for a binary file that is already a constructed dataset, but why wouldn't it be working for a csv file? I suppose there are other ways of blacklisting or dropping columns from the dataset rather than using the lightgbm Dataset parameters but since it is there and seems available, I would like to reconfirm that is working as expected and that perhaps my reproducible example is not correct.

Reproducible example

import lightgbm as lgb
import pandas as pd

# Create a DataFrame with sample data
sample_data = pd.DataFrame({
    'label': [0, 1, 0, 1, 0, 1],
    'feature2': [5, 7, 2, 6, 1, 8],
    'feature3': [3, 8, 3, 7, 2, 9],
    'feature4': [4, 9, 4, 8, 3, 10],
    'group_id': [1, 2, 2, 3, 3, 3]
})
# Save the DataFrame to a CSV file
sample_file = 'sample_data.csv'
sample_data.to_csv(sample_file, index=False)

# Load the dataset with LightGBM (label is by default the first column but making it explicit)
dataset = lgb.Dataset(sample_file, params={'header': True, 'label_column': 'name:label', 'ignore_column': 'name:feature2,feature3', 'query': 'name:group_id'})

# Construct the dataset
dataset.construct()

# [LightGBM] [Info] Using column label as label
# [LightGBM] [Info] Using column group_id as group/query id
# [LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
# [LightGBM] [Info] Construct bin mappers from text data time 0.00 seconds

# Get the number of features and data points
num_feature = dataset.num_feature()
print(f'Number of features: {num_feature}') # shows 4 incorrectly?
num_data = dataset.num_data() # shows 6 correctly
print(f'Number of data points: {num_data}')
# the two below work correctly (label is by default the first column)
get_label = dataset.get_label()
print(f'Label: {get_label}')
get_group = dataset.get_group()
print(f'Group: {get_group}')

Environment info

Operating System: Windows 11

CPU/GPU model: CPU

C++/Python/R version: Python

LightGBM version or commit hash: 4.5.0 (same issue with v3)

Given the above sample data, I would expect the number of features to be only the feature4 with the label and group_id columns not being treated as features and the feature2 and feature3 being ignored/dropped. However, the number of features returned is 4 (which I understand only ignores the label column and loads the remaining 4 columns). I also tried with using the number of indices instead of the name: and using a plain text CSV read as csv instead of using a dataframe. I also thought that may be .num_features() can be retrieved only before constructing a dataset but the features cannot be obtained before constructing the dataset as the clear error message is returned in this case.
I also thought that may be this parameter only works in CLI as it was suggested in one of the older issues but then the documentation does not state that.

The text was updated successfully, but these errors were encountered:

kainkad · 2024-10-28T10:43:18Z

Adding to the above if I set the ignore_column parameter to contain indices, or names that don't exist in the data, the number of features still remains the same, e.g.

dataset = lgb.Dataset(sample_file, params={'header': True, 'label_column': 'name:label', 'ignore_column': '0,1,2,3,4,5,6,7,8,9,10', 'query': 'name:group_id'})
dataset.construct()
num_feature = dataset.num_feature() # gives 4 as well

jameslamb added the question label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore_column seems obsolete #6699

ignore_column seems obsolete #6699

kainkad commented Oct 25, 2024 •

edited

Loading

kainkad commented Oct 28, 2024

ignore_column seems obsolete #6699

ignore_column seems obsolete #6699

Comments

kainkad commented Oct 25, 2024 • edited Loading

Description

Reproducible example

Environment info

kainkad commented Oct 28, 2024

kainkad commented Oct 25, 2024 •

edited

Loading