Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] load_boston has been removed in sklearn 1.2 #5634

Closed
shiyu1994 opened this issue Dec 13, 2022 · 3 comments
Closed

[ci] load_boston has been removed in sklearn 1.2 #5634

shiyu1994 opened this issue Dec 13, 2022 · 3 comments

Comments

@shiyu1994
Copy link
Collaborator

Description

The Boston dataset has been removed in sklearn 1.2 due to ethical issues. However, our test cases use the dataset for many times. The removal causes our CI jobs to fail.

Reproducible example

(https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879)

>           raise ImportError(msg)
[3900](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3901)
E           ImportError: 
[3901](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3902)
E           `load_boston` has been removed from scikit-learn since version 1.2.
[3902](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3903)
E           
[3903](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3904)
E           The Boston housing prices dataset has an ethical problem: as
[3904](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3905)
E           investigated in [1], the authors of this dataset engineered a
[3905](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3906)
E           non-invertible variable "B" assuming that racial self-segregation had a
[3906](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3907)
E           positive impact on house prices [2]. Furthermore the goal of the
[3907](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3908)
E           research that led to the creation of this dataset was to study the
[3908](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3909)
E           impact of air quality but it did not give adequate demonstration of the
[3909](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3910)
E           validity of this assumption.
[3910](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3911)
E           
[3911](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3912)
E           The scikit-learn maintainers therefore strongly discourage the use of
[3912](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3913)
E           this dataset unless the purpose of the code is to study and educate
[3913](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3914)
E           about ethical issues in data science and machine learning.
[3914](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3915)
E           
[3915](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3916)
E           In this special case, you can fetch the dataset from the original
[3916](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3917)
E           source::
[3917](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3918)
E           
[3918](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3919)
E               import pandas as pd
[3919](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3920)
E               import numpy as np
[3920](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3921)
E           
[3921](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3922)
E               data_url = "http://lib.stat.cmu.edu/datasets/boston"
[3922](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3923)
E               raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
[3923](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3924)
E               data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
[3924](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3925)
E               target = raw_df.values[1::2, 2]
[3925](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3926)
E           
[3926](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3927)
E           Alternative datasets include the California housing dataset and the
[3927](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3928)
E           Ames housing dataset. You can load the datasets as follows::
[3928](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3929)
E           
[3929](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3930)
E               from sklearn.datasets import fetch_california_housing
[3930](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3931)
E               housing = fetch_california_housing()
[3931](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3932)
E           
[3932](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3933)
E           for the California housing dataset and::
[3933](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3934)
E           
[3934](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3935)
E               from sklearn.datasets import fetch_openml
[3935](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3936)
E               housing = fetch_openml(name="house_prices", as_frame=True)
[3936](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3937)
E           
[3937](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3938)
E           for the Ames housing dataset.
[3938](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3939)
E           
[3939](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3940)
E           [1] M Carlisle.
[3940](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3941)
E           "Racist data destruction?"
[3941](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3942)
E           <https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>
[3942](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3943)
E           
[3943](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3944)
E           [2] Harrison Jr, David, and Daniel L. Rubinfeld.
[3944](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3945)
E           "Hedonic housing prices and the demand for clean air."
[3945](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3946)
E           Journal of environmental economics and management 5.1 (1978): 81-102.
[3946](https://github.com/microsoft/LightGBM/actions/runs/3654388214/jobs/6174767879#step:5:3947)
E           <https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
@jameslamb
Copy link
Collaborator

Thanks @shiyu1994 , but we already have an issue for this (#4793) and a PR to fix it (#5581).

@shiyu1994
Copy link
Collaborator Author

@jameslamb Thanks. Sorry for missing that...

@jameslamb
Copy link
Collaborator

No problem! Appreciate you reporting this, and sorry we didn't get the fix merged sooner.

@microsoft microsoft locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants