Text Processing #1300

Louquinze · 2021-11-12T12:55:20Z

Pull Request for Hackathon on Wednesday 17.11.

rebased the forked branch and merge all current changes into it
last test might still fail

codecov · 2021-11-12T13:39:11Z

Codecov Report

Merging #1300 (bc6e883) into development (a9fbd5c) will decrease coverage by 0.43%.
The diff coverage is 100.00%.

❗ Current head bc6e883 differs from pull request most recent head ce1c0d1. Consider uploading reports for the commit ce1c0d1 to get more accurate results

@@               Coverage Diff               @@
##           development    #1300      +/-   ##
===============================================
- Coverage        88.07%   87.63%   -0.44%     
===============================================
  Files              140      146       +6     
  Lines            10993    11285     +292     
===============================================
+ Hits              9682     9890     +208     
- Misses            1311     1395      +84

Impacted Files	Coverage Δ
...osklearn/metalearning/metafeatures/metafeatures.py	`94.47% <ø> (-0.12%)`	⬇️
...a_preprocessing/feature_reduction/truncated_svd.py	`68.75% <ø> (ø)`
...line/components/data_preprocessing/feature_type.py	`89.21% <ø> (+1.09%)`	⬆️
...components/data_preprocessing/feature_type_text.py	`90.00% <ø> (ø)`
...nents/data_preprocessing/text_encoding/__init__.py	`84.37% <ø> (ø)`
...reprocessing/text_encoding/bag_of_word_encoding.py	`70.00% <ø> (ø)`
...ing/text_encoding/bag_of_word_encoding_distinct.py	`58.33% <ø> (ø)`
...data_preprocessing/text_encoding/tfidf_encoding.py	`71.15% <ø> (ø)`
autosklearn/data/feature_validator.py	`96.63% <100.00%> (-0.87%)`	⬇️
...ponents/feature_preprocessing/select_percentile.py	`84.61% <0.00%> (-7.70%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9fbd5c...ce1c0d1. Read the comment docs.

autosklearn/data/feature_validator.py

# Conflicts: # autosklearn/metalearning/files/accuracy_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/accuracy_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/accuracy_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/accuracy_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/average_precision_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/average_precision_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/average_precision_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/average_precision_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_macro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_macro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_macro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_macro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_micro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_micro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_micro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_micro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_samples_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_samples_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_samples_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_samples_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_weighted_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_weighted_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_weighted_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_weighted_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/log_loss_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/log_loss_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/log_loss_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/log_loss_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/mean_absolute_error_regression_dense/configurations.csv # autosklearn/metalearning/files/mean_absolute_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/mean_squared_error_regression_dense/configurations.csv # autosklearn/metalearning/files/mean_squared_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/mean_squared_log_error_regression_dense/configurations.csv # autosklearn/metalearning/files/mean_squared_log_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/median_absolute_error_regression_dense/configurations.csv # autosklearn/metalearning/files/median_absolute_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/precision_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_macro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_macro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_macro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_macro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_micro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_micro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_micro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_micro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_samples_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_samples_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_samples_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_samples_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_weighted_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_weighted_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_weighted_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_weighted_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/r2_regression_dense/configurations.csv # autosklearn/metalearning/files/r2_regression_sparse/configurations.csv # autosklearn/metalearning/files/recall_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_macro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_macro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_macro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_macro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_micro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_micro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_micro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_micro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_samples_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_samples_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_samples_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_samples_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_weighted_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_weighted_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_weighted_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_weighted_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/roc_auc_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/roc_auc_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/roc_auc_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/roc_auc_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/root_mean_squared_error_regression_dense/configurations.csv # autosklearn/metalearning/files/root_mean_squared_error_regression_sparse/configurations.csv

autosklearn/data/feature_validator.py

examples/40_advanced/example_text_preprocessing.py

autosklearn/pipeline/components/data_preprocessing/feature_reduction/truncated_svd.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py

test/test_data/test_feature_validator.py

autosklearn/pipeline/components/data_preprocessing/feature_reduction/truncated_svd.py

autosklearn/pipeline/components/data_preprocessing/feature_type_text.py

examples/40_advanced/example_text_preprocessing.py

…aybe this needs to be fixed in a bigger scale)

mfeurer

I'm good with this. @eddiebergman could you please have a look at your remaining comments?

autosklearn/pipeline/components/data_preprocessing/text_encoding/__init__.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py

...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py

examples/40_advanced/example_text_preprocessing.py

test/test_util/test_trials_callback.py

...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py

…ed self.preprocessor again to it self

...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/__init__.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py

autosklearn/pipeline/components/data_preprocessing/feature_type_text.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py

autosklearn/pipeline/components/data_preprocessing/feature_type_text.py

eddiebergman · 2022-01-28T11:45:08Z

...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py

+                for feature in X.columns:
+                    vectorizer = CountVectorizer(min_df=self.min_df_absolute,
+                                                 ngram_range=(1, self.ngram_range)).fit(X[feature])
+                    self.preprocessor[feature] = vectorizer


Still needs dropna to be done, X[feature] could have na's in them which CountVectorizer can't handle.

...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py

examples/40_advanced/example_text_preprocessing.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py

autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py

eddiebergman · 2022-02-01T12:07:59Z

Just a note on the sum thing @mfeurer was concerned about, it does exactly what the was done previously and now reimplemented. I don't really mind how it's implemented but it changes nothing but make the code shorter.

Louquinze · 2022-02-01T13:00:50Z

t

I think it is because sum(*) is a python native method and therefore it is not save how it behaves. but as far as i know "+=" is a numpy operation if both sides are numpy array. But i can also change it to np.sum if this is better

eddiebergman · 2022-02-01T13:25:23Z

~~sum actually just calls the underlying __add__ operation, i.e. it called np.ndarray.__add__. This is similar how len just called the underlying objects __len__ method.~~

~~sum is basically short hand for~~

ans = sum(items)
ans = item[0] + item[1] + ... + item[n]

Edit: Turns out i"m completely wrong on this. Use np.sum(iterable) for np arrays.
https://stackoverflow.com/a/49908528

Essentially there was talk of making it work as I described but they didn't go ahead with it.

Louquinze · 2022-02-01T13:50:50Z

i get the following message
Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead. return np.sum(self.preprocessor.transform(X[feature]) for feature in X.columns)

eddiebergman · 2022-02-01T14:08:06Z

i get the following message Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead. return np.sum(self.preprocessor.transform(X[feature]) for feature in X.columns)

Just do it manually as is, it's fine. Can deal with performance later if needed.

mfeurer · 2022-02-01T15:35:39Z

Just do it manually as is, it's fine. Can deal with performance later if needed.

I agree on that. Actually, I think the current way to do it is the best way because it is agnostic on the type of array implementation. It would work for numpy arrays, pandas dataframes and scipy sparse matrices. Other solutions such as the built-in sum() are more complicated to understand, and np.sum() explicitly invokes numpy.

Anyway, I'd be happy to merge this version. Anything left from your side @eddiebergman ?

eddiebergman · 2022-02-01T15:59:07Z

All good from my side :)

... side note: to keep things general with addition (and how I thought sum actually worked)

reduce(operator.add, xs)

* commit meta learning data bases * commit changed files * commit new files * fixed experimental settings * implemented last comments on old PR * adapted metalearning to last commit * add a text preprocessing example * intigrated feedback * new changes on *.csv files * reset changes * add changes for merging * add changes for merging * add changes for merging * try to merge * fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale) * fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale) * fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale) * init * init * commit changes for text preprocessing * text prepreprocessing commit * fix metalearning * fix metalearning * adapted test to new text feature * fix style guide issues * integrate PR comments * integrate PR comments * implemented the comments to the last PR * fitted operation is not in place therefore we have to assgin the fitted self.preprocessor again to it self * add first text processing tests * add first text processing tests * including comments from 01.25. * including comments from 01.28. * including comments from 01.28. * including comments from 01.28. * including comments from 01.31.

Louquinze and others added 5 commits November 9, 2021 11:51

commit meta learning data bases

4450d86

commit changed files

e821eaf

commit new files

ae4f59f

fixed experimental settings

d0a10ab

Merge branch 'automl:development' into development

65271a9

mfeurer reviewed Nov 17, 2021

View reviewed changes

autosklearn/data/feature_validator.py Outdated Show resolved Hide resolved

eddiebergman changed the title ~~Development~~ Text Processing Nov 17, 2021

eddiebergman mentioned this pull request Nov 17, 2021

How to apply a custom preprocessor to only specified features #1110

Open

implemented last comments on old PR

55e87e2

eddiebergman mentioned this pull request Nov 17, 2021

Automatic Feature Type Discovery #469

Open

Louquinze added 7 commits November 17, 2021 14:42

adapted metalearning to last commit

590387d

add a text preprocessing example

2809c46

intigrated feedback

ffe8ccf

new changes on *.csv files

8094eb5

reset changes

1a27144

add changes for merging

1a2f66d

mfeurer reviewed Nov 17, 2021

View reviewed changes

Louquinze added 3 commits November 17, 2021 18:30

add changes for merging

107e854

add changes for merging

88aa101

try to merge

11f092f

mfeurer closed this Nov 17, 2021

mfeurer reopened this Nov 17, 2021

eddiebergman reviewed Nov 17, 2021

View reviewed changes

eddiebergman added the PR: In progress label Nov 27, 2021

Louquinze added 3 commits December 7, 2021 12:40

fixed string representation for metalearning (some sort of hot fix, m…

d5a03d6

…aybe this needs to be fixed in a bigger scale)

fixed string representation for metalearning (some sort of hot fix, m…

220807e

…aybe this needs to be fixed in a bigger scale)

fixed string representation for metalearning (some sort of hot fix, m…

38ffd06

…aybe this needs to be fixed in a bigger scale)

eddiebergman added the PR: Metadata label Dec 13, 2021

integrate PR comments

e85eb2e

mfeurer approved these changes Jan 19, 2022

View reviewed changes

mfeurer mentioned this pull request Jan 19, 2022

Text preprocessing V2 TODOs #1373

Open

11 tasks

eddiebergman requested changes Jan 19, 2022

View reviewed changes

eddiebergman reviewed Jan 19, 2022

View reviewed changes

...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py Outdated Show resolved Hide resolved

implemented the comments to the last PR

cafb1d4

Louquinze requested a review from eddiebergman January 23, 2022 13:16

Louquinze added 2 commits January 23, 2022 14:48

fitted operation is not in place therefore we have to assgin the fitt…

b9da42d

…ed self.preprocessor again to it self

add first text processing tests

d2d5a24

mfeurer reviewed Jan 24, 2022

View reviewed changes

add first text processing tests

ac40ff9

eddiebergman reviewed Jan 24, 2022

View reviewed changes

Louquinze added 2 commits January 25, 2022 08:59

including comments from 01.25.

38be7c3

including comments from 01.28.

5f6d6a7

Louquinze requested a review from eddiebergman January 28, 2022 11:27

eddiebergman requested changes Jan 28, 2022

View reviewed changes

Louquinze added 2 commits January 28, 2022 17:58

including comments from 01.28.

94b9c27

including comments from 01.28.

bc6e883

mfeurer reviewed Jan 31, 2022

View reviewed changes

autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py Outdated Show resolved Hide resolved

mfeurer reviewed Jan 31, 2022

View reviewed changes

test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py Outdated Show resolved Hide resolved

including comments from 01.31.

ce1c0d1

mfeurer merged commit 4b21321 into automl:development Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Processing #1300

Text Processing #1300

Louquinze commented Nov 12, 2021

codecov bot commented Nov 12, 2021 •

edited

Loading

mfeurer left a comment

eddiebergman Jan 28, 2022

eddiebergman commented Feb 1, 2022 •

edited

Loading

Louquinze commented Feb 1, 2022

eddiebergman commented Feb 1, 2022 •

edited

Loading

Louquinze commented Feb 1, 2022

eddiebergman commented Feb 1, 2022

mfeurer commented Feb 1, 2022

eddiebergman commented Feb 1, 2022 •

edited

Loading

Text Processing #1300

Text Processing #1300

Conversation

Louquinze commented Nov 12, 2021

codecov bot commented Nov 12, 2021 • edited Loading

Codecov Report

mfeurer left a comment

Choose a reason for hiding this comment

eddiebergman Jan 28, 2022

Choose a reason for hiding this comment

eddiebergman commented Feb 1, 2022 • edited Loading

Louquinze commented Feb 1, 2022

eddiebergman commented Feb 1, 2022 • edited Loading

Louquinze commented Feb 1, 2022

eddiebergman commented Feb 1, 2022

mfeurer commented Feb 1, 2022

eddiebergman commented Feb 1, 2022 • edited Loading

codecov bot commented Nov 12, 2021 •

edited

Loading

eddiebergman commented Feb 1, 2022 •

edited

Loading

eddiebergman commented Feb 1, 2022 •

edited

Loading

eddiebergman commented Feb 1, 2022 •

edited

Loading