Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Processing #1300

Merged
merged 40 commits into from
Feb 3, 2022
Merged

Text Processing #1300

merged 40 commits into from
Feb 3, 2022

Conversation

Louquinze
Copy link
Collaborator

Pull Request for Hackathon on Wednesday 17.11.

  • rebased the forked branch and merge all current changes into it
  • last test might still fail

@codecov
Copy link

codecov bot commented Nov 12, 2021

Codecov Report

Merging #1300 (bc6e883) into development (a9fbd5c) will decrease coverage by 0.43%.
The diff coverage is 100.00%.

❗ Current head bc6e883 differs from pull request most recent head ce1c0d1. Consider uploading reports for the commit ce1c0d1 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           development    #1300      +/-   ##
===============================================
- Coverage        88.07%   87.63%   -0.44%     
===============================================
  Files              140      146       +6     
  Lines            10993    11285     +292     
===============================================
+ Hits              9682     9890     +208     
- Misses            1311     1395      +84     
Impacted Files Coverage Δ
...osklearn/metalearning/metafeatures/metafeatures.py 94.47% <ø> (-0.12%) ⬇️
...a_preprocessing/feature_reduction/truncated_svd.py 68.75% <ø> (ø)
...line/components/data_preprocessing/feature_type.py 89.21% <ø> (+1.09%) ⬆️
...components/data_preprocessing/feature_type_text.py 90.00% <ø> (ø)
...nents/data_preprocessing/text_encoding/__init__.py 84.37% <ø> (ø)
...reprocessing/text_encoding/bag_of_word_encoding.py 70.00% <ø> (ø)
...ing/text_encoding/bag_of_word_encoding_distinct.py 58.33% <ø> (ø)
...data_preprocessing/text_encoding/tfidf_encoding.py 71.15% <ø> (ø)
autosklearn/data/feature_validator.py 96.63% <100.00%> (-0.87%) ⬇️
...ponents/feature_preprocessing/select_percentile.py 84.61% <0.00%> (-7.70%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9fbd5c...ce1c0d1. Read the comment docs.

@eddiebergman eddiebergman changed the title Development Text Processing Nov 17, 2021
# Conflicts:
#	autosklearn/metalearning/files/accuracy_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/accuracy_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/accuracy_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/accuracy_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/average_precision_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/average_precision_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/average_precision_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/average_precision_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/balanced_accuracy_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/balanced_accuracy_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/balanced_accuracy_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/balanced_accuracy_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_macro_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_macro_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_macro_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_macro_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_micro_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_micro_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_micro_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_micro_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_samples_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_samples_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_samples_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_samples_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_weighted_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_weighted_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/f1_weighted_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/f1_weighted_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/log_loss_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/log_loss_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/log_loss_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/log_loss_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/mean_absolute_error_regression_dense/configurations.csv
#	autosklearn/metalearning/files/mean_absolute_error_regression_sparse/configurations.csv
#	autosklearn/metalearning/files/mean_squared_error_regression_dense/configurations.csv
#	autosklearn/metalearning/files/mean_squared_error_regression_sparse/configurations.csv
#	autosklearn/metalearning/files/mean_squared_log_error_regression_dense/configurations.csv
#	autosklearn/metalearning/files/mean_squared_log_error_regression_sparse/configurations.csv
#	autosklearn/metalearning/files/median_absolute_error_regression_dense/configurations.csv
#	autosklearn/metalearning/files/median_absolute_error_regression_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_macro_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_macro_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_macro_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_macro_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_micro_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_micro_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_micro_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_micro_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_samples_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_samples_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_samples_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_samples_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_weighted_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_weighted_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/precision_weighted_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/precision_weighted_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/r2_regression_dense/configurations.csv
#	autosklearn/metalearning/files/r2_regression_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_macro_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_macro_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_macro_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_macro_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_micro_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_micro_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_micro_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_micro_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_samples_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_samples_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_samples_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_samples_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_weighted_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_weighted_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/recall_weighted_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/recall_weighted_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/roc_auc_binary.classification_dense/configurations.csv
#	autosklearn/metalearning/files/roc_auc_binary.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/roc_auc_multiclass.classification_dense/configurations.csv
#	autosklearn/metalearning/files/roc_auc_multiclass.classification_sparse/configurations.csv
#	autosklearn/metalearning/files/root_mean_squared_error_regression_dense/configurations.csv
#	autosklearn/metalearning/files/root_mean_squared_error_regression_sparse/configurations.csv
autosklearn/data/feature_validator.py Outdated Show resolved Hide resolved
autosklearn/data/feature_validator.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
examples/40_advanced/example_text_preprocessing.py Outdated Show resolved Hide resolved
test/test_data/test_feature_validator.py Outdated Show resolved Hide resolved
test/test_data/test_feature_validator.py Outdated Show resolved Hide resolved
test/test_data/test_feature_validator.py Outdated Show resolved Hide resolved
@mfeurer mfeurer closed this Nov 17, 2021
@mfeurer mfeurer reopened this Nov 17, 2021
…aybe this needs to be fixed in a bigger scale)
…aybe this needs to be fixed in a bigger scale)
…aybe this needs to be fixed in a bigger scale)
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this. @eddiebergman could you please have a look at your remaining comments?

@mfeurer mfeurer mentioned this pull request Jan 19, 2022
11 tasks
for feature in X.columns:
vectorizer = CountVectorizer(min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_range)).fit(X[feature])
self.preprocessor[feature] = vectorizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs dropna to be done, X[feature] could have na's in them which CountVectorizer can't handle.

@eddiebergman
Copy link
Contributor

eddiebergman commented Feb 1, 2022

Just a note on the sum thing @mfeurer was concerned about, it does exactly what the was done previously and now reimplemented. I don't really mind how it's implemented but it changes nothing but make the code shorter.

@Louquinze
Copy link
Collaborator Author

t

I think it is because sum(*) is a python native method and therefore it is not save how it behaves. but as far as i know "+=" is a numpy operation if both sides are numpy array. But i can also change it to np.sum if this is better

@eddiebergman
Copy link
Contributor

eddiebergman commented Feb 1, 2022

sum actually just calls the underlying __add__ operation, i.e. it called np.ndarray.__add__. This is similar how len just called the underlying objects __len__ method.

sum is basically short hand for

ans = sum(items)
ans = item[0] + item[1] + ... + item[n]

Edit: Turns out i"m completely wrong on this. Use np.sum(iterable) for np arrays.
https://stackoverflow.com/a/49908528

Essentially there was talk of making it work as I described but they didn't go ahead with it.

@Louquinze
Copy link
Collaborator Author

i get the following message
Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead. return np.sum(self.preprocessor.transform(X[feature]) for feature in X.columns)

@eddiebergman
Copy link
Contributor

i get the following message Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead. return np.sum(self.preprocessor.transform(X[feature]) for feature in X.columns)

Just do it manually as is, it's fine. Can deal with performance later if needed.

@mfeurer
Copy link
Contributor

mfeurer commented Feb 1, 2022

Just do it manually as is, it's fine. Can deal with performance later if needed.

I agree on that. Actually, I think the current way to do it is the best way because it is agnostic on the type of array implementation. It would work for numpy arrays, pandas dataframes and scipy sparse matrices. Other solutions such as the built-in sum() are more complicated to understand, and np.sum() explicitly invokes numpy.

Anyway, I'd be happy to merge this version. Anything left from your side @eddiebergman ?

@eddiebergman
Copy link
Contributor

eddiebergman commented Feb 1, 2022

All good from my side :)

... side note: to keep things general with addition (and how I thought sum actually worked)

reduce(operator.add, xs)

@mfeurer mfeurer merged commit 4b21321 into automl:development Feb 3, 2022
eddiebergman pushed a commit that referenced this pull request Aug 18, 2022
* commit meta learning data bases

* commit changed files

* commit new files

* fixed experimental settings

* implemented last comments on old PR

* adapted metalearning to last commit

* add a text preprocessing example

* intigrated feedback

* new changes on *.csv files

* reset changes

* add changes for merging

* add changes for merging

* add changes for merging

* try to merge

* fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale)

* fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale)

* fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale)

* init

* init

* commit changes for text preprocessing

* text prepreprocessing commit

* fix metalearning

* fix metalearning

* adapted test to new text feature

* fix style guide issues

* integrate PR comments

* integrate PR comments

* implemented the comments to the last PR

* fitted operation is not in place therefore we have to assgin the fitted self.preprocessor again to it self

* add first text processing tests

* add first text processing tests

* including comments from 01.25.

* including comments from 01.28.

* including comments from 01.28.

* including comments from 01.28.

* including comments from 01.31.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants