-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Processing #1300
Text Processing #1300
Conversation
Codecov Report
@@ Coverage Diff @@
## development #1300 +/- ##
===============================================
- Coverage 88.07% 87.63% -0.44%
===============================================
Files 140 146 +6
Lines 10993 11285 +292
===============================================
+ Hits 9682 9890 +208
- Misses 1311 1395 +84
Continue to review full report at Codecov.
|
# Conflicts: # autosklearn/metalearning/files/accuracy_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/accuracy_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/accuracy_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/accuracy_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/average_precision_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/average_precision_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/average_precision_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/average_precision_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/balanced_accuracy_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_macro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_macro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_macro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_macro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_micro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_micro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_micro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_micro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_samples_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_samples_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_samples_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_samples_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_weighted_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_weighted_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/f1_weighted_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/f1_weighted_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/log_loss_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/log_loss_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/log_loss_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/log_loss_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/mean_absolute_error_regression_dense/configurations.csv # autosklearn/metalearning/files/mean_absolute_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/mean_squared_error_regression_dense/configurations.csv # autosklearn/metalearning/files/mean_squared_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/mean_squared_log_error_regression_dense/configurations.csv # autosklearn/metalearning/files/mean_squared_log_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/median_absolute_error_regression_dense/configurations.csv # autosklearn/metalearning/files/median_absolute_error_regression_sparse/configurations.csv # autosklearn/metalearning/files/precision_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_macro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_macro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_macro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_macro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_micro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_micro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_micro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_micro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_samples_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_samples_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_samples_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_samples_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_weighted_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_weighted_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/precision_weighted_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/precision_weighted_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/r2_regression_dense/configurations.csv # autosklearn/metalearning/files/r2_regression_sparse/configurations.csv # autosklearn/metalearning/files/recall_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_macro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_macro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_macro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_macro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_micro_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_micro_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_micro_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_micro_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_samples_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_samples_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_samples_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_samples_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_weighted_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_weighted_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/recall_weighted_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/recall_weighted_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/roc_auc_binary.classification_dense/configurations.csv # autosklearn/metalearning/files/roc_auc_binary.classification_sparse/configurations.csv # autosklearn/metalearning/files/roc_auc_multiclass.classification_dense/configurations.csv # autosklearn/metalearning/files/roc_auc_multiclass.classification_sparse/configurations.csv # autosklearn/metalearning/files/root_mean_squared_error_regression_dense/configurations.csv # autosklearn/metalearning/files/root_mean_squared_error_regression_sparse/configurations.csv
autosklearn/pipeline/components/data_preprocessing/feature_reduction/truncated_svd.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/feature_reduction/truncated_svd.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/feature_reduction/truncated_svd.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/feature_type_text.py
Outdated
Show resolved
Hide resolved
…aybe this needs to be fixed in a bigger scale)
…aybe this needs to be fixed in a bigger scale)
…aybe this needs to be fixed in a bigger scale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this. @eddiebergman could you please have a look at your remaining comments?
autosklearn/pipeline/components/data_preprocessing/text_encoding/__init__.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/__init__.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Outdated
Show resolved
Hide resolved
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Outdated
Show resolved
Hide resolved
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py
Outdated
Show resolved
Hide resolved
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Outdated
Show resolved
Hide resolved
…ed self.preprocessor again to it self
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Outdated
Show resolved
Hide resolved
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/__init__.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/feature_type_text.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/feature_type_text.py
Outdated
Show resolved
Hide resolved
for feature in X.columns: | ||
vectorizer = CountVectorizer(min_df=self.min_df_absolute, | ||
ngram_range=(1, self.ngram_range)).fit(X[feature]) | ||
self.preprocessor[feature] = vectorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs dropna
to be done, X[feature]
could have na's in them which CountVectorizer
can't handle.
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Show resolved
Hide resolved
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Show resolved
Hide resolved
...klearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding_distinct.py
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/bag_of_word_encoding.py
Outdated
Show resolved
Hide resolved
autosklearn/pipeline/components/data_preprocessing/text_encoding/tfidf_encoding.py
Outdated
Show resolved
Hide resolved
test/test_pipeline/components/data_preprocessing/test_data_preprocessing_text.py
Outdated
Show resolved
Hide resolved
Just a note on the |
I think it is because sum(*) is a python native method and therefore it is not save how it behaves. but as far as i know "+=" is a numpy operation if both sides are numpy array. But i can also change it to np.sum if this is better |
ans = sum(items)
ans = item[0] + item[1] + ... + item[n] Edit: Turns out i"m completely wrong on this. Use Essentially there was talk of making it work as I described but they didn't go ahead with it. |
i get the following message |
Just do it manually as is, it's fine. Can deal with performance later if needed. |
I agree on that. Actually, I think the current way to do it is the best way because it is agnostic on the type of array implementation. It would work for numpy arrays, pandas dataframes and scipy sparse matrices. Other solutions such as the built-in sum() are more complicated to understand, and np.sum() explicitly invokes numpy. Anyway, I'd be happy to merge this version. Anything left from your side @eddiebergman ? |
All good from my side :) ... side note: to keep things general with addition (and how I thought reduce(operator.add, xs) |
* commit meta learning data bases * commit changed files * commit new files * fixed experimental settings * implemented last comments on old PR * adapted metalearning to last commit * add a text preprocessing example * intigrated feedback * new changes on *.csv files * reset changes * add changes for merging * add changes for merging * add changes for merging * try to merge * fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale) * fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale) * fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale) * init * init * commit changes for text preprocessing * text prepreprocessing commit * fix metalearning * fix metalearning * adapted test to new text feature * fix style guide issues * integrate PR comments * integrate PR comments * implemented the comments to the last PR * fitted operation is not in place therefore we have to assgin the fitted self.preprocessor again to it self * add first text processing tests * add first text processing tests * including comments from 01.25. * including comments from 01.28. * including comments from 01.28. * including comments from 01.28. * including comments from 01.31.
Pull Request for Hackathon on Wednesday 17.11.