Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_vectorized_features error #103

Open
MLFlexer opened this issue Feb 22, 2023 · 8 comments · May be fixed by #108
Open

create_vectorized_features error #103

MLFlexer opened this issue Feb 22, 2023 · 8 comments · May be fixed by #108

Comments

@MLFlexer
Copy link

I have problems running the following commands in python:

import ember
ember.create_vectorized_features("/data/ember2018/")

I have installed the dependencies and tried on docker with leif versions 0.9.0, 0.10.1 and i still get the same failure:

ember.create_vectorized_features("./ember/")
Vectorizing training set
  0%|                                                                                    | 0/900000 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 44, in vectorize_unpack
    return vectorize(*args)
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 31, in vectorize
    feature_vector = extractor.process_raw_features(raw_features)
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/features.py", line 552, in process_raw_features
    feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features]
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/features.py", line 552, in <listcomp>
    feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features]
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/features.py", line 192, in process_raw_features
    entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0]
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/feature_extraction/_hash.py", line 170, in transform
    raise ValueError(
ValueError: Samples can not be a single string. The input must be an iterable over iterables of strings.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 75, in create_vectorized_features
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 60, in vectorize_subset
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
ValueError: Samples can not be a single string. The input must be an iterable over iterables of strings.
>>>

I seems from the error msg, that the input is not the same format as expected in the vectorizor?
Any fix to this?

@birkj
Copy link

birkj commented Feb 22, 2023

I have the same problem. @mrphilroth is this a common problem?

@AhlemRn
Copy link

AhlemRn commented Apr 2, 2023

i have the same problem , if you have fix it please tell me how

@MLFlexer
Copy link
Author

MLFlexer commented Apr 3, 2023

i have the same problem , if you have fix it please tell me how

I have not been able to find a fix for this yet, although I have not spent a lot of time on this

@keremgirenes
Copy link

i had the same issue, downgraded python to 3.6 in my environment, worked like charm.

@gparrella12
Copy link

A way to fix it is to replace:
entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0]
with:
entry_name_hashed = FeatureHasher(50, input_type="string").transform([ [raw_obj['entry']] ]).toarray()[0]

in features.py at line 192. In this way an iterable over iterable over raw features is obtained, as transform() method require.

@maciejskorski maciejskorski linked a pull request Jul 19, 2023 that will close this issue
@maciejskorski
Copy link

Same problem. I started a fork to curate this repo. Also my PR #108 fixes the issue

@KSroido
Copy link

KSroido commented Mar 17, 2024

downgrade to py3.6will easily solve

pturnah referenced this issue Apr 7, 2024
Fixed:
ValueError: Samples can not be a single string. The input must be an iterable over iterables of strings.

By:
entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0] 
with:
entry_name_hashed = FeatureHasher(50, input_type="string").transform([ [raw_obj['entry']] ]).toarray()[0] 

at line 192.

In this way an iterable over iterable over raw features is obtained, as transform() method require.
@mdaument
Copy link

A way to fix it is to replace:
entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0]
with:
entry_name_hashed = FeatureHasher(50, input_type="string").transform([ [raw_obj['entry']] ]).toarray()[0]

in features.py at line 192. In this way an iterable over iterable over raw features is obtained, as transform() method require.

Can anyone provide any insight on what the intended output for the entry name hash table is supposed to be?

Using it the way it's written with Python3.6 or earlier, the FeatureHasher hashes each character in the entry string. For example, if .text is the entry point, there are 4 bins populated in the returned hash table.

Using the fixed version, the FeatureHasher hashes the entire string, so an entry point string of .text will return a hash table with only one bin populated.

In the grand scheme of the model, I don't know if either way has much of an impact, but it would be good to know if the authors intended the hash table to be one way or the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants