File Classifier by Type (v2) #5362

ZanSara · 2023-07-14T13:59:47Z

Very simple component that takes in a list of paths and splits such list into several output lists, one for each file type.

It could look like the following:

@component
class FileTypeClassifier:

    @component.input
    def input(self):
        class Input:
            paths: List[Union[str, Path]]
        return Input

    @component.output
    def output(self):
        class Output:
            ... as many extensions as defined in the __init__, like: "txt: List[Union[str, Path]]" ...
        return Output

    def __init__(self, extensions: List[str]):
        ... 

    def run(self, data):
        ... categorizes the files ...
        return self.output(txt=txt_files, pdf=pdf_files, wav=wav_files, ....)

Note: during the implementation of the above we realized that extracting the mimetype for a file is a non-trivial task to carry out cross-platform in Python. The mimetype library relies on the file extension. The "correct" library to use, python-magic, is non-trivial to install on Windows due to the need to ship libmagic with it, and there might be Python based alternatives polyfile or filetype, but we would need to check their respective accuracy.

Therefore we decided to still implement a simple FileExtensionClassifier that uses the file extension: it's very easy to use, very quick, and does not try to read the file, which can be a plus in some scenarios. We'll then also implement a "real" FileTypeClassifier using one of the libraries mentioned for more accurate results.

Tasks

Give feedback

FileExtensionClassifier feat: Add FileExtensionClassifier to previews #5514
FileTypeClassifier based on either python-magic or polyfile or filetype
Options

The text was updated successfully, but these errors were encountered:

vblagoje · 2023-08-03T08:54:44Z

What's the best folder package for this component in preview @ZanSara? I put it in io but not sure how we want to structure packages under components. Suggestions welcome.

ZanSara · 2023-08-07T10:44:07Z

So far we don't have specific rules yet, but for this component we may make a package called classifiers. It may end up including all types of classifiers (query classifiers, document classifiers, etc etc) and imho that's ok. If it gets too big we'll think later how to split it.

vblagoje · 2023-10-24T15:12:45Z

This issue was completed with #5514 wasn't it @ZanSara ?

ZanSara mentioned this issue Jul 14, 2023

Migrate Components to Pipeline v2 #5265

Closed

ZanSara changed the title ~~File Type Classifier component (v2)~~ FileTypeClassifier component (v2) Jul 14, 2023

ZanSara changed the title ~~FileTypeClassifier component (v2)~~ FileTypeClassifier (v2) Jul 14, 2023

ZanSara mentioned this issue Jul 14, 2023

FileToDocument components (2.x) #5367

Closed

masci assigned vblagoje Jul 19, 2023

vblagoje mentioned this issue Aug 7, 2023

feat: Add FileExtensionClassifier to previews #5514

Merged

ZanSara changed the title ~~FileTypeClassifier (v2)~~ File Classifier by Type (v2) Aug 15, 2023

ZanSara unassigned vblagoje Aug 16, 2023

ZanSara added the 2.x Related to Haystack v2.0 label Aug 25, 2023

masci added the good first issue Good for newcomers label Sep 13, 2023

bilgeyucel added the hacktoberfest label Sep 26, 2023

masci assigned vblagoje Oct 3, 2023

masci closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Classifier by Type (v2) #5362

File Classifier by Type (v2) #5362

ZanSara commented Jul 14, 2023 •

edited

Loading

Tasks

vblagoje commented Aug 3, 2023

ZanSara commented Aug 7, 2023

vblagoje commented Oct 24, 2023

File Classifier by Type (v2) #5362

File Classifier by Type (v2) #5362

Comments

ZanSara commented Jul 14, 2023 • edited Loading

Tasks

vblagoje commented Aug 3, 2023

ZanSara commented Aug 7, 2023

vblagoje commented Oct 24, 2023

ZanSara commented Jul 14, 2023 •

edited

Loading