You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: during the implementation of the above we realized that extracting the mimetype for a file is a non-trivial task to carry out cross-platform in Python. The mimetype library relies on the file extension. The "correct" library to use, python-magic, is non-trivial to install on Windows due to the need to ship libmagic with it, and there might be Python based alternatives polyfile or filetype, but we would need to check their respective accuracy.
Therefore we decided to still implement a simple FileExtensionClassifier that uses the file extension: it's very easy to use, very quick, and does not try to read the file, which can be a plus in some scenarios. We'll then also implement a "real" FileTypeClassifier using one of the libraries mentioned for more accurate results.
The content you are editing has changed. Please copy your edits and refresh the page.
What's the best folder package for this component in preview @ZanSara? I put it in io but not sure how we want to structure packages under components. Suggestions welcome.
So far we don't have specific rules yet, but for this component we may make a package called classifiers. It may end up including all types of classifiers (query classifiers, document classifiers, etc etc) and imho that's ok. If it gets too big we'll think later how to split it.
Very simple component that takes in a list of paths and splits such list into several output lists, one for each file type.
It could look like the following:
Note: during the implementation of the above we realized that extracting the mimetype for a file is a non-trivial task to carry out cross-platform in Python. The
mimetype
library relies on the file extension. The "correct" library to use,python-magic
, is non-trivial to install on Windows due to the need to shiplibmagic
with it, and there might be Python based alternatives polyfile or filetype, but we would need to check their respective accuracy.Therefore we decided to still implement a simple
FileExtensionClassifier
that uses the file extension: it's very easy to use, very quick, and does not try to read the file, which can be a plus in some scenarios. We'll then also implement a "real"FileTypeClassifier
using one of the libraries mentioned for more accurate results.Tasks
The text was updated successfully, but these errors were encountered: