Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file type validation and mime type extraction #1893

Merged
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
80c50c6
feat: add file type validation and mime type extraction
jasonlai1218 Oct 15, 2023
2cf4d57
test: refactor test functions and add new tests
jasonlai1218 Oct 15, 2023
d327dee
Based on the file summaries provided, the best label for the commit w…
jasonlai1218 Oct 16, 2023
e669cbe
chore: update dev requirements and setup.py dependencies
jasonlai1218 Oct 16, 2023
3246d37
chore: update dependencies and remove unnecessary dev requirement
jasonlai1218 Oct 16, 2023
53aef24
chore: import `magic` module in `file.py`
jasonlai1218 Oct 16, 2023
38b684e
docs: fix typo in error message for incorrect file type
jasonlai1218 Oct 16, 2023
3034803
build: add libmagic1 package to Dockerfile.dev
jasonlai1218 Oct 17, 2023
65c8918
feat: refactor file type validation in FlyteFilePathTransformer class
jasonlai1218 Oct 18, 2023
50419e2
test: add test for invalid file type validation
jasonlai1218 Oct 18, 2023
ed980d6
refactor: update type hint for `validate_file_type` method
jasonlai1218 Oct 19, 2023
b64f06d
refactor: handle missing `magic` module gracefully
jasonlai1218 Oct 19, 2023
1fbffac
test: refactor file type tests to use `can_import_magic` fixture
jasonlai1218 Oct 19, 2023
cfa9801
Based on the file summaries provided, the best label for this code ch…
jasonlai1218 Oct 19, 2023
35ddf82
feat: refactor file handling and introduce hash calculation
jasonlai1218 Oct 20, 2023
c5b7408
The label that best describes this change is "refactor". This is beca…
jasonlai1218 Oct 21, 2023
2dfb7aa
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 21, 2023
946bb9c
Based on the file summaries provided, the best label for the commit w…
jasonlai1218 Oct 21, 2023
75177e4
refactor: change MIME types for `hdf5`, `ipynb`, and `onnx` files
jasonlai1218 Oct 22, 2023
73be136
refactor: update MIME types for specific file types
jasonlai1218 Oct 22, 2023
3dc3525
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 24, 2023
0577b3e
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 26, 2023
c07cffc
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 27, 2023
0b8bf4b
The label best describing this change is "test".: update file.py and …
jasonlai1218 Oct 27, 2023
6a09db0
feat: refactor file handling and input parameter in test cases
jasonlai1218 Oct 27, 2023
0beb145
chore: remove write_csv_format_file.py
jasonlai1218 Oct 27, 2023
85338ca
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 30, 2023
00d3452
feat: add `python-magic` library for file type validation and logging…
jasonlai1218 Oct 31, 2023
b176572
feat: add validation method for file types in FlyteFilePathTransformer
jasonlai1218 Oct 31, 2023
2d2e7d8
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 31, 2023
7ac4c8f
feat: refactor file handling and add new tests
jasonlai1218 Oct 31, 2023
2b8cd4f
The label best describing this change is "refactor".: remove unnecess…
jasonlai1218 Oct 31, 2023
0b6c26c
test: add skipif marks to test functions for dataclass and enum in da…
jasonlai1218 Oct 31, 2023
c6b96ab
test: add import statement and skip test in unit test file
jasonlai1218 Oct 31, 2023
c1e2774
test: remove unnecessary tests in core and type_hints modules
jasonlai1218 Nov 2, 2023
d1d0476
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 2, 2023
143f1db
chore: remove custom type files
jasonlai1218 Nov 2, 2023
78ba16b
refactor: remove unnecessary validation in file transformations
jasonlai1218 Nov 2, 2023
06bd4c4
refactor: refactor file validation and test updates
jasonlai1218 Nov 3, 2023
ef04855
refactor: refactor file imports and remove unnecessary code
jasonlai1218 Nov 3, 2023
b4caa3a
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 6, 2023
f63dd51
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 7, 2023
ec732bd
feat: refactor file type handling
jasonlai1218 Nov 7, 2023
4b837a8
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 8, 2023
aea3a47
build: update requirements for installing `python-magic` library
jasonlai1218 Nov 8, 2023
bd4232a
chore: update `python-magic` usage comments for Windows compatibility
jasonlai1218 Nov 9, 2023
63e634a
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 9, 2023
d277be0
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 10, 2023
ee38477
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 11, 2023
7dc39e6
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ WORKDIR /root

ARG VERSION

RUN apt-get update && apt-get install build-essential vim -y
RUN apt-get update && apt-get install build-essential vim libmagic1 -y

COPY . /flytekit

Expand Down
28 changes: 28 additions & 0 deletions flytekit/types/file/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from contextlib import contextmanager
from dataclasses import dataclass, field

import magic
eapolinario marked this conversation as resolved.
Show resolved Hide resolved
from dataclasses_json import config
from marshmallow import fields
from mashumaro.mixins.json import DataClassJSONMixin
Expand Down Expand Up @@ -320,6 +321,31 @@
def get_literal_type(self, t: typing.Union[typing.Type[FlyteFile], os.PathLike]) -> LiteralType:
return LiteralType(blob=self._blob_type(format=FlyteFilePathTransformer.get_format(t)))

def get_mime_type_from_python_type(self, extension: str) -> str:
eapolinario marked this conversation as resolved.
Show resolved Hide resolved
extension_to_mime_type = {

Check warning on line 325 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L325

Added line #L325 was not covered by tests
"html": "text/html",
"jpeg": "image/jpeg",
"png": "image/png",
"hdf5": "application/x-hdf",
"joblib": "application/octet-stream",
"pdf": "application/pdf",
"python_pickle": "application/octet-stream",
"ipynb": "application/x-ipynb+json",
"svg": "image/svg+xml",
"csv": "text/csv",
"onnx": "application/octet-stream",
"tfrecord": "application/octet-stream",
"txt": "text/plain",
}
return extension_to_mime_type[extension]

Check warning on line 340 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L340

Added line #L340 was not covered by tests

def validate_file_type(self, python_type, source_path):
eapolinario marked this conversation as resolved.
Show resolved Hide resolved
if FlyteFilePathTransformer.get_format(python_type):
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
real_type = magic.from_file(source_path, mime=True)
expected_type = self.get_mime_type_from_python_type(FlyteFilePathTransformer.get_format(python_type))

Check warning on line 345 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L344-L345

Added lines #L344 - L345 were not covered by tests
if real_type != expected_type:
raise ValueError(f"Incorrect file type, expected {expected_type}, got {real_type}")

Check warning on line 347 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L347

Added line #L347 was not covered by tests

def to_literal(
self,
ctx: FlyteContext,
Expand All @@ -344,6 +370,7 @@

if isinstance(python_val, FlyteFile):
source_path = python_val.path
self.validate_file_type(python_type, source_path)

Check warning on line 373 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L373

Added line #L373 was not covered by tests

# If the object has a remote source, then we just convert it back. This means that if someone is just
# going back and forth between a FlyteFile Python value and a Blob Flyte IDL value, we don't do anything.
Expand All @@ -369,6 +396,7 @@
elif isinstance(python_val, pathlib.Path) or isinstance(python_val, str):
source_path = str(python_val)
if issubclass(python_type, FlyteFile):
self.validate_file_type(python_type, source_path)

Check warning on line 399 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L399

Added line #L399 was not covered by tests
if ctx.file_access.is_remote(source_path):
should_upload = False
else:
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
"rich",
"rich_click",
"jsonpickle",
"python-magic",
eapolinario marked this conversation as resolved.
Show resolved Hide resolved
],
extras_require=extras_require,
scripts=[
Expand Down
77 changes: 75 additions & 2 deletions tests/flytekit/unit/core/test_flyte_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import pathlib
import tempfile
import typing
from unittest.mock import MagicMock
from unittest.mock import MagicMock, patch

import pytest
from typing_extensions import Annotated
Expand All @@ -19,7 +19,7 @@
from flytekit.core.workflow import workflow
from flytekit.models.core.types import BlobType
from flytekit.models.literals import LiteralMap
from flytekit.types.file.file import FlyteFile
from flytekit.types.file.file import FlyteFile, FlyteFilePathTransformer


# Fixture that ensures a dummy local file
Expand Down Expand Up @@ -52,6 +52,79 @@ def my_wf() -> FlyteFile[typing.TypeVar("txt")]:
assert fh.read() == "Hello World\n"


def test_matching_file_types_in_workflow():
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
# TXT
@task
def t1() -> FlyteFile[typing.TypeVar("txt")]:
fname = "/tmp/flytekit_test.txt"
with open(fname, "w") as fh:
fh.write("Hello World\n")
return fname

@workflow
def my_wf() -> FlyteFile[typing.TypeVar("txt")]:
f = t1()
return f

res = my_wf()
print(type(res))
with open(res, "r") as fh:
assert fh.read() == "Hello World\n"


def test_mismatching_file_types():
@task
def t1() -> FlyteFile[typing.TypeVar("jpeg")]:
fname = "/tmp/flytekit_test.txt"
with open(fname, "w") as fh:
fh.write("Hello World\n")
return fname

@workflow
def my_wf() -> FlyteFile[typing.TypeVar("jpeg")]:
f = t1()
return f

with pytest.raises(TypeError) as excinfo:
my_wf()
assert "Incorrect type, expected image/jpeg, got text/plain" in str(excinfo.value)


def test_get_mime_type_from_python_type_success():
eapolinario marked this conversation as resolved.
Show resolved Hide resolved
transformer = TypeEngine.get_transformer(FlyteFile)
assert transformer.get_mime_type_from_python_type("html") == "text/html"
assert transformer.get_mime_type_from_python_type("jpeg") == "image/jpeg"
assert transformer.get_mime_type_from_python_type("png") == "image/png"
assert transformer.get_mime_type_from_python_type("hdf5") == "application/x-hdf"
assert transformer.get_mime_type_from_python_type("joblib") == "application/octet-stream"
assert transformer.get_mime_type_from_python_type("pdf") == "application/pdf"
assert transformer.get_mime_type_from_python_type("python_pickle") == "application/octet-stream"
assert transformer.get_mime_type_from_python_type("ipynb") == "application/x-ipynb+json"
assert transformer.get_mime_type_from_python_type("svg") == "image/svg+xml"
assert transformer.get_mime_type_from_python_type("csv") == "text/csv"
assert transformer.get_mime_type_from_python_type("onnx") == "application/octet-stream"
assert transformer.get_mime_type_from_python_type("tfrecord") == "application/octet-stream"
assert transformer.get_mime_type_from_python_type("txt") == "text/plain"


def test_get_mime_type_from_python_type_failure():
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
transformer = TypeEngine.get_transformer(FlyteFile)
with pytest.raises(KeyError):
transformer.get_mime_type_from_python_type("unknown_extension")


def test_validate_file_type_incorrect():
transformer = TypeEngine.get_transformer(FlyteFile)
source_path = "/tmp/flytekit_test.png"
source_file_mime_type = "image/png"
user_defined_format = "jpeg"

FlyteFilePathTransformer.get_format = MagicMock(return_value=user_defined_format)
with patch("magic.from_file", return_value=source_file_mime_type):
with pytest.raises(ValueError, match=f"Incorrect file type, expected image/jpeg, got {source_file_mime_type}"):
transformer.validate_file_type(user_defined_format, source_path)


eapolinario marked this conversation as resolved.
Show resolved Hide resolved
def test_file_handling_remote_default_wf_input():
SAMPLE_DATA = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

Expand Down
Loading