Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: apply tar filters when using python 3.12 or above #3124

Merged
merged 9 commits into from
Jun 5, 2024
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

### Enhancements

* **Filtering for tar extraction** Adds tar filtering to the compression module for connectors to avoid decompression malicious content in `.tar.gz` files. This was added to the Python `tarfile` lib in Python 3.12. The change only applies when using Python 3.12 and above.
* **Add support for Pinecone serverless** Adds Pinecone serverless to the connector tests. Pinecone
serverless will work version versions >=0.14.2, but hadn't been tested until now.

Expand Down
15 changes: 15 additions & 0 deletions test_unstructured/ingest/utils/test_compression.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import os
import tarfile

from unstructured.ingest.utils.compression import uncompress_tar_file


def test_uncompress_tar_file(tmpdir):
tar_filename = os.path.join(tmpdir, "test.tar")
filename = "example-docs/fake-text.txt"

with tarfile.open(tar_filename, "w:gz") as tar:
tar.add(filename, arcname=os.path.basename(filename))

path = uncompress_tar_file(tar_filename, path=tmpdir.dirname)
assert path == tmpdir.dirname
12 changes: 12 additions & 0 deletions unstructured/ingest/utils/compression.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import copy
import os
import sys
import tarfile
import zipfile
from dataclasses import dataclass
Expand Down Expand Up @@ -63,6 +64,17 @@ def uncompress_tar_file(tar_filename: str, path: Optional[str] = None) -> str:
path = path if path else os.path.join(head, f"{tail}-tar-uncompressed")
logger.info(f"extracting tar {tar_filename} -> {path}")
with tarfile.open(tar_filename, "r:gz") as tfile:
# NOTE(robinson: Mitigate against malicious content being extracted from the tar file.
# This was added in Python 3.12
# Ref: https://docs.python.org/3/library/tarfile.html#extraction-filters
if sys.version_info >= (3, 12):
tfile.extraction_filter = tarfile.tar_filter
else:
logger.warning(
"Extraction filtering for tar files is available for Python 3.12 and above. "
"Consider upgrading your Python version to improve security. "
"See https://docs.python.org/3/library/tarfile.html#extraction-filters"
)
tfile.extractall(path=path)
return path

Expand Down
Loading