Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet data-transfer bug for remote storage #1213

Merged
merged 2 commits into from
Oct 27, 2021

Conversation

rjzamora
Copy link
Collaborator

Closes #1155

The optimized fsspec logic (added in #1119), does not properly recognize list or struct-column names in the parquet metadata. This is because a column name like "a" may be encoded as "a.list.element" or "a.field" in the parquet metadata. The proposed fix here is to split the encoded name by "." when checking if the corresponding data is in the list of required column names. This means we may transfer data that is not desired by the user if they happen to use "." in their column names (with common prefixes). However, I expect this "corner case" to be uncommon.

@rjzamora rjzamora self-assigned this Oct 26, 2021
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #1213 of commit 59eb50b98c61dbc567c015675d275e7324d4947f, no merge conflicts.
Running as SYSTEM
Setting status of 59eb50b98c61dbc567c015675d275e7324d4947f to PENDING with url http://10.20.13.93:8080/job/nvtabular_tests/3682/ and message: 'Pending'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1213/*:refs/remotes/origin/pr/1213/* # timeout=10
 > git rev-parse 59eb50b98c61dbc567c015675d275e7324d4947f^{commit} # timeout=10
Checking out Revision 59eb50b98c61dbc567c015675d275e7324d4947f (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 59eb50b98c61dbc567c015675d275e7324d4947f # timeout=10
Commit message: "add comment explaining change"
 > git rev-list --no-walk bf11b804c9ef5417ee835cf24b15b7e8719e1566 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins8855208352084664797.sh
Installing NVTabular
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: pip in /var/jenkins_home/.local/lib/python3.8/site-packages (21.3.1)
Requirement already satisfied: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (58.3.0)
Requirement already satisfied: wheel in /var/jenkins_home/.local/lib/python3.8/site-packages (0.37.0)
Requirement already satisfied: pybind11 in /var/jenkins_home/.local/lib/python3.8/site-packages (2.8.0)
running develop
running egg_info
creating nvtabular.egg-info
writing nvtabular.egg-info/PKG-INFO
writing dependency_links to nvtabular.egg-info/dependency_links.txt
writing requirements to nvtabular.egg-info/requires.txt
writing top-level names to nvtabular.egg-info/top_level.txt
writing manifest file 'nvtabular.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
warning: no files found matching '*.h' under directory 'cpp'
warning: no files found matching '*.cu' under directory 'cpp'
warning: no files found matching '*.cuh' under directory 'cpp'
adding license file 'LICENSE'
writing manifest file 'nvtabular.egg-info/SOURCES.txt'
running build_ext
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.8 -c flagcheck.cpp -o flagcheck.o -std=c++17
building 'nvtabular_cpp' extension
creating build
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/cpp
creating build/temp.linux-x86_64-3.8/cpp/nvtabular
creating build/temp.linux-x86_64-3.8/cpp/nvtabular/inference
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/__init__.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/__init__.o -std=c++17 -fvisibility=hidden -g0
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/inference/__init__.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/__init__.o -std=c++17 -fvisibility=hidden -g0
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/inference/categorify.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/categorify.o -std=c++17 -fvisibility=hidden -g0
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/inference/fill.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/fill.o -std=c++17 -fvisibility=hidden -g0
creating build/lib.linux-x86_64-3.8
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/cpp/nvtabular/__init__.o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/__init__.o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/categorify.o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/fill.o -o build/lib.linux-x86_64-3.8/nvtabular_cpp.cpython-38-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-3.8/nvtabular_cpp.cpython-38-x86_64-linux-gnu.so -> 
Generating nvtabular/inference/triton/model_config_pb2.py from nvtabular/inference/triton/model_config.proto
Creating /var/jenkins_home/.local/lib/python3.8/site-packages/nvtabular.egg-link (link to .)
nvtabular 0.7.0+27.g59eb50b is already the active version in easy-install.pth

Installed /var/jenkins_home/workspace/nvtabular_tests/nvtabular
Processing dependencies for nvtabular==0.7.0+27.g59eb50b
Searching for protobuf==3.18.0
Best match: protobuf 3.18.0
Adding protobuf 3.18.0 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for tensorflow-metadata==1.2.0
Best match: tensorflow-metadata 1.2.0
Processing tensorflow_metadata-1.2.0-py3.8.egg
tensorflow-metadata 1.2.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/tensorflow_metadata-1.2.0-py3.8.egg
Searching for pyarrow==4.0.1
Best match: pyarrow 4.0.1
Adding pyarrow 4.0.1 to easy-install.pth file
Installing plasma_store script to /var/jenkins_home/.local/bin

Using /usr/local/lib/python3.8/dist-packages
Searching for tqdm==4.61.2
Best match: tqdm 4.61.2
Processing tqdm-4.61.2-py3.8.egg
tqdm 4.61.2 is already the active version in easy-install.pth
Installing tqdm script to /var/jenkins_home/.local/bin

Using /var/jenkins_home/.local/lib/python3.8/site-packages/tqdm-4.61.2-py3.8.egg
Searching for numba==0.54.0
Best match: numba 0.54.0
Processing numba-0.54.0-py3.8-linux-x86_64.egg
numba 0.54.0 is already the active version in easy-install.pth
Installing pycc script to /var/jenkins_home/.local/bin
Installing numba script to /var/jenkins_home/.local/bin

Using /var/jenkins_home/.local/lib/python3.8/site-packages/numba-0.54.0-py3.8-linux-x86_64.egg
Searching for pandas==1.3.3
Best match: pandas 1.3.3
Processing pandas-1.3.3-py3.8-linux-x86_64.egg
pandas 1.3.3 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/pandas-1.3.3-py3.8-linux-x86_64.egg
Searching for distributed==2021.7.1
Best match: distributed 2021.7.1
Processing distributed-2021.7.1-py3.8.egg
distributed 2021.7.1 is already the active version in easy-install.pth
Installing dask-ssh script to /var/jenkins_home/.local/bin
Installing dask-scheduler script to /var/jenkins_home/.local/bin
Installing dask-worker script to /var/jenkins_home/.local/bin

Using /var/jenkins_home/.local/lib/python3.8/site-packages/distributed-2021.7.1-py3.8.egg
Searching for dask==2021.7.1
Best match: dask 2021.7.1
Adding dask 2021.7.1 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for googleapis-common-protos==1.53.0
Best match: googleapis-common-protos 1.53.0
Processing googleapis_common_protos-1.53.0-py3.8.egg
googleapis-common-protos 1.53.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/googleapis_common_protos-1.53.0-py3.8.egg
Searching for absl-py==0.12.0
Best match: absl-py 0.12.0
Processing absl_py-0.12.0-py3.8.egg
absl-py 0.12.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/absl_py-0.12.0-py3.8.egg
Searching for numpy==1.20.2
Best match: numpy 1.20.2
Adding numpy 1.20.2 to easy-install.pth file
Installing f2py script to /var/jenkins_home/.local/bin
Installing f2py3 script to /var/jenkins_home/.local/bin
Installing f2py3.8 script to /var/jenkins_home/.local/bin

Using /usr/local/lib/python3.8/dist-packages
Searching for setuptools==58.3.0
Best match: setuptools 58.3.0
Adding setuptools 58.3.0 to easy-install.pth file

Using /var/jenkins_home/.local/lib/python3.8/site-packages
Searching for llvmlite==0.37.0
Best match: llvmlite 0.37.0
Processing llvmlite-0.37.0-py3.8-linux-x86_64.egg
llvmlite 0.37.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/llvmlite-0.37.0-py3.8-linux-x86_64.egg
Searching for pytz==2021.1
Best match: pytz 2021.1
Adding pytz 2021.1 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for python-dateutil==2.8.2
Best match: python-dateutil 2.8.2
Adding python-dateutil 2.8.2 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for zict==2.0.0
Best match: zict 2.0.0
Processing zict-2.0.0-py3.8.egg
zict 2.0.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/zict-2.0.0-py3.8.egg
Searching for tornado==6.1
Best match: tornado 6.1
Processing tornado-6.1-py3.8-linux-x86_64.egg
tornado 6.1 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/tornado-6.1-py3.8-linux-x86_64.egg
Searching for toolz==0.11.1
Best match: toolz 0.11.1
Processing toolz-0.11.1-py3.8.egg
toolz 0.11.1 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/toolz-0.11.1-py3.8.egg
Searching for tblib==1.7.0
Best match: tblib 1.7.0
Processing tblib-1.7.0-py3.8.egg
tblib 1.7.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/tblib-1.7.0-py3.8.egg
Searching for sortedcontainers==2.4.0
Best match: sortedcontainers 2.4.0
Processing sortedcontainers-2.4.0-py3.8.egg
sortedcontainers 2.4.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/sortedcontainers-2.4.0-py3.8.egg
Searching for PyYAML==5.4.1
Best match: PyYAML 5.4.1
Processing PyYAML-5.4.1-py3.8-linux-x86_64.egg
PyYAML 5.4.1 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/PyYAML-5.4.1-py3.8-linux-x86_64.egg
Searching for psutil==5.8.0
Best match: psutil 5.8.0
Processing psutil-5.8.0-py3.8-linux-x86_64.egg
psutil 5.8.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/psutil-5.8.0-py3.8-linux-x86_64.egg
Searching for msgpack==1.0.2
Best match: msgpack 1.0.2
Processing msgpack-1.0.2-py3.8-linux-x86_64.egg
msgpack 1.0.2 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/msgpack-1.0.2-py3.8-linux-x86_64.egg
Searching for cloudpickle==1.6.0
Best match: cloudpickle 1.6.0
Processing cloudpickle-1.6.0-py3.8.egg
cloudpickle 1.6.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/cloudpickle-1.6.0-py3.8.egg
Searching for click==8.0.1
Best match: click 8.0.1
Processing click-8.0.1-py3.8.egg
click 8.0.1 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/click-8.0.1-py3.8.egg
Searching for packaging==21.0
Best match: packaging 21.0
Adding packaging 21.0 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for partd==1.2.0
Best match: partd 1.2.0
Processing partd-1.2.0-py3.8.egg
partd 1.2.0 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/partd-1.2.0-py3.8.egg
Searching for fsspec==2021.10.1
Best match: fsspec 2021.10.1
Adding fsspec 2021.10.1 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for six==1.15.0
Best match: six 1.15.0
Adding six 1.15.0 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for HeapDict==1.0.1
Best match: HeapDict 1.0.1
Processing HeapDict-1.0.1-py3.8.egg
HeapDict 1.0.1 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/HeapDict-1.0.1-py3.8.egg
Searching for pyparsing==2.4.7
Best match: pyparsing 2.4.7
Adding pyparsing 2.4.7 to easy-install.pth file

Using /usr/local/lib/python3.8/dist-packages
Searching for locket==0.2.1
Best match: locket 0.2.1
Processing locket-0.2.1-py3.8.egg
locket 0.2.1 is already the active version in easy-install.pth

Using /var/jenkins_home/.local/lib/python3.8/site-packages/locket-0.2.1-py3.8.egg
Finished processing dependencies for nvtabular==0.7.0+27.g59eb50b
Running black --check
All done! ✨ 🍰 ✨
138 files would be left unchanged.
Running flake8
Running isort
Skipped 2 files
Running bandit
Running pylint
************* Module nvtabular.ops.categorify
nvtabular/ops/categorify.py:497:15: I1101: Module 'nvtabular_cpp' has no 'inference' member, but source is unavailable. Consider adding this module to extension-pkg-allow-list if you want to perform analysis based on run-time introspection of living objects. (c-extension-no-member)
************* Module nvtabular.ops.fill
nvtabular/ops/fill.py:67:15: I1101: Module 'nvtabular_cpp' has no 'inference' member, but source is unavailable. Consider adding this module to extension-pkg-allow-list if you want to perform analysis based on run-time introspection of living objects. (c-extension-no-member)


Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

Running flake8-nb
Building docs
make: Entering directory '/var/jenkins_home/workspace/nvtabular_tests/nvtabular/docs'
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/local/lib/python3.8/dist-packages/recommonmark/parser.py:75: UserWarning: Container node skipped: type=document
warn("Container node skipped: type={0}".format(mdnode.t))
/usr/local/lib/python3.8/dist-packages/recommonmark/parser.py:75: UserWarning: Container node skipped: type=document
warn("Container node skipped: type={0}".format(mdnode.t))
/usr/local/lib/python3.8/dist-packages/recommonmark/parser.py:75: UserWarning: Container node skipped: type=document
warn("Container node skipped: type={0}".format(mdnode.t))
make: Leaving directory '/var/jenkins_home/workspace/nvtabular_tests/nvtabular/docs'
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: forked-1.3.0, cov-3.0.0, xdist-2.4.0
collected 1564 items / 1 skipped / 1563 selected

tests/unit/test_dask_nvt.py ............................................ [ 2%]
....................................................................... [ 7%]
tests/unit/test_io.py .................................................. [ 10%]
........................................................................ [ 15%]
.................ssssssss............................................... [ 19%]
........ [ 20%]
tests/unit/test_notebooks.py ...... [ 20%]
tests/unit/test_tf4rec.py . [ 20%]
tests/unit/test_tools.py ...................... [ 22%]
tests/unit/test_triton_inference.py ............................... [ 24%]
tests/unit/columns/test_column_schemas.py .............................. [ 26%]
.................................................... [ 29%]
tests/unit/columns/test_column_selector.py .................... [ 30%]
tests/unit/framework_utils/test_tf_feature_columns.py . [ 30%]
tests/unit/framework_utils/test_tf_layers.py ........................... [ 32%]
................................................... [ 35%]
tests/unit/framework_utils/test_torch_layers.py . [ 35%]
tests/unit/loader/test_dataloader_backend.py .... [ 35%]
tests/unit/loader/test_tf_dataloader.py ................................ [ 38%]
........................................s.. [ 40%]
tests/unit/loader/test_torch_dataloader.py ............................. [ 42%]
........................................................ [ 46%]
tests/unit/ops/test_categorify.py ...................................... [ 48%]
............................................................... [ 52%]
tests/unit/ops/test_column_similarity.py ........................ [ 54%]
tests/unit/ops/test_fill.py ............................................ [ 57%]
........ [ 57%]
tests/unit/ops/test_hash_bucket.py ......................... [ 59%]
tests/unit/ops/test_join.py ............................................ [ 61%]
........................................................................ [ 66%]
................................ [ 68%]
tests/unit/ops/test_lambda.py .... [ 68%]
tests/unit/ops/test_normalize.py ....................................... [ 71%]
.. [ 71%]
tests/unit/ops/test_ops.py ............................................. [ 74%]
........................ [ 75%]
tests/unit/ops/test_ops_schema.py ...................................... [ 78%]
........................................................................ [ 82%]
........................................................................ [ 87%]
....................................... [ 90%]
tests/unit/ops/test_target_encode.py ..................... [ 91%]
tests/unit/workflow/test_cpu_workflow.py ...... [ 91%]
tests/unit/workflow/test_workflow.py ................................... [ 93%]
.......................................................... [ 97%]
tests/unit/workflow/test_workflow_node.py ........... [ 98%]
tests/unit/workflow/test_workflow_ops.py .. [ 98%]
tests/unit/workflow/test_workflow_schemas.py ....................... [100%]

=============================== warnings summary ===============================
tests/unit/test_dask_nvt.py: 3 warnings
tests/unit/test_io.py: 24 warnings
tests/unit/test_tf4rec.py: 1 warning
tests/unit/test_tools.py: 2 warnings
tests/unit/test_triton_inference.py: 6 warnings
tests/unit/loader/test_tf_dataloader.py: 50 warnings
tests/unit/loader/test_torch_dataloader.py: 16 warnings
tests/unit/ops/test_categorify.py: 2 warnings
tests/unit/ops/test_column_similarity.py: 7 warnings
tests/unit/ops/test_fill.py: 24 warnings
tests/unit/ops/test_normalize.py: 28 warnings
tests/unit/ops/test_ops.py: 3 warnings
tests/unit/ops/test_target_encode.py: 21 warnings
tests/unit/workflow/test_workflow.py: 30 warnings
tests/unit/workflow/test_workflow_node.py: 1 warning
tests/unit/workflow/test_workflow_schemas.py: 1 warning
/var/jenkins_home/.local/lib/python3.8/site-packages/numba-0.54.0-py3.8-linux-x86_64.egg/numba/cuda/compiler.py:865: NumbaPerformanceWarning: �[1mGrid size (1) < 2 * SM count (112) will likely result in GPU under utilization due to low occupancy.�[0m
warn(NumbaPerformanceWarning(msg))

tests/unit/test_dask_nvt.py: 2 warnings
tests/unit/test_io.py: 36 warnings
tests/unit/workflow/test_workflow.py: 44 warnings
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/workflow/workflow.py:89: UserWarning: A global dask.distributed client has been detected, but the single-threaded scheduler will be used for execution. Please use the client argument to initialize a Workflow object with distributed-execution enabled.
warnings.warn(

tests/unit/test_dask_nvt.py: 2 warnings
tests/unit/test_io.py: 52 warnings
tests/unit/workflow/test_workflow.py: 35 warnings
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/io/dask.py:375: UserWarning: A global dask.distributed client has been detected, but the single-threaded scheduler will be used for this write operation. Please use the client argument to initialize a Dataset and/or Workflow object with distributed-execution enabled.
warnings.warn(

tests/unit/test_io.py: 96 warnings
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/init.py:38: DeprecationWarning: ColumnGroup is deprecated, use ColumnSelector instead
warnings.warn("ColumnGroup is deprecated, use ColumnSelector instead", DeprecationWarning)

tests/unit/test_io.py: 24 warnings
tests/unit/loader/test_torch_dataloader.py: 54 warnings
tests/unit/workflow/test_workflow_node.py: 1 warning
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/workflow/node.py:47: FutureWarning: The ["a", "b", "c"] >> ops.Operator syntax for creating a ColumnGroup has been deprecated in NVTabular 21.09 and will be removed in a future version.
warnings.warn(

tests/unit/test_io.py: 20 warnings
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/io/dataset.py:516: UserWarning: A global dask.distributed client has been detected, but the single-threaded scheduler is being used for this shuffle operation. Please use the client argument to initialize a Dataset and/or Workflow object with distributed-execution enabled.
warnings.warn(

tests/unit/ops/test_categorify.py::test_categorify_counts[True-True]
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/categorify.py:1002: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_0[name_count] = null_size

tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-parquet-0.01]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-parquet-0.1]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-0.01]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-0.1]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-no-header-0.01]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-no-header-0.1]
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/fill.py:125: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f"{col}_filled"] = df[col].isna()

tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-parquet-0.01]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-parquet-0.1]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-0.01]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-0.1]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-no-header-0.01]
tests/unit/ops/test_fill.py::test_fill_median[True-True-op_columns1-csv-no-header-0.1]
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/fill.py:126: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[col] = df[col].fillna(self.medians[col])

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
/var/jenkins_home/.local/lib/python3.8/site-packages/pandas-1.3.3-py3.8-linux-x86_64.egg/pandas/core/indexing.py:1732: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/fill.py:54: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f"{col}_filled"] = df[col].isna()

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/fill.py:55: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[col] = df[col].fillna(self.fill_val)

tests/unit/ops/test_join.py: 80 warnings
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/join_external.py:191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[tmp] = _arange(len(df), like_df=df, dtype="int32")

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-False]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-False]
tests/unit/ops/test_ops.py::test_groupby_op[id-True]
tests/unit/ops/test_ops.py::test_groupby_op[id-False]
/usr/local/lib/python3.8/dist-packages/dask/dataframe/core.py:6778: UserWarning: Insufficient elements for head. 1 elements requested, only 0 elements available. Try passing larger npartitions to head.
warnings.warn(msg.format(n, len(r)))

tests/unit/workflow/test_cpu_workflow.py: 14 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/pandas-1.3.3-py3.8-linux-x86_64.egg/pandas/core/frame.py:3641: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[k1] = value[k2]

-- Docs: https://docs.pytest.org/en/stable/warnings.html

---------- coverage: platform linux, python 3.8.10-final-0 -----------
Name Stmts Miss Branch BrPart Cover Missing

examples/multi-gpu-movielens/torch_trainer.py 65 0 6 1 99% 32->36
examples/multi-gpu-movielens/torch_trainer_dist.py 63 0 2 0 100%
nvtabular/init.py 18 0 0 0 100%
nvtabular/columns/init.py 2 0 0 0 100%
nvtabular/columns/schema.py 216 14 105 17 90% 46->61, 52-55, 97->113, 105, 109, 155, 182, 268->275, 271->273, 282, 299->304, 302->304, 315, 339, 346, 355, 358, 363->362
nvtabular/columns/selector.py 74 1 34 0 99% 121
nvtabular/dispatch.py 294 57 144 23 79% 37-39, 42-46, 51-53, 59-69, 76-77, 120-122, 127-130, 134-139, 146, 165, 176, 182, 187->189, 200, 223-226, 257->259, 266, 269, 275, 300, 307, 338->343, 341, 344, 347->351, 384, 395-398, 424-427, 457, 461, 502, 526, 528, 535
nvtabular/framework_utils/init.py 0 0 0 0 100%
nvtabular/framework_utils/tensorflow/init.py 1 0 0 0 100%
nvtabular/framework_utils/tensorflow/feature_column_utils.py 134 78 90 15 39% 30, 99, 103, 114-130, 140, 143-158, 162, 166-167, 173-198, 207-217, 220-227, 229->233, 234, 239-279, 282
nvtabular/framework_utils/tensorflow/layers/init.py 4 0 0 0 100%
nvtabular/framework_utils/tensorflow/layers/embedding.py 153 12 85 6 91% 60, 68->49, 122, 179, 231-239, 335->343, 357->360, 363-364, 367
nvtabular/framework_utils/tensorflow/layers/interaction.py 47 25 20 1 43% 49, 74-103, 106-110, 113
nvtabular/framework_utils/tensorflow/layers/outer_product.py 30 24 10 0 15% 37-38, 41-60, 71-84, 87
nvtabular/framework_utils/tensorflow/tfrecords_to_parquet.py 58 58 30 0 0% 16-111
nvtabular/framework_utils/torch/init.py 0 0 0 0 100%
nvtabular/framework_utils/torch/layers/init.py 2 0 0 0 100%
nvtabular/framework_utils/torch/layers/embeddings.py 32 2 14 2 91% 50, 91
nvtabular/framework_utils/torch/models.py 45 1 28 4 93% 57->61, 87->89, 93->96, 103
nvtabular/framework_utils/torch/utils.py 75 5 30 5 90% 51->53, 64, 71->76, 75, 118-120
nvtabular/inference/init.py 0 0 0 0 100%
nvtabular/inference/triton/init.py 388 210 184 14 45% 82-86, 141-174, 195-218, 263-307, 338, 364-372, 380-387, 406, 428-444, 485-489, 527-537, 583-623, 629-645, 649-716, 723->726, 726->722, 743->742, 765-775, 784, 794, 815, 821-847, 853-879, 886, 892->895, 896
nvtabular/inference/triton/benchmarking_tools.py 52 52 10 0 0% 2-103
nvtabular/inference/triton/data_conversions.py 87 3 58 4 95% 32-33, 84
nvtabular/inference/triton/model.py 176 176 98 0 0% 27-332
nvtabular/inference/triton/model_config_pb2.py 299 0 2 0 100%
nvtabular/inference/triton/model_pt.py 101 101 40 0 0% 27-220
nvtabular/io/init.py 4 0 0 0 100%
nvtabular/io/avro.py 88 88 30 0 0% 16-189
nvtabular/io/csv.py 57 6 20 5 86% 22-23, 99, 103->107, 108, 110, 124
nvtabular/io/dask.py 183 8 72 11 93% 111, 114, 150, 401, 411, 428->431, 439, 443->445, 445->441, 450, 452
nvtabular/io/dataframe_engine.py 61 5 28 6 88% 19-20, 50, 69, 88->92, 92->97, 94->97, 97->116, 125
nvtabular/io/dataset.py 361 44 172 29 85% 47-48, 264, 266, 279, 304-318, 441->515, 446-449, 454->464, 471->469, 472->476, 489->493, 504, 515->524, 575-576, 577->581, 629, 752, 754, 756, 762, 766-768, 770, 830-831, 858, 865-866, 872, 878, 975-976, 1094-1099, 1105, 1117-1118, 1206
nvtabular/io/dataset_engine.py 31 1 4 0 97% 48
nvtabular/io/fsspec_utils.py 118 104 64 0 8% 26-27, 42-98, 103-135, 151-198, 220-270, 275-291, 295-297, 311-322
nvtabular/io/hugectr.py 45 2 24 2 91% 34, 74->97, 101
nvtabular/io/parquet.py 589 48 204 29 88% 34-35, 58, 80->156, 87, 101, 113-127, 140-153, 176, 205-206, 223->248, 234->248, 285-293, 313, 319, 337->339, 353, 371->381, 374, 423->435, 427, 549-554, 592-597, 713->720, 781->786, 787-788, 908, 912, 916, 922, 954, 971, 975, 982->984, 1092->exit, 1102->1107, 1112->1122, 1127, 1149, 1176
nvtabular/io/shuffle.py 31 7 16 4 72% 42, 44-45, 49, 62-64
nvtabular/io/writer.py 185 13 74 5 92% 25-26, 52, 80, 126, 129, 213, 222, 225, 268, 300-302
nvtabular/io/writer_factory.py 18 2 8 2 85% 35, 60
nvtabular/loader/init.py 0 0 0 0 100%
nvtabular/loader/backend.py 362 13 146 11 95% 131, 147-148, 276->278, 288-292, 339-340, 379->383, 380->379, 455, 459-460, 490, 597, 605
nvtabular/loader/tensorflow.py 164 22 52 7 86% 57, 65-68, 83, 97, 307, 343, 358-360, 389-391, 401-409, 412-415
nvtabular/loader/tf_utils.py 55 10 20 5 80% 29->32, 32->34, 39->41, 43, 50-51, 58-60, 66-70
nvtabular/loader/torch.py 81 13 16 2 78% 25-27, 30-36, 111, 149-150
nvtabular/ops/init.py 23 0 0 0 100%
nvtabular/ops/add_metadata.py 17 2 2 1 84% 32, 46
nvtabular/ops/bucketize.py 37 10 18 3 69% 53-55, 59->exit, 62-65, 84-87, 94
nvtabular/ops/categorify.py 628 66 334 47 86% 244, 246, 263, 267, 275, 283, 285, 312, 331-332, 359, 370->374, 378-385, 467-468, 493-494, 611, 705, 722, 758, 835-836, 851-855, 856->820, 874, 882, 889->exit, 913, 916->919, 974, 979, 1000->1005, 1007->961, 1013-1016, 1028, 1032, 1034, 1041, 1046-1049, 1128, 1130, 1200->1223, 1206->1223, 1224-1229, 1266, 1285->1290, 1289, 1299->1296, 1304->1296, 1311, 1314, 1322-1332
nvtabular/ops/clip.py 18 2 6 3 79% 44, 52->54, 55
nvtabular/ops/column_similarity.py 118 25 38 5 74% 19-20, 78->exit, 108, 134, 198-199, 208-210, 218-234, 251->254, 255, 265
nvtabular/ops/data_stats.py 56 2 22 3 94% 91->93, 95, 97->87, 102
nvtabular/ops/difference_lag.py 31 1 8 1 95% 69->71, 94
nvtabular/ops/dropna.py 8 0 0 0 100%
nvtabular/ops/fill.py 91 12 36 3 82% 63-67, 93, 121, 147, 151, 162-165
nvtabular/ops/filter.py 20 1 6 1 92% 49
nvtabular/ops/groupby.py 129 4 78 5 96% 73, 84, 94->96, 106->111, 141, 145
nvtabular/ops/hash_bucket.py 41 2 20 2 93% 72, 106->112, 118
nvtabular/ops/hashed_cross.py 36 4 15 3 86% 53, 66, 81, 91
nvtabular/ops/internal/init.py 3 0 0 0 100%
nvtabular/ops/internal/concat_columns.py 11 0 0 0 100%
nvtabular/ops/internal/identity.py 6 1 0 0 83% 42
nvtabular/ops/internal/subset_columns.py 13 1 0 0 92% 53
nvtabular/ops/join_external.py 92 18 36 7 76% 20-21, 114, 116, 118, 135-161, 177->179, 216->227, 221
nvtabular/ops/join_groupby.py 101 7 36 4 92% 108, 115, 124, 131->130, 215-216, 219-220
nvtabular/ops/lambdaop.py 39 6 18 6 79% 59, 63, 77, 89, 94, 103
nvtabular/ops/list_slice.py 66 24 26 1 58% 21-22, 53-54, 104-118, 126-137
nvtabular/ops/logop.py 19 0 4 0 100%
nvtabular/ops/moments.py 69 0 24 0 100%
nvtabular/ops/normalize.py 88 10 18 1 88% 87, 95-96, 102, 135-136, 158-159, 163, 174
nvtabular/ops/operator.py 66 0 14 0 100%
nvtabular/ops/rename.py 41 3 22 3 90% 47, 88-90
nvtabular/ops/stat_operator.py 8 0 0 0 100%
nvtabular/ops/target_encoding.py 154 11 66 4 91% 168->172, 176->185, 233-234, 237-238, 250-256, 347->350, 363
nvtabular/ops/value_counts.py 31 1 6 2 92% 41->39, 49
nvtabular/tags.py 16 0 0 0 100%
nvtabular/tools/init.py 0 0 0 0 100%
nvtabular/tools/data_gen.py 236 1 62 1 99% 322
nvtabular/tools/dataset_inspector.py 50 7 18 1 79% 32-39
nvtabular/tools/inspector_script.py 46 46 0 0 0% 17-168
nvtabular/utils.py 106 43 48 8 54% 31-32, 36-37, 50, 61-62, 64-66, 69, 72, 78, 84, 90-126, 145, 149->153
nvtabular/worker.py 82 5 38 7 90% 24-25, 82->99, 91, 92->99, 99->102, 108, 110, 111->113
nvtabular/workflow/init.py 2 0 0 0 100%
nvtabular/workflow/node.py 240 18 116 10 89% 55, 93->98, 146, 248->252, 288, 302, 311, 329-334, 339, 388-389, 400->395, 453-458
nvtabular/workflow/workflow.py 221 15 112 7 93% 28-29, 47, 139, 195, 222-224, 332, 347-348, 366-367, 503, 515

TOTAL 7908 1553 3187 349 78%
Coverage XML written to file coverage.xml

Required test coverage of 70% reached. Total coverage: 77.62%
=========================== short test summary info ============================
SKIPPED [1] ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:16: could not import 's3fs': No module named 's3fs'
SKIPPED [8] tests/unit/test_io.py:581: could not import 'uavro': No module named 'uavro'
SKIPPED [1] tests/unit/loader/test_tf_dataloader.py:522: not working correctly in ci environment
========= 1555 passed, 10 skipped, 703 warnings in 1705.69s (0:28:25) ==========
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins2337720675058294050.sh

Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for figuring this out!

@benfred benfred merged commit 810ca8e into NVIDIA-Merlin:main Oct 27, 2021
@rjzamora rjzamora deleted the fix-gcs-list branch October 27, 2021 13:25
mikemckiernan pushed a commit that referenced this pull request Nov 24, 2022
* account for list/struct schema names in parquet metdata

* add comment explaining change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Reading data from GCS creates issue
3 participants