-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet data-transfer bug for remote storage #1213
Conversation
Click to view CI ResultsGitHub pull request #1213 of commit 59eb50b98c61dbc567c015675d275e7324d4947f, no merge conflicts. Running as SYSTEM Setting status of 59eb50b98c61dbc567c015675d275e7324d4947f to PENDING with url http://10.20.13.93:8080/job/nvtabular_tests/3682/ and message: 'Pending' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1213/*:refs/remotes/origin/pr/1213/* # timeout=10 > git rev-parse 59eb50b98c61dbc567c015675d275e7324d4947f^{commit} # timeout=10 Checking out Revision 59eb50b98c61dbc567c015675d275e7324d4947f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 59eb50b98c61dbc567c015675d275e7324d4947f # timeout=10 Commit message: "add comment explaining change" > git rev-list --no-walk bf11b804c9ef5417ee835cf24b15b7e8719e1566 # timeout=10 First time build. Skipping changelog. [nvtabular_tests] $ /bin/bash /tmp/jenkins8855208352084664797.sh Installing NVTabular Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: pip in /var/jenkins_home/.local/lib/python3.8/site-packages (21.3.1) Requirement already satisfied: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (58.3.0) Requirement already satisfied: wheel in /var/jenkins_home/.local/lib/python3.8/site-packages (0.37.0) Requirement already satisfied: pybind11 in /var/jenkins_home/.local/lib/python3.8/site-packages (2.8.0) running develop running egg_info creating nvtabular.egg-info writing nvtabular.egg-info/PKG-INFO writing dependency_links to nvtabular.egg-info/dependency_links.txt writing requirements to nvtabular.egg-info/requires.txt writing top-level names to nvtabular.egg-info/top_level.txt writing manifest file 'nvtabular.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( warning: no files found matching '*.h' under directory 'cpp' warning: no files found matching '*.cu' under directory 'cpp' warning: no files found matching '*.cuh' under directory 'cpp' adding license file 'LICENSE' writing manifest file 'nvtabular.egg-info/SOURCES.txt' running build_ext x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.8 -c flagcheck.cpp -o flagcheck.o -std=c++17 building 'nvtabular_cpp' extension creating build creating build/temp.linux-x86_64-3.8 creating build/temp.linux-x86_64-3.8/cpp creating build/temp.linux-x86_64-3.8/cpp/nvtabular creating build/temp.linux-x86_64-3.8/cpp/nvtabular/inference x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/__init__.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/__init__.o -std=c++17 -fvisibility=hidden -g0 x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/inference/__init__.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/__init__.o -std=c++17 -fvisibility=hidden -g0 x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/inference/categorify.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/categorify.o -std=c++17 -fvisibility=hidden -g0 x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DVERSION_INFO=0.7.0+27.g59eb50b -I./cpp/ -I/var/jenkins_home/.local/lib/python3.8/site-packages/pybind11/include -I/usr/include/python3.8 -c cpp/nvtabular/inference/fill.cc -o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/fill.o -std=c++17 -fvisibility=hidden -g0 creating build/lib.linux-x86_64-3.8 x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/cpp/nvtabular/__init__.o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/__init__.o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/categorify.o build/temp.linux-x86_64-3.8/cpp/nvtabular/inference/fill.o -o build/lib.linux-x86_64-3.8/nvtabular_cpp.cpython-38-x86_64-linux-gnu.so copying build/lib.linux-x86_64-3.8/nvtabular_cpp.cpython-38-x86_64-linux-gnu.so -> Generating nvtabular/inference/triton/model_config_pb2.py from nvtabular/inference/triton/model_config.proto Creating /var/jenkins_home/.local/lib/python3.8/site-packages/nvtabular.egg-link (link to .) nvtabular 0.7.0+27.g59eb50b is already the active version in easy-install.pth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for figuring this out!
* account for list/struct schema names in parquet metdata * add comment explaining change
Closes #1155
The optimized fsspec logic (added in #1119), does not properly recognize list or struct-column names in the parquet metadata. This is because a column name like
"a"
may be encoded as"a.list.element"
or"a.field"
in the parquet metadata. The proposed fix here is to split the encoded name by"."
when checking if the corresponding data is in the list of required column names. This means we may transfer data that is not desired by the user if they happen to use"."
in their column names (with common prefixes). However, I expect this "corner case" to be uncommon.