Skip to content

Commit

Permalink
Merge branch 'branch-0.15' into bug-disallow-sum-timestamp
Browse files Browse the repository at this point in the history
  • Loading branch information
karthikeyann authored Jul 24, 2020
2 parents eccf0c9 + c387edc commit 2026363
Show file tree
Hide file tree
Showing 491 changed files with 39,108 additions and 7,199 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ DartConfiguration.tcl
*.manifest
*.spec
.nfs*
.clangd

## Python build directories & artifacts
dask-worker-space/
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/timothycrosley/isort
rev: 4.3.21
rev: 5.0.4
hooks:
- id: isort
- repo: https://github.com/ambv/black
Expand Down
129 changes: 120 additions & 9 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,56 @@
- PR #5292 Add unsigned int type columns to libcudf
- PR #5287 Add `index.join` support
- PR #5222 Adding clip feature support to DataFrame and Series
- PR #5318 Support/leverage DataFrame.shuffle in dask_cudf
- PR #4546 Support pandas 1.0+
- PR #5331 Add `cudf::drop_nans`
- PR #5327 Add `cudf::cross_join` feature
- PR #5204 Concatenate strings columns using row separator as strings column
- PR #5342 Add support for `StringMethods.__getitem__`
- PR #3504 Add External Kafka Datasource
- PR #5356 Use `size_type` instead of `scalar` in `cudf::repeat`.
- PR #5397 Add internal implementation of nested loop equijoins.
- PR #5303 Add slice_strings functionality using delimiter string
- PR #5394 Enable cast and binops with duration types (builds on PR 5359)
- PR #5301 Add Java bindings for `zfill`
- PR #5411 Enable metadata collection for chunked parquet writer
- PR #5359 Add duration types
- PR #5364 Validate array interface during buffer construction
- PR #5418 Add support for `DataFrame.info`
- PR #5425 Add Python `Groupby.rolling()`
- PR #5359 Add duration types
- PR #5434 Add nvtext function generate_character_grams
- PR #5442 Add support for `cudf.isclose`
- PR #5444 Remove usage of deprecated RMM APIs and headers.
- PR #5463 Add `.str.byte_count` python api and cython(bindings)
- PR #5488 Add plumbings for `.str.replace_tokens`
- PR #5502 Add Unsigned int types support in dlpack
- PR #5497 Add `.str.isinteger` & `.str.isfloat`
- PR #5511 Port of clx subword tokenizer to cudf
- PR #5528 Add unsigned int reading and writing support to parquet
- PR #5510 Add support for `cudf.Index` to create Indexes
- PR #5668 Adding support for `cudf.testing`
- PR #5460 Add support to write to remote filesystems
- PR #5454 Add support for `DataFrame.append`, `Index.append`, `Index.difference` and `Index.empty`
- PR #5536 Parquet reader - add support for multiple sources
- PR #5654 Adding support for `cudf.DataFrame.sample` and `cudf.Series.sample`
- PR #5607 Add Java bindings for duration types
- PR #5612 Add `is_hex` strings API
- PR #5659 Added support for rapids-compose for Java bindings and other enhancements
- PR #5637 Parameterize Null comparator behaviour in Joins
- PR #5623 Add `is_ipv4` strings API
- PR #5674 Support JIT backend on PowerPC64
- PR #5629 Add `ListColumn` and `ListDtype`
- PR #5658 Add `filter_tokens` nvtext API
- PR #5666 Add `filter_characters_of_type` strings API
- PR #5673 Always build and test with per-thread default stream enabled in the GPU CI build
- PR #5572 Add `cudf::encode` API.

## Improvements

- PR #5605 Automatically flush RMM allocate/free logs in JNI
- PR #5632 Switch JNI code to use `pool_memory_resource` instead of CNMeM
- PR #5486 Link Boost libraries statically in the Java build
- PR #5479 Link Arrow libraries statically
- PR #5414 Use new release of Thrust/CUB in the JNI build
- PR #5403 Update required CMake version to 3.14 in contribution guide
Expand Down Expand Up @@ -57,19 +88,59 @@
- PR #5405 Add Error message to `StringColumn.unary_operator`
- PR #5424 Add python plumbing for `.str.character_tokenize`
- PR #5420 Aligning signature of `Series.value_counts` to Pandas
- PR #5535 Update document for XGBoost usage with dask-cuda
- PR #5431 Adding support for unsigned int
- PR #5426 Refactor strings code to minimize calls to regex
- PR #5433 Add support for column inputs in `strings::starts_with` and `strings::ends_with`
- PR #5427 Add Java bindings for unsigned data types
- PR #5429 Improve text wrapping in libcudf documentation
- PR #5443 Remove unused `is_simple` trait
- PR #5441 Update Java HostMemoryBuffer to only load native libs when necessary
- PR #5452 Add support for strings conversion using negative timestamps
- PR #5437 Improve libcudf join documentation
- PR #5458 Install meta packages for dependencies
- PR #5467 Move doc customization scripts to Jenkins
- PR #5468 Add cudf::unique_count(table_view)
- PR #5482 Use rmm::device_uvector in place of rmm::device_vector in copy_if
- PR #5483 Add NVTX range calls to dictionary APIs
- PR #5477 Add `is_index_type` trait
- PR #5487 Use sorted lists instead of sets for pytest parameterization
- PR #5491 allow build libcudf in custom dir
- PR #5501 Adding only unsigned types support for categorical column codes
- PR #5570 Add Index APIs such as `Int64Index`, `UInt64Index` and others
- PR #5503 Change `unique_count` to `distinct_count`
- PR #5514 `convert_datetime.cu` Small Cleanup
- PR #5496 Rename .cu tests (zero cuda kernels) to .cpp files
- PR #5518 split iterator and gather tests to speedup build tests
- PR #5526 Change `type_id` to enum class
- PR #5559 Java APIs for missing date/time operators
- PR #5582 Add support for axis and other parameters to `DataFrame.sort_index` and fix other bunch of issues.
- PR #5562 Add missing join type for java
- PR #5584 Refactor `CompactProtocolReader::InitSchema`
- PR #5591 Add `__arrow_array__` protocol and raise a descriptive error message
- PR #5635 Ad cuIO reader benchmarks for CSV, ORC and Parquet
- PR #5601 Instantiate Table instances in `Frame._concat` to avoid `DF.insert()` overhead
- PR #5602 Add support for concatenation of `Series` & `DataFrame` in `cudf.concat` when `axis=0`
- PR #5603 Refactor JIT `parser.cpp`
- PR #5643 Update `isort` to 5.0.4
- PR #5662 Make Java ColumnVector(long nativePointer) constructor public
- PR #5679 Use `pickle5` to test older Python versions
- PR #5684 Use `pickle5` in `Serializable` (when available)
- PR #5419 Support rolling, groupby_rolling for durations
- PR #5687 Change strings::split_record to return a lists column
- PR #5708 Add support for `dummy_na` in `get_dummies`
- PR #5709 Update java build to help cu-spacial with java bindings
- PR #5713 Remove old NVTX utilities
- PR #5726 Replace use of `assert_frame_equal` in tests with `assert_eq`
- PR #5720 Replace owning raw pointers with std::unique_ptr
- PR #5702 Add inherited methods to python docs and other docs fixes
- PR #5733 Add support for `size` property in `DataFrame`/ `Series` / `Index`/ `MultiIndex`
- PR #5743 Reduce number of test cases in concatenate benchmark
- PR #5752 Add cuDF internals documentation (ColumnAccessor)

## Bug Fixes

- PR #5525 Make sure to allocate bitmasks of string columns only once
- PR #5336 Initialize conversion tables on a per-context basis
- PR #5283 Fix strings::ipv4_to_integers overflow to negative
- PR #5269 Explicitly require NumPy
Expand All @@ -80,7 +151,7 @@
- PR #5334 Fix pickling sizeof test
- PR #5337 Fix broken alias from DataFrame.{at,iat} to {loc, iloc}
- PR #5347 Fix APPLY_BOOLEAN_MASK_BENCH segfault
- PR #5368 Fix loc indexing issue with `datetime` type index
- PR #5368 Fix loc indexing issue with `datetime` type index
- PR #5367 Fix API for `cudf::repeat` in `cudf::cross_join`
- PR #5377 Handle array of cupy scalars in to_column
- PR #5326 Fix `DataFrame.__init__` for list of scalar inputs and related dask issue
Expand All @@ -95,13 +166,50 @@
- PR #5399 Fix cpp compiler warnings of unreachable code
- PR #5439 Fix nvtext ngrams_tokenize performance for multi-byte UTF8
- PR #5446 Fix compile error caused by out-of-date PR merge (4990)
- PR #5423 Fix any() reduction ignore nulls
- PR #5459 Fix str.translate to convert table characters to UTF-8
- PR #5480 Fix merge sort docs
- PR #5465 Fix benchmark out of memory errors due to multiple initialization
- PR #5319 Disallow SUM and specialize MEAN of timestamp types
- PR #5473 Fix RLEv2 patched base in ORC reader
- PR #5472 Fix str concat issue with indexed series
- PR #5478 Fix `loc` and `iloc` doc
- PR #5484 Ensure flat index after groupby if nlevels == 1
- PR #5489 Fix drop_nulls/boolean_mask corruption for large columns
- PR #5504 Remove some java assertions that are not needed
- PR #5516 Update gpuCI image in local build script
- PR #5529 Fix issue with negative timestamp in orc writer
- PR #5523 Handle `dtype` of `Buffer` objects when not passed explicitly
- PR #5534 Fix the java build around type_id
- PR #5564 Fix CudfEngine.read_metadata API in dask_cudf
- PR #5537 Fix issue related to using `set_index` on a string series
- PR #5561 Fix `copy_bitmask` issue with offset
- PR #5609 Fix loc and iloc issue with column like input
- PR #5578 Fix getattr logic in GroupBy
- PR #5490 Fix python column view
- PR #5613 Fix assigning an equal length object into a masked out Series
- PR #5608 Fix issue related to string types being represented as binary types
- PR #5619 Fix issue related to typecasting when using a `CategoricalDtype`
- PR #5649 Fix issue when empty Dataframe with index are passed to `cudf.concat`
- PR #5644 Fix issue related to Dataframe init when passing in `columns`
- PR #5340 Disable iteration in cudf objects and add support for `DataFrame` initialization with list of `Series`
- PR #5663 Move Duration types under Timestamps in doxygen Modules page
- PR #5664 Update conda upload versions for new supported CUDA/Python
- PR #5656 Fix issue with incorrect docker image being used in local build script
- PR #5671 Fix chunksize issue with `DataFrame.to_csv`
- PR #5672 Fix crash in parquet writer while writing large string data
- PR #5675 Allow lists_column_wrappers to be constructed from incomplete hierarchies.
- PR #5691 Raise error on incompatible mixed-type input for a column
- PR #5692 Fix compilation issue with gcc 7.4.0 and CUDA 10.1
- PR #5693 Add fix missing from PR 5656 to update local docker image to py3.7
- PR #5703 Small fix for dataframe constructor with cuda array interface objects that don't have `descr` field
- PR #5719 Fix Frame._concat() with categorical columns
- PR #5736 Disable unsigned type in ORC writer benchmarks
- PR #5745 Update JNI cast for inability to cast timestamp and integer types
- PR #5750 Add RMM_ROOT/include to the spdlog search path in JNI build
- PR #5319 Disallow SUM and specialize MEAN of timestamp types


# cuDF 0.14.0 (Date TBD)
# cuDF 0.14.0 (03 Jun 2020)

## New Features

Expand All @@ -120,6 +228,7 @@
- PR #4923 Add Java and JNI bindings for string split
- PR #4972 Add list_view (cudf::LIST) type
- PR #4990 Add lists_column_view, list_column_wrapper, lists support for concatenate
- PR #5073 gather support for cudf::LIST columns
- PR #5004 Added a null considering min/max binary op
- PR #4992 Add Java bindings for converting nans to nulls
- PR #4975 Add Java bindings for first and last aggregate expressions based on nth
Expand Down Expand Up @@ -208,7 +317,7 @@
- PR #4841 Remove unused `single_lane_block_popc_reduce` function
- PR #4842 Added Java bindings for titlizing a String column
- PR #4847 Replace legacy NVTX calls with "standalone" NVTX bindings calls
- PR #4851 Performance improvements relating to `concat`
- PR #4851 Performance improvements relating to `concat`
- PR #4852 Add NVTX range calls to strings and nvtext APIs
- PR #4849 Update Java bindings to use new NVTX API
- PR #4845 Add CUDF_FUNC_RANGE to top-level cuIO function APIs
Expand Down Expand Up @@ -236,7 +345,7 @@
- PR #4912 Drop old `valid` check in `element_indexing`
- PR #4924 Properly handle npartition argument in rearrange_by_hash
- PR #4918 Adding support for `cupy.ndarray` in `series.loc`
- PR #4909 Added ability to transform a column using cuda method in Java bindings
- PR #4909 Added ability to transform a column using cuda method in Java bindings
- PR #3259 Add .clang-format file & format all files
- PR #4943 Fix-up error handling in GPU detection
- PR #4917 Add support for casting unsupported `dtypes` of same kind
Expand Down Expand Up @@ -350,7 +459,7 @@
- PR #4749 Setting `nan_as_null=True` while creating a column in DataFrame creation
- PR #4761 Fix issues with `nan_as_null` in certain case
- PR #4650 Fix type mismatch & result format issue in `searchsorted`
- PR #4755 Fix Java build to deal with new quantiles API
- PR #4755 Fix Java build to deal with new quantiles API
- PR #4720 Fix issue related to `dtype` param not being adhered incase of cuda arrays
- PR #4756 Fix regex error checking for valid quantifier condition
- PR #4777 Fix data pointer for column slices of zero length
Expand All @@ -374,6 +483,7 @@
- PR #4883 Fix series get/set to match pandas
- PR #4861 Fix to_integers illegal-memory-access with all-empty strings column
- PR #4860 Fix issues in HostMemoryBufferTest, and testNormalizeNANsAndZeros
- PR #4879 Fix output for `cudf.concat` with `axis=1` for pandas parity
- PR #4838 Fix to support empty inputs to `replace` method
- PR #4859 JSON reader: fix data type inference for string columns
- PR #4868 Temporary fix to skip validation on Dask related runs
Expand Down Expand Up @@ -411,11 +521,12 @@
- PR #5070 Fix libcudf++ csv reader support for hex dtypes, doublequotes and empty columns
- PR #5057 Fix metadata_out parameter not reaching parquet `write_all`
- PR #5076 Fix JNI code for null_policy enum change
- PR #5031 grouped_time_range_rolling_window assumes ASC sort order
- PR #5031 grouped_time_range_rolling_window assumes ASC sort order
- PR #5032 grouped_time_range_rolling_window should permit invocation without specifying grouping_keys
- PR #5103 Fix `read_csv` issue with names and header
- PR #5090 Fix losing nulls while creating DataFrame from dictionary
- PR #5089 Return false for sign-only string in libcudf is_float and is_integer
- PR #5124 `DataFrame.rename` support for renaming indexes w/ default for `index`
- PR #5108 Fix float-to-string convert for -0.0
- PR #5111 Fix header not being included in legacy jit transform.
- PR #5115 Fix hex-to-integer logic when string has prefix '0x'
Expand Down Expand Up @@ -523,7 +634,7 @@
- PR #4028 Port json.pyx to use new libcudf APIs
- PR #4014 ORC/Parquet: add count parameter to stripe/rowgroup-based reader API
- PR #3880 Add aggregation infrastructure support for cudf::reduce
- PR #4059 Add aggregation infrastructure support for cudf::scan
- PR #4059 Add aggregation infrastructure support for cudf::scan
- PR #4021 Change quantiles signature for clarity.
- PR #4057 Handle offsets in cython Column class
- PR #4045 Reorganize `libxx` directory
Expand Down Expand Up @@ -1965,7 +2076,7 @@

- PR #821 Fix flake8 issues revealed by flake8 update
- PR #808 Resolved renamed `d_columns_valids` variable name
- PR #820 CSV Reader: fix the issue where reader adds additional rows when file uses
- PR #820 CSV Reader: fix the issue where reader adds additional rows when file uses
as a line terminator
- PR #780 CSV Reader: Fix scientific notation parsing and null values for empty quotes
- PR #815 CSV Reader: Fix data parsing when tabs are present in the input CSV file
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ that committed code follows our standards. You can use the tools to
automatically format your python code by running:

```bash
isort --recursive --atomic --apply python
isort --atomic python/**/*.py
black python
```

Expand Down
68 changes: 47 additions & 21 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
# This script is used to build the component(s) in this repo from
# source, and can be called with various options to customize the
# build as needed (see the help output for details)

# Abort script on first error
set -e

Expand All @@ -18,27 +17,30 @@ ARGS=$*
# script, and that this script resides in the repo dir!
REPODIR=$(cd $(dirname $0); pwd)

VALIDARGS="clean libcudf cudf dask_cudf benchmarks tests -v -g -n --allgpuarch --disable_nvtx --show_depr_warn -h"
HELP="$0 [clean] [libcudf] [cudf] [dask_cudf] [benchmarks] [tests] [-v] [-g] [-n] [-h]
clean - remove all existing build artifacts and configuration (start
over)
libcudf - build the cudf C++ code only
cudf - build the cudf Python package
dask_cudf - build the dask_cudf Python package
benchmarks - build benchmarks
tests - build tests
-v - verbose build mode
-g - build for debug
-n - no install step
--allgpuarch - build for all supported GPU architectures
--disable_nvtx - disable inserting NVTX profiling ranges
--show_depr_warn - show cmake deprecation warnings
-h - print this text
VALIDARGS="clean libcudf cudf dask_cudf benchmarks tests libcudf_kafka -v -g -n -l --allgpuarch --disable_nvtx --show_depr_warn --ptds -h"
HELP="$0 [clean] [libcudf] [cudf] [dask_cudf] [benchmarks] [tests] [libcudf_kafka] [-v] [-g] [-n] [-h] [-l]
clean - remove all existing build artifacts and configuration (start
over)
libcudf - build the cudf C++ code only
cudf - build the cudf Python package
dask_cudf - build the dask_cudf Python package
benchmarks - build benchmarks
tests - build tests
libcudf_kafka - build the libcudf_kafka C++ code only
-v - verbose build mode
-g - build for debug
-n - no install step
-l - build legacy tests
--allgpuarch - build for all supported GPU architectures
--disable_nvtx - disable inserting NVTX profiling ranges
--show_depr_warn - show cmake deprecation warnings
--ptds - enable per-thread default stream
-h - print this text
default action (no args) is to build and install 'libcudf' then 'cudf'
then 'dask_cudf' targets
"
LIB_BUILD_DIR=${REPODIR}/cpp/build
LIB_BUILD_DIR=${LIB_BUILD_DIR:=${REPODIR}/cpp/build}
CUDF_BUILD_DIR=${REPODIR}/python/cudf/build
DASK_CUDF_BUILD_DIR=${REPODIR}/python/dask_cudf/build
BUILD_DIRS="${LIB_BUILD_DIR} ${CUDF_BUILD_DIR} ${DASK_CUDF_BUILD_DIR}"
Expand All @@ -52,6 +54,8 @@ BUILD_ALL_GPU_ARCH=0
BUILD_NVTX=ON
BUILD_TESTS=OFF
BUILD_DISABLE_DEPRECATION_WARNING=ON
BUILD_PER_THREAD_DEFAULT_STREAM=OFF
BUILD_LIBCUDF_KAFKA=OFF

# Set defaults for vars that may not have been defined externally
# FIXME: if INSTALL_PREFIX is not set, check PREFIX, then check
Expand Down Expand Up @@ -108,6 +112,12 @@ fi
if hasArg --show_depr_warn; then
BUILD_DISABLE_DEPRECATION_WARNING=OFF
fi
if hasArg --ptds; then
BUILD_PER_THREAD_DEFAULT_STREAM=ON
fi
if hasArg libcudf_kafka; then
BUILD_LIBCUDF_KAFKA=ON
fi

# If clean given, run it prior to any other steps
if hasArg clean; then
Expand All @@ -134,8 +144,7 @@ fi
################################################################################
# Configure, build, and install libcudf

if buildAll || hasArg libcudf; then

if buildAll || hasArg libcudf || hasArg libcudf_kafka; then
mkdir -p ${LIB_BUILD_DIR}
cd ${LIB_BUILD_DIR}
cmake -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX} \
Expand All @@ -144,7 +153,9 @@ if buildAll || hasArg libcudf; then
-DUSE_NVTX=${BUILD_NVTX} \
-DBUILD_BENCHMARKS=${BUILD_BENCHMARKS} \
-DDISABLE_DEPRECATION_WARNING=${BUILD_DISABLE_DEPRECATION_WARNING} \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} ..
-DPER_THREAD_DEFAULT_STREAM=${BUILD_PER_THREAD_DEFAULT_STREAM} \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DBUILD_CUDF_KAFKA=${BUILD_LIBCUDF_KAFKA} $REPODIR/cpp
fi

if buildAll || hasArg libcudf; then
Expand Down Expand Up @@ -187,3 +198,18 @@ if buildAll || hasArg dask_cudf; then
python setup.py build_ext --inplace
fi
fi

# Do not build libcudf_kafka with 'buildAll'
if hasArg libcudf_kafka; then

cd ${LIB_BUILD_DIR}
if [[ ${INSTALL_TARGET} != "" ]]; then
make -j${PARALLEL_LEVEL} install_libcudf_kafka VERBOSE=${VERBOSE}
else
make -j${PARALLEL_LEVEL} libcudf_kafka VERBOSE=${VERBOSE}
fi

if [[ ${BUILD_TESTS} == "ON" ]]; then
make -j${PARALLEL_LEVEL} build_tests_libcudf_kafka VERBOSE=${VERBOSE}
fi
fi
Loading

0 comments on commit 2026363

Please sign in to comment.