diff --git a/CHANGELOG.md b/CHANGELOG.md index 6d4bdfb8d98..dda2e02f593 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,8 +3,244 @@ Please see https://github.com/rapidsai/cudf/releases/tag/v22.04.00a for the latest changes to this development branch. # cuDF 22.02.00 (Date TBD) +# cuDF 22.02.00 (2 Feb 2022) + +## 🚨 Beaking Changes + +- ORC wite API changes fo ganula statistics ([#10058](https://github.com/rapidsai/cudf/pull/10058)) [@mythocks](https://github.com/mythocks) +- `decimal128` Suppot fo `to/fom_aow` ([#9986](https://github.com/rapidsai/cudf/pull/9986)) [@codeepot](https://github.com/codeepot) +- Remove depecated method `one_hot_encoding` ([#9977](https://github.com/rapidsai/cudf/pull/9977)) [@isVoid](https://github.com/isVoid) +- Remove st.subwod_tokenize ([#9968](https://github.com/rapidsai/cudf/pull/9968)) [@VibhuJawa](https://github.com/VibhuJawa) +- Remove depecated `method` paamete fom `mege` and `join`. ([#9944](https://github.com/rapidsai/cudf/pull/9944)) [@bdice](https://github.com/bdice) +- Remove depecated method DataFame.hash_columns. ([#9943](https://github.com/rapidsai/cudf/pull/9943)) [@bdice](https://github.com/bdice) +- Remove depecated method Seies.hash_encode. ([#9942](https://github.com/rapidsai/cudf/pull/9942)) [@bdice](https://github.com/bdice) +- Refactoing ceil/ound/floo code fo datetime64 types ([#9926](https://github.com/rapidsai/cudf/pull/9926)) [@mayankanand007](https://github.com/mayankanand007) +- Intoduce `nan_as_null` paamete fo `cudf.Index` ([#9893](https://github.com/rapidsai/cudf/pull/9893)) [@galipemsaga](https://github.com/galipemsaga) +- Add egex_flags paamete to stings eplace_e functions ([#9878](https://github.com/rapidsai/cudf/pull/9878)) [@davidwendt](https://github.com/davidwendt) +- Beak tie fo `top` categoical columns in `Seies.descibe` ([#9867](https://github.com/rapidsai/cudf/pull/9867)) [@isVoid](https://github.com/isVoid) +- Add patitioning suppot in paquet wite ([#9810](https://github.com/rapidsai/cudf/pull/9810)) [@devavet](https://github.com/devavet) +- Move `dop_duplicates`, `dop_na`, `_gathe`, `take` to IndexFame and ceate thei `_base_index` countepats ([#9807](https://github.com/rapidsai/cudf/pull/9807)) [@isVoid](https://github.com/isVoid) +- Raise tempoay eo fo `decimal128` types in paquet eade ([#9804](https://github.com/rapidsai/cudf/pull/9804)) [@galipemsaga](https://github.com/galipemsaga) +- Change default `dtype` of all nulls column fom `float` to `object` ([#9803](https://github.com/rapidsai/cudf/pull/9803)) [@galipemsaga](https://github.com/galipemsaga) +- Remove unused masked udf cython/c++ code ([#9792](https://github.com/rapidsai/cudf/pull/9792)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Pick smallest decimal type with equied pecision in ORC eade ([#9775](https://github.com/rapidsai/cudf/pull/9775)) [@vuule](https://github.com/vuule) +- Add decimal128 suppot to Paquet eade and wite ([#9765](https://github.com/rapidsai/cudf/pull/9765)) [@vuule](https://github.com/vuule) +- Refacto TableTest assetion methods to a sepaate utility class ([#9762](https://github.com/rapidsai/cudf/pull/9762)) [@jlowe](https://github.com/jlowe) +- Use cuFile diect device eads/wites by default in cuIO ([#9722](https://github.com/rapidsai/cudf/pull/9722)) [@vuule](https://github.com/vuule) +- Match pandas scala esult types in eductions ([#9717](https://github.com/rapidsai/cudf/pull/9717)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Add paametes to contol ow goup size in Paquet wite ([#9677](https://github.com/rapidsai/cudf/pull/9677)) [@vuule](https://github.com/vuule) +- Refacto bit counting APIs, intoduce valid/null count functions, and split host/device side code fo segmented counts. ([#9588](https://github.com/rapidsai/cudf/pull/9588)) [@bdice](https://github.com/bdice) +- Add suppot fo `decimal128` in cudf python ([#9533](https://github.com/rapidsai/cudf/pull/9533)) [@galipemsaga](https://github.com/galipemsaga) +- Implement `lists::index_of()` to find positions in list ows ([#9510](https://github.com/rapidsai/cudf/pull/9510)) [@mythocks](https://github.com/mythocks) +- Rewiting ow/column convesions fo Spak <-> cudf data convesions ([#8444](https://github.com/rapidsai/cudf/pull/8444)) [@hypebolic2346](https://github.com/hypebolic2346) -Please see https://github.com/rapidsai/cudf/releases/tag/v22.02.00a for the latest changes to this development branch. +## 🐛 Bug Fixes + +- Add check fo negative stipe index in ORC eade ([#10074](https://github.com/rapidsai/cudf/pull/10074)) [@vuule](https://github.com/vuule) +- Update Java tests to expect DECIMAL128 fom Aow ([#10073](https://github.com/rapidsai/cudf/pull/10073)) [@jlowe](https://github.com/jlowe) +- Avoid index mateialization when `DataFame` is ceated with un-named `Seies` objects ([#10071](https://github.com/rapidsai/cudf/pull/10071)) [@galipemsaga](https://github.com/galipemsaga) +- fix gcc 11 compilation eos ([#10067](https://github.com/rapidsai/cudf/pull/10067)) [@ongou](https://github.com/ongou) +- Fix `columns` odeing issue in paquet eade ([#10066](https://github.com/rapidsai/cudf/pull/10066)) [@galipemsaga](https://github.com/galipemsaga) +- Fix datafame setitem with `ndaay` types ([#10056](https://github.com/rapidsai/cudf/pull/10056)) [@galipemsaga](https://github.com/galipemsaga) +- Remove implicit copy due to convesion fom cudf::size_type and size_t ([#10045](https://github.com/rapidsai/cudf/pull/10045)) [@obetmaynad](https://github.com/obetmaynad) +- Include <optional> in heades that use std::optional ([#10044](https://github.com/rapidsai/cudf/pull/10044)) [@obetmaynad](https://github.com/obetmaynad) +- Fix ep and concat of `StuctColumn` ([#10042](https://github.com/rapidsai/cudf/pull/10042)) [@galipemsaga](https://github.com/galipemsaga) +- Include ow goup level stats when witing ORC files ([#10041](https://github.com/rapidsai/cudf/pull/10041)) [@vuule](https://github.com/vuule) +- build.sh espects the `--build_metics` and `--incl_cache_stats` flags ([#10035](https://github.com/rapidsai/cudf/pull/10035)) [@obetmaynad](https://github.com/obetmaynad) +- Fix memoy leaks in JNI native code. ([#10029](https://github.com/rapidsai/cudf/pull/10029)) [@mythocks](https://github.com/mythocks) +- Update JNI to use new aena m constucto ([#10027](https://github.com/rapidsai/cudf/pull/10027)) [@ongou](https://github.com/ongou) +- Fix null check when compaing stucts in `ag_min` opeation of eduction/goupby ([#10026](https://github.com/rapidsai/cudf/pull/10026)) [@ttnghia](https://github.com/ttnghia) +- Wap CI scipt shell vaiables in quotes to fix local testing. ([#10018](https://github.com/rapidsai/cudf/pull/10018)) [@bdice](https://github.com/bdice) +- cudftestutil no longe popagates compile flags to extenal uses ([#10017](https://github.com/rapidsai/cudf/pull/10017)) [@obetmaynad](https://github.com/obetmaynad) +- Remove `CUDA_DEVICE_CALLABLE` maco usage ([#10015](https://github.com/rapidsai/cudf/pull/10015)) [@hypebolic2346](https://github.com/hypebolic2346) +- Add missing list filling heade in meta.yaml ([#10007](https://github.com/rapidsai/cudf/pull/10007)) [@devavet](https://github.com/devavet) +- Fix `conda` ecipes fo `custeamz` & `cudf_kafka` ([#10003](https://github.com/rapidsai/cudf/pull/10003)) [@ajschmidt8](https://github.com/ajschmidt8) +- Fix matching egex wod-bounday () in stings eplace ([#9997](https://github.com/rapidsai/cudf/pull/9997)) [@davidwendt](https://github.com/davidwendt) +- Fix null check when compaing stucts in `min` and `max` eduction/goupby opeations ([#9994](https://github.com/rapidsai/cudf/pull/9994)) [@ttnghia](https://github.com/ttnghia) +- Fix octal patten matching in egex sting ([#9993](https://github.com/rapidsai/cudf/pull/9993)) [@davidwendt](https://github.com/davidwendt) +- `decimal128` Suppot fo `to/fom_aow` ([#9986](https://github.com/rapidsai/cudf/pull/9986)) [@codeepot](https://github.com/codeepot) +- Fix goupby shift/diff/fill afte selecting fom a `GoupBy` ([#9984](https://github.com/rapidsai/cudf/pull/9984)) [@shwina](https://github.com/shwina) +- Fix the oveflow poblem of decimal escale ([#9966](https://github.com/rapidsai/cudf/pull/9966)) [@spelingxx](https://github.com/spelingxx) +- Use default value fo decimal pecision in paquet wite when not specified ([#9963](https://github.com/rapidsai/cudf/pull/9963)) [@devavet](https://github.com/devavet) +- Fix cudf java build eo. ([#9958](https://github.com/rapidsai/cudf/pull/9958)) [@fiestaman](https://github.com/fiestaman) +- Use gpuci_mamba_ety to install local atifacts. ([#9951](https://github.com/rapidsai/cudf/pull/9951)) [@bdice](https://github.com/bdice) +- Fix egession HostColumnVectoCoe equiing native libs ([#9948](https://github.com/rapidsai/cudf/pull/9948)) [@jlowe](https://github.com/jlowe) +- Rename aggegate_metadata in wite to fix name collision ([#9938](https://github.com/rapidsai/cudf/pull/9938)) [@devavet](https://github.com/devavet) +- Fixed issue with pecentile_appox whee output tdigests could have uninitialized data at the end. ([#9931](https://github.com/rapidsai/cudf/pull/9931)) [@nvdbaanec](https://github.com/nvdbaanec) +- Resolve acecheck eos in ORC kenels ([#9916](https://github.com/rapidsai/cudf/pull/9916)) [@vuule](https://github.com/vuule) +- Fix the java build afte paquet patitioning suppot ([#9908](https://github.com/rapidsai/cudf/pull/9908)) [@evans2](https://github.com/evans2) +- Fix compilation of benchmak fo paquet wite. ([#9905](https://github.com/rapidsai/cudf/pull/9905)) [@bdice](https://github.com/bdice) +- Fix a memcheck eo in ORC wite ([#9896](https://github.com/rapidsai/cudf/pull/9896)) [@vuule](https://github.com/vuule) +- Intoduce `nan_as_null` paamete fo `cudf.Index` ([#9893](https://github.com/rapidsai/cudf/pull/9893)) [@galipemsaga](https://github.com/galipemsaga) +- Fix fallback to sot aggegation fo gouping only hash aggegate ([#9891](https://github.com/rapidsai/cudf/pull/9891)) [@abellina](https://github.com/abellina) +- Add zlib to cudfjni link when using static libcudf libay dependency ([#9890](https://github.com/rapidsai/cudf/pull/9890)) [@jlowe](https://github.com/jlowe) +- TimedeltaIndex constucto aises an AttibuteEo. ([#9884](https://github.com/rapidsai/cudf/pull/9884)) [@skiui-souce](https://github.com/skiui-souce) +- Fix cudf.Scala sting datetime constuction ([#9875](https://github.com/rapidsai/cudf/pull/9875)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Load libcufile.so with RTLD_NODELETE flag ([#9872](https://github.com/rapidsai/cudf/pull/9872)) [@vuule](https://github.com/vuule) +- Beak tie fo `top` categoical columns in `Seies.descibe` ([#9867](https://github.com/rapidsai/cudf/pull/9867)) [@isVoid](https://github.com/isVoid) +- Fix null handling fo stucts `min` and `ag_min` in goupby, goupby scan, eduction, and inclusive_scan ([#9864](https://github.com/rapidsai/cudf/pull/9864)) [@ttnghia](https://github.com/ttnghia) +- Add one-level list encoding suppot in paquet eade ([#9848](https://github.com/rapidsai/cudf/pull/9848)) [@PointKenel](https://github.com/PointKenel) +- Fix an out-of-bounds ead in validity copying in contiguous_split. ([#9842](https://github.com/rapidsai/cudf/pull/9842)) [@nvdbaanec](https://github.com/nvdbaanec) +- Fix join of MultiIndex to Index with one column and ovelapping name. ([#9830](https://github.com/rapidsai/cudf/pull/9830)) [@vyas](https://github.com/vyas) +- Fix caching in `Seies.applymap` ([#9821](https://github.com/rapidsai/cudf/pull/9821)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Enfoce boolean `ascending` fo dask-cudf `sot_values` ([#9814](https://github.com/rapidsai/cudf/pull/9814)) [@chalesbluca](https://github.com/chalesbluca) +- Fix ORC wite cash with empty input columns ([#9808](https://github.com/rapidsai/cudf/pull/9808)) [@vuule](https://github.com/vuule) +- Change default `dtype` of all nulls column fom `float` to `object` ([#9803](https://github.com/rapidsai/cudf/pull/9803)) [@galipemsaga](https://github.com/galipemsaga) +- Load native dependencies when Java ColumnView is loaded ([#9800](https://github.com/rapidsai/cudf/pull/9800)) [@jlowe](https://github.com/jlowe) +- Fix dtype-agument bug in dask_cudf ead_csv ([#9796](https://github.com/rapidsai/cudf/pull/9796)) [@jzamoa](https://github.com/jzamoa) +- Fix oveflow fo min calculation in stings::fom_timestamps ([#9793](https://github.com/rapidsai/cudf/pull/9793)) [@evans2](https://github.com/evans2) +- Fix memoy eo due to lambda etun type deduction limitation ([#9778](https://github.com/rapidsai/cudf/pull/9778)) [@kathikeyann](https://github.com/kathikeyann) +- Revet egex $/EOL end-of-sting new-line special case handling ([#9774](https://github.com/rapidsai/cudf/pull/9774)) [@davidwendt](https://github.com/davidwendt) +- Fix missing steams ([#9767](https://github.com/rapidsai/cudf/pull/9767)) [@kathikeyann](https://github.com/kathikeyann) +- Fix make_empty_scala_like on list_type ([#9759](https://github.com/rapidsai/cudf/pull/9759)) [@spelingxx](https://github.com/spelingxx) +- Update cmake and conda to 22.02 ([#9746](https://github.com/rapidsai/cudf/pull/9746)) [@devavet](https://github.com/devavet) +- Fix out-of-bounds memoy wite in decimal128-to-sting convesion ([#9740](https://github.com/rapidsai/cudf/pull/9740)) [@davidwendt](https://github.com/davidwendt) +- Match pandas scala esult types in eductions ([#9717](https://github.com/rapidsai/cudf/pull/9717)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Fix egex non-multiline EOL/$ matching stings ending with a new-line ([#9715](https://github.com/rapidsai/cudf/pull/9715)) [@davidwendt](https://github.com/davidwendt) +- Fixed build by adding moe checks fo int8, int16 ([#9707](https://github.com/rapidsai/cudf/pull/9707)) [@azajafi](https://github.com/azajafi) +- Fix `null` handling when `boolean` dtype is passed ([#9691](https://github.com/rapidsai/cudf/pull/9691)) [@galipemsaga](https://github.com/galipemsaga) +- Fix steam usage in `segmented_gathe()` ([#9679](https://github.com/rapidsai/cudf/pull/9679)) [@mythocks](https://github.com/mythocks) + +## 📖 Documentation + +- Update `decimal` dtypes elated docs enties ([#10072](https://github.com/rapidsai/cudf/pull/10072)) [@galipemsaga](https://github.com/galipemsaga) +- Fix egex doc descibing hexadecimal escape chaactes ([#10009](https://github.com/rapidsai/cudf/pull/10009)) [@davidwendt](https://github.com/davidwendt) +- Fix cudf compilation instuctions. ([#9956](https://github.com/rapidsai/cudf/pull/9956)) [@esoha-nvidia](https://github.com/esoha-nvidia) +- Fix see also links fo IO APIs ([#9895](https://github.com/rapidsai/cudf/pull/9895)) [@galipemsaga](https://github.com/galipemsaga) +- Fix build instuctions fo libcudf doxygen ([#9837](https://github.com/rapidsai/cudf/pull/9837)) [@davidwendt](https://github.com/davidwendt) +- Fix some doxygen wanings and add missing documentation ([#9770](https://github.com/rapidsai/cudf/pull/9770)) [@kathikeyann](https://github.com/kathikeyann) +- update cuda vesion in local build ([#9736](https://github.com/rapidsai/cudf/pull/9736)) [@kathikeyann](https://github.com/kathikeyann) +- Fix doxygen fo enum types in libcudf ([#9724](https://github.com/rapidsai/cudf/pull/9724)) [@davidwendt](https://github.com/davidwendt) +- Spell check fixes ([#9682](https://github.com/rapidsai/cudf/pull/9682)) [@kathikeyann](https://github.com/kathikeyann) +- Fix links in C++ Develope Guide. ([#9675](https://github.com/rapidsai/cudf/pull/9675)) [@bdice](https://github.com/bdice) + +## 🚀 New Featues + +- Remove libcudacxx patch needed fo nvcc 11.4 ([#10057](https://github.com/rapidsai/cudf/pull/10057)) [@obetmaynad](https://github.com/obetmaynad) +- Allow CuPy 10 ([#10048](https://github.com/rapidsai/cudf/pull/10048)) [@jakikham](https://github.com/jakikham) +- Add in suppot fo NULL_LOGICAL_AND and NULL_LOGICAL_OR binops ([#10016](https://github.com/rapidsai/cudf/pull/10016)) [@evans2](https://github.com/evans2) +- Add `goupby.tansfom` (only suppot fo aggegations) ([#10005](https://github.com/rapidsai/cudf/pull/10005)) [@shwina](https://github.com/shwina) +- Add patitioning suppot to Paquet chunked wite ([#10000](https://github.com/rapidsai/cudf/pull/10000)) [@devavet](https://github.com/devavet) +- Add jni fo sequences ([#9972](https://github.com/rapidsai/cudf/pull/9972)) [@wbo4958](https://github.com/wbo4958) +- Java bindings fo mixed left, inne, and full joins ([#9941](https://github.com/rapidsai/cudf/pull/9941)) [@jlowe](https://github.com/jlowe) +- Java bindings fo JSON eade suppot ([#9940](https://github.com/rapidsai/cudf/pull/9940)) [@wbo4958](https://github.com/wbo4958) +- Enable tanspose fo sting columns in cudf python ([#9937](https://github.com/rapidsai/cudf/pull/9937)) [@galipemsaga](https://github.com/galipemsaga) +- Suppot stucts fo `cudf::contains` with column/scala input ([#9929](https://github.com/rapidsai/cudf/pull/9929)) [@ttnghia](https://github.com/ttnghia) +- Implement mixed equality/conditional joins ([#9917](https://github.com/rapidsai/cudf/pull/9917)) [@vyas](https://github.com/vyas) +- Add cudf::stings::extact_all API ([#9909](https://github.com/rapidsai/cudf/pull/9909)) [@davidwendt](https://github.com/davidwendt) +- Implement JNI fo `cudf::scatte` APIs ([#9903](https://github.com/rapidsai/cudf/pull/9903)) [@ttnghia](https://github.com/ttnghia) +- JNI: Function to copy and set validity fom bool column. ([#9901](https://github.com/rapidsai/cudf/pull/9901)) [@mythocks](https://github.com/mythocks) +- Add dictionay suppot to cudf::copy_if_else ([#9887](https://github.com/rapidsai/cudf/pull/9887)) [@davidwendt](https://github.com/davidwendt) +- add un_benchmaks taget fo unning benchmaks with json output ([#9879](https://github.com/rapidsai/cudf/pull/9879)) [@kathikeyann](https://github.com/kathikeyann) +- Add egex_flags paamete to stings eplace_e functions ([#9878](https://github.com/rapidsai/cudf/pull/9878)) [@davidwendt](https://github.com/davidwendt) +- Add_suffix and add_pefix fo DataFames and Seies ([#9846](https://github.com/rapidsai/cudf/pull/9846)) [@mayankanand007](https://github.com/mayankanand007) +- Add JNI fo `cudf::dop_duplicates` ([#9841](https://github.com/rapidsai/cudf/pull/9841)) [@ttnghia](https://github.com/ttnghia) +- Implement pe-list sequence ([#9839](https://github.com/rapidsai/cudf/pull/9839)) [@ttnghia](https://github.com/ttnghia) +- adding `seies.tanspose` ([#9835](https://github.com/rapidsai/cudf/pull/9835)) [@mayankanand007](https://github.com/mayankanand007) +- Adding suppot fo `Seies.autoco` ([#9833](https://github.com/rapidsai/cudf/pull/9833)) [@mayankanand007](https://github.com/mayankanand007) +- Suppot ound opeation on datetime64 datatypes ([#9820](https://github.com/rapidsai/cudf/pull/9820)) [@mayankanand007](https://github.com/mayankanand007) +- Add patitioning suppot in paquet wite ([#9810](https://github.com/rapidsai/cudf/pull/9810)) [@devavet](https://github.com/devavet) +- Raise tempoay eo fo `decimal128` types in paquet eade ([#9804](https://github.com/rapidsai/cudf/pull/9804)) [@galipemsaga](https://github.com/galipemsaga) +- Add decimal128 suppot to Paquet eade and wite ([#9765](https://github.com/rapidsai/cudf/pull/9765)) [@vuule](https://github.com/vuule) +- Optimize `goupby::scan` ([#9754](https://github.com/rapidsai/cudf/pull/9754)) [@PointKenel](https://github.com/PointKenel) +- Add sample JNI API ([#9728](https://github.com/rapidsai/cudf/pull/9728)) [@es-life](https://github.com/es-life) +- Suppot `min` and `max` in inclusive scan fo stucts ([#9725](https://github.com/rapidsai/cudf/pull/9725)) [@ttnghia](https://github.com/ttnghia) +- Add `fist` and `last` method to `IndexedFame` ([#9710](https://github.com/rapidsai/cudf/pull/9710)) [@isVoid](https://github.com/isVoid) +- Suppot `min` and `max` eduction fo stucts ([#9697](https://github.com/rapidsai/cudf/pull/9697)) [@ttnghia](https://github.com/ttnghia) +- Add paametes to contol ow goup size in Paquet wite ([#9677](https://github.com/rapidsai/cudf/pull/9677)) [@vuule](https://github.com/vuule) +- Run compute-sanitize in nightly build ([#9641](https://github.com/rapidsai/cudf/pull/9641)) [@kathikeyann](https://github.com/kathikeyann) +- Implement Seies.datetime.floo ([#9571](https://github.com/rapidsai/cudf/pull/9571)) [@skiui-souce](https://github.com/skiui-souce) +- ceil/floo fo `DatetimeIndex` ([#9554](https://github.com/rapidsai/cudf/pull/9554)) [@mayankanand007](https://github.com/mayankanand007) +- Add suppot fo `decimal128` in cudf python ([#9533](https://github.com/rapidsai/cudf/pull/9533)) [@galipemsaga](https://github.com/galipemsaga) +- Implement `lists::index_of()` to find positions in list ows ([#9510](https://github.com/rapidsai/cudf/pull/9510)) [@mythocks](https://github.com/mythocks) +- custeamz oauth callback fo kafka (libdkafka) ([#9486](https://github.com/rapidsai/cudf/pull/9486)) [@jdye64](https://github.com/jdye64) +- Add Peason coelation fo sot goupby (python) ([#9166](https://github.com/rapidsai/cudf/pull/9166)) [@skiui-souce](https://github.com/skiui-souce) +- Intechange datafame potocol ([#9071](https://github.com/rapidsai/cudf/pull/9071)) [@iskode](https://github.com/iskode) +- Rewiting ow/column convesions fo Spak <-> cudf data convesions ([#8444](https://github.com/rapidsai/cudf/pull/8444)) [@hypebolic2346](https://github.com/hypebolic2346) + +## 🛠️ Impovements + +- Pepae upload scipts fo Python 3.7 emoval ([#10092](https://github.com/rapidsai/cudf/pull/10092)) [@Ethyling](https://github.com/Ethyling) +- Simplify custeamz and cudf_kafka ecipes files ([#10065](https://github.com/rapidsai/cudf/pull/10065)) [@Ethyling](https://github.com/Ethyling) +- ORC wite API changes fo ganula statistics ([#10058](https://github.com/rapidsai/cudf/pull/10058)) [@mythocks](https://github.com/mythocks) +- Remove python constaints in cuteamz and cudf_kafka ecipes ([#10052](https://github.com/rapidsai/cudf/pull/10052)) [@Ethyling](https://github.com/Ethyling) +- Unpin `dask` and `distibuted` in CI ([#10028](https://github.com/rapidsai/cudf/pull/10028)) [@galipemsaga](https://github.com/galipemsaga) +- Add `_fom_column_like_self` factoy ([#10022](https://github.com/rapidsai/cudf/pull/10022)) [@isVoid](https://github.com/isVoid) +- Replace custom CUDA bindings peviously povided by RMM with official CUDA Python bindings ([#10008](https://github.com/rapidsai/cudf/pull/10008)) [@shwina](https://github.com/shwina) +- Use `cuda::std::is_aithmetic` in `cudf::is_numeic` tait. ([#9996](https://github.com/rapidsai/cudf/pull/9996)) [@bdice](https://github.com/bdice) +- Clean up CUDA steam use in cuIO ([#9991](https://github.com/rapidsai/cudf/pull/9991)) [@vuule](https://github.com/vuule) +- Use addessed-odeed fist fit fo the pinned memoy pool ([#9989](https://github.com/rapidsai/cudf/pull/9989)) [@ongou](https://github.com/ongou) +- Add stings tests to tanspose_test.cpp ([#9985](https://github.com/rapidsai/cudf/pull/9985)) [@davidwendt](https://github.com/davidwendt) +- Use gpuci_mamba_ety on Java CI. ([#9983](https://github.com/rapidsai/cudf/pull/9983)) [@bdice](https://github.com/bdice) +- Remove depecated method `one_hot_encoding` ([#9977](https://github.com/rapidsai/cudf/pull/9977)) [@isVoid](https://github.com/isVoid) +- Mino cleanup of unused Python functions ([#9974](https://github.com/rapidsai/cudf/pull/9974)) [@vyas](https://github.com/vyas) +- Use new efficient patitioned paquet witing in cuDF ([#9971](https://github.com/rapidsai/cudf/pull/9971)) [@devavet](https://github.com/devavet) +- Remove st.subwod_tokenize ([#9968](https://github.com/rapidsai/cudf/pull/9968)) [@VibhuJawa](https://github.com/VibhuJawa) +- Fowad-mege banch-21.12 to banch-22.02 ([#9947](https://github.com/rapidsai/cudf/pull/9947)) [@bdice](https://github.com/bdice) +- Remove depecated `method` paamete fom `mege` and `join`. ([#9944](https://github.com/rapidsai/cudf/pull/9944)) [@bdice](https://github.com/bdice) +- Remove depecated method DataFame.hash_columns. ([#9943](https://github.com/rapidsai/cudf/pull/9943)) [@bdice](https://github.com/bdice) +- Remove depecated method Seies.hash_encode. ([#9942](https://github.com/rapidsai/cudf/pull/9942)) [@bdice](https://github.com/bdice) +- use ninja in java ci build ([#9933](https://github.com/rapidsai/cudf/pull/9933)) [@ongou](https://github.com/ongou) +- Add build-time publish step to cpu build scipt ([#9927](https://github.com/rapidsai/cudf/pull/9927)) [@davidwendt](https://github.com/davidwendt) +- Refactoing ceil/ound/floo code fo datetime64 types ([#9926](https://github.com/rapidsai/cudf/pull/9926)) [@mayankanand007](https://github.com/mayankanand007) +- Remove vaious unused functions ([#9922](https://github.com/rapidsai/cudf/pull/9922)) [@vyas](https://github.com/vyas) +- Raise in `quey` if dtype is not suppoted ([#9921](https://github.com/rapidsai/cudf/pull/9921)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Add missing impots tests ([#9920](https://github.com/rapidsai/cudf/pull/9920)) [@Ethyling](https://github.com/Ethyling) +- Spak Decimal128 hashing ([#9919](https://github.com/rapidsai/cudf/pull/9919)) [@wlee](https://github.com/wlee) +- Replace `thust/std::get` with stuctued bindings ([#9915](https://github.com/rapidsai/cudf/pull/9915)) [@codeepot](https://github.com/codeepot) +- Upgade thust vesion to 1.15 ([#9912](https://github.com/rapidsai/cudf/pull/9912)) [@obetmaynad](https://github.com/obetmaynad) +- Remove conda envs fo CUDA 11.0 and 11.2. ([#9910](https://github.com/rapidsai/cudf/pull/9910)) [@bdice](https://github.com/bdice) +- Retun count of set bits fom inplace_bitmask_and. ([#9904](https://github.com/rapidsai/cudf/pull/9904)) [@bdice](https://github.com/bdice) +- Use dynamic nullate fo join hashe and equality compaato ([#9902](https://github.com/rapidsai/cudf/pull/9902)) [@davidwendt](https://github.com/davidwendt) +- Update ucx-py vesion on elease using vc ([#9897](https://github.com/rapidsai/cudf/pull/9897)) [@Ethyling](https://github.com/Ethyling) +- Remove `IncludeCategoies` fom `.clang-fomat` ([#9876](https://github.com/rapidsai/cudf/pull/9876)) [@codeepot](https://github.com/codeepot) +- Suppot statically linking CUDA untime fo Java bindings ([#9873](https://github.com/rapidsai/cudf/pull/9873)) [@jlowe](https://github.com/jlowe) +- Add `clang-tidy` to libcudf ([#9860](https://github.com/rapidsai/cudf/pull/9860)) [@codeepot](https://github.com/codeepot) +- Remove depecated methods fom Java Table class ([#9853](https://github.com/rapidsai/cudf/pull/9853)) [@jlowe](https://github.com/jlowe) +- Add test fo map column metadata handling in ORC wite ([#9852](https://github.com/rapidsai/cudf/pull/9852)) [@vuule](https://github.com/vuule) +- Use pandas `to_offset` to pase fequency sting in `date_ange` ([#9843](https://github.com/rapidsai/cudf/pull/9843)) [@isVoid](https://github.com/isVoid) +- add templated benchmak with fixtue ([#9838](https://github.com/rapidsai/cudf/pull/9838)) [@kathikeyann](https://github.com/kathikeyann) +- Use list of column inputs fo `apply_boolean_mask` ([#9832](https://github.com/rapidsai/cudf/pull/9832)) [@isVoid](https://github.com/isVoid) +- Added a few moe tests fo Decimal to Sting cast ([#9818](https://github.com/rapidsai/cudf/pull/9818)) [@azajafi](https://github.com/azajafi) +- Run doctests. ([#9815](https://github.com/rapidsai/cudf/pull/9815)) [@bdice](https://github.com/bdice) +- Avoid oveflow fo fixed_point ound ([#9809](https://github.com/rapidsai/cudf/pull/9809)) [@spelingxx](https://github.com/spelingxx) +- Move `dop_duplicates`, `dop_na`, `_gathe`, `take` to IndexFame and ceate thei `_base_index` countepats ([#9807](https://github.com/rapidsai/cudf/pull/9807)) [@isVoid](https://github.com/isVoid) +- Use vecto factoies fo host-device copies. ([#9806](https://github.com/rapidsai/cudf/pull/9806)) [@bdice](https://github.com/bdice) +- Refacto host device macos ([#9797](https://github.com/rapidsai/cudf/pull/9797)) [@vyas](https://github.com/vyas) +- Remove unused masked udf cython/c++ code ([#9792](https://github.com/rapidsai/cudf/pull/9792)) [@bandon-b-mille](https://github.com/bandon-b-mille) +- Allow custom sot functions fo dask-cudf `sot_values` ([#9789](https://github.com/rapidsai/cudf/pull/9789)) [@chalesbluca](https://github.com/chalesbluca) +- Impove build time of libcudf iteato tests ([#9788](https://github.com/rapidsai/cudf/pull/9788)) [@davidwendt](https://github.com/davidwendt) +- Copy Java native dependencies diectly into classpath ([#9787](https://github.com/rapidsai/cudf/pull/9787)) [@jlowe](https://github.com/jlowe) +- Add decimal types to cuIO benchmaks ([#9776](https://github.com/rapidsai/cudf/pull/9776)) [@vuule](https://github.com/vuule) +- Pick smallest decimal type with equied pecision in ORC eade ([#9775](https://github.com/rapidsai/cudf/pull/9775)) [@vuule](https://github.com/vuule) +- Avoid oveflow fo `fixed_point` `cudf::cast` and pefomance optimization ([#9772](https://github.com/rapidsai/cudf/pull/9772)) [@codeepot](https://github.com/codeepot) +- Use CTAD with Thust function objects ([#9768](https://github.com/rapidsai/cudf/pull/9768)) [@codeepot](https://github.com/codeepot) +- Refacto TableTest assetion methods to a sepaate utility class ([#9762](https://github.com/rapidsai/cudf/pull/9762)) [@jlowe](https://github.com/jlowe) +- Use Java classloade to find test esouces ([#9760](https://github.com/rapidsai/cudf/pull/9760)) [@jlowe](https://github.com/jlowe) +- Allow cast decimal128 to sting and add tests ([#9756](https://github.com/rapidsai/cudf/pull/9756)) [@azajafi](https://github.com/azajafi) +- Load balance optimization fo contiguous_split ([#9755](https://github.com/rapidsai/cudf/pull/9755)) [@nvdbaanec](https://github.com/nvdbaanec) +- Consolidate and impove `eset_index` ([#9750](https://github.com/rapidsai/cudf/pull/9750)) [@isVoid](https://github.com/isVoid) +- Update to UCX-Py 0.24 ([#9748](https://github.com/rapidsai/cudf/pull/9748)) [@pentschev](https://github.com/pentschev) +- Skip cufile tests in JNI build scipt ([#9744](https://github.com/rapidsai/cudf/pull/9744)) [@pxLi](https://github.com/pxLi) +- Enable sting to decimal 128 cast ([#9742](https://github.com/rapidsai/cudf/pull/9742)) [@azajafi](https://github.com/azajafi) +- Use stop instead of stop_. ([#9735](https://github.com/rapidsai/cudf/pull/9735)) [@bdice](https://github.com/bdice) +- Fowad-mege banch-21.12 to banch-22.02 ([#9730](https://github.com/rapidsai/cudf/pull/9730)) [@bdice](https://github.com/bdice) +- Impove cmake fomat scipt ([#9723](https://github.com/rapidsai/cudf/pull/9723)) [@vyas](https://github.com/vyas) +- Use cuFile diect device eads/wites by default in cuIO ([#9722](https://github.com/rapidsai/cudf/pull/9722)) [@vuule](https://github.com/vuule) +- Add diectoy-patitioned data suppot to cudf.ead_paquet ([#9720](https://github.com/rapidsai/cudf/pull/9720)) [@jzamoa](https://github.com/jzamoa) +- Use steam allocato adapto fo hash join table ([#9704](https://github.com/rapidsai/cudf/pull/9704)) [@PointKenel](https://github.com/PointKenel) +- Update check fo inf/nan stings in libcudf float convesion to ignoe case ([#9694](https://github.com/rapidsai/cudf/pull/9694)) [@davidwendt](https://github.com/davidwendt) +- Update cudf JNI to 22.02.0-SNAPSHOT ([#9681](https://github.com/rapidsai/cudf/pull/9681)) [@pxLi](https://github.com/pxLi) +- Replace cudf's concuent_odeed_map with cuco::static_map in semi/anti joins ([#9666](https://github.com/rapidsai/cudf/pull/9666)) [@vyas](https://github.com/vyas) +- Some impovements to `pase_decimal` function and bindings fo `is_fixed_point` ([#9658](https://github.com/rapidsai/cudf/pull/9658)) [@azajafi](https://github.com/azajafi) +- Add utility to fomat ninja-log build times ([#9631](https://github.com/rapidsai/cudf/pull/9631)) [@davidwendt](https://github.com/davidwendt) +- Allow untime has_nulls paamete fo ow opeatos ([#9623](https://github.com/rapidsai/cudf/pull/9623)) [@davidwendt](https://github.com/davidwendt) +- Use fsspec.paquet fo impoved ead_paquet pefomance fom emote stoage ([#9589](https://github.com/rapidsai/cudf/pull/9589)) [@jzamoa](https://github.com/jzamoa) +- Refacto bit counting APIs, intoduce valid/null count functions, and split host/device side code fo segmented counts. ([#9588](https://github.com/rapidsai/cudf/pull/9588)) [@bdice](https://github.com/bdice) +- Use List of Columns as Input fo `dop_nulls`, `gathe` and `dop_duplicates` ([#9558](https://github.com/rapidsai/cudf/pull/9558)) [@isVoid](https://github.com/isVoid) +- Simplify mege intenals and educe ovehead ([#9516](https://github.com/rapidsai/cudf/pull/9516)) [@vyas](https://github.com/vyas) +- Add `stuct` geneation suppot in datageneato & fuzz tests ([#9180](https://github.com/rapidsai/cudf/pull/9180)) [@galipemsaga](https://github.com/galipemsaga) +- Simplify wite_csv by emoving unnecessay wite/impl classes ([#9089](https://github.com/rapidsai/cudf/pull/9089)) [@cwhais](https://github.com/cwhais) # cuDF 21.12.00 (9 Dec 2021) diff --git a/build.sh b/build.sh index c2eba134c35..8b3add1dddd 100755 --- a/build.sh +++ b/build.sh @@ -185,12 +185,9 @@ if buildAll || hasArg libcudf; then fi # get the current count before the compile starts - FILES_IN_CCACHE="" - if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v ccache)" ]]; then - FILES_IN_CCACHE=$(ccache -s | grep "files in cache") - echo "$FILES_IN_CCACHE" - # zero the ccache statistics - ccache -z + if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v sccache)" ]]; then + # zero the sccache statistics + sccache --zero-stats fi cmake -S $REPODIR/cpp -B ${LIB_BUILD_DIR} \ @@ -216,11 +213,12 @@ if buildAll || hasArg libcudf; then echo "Formatting build metrics" python ${REPODIR}/cpp/scripts/sort_ninja_log.py ${LIB_BUILD_DIR}/.ninja_log --fmt xml > ${LIB_BUILD_DIR}/ninja_log.xml MSG="

" - # get some ccache stats after the compile - if [[ "$BUILD_REPORT_INCL_CACHE_STATS"=="ON" && -x "$(command -v ccache)" ]]; then - MSG="${MSG}
$FILES_IN_CCACHE" - HIT_RATE=$(ccache -s | grep "cache hit rate") - MSG="${MSG}
${HIT_RATE}" + # get some sccache stats after the compile + if [[ "$BUILD_REPORT_INCL_CACHE_STATS" == "ON" && -x "$(command -v sccache)" ]]; then + COMPILE_REQUESTS=$(sccache -s | grep "Compile requests \+ [0-9]\+$" | awk '{ print $NF }') + CACHE_HITS=$(sccache -s | grep "Cache hits \+ [0-9]\+$" | awk '{ print $NF }') + HIT_RATE=$(echo - | awk "{printf \"%.2f\n\", $CACHE_HITS / $COMPILE_REQUESTS * 100}") + MSG="${MSG}
cache hit rate ${HIT_RATE} %" fi MSG="${MSG}
parallel setting: $PARALLEL_LEVEL" MSG="${MSG}
parallel build time: $compile_total seconds" diff --git a/ci/cpu/build.sh b/ci/cpu/build.sh index 6f19f174da0..574a55d26b6 100755 --- a/ci/cpu/build.sh +++ b/ci/cpu/build.sh @@ -31,6 +31,10 @@ if [[ "$BUILD_MODE" = "branch" && "$SOURCE_BRANCH" = branch-* ]] ; then export VERSION_SUFFIX=`date +%y%m%d` fi +export CMAKE_CUDA_COMPILER_LAUNCHER="sccache" +export CMAKE_CXX_COMPILER_LAUNCHER="sccache" +export CMAKE_C_COMPILER_LAUNCHER="sccache" + ################################################################################ # SETUP - Check environment ################################################################################ @@ -77,6 +81,8 @@ if [ "$BUILD_LIBCUDF" == '1' ]; then gpuci_conda_retry build --no-build-id --croot ${CONDA_BLD_DIR} conda/recipes/libcudf $CONDA_BUILD_ARGS mkdir -p ${CONDA_BLD_DIR}/libcudf/work cp -r ${CONDA_BLD_DIR}/work/* ${CONDA_BLD_DIR}/libcudf/work + gpuci_logger "sccache stats" + sccache --show-stats # Copy libcudf build metrics results LIBCUDF_BUILD_DIR=$CONDA_BLD_DIR/libcudf/work/cpp/build diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh index d5fb7451769..6a5c28faeff 100755 --- a/ci/gpu/build.sh +++ b/ci/gpu/build.sh @@ -36,6 +36,10 @@ export DASK_DISTRIBUTED_GIT_TAG='2022.01.0' # ucx-py version export UCX_PY_VERSION='0.25.*' +export CMAKE_CUDA_COMPILER_LAUNCHER="sccache" +export CMAKE_CXX_COMPILER_LAUNCHER="sccache" +export CMAKE_C_COMPILER_LAUNCHER="sccache" + ################################################################################ # TRAP - Setup trap for removing jitify cache ################################################################################ diff --git a/ci/utils/nbtestlog2junitxml.py b/ci/utils/nbtestlog2junitxml.py index 15b362e4b70..6a421279112 100644 --- a/ci/utils/nbtestlog2junitxml.py +++ b/ci/utils/nbtestlog2junitxml.py @@ -7,11 +7,11 @@ from enum import Enum -startingPatt = re.compile("^STARTING: ([\w\.\-]+)$") -skippingPatt = re.compile("^SKIPPING: ([\w\.\-]+)\s*(\(([\w\.\-\ \,]+)\))?\s*$") -exitCodePatt = re.compile("^EXIT CODE: (\d+)$") -folderPatt = re.compile("^FOLDER: ([\w\.\-]+)$") -timePatt = re.compile("^real\s+([\d\.ms]+)$") +startingPatt = re.compile(r"^STARTING: ([\w\.\-]+)$") +skippingPatt = re.compile(r"^SKIPPING: ([\w\.\-]+)\s*(\(([\w\.\-\ \,]+)\))?\s*$") +exitCodePatt = re.compile(r"^EXIT CODE: (\d+)$") +folderPatt = re.compile(r"^FOLDER: ([\w\.\-]+)$") +timePatt = re.compile(r"^real\s+([\d\.ms]+)$") linePatt = re.compile("^" + ("-" * 80) + "$") diff --git a/conda/recipes/libcudf/meta.yaml b/conda/recipes/libcudf/meta.yaml index 2cbe5173de0..70c020d4abd 100644 --- a/conda/recipes/libcudf/meta.yaml +++ b/conda/recipes/libcudf/meta.yaml @@ -22,13 +22,15 @@ build: - PARALLEL_LEVEL - VERSION_SUFFIX - PROJECT_FLASH - - CCACHE_DIR - - CCACHE_NOHASHDIR - - CCACHE_COMPILERCHECK - CMAKE_GENERATOR - CMAKE_C_COMPILER_LAUNCHER - CMAKE_CXX_COMPILER_LAUNCHER - CMAKE_CUDA_COMPILER_LAUNCHER + - SCCACHE_S3_KEY_PREFIX=libcudf-aarch64 # [aarch64] + - SCCACHE_S3_KEY_PREFIX=libcudf-linux64 # [linux64] + - SCCACHE_BUCKET=rapids-sccache + - SCCACHE_REGION=us-west-2 + - SCCACHE_IDLE_TIMEOUT=32768 run_exports: - {{ pin_subpackage("libcudf", max_pin="x.x") }} diff --git a/cpp/include/cudf/binaryop.hpp b/cpp/include/cudf/binaryop.hpp index daf55c0befe..177fd904b0b 100644 --- a/cpp/include/cudf/binaryop.hpp +++ b/cpp/include/cudf/binaryop.hpp @@ -45,7 +45,7 @@ enum class binary_operator : int32_t { PMOD, ///< positive modulo operator ///< If remainder is negative, this returns (remainder + divisor) % divisor ///< else, it returns (dividend % divisor) - PYMOD, ///< operator % but following python's sign rules for negatives + PYMOD, ///< operator % but following Python's sign rules for negatives POW, ///< lhs ^ rhs LOG_BASE, ///< logarithm to the base ATAN2, ///< 2-argument arctangent diff --git a/cpp/include/cudf/fixed_point/fixed_point.hpp b/cpp/include/cudf/fixed_point/fixed_point.hpp index a7112ae415d..f027e2783b1 100644 --- a/cpp/include/cudf/fixed_point/fixed_point.hpp +++ b/cpp/include/cudf/fixed_point/fixed_point.hpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2020-2021, NVIDIA CORPORATION. + * Copyright (c) 2020-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -440,6 +440,21 @@ class fixed_point { CUDF_HOST_DEVICE inline friend fixed_point operator/( fixed_point const& lhs, fixed_point const& rhs); + /** + * @brief operator % (for computing the modulo operation of two `fixed_point` numbers) + * + * If `_scale`s are equal, the modulus is computed directly. + * If `_scale`s are not equal, the number with larger `_scale` is shifted to the + * smaller `_scale`, and then the modulus is computed. + * + * @tparam Rep1 Representation type of number being modulo-ed to `this` + * @tparam Rad1 Radix (base) type of number being modulo-ed to `this` + * @return The resulting `fixed_point` number + */ + template + CUDF_HOST_DEVICE inline friend fixed_point operator%( + fixed_point const& lhs, fixed_point const& rhs); + /** * @brief operator == (for comparing two `fixed_point` numbers) * @@ -750,6 +765,16 @@ CUDF_HOST_DEVICE inline bool operator>(fixed_point const& lhs, return lhs.rescaled(scale)._value > rhs.rescaled(scale)._value; } +// MODULO OPERATION +template +CUDF_HOST_DEVICE inline fixed_point operator%(fixed_point const& lhs, + fixed_point const& rhs) +{ + auto const scale = std::min(lhs._scale, rhs._scale); + auto const remainder = lhs.rescaled(scale)._value % rhs.rescaled(scale)._value; + return fixed_point{scaled_integer{remainder, scale}}; +} + using decimal32 = fixed_point; using decimal64 = fixed_point; using decimal128 = fixed_point<__int128_t, Radix::BASE_10>; diff --git a/cpp/scripts/run-clang-format.py b/cpp/scripts/run-clang-format.py index a7c83da22c5..3d462d65fb8 100755 --- a/cpp/scripts/run-clang-format.py +++ b/cpp/scripts/run-clang-format.py @@ -13,7 +13,6 @@ # limitations under the License. # -from __future__ import print_function import argparse import os @@ -124,9 +123,9 @@ def run_clang_format(src, dst, exe, verbose, inplace): os.makedirs(dstdir) # run the clang format command itself if src == dst: - cmd = "%s -i %s" % (exe, src) + cmd = f"{exe} -i {src}" else: - cmd = "%s %s > %s" % (exe, src, dst) + cmd = f"{exe} {src} > {dst}" try: subprocess.check_call(cmd, shell=True) except subprocess.CalledProcessError: @@ -134,9 +133,9 @@ def run_clang_format(src, dst, exe, verbose, inplace): raise # run the diff to check if there are any formatting issues if inplace: - cmd = "diff -q %s %s >/dev/null" % (src, dst) + cmd = f"diff -q {src} {dst} >/dev/null" else: - cmd = "diff %s %s" % (src, dst) + cmd = f"diff {src} {dst}" try: subprocess.check_call(cmd, shell=True) diff --git a/cpp/scripts/run-clang-tidy.py b/cpp/scripts/run-clang-tidy.py index 3a1a663e231..30e937d7f4d 100644 --- a/cpp/scripts/run-clang-tidy.py +++ b/cpp/scripts/run-clang-tidy.py @@ -13,7 +13,6 @@ # limitations under the License. # -from __future__ import print_function import re import os import subprocess @@ -67,7 +66,7 @@ def parse_args(): def get_all_commands(cdb): - with open(cdb, "r") as fp: + with open(cdb) as fp: return json.load(fp) @@ -195,10 +194,10 @@ def collect_result(result): def print_result(passed, stdout, file): status_str = "PASSED" if passed else "FAILED" - print("%s File:%s %s %s" % (SEPARATOR, file, status_str, SEPARATOR)) + print(f"{SEPARATOR} File:{file} {status_str} {SEPARATOR}") if stdout: print(stdout) - print("%s File:%s ENDS %s" % (SEPARATOR, file, SEPARATOR)) + print(f"{SEPARATOR} File:{file} ENDS {SEPARATOR}") def print_results(): diff --git a/cpp/scripts/sort_ninja_log.py b/cpp/scripts/sort_ninja_log.py index 33c369b254f..85eb800879a 100755 --- a/cpp/scripts/sort_ninja_log.py +++ b/cpp/scripts/sort_ninja_log.py @@ -33,7 +33,7 @@ # build a map of the log entries entries = {} -with open(log_file, "r") as log: +with open(log_file) as log: last = 0 files = {} for line in log: diff --git a/cpp/src/binaryop/binaryop.cpp b/cpp/src/binaryop/binaryop.cpp index 5f9ff2574e3..dfa7896c37a 100644 --- a/cpp/src/binaryop/binaryop.cpp +++ b/cpp/src/binaryop/binaryop.cpp @@ -88,7 +88,10 @@ bool is_basic_arithmetic_binop(binary_operator op) op == binary_operator::MUL or // operator * op == binary_operator::DIV or // operator / using common type of lhs and rhs op == binary_operator::NULL_MIN or // 2 null = null, 1 null = value, else min - op == binary_operator::NULL_MAX; // 2 null = null, 1 null = value, else max + op == binary_operator::NULL_MAX or // 2 null = null, 1 null = value, else max + op == binary_operator::MOD or // operator % + op == binary_operator::PMOD or // positive modulo operator + op == binary_operator::PYMOD; // operator % but following Python's negative sign rules } /** diff --git a/cpp/src/binaryop/compiled/operation.cuh b/cpp/src/binaryop/compiled/operation.cuh index 4b5f78dc400..de9d46b6280 100644 --- a/cpp/src/binaryop/compiled/operation.cuh +++ b/cpp/src/binaryop/compiled/operation.cuh @@ -162,12 +162,24 @@ struct PMod { if (rem < 0) rem = std::fmod(rem + yconv, yconv); return rem; } + + template () and + std::is_same_v>* = nullptr> + __device__ inline auto operator()(TypeLhs x, TypeRhs y) + { + auto const remainder = x % y; + return remainder.value() < 0 ? (remainder + y) % y : remainder; + } }; struct PyMod { template >)>* = nullptr> + std::enable_if_t<(std::is_integral_v> or + (cudf::is_fixed_point() and + std::is_same_v))>* = nullptr> __device__ inline auto operator()(TypeLhs x, TypeRhs y) -> decltype(((x % y) + y) % y) { return ((x % y) + y) % y; diff --git a/cpp/src/binaryop/compiled/util.cpp b/cpp/src/binaryop/compiled/util.cpp index 9481c236142..d8f1eb03a16 100644 --- a/cpp/src/binaryop/compiled/util.cpp +++ b/cpp/src/binaryop/compiled/util.cpp @@ -45,7 +45,11 @@ struct common_type_functor { // Eg. d=t-t return data_type{type_to_id()}; } - return {}; + + // A compiler bug may cause a compilation error when using empty initializer list to construct + // an std::optional object containing no `data_type` value. Therefore, we should explicitly + // return `std::nullopt` instead. + return std::nullopt; } }; template diff --git a/cpp/tests/binaryop/binop-compiled-fixed_point-test.cpp b/cpp/tests/binaryop/binop-compiled-fixed_point-test.cpp index 29905171907..335de93c976 100644 --- a/cpp/tests/binaryop/binop-compiled-fixed_point-test.cpp +++ b/cpp/tests/binaryop/binop-compiled-fixed_point-test.cpp @@ -1,5 +1,5 @@ /* - * Copyright (c) 2021, NVIDIA CORPORATION. + * Copyright (c) 2021-2022, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -33,14 +33,14 @@ namespace cudf::test::binop { template -struct FixedPointCompiledTestBothReps : public cudf::test::BaseFixture { +struct FixedPointCompiledTest : public cudf::test::BaseFixture { }; template using wrapper = cudf::test::fixed_width_column_wrapper; -TYPED_TEST_SUITE(FixedPointCompiledTestBothReps, cudf::test::FixedPointTypes); +TYPED_TEST_SUITE(FixedPointCompiledTest, cudf::test::FixedPointTypes); -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpAdd) { using namespace numeric; using decimalXX = TypeParam; @@ -73,7 +73,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_col, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpMultiply) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpMultiply) { using namespace numeric; using decimalXX = TypeParam; @@ -109,7 +109,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpMultiply) template using fp_wrapper = cudf::test::fixed_point_column_wrapper; -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpMultiply2) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpMultiply2) { using namespace numeric; using decimalXX = TypeParam; @@ -128,7 +128,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpMultiply2) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpDiv) { using namespace numeric; using decimalXX = TypeParam; @@ -147,7 +147,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv2) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpDiv2) { using namespace numeric; using decimalXX = TypeParam; @@ -166,7 +166,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv2) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv3) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpDiv3) { using namespace numeric; using decimalXX = TypeParam; @@ -183,7 +183,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv3) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv4) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpDiv4) { using namespace numeric; using decimalXX = TypeParam; @@ -203,7 +203,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpDiv4) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd2) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpAdd2) { using namespace numeric; using decimalXX = TypeParam; @@ -222,7 +222,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd2) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd3) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpAdd3) { using namespace numeric; using decimalXX = TypeParam; @@ -241,7 +241,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd3) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd4) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpAdd4) { using namespace numeric; using decimalXX = TypeParam; @@ -258,7 +258,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd4) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd5) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpAdd5) { using namespace numeric; using decimalXX = TypeParam; @@ -275,7 +275,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd5) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd6) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpAdd6) { using namespace numeric; using decimalXX = TypeParam; @@ -294,7 +294,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpAdd6) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected1, result1->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointCast) +TYPED_TEST(FixedPointCompiledTest, FixedPointCast) { using namespace numeric; using decimalXX = TypeParam; @@ -308,7 +308,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointCast) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpMultiplyScalar) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpMultiplyScalar) { using namespace numeric; using decimalXX = TypeParam; @@ -325,7 +325,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpMultiplyScalar) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpSimplePlus) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpSimplePlus) { using namespace numeric; using decimalXX = TypeParam; @@ -344,7 +344,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpSimplePlus) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimple) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpEqualSimple) { using namespace numeric; using decimalXX = TypeParam; @@ -361,7 +361,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimple) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimpleScale0) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpEqualSimpleScale0) { using namespace numeric; using decimalXX = TypeParam; @@ -377,7 +377,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimpleScale0) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimpleScale0Null) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpEqualSimpleScale0Null) { using namespace numeric; using decimalXX = TypeParam; @@ -393,7 +393,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimpleScale0Nu CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimpleScale2Null) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpEqualSimpleScale2Null) { using namespace numeric; using decimalXX = TypeParam; @@ -409,7 +409,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualSimpleScale2Nu CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualLessGreater) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpEqualLessGreater) { using namespace numeric; using decimalXX = TypeParam; @@ -453,7 +453,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpEqualLessGreater) CUDF_TEST_EXPECT_COLUMNS_EQUAL(true_col, greater_result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpNullMaxSimple) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpNullMaxSimple) { using namespace numeric; using decimalXX = TypeParam; @@ -473,7 +473,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpNullMaxSimple) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpNullMinSimple) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpNullMinSimple) { using namespace numeric; using decimalXX = TypeParam; @@ -493,7 +493,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpNullMinSimple) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpNullEqualsSimple) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpNullEqualsSimple) { using namespace numeric; using decimalXX = TypeParam; @@ -510,7 +510,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpNullEqualsSimple) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div) { using namespace numeric; using decimalXX = TypeParam; @@ -526,7 +526,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div2) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div2) { using namespace numeric; using decimalXX = TypeParam; @@ -542,7 +542,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div2) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div3) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div3) { using namespace numeric; using decimalXX = TypeParam; @@ -558,7 +558,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div3) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div4) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div4) { using namespace numeric; using decimalXX = TypeParam; @@ -574,7 +574,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div4) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div6) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div6) { using namespace numeric; using decimalXX = TypeParam; @@ -591,7 +591,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div6) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div7) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div7) { using namespace numeric; using decimalXX = TypeParam; @@ -608,7 +608,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div7) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div8) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div8) { using namespace numeric; using decimalXX = TypeParam; @@ -624,7 +624,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div8) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div9) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div9) { using namespace numeric; using decimalXX = TypeParam; @@ -640,7 +640,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div9) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div10) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div10) { using namespace numeric; using decimalXX = TypeParam; @@ -656,7 +656,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div10) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div11) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOp_Div11) { using namespace numeric; using decimalXX = TypeParam; @@ -672,7 +672,7 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOp_Div11) CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); } -TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpThrows) +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpThrows) { using namespace numeric; using decimalXX = TypeParam; @@ -684,6 +684,132 @@ TYPED_TEST(FixedPointCompiledTestBothReps, FixedPointBinaryOpThrows) cudf::logic_error); } +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpModSimple) +{ + using namespace numeric; + using decimalXX = TypeParam; + using RepType = device_storage_type_t; + + auto const lhs = fp_wrapper{{-33, -22, -11, 11, 22, 33, 44, 55}, scale_type{-1}}; + auto const rhs = fp_wrapper{{10, 10, 10, 10, 10, 10, 10, 10}, scale_type{-1}}; + auto const expected = fp_wrapper{{-3, -2, -1, 1, 2, 3, 4, 5}, scale_type{-1}}; + + auto const type = + cudf::binary_operation_fixed_point_output_type(cudf::binary_operator::MOD, + static_cast(lhs).type(), + static_cast(rhs).type()); + auto const result = cudf::binary_operation(lhs, rhs, cudf::binary_operator::MOD, type); + + CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); +} + +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpPModSimple) +{ + using namespace numeric; + using decimalXX = TypeParam; + using RepType = device_storage_type_t; + + auto const lhs = fp_wrapper{{-33, -22, -11, 11, 22, 33, 44, 55}, scale_type{-1}}; + auto const rhs = fp_wrapper{{10, 10, 10, 10, 10, 10, 10, 10}, scale_type{-1}}; + auto const expected = fp_wrapper{{7, 8, 9, 1, 2, 3, 4, 5}, scale_type{-1}}; + + for (auto const op : {cudf::binary_operator::PMOD, cudf::binary_operator::PYMOD}) { + auto const type = cudf::binary_operation_fixed_point_output_type( + op, static_cast(lhs).type(), static_cast(rhs).type()); + auto const result = cudf::binary_operation(lhs, rhs, op, type); + + CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); + } +} + +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpModSimple2) +{ + using namespace numeric; + using decimalXX = TypeParam; + using RepType = device_storage_type_t; + + auto const lhs = fp_wrapper{{-33, -22, -11, 11, 22, 33, 44, 55}, scale_type{-1}}; + auto const rhs = make_fixed_point_scalar(10, scale_type{-1}); + auto const expected = fp_wrapper{{-3, -2, -1, 1, 2, 3, 4, 5}, scale_type{-1}}; + + auto const type = cudf::binary_operation_fixed_point_output_type( + cudf::binary_operator::MOD, static_cast(lhs).type(), rhs->type()); + auto const result = cudf::binary_operation(lhs, *rhs, cudf::binary_operator::MOD, type); + + CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); +} + +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpPModAndPyModSimple2) +{ + using namespace numeric; + using decimalXX = TypeParam; + using RepType = device_storage_type_t; + + auto const lhs = fp_wrapper{{-33, -22, -11, 11, 22, 33, 44, 55}, scale_type{-1}}; + auto const rhs = make_fixed_point_scalar(10, scale_type{-1}); + auto const expected = fp_wrapper{{7, 8, 9, 1, 2, 3, 4, 5}, scale_type{-1}}; + + for (auto const op : {cudf::binary_operator::PMOD, cudf::binary_operator::PYMOD}) { + auto const type = cudf::binary_operation_fixed_point_output_type( + op, static_cast(lhs).type(), rhs->type()); + auto const result = cudf::binary_operation(lhs, *rhs, op, type); + + CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); + } +} + +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpMod) +{ + using namespace numeric; + using decimalXX = TypeParam; + using RepType = device_storage_type_t; + auto constexpr N = 1000; + + for (auto scale : {-1, -2, -3}) { + auto const iota = thrust::make_counting_iterator(-500); + auto const lhs = fp_wrapper{iota, iota + N, scale_type{-1}}; + auto const rhs = make_fixed_point_scalar(7, scale_type{scale}); + + auto const factor = static_cast(std::pow(10, -1 - scale)); + auto const f = [factor](auto i) { return (i * factor) % 7; }; + auto const exp_iter = cudf::detail::make_counting_transform_iterator(-500, f); + auto const expected = fp_wrapper{exp_iter, exp_iter + N, scale_type{scale}}; + + auto const type = cudf::binary_operation_fixed_point_output_type( + cudf::binary_operator::MOD, static_cast(lhs).type(), rhs->type()); + auto const result = cudf::binary_operation(lhs, *rhs, cudf::binary_operator::MOD, type); + + CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); + } +} + +TYPED_TEST(FixedPointCompiledTest, FixedPointBinaryOpPModAndPyMod) +{ + using namespace numeric; + using decimalXX = TypeParam; + using RepType = device_storage_type_t; + auto constexpr N = 1000; + + for (auto const scale : {-1, -2, -3}) { + auto const iota = thrust::make_counting_iterator(-500); + auto const lhs = fp_wrapper{iota, iota + N, scale_type{-1}}; + auto const rhs = make_fixed_point_scalar(7, scale_type{scale}); + + auto const factor = static_cast(std::pow(10, -1 - scale)); + auto const f = [factor](auto i) { return (((i * factor) % 7) + 7) % 7; }; + auto const exp_iter = cudf::detail::make_counting_transform_iterator(-500, f); + auto const expected = fp_wrapper{exp_iter, exp_iter + N, scale_type{scale}}; + + for (auto const op : {cudf::binary_operator::PMOD, cudf::binary_operator::PYMOD}) { + auto const type = cudf::binary_operation_fixed_point_output_type( + op, static_cast(lhs).type(), rhs->type()); + auto const result = cudf::binary_operation(lhs, *rhs, op, type); + + CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result->view()); + } + } +} + template struct FixedPointTest_64_128_Reps : public cudf::test::BaseFixture { }; diff --git a/docs/cudf/source/conf.py b/docs/cudf/source/conf.py index 3d6d3ceb399..60704f3e6ae 100644 --- a/docs/cudf/source/conf.py +++ b/docs/cudf/source/conf.py @@ -1,6 +1,4 @@ #!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# # Copyright (c) 2018-2021, NVIDIA CORPORATION. # # cudf documentation build configuration file, created by @@ -118,17 +116,6 @@ html_theme = "pydata_sphinx_theme" html_logo = "_static/RAPIDS-logo-purple.png" -# on_rtd is whether we are on readthedocs.org -on_rtd = os.environ.get("READTHEDOCS", None) == "True" - -if not on_rtd: - # only import and set the theme if we're building docs locally - # otherwise, readthedocs.org uses their theme by default, - # so no need to specify it - import pydata_sphinx_theme - - html_theme = "pydata_sphinx_theme" - html_theme_path = pydata_sphinx_theme.get_html_theme_path() # Theme options are theme-specific and customize the look and feel of a theme diff --git a/java/src/main/java/ai/rapids/cudf/Aggregation128Utils.java b/java/src/main/java/ai/rapids/cudf/Aggregation128Utils.java new file mode 100644 index 00000000000..9a0ac709e3e --- /dev/null +++ b/java/src/main/java/ai/rapids/cudf/Aggregation128Utils.java @@ -0,0 +1,67 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package ai.rapids.cudf; + +/** + * Utility methods for breaking apart and reassembling 128-bit values during aggregations + * to enable hash-based aggregations and detect overflows. + */ +public class Aggregation128Utils { + static { + NativeDepsLoader.loadNativeDeps(); + } + + /** + * Extract a 32-bit chunk from a 128-bit value. + * @param col column of 128-bit values (e.g.: DECIMAL128) + * @param outType integer type to use for the output column (e.g.: UINT32 or INT32) + * @param chunkIdx index of the 32-bit chunk to extract where 0 is the least significant chunk + * and 3 is the most significant chunk + * @return column containing the specified 32-bit chunk of the input column values. A null input + * row will result in a corresponding null output row. + */ + public static ColumnVector extractInt32Chunk(ColumnView col, DType outType, int chunkIdx) { + return new ColumnVector(extractInt32Chunk(col.getNativeView(), + outType.getTypeId().getNativeId(), chunkIdx)); + } + + /** + * Reassemble a column of 128-bit values from a table of four 64-bit integer columns and check + * for overflow. The 128-bit value is reconstructed by overlapping the 64-bit values by 32-bits. + * The least significant 32-bits of the least significant 64-bit value are used directly as the + * least significant 32-bits of the final 128-bit value, and the remaining 32-bits are added to + * the next most significant 64-bit value. The lower 32-bits of that sum become the next most + * significant 32-bits in the final 128-bit value, and the remaining 32-bits are added to the + * next most significant 64-bit input value, and so on. + * + * @param chunks table of four 64-bit integer columns with the columns ordered from least + * significant to most significant. The last column must be of type INT64. + * @param type the type to use for the resulting 128-bit value column + * @return table containing a boolean column and a 128-bit value column of the requested type. + * The boolean value will be true if an overflow was detected for that row's value when + * it was reassembled. A null input row will result in a corresponding null output row. + */ + public static Table combineInt64SumChunks(Table chunks, DType type) { + return new Table(combineInt64SumChunks(chunks.getNativeView(), + type.getTypeId().getNativeId(), + type.getScale())); + } + + private static native long extractInt32Chunk(long columnView, int outTypeId, int chunkIdx); + + private static native long[] combineInt64SumChunks(long chunksTableView, int dtype, int scale); +} diff --git a/java/src/main/native/CMakeLists.txt b/java/src/main/native/CMakeLists.txt index 00747efff27..ffbeeb155e0 100755 --- a/java/src/main/native/CMakeLists.txt +++ b/java/src/main/native/CMakeLists.txt @@ -1,5 +1,5 @@ # ============================================================================= -# Copyright (c) 2019-2021, NVIDIA CORPORATION. +# Copyright (c) 2019-2022, NVIDIA CORPORATION. # # Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except # in compliance with the License. You may obtain a copy of the License at @@ -219,7 +219,7 @@ endif() add_library( cudfjni SHARED - src/row_conversion.cu + src/Aggregation128UtilsJni.cpp src/AggregationJni.cpp src/CudfJni.cpp src/CudaJni.cpp @@ -236,7 +236,9 @@ add_library( src/RmmJni.cpp src/ScalarJni.cpp src/TableJni.cpp + src/aggregation128_utils.cu src/map_lookup.cu + src/row_conversion.cu src/check_nvcomp_output_sizes.cu ) diff --git a/java/src/main/native/src/Aggregation128UtilsJni.cpp b/java/src/main/native/src/Aggregation128UtilsJni.cpp new file mode 100644 index 00000000000..71c36cb724a --- /dev/null +++ b/java/src/main/native/src/Aggregation128UtilsJni.cpp @@ -0,0 +1,47 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "aggregation128_utils.hpp" +#include "cudf_jni_apis.hpp" +#include "dtype_utils.hpp" + +extern "C" { + +JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_Aggregation128Utils_extractInt32Chunk( + JNIEnv *env, jclass, jlong j_column_view, jint j_out_dtype, jint j_chunk_idx) { + JNI_NULL_CHECK(env, j_column_view, "column is null", 0); + try { + cudf::jni::auto_set_device(env); + auto cview = reinterpret_cast(j_column_view); + auto dtype = cudf::jni::make_data_type(j_out_dtype, 0); + return cudf::jni::release_as_jlong(cudf::jni::extract_chunk32(*cview, dtype, j_chunk_idx)); + } + CATCH_STD(env, 0); +} + +JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_Aggregation128Utils_combineInt64SumChunks( + JNIEnv *env, jclass, jlong j_table_view, jint j_dtype, jint j_scale) { + JNI_NULL_CHECK(env, j_table_view, "table is null", 0); + try { + cudf::jni::auto_set_device(env); + auto tview = reinterpret_cast(j_table_view); + std::unique_ptr result = + cudf::jni::assemble128_from_sum(*tview, cudf::jni::make_data_type(j_dtype, j_scale)); + return cudf::jni::convert_table_for_return(env, result); + } + CATCH_STD(env, 0); +} +} diff --git a/java/src/main/native/src/ColumnVectorJni.cpp b/java/src/main/native/src/ColumnVectorJni.cpp index 0e559ad0403..f01d832eb19 100644 --- a/java/src/main/native/src/ColumnVectorJni.cpp +++ b/java/src/main/native/src/ColumnVectorJni.cpp @@ -252,8 +252,8 @@ JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnVector_makeListFromOffsets( JNI_NULL_CHECK(env, offsets_handle, "offsets_handle is null", 0) try { cudf::jni::auto_set_device(env); - auto const *child_cv = reinterpret_cast(child_handle); - auto const *offsets_cv = reinterpret_cast(offsets_handle); + auto const child_cv = reinterpret_cast(child_handle); + auto const offsets_cv = reinterpret_cast(offsets_handle); CUDF_EXPECTS(offsets_cv->type().id() == cudf::type_id::INT32, "Input offsets does not have type INT32."); diff --git a/java/src/main/native/src/ColumnViewJni.cpp b/java/src/main/native/src/ColumnViewJni.cpp index 63247eb0066..eec4a78a457 100644 --- a/java/src/main/native/src/ColumnViewJni.cpp +++ b/java/src/main/native/src/ColumnViewJni.cpp @@ -408,7 +408,7 @@ JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_dropListDuplicatesWithKey JNI_NULL_CHECK(env, keys_vals_handle, "keys_vals_handle is null", 0); try { cudf::jni::auto_set_device(env); - auto const *input_cv = reinterpret_cast(keys_vals_handle); + auto const input_cv = reinterpret_cast(keys_vals_handle); CUDF_EXPECTS(input_cv->offset() == 0, "Input column has non-zero offset."); CUDF_EXPECTS(input_cv->type().id() == cudf::type_id::LIST, "Input column is not a lists column."); @@ -460,7 +460,8 @@ JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_dropListDuplicatesWithKey auto out_structs = cudf::make_structs_column(out_child_size, std::move(out_structs_members), 0, {}); return release_as_jlong(cudf::make_lists_column(input_cv->size(), std::move(out_offsets), - std::move(out_structs), 0, {})); + std::move(out_structs), input_cv->null_count(), + cudf::copy_bitmask(*input_cv))); } CATCH_STD(env, 0); } diff --git a/java/src/main/native/src/aggregation128_utils.cu b/java/src/main/native/src/aggregation128_utils.cu new file mode 100644 index 00000000000..865f607ff7d --- /dev/null +++ b/java/src/main/native/src/aggregation128_utils.cu @@ -0,0 +1,127 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include "aggregation128_utils.hpp" + +namespace { + +// Functor to reassemble a 128-bit value from four 64-bit chunks with overflow detection. +class chunk_assembler : public thrust::unary_function { +public: + chunk_assembler(bool *overflows, uint64_t const *chunks0, uint64_t const *chunks1, + uint64_t const *chunks2, int64_t const *chunks3) + : overflows(overflows), chunks0(chunks0), chunks1(chunks1), chunks2(chunks2), + chunks3(chunks3) {} + + __device__ __int128_t operator()(cudf::size_type i) const { + // Starting with the least significant input and moving to the most significant, propagate the + // upper 32-bits of the previous column into the next column, i.e.: propagate the "carry" bits + // of each 64-bit chunk into the next chunk. + uint64_t const c0 = chunks0[i]; + uint64_t const c1 = chunks1[i] + (c0 >> 32); + uint64_t const c2 = chunks2[i] + (c1 >> 32); + int64_t const c3 = chunks3[i] + (c2 >> 32); + uint64_t const lower64 = (c1 << 32) | static_cast(c0); + int64_t const upper64 = (c3 << 32) | static_cast(c2); + + // check for overflow by ensuring the sign bit matches the top carry bits + int32_t const replicated_sign_bit = static_cast(c3) >> 31; + int32_t const top_carry_bits = static_cast(c3 >> 32); + overflows[i] = (replicated_sign_bit != top_carry_bits); + + return (static_cast<__int128_t>(upper64) << 64) | lower64; + } + +private: + // output column for overflow detected + bool *const overflows; + + // input columns for the four 64-bit values + uint64_t const *const chunks0; + uint64_t const *const chunks1; + uint64_t const *const chunks2; + int64_t const *const chunks3; +}; + +} // anonymous namespace + +namespace cudf::jni { + +// Extract a 32-bit chunk from a 128-bit value. +std::unique_ptr extract_chunk32(cudf::column_view const &in_col, cudf::data_type type, + int chunk_idx, rmm::cuda_stream_view stream) { + CUDF_EXPECTS(in_col.type().id() == cudf::type_id::DECIMAL128, "not a 128-bit type"); + CUDF_EXPECTS(chunk_idx >= 0 && chunk_idx < 4, "invalid chunk index"); + CUDF_EXPECTS(type.id() == cudf::type_id::INT32 || type.id() == cudf::type_id::UINT32, + "not a 32-bit integer type"); + auto const num_rows = in_col.size(); + auto out_col = cudf::make_fixed_width_column(type, num_rows, copy_bitmask(in_col)); + auto out_view = out_col->mutable_view(); + auto const in_begin = in_col.begin(); + + // Build an iterator for every fourth 32-bit value, i.e.: one "chunk" of a __int128_t value + thrust::transform_iterator transform_iter{thrust::counting_iterator{0}, + [] __device__(auto i) { return i * 4; }}; + thrust::permutation_iterator stride_iter{in_begin + chunk_idx, transform_iter}; + + thrust::copy(rmm::exec_policy(stream), stride_iter, stride_iter + num_rows, + out_view.data()); + return out_col; +} + +// Reassemble a column of 128-bit values from four 64-bit integer columns with overflow detection. +std::unique_ptr assemble128_from_sum(cudf::table_view const &chunks_table, + cudf::data_type output_type, + rmm::cuda_stream_view stream) { + CUDF_EXPECTS(output_type.id() == cudf::type_id::DECIMAL128, "not a 128-bit type"); + CUDF_EXPECTS(chunks_table.num_columns() == 4, "must be 4 column table"); + auto const num_rows = chunks_table.num_rows(); + auto const chunks0 = chunks_table.column(0); + auto const chunks1 = chunks_table.column(1); + auto const chunks2 = chunks_table.column(2); + auto const chunks3 = chunks_table.column(3); + CUDF_EXPECTS(cudf::size_of(chunks0.type()) == 8 && cudf::size_of(chunks1.type()) == 8 && + cudf::size_of(chunks2.type()) == 8 && + chunks3.type().id() == cudf::type_id::INT64, + "chunks type mismatch"); + std::vector> columns; + columns.push_back(cudf::make_fixed_width_column(cudf::data_type{cudf::type_id::BOOL8}, num_rows, + copy_bitmask(chunks0))); + columns.push_back(cudf::make_fixed_width_column(output_type, num_rows, copy_bitmask(chunks0))); + auto overflows_view = columns[0]->mutable_view(); + auto assembled_view = columns[1]->mutable_view(); + thrust::transform(rmm::exec_policy(stream), thrust::make_counting_iterator(0), + thrust::make_counting_iterator(num_rows), + assembled_view.begin<__int128_t>(), + chunk_assembler(overflows_view.begin(), chunks0.begin(), + chunks1.begin(), chunks2.begin(), + chunks3.begin())); + return std::make_unique(std::move(columns)); +} + +} // namespace cudf::jni diff --git a/java/src/main/native/src/aggregation128_utils.hpp b/java/src/main/native/src/aggregation128_utils.hpp new file mode 100644 index 00000000000..30c1032b795 --- /dev/null +++ b/java/src/main/native/src/aggregation128_utils.hpp @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include +#include +#include + +namespace cudf::jni { + +/** + * @brief Extract a 32-bit integer column from a column of 128-bit values. + * + * Given a 128-bit input column, a 32-bit integer column is returned corresponding to + * the index of which 32-bit chunk of the original 128-bit values to extract. + * 0 corresponds to the least significant chunk, and 3 corresponds to the most + * significant chunk. + * + * A null input row will result in a corresponding null output row. + * + * @param col Column of 128-bit values + * @param dtype Integer type to use for the output column (e.g.: UINT32 or INT32) + * @param chunk_idx Index of the 32-bit chunk to extract + * @param stream CUDA stream to use + * @return A column containing the extracted 32-bit integer values + */ +std::unique_ptr +extract_chunk32(cudf::column_view const &col, cudf::data_type dtype, int chunk_idx, + rmm::cuda_stream_view stream = rmm::cuda_stream_default); + +/** + * @brief Reassemble a 128-bit column from four 64-bit integer columns with overflow detection. + * + * The 128-bit value is reconstructed by overlapping the 64-bit values by 32-bits. The least + * significant 32-bits of the least significant 64-bit value are used directly as the least + * significant 32-bits of the final 128-bit value, and the remaining 32-bits are added to the next + * most significant 64-bit value. The lower 32-bits of that sum become the next most significant + * 32-bits in the final 128-bit value, and the remaining 32-bits are added to the next most + * significant 64-bit input value, and so on. + * + * A null input row will result in a corresponding null output row. + * + * @param chunks_table Table of four 64-bit integer columns with the columns ordered from least + * significant to most significant. The last column must be an INT64 column. + * @param output_type The type to use for the resulting 128-bit value column + * @param stream CUDA stream to use + * @return Table containing a boolean column and a 128-bit value column of the + * requested type. The boolean value will be true if an overflow was detected + * for that row's value. + */ +std::unique_ptr +assemble128_from_sum(cudf::table_view const &chunks_table, cudf::data_type output_type, + rmm::cuda_stream_view stream = rmm::cuda_stream_default); + +} // namespace cudf::jni diff --git a/java/src/test/java/ai/rapids/cudf/Aggregation128UtilsTest.java b/java/src/test/java/ai/rapids/cudf/Aggregation128UtilsTest.java new file mode 100644 index 00000000000..11e2aff7259 --- /dev/null +++ b/java/src/test/java/ai/rapids/cudf/Aggregation128UtilsTest.java @@ -0,0 +1,80 @@ +/* + * Copyright (c) 2022, NVIDIA CORPORATION. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package ai.rapids.cudf; + +import org.junit.jupiter.api.Test; + +import java.math.BigInteger; + +public class Aggregation128UtilsTest extends CudfTestBase { + @Test + public void testExtractInt32Chunks() { + BigInteger[] intvals = new BigInteger[] { + null, + new BigInteger("123456789abcdef0f0debc9a78563412", 16), + new BigInteger("123456789abcdef0f0debc9a78563412", 16), + new BigInteger("123456789abcdef0f0debc9a78563412", 16), + null + }; + try (ColumnVector cv = ColumnVector.decimalFromBigInt(-38, intvals); + ColumnVector chunk1 = Aggregation128Utils.extractInt32Chunk(cv, DType.UINT32, 0); + ColumnVector chunk2 = Aggregation128Utils.extractInt32Chunk(cv, DType.UINT32, 1); + ColumnVector chunk3 = Aggregation128Utils.extractInt32Chunk(cv, DType.UINT32, 2); + ColumnVector chunk4 = Aggregation128Utils.extractInt32Chunk(cv, DType.INT32, 3); + Table actualChunks = new Table(chunk1, chunk2, chunk3, chunk4); + ColumnVector expectedChunk1 = ColumnVector.fromBoxedUnsignedInts( + null, 0x78563412, 0x78563412, 0x78563412, null); + ColumnVector expectedChunk2 = ColumnVector.fromBoxedUnsignedInts( + null, -0x0f214366, -0x0f214366, -0x0f214366, null); + ColumnVector expectedChunk3 = ColumnVector.fromBoxedUnsignedInts( + null, -0x65432110, -0x65432110, -0x65432110, null); + ColumnVector expectedChunk4 = ColumnVector.fromBoxedInts( + null, 0x12345678, 0x12345678, 0x12345678, null); + Table expectedChunks = new Table(expectedChunk1, expectedChunk2, expectedChunk3, expectedChunk4)) { + AssertUtils.assertTablesAreEqual(expectedChunks, actualChunks); + } + } + + @Test + public void testCombineInt64SumChunks() { + try (ColumnVector chunks0 = ColumnVector.fromBoxedUnsignedLongs( + null, 0L, 1L, 0L, 0L, 0x12345678L, 0x123456789L, 0x1234567812345678L, 0xfedcba9876543210L); + ColumnVector chunks1 = ColumnVector.fromBoxedUnsignedLongs( + null, 0L, 2L, 0L, 0L, 0x9abcdef0L, 0x9abcdef01L, 0x1122334455667788L, 0xaceaceaceaceaceaL); + ColumnVector chunks2 = ColumnVector.fromBoxedUnsignedLongs( + null, 0L, 3L, 0L, 0L, 0x11223344L, 0x556677889L, 0x99aabbccddeeff00L, 0xbdfbdfbdfbdfbdfbL); + ColumnVector chunks3 = ColumnVector.fromBoxedLongs( + null, 0L, -1L, 0x100000000L, 0x80000000L, 0x55667788L, 0x01234567L, 0x66554434L, -0x42042043L); + Table chunksTable = new Table(chunks0, chunks1, chunks2, chunks3); + Table actual = Aggregation128Utils.combineInt64SumChunks(chunksTable, DType.create(DType.DTypeEnum.DECIMAL128, -20)); + ColumnVector expectedOverflows = ColumnVector.fromBoxedBooleans( + null, false, false, true, true, false, false, true, false); + ColumnVector expectedValues = ColumnVector.decimalFromBigInt(-20, + null, + new BigInteger("0", 16), + new BigInteger("-fffffffcfffffffdffffffff", 16), + new BigInteger("0", 16), + new BigInteger("-80000000000000000000000000000000", 16), + new BigInteger("55667788112233449abcdef012345678", 16), + new BigInteger("123456c56677892abcdef0223456789", 16), + new BigInteger("ef113244679ace0012345678", 16), + new BigInteger("7bf7bf7ba8ca8ca8e9ab678276543210", 16)); + Table expected = new Table(expectedOverflows, expectedValues)) { + AssertUtils.assertTablesAreEqual(expected, actual); + } + } +} diff --git a/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java b/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java index 8f39c3c51ce..f9c8029ed84 100644 --- a/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java +++ b/java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java @@ -4380,12 +4380,14 @@ void testDropListDuplicatesWithKeysValues() { 3, 4, 5, // list2 null, 0, 6, 6, 0, // list3 null, 6, 7, null, 7 // list 4 + // list5 (empty) ); ColumnVector inputChildVals = ColumnVector.fromBoxedInts( 10, 20, // list1 30, 40, 50, // list2 60, 70, 80, 90, 100, // list3 110, 120, 130, 140, 150 // list4 + // list5 (empty) ); ColumnVector inputStructsKeysVals = ColumnVector.makeStruct(inputChildKeys, inputChildVals); ColumnVector inputOffsets = ColumnVector.fromInts(0, 2, 5, 10, 15, 15); @@ -4402,7 +4404,8 @@ void testDropListDuplicatesWithKeysValues() { 10, 20, 30, 40, 50, 100, 90, 60, - 120, 150, 140); + 120, 150, 140 + ); ColumnVector expectedStructsKeysVals = ColumnVector.makeStruct(expectedChildKeys, expectedChildVals); ColumnVector expectedOffsets = ColumnVector.fromInts(0, 2, 5, 8, 11, 11); @@ -4416,6 +4419,60 @@ void testDropListDuplicatesWithKeysValues() { } } + @Test + void testDropListDuplicatesWithKeysValuesNullable() { + try(ColumnVector inputChildKeys = ColumnVector.fromBoxedInts( + 1, 2, // list1 + // list2 (null) + 3, 4, 5, // list3 + null, 0, 6, 6, 0, // list4 + null, 6, 7, null, 7 // list 5 + // list6 (null) + ); + ColumnVector inputChildVals = ColumnVector.fromBoxedInts( + 10, 20, // list1 + // list2 (null) + 30, 40, 50, // list3 + 60, 70, 80, 90, 100, // list4 + 110, 120, 130, 140, 150 // list5 + // list6 (null) + ); + ColumnVector inputStructsKeysVals = ColumnVector.makeStruct(inputChildKeys, inputChildVals); + ColumnVector inputOffsets = ColumnVector.fromInts(0, 2, 2, 5, 10, 15, 15); + ColumnVector tmpInputListsKeysVals = inputStructsKeysVals.makeListFromOffsets(6,inputOffsets); + ColumnVector templateBitmask = ColumnVector.fromBoxedInts(1, null, 1, 1, 1, null); + ColumnVector inputListsKeysVals = tmpInputListsKeysVals.mergeAndSetValidity(BinaryOp.BITWISE_AND, templateBitmask); + + ColumnVector expectedChildKeys = ColumnVector.fromBoxedInts( + 1, 2, // list1 + // list2 (null) + 3, 4, 5, // list3 + 0, 6, null, // list4 + 6, 7, null // list5 + // list6 (null) + ); + ColumnVector expectedChildVals = ColumnVector.fromBoxedInts( + 10, 20, // list1 + // list2 (null) + 30, 40, 50, // list3 + 100, 90, 60, // list4 + 120, 150, 140 // list5 + // list6 (null) + ); + ColumnVector expectedStructsKeysVals = ColumnVector.makeStruct(expectedChildKeys, + expectedChildVals); + ColumnVector expectedOffsets = ColumnVector.fromInts(0, 2, 2, 5, 8, 11, 11); + ColumnVector tmpExpectedListsKeysVals = expectedStructsKeysVals.makeListFromOffsets(6, + expectedOffsets); + ColumnVector expectedListsKeysVals = tmpExpectedListsKeysVals.mergeAndSetValidity(BinaryOp.BITWISE_AND, templateBitmask); + + ColumnVector output = inputListsKeysVals.dropListDuplicatesWithKeysValues(); + ColumnVector sortedOutput = output.listSortRows(false, false); + ) { + assertColumnsAreEqual(expectedListsKeysVals, sortedOutput); + } + } + @SafeVarargs private static ColumnVector makeListsColumn(DType childDType, List... rows) { HostColumnVector.DataType childType = new HostColumnVector.BasicType(true, childDType); @@ -4716,7 +4773,7 @@ void testStringSplit() { Table resultSplitOnce = v.stringSplit(pattern, 1); Table resultSplitAll = v.stringSplit(pattern)) { assertTablesAreEqual(expectedSplitOnce, resultSplitOnce); - assertTablesAreEqual(expectedSplitAll, resultSplitAll); + assertTablesAreEqual(expectedSplitAll, resultSplitAll); } } @@ -6068,7 +6125,7 @@ void testCopyWithBooleanColumnAsValidity() { } // Negative case: Mismatch in row count. - Exception x = assertThrows(CudfException.class, () -> { + Exception x = assertThrows(CudfException.class, () -> { try (ColumnVector exemplar = ColumnVector.fromBoxedInts(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); ColumnVector validity = ColumnVector.fromBoxedBooleans(F, T, F, T); ColumnVector result = exemplar.copyWithBooleanColumnAsValidity(validity)) { diff --git a/python/cudf/cudf/_fuzz_testing/fuzzer.py b/python/cudf/cudf/_fuzz_testing/fuzzer.py index 484b3fb26f4..a51a5073510 100644 --- a/python/cudf/cudf/_fuzz_testing/fuzzer.py +++ b/python/cudf/cudf/_fuzz_testing/fuzzer.py @@ -14,7 +14,7 @@ ) -class Fuzzer(object): +class Fuzzer: def __init__( self, target, diff --git a/python/cudf/cudf/_fuzz_testing/io.py b/python/cudf/cudf/_fuzz_testing/io.py index 193fb4c7f7f..dfc59a1f18d 100644 --- a/python/cudf/cudf/_fuzz_testing/io.py +++ b/python/cudf/cudf/_fuzz_testing/io.py @@ -16,7 +16,7 @@ ) -class IOFuzz(object): +class IOFuzz: def __init__( self, dirs=None, @@ -59,7 +59,7 @@ def __init__( self._current_buffer = None def _load_params(self, path): - with open(path, "r") as f: + with open(path) as f: params = json.load(f) self._inputs.append(params) diff --git a/python/cudf/cudf/_fuzz_testing/main.py b/python/cudf/cudf/_fuzz_testing/main.py index 7b28a4c4970..6b536fc3e2e 100644 --- a/python/cudf/cudf/_fuzz_testing/main.py +++ b/python/cudf/cudf/_fuzz_testing/main.py @@ -3,7 +3,7 @@ from cudf._fuzz_testing import fuzzer -class PythonFuzz(object): +class PythonFuzz: def __init__(self, func, params=None, data_handle=None, **kwargs): self.function = func self.data_handler_class = data_handle diff --git a/python/cudf/cudf/_version.py b/python/cudf/cudf/_version.py index a511ab98acf..c6281349c50 100644 --- a/python/cudf/cudf/_version.py +++ b/python/cudf/cudf/_version.py @@ -86,7 +86,7 @@ def run_command( stderr=(subprocess.PIPE if hide_stderr else None), ) break - except EnvironmentError: + except OSError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue @@ -96,7 +96,7 @@ def run_command( return None, None else: if verbose: - print("unable to find command, tried %s" % (commands,)) + print(f"unable to find command, tried {commands}") return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: @@ -149,7 +149,7 @@ def git_get_keywords(versionfile_abs): # _version.py. keywords = {} try: - f = open(versionfile_abs, "r") + f = open(versionfile_abs) for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) @@ -164,7 +164,7 @@ def git_get_keywords(versionfile_abs): if mo: keywords["date"] = mo.group(1) f.close() - except EnvironmentError: + except OSError: pass return keywords @@ -188,11 +188,11 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") - refs = set([r.strip() for r in refnames.strip("()").split(",")]) + refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " - tags = set([r[len(TAG) :] for r in refs if r.startswith(TAG)]) + tags = {r[len(TAG) :] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d @@ -201,7 +201,7 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". - tags = set([r for r in refs if re.search(r"\d", r)]) + tags = {r for r in refs if re.search(r"\d", r)} if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: @@ -308,10 +308,9 @@ def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) - pieces["error"] = "tag '%s' doesn't start with prefix '%s'" % ( - full_tag, - tag_prefix, - ) + pieces[ + "error" + ] = f"tag '{full_tag}' doesn't start with prefix '{tag_prefix}'" return pieces pieces["closest-tag"] = full_tag[len(tag_prefix) :] diff --git a/python/cudf/cudf/comm/gpuarrow.py b/python/cudf/cudf/comm/gpuarrow.py index b6089b65aa5..7879261139d 100644 --- a/python/cudf/cudf/comm/gpuarrow.py +++ b/python/cudf/cudf/comm/gpuarrow.py @@ -58,7 +58,7 @@ def to_dict(self): return dc -class GpuArrowNodeReader(object): +class GpuArrowNodeReader: def __init__(self, table, index): self._table = table self._field = table.schema[index] diff --git a/python/cudf/cudf/core/_base_index.py b/python/cudf/cudf/core/_base_index.py index 6569184e90b..2e6f138d2e3 100644 --- a/python/cudf/cudf/core/_base_index.py +++ b/python/cudf/cudf/core/_base_index.py @@ -1,9 +1,8 @@ # Copyright (c) 2021, NVIDIA CORPORATION. -from __future__ import annotations, division, print_function +from __future__ import annotations import pickle -import warnings from typing import Any, Set import pandas as pd @@ -1350,28 +1349,6 @@ def isin(self, values): return self._values.isin(values).values - def memory_usage(self, deep=False): - """ - Memory usage of the values. - - Parameters - ---------- - deep : bool - Introspect the data deeply, - interrogate `object` dtypes for system-level - memory consumption. - - Returns - ------- - bytes used - """ - if deep: - warnings.warn( - "The deep parameter is ignored and is only included " - "for pandas compatibility." - ) - return self._values.memory_usage() - @classmethod def from_pandas(cls, index, nan_as_null=None): """ diff --git a/python/cudf/cudf/core/column/column.py b/python/cudf/cudf/core/column/column.py index 19313dd3fe2..2c3951c0e5e 100644 --- a/python/cudf/cudf/core/column/column.py +++ b/python/cudf/cudf/core/column/column.py @@ -77,12 +77,12 @@ pandas_dtypes_alias_to_cudf_alias, pandas_dtypes_to_np_dtypes, ) -from cudf.utils.utils import mask_dtype +from cudf.utils.utils import NotIterable, mask_dtype T = TypeVar("T", bound="ColumnBase") -class ColumnBase(Column, Serializable): +class ColumnBase(Column, Serializable, NotIterable): def as_frame(self) -> "cudf.core.frame.Frame": """ Converts a Column to Frame @@ -130,9 +130,6 @@ def to_pandas(self, index: pd.Index = None, **kwargs) -> "pd.Series": pd_series.index = index return pd_series - def __iter__(self): - cudf.utils.utils.raise_iteration_error(obj=self) - @property def values_host(self) -> "np.ndarray": """ diff --git a/python/cudf/cudf/core/column/string.py b/python/cudf/cudf/core/column/string.py index 6467fd39ddd..22b7a0f9d2c 100644 --- a/python/cudf/cudf/core/column/string.py +++ b/python/cudf/cudf/core/column/string.py @@ -5083,7 +5083,7 @@ def to_arrow(self) -> pa.Array: """ if self.null_count == len(self): return pa.NullArray.from_buffers( - pa.null(), len(self), [pa.py_buffer((b""))] + pa.null(), len(self), [pa.py_buffer(b"")] ) else: return super().to_arrow() diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py index 3735a949277..9d179994174 100644 --- a/python/cudf/cudf/core/dataframe.py +++ b/python/cudf/cudf/core/dataframe.py @@ -1,6 +1,6 @@ # Copyright (c) 2018-2022, NVIDIA CORPORATION. -from __future__ import annotations, division +from __future__ import annotations import functools import inspect @@ -1242,66 +1242,9 @@ def _slice(self: T, arg: slice) -> T: return result def memory_usage(self, index=True, deep=False): - """ - Return the memory usage of each column in bytes. - The memory usage can optionally include the contribution of - the index and elements of `object` dtype. - - Parameters - ---------- - index : bool, default True - Specifies whether to include the memory usage of the DataFrame's - index in returned Series. If ``index=True``, the memory usage of - the index is the first item in the output. - deep : bool, default False - If True, introspect the data deeply by interrogating - `object` dtypes for system-level memory consumption, and include - it in the returned values. - - Returns - ------- - Series - A Series whose index is the original column names and whose values - is the memory usage of each column in bytes. - - Examples - -------- - >>> dtypes = ['int64', 'float64', 'object', 'bool'] - >>> data = dict([(t, np.ones(shape=5000).astype(t)) - ... for t in dtypes]) - >>> df = cudf.DataFrame(data) - >>> df.head() - int64 float64 object bool - 0 1 1.0 1.0 True - 1 1 1.0 1.0 True - 2 1 1.0 1.0 True - 3 1 1.0 1.0 True - 4 1 1.0 1.0 True - >>> df.memory_usage(index=False) - int64 40000 - float64 40000 - object 40000 - bool 5000 - dtype: int64 - - Use a Categorical for efficient storage of an object-dtype column with - many repeated values. - - >>> df['object'].astype('category').memory_usage(deep=True) - 5008 - """ - if deep: - warnings.warn( - "The deep parameter is ignored and is only included " - "for pandas compatibility." - ) - ind = list(self.columns) - sizes = [col.memory_usage() for col in self._data.columns] - if index: - ind.append("Index") - ind = cudf.Index(ind, dtype="str") - sizes.append(self.index.memory_usage()) - return Series(sizes, index=ind) + return Series( + {str(k): v for k, v in super().memory_usage(index, deep).items()} + ) def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): if method == "__call__" and hasattr(cudf, ufunc.__name__): @@ -2547,11 +2490,6 @@ def reset_index( inplace=inplace, ) - def take(self, indices, axis=0): - out = super().take(indices) - out.columns = self.columns - return out - @annotate("INSERT", color="green", domain="cudf_python") def insert(self, loc, name, value, nan_as_null=None): """Add a column to DataFrame at the index specified by loc. @@ -4229,7 +4167,7 @@ def _verbose_repr(): dtype = self.dtypes.iloc[i] col = pprint_thing(col) - line_no = _put_str(" {num}".format(num=i), space_num) + line_no = _put_str(f" {i}", space_num) count = "" if show_counts: count = counts[i] @@ -5576,9 +5514,7 @@ def select_dtypes(self, include=None, exclude=None): if issubclass(dtype.type, e_dtype): exclude_subtypes.add(dtype.type) - include_all = set( - [cudf_dtype_from_pydata_dtype(d) for d in self.dtypes] - ) + include_all = {cudf_dtype_from_pydata_dtype(d) for d in self.dtypes} if include: inclusion = include_all & include_subtypes @@ -6329,8 +6265,8 @@ def _align_indices(lhs, rhs): lhs_out = DataFrame(index=df.index) rhs_out = DataFrame(index=df.index) common = set(lhs.columns) & set(rhs.columns) - common_x = set(["{}_x".format(x) for x in common]) - common_y = set(["{}_y".format(x) for x in common]) + common_x = {f"{x}_x" for x in common} + common_y = {f"{x}_y" for x in common} for col in df.columns: if col in common_x: lhs_out[col[:-2]] = df[col] diff --git a/python/cudf/cudf/core/frame.py b/python/cudf/cudf/core/frame.py index 2e01a29b961..6b83f927727 100644 --- a/python/cudf/cudf/core/frame.py +++ b/python/cudf/cudf/core/frame.py @@ -337,6 +337,26 @@ def empty(self): """ return self.size == 0 + def memory_usage(self, deep=False): + """Return the memory usage of an object. + + Parameters + ---------- + deep : bool + The deep parameter is ignored and is only included for pandas + compatibility. + + Returns + ------- + The total bytes used. + """ + if deep: + warnings.warn( + "The deep parameter is ignored and is only included " + "for pandas compatibility." + ) + return {name: col.memory_usage() for name, col in self._data.items()} + def __len__(self): return self._num_rows diff --git a/python/cudf/cudf/core/groupby/groupby.py b/python/cudf/cudf/core/groupby/groupby.py index a393d8e9457..ff700144bed 100644 --- a/python/cudf/cudf/core/groupby/groupby.py +++ b/python/cudf/cudf/core/groupby/groupby.py @@ -1461,7 +1461,7 @@ def apply(self, func): # TODO: should we define this as a dataclass instead? -class Grouper(object): +class Grouper: def __init__( self, key=None, level=None, freq=None, closed=None, label=None ): diff --git a/python/cudf/cudf/core/index.py b/python/cudf/cudf/core/index.py index fc59d15e264..f71f930a21c 100644 --- a/python/cudf/cudf/core/index.py +++ b/python/cudf/cudf/core/index.py @@ -1,6 +1,6 @@ # Copyright (c) 2018-2021, NVIDIA CORPORATION. -from __future__ import annotations, division, print_function +from __future__ import annotations import math import pickle @@ -826,6 +826,9 @@ def _concat(cls, objs): result.name = name return result + def memory_usage(self, deep=False): + return sum(super().memory_usage(deep=deep).values()) + @annotate("INDEX_EQUALS", color="green", domain="cudf_python") def equals(self, other, **kwargs): """ diff --git a/python/cudf/cudf/core/indexed_frame.py b/python/cudf/cudf/core/indexed_frame.py index 8ecab2c7c65..fab5d75f62b 100644 --- a/python/cudf/cudf/core/indexed_frame.py +++ b/python/cudf/cudf/core/indexed_frame.py @@ -473,6 +473,68 @@ def sort_index( out = out.reset_index(drop=True) return self._mimic_inplace(out, inplace=inplace) + def memory_usage(self, index=True, deep=False): + """Return the memory usage of an object. + + Parameters + ---------- + index : bool, default True + Specifies whether to include the memory usage of the index. + deep : bool, default False + The deep parameter is ignored and is only included for pandas + compatibility. + + Returns + ------- + Series or scalar + For DataFrame, a Series whose index is the original column names + and whose values is the memory usage of each column in bytes. For a + Series the total memory usage. + + Examples + -------- + **DataFrame** + + >>> dtypes = ['int64', 'float64', 'object', 'bool'] + >>> data = dict([(t, np.ones(shape=5000).astype(t)) + ... for t in dtypes]) + >>> df = cudf.DataFrame(data) + >>> df.head() + int64 float64 object bool + 0 1 1.0 1.0 True + 1 1 1.0 1.0 True + 2 1 1.0 1.0 True + 3 1 1.0 1.0 True + 4 1 1.0 1.0 True + >>> df.memory_usage(index=False) + int64 40000 + float64 40000 + object 40000 + bool 5000 + dtype: int64 + + Use a Categorical for efficient storage of an object-dtype column with + many repeated values. + + >>> df['object'].astype('category').memory_usage(deep=True) + 5008 + + **Series** + >>> s = cudf.Series(range(3), index=['a','b','c']) + >>> s.memory_usage() + 43 + + Not including the index gives the size of the rest of the data, which + is necessarily smaller: + + >>> s.memory_usage(index=False) + 24 + """ + usage = super().memory_usage(deep=deep) + if index: + usage["Index"] = self.index.memory_usage() + return usage + def hash_values(self, method="murmur3"): """Compute the hash of values in this column. diff --git a/python/cudf/cudf/core/join/join.py b/python/cudf/cudf/core/join/join.py index 704274815f6..39ff4718550 100644 --- a/python/cudf/cudf/core/join/join.py +++ b/python/cudf/cudf/core/join/join.py @@ -169,13 +169,11 @@ def __init__( if on else set() if (self._using_left_index or self._using_right_index) - else set( - [ - lkey.name - for lkey, rkey in zip(self._left_keys, self._right_keys) - if lkey.name == rkey.name - ] - ) + else { + lkey.name + for lkey, rkey in zip(self._left_keys, self._right_keys) + if lkey.name == rkey.name + } ) def perform_merge(self) -> Frame: diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py index adce3c24a83..8581b97c217 100644 --- a/python/cudf/cudf/core/multiindex.py +++ b/python/cudf/cudf/core/multiindex.py @@ -5,7 +5,6 @@ import itertools import numbers import pickle -import warnings from collections.abc import Sequence from numbers import Integral from typing import Any, List, MutableMapping, Optional, Tuple, Union @@ -23,10 +22,14 @@ from cudf.core._compat import PANDAS_GE_120 from cudf.core.frame import Frame from cudf.core.index import BaseIndex, _lexsorted_equal_range, as_index -from cudf.utils.utils import _maybe_indices_to_slice, cached_property +from cudf.utils.utils import ( + NotIterable, + _maybe_indices_to_slice, + cached_property, +) -class MultiIndex(Frame, BaseIndex): +class MultiIndex(Frame, BaseIndex, NotIterable): """A multi-level or hierarchical index. Provides N-Dimensional indexing into Series and DataFrame objects. @@ -115,7 +118,7 @@ def __init__( "MultiIndex has unequal number of levels and " "codes and is inconsistent!" ) - if len(set(c.size for c in codes._data.columns)) != 1: + if len({c.size for c in codes._data.columns}) != 1: raise ValueError( "MultiIndex length of codes does not match " "and is inconsistent!" @@ -367,9 +370,6 @@ def copy( return mi - def __iter__(self): - cudf.utils.utils.raise_iteration_error(obj=self) - def __repr__(self): max_seq_items = get_option("display.max_seq_items") or len(self) @@ -752,7 +752,7 @@ def _index_and_downcast(self, result, index, index_key): # Pandas returns an empty Series with a tuple as name # the one expected result column result = cudf.Series._from_data( - {}, name=tuple((col[0] for col in index._data.columns)) + {}, name=tuple(col[0] for col in index._data.columns) ) elif out_index._num_columns == 1: # If there's only one column remaining in the output index, convert @@ -1202,7 +1202,7 @@ def _poplevels(self, level): if not pd.api.types.is_list_like(level): level = (level,) - ilevels = sorted([self._level_index_from_level(lev) for lev in level]) + ilevels = sorted(self._level_index_from_level(lev) for lev in level) if not ilevels: return None @@ -1412,22 +1412,14 @@ def _clean_nulls_from_index(self): ) def memory_usage(self, deep=False): - if deep: - warnings.warn( - "The deep parameter is ignored and is only included " - "for pandas compatibility." - ) - - n = 0 - for col in self._data.columns: - n += col.memory_usage() + usage = sum(super().memory_usage(deep=deep).values()) if self.levels: for level in self.levels: - n += level.memory_usage(deep=deep) + usage += level.memory_usage(deep=deep) if self.codes: for col in self.codes._data.columns: - n += col.memory_usage() - return n + usage += col.memory_usage() + return usage def difference(self, other, sort=None): if hasattr(other, "to_pandas"): diff --git a/python/cudf/cudf/core/scalar.py b/python/cudf/cudf/core/scalar.py index b0770b71ca6..134b94bf0f2 100644 --- a/python/cudf/cudf/core/scalar.py +++ b/python/cudf/cudf/core/scalar.py @@ -17,7 +17,7 @@ ) -class Scalar(object): +class Scalar: """ A GPU-backed scalar object with NumPy scalar like properties May be used in binary operations against other scalars, cuDF diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py index 12a2538b776..5823ea18d1b 100644 --- a/python/cudf/cudf/core/series.py +++ b/python/cudf/cudf/core/series.py @@ -167,7 +167,7 @@ def __getitem__(self, arg: Any) -> Union[ScalarLike, DataFrameOrSeries]: if ( isinstance(arg, tuple) and len(arg) == self._frame._index.nlevels - and not any((isinstance(x, slice) for x in arg)) + and not any(isinstance(x, slice) for x in arg) ): result = result.iloc[0] return result @@ -953,52 +953,7 @@ def to_frame(self, name=None): return cudf.DataFrame({col: self._column}, index=self.index) def memory_usage(self, index=True, deep=False): - """ - Return the memory usage of the Series. - - The memory usage can optionally include the contribution of - the index and of elements of `object` dtype. - - Parameters - ---------- - index : bool, default True - Specifies whether to include the memory usage of the Series index. - deep : bool, default False - If True, introspect the data deeply by interrogating - `object` dtypes for system-level memory consumption, and include - it in the returned value. - - Returns - ------- - int - Bytes of memory consumed. - - See Also - -------- - cudf.DataFrame.memory_usage : Bytes consumed by - a DataFrame. - - Examples - -------- - >>> s = cudf.Series(range(3), index=['a','b','c']) - >>> s.memory_usage() - 43 - - Not including the index gives the size of the rest of the data, which - is necessarily smaller: - - >>> s.memory_usage(index=False) - 24 - """ - if deep: - warnings.warn( - "The deep parameter is ignored and is only included " - "for pandas compatibility." - ) - n = self._column.memory_usage() - if index: - n += self._index.memory_usage() - return n + return sum(super().memory_usage(index, deep).values()) def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): if method == "__call__": @@ -2722,42 +2677,6 @@ def unique(self): res = self._column.unique() return Series(res, name=self.name) - def nunique(self, method="sort", dropna=True): - """Returns the number of unique values of the Series: approximate version, - and exact version to be moved to libcudf - - Excludes NA values by default. - - Parameters - ---------- - dropna : bool, default True - Don't include NA values in the count. - - Returns - ------- - int - - Examples - -------- - >>> import cudf - >>> s = cudf.Series([1, 3, 5, 7, 7]) - >>> s - 0 1 - 1 3 - 2 5 - 3 7 - 4 7 - dtype: int64 - >>> s.nunique() - 4 - """ - if method != "sort": - msg = "non sort based distinct_count() not implemented yet" - raise NotImplementedError(msg) - if self.null_count == len(self): - return 0 - return super().nunique(method, dropna) - def value_counts( self, normalize=False, @@ -2969,7 +2888,7 @@ def _prepare_percentiles(percentiles): return percentiles def _format_percentile_names(percentiles): - return ["{0}%".format(int(x * 100)) for x in percentiles] + return [f"{int(x * 100)}%" for x in percentiles] def _format_stats_values(stats_data): return map(lambda x: round(x, 6), stats_data) @@ -3071,7 +2990,7 @@ def _describe_timestamp(self): .to_numpy(na_value=np.nan), ) ), - "max": str(pd.Timestamp((self.max()))), + "max": str(pd.Timestamp(self.max())), } return Series( @@ -3327,6 +3246,11 @@ def merge( method="hash", suffixes=("_x", "_y"), ): + warnings.warn( + "Series.merge is deprecated and will be removed in a future " + "release. Use cudf.merge instead.", + FutureWarning, + ) if left_on not in (self.name, None): raise ValueError( "Series to other merge uses series name as key implicitly" @@ -3550,7 +3474,7 @@ def wrapper(self, other, level=None, fill_value=None, axis=0): setattr(Series, binop, make_binop_func(binop)) -class DatetimeProperties(object): +class DatetimeProperties: """ Accessor object for datetimelike properties of the Series values. @@ -4492,7 +4416,7 @@ def strftime(self, date_format, *args, **kwargs): ) -class TimedeltaProperties(object): +class TimedeltaProperties: """ Accessor object for timedeltalike properties of the Series values. diff --git a/python/cudf/cudf/core/single_column_frame.py b/python/cudf/cudf/core/single_column_frame.py index ef479f19363..bf867923b57 100644 --- a/python/cudf/cudf/core/single_column_frame.py +++ b/python/cudf/cudf/core/single_column_frame.py @@ -15,11 +15,12 @@ from cudf.api.types import _is_scalar_or_zero_d_array from cudf.core.column import ColumnBase, as_column from cudf.core.frame import Frame +from cudf.utils.utils import NotIterable T = TypeVar("T", bound="Frame") -class SingleColumnFrame(Frame): +class SingleColumnFrame(Frame, NotIterable): """A one-dimensional frame. Frames with only a single column share certain logic that is encoded in @@ -85,12 +86,6 @@ def shape(self): """Get a tuple representing the dimensionality of the Index.""" return (len(self),) - def __iter__(self): - # Iterating over a GPU object is not efficient and hence not supported. - # Consider using ``.to_arrow()``, ``.to_pandas()`` or ``.values_host`` - # if you wish to iterate over the values. - cudf.utils.utils.raise_iteration_error(obj=self) - def __bool__(self): raise TypeError( f"The truth value of a {type(self)} is ambiguous. Use " @@ -343,4 +338,6 @@ def nunique(self, method: builtins.str = "sort", dropna: bool = True): int Number of unique values in the column. """ + if self._column.null_count == len(self): + return 0 return self._column.distinct_count(method=method, dropna=dropna) diff --git a/python/cudf/cudf/core/udf/typing.py b/python/cudf/cudf/core/udf/typing.py index da7ff4c0e32..56e8bec74dc 100644 --- a/python/cudf/cudf/core/udf/typing.py +++ b/python/cudf/cudf/core/udf/typing.py @@ -133,8 +133,8 @@ def typeof_masked(val, c): class MaskedConstructor(ConcreteTemplate): key = api.Masked units = ["ns", "ms", "us", "s"] - datetime_cases = set(types.NPDatetime(u) for u in units) - timedelta_cases = set(types.NPTimedelta(u) for u in units) + datetime_cases = {types.NPDatetime(u) for u in units} + timedelta_cases = {types.NPTimedelta(u) for u in units} cases = [ nb_signature(MaskedType(t), t, types.boolean) for t in ( diff --git a/python/cudf/cudf/datasets.py b/python/cudf/cudf/datasets.py index 2341a5c23b9..d7a2fedef59 100644 --- a/python/cudf/cudf/datasets.py +++ b/python/cudf/cudf/datasets.py @@ -57,9 +57,7 @@ def timeseries( pd.date_range(start, end, freq=freq, name="timestamp") ) state = np.random.RandomState(seed) - columns = dict( - (k, make[dt](len(index), state)) for k, dt in dtypes.items() - ) + columns = {k: make[dt](len(index), state) for k, dt in dtypes.items()} df = pd.DataFrame(columns, index=index, columns=sorted(columns)) if df.index[-1] == end: df = df.iloc[:-1] @@ -110,7 +108,7 @@ def randomdata(nrows=10, dtypes=None, seed=None): if dtypes is None: dtypes = {"id": int, "x": float, "y": float} state = np.random.RandomState(seed) - columns = dict((k, make[dt](nrows, state)) for k, dt in dtypes.items()) + columns = {k: make[dt](nrows, state) for k, dt in dtypes.items()} df = pd.DataFrame(columns, columns=sorted(columns)) return cudf.from_pandas(df) diff --git a/python/cudf/cudf/tests/test_api_types.py b/python/cudf/cudf/tests/test_api_types.py index 4d104c122d1..e7cf113f604 100644 --- a/python/cudf/cudf/tests/test_api_types.py +++ b/python/cudf/cudf/tests/test_api_types.py @@ -17,9 +17,7 @@ (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -128,9 +126,7 @@ def test_is_categorical_dtype(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, True), @@ -235,9 +231,7 @@ def test_is_numeric_dtype(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -342,9 +336,7 @@ def test_is_integer_dtype(obj, expect): (int(), True), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -450,9 +442,7 @@ def test_is_integer(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -557,9 +547,7 @@ def test_is_string_dtype(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -664,9 +652,7 @@ def test_is_datetime_dtype(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -771,9 +757,7 @@ def test_is_list_dtype(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -881,9 +865,7 @@ def test_is_struct_dtype(obj, expect): (int(), False), (float(), False), (complex(), False), - (str(), False), ("", False), - (r"", False), (object(), False), # Base Python types. (bool, False), @@ -988,9 +970,7 @@ def test_is_decimal_dtype(obj, expect): int(), float(), complex(), - str(), "", - r"", object(), # Base Python types. bool, @@ -1070,9 +1050,7 @@ def test_pandas_agreement(obj): int(), float(), complex(), - str(), "", - r"", object(), # Base Python types. bool, diff --git a/python/cudf/cudf/tests/test_binops.py b/python/cudf/cudf/tests/test_binops.py index 921f2de38c2..76add8b9c5d 100644 --- a/python/cudf/cudf/tests/test_binops.py +++ b/python/cudf/cudf/tests/test_binops.py @@ -1,6 +1,5 @@ # Copyright (c) 2018-2022, NVIDIA CORPORATION. -from __future__ import division import decimal import operator diff --git a/python/cudf/cudf/tests/test_copying.py b/python/cudf/cudf/tests/test_copying.py index 21a6a9172db..0d0ba579f22 100644 --- a/python/cudf/cudf/tests/test_copying.py +++ b/python/cudf/cudf/tests/test_copying.py @@ -1,5 +1,3 @@ -from __future__ import division, print_function - import numpy as np import pandas as pd import pytest diff --git a/python/cudf/cudf/tests/test_cuda_apply.py b/python/cudf/cudf/tests/test_cuda_apply.py index a00dbbba5f0..e8bd64b5061 100644 --- a/python/cudf/cudf/tests/test_cuda_apply.py +++ b/python/cudf/cudf/tests/test_cuda_apply.py @@ -98,7 +98,7 @@ def kernel(in1, in2, in3, out1, out2, extra1, extra2): expect_out1 = extra2 * in1 - extra1 * in2 + in3 expect_out2 = np.hstack( - np.arange((e - s)) for s, e in zip(chunks, chunks[1:] + [len(df)]) + np.arange(e - s) for s, e in zip(chunks, chunks[1:] + [len(df)]) ) outdf = df.apply_chunks( @@ -141,8 +141,7 @@ def kernel(in1, in2, in3, out1, out2, extra1, extra2): expect_out1 = extra2 * in1 - extra1 * in2 + in3 expect_out2 = np.hstack( - tpb * np.arange((e - s)) - for s, e in zip(chunks, chunks[1:] + [len(df)]) + tpb * np.arange(e - s) for s, e in zip(chunks, chunks[1:] + [len(df)]) ) outdf = df.apply_chunks( diff --git a/python/cudf/cudf/tests/test_dataframe.py b/python/cudf/cudf/tests/test_dataframe.py index ba2caf7c6c8..5022f1a675b 100644 --- a/python/cudf/cudf/tests/test_dataframe.py +++ b/python/cudf/cudf/tests/test_dataframe.py @@ -246,17 +246,15 @@ def test_series_init_none(): sr1 = cudf.Series() got = sr1.to_string() - expect = sr1.to_pandas().__repr__() - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = repr(sr1.to_pandas()) + assert got == expect # 2: Using `None` as an initializer sr2 = cudf.Series(None) got = sr2.to_string() - expect = sr2.to_pandas().__repr__() - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = repr(sr2.to_pandas()) + assert got == expect def test_dataframe_basic(): @@ -843,21 +841,20 @@ def test_dataframe_to_string_with_masked_data(): def test_dataframe_to_string_wide(monkeypatch): monkeypatch.setenv("COLUMNS", "79") # Test basic - df = cudf.DataFrame() - for i in range(100): - df["a{}".format(i)] = list(range(3)) - pd.options.display.max_columns = 0 - got = df.to_string() + df = cudf.DataFrame({f"a{i}": [0, 1, 2] for i in range(100)}) + with pd.option_context("display.max_columns", 0): + got = df.to_string() - expect = """ - a0 a1 a2 a3 a4 a5 a6 a7 ... a92 a93 a94 a95 a96 a97 a98 a99 -0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 -1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 -2 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 -[3 rows x 100 columns] -""" - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = textwrap.dedent( + """\ + a0 a1 a2 a3 a4 a5 a6 a7 ... a92 a93 a94 a95 a96 a97 a98 a99 + 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 + 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 + 2 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 + + [3 rows x 100 columns]""" # noqa: E501 + ) + assert got == expect def test_dataframe_empty_to_string(): @@ -865,9 +862,8 @@ def test_dataframe_empty_to_string(): df = cudf.DataFrame() got = df.to_string() - expect = "Empty DataFrame\nColumns: []\nIndex: []\n" - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = "Empty DataFrame\nColumns: []\nIndex: []" + assert got == expect def test_dataframe_emptycolumns_to_string(): @@ -877,9 +873,8 @@ def test_dataframe_emptycolumns_to_string(): df["b"] = [] got = df.to_string() - expect = "Empty DataFrame\nColumns: [a, b]\nIndex: []\n" - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = "Empty DataFrame\nColumns: [a, b]\nIndex: []" + assert got == expect def test_dataframe_copy(): @@ -890,14 +885,14 @@ def test_dataframe_copy(): df2["b"] = [4, 5, 6] got = df.to_string() - expect = """ - a -0 1 -1 2 -2 3 -""" - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = textwrap.dedent( + """\ + a + 0 1 + 1 2 + 2 3""" + ) + assert got == expect def test_dataframe_copy_shallow(): @@ -908,14 +903,14 @@ def test_dataframe_copy_shallow(): df2["b"] = [4, 2, 3] got = df.to_string() - expect = """ - a -0 1 -1 2 -2 3 -""" - # values should match despite whitespace difference - assert got.split() == expect.split() + expect = textwrap.dedent( + """\ + a + 0 1 + 1 2 + 2 3""" + ) + assert got == expect def test_dataframe_dtypes(): @@ -1163,7 +1158,7 @@ def test_dataframe_hash_partition(nrows, nparts, nkeys): gdf = cudf.DataFrame() keycols = [] for i in range(nkeys): - keyname = "key{}".format(i) + keyname = f"key{i}" gdf[keyname] = np.random.randint(0, 7 - i, nrows) keycols.append(keyname) gdf["val1"] = np.random.randint(0, nrows * 2, nrows) diff --git a/python/cudf/cudf/tests/test_factorize.py b/python/cudf/cudf/tests/test_factorize.py index 1f16686a6a6..3081b7c4a6e 100644 --- a/python/cudf/cudf/tests/test_factorize.py +++ b/python/cudf/cudf/tests/test_factorize.py @@ -23,7 +23,7 @@ def test_factorize_series_obj(ncats, nelem): assert isinstance(uvals, cp.ndarray) assert isinstance(labels, Index) - encoder = dict((labels[idx], idx) for idx in range(len(labels))) + encoder = {labels[idx]: idx for idx in range(len(labels))} handcoded = [encoder[v] for v in arr] np.testing.assert_array_equal(uvals.get(), handcoded) @@ -42,7 +42,7 @@ def test_factorize_index_obj(ncats, nelem): assert isinstance(uvals, cp.ndarray) assert isinstance(labels, Index) - encoder = dict((labels[idx], idx) for idx in range(len(labels))) + encoder = {labels[idx]: idx for idx in range(len(labels))} handcoded = [encoder[v] for v in arr] np.testing.assert_array_equal(uvals.get(), handcoded) diff --git a/python/cudf/cudf/tests/test_gcs.py b/python/cudf/cudf/tests/test_gcs.py index db53529b22f..307232b1305 100644 --- a/python/cudf/cudf/tests/test_gcs.py +++ b/python/cudf/cudf/tests/test_gcs.py @@ -48,14 +48,14 @@ def mock_size(*args): # use_python_file_object=True, because the pyarrow # `open_input_file` command will fail (since it doesn't # use the monkey-patched `open` definition) - got = cudf.read_csv("gcs://{}".format(fpath), use_python_file_object=False) + got = cudf.read_csv(f"gcs://{fpath}", use_python_file_object=False) assert_eq(pdf, got) # AbstractBufferedFile -> PythonFile conversion # will work fine with the monkey-patched FS if we # pass in an fsspec file object fs = gcsfs.core.GCSFileSystem() - with fs.open("gcs://{}".format(fpath)) as f: + with fs.open(f"gcs://{fpath}") as f: got = cudf.read_csv(f) assert_eq(pdf, got) @@ -69,7 +69,7 @@ def mock_open(*args, **kwargs): return open(local_filepath, "wb") monkeypatch.setattr(gcsfs.core.GCSFileSystem, "open", mock_open) - gdf.to_orc("gcs://{}".format(gcs_fname)) + gdf.to_orc(f"gcs://{gcs_fname}") got = pa.orc.ORCFile(local_filepath).read().to_pandas() assert_eq(pdf, got) diff --git a/python/cudf/cudf/tests/test_groupby.py b/python/cudf/cudf/tests/test_groupby.py index f5decd62ea9..61c7d1958a0 100644 --- a/python/cudf/cudf/tests/test_groupby.py +++ b/python/cudf/cudf/tests/test_groupby.py @@ -84,11 +84,6 @@ def make_frame( return df -def get_nelem(): - for elem in [2, 3, 1000]: - yield elem - - @pytest.fixture def gdf(): return DataFrame({"x": [1, 2, 3], "y": [0, 1, 1]}) @@ -1096,7 +1091,7 @@ def test_groupby_cumcount(): ) -@pytest.mark.parametrize("nelem", get_nelem()) +@pytest.mark.parametrize("nelem", [2, 3, 1000]) @pytest.mark.parametrize("as_index", [True, False]) @pytest.mark.parametrize( "agg", ["min", "max", "idxmin", "idxmax", "mean", "count"] diff --git a/python/cudf/cudf/tests/test_hdfs.py b/python/cudf/cudf/tests/test_hdfs.py index 24554f113bb..2d61d6693cb 100644 --- a/python/cudf/cudf/tests/test_hdfs.py +++ b/python/cudf/cudf/tests/test_hdfs.py @@ -62,7 +62,7 @@ def test_read_csv(tmpdir, pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_csv_reader.csv".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_csv_reader.csv" got = cudf.read_csv(hd_fpath) @@ -81,7 +81,7 @@ def test_write_csv(pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_csv_writer.csv".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_csv_writer.csv" gdf.to_csv(hd_fpath, index=False) @@ -107,7 +107,7 @@ def test_read_parquet(tmpdir, pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_parquet_reader.parquet".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_parquet_reader.parquet" got = cudf.read_parquet(hd_fpath) @@ -126,7 +126,7 @@ def test_write_parquet(pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_parquet_writer.parquet".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_parquet_writer.parquet" gdf.to_parquet(hd_fpath) @@ -153,7 +153,7 @@ def test_write_parquet_partitioned(tmpdir, pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_parquet_partitioned.parquet".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_parquet_partitioned.parquet" # Clear data written from previous runs hdfs.rm(f"{basedir}/test_parquet_partitioned.parquet", recursive=True) gdf.to_parquet( @@ -186,7 +186,7 @@ def test_read_json(tmpdir, pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_json_reader.json".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_json_reader.json" got = cudf.read_json(hd_fpath, engine="cudf", orient="records", lines=True) @@ -207,9 +207,9 @@ def test_read_orc(datadir, hdfs, test_url): hdfs.upload(basedir + "/file.orc", buffer) if test_url: - hd_fpath = "hdfs://{}:{}{}/file.orc".format(host, port, basedir) + hd_fpath = f"hdfs://{host}:{port}{basedir}/file.orc" else: - hd_fpath = "hdfs://{}/file.orc".format(basedir) + hd_fpath = f"hdfs://{basedir}/file.orc" got = cudf.read_orc(hd_fpath) expect = orc.ORCFile(buffer).read().to_pandas() @@ -226,7 +226,7 @@ def test_write_orc(pdf, hdfs, test_url): host, port, basedir ) else: - hd_fpath = "hdfs://{}/test_orc_writer.orc".format(basedir) + hd_fpath = f"hdfs://{basedir}/test_orc_writer.orc" gdf.to_orc(hd_fpath) @@ -247,9 +247,9 @@ def test_read_avro(datadir, hdfs, test_url): hdfs.upload(basedir + "/file.avro", buffer) if test_url: - hd_fpath = "hdfs://{}:{}{}/file.avro".format(host, port, basedir) + hd_fpath = f"hdfs://{host}:{port}{basedir}/file.avro" else: - hd_fpath = "hdfs://{}/file.avro".format(basedir) + hd_fpath = f"hdfs://{basedir}/file.avro" got = cudf.read_avro(hd_fpath) with open(fname, mode="rb") as f: @@ -270,7 +270,7 @@ def test_storage_options(tmpdir, pdf, hdfs): # Write to hdfs hdfs.upload(basedir + "/file.csv", buffer) - hd_fpath = "hdfs://{}/file.csv".format(basedir) + hd_fpath = f"hdfs://{basedir}/file.csv" storage_options = {"host": host, "port": port} @@ -293,7 +293,7 @@ def test_storage_options_error(tmpdir, pdf, hdfs): # Write to hdfs hdfs.upload(basedir + "/file.csv", buffer) - hd_fpath = "hdfs://{}:{}{}/file.avro".format(host, port, basedir) + hd_fpath = f"hdfs://{host}:{port}{basedir}/file.avro" storage_options = {"host": host, "port": port} diff --git a/python/cudf/cudf/tests/test_query.py b/python/cudf/cudf/tests/test_query.py index 3de38b2cf6f..09129a43f07 100644 --- a/python/cudf/cudf/tests/test_query.py +++ b/python/cudf/cudf/tests/test_query.py @@ -1,6 +1,5 @@ # Copyright (c) 2018, NVIDIA CORPORATION. -from __future__ import division, print_function import datetime import inspect diff --git a/python/cudf/cudf/tests/test_reductions.py b/python/cudf/cudf/tests/test_reductions.py index 40add502309..7106ab54686 100644 --- a/python/cudf/cudf/tests/test_reductions.py +++ b/python/cudf/cudf/tests/test_reductions.py @@ -1,6 +1,5 @@ # Copyright (c) 2020-2022, NVIDIA CORPORATION. -from __future__ import division, print_function import re from decimal import Decimal diff --git a/python/cudf/cudf/tests/test_s3.py b/python/cudf/cudf/tests/test_s3.py index da1ffc1fc16..4807879a730 100644 --- a/python/cudf/cudf/tests/test_s3.py +++ b/python/cudf/cudf/tests/test_s3.py @@ -147,7 +147,7 @@ def test_read_csv(s3_base, s3so, pdf, bytes_per_thread): # Use fsspec file object with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got = cudf.read_csv( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", storage_options=s3so, bytes_per_thread=bytes_per_thread, use_python_file_object=False, @@ -157,7 +157,7 @@ def test_read_csv(s3_base, s3so, pdf, bytes_per_thread): # Use Arrow PythonFile object with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got = cudf.read_csv( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", storage_options=s3so, bytes_per_thread=bytes_per_thread, use_python_file_object=True, @@ -174,7 +174,7 @@ def test_read_csv_arrow_nativefile(s3_base, s3so, pdf): fs = pa_fs.S3FileSystem( endpoint_override=s3so["client_kwargs"]["endpoint_url"], ) - with fs.open_input_file("{}/{}".format(bname, fname)) as fil: + with fs.open_input_file(f"{bname}/{fname}") as fil: got = cudf.read_csv(fil) assert_eq(pdf, got) @@ -193,7 +193,7 @@ def test_read_csv_byte_range( # Use fsspec file object with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got = cudf.read_csv( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", storage_options=s3so, byte_range=(74, 73), bytes_per_thread=bytes_per_thread, @@ -213,15 +213,15 @@ def test_write_csv(s3_base, s3so, pdf, chunksize): gdf = cudf.from_pandas(pdf) with s3_context(s3_base=s3_base, bucket=bname) as s3fs: gdf.to_csv( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", index=False, chunksize=chunksize, storage_options=s3so, ) - assert s3fs.exists("s3://{}/{}".format(bname, fname)) + assert s3fs.exists(f"s3://{bname}/{fname}") # TODO: Update to use `storage_options` from pandas v1.2.0 - got = pd.read_csv(s3fs.open("s3://{}/{}".format(bname, fname))) + got = pd.read_csv(s3fs.open(f"s3://{bname}/{fname}")) assert_eq(pdf, got) @@ -248,7 +248,7 @@ def test_read_parquet( buffer.seek(0) with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got1 = cudf.read_parquet( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", open_file_options=( {"precache_options": {"method": precache}} if use_python_file_object @@ -265,10 +265,10 @@ def test_read_parquet( # Check fsspec file-object handling buffer.seek(0) with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): - fs = get_fs_token_paths( - "s3://{}/{}".format(bname, fname), storage_options=s3so - )[0] - with fs.open("s3://{}/{}".format(bname, fname), mode="rb") as f: + fs = get_fs_token_paths(f"s3://{bname}/{fname}", storage_options=s3so)[ + 0 + ] + with fs.open(f"s3://{bname}/{fname}", mode="rb") as f: got2 = cudf.read_parquet( f, bytes_per_thread=bytes_per_thread, @@ -297,7 +297,7 @@ def test_read_parquet_ext( buffer.seek(0) with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got1 = cudf.read_parquet( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", storage_options=s3so, bytes_per_thread=bytes_per_thread, footer_sample_size=3200, @@ -326,7 +326,7 @@ def test_read_parquet_arrow_nativefile(s3_base, s3so, pdf, columns): fs = pa_fs.S3FileSystem( endpoint_override=s3so["client_kwargs"]["endpoint_url"], ) - with fs.open_input_file("{}/{}".format(bname, fname)) as fil: + with fs.open_input_file(f"{bname}/{fname}") as fil: got = cudf.read_parquet(fil, columns=columns) expect = pdf[columns] if columns else pdf @@ -343,7 +343,7 @@ def test_read_parquet_filters(s3_base, s3so, pdf_ext, precache): filters = [("String", "==", "Omega")] with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got = cudf.read_parquet( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", storage_options=s3so, filters=filters, open_file_options={"precache_options": {"method": precache}}, @@ -360,13 +360,13 @@ def test_write_parquet(s3_base, s3so, pdf, partition_cols): gdf = cudf.from_pandas(pdf) with s3_context(s3_base=s3_base, bucket=bname) as s3fs: gdf.to_parquet( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", partition_cols=partition_cols, storage_options=s3so, ) - assert s3fs.exists("s3://{}/{}".format(bname, fname)) + assert s3fs.exists(f"s3://{bname}/{fname}") - got = pd.read_parquet(s3fs.open("s3://{}/{}".format(bname, fname))) + got = pd.read_parquet(s3fs.open(f"s3://{bname}/{fname}")) assert_eq(pdf, got) @@ -383,7 +383,7 @@ def test_read_json(s3_base, s3so): with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got = cudf.read_json( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", engine="cudf", orient="records", lines=True, @@ -407,7 +407,7 @@ def test_read_orc(s3_base, s3so, datadir, use_python_file_object, columns): with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): got = cudf.read_orc( - "s3://{}/{}".format(bname, fname), + f"s3://{bname}/{fname}", columns=columns, storage_options=s3so, use_python_file_object=use_python_file_object, @@ -432,7 +432,7 @@ def test_read_orc_arrow_nativefile(s3_base, s3so, datadir, columns): fs = pa_fs.S3FileSystem( endpoint_override=s3so["client_kwargs"]["endpoint_url"], ) - with fs.open_input_file("{}/{}".format(bname, fname)) as fil: + with fs.open_input_file(f"{bname}/{fname}") as fil: got = cudf.read_orc(fil, columns=columns) if columns: @@ -445,10 +445,10 @@ def test_write_orc(s3_base, s3so, pdf): bname = "orc" gdf = cudf.from_pandas(pdf) with s3_context(s3_base=s3_base, bucket=bname) as s3fs: - gdf.to_orc("s3://{}/{}".format(bname, fname), storage_options=s3so) - assert s3fs.exists("s3://{}/{}".format(bname, fname)) + gdf.to_orc(f"s3://{bname}/{fname}", storage_options=s3so) + assert s3fs.exists(f"s3://{bname}/{fname}") - with s3fs.open("s3://{}/{}".format(bname, fname)) as f: + with s3fs.open(f"s3://{bname}/{fname}") as f: got = pa.orc.ORCFile(f).read().to_pandas() assert_eq(pdf, got) diff --git a/python/cudf/cudf/tests/test_sorting.py b/python/cudf/cudf/tests/test_sorting.py index 00cd31e7539..10c3689fcd7 100644 --- a/python/cudf/cudf/tests/test_sorting.py +++ b/python/cudf/cudf/tests/test_sorting.py @@ -105,7 +105,7 @@ def test_series_argsort(nelem, dtype, asc): ) def test_series_sort_index(nelem, asc): np.random.seed(0) - sr = Series((100 * np.random.random(nelem))) + sr = Series(100 * np.random.random(nelem)) psr = sr.to_pandas() expected = psr.sort_index(ascending=asc) diff --git a/python/cudf/cudf/tests/test_text.py b/python/cudf/cudf/tests/test_text.py index a447a60c709..5ff66fc750f 100644 --- a/python/cudf/cudf/tests/test_text.py +++ b/python/cudf/cudf/tests/test_text.py @@ -763,7 +763,7 @@ def test_read_text(datadir): chess_file = str(datadir) + "/chess.pgn" delimiter = "1." - with open(chess_file, "r") as f: + with open(chess_file) as f: content = f.read().split(delimiter) # Since Python split removes the delimiter and read_text does diff --git a/python/cudf/cudf/tests/test_transform.py b/python/cudf/cudf/tests/test_transform.py index 021c4052759..bd7ee45fbf8 100644 --- a/python/cudf/cudf/tests/test_transform.py +++ b/python/cudf/cudf/tests/test_transform.py @@ -1,6 +1,5 @@ # Copyright (c) 2018-2020, NVIDIA CORPORATION. -from __future__ import division import numpy as np import pytest diff --git a/python/cudf/cudf/tests/test_udf_binops.py b/python/cudf/cudf/tests/test_udf_binops.py index c5cd8f8b717..173515509cd 100644 --- a/python/cudf/cudf/tests/test_udf_binops.py +++ b/python/cudf/cudf/tests/test_udf_binops.py @@ -1,5 +1,4 @@ # Copyright (c) 2018, NVIDIA CORPORATION. -from __future__ import division import numpy as np import pytest diff --git a/python/cudf/cudf/tests/test_unaops.py b/python/cudf/cudf/tests/test_unaops.py index e79b74e3aab..2e8da615e3e 100644 --- a/python/cudf/cudf/tests/test_unaops.py +++ b/python/cudf/cudf/tests/test_unaops.py @@ -1,5 +1,3 @@ -from __future__ import division - import itertools import operator import re diff --git a/python/cudf/cudf/utils/applyutils.py b/python/cudf/cudf/utils/applyutils.py index 3cbbc1e1ce7..593965046e6 100644 --- a/python/cudf/cudf/utils/applyutils.py +++ b/python/cudf/cudf/utils/applyutils.py @@ -125,7 +125,7 @@ def make_aggregate_nullmask(df, columns=None, op="and"): return out_mask -class ApplyKernelCompilerBase(object): +class ApplyKernelCompilerBase: def __init__( self, func, incols, outcols, kwargs, pessimistic_nulls, cache_key ): @@ -253,7 +253,7 @@ def row_wise_kernel({args}): srcidx.format(a=a, start=start, stop=stop, stride=stride) ) - body.append("inner({})".format(args)) + body.append(f"inner({args})") indented = ["{}{}".format(" " * 4, ln) for ln in body] # Finalize source @@ -309,7 +309,7 @@ def chunk_wise_kernel(nrows, chunks, {args}): slicedargs = {} for a in argnames: if a not in extras: - slicedargs[a] = "{}[start:stop]".format(a) + slicedargs[a] = f"{a}[start:stop]" else: slicedargs[a] = str(a) body.append( @@ -361,4 +361,4 @@ def _load_cache_or_make_chunk_wise_kernel(func, *args, **kwargs): def _mangle_user(name): """Mangle user variable name""" - return "__user_{}".format(name) + return f"__user_{name}" diff --git a/python/cudf/cudf/utils/cudautils.py b/python/cudf/cudf/utils/cudautils.py index f0533dcaa72..742c747ab69 100755 --- a/python/cudf/cudf/utils/cudautils.py +++ b/python/cudf/cudf/utils/cudautils.py @@ -218,7 +218,7 @@ def make_cache_key(udf, sig): codebytes = udf.__code__.co_code constants = udf.__code__.co_consts if udf.__closure__ is not None: - cvars = tuple([x.cell_contents for x in udf.__closure__]) + cvars = tuple(x.cell_contents for x in udf.__closure__) cvarbytes = dumps(cvars) else: cvarbytes = b"" diff --git a/python/cudf/cudf/utils/dtypes.py b/python/cudf/cudf/utils/dtypes.py index 44bbb1b493d..4cd1738996f 100644 --- a/python/cudf/cudf/utils/dtypes.py +++ b/python/cudf/cudf/utils/dtypes.py @@ -160,8 +160,8 @@ def numeric_normalize_types(*args): def _find_common_type_decimal(dtypes): # Find the largest scale and the largest difference between # precision and scale of the columns to be concatenated - s = max([dtype.scale for dtype in dtypes]) - lhs = max([dtype.precision - dtype.scale for dtype in dtypes]) + s = max(dtype.scale for dtype in dtypes) + lhs = max(dtype.precision - dtype.scale for dtype in dtypes) # Combine to get the necessary precision and clip at the maximum # precision p = s + lhs @@ -525,7 +525,7 @@ def find_common_type(dtypes): ) for dtype in dtypes ): - if len(set(dtype._categories.dtype for dtype in dtypes)) == 1: + if len({dtype._categories.dtype for dtype in dtypes}) == 1: return cudf.CategoricalDtype( cudf.core.column.concat_columns( [dtype._categories for dtype in dtypes] diff --git a/python/cudf/cudf/utils/hash_vocab_utils.py b/python/cudf/cudf/utils/hash_vocab_utils.py index 45004c5f107..11029cbfe5e 100644 --- a/python/cudf/cudf/utils/hash_vocab_utils.py +++ b/python/cudf/cudf/utils/hash_vocab_utils.py @@ -79,10 +79,8 @@ def _pick_initial_a_b(data, max_constant, init_bins): longest = _new_bin_length(_longest_bin_length(bins)) if score <= max_constant and longest <= MAX_SIZE_FOR_INITIAL_BIN: - print( - "Attempting to build table using {:.6f}n space".format(score) - ) - print("Longest bin was {}".format(longest)) + print(f"Attempting to build table using {score:.6f}n space") + print(f"Longest bin was {longest}") break return bins, a, b @@ -170,7 +168,7 @@ def _pack_keys_and_values(flattened_hash_table, original_dict): def _load_vocab_dict(path): vocab = {} - with open(path, mode="r", encoding="utf-8") as f: + with open(path, encoding="utf-8") as f: counter = 0 for line in f: vocab[line.strip()] = counter @@ -193,17 +191,17 @@ def _store_func( ): with open(out_name, mode="w+") as f: - f.write("{}\n".format(outer_a)) - f.write("{}\n".format(outer_b)) - f.write("{}\n".format(num_outer_bins)) + f.write(f"{outer_a}\n") + f.write(f"{outer_b}\n") + f.write(f"{num_outer_bins}\n") f.writelines( - "{} {}\n".format(coeff, offset) + f"{coeff} {offset}\n" for coeff, offset in zip(inner_table_coeffs, offsets_into_ht) ) - f.write("{}\n".format(len(hash_table))) - f.writelines("{}\n".format(kv) for kv in hash_table) + f.write(f"{len(hash_table)}\n") + f.writelines(f"{kv}\n" for kv in hash_table) f.writelines( - "{}\n".format(tok_id) + f"{tok_id}\n" for tok_id in [unk_tok_id, first_token_id, sep_token_id] ) @@ -295,6 +293,6 @@ def hash_vocab( ) assert ( val == value - ), "Incorrect value found. Got {} expected {}".format(val, value) + ), f"Incorrect value found. Got {val} expected {value}" print("All present tokens return correct value.") diff --git a/python/cudf/cudf/utils/queryutils.py b/python/cudf/cudf/utils/queryutils.py index d9153c2b1d2..64218ddf46a 100644 --- a/python/cudf/cudf/utils/queryutils.py +++ b/python/cudf/cudf/utils/queryutils.py @@ -136,7 +136,7 @@ def query_compile(expr): key "args" is a sequence of name of the arguments. """ - funcid = "queryexpr_{:x}".format(np.uintp(hash(expr))) + funcid = f"queryexpr_{np.uintp(hash(expr)):x}" # Load cache compiled = _cache.get(funcid) # Cache not found @@ -147,7 +147,7 @@ def query_compile(expr): # compile devicefn = cuda.jit(device=True)(fn) - kernelid = "kernel_{}".format(funcid) + kernelid = f"kernel_{funcid}" kernel = _wrap_query_expr(kernelid, devicefn, args) compiled = info.copy() @@ -173,10 +173,10 @@ def _add_idx(arg): if arg.startswith(ENVREF_PREFIX): return arg else: - return "{}[idx]".format(arg) + return f"{arg}[idx]" def _add_prefix(arg): - return "_args_{}".format(arg) + return f"_args_{arg}" glbls = {"queryfn": fn, "cuda": cuda} kernargs = map(_add_prefix, args) diff --git a/python/cudf/cudf/utils/utils.py b/python/cudf/cudf/utils/utils.py index add4ecd8f01..65a803d6768 100644 --- a/python/cudf/cudf/utils/utils.py +++ b/python/cudf/cudf/utils/utils.py @@ -204,12 +204,13 @@ def __getattr__(self, key): ) -def raise_iteration_error(obj): - raise TypeError( - f"{obj.__class__.__name__} object is not iterable. " - f"Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` " - f"if you wish to iterate over the values." - ) +class NotIterable: + def __iter__(self): + raise TypeError( + f"{self.__class__.__name__} object is not iterable. " + f"Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` " + f"if you wish to iterate over the values." + ) def pa_mask_buffer_to_mask(mask_buf, size): diff --git a/python/cudf/setup.py b/python/cudf/setup.py index a8e14504469..e4e43bc1595 100644 --- a/python/cudf/setup.py +++ b/python/cudf/setup.py @@ -63,9 +63,7 @@ def get_cuda_version_from_header(cuda_include_dir, delimeter=""): cuda_version = None - with open( - os.path.join(cuda_include_dir, "cuda.h"), "r", encoding="utf-8" - ) as f: + with open(os.path.join(cuda_include_dir, "cuda.h"), encoding="utf-8") as f: for line in f.readlines(): if re.search(r"#define CUDA_VERSION ", line) is not None: cuda_version = line diff --git a/python/cudf_kafka/cudf_kafka/_version.py b/python/cudf_kafka/cudf_kafka/_version.py index 5ab5c72e457..6cd10cc10bf 100644 --- a/python/cudf_kafka/cudf_kafka/_version.py +++ b/python/cudf_kafka/cudf_kafka/_version.py @@ -86,7 +86,7 @@ def run_command( stderr=(subprocess.PIPE if hide_stderr else None), ) break - except EnvironmentError: + except OSError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue @@ -96,7 +96,7 @@ def run_command( return None, None else: if verbose: - print("unable to find command, tried %s" % (commands,)) + print(f"unable to find command, tried {commands}") return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: @@ -149,7 +149,7 @@ def git_get_keywords(versionfile_abs): # _version.py. keywords = {} try: - f = open(versionfile_abs, "r") + f = open(versionfile_abs) for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) @@ -164,7 +164,7 @@ def git_get_keywords(versionfile_abs): if mo: keywords["date"] = mo.group(1) f.close() - except EnvironmentError: + except OSError: pass return keywords @@ -188,11 +188,11 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") - refs = set([r.strip() for r in refnames.strip("()").split(",")]) + refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " - tags = set([r[len(TAG) :] for r in refs if r.startswith(TAG)]) + tags = {r[len(TAG) :] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d @@ -201,7 +201,7 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". - tags = set([r for r in refs if re.search(r"\d", r)]) + tags = {r for r in refs if re.search(r"\d", r)} if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: @@ -308,10 +308,9 @@ def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) - pieces["error"] = "tag '%s' doesn't start with prefix '%s'" % ( - full_tag, - tag_prefix, - ) + pieces[ + "error" + ] = f"tag '{full_tag}' doesn't start with prefix '{tag_prefix}'" return pieces pieces["closest-tag"] = full_tag[len(tag_prefix) :] diff --git a/python/cudf_kafka/versioneer.py b/python/cudf_kafka/versioneer.py index 2260d5c2dcf..c7dbfd76734 100644 --- a/python/cudf_kafka/versioneer.py +++ b/python/cudf_kafka/versioneer.py @@ -275,7 +275,6 @@ """ -from __future__ import print_function import errno import json @@ -345,7 +344,7 @@ def get_config_from_root(root): # the top of versioneer.py for instructions on writing your setup.cfg . setup_cfg = os.path.join(root, "setup.cfg") parser = configparser.SafeConfigParser() - with open(setup_cfg, "r") as f: + with open(setup_cfg) as f: parser.readfp(f) VCS = parser.get("versioneer", "VCS") # mandatory @@ -407,7 +406,7 @@ def run_command( stderr=(subprocess.PIPE if hide_stderr else None), ) break - except EnvironmentError: + except OSError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue @@ -417,7 +416,7 @@ def run_command( return None, None else: if verbose: - print("unable to find command, tried %s" % (commands,)) + print(f"unable to find command, tried {commands}") return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: @@ -964,7 +963,7 @@ def git_get_keywords(versionfile_abs): # _version.py. keywords = {} try: - f = open(versionfile_abs, "r") + f = open(versionfile_abs) for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) @@ -979,7 +978,7 @@ def git_get_keywords(versionfile_abs): if mo: keywords["date"] = mo.group(1) f.close() - except EnvironmentError: + except OSError: pass return keywords @@ -1003,11 +1002,11 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") - refs = set([r.strip() for r in refnames.strip("()").split(",")]) + refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " - tags = set([r[len(TAG) :] for r in refs if r.startswith(TAG)]) + tags = {r[len(TAG) :] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d @@ -1016,7 +1015,7 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". - tags = set([r for r in refs if re.search(r"\d", r)]) + tags = {r for r in refs if re.search(r"\d", r)} if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: @@ -1123,9 +1122,8 @@ def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) - pieces["error"] = "tag '%s' doesn't start with prefix '%s'" % ( - full_tag, - tag_prefix, + pieces["error"] = "tag '{}' doesn't start with prefix '{}'".format( + full_tag, tag_prefix, ) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix) :] @@ -1175,13 +1173,13 @@ def do_vcs_install(manifest_in, versionfile_source, ipy): files.append(versioneer_file) present = False try: - f = open(".gitattributes", "r") + f = open(".gitattributes") for line in f.readlines(): if line.strip().startswith(versionfile_source): if "export-subst" in line.strip().split()[1:]: present = True f.close() - except EnvironmentError: + except OSError: pass if not present: f = open(".gitattributes", "a+") @@ -1245,7 +1243,7 @@ def versions_from_file(filename): try: with open(filename) as f: contents = f.read() - except EnvironmentError: + except OSError: raise NotThisMethod("unable to read _version.py") mo = re.search( r"version_json = '''\n(.*)''' # END VERSION_JSON", @@ -1272,7 +1270,7 @@ def write_to_version_file(filename, versions): with open(filename, "w") as f: f.write(SHORT_VERSION_PY % contents) - print("set %s to '%s'" % (filename, versions["version"])) + print("set {} to '{}'".format(filename, versions["version"])) def plus_or_dot(pieces): @@ -1497,7 +1495,7 @@ def get_versions(verbose=False): try: ver = versions_from_file(versionfile_abs) if verbose: - print("got version from file %s %s" % (versionfile_abs, ver)) + print(f"got version from file {versionfile_abs} {ver}") return ver except NotThisMethod: pass @@ -1773,7 +1771,7 @@ def do_setup(): try: cfg = get_config_from_root(root) except ( - EnvironmentError, + OSError, configparser.NoSectionError, configparser.NoOptionError, ) as e: @@ -1803,9 +1801,9 @@ def do_setup(): ipy = os.path.join(os.path.dirname(cfg.versionfile_source), "__init__.py") if os.path.exists(ipy): try: - with open(ipy, "r") as f: + with open(ipy) as f: old = f.read() - except EnvironmentError: + except OSError: old = "" if INIT_PY_SNIPPET not in old: print(" appending to %s" % ipy) @@ -1824,12 +1822,12 @@ def do_setup(): manifest_in = os.path.join(root, "MANIFEST.in") simple_includes = set() try: - with open(manifest_in, "r") as f: + with open(manifest_in) as f: for line in f: if line.startswith("include "): for include in line.split()[1:]: simple_includes.add(include) - except EnvironmentError: + except OSError: pass # That doesn't cover everything MANIFEST.in can do # (http://docs.python.org/2/distutils/sourcedist.html#commands), so @@ -1863,7 +1861,7 @@ def scan_setup_py(): found = set() setters = False errors = 0 - with open("setup.py", "r") as f: + with open("setup.py") as f: for line in f.readlines(): if "import versioneer" in line: found.add("import") diff --git a/python/custreamz/custreamz/_version.py b/python/custreamz/custreamz/_version.py index a3409a06953..106fc3524f9 100644 --- a/python/custreamz/custreamz/_version.py +++ b/python/custreamz/custreamz/_version.py @@ -86,7 +86,7 @@ def run_command( stderr=(subprocess.PIPE if hide_stderr else None), ) break - except EnvironmentError: + except OSError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue @@ -96,7 +96,7 @@ def run_command( return None, None else: if verbose: - print("unable to find command, tried %s" % (commands,)) + print(f"unable to find command, tried {commands}") return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: @@ -149,7 +149,7 @@ def git_get_keywords(versionfile_abs): # _version.py. keywords = {} try: - f = open(versionfile_abs, "r") + f = open(versionfile_abs) for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) @@ -164,7 +164,7 @@ def git_get_keywords(versionfile_abs): if mo: keywords["date"] = mo.group(1) f.close() - except EnvironmentError: + except OSError: pass return keywords @@ -188,11 +188,11 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") - refs = set([r.strip() for r in refnames.strip("()").split(",")]) + refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " - tags = set([r[len(TAG) :] for r in refs if r.startswith(TAG)]) + tags = {r[len(TAG) :] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d @@ -201,7 +201,7 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". - tags = set([r for r in refs if re.search(r"\d", r)]) + tags = {r for r in refs if re.search(r"\d", r)} if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: @@ -308,10 +308,9 @@ def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) - pieces["error"] = "tag '%s' doesn't start with prefix '%s'" % ( - full_tag, - tag_prefix, - ) + pieces[ + "error" + ] = f"tag '{full_tag}' doesn't start with prefix '{tag_prefix}'" return pieces pieces["closest-tag"] = full_tag[len(tag_prefix) :] diff --git a/python/custreamz/custreamz/tests/test_dataframes.py b/python/custreamz/custreamz/tests/test_dataframes.py index 24f6e46f6c5..a7378408c24 100644 --- a/python/custreamz/custreamz/tests/test_dataframes.py +++ b/python/custreamz/custreamz/tests/test_dataframes.py @@ -4,7 +4,6 @@ Tests for Streamz Dataframes (SDFs) built on top of cuDF DataFrames. *** Borrowed from streamz.dataframe.tests | License at thirdparty/LICENSE *** """ -from __future__ import division, print_function import json import operator diff --git a/python/dask_cudf/dask_cudf/_version.py b/python/dask_cudf/dask_cudf/_version.py index 8ca2cf98381..104879fce36 100644 --- a/python/dask_cudf/dask_cudf/_version.py +++ b/python/dask_cudf/dask_cudf/_version.py @@ -86,7 +86,7 @@ def run_command( stderr=(subprocess.PIPE if hide_stderr else None), ) break - except EnvironmentError: + except OSError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue @@ -96,7 +96,7 @@ def run_command( return None, None else: if verbose: - print("unable to find command, tried %s" % (commands,)) + print(f"unable to find command, tried {commands}") return None, None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: @@ -149,7 +149,7 @@ def git_get_keywords(versionfile_abs): # _version.py. keywords = {} try: - f = open(versionfile_abs, "r") + f = open(versionfile_abs) for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) @@ -164,7 +164,7 @@ def git_get_keywords(versionfile_abs): if mo: keywords["date"] = mo.group(1) f.close() - except EnvironmentError: + except OSError: pass return keywords @@ -188,11 +188,11 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") - refs = set([r.strip() for r in refnames.strip("()").split(",")]) + refs = {r.strip() for r in refnames.strip("()").split(",")} # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " - tags = set([r[len(TAG) :] for r in refs if r.startswith(TAG)]) + tags = {r[len(TAG) :] for r in refs if r.startswith(TAG)} if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d @@ -201,7 +201,7 @@ def git_versions_from_keywords(keywords, tag_prefix, verbose): # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". - tags = set([r for r in refs if re.search(r"\d", r)]) + tags = {r for r in refs if re.search(r"\d", r)} if verbose: print("discarding '%s', no digits" % ",".join(refs - tags)) if verbose: @@ -308,10 +308,9 @@ def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) - pieces["error"] = "tag '%s' doesn't start with prefix '%s'" % ( - full_tag, - tag_prefix, - ) + pieces[ + "error" + ] = f"tag '{full_tag}' doesn't start with prefix '{tag_prefix}'" return pieces pieces["closest-tag"] = full_tag[len(tag_prefix) :] diff --git a/python/dask_cudf/dask_cudf/core.py b/python/dask_cudf/dask_cudf/core.py index e191873f82b..729db6c232d 100644 --- a/python/dask_cudf/dask_cudf/core.py +++ b/python/dask_cudf/dask_cudf/core.py @@ -516,7 +516,7 @@ def _extract_meta(x): elif isinstance(x, list): return [_extract_meta(_x) for _x in x] elif isinstance(x, tuple): - return tuple([_extract_meta(_x) for _x in x]) + return tuple(_extract_meta(_x) for _x in x) elif isinstance(x, dict): return {k: _extract_meta(v) for k, v in x.items()} return x @@ -611,9 +611,7 @@ def reduction( if not isinstance(args, (tuple, list)): args = [args] - npartitions = set( - arg.npartitions for arg in args if isinstance(arg, _Frame) - ) + npartitions = {arg.npartitions for arg in args if isinstance(arg, _Frame)} if len(npartitions) > 1: raise ValueError("All arguments must have same number of partitions") npartitions = npartitions.pop() @@ -636,7 +634,7 @@ def reduction( ) # Chunk - a = "{0}-chunk-{1}".format(token or funcname(chunk), token_key) + a = f"{token or funcname(chunk)}-chunk-{token_key}" if len(args) == 1 and isinstance(args[0], _Frame) and not chunk_kwargs: dsk = { (a, 0, i): (chunk, key) @@ -654,7 +652,7 @@ def reduction( } # Combine - b = "{0}-combine-{1}".format(token or funcname(combine), token_key) + b = f"{token or funcname(combine)}-combine-{token_key}" k = npartitions depth = 0 while k > split_every: @@ -670,7 +668,7 @@ def reduction( depth += 1 # Aggregate - b = "{0}-agg-{1}".format(token or funcname(aggregate), token_key) + b = f"{token or funcname(aggregate)}-agg-{token_key}" conc = (list, [(a, depth, i) for i in range(k)]) if aggregate_kwargs: dsk[(b, 0)] = (apply, aggregate, [conc], aggregate_kwargs) diff --git a/python/dask_cudf/dask_cudf/io/orc.py b/python/dask_cudf/dask_cudf/io/orc.py index 00fc197da9b..2d326e41c3e 100644 --- a/python/dask_cudf/dask_cudf/io/orc.py +++ b/python/dask_cudf/dask_cudf/io/orc.py @@ -79,7 +79,7 @@ def read_orc(path, columns=None, filters=None, storage_options=None, **kwargs): ex = set(columns) - set(schema) if ex: raise ValueError( - "Requested columns (%s) not in schema (%s)" % (ex, set(schema)) + "Requested columns ({ex}) not in schema ({set(schema)})" ) else: columns = list(schema) diff --git a/python/dask_cudf/dask_cudf/io/tests/test_parquet.py b/python/dask_cudf/dask_cudf/io/tests/test_parquet.py index 706b0e272ea..f5c1e53258e 100644 --- a/python/dask_cudf/dask_cudf/io/tests/test_parquet.py +++ b/python/dask_cudf/dask_cudf/io/tests/test_parquet.py @@ -40,12 +40,7 @@ def test_roundtrip_from_dask(tmpdir, stats): tmpdir = str(tmpdir) ddf.to_parquet(tmpdir, engine="pyarrow") files = sorted( - [ - os.path.join(tmpdir, f) - for f in os.listdir(tmpdir) - # TODO: Allow "_metadata" in list after dask#6047 - if not f.endswith("_metadata") - ], + (os.path.join(tmpdir, f) for f in os.listdir(tmpdir)), key=natural_sort_key, ) diff --git a/python/dask_cudf/setup.py b/python/dask_cudf/setup.py index 39491a45e7e..635f21fd906 100644 --- a/python/dask_cudf/setup.py +++ b/python/dask_cudf/setup.py @@ -33,9 +33,7 @@ def get_cuda_version_from_header(cuda_include_dir, delimeter=""): cuda_version = None - with open( - os.path.join(cuda_include_dir, "cuda.h"), "r", encoding="utf-8" - ) as f: + with open(os.path.join(cuda_include_dir, "cuda.h"), encoding="utf-8") as f: for line in f.readlines(): if re.search(r"#define CUDA_VERSION ", line) is not None: cuda_version = line