Skip to content

Releases: dathere/qsv

3.1.1

24 Feb 16:41
Compare
Choose a tag to compare

[3.1.1] - 2025-02-24

Highlights:

  • sample: is now a "smart" command that uses the stats cache to validate and make sampling faster.
  • With the QSV_STATSCACHE_MODE env var, you can now control the stats cache behavior suite-wide, making sure "smart" commands use it when appropriate.
  • luau command's capabilities have been significantly expanded with:
    • New accumulate helper function for aggregating values across rows
    • Optional naming for cumulative helper functions
    • More robust error handling and improved docstrings
    • Enhanced scripting performance with fast-float parsing
    • new Wiki section with examples of using its helper functions
  • schema: now does type-aware sorting of enum lists, making JSON Schema enum list customization easier when fine-tuning it for JSON Schema validation with validate.
  • lens: adds --freeze-columns option with a default of 1, improving navigation of wide CSVs
  • stats: adds --dataset-stats option to explicitly compute dataset-level statistics. Starting with qsv 2.0.0, it was computed automatically to support Datapusher+ and the DRUF workflow, but it was causing confusion with some command-line users.

Added

Changed

  • frequency: made error handling more robust b195519
  • luau: refactored all cumulative helper functions (cum_) now have name as an optional argument #2540
  • schema: refactored to use QSV_STATSCACHE_MODE env var 5771ff4
  • select: refactored select helper bfbe64c
  • stats: optimized memory layout of central Stats struct 52f697e
  • stats: optimized record_count functionality 0e3114a 18791da
  • contrib(completions): update qsv completions for qsv 3.1 by @rzmk in #2556
  • deps: bump arrow and tempfile 4cc2679
  • deps: bump cached and redis crates e622d14
  • deps: bump csvlens from 0.11 to 0.12 b2fd985
  • deps: use our patched fork of csvlens with ability to freeze columns d66ec6d
  • deps: bump polars to 0.46.0 at py-1.23.0 tag 6072aa2
  • deps: bump flate2 from 1.0.35 to 1.1.0 eed471a
  • deps: bump gzp from 0.11 to 1.0.0 43c8a4a
  • build(deps): bump jaq-json from 1.1.0 to 1.1.1 by @dependabot in #2547
  • build(deps): bump jaq-core from 2.1.0 to 2.1.1 by @dependabot in #2546
  • build(deps): bump log from 0.4.25 to 0.4.26 by @dependabot in #2545
  • build(deps): bump tempfile from 3.16.0 to 3.17.0 by @dependabot in #2532
  • build(deps): bump tempfile from 3.17.0 to 3.17.1 by @dependabot in #2535
  • build(deps): bump serde_json from 1.0.138 to 1.0.139 by @dependabot in #2541
  • build(deps): bump serde from 1.0.217 to 1.0.218 by @dependabot in #2542
  • build(deps): bump smallvec from 1.13.2 to 1.14.0 by @dependabot in #2528
  • build(deps): bump strum from 0.27.0 to 0.27.1 by @dependabot in #2533
  • build(deps): bump strum_macros from 0.27.0 to 0.27.1 by @dependabot in #2534
  • build(deps): bump uuid from 1.13.1 to 1.13.2 by @dependabot in #2538
  • build(deps): bump uuid from 1.13.2 to 1.14.0 by @dependabot in #2544
  • chore: we now have ~1,800 tests! f5d09ed
  • applied select clippy lint suggestions
  • bumped indirect dependencies to latest versions
  • bumped MSRV to latest Rust stable - v1.85

Fixed

  • count: refactored to fall back to "regular" CSV reader when Polars counting returns a zero count fd39bcb
  • schema: fixed off-by-one error 60de090
  • ensured get_stats_record helper returns field/stats correctly ad86a37
  • Fixed RUSTSEC-2025-0007: ring is unmaintained #2548
  • stats: only add qsv__value column when --dataset-stats is enabled 64267d3
  • skip format check when path starts with temp dir (indicating its a file streamed from STDIN) or is a snappy file ff8957e

Removed

  • frequency: removed --stats-mode option now that we have a suite-wide QSV_STATSCACHE_MODE env var ba75f08 416abb7
  • chore: removed simdutf8 conditional directive for aarch64 architecture, now that its no longer needed ec1e16c
  • removed publish-linux-qsvpy-glibc-231-musl-123.yml workflow as it was getting cross compilation errors and we have another musl workflow that works 7c08617

Full Changelog: 3.0.0...3.1.1

3.0.0

13 Feb 17:14
Compare
Choose a tag to compare

[3.0.0] - 2025-02-13

Highlights:

  • sample: Five new sampling methods! In addition to reservoir & indexed - added bernoulli, systematic, stratified, weighted & cluster sampling. And they're all memory efficient so you should be able to sample arbitrarily large datasets!
  • stats: Added "sortiness" [-1 (Descending) to 1 (Ascending)] & "uniqueness_ratio" [0 (many repeated values) to 1 (All unique values)] stats (more info).
    The qsv-stats engine was also optimized to squeeze out more performance, with stats now 2.6x faster while using less memory despite the addition of these new stats.
  • diff: is now a "smart" command, so that it uses the stats cache to short-circuit diffs if files are identical per their fingerprint hashes, and to validate that the diff key column is all unique.
  • The stats cache has been refactored and improved performance for "smart" commands:
    • frequency is not only 3.3x faster, it uses far less memory as it now doesn't need to maintain hashmaps for columns with all unique values.
    • tojsonl is 2.25x faster
    • schema is 1.4x faster
  • luau got a major performance boost with the v0.660 engine upgrade, taking advantage of several compiler optimizations. luau is now up to 3.1x faster!
  • validate had a major performance regression - going down from 3.295 seconds in v2.1.0 to 13.159 seconds in v2.2.1 in the benchmarks. 4x slower! With the jsonschema 0.29 crate update, validate now clocks in 3.022 seconds!
  • template also got a big boost and is now 2.9x faster with the minijinja 2.7 crate update.

Added

  • joinp: additional joinp asof join sort and match options #2486
  • stats: add "sortiness" statistic #2499
  • stats add uniqueness_ratio #2521
  • stats & frequency: add --vis-whitespace option. Fulfills #2501 #2503
  • sample: add more sampling methods (in addition to indexed and reservoir - added bernoulli, systematic, stratified, weighted & cluster sampling) and made them all memory efficient so we can sample arbitrarily large datasets: #2507 & #2511
  • diff: make diff a "smart" command. Fulfills #2493 and #2509 #2518
  • benchmarks : added new benchmarks for sample for new sampling methods d758c54

Changed

  • luau: bump from 0.653 to 0.660 and optimize for performance 4402df6 de429b4 07ff8b8 3211f5c
  • stats: compute string len stats only for string columns #2495
  • contrib(completions): update qsv completions for qsv 2.2.1 by @rzmk in #2494
  • deps: bump polars to latest upstream after its py-1.22.0 release
  • deps: backported csv-core 0.1.12 fix to our qsv-optimized csv-core fork dathere/rust-csv@5d0916e
  • build(deps): bump actions/setup-python from 5.3.0 to 5.4.0 by @dependabot in #2488
  • build(deps): bump bytes from 1.9.0 to 1.10.0 by @dependabot in #2497
  • build(deps): bump data-encoding from 2.7.0 to 2.8.0 by @dependabot in #2512
  • build(deps): bump geosuggest-core from 0.6.5 to 0.6.6 by @dependabot in #2520
  • build(deps): bump geosuggest-utils from 0.6.5 to 0.6.6 by @dependabot in #2519
  • build(deps): bump jsonschema from 0.28.3 to 0.29.0 by @dependabot in #2510
  • build(deps): bump minijinja from 2.6.0 to 2.7.0 by @dependabot in #2489
  • build(deps): bump mlua from 0.10.2 to 0.10.3 by @dependabot in #2485
  • build(deps): bump qsv-stats from 0.27.0 to 0.28.0 by @dependabot in #2496
  • build(deps): bump qsv-stats from 0.28.0 to 0.29.0 by @dependabot in #2498
  • build(deps): bump qsv-stats from 0.29.0 to 0.30.0 by @dependabot in #2505
  • chore: Bump rand to 0.9 #2504
  • build(deps): bump simple-home-dir from 0.4.6 to 0.4.7 by @dependabot in #2515
  • build(deps): bump uuid from 1.12.1 to 1.13.1 by @dependabot in #2500
  • bumped numerous indirect dependencies to latest versions
  • applied select clippy lint suggestions
  • bumped MSRV to latest Rust stable - v1.84.1

Fixed

  • docs: QSV_AUTOINDEX => QSV_AUTOINDEX_SIZE typo. Fixes #2479 #2484
  • fix: search & searchset off by 1 when using --flag option. Fixes #2508 #2513

Full Changelog: 2.2.1...3.0.0

2.2.1

27 Jan 02:03
bea7973
Compare
Choose a tag to compare

[2.2.1] - 2025-01-27

Changed

  • deps: bumped polars to 0.46.0. This will allow us to publish qsv to crates.io as qsv was using features that were not enabled in polars 0.45.1 275b2b8

Fixed

  • stats: fix cache json processing bug. Fixes #2476 #2477
  • benchmarks: v6.1.0 - ensured all stats cache benchmarks actually used the stats cache even if the default --cache-threshold is 5 seconds - too high to trigger stats cache creation ac33010

Full Changelog: 2.2.0...2.2.1

2.2.0

26 Jan 15:12
8b394ff
Compare
Choose a tag to compare

[2.2.0] - 2025-01-26

Highlights:

  • stats - the ❀️ of qsv, got a little tune-up:
    • It got a tad faster now that we only compute string length stats for string types. Previously, we were also computing length for numbers, thinking it'll be useful for storage sizing purposes (as everything is stored as string with CSV). But as performance is goal number 1, we're no longer doing so. Besides, this sizing info can be derived using other stats.
    • Fixed the problem with the stats cache being deleted/ignored even when not necessary.
      This bug snuck in while implementing the --cache-threshold cache suppression option. With stats getting its cache mojo back - expect near-instant cache-backed response not only for stats but also other "automagical" smart commands πŸͺ„.
  • diff - @janriemer squashed some bugs without sacrificing diff's ludicrous speed! πŸ˜‰
  • validate: added dynamicEnum custom JSON Schema keyword column specifier support.
    You can now specify which column to validate against (by name or by 0-based column index), instead of always using the first column. This works for local & remote lookup files using the http/s://, ckan:// and dathere:// URL schemes.
  • extdedup now actually uses a proper memory-mapped backed on-disk hash table.
    Previously, it was only deduping in-memory as the odht crate was not properly wired to a memory mapped file 🀦 (I took the name of the odht crate literally and thought it was handling it 🀷). Thanks for the detailed bug report @Svenskunganka!
  • JSON query parsing overhaul.
    The fetch, fetchpost & json commands now use the latest jaq engine, making for faster performance especially now that we're precompiling and caching the jaq filter.
  • Polars engine upgraded. πŸ»β€β„οΈ
    By two versions! py-polars 1.20.0 and 1.21.0 - giving the sqlp, joinp, pivotp & count commands a little boost. πŸš€

NOTE: qsv v2.2.0 is not available on crates.io as it does not allow enabling unreleased features as we await a new version of Polars. As soon as Polars 0.46.0 is published, a new qsv patch release will be published to crates.io.
This means that installation option 3 using cargo install will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.2.0 still work.


Added

  • diff: add --delimiter "convenience" option. Fulfills #2447 #2464
  • slice: add stdin and snappy compressed file support ab34a62
  • validate: add dynamicEnum column specifier support. Fulfills #2470 #2472

Changed

  • fetch, fetchpost & json: jaq dependency upgrade - from jaq-interpret & jaq-parse to jaq-core/jaq-json/jaq-std #2458
  • fetch & fetchpost: cache compiled jaq filter #2467
  • joinp: adjust asofby test to reflect Polars py-1.20.0 behavior 853a266
  • stats: compute string length stats for string type only #2471
  • sqlp: wordsmith fastpath explanation 4e3f853
  • refactor: standardize -q and -Q shortcut options. Fulfills #2466 #2468
  • deps: bump polars to 0.45.1 at py-polars-1.20.0 tag #2448
  • deps: bump polars to 0.45.1 at py-polars-1.21.0 tag 4525d00
  • deps: Bump csv-diff to 0.1.1 by @janriemer in #2456
  • deps: Bump csvlens to latest upstream 27a723e
  • deps: use latest strum upstream 2ca1b0d
  • build(deps): bump base62 from 2.2.0 to 2.2.1 by @dependabot in #2440
  • build(deps): bump chrono-tz from 0.10.0 to 0.10.1 by @dependabot in #2449
  • build(deps): bump data-encoding from 2.6.0 to 2.7.0 by @dependabot in #2444
  • build(deps): bump indexmap from 2.7.0 to 2.7.1 by @dependabot in #2461
  • build(deps): bump jsonschema from 0.28.1 to 0.28.2 by @dependabot in #2469
  • build(deps): bump jsonschema from 0.28.2 to 0.28.3 by @dependabot in #2473
  • build(deps): bump log from 0.4.22 to 0.4.25 by @dependabot in #2439
  • build(deps): bump semver from 1.0.24 to 1.0.25 by @dependabot in #2459
  • build(deps): bump serde_json from 1.0.135 to 1.0.136 by @dependabot in #2455
  • build(deps): bump serde_json from 1.0.136 to 1.0.137 by @dependabot in #2460
  • build(deps): bump simple-home-dir from 0.4.5 to 0.4.6 by @dependabot in #2445
  • build(deps): bump uuid from 1.11.1 to 1.12.0 by @dependabot in #2441
  • build(deps): bump uuid from 1.12.0 to 1.12.1 by @dependabot in #2465
  • tests: enabled Windows CI caching for faster CI tests
  • bumped numerous indirect dependencies to latest versions
  • applied select clippy lint suggestions

Fixed

  • count: Sometimes, polars count returns zero even if there are rows. Fixed by doing a regular csv reader count when polars count returns zero abcd365
  • diff: Fix name to index conversion by @janriemer. Fixes #2443 #2457
  • extdedup: refactor/fix to actually have on-disk hash table backed by a mem-mapped file. Fixes #2462 #2475
  • stats: fix stats caching as it was inadvertently deleting the stats cache even when not necessary 96e6d28

Removed

  • foreach: refactored to remove unmaintained local-encoding dependency #2454
  • remove polars feature from qsvdp binary variant. We'll use py-polars from DP+ directly.

Full Changelog: 2.1.0...2.2.0

2.1.0

13 Jan 04:06
Compare
Choose a tag to compare

[2.1.0] - 2025-01-12

Highlights:

  • join & joinp fine-tuning continues, with several join key transformation options (--ignore-leading-zeros & --norm-unicode); join fixes for --right-anti and --right-semi joins; and reverting a join performance regression with 2.0.0.
  • pivotp uses more summary statistics for even smarter aggregation suggestions

NOTE: qsv v2.1.0 is not available on crates.io. This was caused by qsv's use of a brand new string_normalize Polars feature that is not yet available on the latest release of Polars - v0.45.1. Once a new version of Polars is published with this feature, a new qsv patch release will be published to crates.io.
This means that installation option 3 using cargo install will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.1.0 still work.


Added

  • join: add --ignore-leading-zeros option #2430
  • joinp add --norm-unicode option to unicode normalize join keys #2436
  • pivotp added more smart aggregation suggestions #2428
  • template: added to qsvdp binary variant 9df85e6
  • benchmarks: added pivotp benchmark 92e4c51

Changed

  • joinp: refactored --ignore-leading-zeros handling #2433
  • Migrate from unmaintained dynfmt to dynfmt2 #2421
  • deps: bump csvlens to latest upstream 52c766d
  • deps: bump to latest csv qsv-optimized fork 58ac650
  • deps: bumped MiniJinja to 2.6.0 8176368
  • deps: bump to latest Polars upstream
  • deps: bump qsv-stats to 0.26.0
  • build(deps): bump azure/trusted-signing-action from 0.5.0 to 0.5.1 by @dependabot in #2420
  • build(deps): bump base62 from 2.0.3 to 2.1.0 by @dependabot in #2419
  • build(deps): bump base62 from 2.1.0 to 2.2.0 by @dependabot in #2426
  • build(deps): bump phf from 0.11.2 to 0.11.3 by @dependabot in #2417
  • build(deps): bump pyo3 from 0.23.3 to 0.23.4 by @dependabot in #2431
  • build(deps): bump serde_json from 1.0.134 to 1.0.135 by @dependabot in #2416
  • build(deps): bump tokio from 1.42.0 to 1.43.0 by @dependabot in #2423
  • build(deps): bump uuid from 1.11.0 to 1.11.1 by @dependabot in #2427
  • apply several clippy suggestions
  • bumped numerous indirect dependencies to latest versions
  • bumped Rust nightly from 2024-12-19 to 2025-01-05 (same version used by Polars)
  • bump MSRV to latest Rust stable - v1.84.0

Fixed

  • join: revert optimization that actually resulted in a performance regression e42af2b
  • join: --right-anti and --right-semi joins didn't swap headers properly #2435
  • count: polars-powered count didn't use the right data type SQL count(*) d8c1524

Full Changelog: 2.0.0...2.1.0

2.0.0

06 Jan 12:54
Compare
Choose a tag to compare

qsv v2.0.0 is here! πŸŽ‰

It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!

Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!

  • It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
    Under the hood, the fetchpost, template, stats, validate and luau commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming.
  • It adds a new "smart" pivotp command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations.
  • stats now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.
  • join and joinp got a lot of love in this release, with several new options:
    • joinp: non-equi join support! πŸŽ‰πŸ’―πŸ₯³
      See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.
    • join & joinp: --right-anti and --right-semi joins
    • joinp: --ignore-leading-zeros option for join keys
    • joinp: --maintain-order option to maintain the order of the either the left or right dataset in the output
    • joinp: expanded --cache-schema options to make joinp smarter/faster by leveraging the stats cache
    • join: --keys-output option to write successfully joined keys to a separate output file.

This release lays the groundwork for the outliers "smart" command to quickly identify outliers using stats/frequency info.

It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.


Added

  • fetchpost: add --globals-json option #2357
  • fixlengths: add --remove-empty option; refactored for performance. Fulfills #2391. #2411
  • join: add --keys-output option. Fulfills #2407. #2408
  • join: add --right-anti and --right-semi options. Fulfills #2379. #2380
  • joinp: add non-equi join support! πŸŽ‰πŸ’―πŸ₯³ #2409
  • joinp: add --ignore-leading-zeros option. Fulfills #2398. #2400
  • joinp: add --maintain-order option #2338
  • joinp: add --right-anti and --right-semi options. Fulfills #2377. #2378
  • luau: addl helper functions. Fulfills #1782. #2362
  • luau: add qsv_writejson helper #2375
  • pivotp: new polars polars-powered command. Fulfills #799. #2364
  • pivotp: "smart" pivotp. #2367
  • stats: add geometric mean and harmonic mean. Fulfills #2227. #2342
  • stats: add string length stats to set stage for upcoming outliers "smart" command to quickly identify outliers using stats/frequency info #2390
  • template: add --globals-json option #2356
  • tojsonl: add --quiet option. Fulfills #2335. #2336
  • validate: add --validate-schema option to check if the JSON Schema itself is valid #2393
  • contrib(completions): add joinp --ignore-case and slice --invert by @rzmk in #2322
  • contrib(completions): add --quiet to tojsonl by @rzmk in #2337
  • ci: add qsv_glibc_2.31-headless to action by @rzmk in #2330
  • Add license to MSI installer by @rzmk in #2321

Changed

  • lens: optimized csvlens library usage, dropping clap dependency #2403
  • pivotp: an even smarter pivotp #2368
  • stats: performance boost 51349ba
  • Update deb package by @tino097 in #2226
  • ci: attempt using files-folder instead of files by @rzmk in #2320
  • Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
  • build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
  • build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
  • build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
  • build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
  • build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
  • build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
  • build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
  • build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
  • build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
  • build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
  • build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
  • bump polars from 0.44.2 to 0.45 #2340
  • build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
  • bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
  • build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
  • build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
  • build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
  • build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
  • build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
  • build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
  • build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
  • build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
  • build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
  • deps: bump tabwriter to 1.4.1 bbcbeba
  • build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
  • build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
  • build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
  • build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
  • apply several clippy suggestions
  • bumped numerous indirect dependencies to latest versions
  • bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)

Fixed

Read more

1.0.0

02 Dec 13:27
Compare
Choose a tag to compare

qsv v1.0.0 is here! πŸŽ‰

After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!

What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!

To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!


Added

  • joinp: add --ignore-case option #2287
  • py: add ability to load python expression from file #2295
  • replace: add --not-one flag (resolves #2305) by @rzmk in #2307
  • slice: add --invert option #2298
  • stats: add dataset-level stats #2297
  • sqlp: auto-decompression of gzip, zstd & zlib compressed csv files with read_csv table function (implements suggestion from @wardi in #2301) #2315
  • template: add lookup support #2313
  • added ui feature to make it easier to make a headless build of qsv #2289
  • added better panic handling #2304
  • added new benchmark for template command cd7e480
  • added πŸ“š lookup support legend b46de73

Changed

  • move qsv from personal Github repo to datHere GitHub org #2317
  • template: parallelized template rendering for significant speedups #2273
  • simplify input format check #2309
  • bump embedded luau from 0.650 to 0.653 986a1d3
  • deps: Switch back to simple-home-dir from simple-expand-tilde #2319
  • deps: Add minijinja contrib #2276
  • deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
  • build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
  • build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
  • build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
  • build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
  • build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
  • build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
  • build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
  • build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
  • build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
  • build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
  • build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
  • build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
  • build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
  • build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
  • build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
  • build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
  • build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
  • build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
  • build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
  • build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
  • applied several clippy suggestions
  • bumped numerous indirect dependencies to latest versions
  • bumped MSRV to latest Rust stable (1.83.0)
  • bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars

Fixed

  • fix get_stats_records() helper to handle input files with embedded spaces (fixes #2294) #2296
  • added better panic handling (fixes #2301) #2304
  • implement simple format check for input files (fixes #2301) #2308

Removed

  • removed simple-expand-tilde dependency in favor of simple-home-dir #2318
  • removed patched fork of indicatif now that 0.17.9 is released, fixing GH unmaintained advisory for instant 33fa54a
  • removed clipboard command from qsvlite binary variant 9c663d8

Full Changelog: 0.138.0...1.0.0

0.138.0

06 Nov 03:23
6dd67c1
Compare
Choose a tag to compare

Highlights:

  • ⭐ New template command for rendering templates with CSV data.
    Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template).

  • ⭐ New lookup module for fetching reference data from remote and local files.
    In addition to the typical http/https schemes for remote files, qsv adds two additional schemes - CKAN:// and datHere://, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
    The lookup module is now being used by the luau (for its qsv_register_lookup helper) and validate (for its dynamicEnum custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g. apply, geocode, template, sqlp, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract).

  • ✨ Enhanced fetchpost with MiniJinja templating for payload construction.
    Previously, fetchpost was limited to posting url-encoded HTML Form data with content type application/x-www-form-urlencoded. Now with the new --payload-tpl and --content-type options, users can post request bodies rendered with MiniJinja and specify other content types (typically application/json, text/plain, multipart/form-data) as well.

  • ✨ Improved Polars integration with automatic schema detection
    The joinp and sqlp commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:

    1. Faster execution by skipping Polars' schema inference step
    2. GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
  • πŸƒ fast-float2 crate for faster float parsing
    Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) with fast-float2.

  • πŸ’ͺ Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
    These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.


Added

  • added lookup module - enabling fetching and caching of reference data from remote and local files #2262
  • fetchpost: add --payload-tpl <file> and --content-type options to construct payload using MiniJinja with the appropriate content-type #2268 5921498
  • joinp: derive polars schema from stats cache 86fe22e
  • sqlp: derive polars schema from stats cache #2256
  • template: new command to render MiniJinja templates with CSV data #2267
  • validate: add dynamicEnum lookup support #2265
  • contrib(completions): add template command and update fetchpost by @rzmk in #2269
  • add fast-float2 dependency for faster bytes to float conversion 7590e4e 3ca30aa
  • added more benchmarks for new/updated commands f8a1d4f cd7e480

Changed

  • luau: adapt to mlua 0.10 API changes 268cb45
  • luau: refactored stage management 31ef58a
  • luau: now uses the lookup module 2f4be34
  • stats: minor perf refactoring 6cdd6ea
  • build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
  • build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
  • build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
  • build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
  • build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
  • build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
  • build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
  • build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
  • build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
  • build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
  • build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
  • build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
  • build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
  • build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
  • build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
  • build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
  • build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
  • build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
  • build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
  • build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
  • deps: updated our fork of the csv crate with more perf optimizations eae7d76
  • deps: use calamine upstream with unreleased fixes 4cc7f37
  • deps: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322
  • deps: bump jsonschema from 0.25 to 0.26 #2251
  • deps: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0
  • deps: bump mlua from 0.9 to 0.10 #2249
  • deps: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44
  • apply select clippy lint suggestions
  • updated indirect dependencies
  • aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5

Fixed

Removed

  • removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
  • removed unneeded create_dir_all_threadsafe helper now that std::create_dir_all is threadsafe d0af83b

Full Changelog: 0.137.0...0.138.0

0.137.0

21 Oct 03:57
75dbaba
Compare
Choose a tag to compare

Highlights:

  • extdedup & extsort now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
    With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful --select option to specify which columns to deduplicate or sort on.
    This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table for extdedup, and an external merge sort for extsort) - they can handle files larger than memory.
  • sqlp now has a --cache-schema option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.
  • fetch and fetchpost have been updated to use the jaq crate instead of the jql crate. This change was made to improve performance and to make the commands consistent with the json command which also uses jaq. Furthermore, jaq is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.
  • stats is a tad faster as we keep squeezing more performance from this central command.

Added

  • extdedup: now supports two modes - LINE mode and CSV mode #2208
  • extsort: now also has two modes - CSV mode and LINE mode #2210
  • sqlp: add --cache-schema option #2224
  • added sqlp --cache-schema benchmarks

Changed

  • apply & applydp: use smallvec for operations vector & other minor performance optimizations #2219 & bc837ae
  • apply & applydp: specify min_length for parallel iterators 7d6ce5e
  • fetch & fetchpost: replace jql with jaq #2222
  • stats: performance optimizations f205809 e26c27f 4579c1b
  • validate: specify min_length for parallel iterators a5b8185
  • deps: updated polars to 0.43.1 at the py-1.10.0 tag.
  • build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
  • build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
  • build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
  • build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
  • build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
  • build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
  • build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
  • build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
  • build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
  • build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
  • build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
  • apply select clippy lints
  • bumped indirect dependencies
  • bumped MSRV to 1.82

Fixed:

  • fix performance regression in batched commands by refactoring optimal_batch_size to require indexed CSV files #2206

Removed:

  • fetch & fetchpost: removed jql options; replaced with jaq #2222

Full Changelog: 0.136.0...0.137.0

0.136.0

08 Oct 19:41
82b7611
Compare
Choose a tag to compare

πŸŽ‰ qsv pro is now available in the Microsoft Store! πŸŽ‰

It's Data Wrangling Democratized on the Desktop, featuring:

  • πŸ“Š Familiar Spreadsheet Interface
    tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line.
  • CKAN CKAN desktop client
    designed to make data publishing easier for portal operators and data stewards using the CKAN CKAN platform.
  • πŸ“₯ Flow
    allows you to build custom node-based flows and data pipelines using a visual interface.
  • πŸ”§ Toolbox
    features an ever-expanding library of reusable scripts for common data-wrangling use cases.
  • ⭐ and more!
    Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support, .qsv file format, etc.) that will be unveiled in future versions.

Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!

Get it from https://qsvpro.dathere.com or

Other highlights:

  • excel: new --table option for XLSX files; new --header-row option; expanded --range option, adding support for Named Ranges and absolute ranges (e.g. Sheet2!$A$1:$J$10); and expanded metadata export now including Named Ranges and Tables (for XLSX files)
  • Improved performance for several commands (apply, datefmt, tojsonl and validate) through automatic batch size optimization
  • validate: dynamicEnum custom JSON Schema keyword in validate command (renamed from dynenum) and enhanced email validation
  • schema: automatic JSON Schema const inferencing for columns with just one value
  • Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes

NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT


Added

  • πŸŽ‰ qsv pro is now in the Microsoft Store!!! πŸŽ‰
  • apply, datefmt, tojsonl, validate: added logic to automatically determine optimal batch size for better parallelization #2178
  • enum: added --new-column support for all enum modes, not just --increment #2173
  • excel: new --table option for XLSX files #2194
  • excel: new --header-row option 458f79a
  • excel: expanded range and metadata options #2195
  • schema: added JSON Schema automatic const inferencing #2180
  • Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
  • contrib(completions): add --table option to qsv excel by @rzmk in #2197
  • completions: add --header-row option to qsv excel e8794d5
  • added new apply operations sentiment benchmark b745e64
  • docs: added indexing section to PERFORMANCE.md 804145a

Changed

Fixed

  • schema: fix enum so it only adds a list when the number of unique values > --enum-threshold #2180
  • Upload artifact fix for Debian package publishing by @tino097 in #2168
  • fixed typos configuration 627de89
  • fixed various GitHub Actions publishing workflow issues

Full Changelog: 0.135.0...0.136.0