24 Feb 16:41

jqnatividad

e21275d

3.1.1 Latest

Latest

[3.1.1] - 2025-02-24

Highlights:

sample: is now a "smart" command that uses the stats cache to validate and make sampling faster.
With the QSV_STATSCACHE_MODE env var, you can now control the stats cache behavior suite-wide, making sure "smart" commands use it when appropriate.
luau command's capabilities have been significantly expanded with:
- New accumulate helper function for aggregating values across rows
- Optional naming for cumulative helper functions
- More robust error handling and improved docstrings
- Enhanced scripting performance with fast-float parsing
- new Wiki section with examples of using its helper functions
schema: now does type-aware sorting of enum lists, making JSON Schema enum list customization easier when fine-tuning it for JSON Schema validation with validate.
lens: adds --freeze-columns option with a default of 1, improving navigation of wide CSVs
stats: adds --dataset-stats option to explicitly compute dataset-level statistics. Starting with qsv 2.0.0, it was computed automatically to support Datapusher+ and the DRUF workflow, but it was causing confusion with some command-line users.

Added

lens: added --freeze-columns option #2552
luau: added accumulate helper function #2537 #2539
luau: added a new section in the Wiki with examples of using the new helper functions https://github.com/dathere/qsv/wiki/Luau-Helper-Functions-Examples
sample: is now "smart" - using the stats cache to validate and make sampling faster #2529 #2530 71ec7ed
schema: added type-aware sort of JSON Schema enum list #2551
stats: added --dataset-stats option #2555
python: added precompiled qsvpy binary for Python 3.13 c408778
added QSV_STATSCACHE_MODE env var to control stats cache suite-wide 4afb98d 2adc313 ba75f08
docs: updated PERFORMANCE docs and added a TLDR version 77ed167 c61c249 db0bb3f
chore: added *.tab & *.ssv to typos config 5236675

Changed

frequency: made error handling more robust b195519
luau: refactored all cumulative helper functions (cum_) now have name as an optional argument #2540
schema: refactored to use QSV_STATSCACHE_MODE env var 5771ff4
select: refactored select helper bfbe64c
stats: optimized memory layout of central Stats struct 52f697e
stats: optimized record_count functionality 0e3114a 18791da
contrib(completions): update qsv completions for qsv 3.1 by @rzmk in #2556
deps: bump arrow and tempfile 4cc2679
deps: bump cached and redis crates e622d14
deps: bump csvlens from 0.11 to 0.12 b2fd985
deps: use our patched fork of csvlens with ability to freeze columns d66ec6d
deps: bump polars to 0.46.0 at py-1.23.0 tag 6072aa2
deps: bump flate2 from 1.0.35 to 1.1.0 eed471a
deps: bump gzp from 0.11 to 1.0.0 43c8a4a
build(deps): bump jaq-json from 1.1.0 to 1.1.1 by @dependabot in #2547
build(deps): bump jaq-core from 2.1.0 to 2.1.1 by @dependabot in #2546
build(deps): bump log from 0.4.25 to 0.4.26 by @dependabot in #2545
build(deps): bump tempfile from 3.16.0 to 3.17.0 by @dependabot in #2532
build(deps): bump tempfile from 3.17.0 to 3.17.1 by @dependabot in #2535
build(deps): bump serde_json from 1.0.138 to 1.0.139 by @dependabot in #2541
build(deps): bump serde from 1.0.217 to 1.0.218 by @dependabot in #2542
build(deps): bump smallvec from 1.13.2 to 1.14.0 by @dependabot in #2528
build(deps): bump strum from 0.27.0 to 0.27.1 by @dependabot in #2533
build(deps): bump strum_macros from 0.27.0 to 0.27.1 by @dependabot in #2534
build(deps): bump uuid from 1.13.1 to 1.13.2 by @dependabot in #2538
build(deps): bump uuid from 1.13.2 to 1.14.0 by @dependabot in #2544
chore: we now have ~1,800 tests! f5d09ed
applied select clippy lint suggestions
bumped indirect dependencies to latest versions
bumped MSRV to latest Rust stable - v1.85

Fixed

count: refactored to fall back to "regular" CSV reader when Polars counting returns a zero count fd39bcb
schema: fixed off-by-one error 60de090
ensured get_stats_record helper returns field/stats correctly ad86a37
Fixed RUSTSEC-2025-0007: ring is unmaintained #2548
stats: only add qsv__value column when --dataset-stats is enabled 64267d3
skip format check when path starts with temp dir (indicating its a file streamed from STDIN) or is a snappy file ff8957e

Removed

frequency: removed --stats-mode option now that we have a suite-wide QSV_STATSCACHE_MODE env var ba75f08 416abb7
chore: removed simdutf8 conditional directive for aarch64 architecture, now that its no longer needed ec1e16c
removed publish-linux-qsvpy-glibc-231-musl-123.yml workflow as it was getting cross compilation errors and we have another musl workflow that works 7c08617

Full Changelog: 3.0.0...3.1.1

Contributors

dependabot and rzmk

Assets 13

qsv-3.1.1-aarch64-apple-darwin.zip

183 MB 2025-02-24T20:08:00Z
qsv-3.1.1-aarch64-unknown-linux-gnu.zip

38.2 MB 2025-02-24T19:28:40Z
qsv-3.1.1-geocode-index.bincode

14.7 MB 2025-02-24T16:41:17Z
qsv-3.1.1-geocode-index.bincode.cities15000

14.7 MB 2025-02-24T16:41:16Z
qsv-3.1.1-geocode-index.bincode.cities15000.sz

5.8 MB 2025-02-24T16:41:15Z
qsv-3.1.1-x86_64-apple-darwin.zip

204 MB 2025-02-24T20:12:45Z
qsv-3.1.1-x86_64-pc-windows-gnu.zip

79.3 MB 2025-02-24T20:07:25Z
qsv-3.1.1-x86_64-pc-windows-msvc.zip

259 MB 2025-02-24T20:07:26Z
qsv-3.1.1-x86_64-unknown-linux-gnu.zip

278 MB 2025-02-24T19:42:52Z
qsv-3.1.1-x86_64-unknown-linux-musl.zip

95.8 MB 2025-02-24T19:26:26Z
Source code (zip)

2025-02-24T16:40:55Z
Source code (tar.gz)

2025-02-24T16:40:55Z

13 Feb 17:14

jqnatividad

3.0.0

7881e22

3.0.0

[3.0.0] - 2025-02-13

Highlights:

sample: Five new sampling methods! In addition to reservoir & indexed - added bernoulli, systematic, stratified, weighted & cluster sampling. And they're all memory efficient so you should be able to sample arbitrarily large datasets!
stats: Added "sortiness" [-1 (Descending) to 1 (Ascending)] & "uniqueness_ratio" [0 (many repeated values) to 1 (All unique values)] stats (more info).
The qsv-stats engine was also optimized to squeeze out more performance, with stats now 2.6x faster while using less memory despite the addition of these new stats.
diff: is now a "smart" command, so that it uses the stats cache to short-circuit diffs if files are identical per their fingerprint hashes, and to validate that the diff key column is all unique.
The stats cache has been refactored and improved performance for "smart" commands:
- frequency is not only 3.3x faster, it uses far less memory as it now doesn't need to maintain hashmaps for columns with all unique values.
- tojsonl is 2.25x faster
- schema is 1.4x faster
luau got a major performance boost with the v0.660 engine upgrade, taking advantage of several compiler optimizations. luau is now up to 3.1x faster!
validate had a major performance regression - going down from 3.295 seconds in v2.1.0 to 13.159 seconds in v2.2.1 in the benchmarks. 4x slower! With the jsonschema 0.29 crate update, validate now clocks in 3.022 seconds!
template also got a big boost and is now 2.9x faster with the minijinja 2.7 crate update.

Added

joinp: additional joinp asof join sort and match options #2486
stats: add "sortiness" statistic #2499
stats add uniqueness_ratio #2521
stats & frequency: add --vis-whitespace option. Fulfills #2501 #2503
sample: add more sampling methods (in addition to indexed and reservoir - added bernoulli, systematic, stratified, weighted & cluster sampling) and made them all memory efficient so we can sample arbitrarily large datasets: #2507 & #2511
diff: make diff a "smart" command. Fulfills #2493 and #2509 #2518
benchmarks : added new benchmarks for sample for new sampling methods d758c54

Changed

luau: bump from 0.653 to 0.660 and optimize for performance 4402df6 de429b4 07ff8b8 3211f5c
stats: compute string len stats only for string columns #2495
contrib(completions): update qsv completions for qsv 2.2.1 by @rzmk in #2494
deps: bump polars to latest upstream after its py-1.22.0 release
deps: backported csv-core 0.1.12 fix to our qsv-optimized csv-core fork dathere/rust-csv@5d0916e
build(deps): bump actions/setup-python from 5.3.0 to 5.4.0 by @dependabot in #2488
build(deps): bump bytes from 1.9.0 to 1.10.0 by @dependabot in #2497
build(deps): bump data-encoding from 2.7.0 to 2.8.0 by @dependabot in #2512
build(deps): bump geosuggest-core from 0.6.5 to 0.6.6 by @dependabot in #2520
build(deps): bump geosuggest-utils from 0.6.5 to 0.6.6 by @dependabot in #2519
build(deps): bump jsonschema from 0.28.3 to 0.29.0 by @dependabot in #2510
build(deps): bump minijinja from 2.6.0 to 2.7.0 by @dependabot in #2489
build(deps): bump mlua from 0.10.2 to 0.10.3 by @dependabot in #2485
build(deps): bump qsv-stats from 0.27.0 to 0.28.0 by @dependabot in #2496
build(deps): bump qsv-stats from 0.28.0 to 0.29.0 by @dependabot in #2498
build(deps): bump qsv-stats from 0.29.0 to 0.30.0 by @dependabot in #2505
chore: Bump rand to 0.9 #2504
build(deps): bump simple-home-dir from 0.4.6 to 0.4.7 by @dependabot in #2515
build(deps): bump uuid from 1.12.1 to 1.13.1 by @dependabot in #2500
bumped numerous indirect dependencies to latest versions
applied select clippy lint suggestions
bumped MSRV to latest Rust stable - v1.84.1

Fixed

docs: QSV_AUTOINDEX => QSV_AUTOINDEX_SIZE typo. Fixes #2479 #2484
fix: search & searchset off by 1 when using --flag option. Fixes #2508 #2513

Full Changelog: 2.2.1...3.0.0

Contributors

dependabot and rzmk

Assets 13

27 Jan 02:03

jqnatividad

2.2.1

bea7973

2.2.1

[2.2.1] - 2025-01-27

Changed

deps: bumped polars to 0.46.0. This will allow us to publish qsv to crates.io as qsv was using features that were not enabled in polars 0.45.1 275b2b8

Fixed

stats: fix cache json processing bug. Fixes #2476 #2477
benchmarks: v6.1.0 - ensured all stats cache benchmarks actually used the stats cache even if the default --cache-threshold is 5 seconds - too high to trigger stats cache creation ac33010

Full Changelog: 2.2.0...2.2.1

Assets 13

26 Jan 15:12

jqnatividad

2.2.0

8b394ff

2.2.0

[2.2.0] - 2025-01-26

Highlights:

stats - the ❤️ of qsv, got a little tune-up:
- It got a tad faster now that we only compute string length stats for string types. Previously, we were also computing length for numbers, thinking it'll be useful for storage sizing purposes (as everything is stored as string with CSV). But as performance is goal number 1, we're no longer doing so. Besides, this sizing info can be derived using other stats.
- Fixed the problem with the stats cache being deleted/ignored even when not necessary.
  This bug snuck in while implementing the --cache-threshold cache suppression option. With stats getting its cache mojo back - expect near-instant cache-backed response not only for stats but also other "automagical" smart commands 🪄.
diff - @janriemer squashed some bugs without sacrificing diff's ludicrous speed! 😉
validate: added dynamicEnum custom JSON Schema keyword column specifier support.
You can now specify which column to validate against (by name or by 0-based column index), instead of always using the first column. This works for local & remote lookup files using the http/s://, ckan:// and dathere:// URL schemes.
extdedup now actually uses a proper memory-mapped backed on-disk hash table.
Previously, it was only deduping in-memory as the odht crate was not properly wired to a memory mapped file 🤦 (I took the name of the odht crate literally and thought it was handling it 🤷). Thanks for the detailed bug report @Svenskunganka!
JSON query parsing overhaul.
The fetch, fetchpost & json commands now use the latest jaq engine, making for faster performance especially now that we're precompiling and caching the jaq filter.
Polars engine upgraded. 🐻‍❄️
By two versions! py-polars 1.20.0 and 1.21.0 - giving the sqlp, joinp, pivotp & count commands a little boost. 🚀

NOTE: qsv v2.2.0 is not available on crates.io as it does not allow enabling unreleased features as we await a new version of Polars. As soon as Polars 0.46.0 is published, a new qsv patch release will be published to crates.io.
This means that installation option 3 using cargo install will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.2.0 still work.

Added

diff: add --delimiter "convenience" option. Fulfills #2447 #2464
slice: add stdin and snappy compressed file support ab34a62
validate: add dynamicEnum column specifier support. Fulfills #2470 #2472

Changed

fetch, fetchpost & json: jaq dependency upgrade - from jaq-interpret & jaq-parse to jaq-core/jaq-json/jaq-std #2458
fetch & fetchpost: cache compiled jaq filter #2467
joinp: adjust asofby test to reflect Polars py-1.20.0 behavior 853a266
stats: compute string length stats for string type only #2471
sqlp: wordsmith fastpath explanation 4e3f853
refactor: standardize -q and -Q shortcut options. Fulfills #2466 #2468
deps: bump polars to 0.45.1 at py-polars-1.20.0 tag #2448
deps: bump polars to 0.45.1 at py-polars-1.21.0 tag 4525d00
deps: Bump csv-diff to 0.1.1 by @janriemer in #2456
deps: Bump csvlens to latest upstream 27a723e
deps: use latest strum upstream 2ca1b0d
build(deps): bump base62 from 2.2.0 to 2.2.1 by @dependabot in #2440
build(deps): bump chrono-tz from 0.10.0 to 0.10.1 by @dependabot in #2449
build(deps): bump data-encoding from 2.6.0 to 2.7.0 by @dependabot in #2444
build(deps): bump indexmap from 2.7.0 to 2.7.1 by @dependabot in #2461
build(deps): bump jsonschema from 0.28.1 to 0.28.2 by @dependabot in #2469
build(deps): bump jsonschema from 0.28.2 to 0.28.3 by @dependabot in #2473
build(deps): bump log from 0.4.22 to 0.4.25 by @dependabot in #2439
build(deps): bump semver from 1.0.24 to 1.0.25 by @dependabot in #2459
build(deps): bump serde_json from 1.0.135 to 1.0.136 by @dependabot in #2455
build(deps): bump serde_json from 1.0.136 to 1.0.137 by @dependabot in #2460
build(deps): bump simple-home-dir from 0.4.5 to 0.4.6 by @dependabot in #2445
build(deps): bump uuid from 1.11.1 to 1.12.0 by @dependabot in #2441
build(deps): bump uuid from 1.12.0 to 1.12.1 by @dependabot in #2465
tests: enabled Windows CI caching for faster CI tests
bumped numerous indirect dependencies to latest versions
applied select clippy lint suggestions

Fixed

count: Sometimes, polars count returns zero even if there are rows. Fixed by doing a regular csv reader count when polars count returns zero abcd365
diff: Fix name to index conversion by @janriemer. Fixes #2443 #2457
extdedup: refactor/fix to actually have on-disk hash table backed by a mem-mapped file. Fixes #2462 #2475
stats: fix stats caching as it was inadvertently deleting the stats cache even when not necessary 96e6d28

Removed

foreach: refactored to remove unmaintained local-encoding dependency #2454
remove polars feature from qsvdp binary variant. We'll use py-polars from DP+ directly.

Full Changelog: 2.1.0...2.2.0

Contributors

Svenskunganka, janriemer, and dependabot

Assets 13

13 Jan 04:06

jqnatividad

2.1.0

2878b1e

2.1.0

[2.1.0] - 2025-01-12

Highlights:

join & joinp fine-tuning continues, with several join key transformation options (--ignore-leading-zeros & --norm-unicode); join fixes for --right-anti and --right-semi joins; and reverting a join performance regression with 2.0.0.
pivotp uses more summary statistics for even smarter aggregation suggestions

NOTE: qsv v2.1.0 is not available on crates.io. This was caused by qsv's use of a brand new string_normalize Polars feature that is not yet available on the latest release of Polars - v0.45.1. Once a new version of Polars is published with this feature, a new qsv patch release will be published to crates.io.
This means that installation option 3 using cargo install will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.1.0 still work.

Added

join: add --ignore-leading-zeros option #2430
joinp add --norm-unicode option to unicode normalize join keys #2436
pivotp added more smart aggregation suggestions #2428
template: added to qsvdp binary variant 9df85e6
benchmarks: added pivotp benchmark 92e4c51

Changed

joinp: refactored --ignore-leading-zeros handling #2433
Migrate from unmaintained dynfmt to dynfmt2 #2421
deps: bump csvlens to latest upstream 52c766d
deps: bump to latest csv qsv-optimized fork 58ac650
deps: bumped MiniJinja to 2.6.0 8176368
deps: bump to latest Polars upstream
deps: bump qsv-stats to 0.26.0
build(deps): bump azure/trusted-signing-action from 0.5.0 to 0.5.1 by @dependabot in #2420
build(deps): bump base62 from 2.0.3 to 2.1.0 by @dependabot in #2419
build(deps): bump base62 from 2.1.0 to 2.2.0 by @dependabot in #2426
build(deps): bump phf from 0.11.2 to 0.11.3 by @dependabot in #2417
build(deps): bump pyo3 from 0.23.3 to 0.23.4 by @dependabot in #2431
build(deps): bump serde_json from 1.0.134 to 1.0.135 by @dependabot in #2416
build(deps): bump tokio from 1.42.0 to 1.43.0 by @dependabot in #2423
build(deps): bump uuid from 1.11.0 to 1.11.1 by @dependabot in #2427
apply several clippy suggestions
bumped numerous indirect dependencies to latest versions
bumped Rust nightly from 2024-12-19 to 2025-01-05 (same version used by Polars)
bump MSRV to latest Rust stable - v1.84.0

Fixed

join: revert optimization that actually resulted in a performance regression e42af2b
join: --right-anti and --right-semi joins didn't swap headers properly #2435
count: polars-powered count didn't use the right data type SQL count(*) d8c1524

Full Changelog: 2.0.0...2.1.0

Contributors

dependabot

Assets 13

06 Jan 12:54

jqnatividad

2.0.0

0f4cf64

2.0.0

qsv v2.0.0 is here! 🎉

It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!

Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!

It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
Under the hood, the fetchpost, template, stats, validate and luau commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming.
It adds a new "smart" pivotp command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations.
stats now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.
join and joinp got a lot of love in this release, with several new options:
- joinp: non-equi join support! 🎉💯🥳
  See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.
- join & joinp: --right-anti and --right-semi joins
- joinp: --ignore-leading-zeros option for join keys
- joinp: --maintain-order option to maintain the order of the either the left or right dataset in the output
- joinp: expanded --cache-schema options to make joinp smarter/faster by leveraging the stats cache
- join: --keys-output option to write successfully joined keys to a separate output file.

This release lays the groundwork for the outliers "smart" command to quickly identify outliers using stats/frequency info.

It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.

Added

fetchpost: add --globals-json option #2357
fixlengths: add --remove-empty option; refactored for performance. Fulfills #2391. #2411
join: add --keys-output option. Fulfills #2407. #2408
join: add --right-anti and --right-semi options. Fulfills #2379. #2380
joinp: add non-equi join support! 🎉💯🥳 #2409
joinp: add --ignore-leading-zeros option. Fulfills #2398. #2400
joinp: add --maintain-order option #2338
joinp: add --right-anti and --right-semi options. Fulfills #2377. #2378
luau: addl helper functions. Fulfills #1782. #2362
luau: add qsv_writejson helper #2375
pivotp: new polars polars-powered command. Fulfills #799. #2364
pivotp: "smart" pivotp. #2367
stats: add geometric mean and harmonic mean. Fulfills #2227. #2342
stats: add string length stats to set stage for upcoming outliers "smart" command to quickly identify outliers using stats/frequency info #2390
template: add --globals-json option #2356
tojsonl: add --quiet option. Fulfills #2335. #2336
validate: add --validate-schema option to check if the JSON Schema itself is valid #2393
contrib(completions): add joinp --ignore-case and slice --invert by @rzmk in #2322
contrib(completions): add --quiet to tojsonl by @rzmk in #2337
ci: add qsv_glibc_2.31-headless to action by @rzmk in #2330
Add license to MSI installer by @rzmk in #2321

Changed

lens: optimized csvlens library usage, dropping clap dependency #2403
pivotp: an even smarter pivotp #2368
stats: performance boost 51349ba
Update deb package by @tino097 in #2226
ci: attempt using files-folder instead of files by @rzmk in #2320
Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
bump polars from 0.44.2 to 0.45 #2340
build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
deps: bump tabwriter to 1.4.1 bbcbeba
build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
apply several clippy suggestions
bumped numerous indirect dependencies to latest versions
bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)

Fixed

joinp: refactor --cache-schema option. Resolves #2369. #2370
extsort underflow in CSV mode. Resolves #2391. #2412
instantiate logger properly 9c0c1a7
fix util::get_stats_records() to no longer infer boolean in StatsMode::PolarsSchema. Resolves #2369. https://github.com/da...

Contributors

tino097, dependabot, and rzmk

Assets 13

02 Dec 13:27

jqnatividad

1.0.0

cb43ba6

1.0.0

qsv v1.0.0 is here! 🎉

After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!

What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!

To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!

Added

joinp: add --ignore-case option #2287
py: add ability to load python expression from file #2295
replace: add --not-one flag (resolves #2305) by @rzmk in #2307
slice: add --invert option #2298
stats: add dataset-level stats #2297
sqlp: auto-decompression of gzip, zstd & zlib compressed csv files with read_csv table function (implements suggestion from @wardi in #2301) #2315
template: add lookup support #2313
added ui feature to make it easier to make a headless build of qsv #2289
added better panic handling #2304
added new benchmark for template command cd7e480
added 📚 lookup support legend b46de73

Changed

move qsv from personal Github repo to datHere GitHub org #2317
template: parallelized template rendering for significant speedups #2273
simplify input format check #2309
bump embedded luau from 0.650 to 0.653 986a1d3
deps: Switch back to simple-home-dir from simple-expand-tilde #2319
deps: Add minijinja contrib #2276
deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
applied several clippy suggestions
bumped numerous indirect dependencies to latest versions
bumped MSRV to latest Rust stable (1.83.0)
bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars

Fixed

fix get_stats_records() helper to handle input files with embedded spaces (fixes #2294) #2296
added better panic handling (fixes #2301) #2304
implement simple format check for input files (fixes #2301) #2308

Removed

removed simple-expand-tilde dependency in favor of simple-home-dir #2318
removed patched fork of indicatif now that 0.17.9 is released, fixing GH unmaintained advisory for instant 33fa54a
removed clipboard command from qsvlite binary variant 9c663d8

Full Changelog: 0.138.0...1.0.0

Contributors

wardi, dependabot, and rzmk

Assets 13

06 Nov 03:23

jqnatividad

0.138.0

6dd67c1

0.138.0

Highlights:

⭐ New template command for rendering templates with CSV data.
Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template).
⭐ New lookup module for fetching reference data from remote and local files.
In addition to the typical http/https schemes for remote files, qsv adds two additional schemes - CKAN:// and datHere://, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
The lookup module is now being used by the luau (for its qsv_register_lookup helper) and validate (for its dynamicEnum custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g. apply, geocode, template, sqlp, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract).
✨ Enhanced fetchpost with MiniJinja templating for payload construction.
Previously, fetchpost was limited to posting url-encoded HTML Form data with content type application/x-www-form-urlencoded. Now with the new --payload-tpl and --content-type options, users can post request bodies rendered with MiniJinja and specify other content types (typically application/json, text/plain, multipart/form-data) as well.
✨ Improved Polars integration with automatic schema detection
The joinp and sqlp commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:
1. Faster execution by skipping Polars' schema inference step
2. GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
🏃 fast-float2 crate for faster float parsing
Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) with fast-float2.
💪 Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.

Added

added lookup module - enabling fetching and caching of reference data from remote and local files #2262
fetchpost: add --payload-tpl <file> and --content-type options to construct payload using MiniJinja with the appropriate content-type #2268 5921498
joinp: derive polars schema from stats cache 86fe22e
sqlp: derive polars schema from stats cache #2256
template: new command to render MiniJinja templates with CSV data #2267
validate: add dynamicEnum lookup support #2265
contrib(completions): add template command and update fetchpost by @rzmk in #2269
add fast-float2 dependency for faster bytes to float conversion 7590e4e 3ca30aa
added more benchmarks for new/updated commands f8a1d4f cd7e480

Changed

luau: adapt to mlua 0.10 API changes 268cb45
luau: refactored stage management 31ef58a
luau: now uses the lookup module 2f4be34
stats: minor perf refactoring 6cdd6ea
build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
deps: updated our fork of the csv crate with more perf optimizations eae7d76
deps: use calamine upstream with unreleased fixes 4cc7f37
deps: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322
deps: bump jsonschema from 0.25 to 0.26 #2251
deps: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0
deps: bump mlua from 0.9 to 0.10 #2249
deps: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44
apply select clippy lint suggestions
updated indirect dependencies
aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5

Fixed

fix documentation typo: it's → its by @tmtmtmtm in #2254

Removed

removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
removed unneeded create_dir_all_threadsafe helper now that std::create_dir_all is threadsafe d0af83b

Full Changelog: 0.137.0...0.138.0

Contributors

tmtmtmtm, dependabot, and rzmk

Assets 12

21 Oct 03:57

jqnatividad

0.137.0

75dbaba

0.137.0

Highlights:

extdedup & extsort now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful --select option to specify which columns to deduplicate or sort on.
This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table for extdedup, and an external merge sort for extsort) - they can handle files larger than memory.
sqlp now has a --cache-schema option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.
fetch and fetchpost have been updated to use the jaq crate instead of the jql crate. This change was made to improve performance and to make the commands consistent with the json command which also uses jaq. Furthermore, jaq is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.
stats is a tad faster as we keep squeezing more performance from this central command.

Added

extdedup: now supports two modes - LINE mode and CSV mode #2208
extsort: now also has two modes - CSV mode and LINE mode #2210
sqlp: add --cache-schema option #2224
added sqlp --cache-schema benchmarks

Changed

apply & applydp: use smallvec for operations vector & other minor performance optimizations #2219 & bc837ae
apply & applydp: specify min_length for parallel iterators 7d6ce5e
fetch & fetchpost: replace jql with jaq #2222
stats: performance optimizations f205809 e26c27f 4579c1b
validate: specify min_length for parallel iterators a5b8185
deps: updated polars to 0.43.1 at the py-1.10.0 tag.
build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
apply select clippy lints
bumped indirect dependencies
bumped MSRV to 1.82

Fixed:

fix performance regression in batched commands by refactoring optimal_batch_size to require indexed CSV files #2206

Removed:

fetch & fetchpost: removed jql options; replaced with jaq #2222

Full Changelog: 0.136.0...0.137.0

Contributors

dependabot

Assets 12

08 Oct 19:41

jqnatividad

0.136.0

82b7611

0.136.0

🎉 qsv pro is now available in the Microsoft Store! 🎉

It's Data Wrangling Democratized on the Desktop, featuring:

📊 Familiar Spreadsheet Interface
tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line.
CKAN desktop client
designed to make data publishing easier for portal operators and data stewards using the CKAN platform.
📥 Flow
allows you to build custom node-based flows and data pipelines using a visual interface.
🔧 Toolbox
features an ever-expanding library of reusable scripts for common data-wrangling use cases.
⭐ and more!
Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support, .qsv file format, etc.) that will be unveiled in future versions.

Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!

Get it from https://qsvpro.dathere.com or

Other highlights:

excel: new --table option for XLSX files; new --header-row option; expanded --range option, adding support for Named Ranges and absolute ranges (e.g. Sheet2!$A$1:$J$10); and expanded metadata export now including Named Ranges and Tables (for XLSX files)
Improved performance for several commands (apply, datefmt, tojsonl and validate) through automatic batch size optimization
validate: dynamicEnum custom JSON Schema keyword in validate command (renamed from dynenum) and enhanced email validation
schema: automatic JSON Schema const inferencing for columns with just one value
Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes

NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT

Added

🎉 qsv pro is now in the Microsoft Store!!! 🎉
apply, datefmt, tojsonl, validate: added logic to automatically determine optimal batch size for better parallelization #2178
enum: added --new-column support for all enum modes, not just --increment #2173
excel: new --table option for XLSX files #2194
excel: new --header-row option 458f79a
excel: expanded range and metadata options #2195
schema: added JSON Schema automatic const inferencing #2180
Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
contrib(completions): add --table option to qsv excel by @rzmk in #2197
completions: add --header-row option to qsv excel e8794d5
added new apply operations sentiment benchmark b745e64
docs: added indexing section to PERFORMANCE.md 804145a

Changed

stats: various minor micro-optimizations 62d95fc 2c2862a
validate: renamed custom keyword dynenum to dynamicEnum to be more consistent with JSON schema naming conventions 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf
validate: optimizations for increased performance; replace serde_json with simd_json 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf
apply new clippy::ref_option lint to Config::new API #2192
Update debian package readme by @tino097 in #2187
deps: bump calamine from 0.25 to 0.26 b42279a
deps: jsonschema use latest 0.22.3 upstream with unreleased features/fixes
deps: polars use latest 0.43.1 upstream with unreleased features/fixes
deps: created our own fork of unmaintained vader_sentiment crate b426761
deps: use serde_json upstream with unreleased perf improvement/fixes https://github.com/jqnatividad/qsv/blob/1c1174b3b8b65d9dfd9c841597366fb09d0a047c/Cargo.toml#L221
build(deps): bump flate2 from 1.0.33 to 1.0.34 by @dependabot in #2171
build(deps): bump flexi_logger from 0.29.0 to 0.29.1 by @dependabot in #2189
build(deps): bump flexi_logger from 0.29.1 to 0.29.2 by @dependabot in #2196
build(deps): bump hashbrown from 0.14.5 to 0.15.0 by @dependabot in #2186
build(deps): bump jsonschema from 0.20.0 to 0.21.0 by @dependabot in #2177
build(deps): bump jsonschema from 0.22.1 to 0.22.2 by @dependabot in #2191
build(deps): bump regex from 1.10.6 to 1.11.0 by @dependabot in #2176
build(deps): bump reqwest from 0.12.7 to 0.12.8 by @dependabot in #2183
build(deps): bump simd-json from 0.14.0 to 0.14.1 #2199
build(deps): bump simple-expand-tilde from 0.4.2 to 0.4.3 by @dependabot in #2190
build(deps): bump sysinfo from 0.31.4 to 0.32.0 by @dependabot in #2193
build(deps): bump tempfile from 3.12.0 to 3.13.0 by @dependabot in #2175
apply select clippy lints
bumped indirect dependencies
aligned Rust nightly to Polars nightly - 2024-09-29 7cd2de1

Fixed

schema: fix enum so it only adds a list when the number of unique values > --enum-threshold #2180
Upload artifact fix for Debian package publishing by @tino097 in #2168
fixed typos configuration 627de89
fixed various GitHub Actions publishing workflow issues

Full Changelog: 0.135.0...0.136.0

Contributors

tino097, dependabot, and rzmk

Assets 12

Releases: dathere/qsv

3.1.1

[3.1.1] - 2025-02-24

Highlights:

Added

Changed

Fixed

Removed

Contributors

3.0.0

[3.0.0] - 2025-02-13

Highlights:

Added

Changed

Fixed

Contributors

2.2.1

[2.2.1] - 2025-01-27

Changed

Fixed

2.2.0

[2.2.0] - 2025-01-26

Highlights:

Added

Changed

Fixed

Removed

Contributors

2.1.0

[2.1.0] - 2025-01-12

Highlights:

Added

Changed

Fixed

Contributors

2.0.0

qsv v2.0.0 is here! 🎉

Added

Changed

Fixed

Contributors

1.0.0

qsv v1.0.0 is here! 🎉

Added

Changed

Fixed

Removed

Contributors

0.138.0

Highlights:

Added

Changed

Fixed

Removed

Contributors

0.137.0

Highlights:

Added

Changed

Fixed:

Removed:

Contributors

0.136.0

🎉 qsv pro is now available in the Microsoft Store! 🎉

Added

Changed

Fixed

Contributors