Releases: dathere/qsv
3.1.1
[3.1.1] - 2025-02-24
Highlights:
sample
: is now a "smart" command that uses the stats cache to validate and make sampling faster.- With the QSV_STATSCACHE_MODE env var, you can now control the stats cache behavior suite-wide, making sure "smart" commands use it when appropriate.
luau
command's capabilities have been significantly expanded with:- New accumulate helper function for aggregating values across rows
- Optional naming for cumulative helper functions
- More robust error handling and improved docstrings
- Enhanced scripting performance with fast-float parsing
- new Wiki section with examples of using its helper functions
schema
: now does type-aware sorting of enum lists, making JSON Schema enum list customization easier when fine-tuning it for JSON Schema validation withvalidate
.lens
: adds--freeze-columns
option with a default of 1, improving navigation of wide CSVsstats
: adds--dataset-stats
option to explicitly compute dataset-level statistics. Starting with qsv 2.0.0, it was computed automatically to support Datapusher+ and the DRUF workflow, but it was causing confusion with some command-line users.
Added
lens
: added--freeze-columns
option #2552luau
: added accumulate helper function #2537 #2539luau
: added a new section in the Wiki with examples of using the new helper functions https://github.com/dathere/qsv/wiki/Luau-Helper-Functions-Examplessample
: is now "smart" - using the stats cache to validate and make sampling faster #2529 #2530 71ec7edschema
: added type-aware sort of JSON Schema enum list #2551stats
: added--dataset-stats
option #2555python
: added precompiled qsvpy binary for Python 3.13 c408778- added QSV_STATSCACHE_MODE env var to control stats cache suite-wide 4afb98d 2adc313 ba75f08
- docs: updated PERFORMANCE docs and added a TLDR version 77ed167 c61c249 db0bb3f
- chore: added *.tab & *.ssv to typos config 5236675
Changed
frequency
: made error handling more robust b195519luau
: refactored all cumulative helper functions (cum_) now have name as an optional argument #2540schema
: refactored to use QSV_STATSCACHE_MODE env var 5771ff4select
: refactored select helper bfbe64cstats
: optimized memory layout of central Stats struct 52f697estats
: optimized record_count functionality 0e3114a 18791dacontrib(completions)
: update qsv completions for qsv 3.1 by @rzmk in #2556- deps: bump arrow and tempfile 4cc2679
- deps: bump cached and redis crates e622d14
- deps: bump csvlens from 0.11 to 0.12 b2fd985
- deps: use our patched fork of csvlens with ability to freeze columns d66ec6d
- deps: bump polars to 0.46.0 at py-1.23.0 tag 6072aa2
- deps: bump flate2 from 1.0.35 to 1.1.0 eed471a
- deps: bump gzp from 0.11 to 1.0.0 43c8a4a
- build(deps): bump jaq-json from 1.1.0 to 1.1.1 by @dependabot in #2547
- build(deps): bump jaq-core from 2.1.0 to 2.1.1 by @dependabot in #2546
- build(deps): bump log from 0.4.25 to 0.4.26 by @dependabot in #2545
- build(deps): bump tempfile from 3.16.0 to 3.17.0 by @dependabot in #2532
- build(deps): bump tempfile from 3.17.0 to 3.17.1 by @dependabot in #2535
- build(deps): bump serde_json from 1.0.138 to 1.0.139 by @dependabot in #2541
- build(deps): bump serde from 1.0.217 to 1.0.218 by @dependabot in #2542
- build(deps): bump smallvec from 1.13.2 to 1.14.0 by @dependabot in #2528
- build(deps): bump strum from 0.27.0 to 0.27.1 by @dependabot in #2533
- build(deps): bump strum_macros from 0.27.0 to 0.27.1 by @dependabot in #2534
- build(deps): bump uuid from 1.13.1 to 1.13.2 by @dependabot in #2538
- build(deps): bump uuid from 1.13.2 to 1.14.0 by @dependabot in #2544
- chore: we now have ~1,800 tests! f5d09ed
- applied select clippy lint suggestions
- bumped indirect dependencies to latest versions
- bumped MSRV to latest Rust stable - v1.85
Fixed
count
: refactored to fall back to "regular" CSV reader when Polars counting returns a zero count fd39bcbschema
: fixed off-by-one error 60de090- ensured get_stats_record helper returns field/stats correctly ad86a37
- Fixed RUSTSEC-2025-0007: ring is unmaintained #2548
stats
: only addqsv__value
column when--dataset-stats
is enabled 64267d3- skip format check when path starts with temp dir (indicating its a file streamed from STDIN) or is a snappy file ff8957e
Removed
frequency
: removed--stats-mode
option now that we have a suite-wide QSV_STATSCACHE_MODE env var ba75f08 416abb7- chore: removed simdutf8 conditional directive for aarch64 architecture, now that its no longer needed ec1e16c
- removed publish-linux-qsvpy-glibc-231-musl-123.yml workflow as it was getting cross compilation errors and we have another musl workflow that works 7c08617
Full Changelog: 3.0.0...3.1.1
3.0.0
[3.0.0] - 2025-02-13
Highlights:
sample
: Five new sampling methods! In addition to reservoir & indexed - added bernoulli, systematic, stratified, weighted & cluster sampling. And they're all memory efficient so you should be able to sample arbitrarily large datasets!stats
: Added "sortiness" [-1 (Descending) to 1 (Ascending)] & "uniqueness_ratio" [0 (many repeated values) to 1 (All unique values)] stats (more info).
The qsv-stats engine was also optimized to squeeze out more performance, withstats
now 2.6x faster while using less memory despite the addition of these new stats.diff
: is now a "smart" command, so that it uses the stats cache to short-circuit diffs if files are identical per their fingerprint hashes, and to validate that the diff key column is all unique.- The stats cache has been refactored and improved performance for "smart" commands:
frequency
is not only 3.3x faster, it uses far less memory as it now doesn't need to maintain hashmaps for columns with all unique values.tojsonl
is 2.25x fasterschema
is 1.4x faster
luau
got a major performance boost with the v0.660 engine upgrade, taking advantage of several compiler optimizations.luau
is now up to 3.1x faster!validate
had a major performance regression - going down from 3.295 seconds in v2.1.0 to 13.159 seconds in v2.2.1 in the benchmarks. 4x slower! With the jsonschema 0.29 crate update,validate
now clocks in 3.022 seconds!template
also got a big boost and is now 2.9x faster with the minijinja 2.7 crate update.
Added
joinp
: additionaljoinp
asof
join sort and match options #2486stats
: add "sortiness" statistic #2499stats
add uniqueness_ratio #2521stats
&frequency
: add--vis-whitespace
option. Fulfills #2501 #2503sample
: add more sampling methods (in addition to indexed and reservoir - added bernoulli, systematic, stratified, weighted & cluster sampling) and made them all memory efficient so we can sample arbitrarily large datasets: #2507 & #2511diff
: makediff
a "smart" command. Fulfills #2493 and #2509 #2518benchmarks
: added new benchmarks forsample
for new sampling methods d758c54
Changed
luau
: bump from 0.653 to 0.660 and optimize for performance 4402df6 de429b4 07ff8b8 3211f5cstats
: compute string len stats only for string columns #2495contrib(completions)
: update qsv completions for qsv 2.2.1 by @rzmk in #2494- deps: bump polars to latest upstream after its py-1.22.0 release
- deps: backported csv-core 0.1.12 fix to our qsv-optimized csv-core fork dathere/rust-csv@5d0916e
- build(deps): bump actions/setup-python from 5.3.0 to 5.4.0 by @dependabot in #2488
- build(deps): bump bytes from 1.9.0 to 1.10.0 by @dependabot in #2497
- build(deps): bump data-encoding from 2.7.0 to 2.8.0 by @dependabot in #2512
- build(deps): bump geosuggest-core from 0.6.5 to 0.6.6 by @dependabot in #2520
- build(deps): bump geosuggest-utils from 0.6.5 to 0.6.6 by @dependabot in #2519
- build(deps): bump jsonschema from 0.28.3 to 0.29.0 by @dependabot in #2510
- build(deps): bump minijinja from 2.6.0 to 2.7.0 by @dependabot in #2489
- build(deps): bump mlua from 0.10.2 to 0.10.3 by @dependabot in #2485
- build(deps): bump qsv-stats from 0.27.0 to 0.28.0 by @dependabot in #2496
- build(deps): bump qsv-stats from 0.28.0 to 0.29.0 by @dependabot in #2498
- build(deps): bump qsv-stats from 0.29.0 to 0.30.0 by @dependabot in #2505
- chore: Bump rand to 0.9 #2504
- build(deps): bump simple-home-dir from 0.4.6 to 0.4.7 by @dependabot in #2515
- build(deps): bump uuid from 1.12.1 to 1.13.1 by @dependabot in #2500
- bumped numerous indirect dependencies to latest versions
- applied select clippy lint suggestions
- bumped MSRV to latest Rust stable - v1.84.1
Fixed
- docs: QSV_AUTOINDEX => QSV_AUTOINDEX_SIZE typo. Fixes #2479 #2484
- fix:
search
&searchset
off by 1 when using--flag
option. Fixes #2508 #2513
Full Changelog: 2.2.1...3.0.0
2.2.1
[2.2.1] - 2025-01-27
Changed
- deps: bumped polars to 0.46.0. This will allow us to publish qsv to crates.io as qsv was using features that were not enabled in polars 0.45.1 275b2b8
Fixed
stats
: fix cache json processing bug. Fixes #2476 #2477- benchmarks: v6.1.0 - ensured all
stats
cache benchmarks actually used the stats cache even if the default--cache-threshold
is 5 seconds - too high to trigger stats cache creation ac33010
Full Changelog: 2.2.0...2.2.1
2.2.0
[2.2.0] - 2025-01-26
Highlights:
stats
- the β€οΈ of qsv, got a little tune-up:- It got a tad faster now that we only compute string length stats for string types. Previously, we were also computing length for numbers, thinking it'll be useful for storage sizing purposes (as everything is stored as string with CSV). But as performance is goal number 1, we're no longer doing so. Besides, this sizing info can be derived using other stats.
- Fixed the problem with the stats cache being deleted/ignored even when not necessary.
This bug snuck in while implementing the--cache-threshold
cache suppression option. Withstats
getting its cache mojo back - expect near-instant cache-backed response not only forstats
but also other "automagical" smart commands πͺ.
diff
- @janriemer squashed some bugs without sacrificingdiff
's ludicrous speed! πvalidate
: addeddynamicEnum
custom JSON Schema keyword column specifier support.
You can now specify which column to validate against (by name or by 0-based column index), instead of always using the first column. This works for local & remote lookup files using thehttp/s://
,ckan://
anddathere://
URL schemes.extdedup
now actually uses a proper memory-mapped backed on-disk hash table.
Previously, it was only deduping in-memory as the odht crate was not properly wired to a memory mapped file π€¦ (I took the name of the odht crate literally and thought it was handling it π€·). Thanks for the detailed bug report @Svenskunganka!- JSON query parsing overhaul.
Thefetch
,fetchpost
&json
commands now use the latestjaq
engine, making for faster performance especially now that we're precompiling and caching the jaq filter. - Polars engine upgraded. π»ββοΈ
By two versions! py-polars 1.20.0 and 1.21.0 - giving thesqlp
,joinp
,pivotp
&count
commands a little boost. π
NOTE: qsv v2.2.0 is not available on crates.io as it does not allow enabling unreleased features as we await a new version of Polars. As soon as Polars 0.46.0 is published, a new qsv patch release will be published to crates.io.
This means that installation option 3 usingcargo install
will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.2.0 still work.
Added
diff
: add--delimiter
"convenience" option. Fulfills #2447 #2464slice
: add stdin and snappy compressed file support ab34a62validate
: add dynamicEnum column specifier support. Fulfills #2470 #2472
Changed
fetch
,fetchpost
&json
:jaq
dependency upgrade - fromjaq-interpret
&jaq-parse
tojaq-core
/jaq-json
/jaq-std
#2458fetch
&fetchpost
: cache compiled jaq filter #2467joinp
: adjust asofby test to reflect Polars py-1.20.0 behavior 853a266stats
: compute string length stats for string type only #2471sqlp
: wordsmith fastpath explanation 4e3f853- refactor: standardize -q and -Q shortcut options. Fulfills #2466 #2468
- deps: bump polars to 0.45.1 at py-polars-1.20.0 tag #2448
- deps: bump polars to 0.45.1 at py-polars-1.21.0 tag 4525d00
- deps: Bump csv-diff to 0.1.1 by @janriemer in #2456
- deps: Bump csvlens to latest upstream 27a723e
- deps: use latest strum upstream 2ca1b0d
- build(deps): bump base62 from 2.2.0 to 2.2.1 by @dependabot in #2440
- build(deps): bump chrono-tz from 0.10.0 to 0.10.1 by @dependabot in #2449
- build(deps): bump data-encoding from 2.6.0 to 2.7.0 by @dependabot in #2444
- build(deps): bump indexmap from 2.7.0 to 2.7.1 by @dependabot in #2461
- build(deps): bump jsonschema from 0.28.1 to 0.28.2 by @dependabot in #2469
- build(deps): bump jsonschema from 0.28.2 to 0.28.3 by @dependabot in #2473
- build(deps): bump log from 0.4.22 to 0.4.25 by @dependabot in #2439
- build(deps): bump semver from 1.0.24 to 1.0.25 by @dependabot in #2459
- build(deps): bump serde_json from 1.0.135 to 1.0.136 by @dependabot in #2455
- build(deps): bump serde_json from 1.0.136 to 1.0.137 by @dependabot in #2460
- build(deps): bump simple-home-dir from 0.4.5 to 0.4.6 by @dependabot in #2445
- build(deps): bump uuid from 1.11.1 to 1.12.0 by @dependabot in #2441
- build(deps): bump uuid from 1.12.0 to 1.12.1 by @dependabot in #2465
- tests: enabled Windows CI caching for faster CI tests
- bumped numerous indirect dependencies to latest versions
- applied select clippy lint suggestions
Fixed
count
: Sometimes, polars count returns zero even if there are rows. Fixed by doing a regular csv reader count when polars count returns zero abcd365diff
: Fix name to index conversion by @janriemer. Fixes #2443 #2457extdedup
: refactor/fix to actually have on-disk hash table backed by a mem-mapped file. Fixes #2462 #2475stats
: fix stats caching as it was inadvertently deleting the stats cache even when not necessary 96e6d28
Removed
foreach
: refactored to remove unmaintainedlocal-encoding
dependency #2454- remove
polars
feature from qsvdp binary variant. We'll use py-polars from DP+ directly.
Full Changelog: 2.1.0...2.2.0
2.1.0
[2.1.0] - 2025-01-12
Highlights:
join
&joinp
fine-tuning continues, with several join key transformation options (--ignore-leading-zeros
&--norm-unicode
);join
fixes for--right-anti
and--right-semi
joins; and reverting ajoin
performance regression with 2.0.0.pivotp
uses more summary statistics for even smarter aggregation suggestions
NOTE: qsv v2.1.0 is not available on crates.io. This was caused by qsv's use of a brand new
string_normalize
Polars feature that is not yet available on the latest release of Polars - v0.45.1. Once a new version of Polars is published with this feature, a new qsv patch release will be published to crates.io.
This means that installation option 3 usingcargo install
will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.1.0 still work.
Added
join
: add--ignore-leading-zeros
option #2430joinp
add--norm-unicode
option to unicode normalize join keys #2436pivotp
added more smart aggregation suggestions #2428template
: added to qsvdp binary variant 9df85e6benchmarks
: addedpivotp
benchmark 92e4c51
Changed
joinp
: refactored--ignore-leading-zeros
handling #2433- Migrate from unmaintained dynfmt to dynfmt2 #2421
- deps: bump csvlens to latest upstream 52c766d
- deps: bump to latest csv qsv-optimized fork 58ac650
- deps: bumped MiniJinja to 2.6.0 8176368
- deps: bump to latest Polars upstream
- deps: bump qsv-stats to 0.26.0
- build(deps): bump azure/trusted-signing-action from 0.5.0 to 0.5.1 by @dependabot in #2420
- build(deps): bump base62 from 2.0.3 to 2.1.0 by @dependabot in #2419
- build(deps): bump base62 from 2.1.0 to 2.2.0 by @dependabot in #2426
- build(deps): bump phf from 0.11.2 to 0.11.3 by @dependabot in #2417
- build(deps): bump pyo3 from 0.23.3 to 0.23.4 by @dependabot in #2431
- build(deps): bump serde_json from 1.0.134 to 1.0.135 by @dependabot in #2416
- build(deps): bump tokio from 1.42.0 to 1.43.0 by @dependabot in #2423
- build(deps): bump uuid from 1.11.0 to 1.11.1 by @dependabot in #2427
- apply several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped Rust nightly from 2024-12-19 to 2025-01-05 (same version used by Polars)
- bump MSRV to latest Rust stable - v1.84.0
Fixed
join
: revert optimization that actually resulted in a performance regression e42af2bjoin
:--right-anti
and--right-semi
joins didn't swap headers properly #2435count
: polars-poweredcount
didn't use the right data type SQL count(*) d8c1524
Full Changelog: 2.0.0...2.1.0
2.0.0
qsv v2.0.0 is here! π
It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!
Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!
- It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
Under the hood, thefetchpost
,template
,stats
,validate
andluau
commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming. - It adds a new "smart"
pivotp
command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations. stats
now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.join
andjoinp
got a lot of love in this release, with several new options:joinp
: non-equi join support! ππ―π₯³
See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.join
&joinp
:--right-anti
and--right-semi
joinsjoinp
:--ignore-leading-zeros
option for join keysjoinp
:--maintain-order
option to maintain the order of the either the left or right dataset in the outputjoinp
: expanded--cache-schema
options to makejoinp
smarter/faster by leveraging the stats cachejoin
:--keys-output
option to write successfully joined keys to a separate output file.
This release lays the groundwork for the outliers
"smart" command to quickly identify outliers using stats/frequency info.
It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.
Added
fetchpost
: add--globals-json
option #2357fixlengths
: add--remove-empty
option; refactored for performance. Fulfills #2391. #2411join
: add--keys-output
option. Fulfills #2407. #2408join
: add--right-anti
and--right-semi
options. Fulfills #2379. #2380joinp
: add non-equi join support! ππ―π₯³ #2409joinp
: add--ignore-leading-zeros
option. Fulfills #2398. #2400joinp
: add--maintain-order
option #2338joinp
: add--right-anti
and--right-semi
options. Fulfills #2377. #2378luau
: addl helper functions. Fulfills #1782. #2362luau
: addqsv_writejson
helper #2375pivotp
: new polars polars-powered command. Fulfills #799. #2364pivotp
: "smart" pivotp. #2367stats
: add geometric mean and harmonic mean. Fulfills #2227. #2342stats
: add string length stats to set stage for upcomingoutliers
"smart" command to quickly identify outliers using stats/frequency info #2390template
: add--globals-json
option #2356tojsonl
: add--quiet
option. Fulfills #2335. #2336validate
: add--validate-schema
option to check if the JSON Schema itself is valid #2393contrib(completions)
: add joinp--ignore-case
and slice--invert
by @rzmk in #2322contrib(completions)
: add--quiet
totojsonl
by @rzmk in #2337ci
: add qsv_glibc_2.31-headless to action by @rzmk in #2330- Add license to MSI installer by @rzmk in #2321
Changed
lens
: optimized csvlens library usage, dropping clap dependency #2403pivotp
: an even smarterpivotp
#2368stats
: performance boost 51349ba- Update deb package by @tino097 in #2226
ci
: attempt using files-folder instead of files by @rzmk in #2320- Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
- build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
- build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
- build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
- build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
- build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
- build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
- build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
- build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
- build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
- build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
- build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
- bump polars from 0.44.2 to 0.45 #2340
- build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
- bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
- build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
- build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
- build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
- build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
- build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
- build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
- build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
- build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
- build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
- deps: bump tabwriter to 1.4.1 bbcbeba
- build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
- build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
- build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
- build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
- apply several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)
Fixed
1.0.0
qsv v1.0.0 is here! π
After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!
What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!
To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!
Added
joinp
: add--ignore-case
option #2287py
: add ability to load python expression from file #2295replace
: add--not-one
flag (resolves #2305) by @rzmk in #2307slice
: add--invert
option #2298stats
: add dataset-level stats #2297sqlp
: auto-decompression of gzip, zstd & zlib compressed csv files withread_csv
table function (implements suggestion from @wardi in #2301) #2315template
: add lookup support #2313- added
ui
feature to make it easier to make a headless build of qsv #2289 - added better panic handling #2304
- added new benchmark for
template
command cd7e480 - added π
lookup support
legend b46de73
Changed
- move qsv from personal Github repo to datHere GitHub org #2317
template
: parallelized template rendering for significant speedups #2273- simplify input format check #2309
- bump embedded
luau
from 0.650 to 0.653 986a1d3 - deps: Switch back to
simple-home-dir
fromsimple-expand-tilde
#2319 - deps: Add minijinja contrib #2276
- deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
- build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
- build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
- build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
- build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
- build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
- build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
- build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
- build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
- build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
- build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
- build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
- build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
- build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
- build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
- build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
- build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
- build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
- build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
- build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
- build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
- applied several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped MSRV to latest Rust stable (1.83.0)
- bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars
Fixed
- fix
get_stats_records()
helper to handle input files with embedded spaces (fixes #2294) #2296 - added better panic handling (fixes #2301) #2304
- implement simple format check for input files (fixes #2301) #2308
Removed
- removed
simple-expand-tilde
dependency in favor ofsimple-home-dir
#2318 - removed patched fork of
indicatif
now that 0.17.9 is released, fixing GH unmaintained advisory forinstant
33fa54a - removed
clipboard
command fromqsvlite
binary variant 9c663d8
Full Changelog: 0.138.0...1.0.0
0.138.0
Highlights:
-
β New
template
command for rendering templates with CSV data.
Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template). -
β New
lookup
module for fetching reference data from remote and local files.
In addition to the typicalhttp
/https
schemes for remote files, qsv adds two additional schemes -CKAN://
anddatHere://
, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
Thelookup
module is now being used by theluau
(for itsqsv_register_lookup
helper) andvalidate
(for itsdynamicEnum
custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g.apply
,geocode
,template
,sqlp
, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract). -
β¨ Enhanced
fetchpost
with MiniJinja templating for payload construction.
Previously,fetchpost
was limited to posting url-encoded HTML Form data with content typeapplication/x-www-form-urlencoded
. Now with the new--payload-tpl
and--content-type
options, users can post request bodies rendered with MiniJinja and specify other content types (typicallyapplication/json
,text/plain
,multipart/form-data
) as well. -
β¨ Improved Polars integration with automatic schema detection
Thejoinp
andsqlp
commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:- Faster execution by skipping Polars' schema inference step
- GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
-
π
fast-float2
crate for faster float parsing
Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) withfast-float2
. -
πͺ Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.
Added
- added lookup module - enabling fetching and caching of reference data from remote and local files #2262
fetchpost
: add--payload-tpl <file>
and--content-type
options to construct payload using MiniJinja with the appropriate content-type #2268 5921498joinp
: derive polars schema from stats cache 86fe22esqlp
: derive polars schema from stats cache #2256template
: new command to render MiniJinja templates with CSV data #2267validate
: adddynamicEnum
lookup support #2265contrib(completions)
: add template command and update fetchpost by @rzmk in #2269- add
fast-float2
dependency for faster bytes to float conversion 7590e4e 3ca30aa - added more benchmarks for new/updated commands f8a1d4f cd7e480
Changed
luau
: adapt to mlua 0.10 API changes 268cb45luau
: refactored stage management 31ef58aluau
: now uses the lookup module 2f4be34stats
: minor perf refactoring 6cdd6ea- build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
- build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
- build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
- build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
- build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
- build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
- build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
- build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
- build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
- build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
- build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
- build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
- build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
- build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
- build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
- build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
- build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
- build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
- build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
- build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
deps
: updated our fork of the csv crate with more perf optimizations eae7d76deps
: use calamine upstream with unreleased fixes 4cc7f37deps
: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322deps
: bump jsonschema from 0.25 to 0.26 #2251deps
: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0deps
: bump mlua from 0.9 to 0.10 #2249deps
: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44- apply select clippy lint suggestions
- updated indirect dependencies
- aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5
Fixed
Removed
- removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
- removed unneeded
create_dir_all_threadsafe
helper now that std::create_dir_all is threadsafe d0af83b
Full Changelog: 0.137.0...0.138.0
0.137.0
Highlights:
extdedup
&extsort
now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful--select
option to specify which columns to deduplicate or sort on.
This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table forextdedup
, and an external merge sort forextsort
) - they can handle files larger than memory.sqlp
now has a--cache-schema
option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.fetch
andfetchpost
have been updated to use thejaq
crate instead of thejql
crate. This change was made to improve performance and to make the commands consistent with thejson
command which also usesjaq
. Furthermore,jaq
is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.stats
is a tad faster as we keep squeezing more performance from this central command.
Added
extdedup
: now supports two modes - LINE mode and CSV mode #2208extsort
: now also has two modes - CSV mode and LINE mode #2210sqlp
: add--cache-schema
option #2224- added
sqlp --cache-schema
benchmarks
Changed
apply
&applydp
: use smallvec for operations vector & other minor performance optimizations #2219 & bc837aeapply
&applydp
: specify min_length for parallel iterators 7d6ce5efetch
&fetchpost
: replace jql with jaq #2222stats
: performance optimizations f205809 e26c27f 4579c1bvalidate
: specify min_length for parallel iterators a5b8185deps
: updated polars to 0.43.1 at the py-1.10.0 tag.- build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
- build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
- build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
- build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
- build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
- build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
- build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
- build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
- build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
- build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
- build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
- apply select clippy lints
- bumped indirect dependencies
- bumped MSRV to 1.82
Fixed:
- fix performance regression in batched commands by refactoring
optimal_batch_size
to require indexed CSV files #2206
Removed:
fetch
&fetchpost
: removed jql options; replaced with jaq #2222
Full Changelog: 0.136.0...0.137.0
0.136.0
π qsv pro is now available in the Microsoft Store! π
It's Data Wrangling Democratized on the Desktop, featuring:
- π Familiar Spreadsheet Interface
tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line. CKAN desktop client
designed to make data publishing easier for portal operators and data stewards using theCKAN platform.
- π₯ Flow
allows you to build custom node-based flows and data pipelines using a visual interface. - π§ Toolbox
features an ever-expanding library of reusable scripts for common data-wrangling use cases. - β and more!
Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support,.qsv
file format, etc.) that will be unveiled in future versions.
Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!
Other highlights:
excel
: new--table
option for XLSX files; new--header-row
option; expanded--range
option, adding support for Named Ranges and absolute ranges (e.g.Sheet2!$A$1:$J$10
); and expanded metadata export now including Named Ranges and Tables (for XLSX files)- Improved performance for several commands (
apply
,datefmt
,tojsonl
andvalidate
) through automatic batch size optimization validate
:dynamicEnum
custom JSON Schema keyword in validate command (renamed fromdynenum
) and enhanced email validationschema
: automatic JSON Schemaconst
inferencing for columns with just one value- Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes
NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT
Added
- π qsv pro is now in the Microsoft Store!!! π
apply
,datefmt
,tojsonl
,validate
: added logic to automatically determine optimal batch size for better parallelization #2178enum
: added--new-column
support for all enum modes, not just--increment
#2173excel
: new--table
option for XLSX files #2194excel
: new--header-row
option 458f79aexcel
: expanded range and metadata options #2195schema
: added JSON Schema automaticconst
inferencing #2180- Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
contrib(completions)
: add--table
option toqsv excel
by @rzmk in #2197completions
: add--header-row
option toqsv excel
e8794d5- added new
apply operations sentiment
benchmark b745e64 docs
: added indexing section to PERFORMANCE.md 804145a
Changed
stats
: various minor micro-optimizations 62d95fc 2c2862avalidate
: renamed custom keyworddynenum
todynamicEnum
to be more consistent with JSON schema naming conventions 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cfvalidate
: optimizations for increased performance; replace serde_json with simd_json 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf- apply new
clippy::ref_option
lint to Config::new API #2192 - Update debian package readme by @tino097 in #2187
deps
: bumpcalamine
from 0.25 to 0.26 b42279adeps
:jsonschema
use latest 0.22.3 upstream with unreleased features/fixesdeps
:polars
use latest 0.43.1 upstream with unreleased features/fixesdeps
: created our own fork of unmaintained vader_sentiment crate b426761deps
: useserde_json
upstream with unreleased perf improvement/fixes https://github.com/jqnatividad/qsv/blob/1c1174b3b8b65d9dfd9c841597366fb09d0a047c/Cargo.toml#L221- build(deps): bump flate2 from 1.0.33 to 1.0.34 by @dependabot in #2171
- build(deps): bump flexi_logger from 0.29.0 to 0.29.1 by @dependabot in #2189
- build(deps): bump flexi_logger from 0.29.1 to 0.29.2 by @dependabot in #2196
- build(deps): bump hashbrown from 0.14.5 to 0.15.0 by @dependabot in #2186
- build(deps): bump jsonschema from 0.20.0 to 0.21.0 by @dependabot in #2177
- build(deps): bump jsonschema from 0.22.1 to 0.22.2 by @dependabot in #2191
- build(deps): bump regex from 1.10.6 to 1.11.0 by @dependabot in #2176
- build(deps): bump reqwest from 0.12.7 to 0.12.8 by @dependabot in #2183
- build(deps): bump simd-json from 0.14.0 to 0.14.1 #2199
- build(deps): bump simple-expand-tilde from 0.4.2 to 0.4.3 by @dependabot in #2190
- build(deps): bump sysinfo from 0.31.4 to 0.32.0 by @dependabot in #2193
- build(deps): bump tempfile from 3.12.0 to 3.13.0 by @dependabot in #2175
- apply select clippy lints
- bumped indirect dependencies
- aligned Rust nightly to Polars nightly - 2024-09-29 7cd2de1
Fixed
schema
: fixenum
so it only adds a list when the number of unique values >--enum-threshold
#2180- Upload artifact fix for Debian package publishing by @tino097 in #2168
- fixed typos configuration 627de89
- fixed various GitHub Actions publishing workflow issues
Full Changelog: 0.135.0...0.136.0