Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create suite-wide metadata cache, caching dataset-level statistics/metadata #2097

Open
jqnatividad opened this issue Aug 30, 2024 · 0 comments

Comments

@jqnatividad
Copy link
Collaborator

jqnatividad commented Aug 30, 2024

Currently, stats compile field/column statistics and persists these stats to a cache file.

This cache file is used by the stats command to return stats instantaneously if the CSV has not changed.
Other "smart" commands also use the stats cache to work faster & smarter.

qsv should have a suite-wide metadata cache that compiles dataset-level statistics and metadata like:

  • record-level stats/metadata
    • record width (max, min, avg, median, variance, stddev, mad) and remove the count --width option
  • package-level stats/metadata
    • number of duplicate records, which is compiled by the existing sortcheck command, and added to a CSV's stats cache when sortcheck is executed. If a CSV has not changed and sortcheck is executed again, it will fetch the existing duplicate record count in the cache
    • data dictionary as initially inferred by describegpt. Will have a flag to indicate if the data dictionary has been manually curated to prevent auto-updates by future runs of describegpt. If the dataset changes, this flag is reset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant