* add validation script
* update
* change token count function
* reorganize cells
* Add unit tests
* Add a printout for CPT
* update question
* Add questions
* Fix lints
* update format
* update
* nb source
* add validation script
* update
* change token count function
* reorganize cells
* Add unit tests
* Add a printout for CPT
* update question
* Add questions
* Fix lints
* update format
* update
* nb source
* Remove license insert for validation notebook
* Add validation utils
* Minor cleanups (#858)
* nits
* logger
* add log
* lint
* update utils/__init__.py to include extra validation functions
* update notebook
* update
* update
* Read UC delta table (#773)
* initial commit
* use databricks-sql to read delta table and convert to json
* update
* update
* update
* add mocked unittest
* Fix lints
* update
* update
* restructure code
* Add timer for optimizing
* Add db-connect
* add wrapper
* update
* add install dbconnect
* update
* update
* patch dbconnect to allow multiple return formats
* update
* add arrow
* use compression
* clean up
* Add cluster rt check
* Fix lints
* remove patch.py for CI
* update
* update
* updat
* update
* fix tests
* fix lint
* update
* update
* Add more tests
* update
* update
* update
* change to download_json
* update
* fix lints
* Add decompressed option for arrow
* format json to jsonl
* Add comments
* Make cf_collect_type global option
* fix comments
* fix lints
* fix comments
* Fix lints
* change to use workspaceclient
* Add CPT support
* Rewire method assignment logic
* Fix bug in stripping https
* Add tests for rewired method assignment logic
* Fix lints
* Fix lints
* Removed logger set_level
* Remove pyspark. It conflicts with databricks-connect
* Update the comment
* skip cluster version check when cluster_id is serverless
* Add use_serverless flag
* update tests with use_serverless flag
* Fix lints
---------
Co-authored-by: Xiaohan Zhang <[email protected]>
* Add download remote function to util
* update
* remove fused layernorm (#859)
* update
* update
* update
* update
* update
* update
* update
* update
* update
* Remove hardcoded combined.jsonl with a flag (#861)
* Remove hardcoded combined.jsonl with a flag
* update
* change output_json_path output_json_folder
---------
Co-authored-by: Xiaohan Zhang <[email protected]>
* bump (#828)
* Add dask and dataframe_to_mds
* update
* update
* update
* update
* Add notebook
* update
* update
* remove script and tests, keep notebook
* update
* update
* update
* update
* Always initialize dist (#864)
* fix dev
* lint
* remove gpu
* updated notebook
* remove scripts keep notebook
* update notebook. rephrase.
* update
* Add response tokens
* update
* update
* Disable MDSWrite, return token counts
* Change plot settings
* update notebook
* update
* update notebook
* update
* update notebook
* update pip install link
* Change done file location
* Create the dest folder
* update notebook
* update
---------
Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>