Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation (#1027) #1

Merged

Conversation

XiaohanZhangCMU
Copy link
Owner

  • add validation script

  • update

  • change token count function

  • reorganize cells

  • Add unit tests

  • Add a printout for CPT

  • update question

  • Add questions

  • Fix lints

  • update format

  • update

  • nb source

  • add validation script

  • update

  • change token count function

  • reorganize cells

  • Add unit tests

  • Add a printout for CPT

  • update question

  • Add questions

  • Fix lints

  • update format

  • update

  • nb source

  • Remove license insert for validation notebook

  • Add validation utils

  • Minor cleanups (Minor cleanups mosaicml/llm-foundry#858)

  • nits

  • logger

  • add log

  • lint

  • update utils/init.py to include extra validation functions

  • update notebook

  • update

  • update

  • Read UC delta table (Read UC delta table mosaicml/llm-foundry#773)

  • initial commit

  • use databricks-sql to read delta table and convert to json

  • update

  • update

  • update

  • add mocked unittest

  • Fix lints

  • update

  • update

  • restructure code

  • Add timer for optimizing

  • Add db-connect

  • add wrapper

  • update

  • add install dbconnect

  • update

  • update

  • patch dbconnect to allow multiple return formats

  • update

  • add arrow

  • use compression

  • clean up

  • Add cluster rt check

  • Fix lints

  • remove patch.py for CI

  • update

  • update

  • updat

  • update

  • fix tests

  • fix lint

  • update

  • update

  • Add more tests

  • update

  • update

  • update

  • change to download_json

  • update

  • fix lints

  • Add decompressed option for arrow

  • format json to jsonl

  • Add comments

  • Make cf_collect_type global option

  • fix comments

  • fix lints

  • fix comments

  • Fix lints

  • change to use workspaceclient

  • Add CPT support

  • Rewire method assignment logic

  • Fix bug in stripping https

  • Add tests for rewired method assignment logic

  • Fix lints

  • Fix lints

  • Removed logger set_level

  • Remove pyspark. It conflicts with databricks-connect

  • Update the comment

  • skip cluster version check when cluster_id is serverless

  • Add use_serverless flag

  • update tests with use_serverless flag

  • Fix lints



  • bump (Bump to turbo v8 mosaicml/llm-foundry#828)

  • Add dask and dataframe_to_mds

  • update

  • update

  • update

  • update

  • Add notebook

  • update

  • update

  • remove script and tests, keep notebook

  • update

  • update

  • update

  • update

  • Always initialize dist (Always initialize dist  mosaicml/llm-foundry#864)

  • fix dev

  • lint

  • remove gpu

  • updated notebook

  • remove scripts keep notebook

  • update notebook. rephrase.

  • update

  • Add response tokens

  • update

  • update

  • Disable MDSWrite, return token counts

  • Change plot settings

  • update notebook

  • update

  • update notebook

  • update

  • update notebook

  • update pip install link

  • Change done file location

  • Create the dest folder

  • update notebook

  • update


* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* update

* Add response tokens

* update

* update

* Disable MDSWrite, return token counts

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* update pip install link

* Change done file location

* Create the dest folder

* update notebook

* update

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
@XiaohanZhangCMU XiaohanZhangCMU merged commit 67f7b4c into XiaohanZhangCMU:validation Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant