Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation (#1027) #1

Merged

Commits on Mar 14, 2024

  1. Validation (#1027)

    * add validation script
    
    * update
    
    * change token count function
    
    * reorganize cells
    
    * Add unit tests
    
    * Add a printout for CPT
    
    * update question
    
    * Add questions
    
    * Fix lints
    
    * update format
    
    * update
    
    * nb source
    
    * add validation script
    
    * update
    
    * change token count function
    
    * reorganize cells
    
    * Add unit tests
    
    * Add a printout for CPT
    
    * update question
    
    * Add questions
    
    * Fix lints
    
    * update format
    
    * update
    
    * nb source
    
    * Remove license insert for validation notebook
    
    * Add validation utils
    
    * Minor cleanups (#858)
    
    * nits
    
    * logger
    
    * add log
    
    * lint
    
    * update utils/__init__.py to include extra validation functions
    
    * update notebook
    
    * update
    
    * update
    
    * Read UC delta table (#773)
    
    * initial commit
    
    * use databricks-sql to read delta table and convert to json
    
    * update
    
    * update
    
    * update
    
    * add mocked unittest
    
    * Fix lints
    
    * update
    
    * update
    
    * restructure code
    
    * Add timer for optimizing
    
    * Add db-connect
    
    * add wrapper
    
    * update
    
    * add install dbconnect
    
    * update
    
    * update
    
    * patch dbconnect to allow multiple return formats
    
    * update
    
    * add arrow
    
    * use compression
    
    * clean up
    
    * Add cluster rt check
    
    * Fix lints
    
    * remove patch.py for CI
    
    * update
    
    * update
    
    * updat
    
    * update
    
    * fix tests
    
    * fix lint
    
    * update
    
    * update
    
    * Add more tests
    
    * update
    
    * update
    
    * update
    
    * change to download_json
    
    * update
    
    * fix lints
    
    * Add decompressed option for arrow
    
    * format json to jsonl
    
    * Add comments
    
    * Make cf_collect_type global option
    
    * fix comments
    
    * fix lints
    
    * fix comments
    
    * Fix lints
    
    * change to use workspaceclient
    
    * Add CPT support
    
    * Rewire method assignment logic
    
    * Fix bug in stripping https
    
    * Add tests for rewired method assignment logic
    
    * Fix lints
    
    * Fix lints
    
    * Removed logger set_level
    
    * Remove pyspark. It conflicts with databricks-connect
    
    * Update the comment
    
    * skip cluster version check when cluster_id is serverless
    
    * Add use_serverless flag
    
    * update tests with use_serverless flag
    
    * Fix lints
    
    ---------
    
    Co-authored-by: Xiaohan Zhang <[email protected]>
    
    * Add download remote function to util
    
    * update
    
    * remove fused layernorm (#859)
    
    * update
    
    * update
    
    * update
    
    * update
    
    * update
    
    * update
    
    * update
    
    * update
    
    * update
    
    * Remove hardcoded combined.jsonl with a flag (#861)
    
    * Remove hardcoded combined.jsonl with a flag
    
    * update
    
    * change output_json_path output_json_folder
    
    ---------
    
    Co-authored-by: Xiaohan Zhang <[email protected]>
    
    * bump (#828)
    
    * Add dask and dataframe_to_mds
    
    * update
    
    * update
    
    * update
    
    * update
    
    * Add notebook
    
    * update
    
    * update
    
    * remove script and tests, keep notebook
    
    * update
    
    * update
    
    * update
    
    * update
    
    * Always initialize dist  (#864)
    
    * fix dev
    
    * lint
    
    * remove gpu
    
    * updated notebook
    
    * remove scripts keep notebook
    
    * update notebook. rephrase.
    
    * update
    
    * Add response tokens
    
    * update
    
    * update
    
    * Disable MDSWrite, return token counts
    
    * Change plot settings
    
    * update notebook
    
    * update
    
    * update notebook
    
    * update
    
    * update notebook
    
    * update pip install link
    
    * Change done file location
    
    * Create the dest folder
    
    * update notebook
    
    * update
    
    ---------
    
    Co-authored-by: Xiaohan Zhang <[email protected]>
    Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
    Co-authored-by: Mihir Patel <[email protected]>
    3 people authored Mar 14, 2024
    Configuration menu
    Copy the full SHA
    9fd91cf View commit details
    Browse the repository at this point in the history