From 944b260c98e4ceb6d24b142e4001c19c950e7676 Mon Sep 17 00:00:00 2001 From: Xiaohan Zhang Date: Wed, 13 Mar 2024 21:58:44 -0700 Subject: [PATCH] Validation (#1028) * add validation script * update * change token count function * reorganize cells * Add unit tests * Add a printout for CPT * update question * Add questions * Fix lints * update format * update * nb source * add validation script * update * change token count function * reorganize cells * Add unit tests * Add a printout for CPT * update question * Add questions * Fix lints * update format * update * nb source * Remove license insert for validation notebook * Add validation utils * Minor cleanups (#858) * nits * logger * add log * lint * update utils/__init__.py to include extra validation functions * update notebook * update * update * Read UC delta table (#773) * initial commit * use databricks-sql to read delta table and convert to json * update * update * update * add mocked unittest * Fix lints * update * update * restructure code * Add timer for optimizing * Add db-connect * add wrapper * update * add install dbconnect * update * update * patch dbconnect to allow multiple return formats * update * add arrow * use compression * clean up * Add cluster rt check * Fix lints * remove patch.py for CI * update * update * updat * update * fix tests * fix lint * update * update * Add more tests * update * update * update * change to download_json * update * fix lints * Add decompressed option for arrow * format json to jsonl * Add comments * Make cf_collect_type global option * fix comments * fix lints * fix comments * Fix lints * change to use workspaceclient * Add CPT support * Rewire method assignment logic * Fix bug in stripping https * Add tests for rewired method assignment logic * Fix lints * Fix lints * Removed logger set_level * Remove pyspark. It conflicts with databricks-connect * Update the comment * skip cluster version check when cluster_id is serverless * Add use_serverless flag * update tests with use_serverless flag * Fix lints --------- Co-authored-by: Xiaohan Zhang * Add download remote function to util * update * remove fused layernorm (#859) * update * update * update * update * update * update * update * update * update * Remove hardcoded combined.jsonl with a flag (#861) * Remove hardcoded combined.jsonl with a flag * update * change output_json_path output_json_folder --------- Co-authored-by: Xiaohan Zhang * bump (#828) * Add dask and dataframe_to_mds * update * update * update * update * Add notebook * update * update * remove script and tests, keep notebook * update * update * update * update * Always initialize dist (#864) * fix dev * lint * remove gpu * updated notebook * remove scripts keep notebook * update notebook. rephrase. * update * Add response tokens * update * update * Disable MDSWrite, return token counts * Change plot settings * update notebook * update * update notebook * update * update notebook * update pip install link * Change done file location * Create the dest folder * update notebook * update * update notebook --------- Co-authored-by: Xiaohan Zhang Co-authored-by: xiaohanzhan-db Co-authored-by: Mihir Patel --- notebooks/validate_and_tokenize_data.ipynb | 25 ++-------------------- 1 file changed, 2 insertions(+), 23 deletions(-) diff --git a/notebooks/validate_and_tokenize_data.ipynb b/notebooks/validate_and_tokenize_data.ipynb index 63d40a2a04..8dcab220f8 100644 --- a/notebooks/validate_and_tokenize_data.ipynb +++ b/notebooks/validate_and_tokenize_data.ipynb @@ -290,7 +290,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { @@ -710,7 +710,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { @@ -803,27 +803,6 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "f5aea2a8-db29-40c9-8ed2-b6a1d032e7ab", - "showTitle": false, - "title": "" - } - }, - "outputs": [], - "source": [ - "import os\n", - "os.makedirs(temporary_mds_output_path, exist_ok=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {