Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paloma release #19

Merged
merged 77 commits into from
Dec 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
905de7a
Merge remote-tracking branch 'origin/small-fixes' into perplexity-sui…
IanMagnusson Oct 27, 2023
4b867a0
Merge remote-tracking branch 'origin/token-ppls' into perplexity-suit…
IanMagnusson Oct 27, 2023
9e4c9fb
Merge remote-tracking branch 'origin/other-metrics-per-subdomain' int…
IanMagnusson Oct 27, 2023
382efdc
Merge remote-tracking branch 'origin/main' into perplexity-suite-paper
IanMagnusson Oct 30, 2023
6a6af5b
pythia 7b runs
IanMagnusson Oct 30, 2023
9a143fb
domla 1b runs
IanMagnusson Oct 30, 2023
9173088
fix the hf_olmo image
IanMagnusson Oct 31, 2023
7578d2c
add aws secrets
IanMagnusson Oct 31, 2023
0ed28b4
handle local tokenizer problem in hf olmo
IanMagnusson Oct 31, 2023
e8aed05
add feature for saving to file
IanMagnusson Oct 31, 2023
080e219
handle s3 auth with env vars
IanMagnusson Oct 31, 2023
c5d4a44
still fixing s3
IanMagnusson Oct 31, 2023
81cbe9d
split up sheets into different files
IanMagnusson Oct 31, 2023
98b1203
make json lines instead of one big json
IanMagnusson Oct 31, 2023
2f6b7de
passing arg by right name
IanMagnusson Oct 31, 2023
2d1fe30
save dolma 1b to file
IanMagnusson Oct 31, 2023
b625b85
pythia 1b
IanMagnusson Oct 31, 2023
6367feb
pythia 7b
IanMagnusson Oct 31, 2023
4d99b8b
initial results exploration
IanMagnusson Nov 1, 2023
8382500
dolma 7b
IanMagnusson Nov 1, 2023
c3ceb44
Merge branch 'perplexity-suite-paper' of github.com:allenai/ai2-llm-e…
IanMagnusson Nov 1, 2023
e33d37c
rp without save file yet
IanMagnusson Nov 1, 2023
285bc8c
Try with manually uploaded fixed files
IanMagnusson Nov 1, 2023
af67310
first line chart
IanMagnusson Nov 1, 2023
2345106
clean up
IanMagnusson Nov 1, 2023
2a2a242
now with win rate
IanMagnusson Nov 2, 2023
b6009d2
New ppl and win rate viz
IanMagnusson Nov 2, 2023
b5889e8
exclude fringe datasets
IanMagnusson Nov 2, 2023
d244801
subdomain bar charts
IanMagnusson Nov 2, 2023
8ae7255
add support for olmo models in s3
IanMagnusson Nov 2, 2023
aa4e49a
just add olmo to the path instead
IanMagnusson Nov 2, 2023
a2f36dc
fix figues labels
IanMagnusson Nov 2, 2023
7f967ae
subdomain line charts
IanMagnusson Nov 3, 2023
5f192ab
added new subdomains by tasks figures
IanMagnusson Nov 3, 2023
0b7b28b
Inital results over all models
IanMagnusson Nov 4, 2023
edee19e
clean up aggregation over subdomains tables
IanMagnusson Nov 6, 2023
25ec471
Add curves by macro subdomains
IanMagnusson Nov 7, 2023
9989734
Also add median over subdomains
IanMagnusson Nov 7, 2023
193fac9
subdomains by order of performance
IanMagnusson Nov 7, 2023
d8773b8
dolma7b 1T
IanMagnusson Nov 7, 2023
9e5f420
Merge branch 'perplexity-suite-paper' of github.com:allenai/ai2-llm-e…
IanMagnusson Nov 7, 2023
0c7bf7e
RP save results
IanMagnusson Nov 7, 2023
7af8bb7
domains by rank by task
IanMagnusson Nov 9, 2023
b9d04a5
fringe curves
IanMagnusson Nov 10, 2023
fc8d17a
add pmi filtered metrics
IanMagnusson Nov 10, 2023
b9d617a
track split on token counts
IanMagnusson Nov 11, 2023
efed63a
Merge branch 'perplexity-suite-paper' of github.com:allenai/ai2-llm-e…
IanMagnusson Nov 11, 2023
9b7a85b
fix pmi ppl
IanMagnusson Nov 11, 2023
ef62450
reweighting inside ppl
IanMagnusson Nov 11, 2023
8a83142
exclude ice and stack from "all" metrics
IanMagnusson Nov 12, 2023
1afd076
remove references to "subdomain" from figures
IanMagnusson Nov 12, 2023
e5a3f27
domain improvement inequality
IanMagnusson Nov 12, 2023
044d676
most and least improved domains
IanMagnusson Nov 12, 2023
634514d
ppl reduction over model size
IanMagnusson Nov 20, 2023
c3a0594
pmi ppl on twitter aae
IanMagnusson Nov 20, 2023
1c42981
add pythia 160m and standardize model size names
IanMagnusson Nov 21, 2023
ae550a9
pile lumi
IanMagnusson Nov 25, 2023
6a8b81c
Merge branch 'perplexity-suite-paper' of github.com:allenai/ai2-llm-e…
IanMagnusson Nov 25, 2023
2fb71b7
save code!
IanMagnusson Nov 28, 2023
b597465
Merge branch 'perplexity-suite-paper' of github.com:allenai/ai2-llm-e…
IanMagnusson Nov 28, 2023
7ebe22f
fix tokens seen on baselines
IanMagnusson Nov 30, 2023
8538d66
only change unsharded dirs
IanMagnusson Dec 5, 2023
a115353
Merge branch 'perplexity-suite-paper' of github.com:allenai/ai2-llm-e…
IanMagnusson Dec 5, 2023
d61a5db
doma 1b test eos fix
IanMagnusson Dec 5, 2023
c25ce9f
more tokens on the 7b
IanMagnusson Dec 5, 2023
a6dbc07
count non-embedding params
IanMagnusson Dec 5, 2023
f9ff98b
A script to get checkpoint that were made on lumi
IanMagnusson Dec 5, 2023
92d4828
roll back paper specifics
IanMagnusson Dec 5, 2023
3d2b22a
more paper specific rollback
IanMagnusson Dec 5, 2023
365edfc
Merge branch 'main' of github.com:allenai/ai2-llm-eval into paloma-re…
IanMagnusson Dec 5, 2023
b58c78a
minimal PPL inference
IanMagnusson Dec 5, 2023
c494108
style stuff
IanMagnusson Dec 5, 2023
c0002e5
changeloooooog
IanMagnusson Dec 5, 2023
c65dec9
remove local path
IanMagnusson Dec 5, 2023
9ed51d1
centralize documentation
IanMagnusson Dec 12, 2023
a1cbad4
Update README.md
AkshitaB Dec 13, 2023
62600b4
Update README.md
AkshitaB Dec 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Updated code that records the fine-grained perplexity metrics per subdomain to also include perplexity over words, characters, bytes, and also bits per byte
- Added option to track avg logit per token type
- Added script that uses the tango steps as functions, and bypasses the tango caching mechanism, for simpler execution
- minimal example of how to run Paloma from HF hub as well as step to output results in jsonl.gz format

### Fixed

Expand Down
44 changes: 44 additions & 0 deletions configs/example_paloma_config.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
/*--------------------------------------- Configurations -----------------------------------------*/

local utils = import 'utils.libsonnet';

//❗ To run this config you will need to first set up the data following the instructions in ai2-llm-eval/eval_data/README.md
//❗ Also note that this will run validation results. Change to paloma_hf_release_test.libsonnet to run test results.
local ppl_suite = import 'task_sets/paloma_hf_release_val.libsonnet';


//❗Set gsheet to the name of your google sheet.
// Set it to null if you do not want your results to be uploaded to a google sheet (they will still be saved as an object).
local gsheet = null;
//❗Set output_dir to a directory where you want to save outputs as jsonl.gz files .
// Set it to null if you do not want your results saved as jsonl.gz files.
local output_dir = null;

local create_models = function(model_path, revisions, gpus_needed) [
{
model_path: model_path,
revision: rev,
gpus_needed: gpus_needed,
prediction_kwargs: {
model_max_length: 2048, //❗Ensure that this is set to the actual max len of your model
limit: 2, //❗ Here we only run 2 examples per task for testing purposes. Set this to null to run all examples.
}
}
for rev in revisions
];

local revisions = [
"step" + std.toString(i * 10000)
for i in std.range(14, 14) //❗ Set this to the range of revisions you want to run.
];


local models = create_models("EleutherAI/pythia-160m-seed1", revisions, 1);
local task_sets = [
ppl_suite.task_set
];


{
steps: utils.create_fine_grained_pipeline(models, task_sets, gsheet, output_dir)
}
53 changes: 53 additions & 0 deletions configs/task_sets/paloma_hf_release_test.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

local task_utils = import 'task_utils.libsonnet';

local common_kwargs = {
task_name: "ppl_custom",
task_kwargs: {
keep_all_instance_fields_except: ["text", "tokens"],
detailed_output: true,
},
prediction_kwargs: {
split: "test",
model_max_length: task_utils.model_max_length,
}
};

// TODO: refactor catwalk's Perplexity task so that it actually uses the s3 path.
// until then, let the path be present in nfs ($EVAL_DATA_PATH).
local data_dir = "paloma/";

local create_task_kwargs(task_names) = [
{
task_kwargs: {
task_rename: "ppl_" + task_name,
files: [data_dir + "/" + task_name + "/test"]
}
}
for task_name in task_names
];

local task_dicts = create_task_kwargs(
[
"m2d2_s2orc_unsplit",
"m2d2_wikipedia_unsplit",
"c4_100_domains",
"c4_en",
"mc4",
"4chan_meta_sep",
"manosphere_meta_sep",
"gab",
"twitterAAE_HELM_fixed",
"wikitext_103",
"ptb",
"redpajama",
"falcon-refinedweb",
"dolma-v1_5",
"dolma_100_subreddits",
"dolma_100_programing_languages"
]
);

{
task_set: task_utils.create_task_set_from_task_dicts("eval_suite", task_dicts, common_kwargs)
}
53 changes: 53 additions & 0 deletions configs/task_sets/paloma_hf_release_val.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

local task_utils = import 'task_utils.libsonnet';

local common_kwargs = {
task_name: "ppl_custom",
task_kwargs: {
keep_all_instance_fields_except: ["text", "tokens"],
detailed_output: true,
},
prediction_kwargs: {
split: "validation",
model_max_length: task_utils.model_max_length,
}
};

// TODO: refactor catwalk's Perplexity task so that it actually uses the s3 path.
// until then, let the path be present in nfs ($EVAL_DATA_PATH).
local data_dir = "paloma/";

local create_task_kwargs(task_names) = [
{
task_kwargs: {
task_rename: "ppl_" + task_name,
files: [data_dir + "/" + task_name + "/val"]
}
}
for task_name in task_names
];

local task_dicts = create_task_kwargs(
[
"m2d2_s2orc_unsplit",
"m2d2_wikipedia_unsplit",
"c4_100_domains",
"c4_en",
"mc4",
"4chan_meta_sep",
"manosphere_meta_sep",
"gab",
"twitterAAE_HELM_fixed",
"wikitext_103",
"ptb",
"redpajama",
"falcon-refinedweb",
"dolma-v1_5",
"dolma_100_subreddits",
"dolma_100_programing_languages"
]
);

{
task_set: task_utils.create_task_set_from_task_dicts("eval_suite", task_dicts, common_kwargs)
}
19 changes: 17 additions & 2 deletions configs/utils.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,18 @@ local create_processed_outputs_as_rows_multiple_metrics_steps(model_task_configs
}
};

local create_save_write_outputs_as_rows_multiple_metrics_as_file_steps(output_dir) =
{
"save-to-file": {
type: "save-write-outputs-as-rows-multiple-metrics-as-file",
write_outputs: {type: "ref", ref: "combine-all-outputs"},
output_dir: output_dir,
step_resources: {
gpu_count: 0
}
}
};

local create_pipeline(models, task_sets, gsheet) =

// Model steps
Expand All @@ -218,7 +230,7 @@ local create_pipeline(models, task_sets, gsheet) =

all_steps;

local create_fine_grained_pipeline(models, task_sets, gsheet) =
local create_fine_grained_pipeline(models, task_sets, gsheet, output_dir = null) =

// Model steps
local model_location_steps = create_model_location_steps(models);
Expand All @@ -237,13 +249,16 @@ local create_fine_grained_pipeline(models, task_sets, gsheet) =
// Aggregate results for each task set and model combination
local combine_all_outputs = create_processed_outputs_as_rows_multiple_metrics_steps(model_task_configs, gsheet);

local save_to_file = create_save_write_outputs_as_rows_multiple_metrics_as_file_steps(output_dir);

local all_steps =
model_location_steps +
catwalk_model_steps +
task_steps +
outputs_steps +
processed_outputs_steps +
combine_all_outputs;
combine_all_outputs +
save_to_file;

all_steps;

Expand Down
30 changes: 30 additions & 0 deletions llm_eval/steps/run_catwalk.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,7 @@ def run(
row = {}
task = d["task"]
row["model"] = model
row["split"] = pred_kwargs["split"]
if "revision" in d["model_kwargs"]:
row["revision"] = d["model_kwargs"]["revision"]
row["subdomain"] = subdomain
Expand All @@ -462,6 +463,35 @@ def run(
return per_metric_type_tsv_outputs


@Step.register("save-write-outputs-as-rows-multiple-metrics-as-file")
class SaveWriteOutputsAsRowsMultipleMetricsAsFile(Step):
VERSION = "001"

def run(self, write_outputs: Dict[str, List[Dict]], output_dir: str) -> None:
import smart_open

if output_dir is None:
logger.info("output_file is None, skipping save to file")
return
transport_params = None
if output_dir.startswith("s3://"):
import boto3

session = boto3.Session(
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
aws_session_token=os.environ["AWS_SESSION_TOKEN"],
)
client = session.client("s3")
transport_params = dict(client=client)
for table_name in write_outputs:
output_file = os.path.join(output_dir, table_name + ".jsonl.gz")
with smart_open.open(output_file, "wb", transport_params=transport_params) as f:
for row in tqdm(write_outputs[table_name], desc=f"writing {table_name} to file"):
f.write(json.dumps(row).encode())
f.write(b"\n")


def write_to_gsheet(gsheet: str, rows: List[Dict], sheet_title: str = "Sheet1"):
import pygsheets

Expand Down
44 changes: 44 additions & 0 deletions paloma/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Paloma

The Paloma benchmark makes use of this repo to run evaluation inference. This readme will explain everything you need to know to get results on Paloma and make a submission to our benchmark.

Links:

[Data](https://huggingface.co/datasets/allenai/paloma)

## Getting existing results from the benchmark
Paloma is first and foremost a suite of results from the research community organized by comprability. These are formated as *.jsonl.gz files recording the perplexity per domain over our 585 domains as well as additional metrics discussed in our paper. These are files are the same type of results that are output by running the code in this repo for a given model.

We are also building out a website to allow interactive inspection of these multi-dimensional results. Until then please contact us by emailing the first author of Paloma if you would like access to the raw benchmark results.

So far the models evaluated by the benchmark are the 6 baseline 1B parameter models that we release with Paloma as well as `EleutherAI/pythia-160m`, `EleutherAI/pythia-1B`, and `EleutherAI/pythia-6.9b`.

## Setup
Start by following the installation instructions for this repo in this [readme](../README.md).

Then follow the instructions in this [readme](eval_data/README.md) to obtain and set up the evaluation data.

## Running evaluation
After following the setup instructions above, you can make an evaluation configuration based on our template [here](../configs/example_paloma_config.jsonnet). This is designed to work with any model hosted on the HuggingFace hub. Just specify the name of the model on the hub and any revisions (i.e., checkpoints) that you want results over. Read the comments in the configuration with the ❗ symbol for more information about details you may need to fill in. Finally make sure to set an output directory for `output_dir` where you want the job to output your results.

Now you can run your evaluation job locally with the following command (from the root of this repo):
```
tango --settings tango.yml run configs/example_paloma_config.jsonnet --workspace my-eval-workspace
```

## Pretraining your model
Note that if you want to make a submission to our benchmark you must choose whether you will opt in to several experimental controls that will allow your submission to be marked for the greatest level of comparability. In this section we detail how you can accomplish these experimental controls.

### Decontaminating your pretraining data
Our decontamination approach is implemented in the Dolma Tooling repo. This will allow you to remove any document from any your pretraining data that is contaminated with respect to the Paloma.

To do this please follow the instructions [here](https://github.com/allenai/dolma/blob/decon-instructions/docs/paloma_decontamination.md) to decontaminate your own pretraining data.

### Fixing the training data order
Our approach for fixing the training data order requires the use of the same training code that we employ to train our 1B parameter baselines. This training code cannot yet be released as it is being developed for a separate, ongoing project. When that code is released we will update our instructions here to enable use of this experimental control. If you wish to use this experimental control before then, please feel free to reach out to the first author of Paloma.

### Fixing the vocabulary
We ask that submissions that do not investigate changes in vocabulary opt in to our standardized vocabulary to enable the greatest level of comprability. That vocabulary is available from the tokenizer hosted on HuggingFace hub as `allenai/gpt-neox-olmo-dolma-v1_5`.

## Making a submission
At present we are building out an automatic submission process that will soon be available. Until then please reach out to us by emailing the first author of Paloma, if you would like to submit results to the benchmark.
15 changes: 15 additions & 0 deletions paloma/eval_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Local evaluation data

This directory is used as a temporary work around until we implement perplexity inference with HF hub datasets.

To use Paloma with this pipeline you will need to first download the data from HF hub (install git lfs first in necessary):
```
huggingface-cli login
git lfs install
git clone https://huggingface.co/datasets/olmo-friends/paloma
```

Then when you run the pipe line you will first need to export the path to this data
```
export EVAL_DATA_PATH=$(pwd)
```
Loading