Merge pull request #12 from pat-alt/11-activations-for-cls-head

pat-alt · Jan 27, 2024 · db4bb89 · db4bb89
2 parents 9ea8230 + 45da081
commit db4bb89
Show file tree

Hide file tree

Showing 7 changed files with 164 additions and 54 deletions.
diff --git a/dev/jcon_proposal.md b/dev/jcon_proposal.md
diff --git a/dev/juliacon/biblio.bib b/dev/juliacon/biblio.bib
@@ -0,0 +1,17 @@
+@Misc{shah2023trillion,
+  author        = {Agam Shah and Suvan Paturi and Sudheer Chava},
+  title         = {Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis},
+  eprint        = {2305.07972},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  year          = {2023},
+}
+
+@Misc{alain2018understanding,
+  author        = {Guillaume Alain and Yoshua Bengio},
+  title         = {Understanding intermediate layers using linear classifier probes},
+  eprint        = {1610.01644},
+  archiveprefix = {arXiv},
+  primaryclass  = {stat.ML},
+  year          = {2018},
+}
diff --git a/dev/juliacon/proposal.md b/dev/juliacon/proposal.md
@@ -0,0 +1,62 @@
+# Trillion Dollar Words in Julia
+
+# 💰 Trillion Dollar Words in Julia
+
+**Abstract**: [TrillionDollarWorlds.jl](https://github.com/pat-alt/TrillionDollarWords.jl) provides access to a novel financial dataset and large language model fine-tuned for classifying central bank communications as either ‘hawkish’, ‘dovish’ or ‘neutral’. It ships with essential functionality for model probing, an important aspect of mechanistic interpretability.
+
+## Description
+
+In the age of forward guidance, central bankers spend a great deal of time thinking about how their communications are perceived by markets. What if there was a way to predict the impact of communications on financial markets directly from text? Shah, Paturi, and Chava (2023) attempt to do just that in their [ACL 2023 paper](https://arxiv.org/abs/2305.07972) (which the author of this package is not affiliated with).
+
+### Background
+
+The authors of the paper have collected and preprocessed a corpus of around 40,000 time-stamped sentences from meeting minutes, press conferences and speeches by members of the Federal Open Market Committee (FOMC). The total sample period spans from January, 1996, to October, 2022. In order to train various rule-based models and large language models (LLM) to classify sentences as either ‘hawkish’, ‘dovish’ or ‘neutral’, they have manually annotated a subset of around 2,500 sentences. The best performing model, a large RoBERTa model with around 355 million parameters, was open-sourced on [HuggingFace](https://huggingface.co/gtfintechlab/FOMC-RoBERTa?text=A+very+hawkish+stance+excerted+by+the+doves).
+
+### Data
+
+While the authors of the paper did publish their data, much of it is unfortunately scattered across CSV and Excel files stored in a public GitHub repo. We have collected and merged that data, yielding a combined dataset with indexed sentences and additional metadata that may be useful for downstream tasks.
+
+``` julia
+julia> using TrillionDollarWords
+julia> load_all_sentences() |> names
+8-element Vector{String}:
+ "sentence_id"
+ "doc_id"
+ "date"
+ "event_type"
+ "label"
+ "sentence"
+ "score"
+ "speaker"
+```
+
+In addition to the sentences, market data about price inflation and the US Treasury yield curve can also be loaded. All datasets are loaded as `DataFrame`s and share common keys that make it possible to join them. Alternatively, a complete dataset combining the corpus of sentences with market data can also be loaded with `load_all_data()`.
+
+### Loading the Model
+
+The model can be loaded with or without the classifier head. Under the hood, we use [Transformers.jl](https://github.com/chengchingwen/Transformers.jl) to retrieve the model from HuggingFace. Any keyword arguments accepted by `Transformers.HuggingFace.HGFConfig` can also be passed. For example, to load the model without the classifier head and enable access to layer-wise activations, the following command can be used: `load_model(; load_head=false, output_hidden_states=true)`.
+
+### Model Inference
+
+For our own research, we have been interested in probing the model. This involves using linear models to estimate the relationship between layer-wise transformer embeddings and some outcome variable of interest (Alain and Bengio 2018). To do this, we first had to run a single forward pass for each sentence through the RoBERTa model and store the layerwise emeddings. The package ships with functionality for doing just that, but to save others valuable GPU hours we have archived activations of the hidden state on the first entity token for each layer as [artifacts](https://github.com/pat-alt/TrillionDollarWords.jl/releases/tag/activations_2024-01-17). To download the last-layer activations in an interactive Julia session, for example, users can proceed as follows:
+
+``` julia
+julia> using LazyArtifacts
+
+julia> artifact"activations_layer_24"
+"$HOME/.julia/artifacts/1785c2c64e603af5e6b79761150b1cc15d03f525"
+```
+
+We have found that despite the small sample size, the RoBERTa model appears to have distilled useful representations for downstream tasks that it was not explicitly trained for. The chart below shows the average out-of-sample root mean squared error for predicting various market indicators from layer activations. Consistent with findings in related work (Alain and Bengio 2018), we find that performance typically improves for layers closer to the final output layer of the transformer model. The measured performance is at least on par with baseline autoregressive models.
+
+![](https://raw.githubusercontent.com/pat-alt/TrillionDollarWords.jl/11-activations-for-cls-head/dev/juliacon/rmse_pca_128.png)
+
+### Intended Purpose
+
+We hope that this small package may be useful to members of the Julia community who are interested in the interplay between Economics, Finance and Artificial Intelligence. It should, for example, be straight-forward to use this package in combination with Transformers.jl to fine-tune additional models on the classification task or other tasks of interest. Any contributions are very much welcome.
+
+## References
+
+Alain, Guillaume, and Yoshua Bengio. 2018. “Understanding Intermediate Layers Using Linear Classifier Probes.” 
+
+Shah, Agam, Suvan Paturi, and Sudheer Chava. 2023. “Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis.” 
diff --git a/dev/juliacon/proposal.qmd b/dev/juliacon/proposal.qmd
@@ -0,0 +1,64 @@
+---
+title: Trillion Dollar Words in Julia
+bibliography: biblio.bib
+format: 
+  commonmark: 
+    wrap: none
+---
+
+# 💰 Trillion Dollar Words in Julia
+
+**Abstract**: [TrillionDollarWorlds.jl](https://github.com/pat-alt/TrillionDollarWords.jl) provides access to a novel financial dataset and large language model fine-tuned for classifying central bank communications as either 'hawkish', 'dovish' or 'neutral'. It ships with essential functionality for model probing, an important aspect of mechanistic interpretability.
+
+## Description
+
+In the age of forward guidance, central bankers spend a great deal of time thinking about how their communications are perceived by markets. What if there was a way to predict the impact of communications on financial markets directly from text? @shah2023trillion attempt to do just that in their [ACL 2023 paper](https://arxiv.org/abs/2305.07972) (which the author of this package is not affiliated with).
+
+### Background
+
+The authors of the paper have collected and preprocessed a corpus of around 40,000 time-stamped sentences from meeting minutes, press conferences and speeches by members of the Federal Open Market Committee (FOMC). The total sample period spans from January, 1996, to October, 2022. In order to train various rule-based models and large language models (LLM) to classify sentences as either 'hawkish', 'dovish' or 'neutral', they have manually annotated a subset of around 2,500 sentences. The best performing model, a large RoBERTa model with around 355 million parameters, was open-sourced on [HuggingFace](https://huggingface.co/gtfintechlab/FOMC-RoBERTa?text=A+very+hawkish+stance+excerted+by+the+doves). 
+
+### Data
+
+While the authors of the paper did publish their data, much of it is unfortunately scattered across CSV and Excel files stored in a public GitHub repo. We have collected and merged that data, yielding a combined dataset with indexed sentences and additional metadata that may be useful for downstream tasks. 
+
+```julia
+julia> using TrillionDollarWords
+julia> load_all_sentences() |> names
+8-element Vector{String}:
+ "sentence_id"
+ "doc_id"
+ "date"
+ "event_type"
+ "label"
+ "sentence"
+ "score"
+ "speaker"
+```
+
+In addition to the sentences, market data about price inflation and the US Treasury yield curve can also be loaded. All datasets are loaded as `DataFrame`s and share common keys that make it possible to join them. Alternatively, a complete dataset combining the corpus of sentences with market data can also be loaded with `load_all_data()`.
+
+### Loading the Model
+
+The model can be loaded with or without the classifier head. Under the hood, we use [Transformers.jl](https://github.com/chengchingwen/Transformers.jl) to retrieve the model from HuggingFace. Any keyword arguments accepted by `Transformers.HuggingFace.HGFConfig` can also be passed. For example, to load the model without the classifier head and enable access to layer-wise activations, the following command can be used: `load_model(; load_head=false, output_hidden_states=true)`.
+
+### Model Inference
+
+For our own research, we have been interested in probing the model. This involves using linear models to estimate the relationship between layer-wise transformer embeddings and some outcome variable of interest [@alain2018understanding]. To do this, we first had to run a single forward pass for each sentence through the RoBERTa model and store the layerwise emeddings. The package ships with functionality for doing just that, but to save others valuable GPU hours we have archived activations of the hidden state on the first entity token for each layer as [artifacts](https://github.com/pat-alt/TrillionDollarWords.jl/releases/tag/activations_2024-01-17). To download the last-layer activations in an interactive Julia session, for example, users can proceed as follows:
+
+```julia
+julia> using LazyArtifacts
+
+julia> artifact"activations_layer_24"
+```
+
+We have found that despite the small sample size, the RoBERTa model appears to have distilled useful representations for downstream tasks that it was not explicitly trained for. The chart below shows the average out-of-sample root mean squared error for predicting various market indicators from layer activations. Consistent with findings in related work [@alain2018understanding], we find that performance typically improves for layers closer to the final output layer of the transformer model. The measured performance is at least on par with baseline autoregressive models.
+
+![](https://raw.githubusercontent.com/pat-alt/TrillionDollarWords.jl/11-activations-for-cls-head/dev/juliacon/rmse_pca_128.png)
+
+### Intended Purpose
+
+We hope that this small package may be useful to members of the Julia community who are interested in the interplay between Economics, Finance and Artificial Intelligence. It should, for example, be straight-forward to use this package in combination with Transformers.jl to fine-tune additional models on the classification task or other tasks of interest. Any contributions are very much welcome. 
+
+## References
+
diff --git a/dev/juliacon/rmse_pca_128.png b/dev/juliacon/rmse_pca_128.png
diff --git a/src/baseline_model.jl b/src/baseline_model.jl
@@ -64,23 +64,36 @@ end
 """
     get_embeddings(atomic_model::HGFRobertaForSequenceClassification, tokens::NamedTuple)
 
-Extends the `embeddings` function to `HGFRobertaForSequenceClassification`.
+Extends the `embeddings` function to `HGFRobertaForSequenceClassification`. Performs a forward pass through the model and returns the embeddings. Then performs a forward pass through the classification head and returns the activations going into the final linear layer.
 """
-get_embeddings(atomic_model::HGFRobertaForSequenceClassification, tokens::NamedTuple) =
-    atomic_model.model(tokens)
+function get_embeddings(
+    atomic_model::HGFRobertaForSequenceClassification,
+    tokens::NamedTuple,
+)
+    clf = atomic_model.cls
+    b = atomic_model.model(tokens)
+    # Perform forward pass through classification head:
+    b = clf.layer.layers[1](b).hidden_state |> x -> clf.layer.layers[2](x)
+    return b
+end
+
 
 """
     laywerwise_activations(mod::BaselineModel, queries::Vector{String})
 
-Computes a forward pass of the model on the given queries and returns the layerwise activations for the `HGFRobertaModel`. If `output_hidden_states=false` was passed to `load_model` (default), only the last layer is returned. If `output_hidden_states=true` was passed to `load_model`, all layers are returned. Even if the model is loaded with the head for classification, the head is not used for computing the activations.
+Computes a forward pass of the model on the given queries and returns the layerwise activations for the `HGFRobertaModel`. If `output_hidden_states=false` was passed to `load_model` (default), only the last layer is returned. If `output_hidden_states=true` was passed to `load_model`, all layers are returned. If the model is loaded with the head for classification, the activations going into the final linear layer are returned.
 """
 function layerwise_activations(mod::BaselineModel, queries::Vector{String})
     embeddings = get_embeddings(mod, queries)
-    pooler = Transformers.HuggingFace.FirstTokenPooler()
-    if haskey(embeddings, :outputs)
-        output = [pooler(x.hidden_state) for x in embeddings.outputs]
+    if typeof(mod.mod) <: HGFRobertaForSequenceClassification
+        output = embeddings.hidden_state[:, :]
     else
-        output = pooler(embeddings.hidden_state)
+        pooler = Transformers.HuggingFace.FirstTokenPooler()
+        if haskey(embeddings, :outputs)
+            output = [pooler(x.hidden_state) for x in embeddings.outputs]
+        else
+            output = pooler(embeddings.hidden_state)
+        end
     end
     return output
 end

diff --git a/test/load_model.jl b/test/load_model.jl
@@ -40,13 +40,11 @@ end
             @test size(A, 2) == n
             A_cls = layerwise_activations(mod_cls, queries.sentence)
             @test size(A_cls, 2) == n
-            @test isequal(A, A_cls)
         end
 
         @testset "To data frame" begin
             A = layerwise_activations(mod, queries)
             A_cls = layerwise_activations(mod_cls, queries)
-            @test isequal(A, A_cls)
         end
 
     end