Skip to content

Commit

Permalink
fix: do not download more files at runtime
Browse files Browse the repository at this point in the history
Due to a design choice in Transformers, remote-code embedded in a model on HF is
downloaded at runtime and this cannot be disabled, despite our attempts to
download it at build time. A practical result of this is that the built Docker
container image needs networking access to download the remote code, which is
not ideal.

This fix is a workaround to download the remote code at build time, and embed it
into the model by updating the `config.json` file appropriately. This should
only be needed for Nomic-style models, apparently, since they are some of the
only embedding models that use remote code.

Many thanks to Tom Aarsen for providing me with the steps to fix this.

Relevant issue: UKPLab/sentence-transformers#2613 (comment)
  • Loading branch information
thoughtpolice committed Apr 25, 2024
1 parent f002ad4 commit 2c3968b
Showing 1 changed file with 10 additions and 3 deletions.
13 changes: 10 additions & 3 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@
buildInputs = [
python
embedding-server-py
pkgs.gron
] ++ pythonLibs;
} ''
mkdir -p $out tmp
Expand All @@ -92,19 +93,25 @@
${python}/bin/python \
${embedding-server-py}/libexec/embedding-server.py \
--save-models-to $out
# FIXME: why doesn't the nomic model include this file?
# https://github.com/UKPLab/sentence-transformers/issues/2613
cp -v \
$HF_HOME/hub/models--nomic-ai--nomic-embed-text-v1/snapshots/02d96723811f4bb77a80857da07eda78c1549a4d/configuration_hf_nomic_bert.py \
$HF_HOME/hub/models--nomic-ai--nomic-embed-text-v1/snapshots/02d96723811f4bb77a80857da07eda78c1549a4d/modeling_hf_nomic_bert.py \
$out/nomic-embed-text-v1
gron $out/nomic-embed-text-v1/config.json \
| sed -E 's/json\.auto_map\.(.*?)\s=\s".*?\-\-/json\.auto_map\.\1 = "/' \
| gron --ungron \
> $out/nomic-embed-text-v1/config.json.tmp
mv $out/nomic-embed-text-v1/config.json.tmp $out/nomic-embed-text-v1/config.json
'';

/* finally, just re-package the data with a fixed-output sha256 hash */
in pkgs.runCommand "model-data" {
outputHashMode = "recursive";
outputHashAlgo = "sha256";
outputHash = "sha256-nMRQtxrJdLxAnNLGHP61nAePGWJ/ZVG6ad0v9wUBTeY=";
outputHash = "sha256-QpmYSk396ShTFyXA9+DWjiGfEuZMSAul9EnOfG6SeRU=";
passthru = { inherit real-data; };
} "mkdir -p $out && cp -r ${real-data}/* $out";
};
Expand Down

0 comments on commit 2c3968b

Please sign in to comment.