Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to save embedding_model and document_store locally ? #135

Closed
arthurbarros opened this issue Jun 7, 2020 · 29 comments · Fixed by #1506
Closed

How to save embedding_model and document_store locally ? #135

arthurbarros opened this issue Jun 7, 2020 · 29 comments · Fixed by #1506
Assignees
Labels
topic:DPR type:feature New feature or request

Comments

@arthurbarros
Copy link
Contributor

I wonder if it is possible to do something similar to reader.save(directory="my_path") but for a EmbeddingRetriever

@Timoeller
Copy link
Contributor

Hey @arthurbarros
the retrievers are not trained (for now) so there is not much need to store them.
The underlying model like "deepset/sentence_bert" is downloaded and cached when first calling the EmbeddingRetriever, so you only download it once.

Though saving the Retrievers might be an interesting feature if you are in an offline setting or copy the model between machines - working with the cache files directly is rather cumbersome. Could you elaborate why you need this functionality, so we can better prioritize this feature?

@Timoeller Timoeller added type:feature New feature or request topic:DPR labels Jun 10, 2020
@Timoeller Timoeller self-assigned this Jun 10, 2020
@Timoeller
Copy link
Contributor

Closing this now. Feel free to reopen

@violetcodes
Copy link

violetcodes commented Oct 13, 2020

Hi @Timoeller,
I have used dpr retriever on colab using gpu (following tute 6) for qa on my documents, I want to use this model on my local (or instance) without a gpu. document_store.update_embedding(retriever) #document_store: FAISSDocumentStore, retriever: DensePassageRetriever on instance is taking much longer time (it is low config machine). This step needs to be run only once. If I could save retriever, or document_store (with its updated embeddings) and copy to my instance, I have to only load them without running bottleneck step and every time, demo would be fast.
am I doing something wrong?

also couldn't use document_store.save("filepath") because FAISSDocumentStore.load() needs sql_url, why don't we need sql_url during initialization but it is essential during loading?

@Weilin37
Copy link

I'm also wondering how to save the document store once we've taken the time to generate the embeddings and loading them.

@tholor
Copy link
Member

tholor commented Nov 19, 2020

Hey @Weilin37,

What type of document store are you using?

If you use ElasticsearchDocumentStore the data will be automatically persisted by elasticsearch.

If you use FAISS, the embeddings are stored in a FAISS Index. You can save that index via FAISSDocumentStore.save("file_path"). All remaining data (e.g the document text and metadata) is automatically kept persistent in SQL. You can load it back via FAISSDocumentStore.load(faiss_file_path="file_path", sql_url= "sqlite:///")

Hope this helps!

@Weilin37
Copy link

Thanks! This is definitely the answer I was looking for! (I am using FAISS)

@AzureAlph
Copy link

AzureAlph commented Nov 19, 2020

Hi @tholor,

Just curious if that also applies for document_store that already had update_embeddings applied? Using Faiss by the way.
I did try your example but when I loaded it, the embeddings where empty. I haven't tried all examples in the documentation yet so maybe im missing something

@tholor
Copy link
Member

tholor commented Nov 19, 2020

Yes, absolutely.

@tholor tholor changed the title How to save embedding_model locally ? How to save embedding_model and document_store locally ? Nov 19, 2020
@AzureAlph
Copy link

So I was trying this out in colab
document_store = FAISSDocumentStore()
document_store.write_documents(datadict)
retriever = EmbeddingRetriever(document_store=document_store,
embedding_model="deepset/sentence_bert",
use_gpu=False)

document_store.update_embeddings(retriever)
document_store.save("testfile_path")
document_store2 = FAISSDocumentStore()
document_store2.load(faiss_file_path="testfile_path", sql_url= "sqlite:///")

but document_store2 always comes up empty. Please guide me where I went wrong. I think its in the sql part but I am not also familiar with it

@tholor
Copy link
Member

tholor commented Nov 20, 2020

Can you please try:

document_store = FAISSDocumentStore(sql_url= "sqlite:///haystack_test_faiss.db")
document_store.write_documents(datadict)
retriever = EmbeddingRetriever(document_store=document_store,
embedding_model="deepset/sentence_bert", use_gpu=False)
document_store.update_embeddings(retriever)
document_store.save("testfile_path")
document_store2 = FAISSDocumentStore.load(faiss_file_path="testfile_path", sql_url= "sqlite:///haystack_test_faiss.db")

@AzureAlph
Copy link

It worked! Thank you.

Also I noticed that after applying 'update_embedding', when I try to check the embedding value of items via 'get_all_documents' within the document_store, embedding values remain empty. Is it the expected behavior or am I misunderstanding how update_embedding works?

@tholor
Copy link
Member

tholor commented Nov 23, 2020

You are totally right, the embeddings should not remain empty. @tanaysoni can you please check this and verify that saving/loading is working as expected?

@tholor tholor reopened this Nov 23, 2020
@tanaysoni
Copy link
Contributor

Hi @AzureAlph, currently, the FAISSDocumentStore does not return embeddings with the documents by default.

With #615, an option to get the embeddings using the return_embedding parameter would get added to the FAISSDocumentStore.get_all_documents().

@harikc456
Copy link

In Google Colab I have used the below code to update the embedding and save the document_store and the retriever.

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", sql_url= "sqlite:///haystack_test_faiss.db")
from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)
document_store.update_embeddings(retriever)
retriever.save("retriever.pt")
document_store.save("faiss.index")

Later in my local machine, I have the following snippet which loads the document store and the saved retriever

document_store = FAISSDocumentStore.load("faiss.index",sql_url="sqlite:///haystack_test_faiss.db", index = None)
retriever = DensePassageRetriever.load("retriever.pt", document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
finder = Finder(reader, retriever)
prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_retriever=10, top_k_reader=5)

But This returns 0 candidates by the retriever. Even though the same was answered in Google Colab

Am I loading the files wrong?

@benscottie
Copy link

@tholor I am experiencing the same problem as above. Is there a solution? I'm wondering if setting index to None has anything to do with it.

@Timoeller
Copy link
Contributor

Hey @benscottie thanks for bringing this up here. Could you open a separate issue with a more detailed explanation of your setup? I could imaging that different faiss versions for saving and loading might complicate it.
Or do you save and load in one colab environment?

@benscottie
Copy link

hey @Timoeller thanks for your response. I am saving and loading in one environment and having the issue. Ultimately I would like to be able to save and load in separate environments. I will open a separate issue

@fingoldo
Copy link
Contributor

document_store2 = FAISSDocumentStore.load(faiss_file_path="testfile_path", sql_url= "sqlite:///haystack_test_faiss.db")

Hi, does your example mean that, if sql_url was not specified from the start, documents are not persisted anywhere and it's not possible to load them gain? If I'm wrong, where can I find them? According to the docs, default path should be "sqlite:///faiss_document_store.db", but I can't find such file, faiss_document_store.db, anywhere on my PC.

@tholor
Copy link
Member

tholor commented Sep 24, 2021

Hey @fingoldo,
We recently changed the save/load behaviour for FAISS (see #1459). With the latest master version, you should indeed have the default sqlite:///faiss_document_store.db and find a file accordingly on your system. This was not the case in older haystack versions where we used the inmemory based sqllite as default.

You still need to call FAISSDocumentStore.save() to save also the FAISS index and the configuration. Just having the SQL DB won't be enough to restore your doc store.

Hope this helps.

@fingoldo
Copy link
Contributor

Thank you very much for the clarification... I have installed farm-haystack recently, a few days ago. After reading ~2M small documents, updating embeddings and calling .save(), I can't find the .db file anywhere, but the RAM footprint is 18Gbs. So probably I had an earlier version somehow which is in-memory SQL. I guess the only option I have is to reinstall the haystack & reindex from scratch...

@tholor
Copy link
Member

tholor commented Sep 24, 2021

Yes, the above PR was just merged 4 days ago (so after our latest 0.10. pypy release). I'd suggest you re-install directly from master via:

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install --editable .

@fingoldo
Copy link
Contributor

I apologize, is changing of vector_dim of a FAISS store supported, without reimporting all the documents? For example, when I want to test many models to see which one gives the most sensible results. 'cause I tried applying "sentence-transformers/all-MiniLM-L6-v2" and it turned out to be a mismatch between default vector_dim of the store (768) and the actual dimension size of the retriever's model (384. is there some standard way to query this number from the retriever, by the way?). When I changed FAISS_store.vector_dim = 384 and tried update_embeddings() again, it was the same error, so I guess changing is not supported at all or must be done at store init step.

@fingoldo
Copy link
Contributor

Also, it seems that you have a small inaccuracy in the code.

file:
haystack\haystack\document_store\faiss.py

method:
def load(cls, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None):

code:

        if not config_path:
            index_path = Path(index_path)
            faiss_init_params_path = index_path.with_suffix(".json")

        init_params: dict = {}
        try:
            with open(faiss_init_params_path, 'r') as ipp:
                init_params = json.load(ipp)
        except OSError as e:

if I provide config_path, load fails as faiss_init_params_path is not initialized.

@tholor
Copy link
Member

tholor commented Sep 27, 2021

is changing of vector_dim of a FAISS store supported, without reimporting all the documents?

No, that's currently not supported as FAISS doesn't allow any such permutations on an existing index. Most transformers use 768 as the dim - however, I see your point when switching to smaller retriever models.
I think we can try to find a way to delete the FAISS index under the hood and repopulate it with embeddings from the new retriever using the documents that are still stored in SQL. Created a new issue for this: #1505
Let me know if you are interested in contributing here.

if I provide config_path, load fails as faiss_init_params_path is not initialized.

@ZanSara Can you please check?

@fingoldo
Copy link
Contributor

t transformers use 768 as the dim - however, I see your point when switching to smaller retriever models.
I think we can try to find a way to delete the FAISS index under the hood and repopulate it with embeddi

Will things change if I switch to Milvus? I mean, that "index" parameter in many of the doc stores. Can it be used somehow to achieve that goal of testing multiple retrievers (and, therefore, embedding models) on the same document set?

@fingoldo
Copy link
Contributor

t transformers use 768 as the dim - however, I see your point when switching to smaller retriever models.
I think we can try to find a way to delete the FAISS index under the hood and repopulate it with embeddi

Will things change if I switch to Milvus? I mean, that "index" parameter in many of the doc stores. Can it be used somehow to achieve that goal of testing multiple retrievers (and, therefore, embedding models) on the same document set?

I'd be willing to contribute something useful, but I'm not experienced at all in contributing (

@tholor
Copy link
Member

tholor commented Sep 27, 2021

I'd be willing to contribute something useful, but I'm not experienced at all in contributing (

How about you give it a shot and if you don't succeed, we can still take over in a couple of weeks?
I added some first thoughts on the design here.

You can find some tips for contributing to Haystack here. In short: If you create an early "work in progress" PR, we can also try to help and give feedback. We recommend using our CI for running the tests until #1353 is finished (which will simplify running local tests).

@lalitpagaria
Copy link
Contributor

@tholor @fingoldo
I have bit hacky solution -

faiss_doc_store.faiss_indexes[index].reset()
faiss_doc_store.faiss_indexes.pop(index)
faiss_doc_store.vector_dim = 384
# To update config (Optional step) need to implement `to_dict()` method
# faiss_doc_store.set_config(**faiss_doc_store.to_dict())
# Now call `update_embedding`

@ZanSara
Copy link
Contributor

ZanSara commented Sep 27, 2021

Also, it seems that you have a small inaccuracy in the code.

file:
haystack\haystack\document_store\faiss.py

method:
def load(cls, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None):

code:

        if not config_path:
            index_path = Path(index_path)
            faiss_init_params_path = index_path.with_suffix(".json")

        init_params: dict = {}
        try:
            with open(faiss_init_params_path, 'r') as ipp:
                init_params = json.load(ipp)
        except OSError as e:

if I provide config_path, load fails as faiss_init_params_path is not initialized.

@fingoldo The bug should be fixed now, check it out. Thank you for spotting it!

masci added a commit that referenced this issue Nov 27, 2023
…135)

* do not use a dict as intermediate format and use sockets directly to simplify code and remove side effects

* fix leftover from cherry-pick
masci added a commit that referenced this issue Nov 27, 2023
* Ignore some mypy errors

* Fix I/O comparator

* Avoid calling asdict multiple times when comparing dataclasses

* Enhance component tests

* Fix I/O dataclasses comparison

* Use Any instead of type when expecting I/O dataclasses

* Fix mypy

* Change InputSocket taken_by field to sender

* Remove variadics implementation

* Adapt tests

* Enhance docs and simplify run

* Remove useless check on drawing

* Add __canals_optional_inputs__ field in components

* Rework a bit Pipeline._ready_to_run()

* Simplify some logic

* Add __canals_mandatory_inputs__ field in components

* Handle pipeline loops

* Fix tests

* Document component state run logic

* Add double loop pipeline test

* Make component decorator a class

* PR feedback

* Add error logging when registering Component with identical names

* Add 'remove' action that removes current component from Pipeline run input queue

* Simplify run checks and logging

* Better logging

* Apply suggestions from code review

Co-authored-by: ZanSara <[email protected]>

* Trim whitespace

* Add support for Union in Component's I/O

* Remove dependencies section in marshaled pipelines

* Create Component Protocol

* simpler optional deps

* Simplify component init wrapping and fix issue with save_init_params

* Update canals/pipeline/save_load.py

Co-authored-by: ZanSara <[email protected]>

* Simplify functions to find I/O sockets

* Fix import

* change import

* testing ci

* testing ci

* Simplify _save_init_params

* testing ci

* testing ci

* use direct pytest call

* trying to force old version for macos

* list macos versions

* list macos versions

* disable on macos

* remove extra

* refactor imports

* re-enable some logs

* some more tests

* small correction

* Remove unused leftover methods

* docs

* update docstring

* mention optionals

* example for dataclass initialization

* missed part

* fix api docs

* improve error reporting and testing

* add tests for Any

* parametrized tests

* fix test for py<3.10

* test type printing

* remove typing. prefix from Any (compat with Py3.11)

* test helpers

* test names

* add type_is_compatible()

* tests pass

* more tests

* add small comment

* handle Unions as anything else

* use sender/receiver for socket pairs

* more sender/receiver renames

* even more renames

* split if statement

* Update __about__.py

* fix logic operator and add tests

* Update __about__.py

* Simplify imports

* Move draw in pipeline module and clearly define public interface

* Format pyproject.toml

* Include only required files in built wheel

* Move sample components out of tests

* stub component class decorator

* update static sample components to new API

* stub

* dynamic output examples

* sum

* add components fixed

* re-add inputsocket and outputsocket creation

* fix component tests

* fixing tests

* Add methods to set I/O dinamically

* fix drawing

* fix some integration tests

* tests green

* pylint

* remove stray files

* Remove default in InputSocket and add is_optional field

* Fix drawing

* Rework sockets string representation

* Add back Component Protocol

* Simplify method to get string representation of types

* Remove sockets __str__

* Remove Component's I/O type checks at run time

* Remove IO check in init wrapper

* Update canals/utils.py

Co-authored-by: Massimiliano Pippi <[email protected]>

* Split __canals_io__ field in __canals_input__ and __canals_output__

* Order input and output fields

* Add test to verify __canals_component__ is set

* Remove empty line

* Add component class factory

* Fix API docs workflow failure

* fix api docs

* Update __about__.py

* Add component from_dict and to_dict methods

* Add Pipeline to_dict and from_dict

* Fix components tests

* Add some more tests

* Change error messages

* Simplify test_to_dict

* Add max_loops_allowed in test_to_dict

* Test non default max_loops_allowed in test_to_dict

* Rework marshal_pipelines

* Rework unmarshal_pipelines

* Rename some stuff

* allow falsy outputs

* apply falsy fix to validation

* add test for falsy inputs

* Split _cleanup_marshalled_data into two functions

* Use from_dict to deserialise component

* Remove commented out code and update variable name

* Add test to verify difference when unmarshaling Pipeline with duplicate names

* Update marshal_pipelines docstring

* update workflow

* exclude tests from mypy in pre-commit hooks

* add additional falsy tests

* remove unnecessary import

* split test into two

Co-authored-by: ZanSara <[email protected]>

* remove init_parameters decorator and fix assumptions

* fix accumulate

* stray if

* Bump version to 0.5.0

* Implement generic default_to_dict and default_from_dict

* Update default_to_dict docstring

Co-authored-by: Massimiliano Pippi <[email protected]>

* Remove all mentions of Component.defaults

* Add Remainder to_dict and from_dict (#91)

* Add Repeat to_dict and from_dict (#92)

* Add Sum to_dict and from_dict (#93)

* Add Greet to_dict and from_dict (#89)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Rework Accumulate to_dict and from_dict (#86)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Add to_dict and from_dict for Parity, Subtract, Double, Concatenate (#87)

* Add Concatenate to_dict and from_dict

* Add Double to_dict and from_dict

* Add Subtract to_dict and from_dict

* Add Parity to_dict and from_dict

---------

Co-authored-by: Massimiliano Pippi <[email protected]>

* Change _to_mermaid_text to use component serialization data (#94)

* Add MergeLoop to_dict and from_dict (#90)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Add Threshold to_dict and from_dict (#97)

* Add AddFixedValue to_dict and from_dict (#88)

Co-authored-by: Massimiliano Pippi <[email protected]>

* Remove BaseTestComponent (#99)

* Change @component decorator so it doesn't add default to_dict and from_dict (#98)

* Rename some classes in tests to suppress Pytest warnings (#101)

* Check Component I/O socket names are valid (#100)

* Remove handling of shared component instances on Pipeline serialization (#102)

* Fix docs

* Bump version to 0.6.0

* Revert "Check Component I/O socket names are valid (#100)" (#103)

This reverts commit 4529874.

* Bump canals to 0.7.0

* Downgrade log from ERROR to DEBUG (#104)

* Make to/from_dict optional (#107)

* remove from/to dict from Protocol

* use a default marshaller

* example component with no serializers

* fix linting

* make it smarter

* fix linting

* thank you mypy protector of the dumb programmers

* feat: check returned dictionary (#106)

* better error message if components don't return dictionaries

* add test

* use factory

* needless import

* Update __about__.py

* fix default serialization and adjust sample components accordingly (#109)

* fix default serialization and adjust sample components accordingly

* typo

* fix pylint errors

* fix: `draw` function vs init parameters (#115)

* fix draw

* stray print

* Update version (#118)

* remove extras

* Revert "remove extras"

This reverts commit a096ff8.

* fix package name, change _parse_connection_name function name, add tests (#126)

* move sockets into components package (#127)

* chore: remove extras (#125)

* remove extras

* workflow

* typo

* fix: Sockets named "text/plain" or containing a "/" fail during pipeline.to_dict (#131)

* don't split sockets by /

* revert hashing edge keys

* docs: remove missing module from docs (#132)

* remove stray print (#123)

* addo sockets docs (#133)

* tidy up utils about types (#129)

* Update canals.md (#134)

* rename module in API docs

* make `__canals_output__` and `__canals_input__` management consistent  (#128)

* make __canals_output__ and __canals_input__ management consistent and assign them to the component instance

* make pylint happy

* return the original type instead of the metaclass

* use type checking instead of instance field

* declare the actual returned type

* fix after conflict resolution

* remove check

* Do not use a dict as intermediate format and use `Socket`s directly (#135)

* do not use a dict as intermediate format and use sockets directly to simplify code and remove side effects

* fix leftover from cherry-pick

* move is_optional evaluation for InputSocket to post_init (#136)

* re-introduce variadics to support Joiner node (#122)

* move sockets into components package

make __canals_output__ and __canals_input__ management consistent and assign them to the component instance

do not use a dict as intermediate format and use sockets directly to simplify code and remove side effects

move is_optional evaluation for InputSocket to post_init

re-introduce variadics to support Joiner node

restore connection-time check

use custom type annotation, fix tests

* fix leftovers from rebase

* rename fan-in to joiner

* clean up and fix typing

* let inputs arrive later

* address review comments

* address review comments

* fix docstrings

* try

* try

* fix run input

* linting

* remove comments

* fix pylint

* bumb version to 0.9.0 (#140)

* properly annotate classmethods (#139)

* feat: add `Pipeline.inputs()` (#120)

* add Pipeline.describe_input()

* add tests

* split dict and str outputs and add to error messages

* tests

* accepts/expects

* move methods

* fix tests

* fix module name

* tests

* review feedback

* Add missing typing_extensions dependency (#152)

* feat: use full connection data to route I/O (#148)

* fix sample components

* make sum variadic

* separate queue and buffer

* all works but loops & variadics together

* fix some tests

* fix some tests

* all tests green

* clean up code a bit

* refactor code

* fix tests

* fix self loops

* fix reused sockets bug

* add distinct loops

* add distinct loops test

* break out some code from run()

* docstring

* improve variadics drawing

* black

* document the deepcopy

* re-arrange connection dataclass and add tests

* consumer -> receiver

* fix typing

* move Connection-related code under component package

* clean up connect()

* cosmetics and typing

* fix linter, make Connection a dataclass again

* fix typing

* add test case for #105

---------

Co-authored-by: Massimiliano Pippi <[email protected]>

* feat: Add Component inputs/outputs functions (#158)

* Add component inputs/outputs methods

* Different impl approach

* Black fixes

* Rename functions to match naming in pipeline inputs/ouputs

* Fix find_component_inputs, update unit tests (#162)

* Fix API docs (#164)

* make Variadic wrap an iterable (#163)

* Add pipeline outputs method (#150)

Co-authored-by: ZanSara <[email protected]>

* Update __about__.py (#165)

Update version to 0.10.0

* add CODEOWNERS

* feat: read defaults from `run()` signature (#166)

* Read defaults from run signature

* simplify setting of sockets

* fix test

* Update sample_components/fstring.py

Co-authored-by: Massimiliano Pippi <[email protected]>

* Update canals/component/component.py

Co-authored-by: Massimiliano Pippi <[email protected]>

* dostring

---------

Co-authored-by: Massimiliano Pippi <[email protected]>

* Use full import path as 'type' in serialization.  (#167)

* Use full import path as 'type' in serialization. Try to import the path when deserializing

* fix test data

* add from_dict test

* remove leftover

* Update canals/pipeline/pipeline.py

Co-authored-by: ZanSara <[email protected]>

* add error message to PipelineError

---------

Co-authored-by: ZanSara <[email protected]>

* bump version

* fix: copy input values before passing them down pipeline.run (#168)

* copy input values before passing them down pipeline.run

* Update test_mutable_inputs.py

* fix mypy and pyright (#169)

* bump version

* remove data we won't keep

* reformat

* try

* skip tests on transient code

---------

Co-authored-by: Silvano Cerza <[email protected]>
Co-authored-by: Silvano Cerza <[email protected]>
Co-authored-by: ZanSara <[email protected]>
Co-authored-by: Michel Bartels <[email protected]>
Co-authored-by: ZanSara <[email protected]>
Co-authored-by: Julian Risch <[email protected]>
Co-authored-by: Julian Risch <[email protected]>
Co-authored-by: Vladimir Blagojevic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:DPR type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.