-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FARMReader slow #1077
Comments
Hey @bappctl,
One thing that you can try to narrow down the root cause: Disable the multiprocessing for inference. You can do that via As a side note: With only two docs the retriever is basically useless as |
TransformersReader works but FARMReader couldn't get it to work. In above code I made num_processes=0 as suggested. But it gets stuck here for almost 40mins (had to kill the process) I notice a peculiar behavior with FARMReader, when i kill the pod then in container log I see the correct result getting printed before app exit, if I don't kill the pod it gets stuck as mentioned above. I see this behaviour only with FARMReader if I switch to TransformersReader apps works fine as expected |
@tholor With TransformersReader how to train on custom data? (similar to FARMReader train()) |
@oryx1729 have you seen such an issue in our kubernetes deployment or have an idea what might cause the deadlock here? |
@tholor Finally made it work. It's something to do with gunicorn threads. When I faced issue I had it set to to 3 threads (very minimum) now I removed it and went with workers and worker-connections then it started working with FARMReader. But threads didn't cause any issue with TransformersReader it happens only with FARMReader. |
Hi @oryx1729
The deploy pod has 4 CPUs, 10GB RAM, 1 GPU. Just tried with 2 documents no more than 2 pages.
The other thing I notice the GPU memory is not freed after predict. [TRAIN]
|
@bappctl are you using FastAPI for the APIs? In that case, the Gunicorn worker class should be Can you share the complete code for your API endpoint? |
I am not using FastAPI. No luck with both train and predict
|
Hi @bappctl, in the document_store = ElasticsearchDocumentStore(host=eshost, port=esport, username='', password='', index=index)
retriever = ElasticsearchRetriever(document_store= document_store)
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
pipe = ExtractiveQAPipeline(reader, retriever)
@app.route('/haystack/predict', methods=['POST'])
def predict():
question = request.form['question']
index = request.form['index']
eshost = request.form['host']
esport = request.form['port']
prediction = pipe.run(query=question, top_k_retriever=10, top_k_reader=3)
answer = []
for res in prediction['answers']:
answer.append(res)
return json.dumps({'code':200,'status':'success','message': 'Predict successful.', 'result': answer}) |
@oryx1729 |
@oryx1729 docker main.py
05/29/2021 21:19:32 - INFO - farm.utils - Using device: CUDA [2021-05-29 21:26:50 +0000] [1] [INFO] Handling signal: term
[2021-05-29 22:10:44 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
|
Hi @bappctl, apologies for the delay in getting back on this. A few observations in the case above:
|
@oryx1729 No worries. I tried with 1 worker too. It didn't help. After failed tries, I dismantled that instance and moved on unfortunately I couldn't verify it now. I will take a look into the last option, there is no preference flask or fast api at my end, all I wanted is to get it running successfully. So far no luck. I will give another try and update you. |
went with your fastapi suggestion . It works. I have couple of questions
|
Hi @bappctl,
Can you provide more details of the use case here? An instance of
It seems like you're running into memory issues with using multiprocessing(by setting |
@oryx1729 The other question I have is sometimes during training for some reason the process get killed and model save() fails and I run into below issue and no more it trains and throws same error. OSError: Unable to load weights from pytorch checkpoint file for language_model.bin if I replace the language_model.bin with the old file the error goes. But it's not a right approach. How to overcome it. |
Hi @bappctl, can you share the full error stack trace that you get when the process is killed? What version of PyTorch are use using? |
pytorch 1.7.1 + cu110 ----- error stack --- 04:58:03 +0000] [282] [ERROR] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 398, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/usr/local/lib/python3.7/dist-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.7/dist-packages/fastapi/applications.py", line 199, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.7/dist-packages/starlette/applications.py", line 112, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/errors.py", line 181, in __call__
raise exc from None
File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/errors.py", line 159, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/cors.py", line 78, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.7/dist-packages/starlette/exceptions.py", line 82, in __call__
raise exc from None
File "/usr/local/lib/python3.7/dist-packages/starlette/exceptions.py", line 71, in __call__
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 580, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 241, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 52, in app
response = await func(request)
File "/usr/local/lib/python3.7/dist-packages/fastapi/routing.py", line 202, in app
dependant=dependant, values=values, is_coroutine=is_coroutine
File "/usr/local/lib/python3.7/dist-packages/fastapi/routing.py", line 148, in run_endpoint_function
return await dependant.call(**values)
File "/app/controller/train.py", line 109, in _start_train
reader = FARMReader(model_name_or_path=model_path)
File "/usr/local/lib/python3.7/dist-packages/haystack/reader/farm.py", line 112, in __init__
strict=False)
File "/usr/local/lib/python3.7/dist-packages/farm/infer.py", line 252, in load
model = BaseAdaptiveModel.load(load_dir=model_name_or_path, device=device, strict=strict)
File "/usr/local/lib/python3.7/dist-packages/farm/modeling/adaptive_model.py", line 53, in load
model = cls.subclasses["AdaptiveModel"].load(**kwargs)
File "/usr/local/lib/python3.7/dist-packages/farm/modeling/adaptive_model.py", line 338, in load
language_model = LanguageModel.load(load_dir)
File "/usr/local/lib/python3.7/dist-packages/farm/modeling/language_model.py", line 142, in load
language_model = cls.subclasses[config["name"]].load(pretrained_model_name_or_path)
File "/usr/local/lib/python3.7/dist-packages/farm/modeling/language_model.py", line 830, in load
distilbert.model = DistilBertModel.from_pretrained(farm_lm_model, config=config, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py", line 1208, in from_pretrained
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for '/model/language_model.bin' at '/model/language_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. |
Hi @bappctl, could it be possible that you're using an older PyTorch version? Can you try with torch v1.8.1? |
@oryx1729 I don't see it happening frequently if I reduce the # workers to 1. Though not frequent at times I see the process get killed during save. I can try with 1.8.1 |
Hi @bappctl, model training is a long-running task, so doing it within the REST API is not a recommended approach. Could you share more details about the use case for triggering model training via an API? An alternate approach is to train the model with a separate script & later use the trained model for inference with the API. |
Closing as it seems that the issue was solved using PyTorch 1.8.1. Feel free to open a new issue if you still face problems. |
Hey, @tholor loving the FARMReader interface. However, for a single prediction, I'm seeing FARMReader being ~6x slower than both TransformersReader and Huggingface QA pipeline with num_processes=0 or 1, and ~7.5x slower with num_processes=None. Is there something obvious I'm missing here? Should we expect inference time parity? Using the latest for farm-haystack and transformers. Pytorch==1.12.1 Colab notebook: https://colab.research.google.com/drive/1DmbqWaFw9U4NLzn2dI_u1ypGScKdrGqp?usp=sharing |
Hey @clharman, We'd expect some time diff as the two readers have quite different postprocessing (e.g. tokenizers, handling no_answers and aggregating logits across multiple passages; see docs for more infos). However, the diff here is totally out of the expected range and unacceptable. There's still a diff between both readers, but this doesn't seem like a "critical bug" for me and might rather be the topic for some thorough profiling + refactoring. @cc ZanSara |
Thanks for the followup @tholor. When I ran with a GPU I saw results matching yours, with about roughly 1.5-2x slowdown. However, running the notebook CPU only (which is how I was doing it originally) the 6x slowdown was persistent after any warmup period. Also, leaving num_processes=None on GPU seems to make the gap even wider -- I'm seeing 16x slower. Just FYI, I've been keeping multiprocessing turned off but thought it was weird. |
ok, thanks for the clarification! the diff on CPU might be related to multiprocessing. @ZanSara @vblagoje can one of you please take over here and try to replicate this? If the gap is consistently that huge on CPU it might make sense to open a new issue about it. @vblagoje Weren't you investigating to get rid of multiprocessing anyway? |
Question
I am running one of the samples in K8 pod (gpu) It get stuck in FARMReader for long (30+ mins) and time out. Any reason? All i added was 2 .txt document
y[```
2021-05-19 23:34:10 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
05/19/2021 23:34:10 - INFO - farm.infer - Got ya 23 parallel workers to do inference ...
05/19/2021 23:34:10 - INFO - farm.infer - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
05/19/2021 23:34:10 - INFO - farm.infer - /w\ /w\ /w\ /w\ /w\ /w\ /w\ /|\ /w\ /w\ /w\ /w\ /w\ /w\ /|\ /w\ /|\ /|\ /|\ /|\ /w\ /w\ /|
05/19/2021 23:34:10 - INFO - farm.infer - /'\ / \ /'\ /'\ / \ / \ /'\ /'\ /'\ /'\ /'\ /'\ / \ /'\ /'\ / \ /'\ /'\ /'\ /'\ / \ / \ /'
05/19/2021 23:34:10 - INFO - farm.infer -
05/19/2021 23:34:10 - INFO - elasticsearch - POST http://10.x.x.x:8071/sidx/_search [status:200 request:0.003s]
05/19/2021 23:34:10 - WARNING - farm.data_handler.dataset - Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
05/19/2021 23:34:10 - WARNING - farm.data_handler.dataset - Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
[2021-05-19 23:34:40 +0000] [8] [WARNING] Worker graceful timeout (pid:8)
[2021-05-19 23:34:42 +0000] [8] [INFO] Worker exiting (pid: 8)
The text was updated successfully, but these errors were encountered: