Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PubMed eutil functions #131

Merged
merged 26 commits into from
Jun 15, 2023
Merged

New PubMed eutil functions #131

merged 26 commits into from
Jun 15, 2023

Conversation

caufieldjh
Copy link
Member

A rewrite of the PubmedClient class to access NCBI APIs directly w/ requests rather than the eutils package.

@justaddcoffee
Copy link
Member

that was quick!

@caufieldjh
Copy link
Member Author

caufieldjh commented Jun 13, 2023

Remaining:

  • Handle instances of API returning malformed JSON (requests.exceptions.JSONDecodeError: Invalid control character at: ...)
  • Handle HTTP error 414 (this is because the query URI is too long, though that's a side effect of requests parsing commas as %2C - still resolves, but causes this error when the URI is > 2500 chars or so whether percent encoded or not, this error arises with long id lists. Try chunking and sending to the history server)
  • Bring back the search function and text scoring
  • Modify tests as needed

@caufieldjh
Copy link
Member Author

caufieldjh commented Jun 14, 2023

It's strange that POST requests still return that 414 error - could be that requests is passing the id list in the wrong place but it gets parsed anyway

https://www.ncbi.nlm.nih.gov/books/NBK25499/ says:

For sequence databases (nuccore, popset, protein), the UID list may be a mixed list of GI numbers and accession.version identifiers. Note: When using accession.version identifiers, there is a conversion step that takes place that causes large lists of identifiers to time out, even when using POST. Therefore, we recommend batching these types of requests in sizes of about 500 UIDs or less, to avoid retrieving only a partial amount of records from your original POST input list.

But apparently that applies to pubmed, too.

@caufieldjh
Copy link
Member Author

caufieldjh commented Jun 15, 2023

Pubmed client tests fail as:

$ poetry run python -m unittest tests.integration.test_clients.test_pubmed_client
Title: Cystic fibrosis: current therapeutic targets and future approaches.
Abstract: Study of currently approved drugs and exploration of future clinical development pipeline therapeutics for cystic fibrosis, and possible limitations in their use.Extensive literature search using individual and a combination of key words related to cystic fibrosis therapeutics.Cystic fibrosis is an autosomal recessive disorder due to mutations in CFTR gene leading to abnormality of chloride channels in mucus and sweat producing cells. Respiratory system and GIT are primarily involved but eventually multiple organs are affected leading to life threatening complications. Management requires drug therapy, extensive physiotherapy and nutritional support. Previously, the focus was on symptomatic improvement and complication prevention but recently the protein rectifiers are being studied which are claimed to correct underlying structural and functional abnormalities. Some improvement is observed by the corrector drugs. Other promising approaches are gene therapy, targeting of cellular interactomes, and newer drugs for symptomatic improvement.The treatment has a long way to go as most of the existing therapeutics is for older children. Other limiting factors include mutation class, genetic profile, drug interactions, adverse effects, and cost. Novel approaches like gene transfer/gene editing, disease modeling and search for alternative targets are warranted.
Keywords: CFTR; Chloride; Hereditary; Respiratory; Sweat
.Testing...
E
======================================================================
ERROR: test_search (tests.integration.test_clients.test_pubmed_client.TestCompletion)
Test PMID search.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 105 (char 104)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/harry/ontogpt/tests/integration/test_clients/test_pubmed_client.py", line 22, in test_search
    results = list(self.client.search("Long covid", ["review", "treatment", "therapies"]))
  File "/home/harry/ontogpt/src/ontogpt/clients/pubmed_client.py", line 393, in search
    esr = self.get_pmids(term=term)
  File "/home/harry/ontogpt/src/ontogpt/clients/pubmed_client.py", line 181, in get_pmids
    data = response.json()
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Invalid control character at: line 1 column 105 (char 104)

----------------------------------------------------------------------
Ran 2 tests in 2.105s

FAILED (errors=1)

This isn't an error in search() specifically, as running get_pmids(term="Long covid") produces the same error.

@caufieldjh
Copy link
Member Author

Ah, that's right:

{'header': {'type': 'esearch', 'version': '0.3'}, 'esearchresult': {'ERROR': "Search Backend failed: Exception:\n'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/"}}

(For reference, by Edirect, they mean Entrez Direct, the CLI tools.)
So for now I will set a limit on 9999 results here.

@caufieldjh caufieldjh marked this pull request as ready for review June 15, 2023 18:11
@caufieldjh caufieldjh merged commit 598487e into ibd_template Jun 15, 2023
@caufieldjh caufieldjh deleted the pubmed_retrieve branch June 15, 2023 18:11
caufieldjh added a commit that referenced this pull request Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants