Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation example : headers #149

Closed
dharmatech opened this issue Nov 11, 2024 · 1 comment
Closed

documentation example : headers #149

dharmatech opened this issue Nov 11, 2024 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@dharmatech
Copy link

dharmatech commented Nov 11, 2024

Example from documentation

This documentation page:

https://py-xbrl.readthedocs.io/en/latest/usage.html

has the following example:

import logging
from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser, XbrlInstance
# just to see which files are downloaded
logging.basicConfig(level=logging.INFO)

cache: HttpCache = HttpCache('./cache')
cache.set_headers({'From': '[email protected]', 'User-Agent': 'py-xbrl/2.1.0'})
parser = XbrlParser(cache)

schema_url = "https://www.sec.gov/Archives/edgar/data/0000320193/000032019321000105/aapl-20210925.htm"
inst: XbrlInstance = parser.parse_instance(schema_url)

Note that it sets the headers as follows:

cache.set_headers({'From': '[email protected]', 'User-Agent': 'py-xbrl/2.1.0'})

Issue

When I used the headers in that format (using my own email) this was the result:

>>> inst: XbrlInstance = parser.parse_instance(schema_url)
urllib3.exceptions.ResponseError: too many 403 error responses

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\urllib3\connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\urllib3\connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\urllib3\connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  [Previous line repeated 2 more times]
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\urllib3\connectionpool.py", line 938, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\urllib3\util\retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.sec.gov', port=443): Max retries exceeded with url: /Archives/edgar/data/0000320193/000032019321000105/aapl-20210925.htm (Caused by ResponseError('too many 403 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\xbrl\instance.py", line 740, in parse_instance
    return parse_ixbrl_url(uri, self.cache) if is_url(uri) else parse_ixbrl(uri, self.cache, instance_url, encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\xbrl\instance.py", line 425, in parse_ixbrl_url
    instance_path: str = cache.cache_file(instance_url)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\xbrl\cache.py", line 81, in cache_file
    query_response = self.connection_manager.download(file_url, headers=self.headers)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\xbrl\helper\connection_manager.py", line 51, in download
    response = self._session.get(url, headers=headers, allow_redirects=True, verify=self.verify_https)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\requests\sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\dharm\python-environments\env-3.12-pandas\Lib\site-packages\requests\adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='www.sec.gov', port=443): Max retries exceeded with url: /Archives/edgar/data/0000320193/000032019321000105/aapl-20210925.htm (Caused by ResponseError('too many 403 error responses'))

Solution

When I set the headers as follows:

headers = {
    'User-Agent': 'COMPANY [email protected]'
}

cache.set_headers(headers)

I no longer ran into the above cited issue.

@manusimidt manusimidt self-assigned this Nov 13, 2024
@manusimidt manusimidt added the documentation Improvements or additions to documentation label Nov 13, 2024
@manusimidt
Copy link
Owner

manusimidt commented Nov 13, 2024

True, seems like the SEC has updated its requirements regarding the UserAgent.
Thanks @dharmatech !

https://www.sec.gov/search-filings/edgar-search-assistance/accessing-edgar-data

manusimidt added a commit that referenced this issue Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants