Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datamule v0.400 update megathread #28

Closed
john-friedman opened this issue Dec 12, 2024 · 9 comments
Closed

datamule v0.400 update megathread #28

john-friedman opened this issue Dec 12, 2024 · 9 comments
Labels
Major Update Major update and/or rework

Comments

@john-friedman
Copy link
Owner

john-friedman commented Dec 12, 2024

Datamule has been updated to v0.400. There will be bugs, please report them.

The good stuff:

  • Premium Downloader with no rate limits. (I'm hosting my own SEC archive now)
  • Downloader has been modified to download all attachments + metadata from one api call.
  • Monitor replaced downloader.watch() to monitor new sec filings
  • Submission and Document class replacing Filing - no need to declare type, as Submission() automatically detects it from metadata

The bad stuff:

  • I've removed a lot of non-core functionalities. If this affects your work, please comment here, and I'll see what I can do.

I'm going to be moving away from using the sec.gov API towards my own infrastructure for several reasons. While, the sec api is great, it has become frustrating for me to use: the rate limits are low, historical files are missing, it goes down sometimes, etc. Hosting my own archive lets me expand the historical timeline, optimize download speed, and lets me iterate faster.

I'm open to making my SEC archive a public utility. If you are someone who can make that happen, feel free to reach out. I will also be releasing a guide on how to host your own SEC archive. The code is open-source.

Benchmarks

File Size Examples Downloader Premium Downloader
Small Files 3, 4, 5 5/s 300/s
Medium Files 8-K 5/s 60/s
Large Files 10-K 3/s 5/s

Note: soon after release I will be updating the Premium Downloader to be 10x as fast.

@john-friedman john-friedman added the Major Update Major update and/or rework label Dec 12, 2024
@john-friedman
Copy link
Owner Author

Features in the pipeline:

  1. Websocket for new submissions
  2. Parse almost every document type and attachment
  3. Databases for insider trading, 13F-HR information tables
  4. Always up to date alternative datasets generated from text

@firmai
Copy link

firmai commented Jan 4, 2025

Hey @john-friedman,hope you have been well, how has the (2) been coming a long based on the previous closed issue: #19

@john-friedman
Copy link
Owner Author

Hey @firmai, should be coming along this month. I pushed parsing back a bit due to setting up my own SEC archive.

@john-friedman
Copy link
Owner Author

Are you under time pressure for one of your projects?

@firmai
Copy link

firmai commented Jan 5, 2025

That's helpful to know, I just want to coordinate with you to make sure I am not replicating what you are doing uncessarily. Currently my time is spent on a patents dataset, but will be moving to filings in about 3 weeks. If you have something by then I will run it and offer suggestions on a thread over here.

@john-friedman
Copy link
Owner Author

Gotcha.

Btw, for patents dataset are you using Google BigQuery? I did some work on patents back in 2022.

@firmai
Copy link

firmai commented Jan 13, 2025

Yeah, using the Bigquery dataset, but it has some issues which I have to resolve! And takes suprisingly long to do.

@john-friedman
Copy link
Owner Author

Downloading the data or cleaning it? Big Query has some big downsides - ended up doing a lot of stuff locally

@firmai
Copy link

firmai commented Jan 15, 2025

The cleaning is the tough part! Will talk to you about it on our next call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Major Update Major update and/or rework
Projects
None yet
Development

No branches or pull requests

2 participants