-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archiving URL references #14
Comments
About 50% of cve.org URL data is dead, either dead dead or like a marketing
page. This is after all the sun.com and the like was removed. I grabbed two
domains that had expired and are listed in the data set.
You also need to archive it for the simple reason of what you downloaded
and processed may not be what I downloaded and processed, assuming the site
is even up and running. E.g.hunter.dev was down for a few days one time
while I wanted to get some data from it.
Also, I contacted arhcive.org sales (they do private/public custom
archives/service/etc.) twice and started a sales process but they went dark
and I gave up. So good luck with that, if you do manage to get their sales
people to sell something please let me know who so I can contact them.
…On Wed, Nov 30, 2022 at 7:58 AM Art Manion ***@***.***> wrote:
Link rot is a problem, how serious is this? Vulnerability information is
often conveyed in social media (e.g., Twitter/Mastadon posts), which are
typically more ephemeral than other types of references. What options do we
have? archive.org and the Library of Congress? wget or some other
in-house solution?
CC @todb <https://github.com/todb>
—
Reply to this email directly, view it on GitHub
<#14>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEOEQZT3JJPVXEN5UIBY2DWK5TQTANCNFSM6AAAAAASPXG2LA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Kurt Seifried (He/Him)
***@***.***
|
Thanks @kurtseifried ! Yeah I'll poke you if I end up on the same path with them. |
For references' sake, I did grab all the extant Twitter references a couple weeks ago when Twitter doom was becoming obvious -- they're now stashed on https://archive.today https://github.com/todb/junkdrawer/blob/master/cve-twitter-refs/archives.csv |
I've put out a call for help to the US Library of Congress, which I suspect is a more stable institution than my current solution of archive.today. I'll hit up archive.org next, but I suspect that the LoC's Web Archive Team would be all over this. In the grand scheme of things, it's not an impossibly huge list of references to archive (several thousand but not several million). |
A rare but existing case brougt up in CNA Slack, URLs might be videos, which are often short-lived (get flagged, made private, etc.). |
Just as a quick update, I'm actually in touch with both archive.org and LoC people, so stay tuned. Holidays slowed down comms but I expect that to pick up again! |
I have ArchiveBox running and while further testing, use, and discussion is needed, so far it looks like a reasonable self-hosted option. |
I really dislike videos as (primary) vulnerability reports, but ArchiveBox supports grabbing videos. |
I think you mentioned that the archive.org and LoC options will not work? One idea I had early on was just to submit every reference to archive.org. Pay for an API key/sufficient rate limits if needed. |
ArchiveTeam did a thing! |
Link rot is a problem, how serious is this? Vulnerability information is often conveyed in social media (e.g., Twitter/Mastadon posts), which are typically more ephemeral than other types of references. What options do we have? archive.org and the Library of Congress?
wget
or some other in-house solution?CC @todb
The text was updated successfully, but these errors were encountered: