-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hyperref deanonymization issue #376
Comments
This won't work with natbib (and probably more bibliography packages). There the bibcite command looks like e.g this
and the second argument is not suitable for a destination name, and it is also not equal to some list counter value. For a natbib solution to hide the key see https://tex.stackexchange.com/a/728373/2388 For general scrambling of destination names see |
Thank you for the report, but I am not sure this is hyperref's problem to solve, whatever system is anonymising the document should presumably adjust these links if they leak information. I don't think the code is safe, and you provide no example or test file. You are using the 2nd argument as an anchor but that may have structure and not be suitable to generate a link name, depending on the bibliography format in use, a quick grep of aux files I have locally for example shows
where the Similarly in your bibitem redefinition, you appear to be assuming a numbered bibliography. The cp/sed code isn't directly usable in our build system (although I can read it) as (a) the build system is cross platform including platforms that don't have |
there is no system anonymizing the pdfs. I understand that my patch and my code is not directly usable in hyperref - obviously. In general, still, it seems inappropriate to leak bib keys into the links here. the bibkeys are not obviously something authors are providing for inclusion in the pdf. |
ah I thought you were redacting printed names and objected to the link anchors being there
natbib was not offered as a solution to the problem, rather an indication that your suggested change would break natbib and many other similar bibliography packages,
The code on stackexchange will show how to hash or otherwise hide these, however I'm not sure that I'd agree with the statement that they are not expected to be used. Like |
i would consider this equally problematic. many academics use hyperref in their papers (often even enforced by the conference template) without taking any precautions against deanonymization through this package. This should not be the default behavior. There is no functional need to have these names in the PDF. And this is not a problem that exists if the hyperref package is not included at all - it is a problem that hyperref introduces, which is why i would consider it a kind of security vulnerability (as it is a confidentiality breach) in the hyperref package rather than somewhere else. |
hyperref does not write label names into the PDF by default. If you get them, your document or your class or some package changes the counter names like chapter, section, enumi are used for destinations. Imho all standard counter names are sensible and harmless. But the set of counter in your PDF can say something about the packages you use in your document. bib keys that hyperref uses for destinations typically are built out of some combination of author, year and title of the work -- data that is also in the bibliography and so also elsewhere in the PDF. So I do not see why the key should add to deanonymization or present a security risk. Imho there would only be a problem if you use bib keys like hyperref adds also metadata to the PDF, e.g. that the creator was "LaTeX with hyperref". So yes, hyperref does reveal something about the source and the production process and your bib keys. But this is not only by accident: Named counter destinations allow links from the outside to parts to the document. And when producing tagged PDF the destination names are used to describe links, so e.g. a good, speaking bib key adds to accessibility: Generally, a PDF (and other complex formats like docx etc too) can contain lots of data that are not visible directly on the screen. The author, title, creation date and producer of the PDF are typically added to the metadata of the PDF. Included pictures can contain metadata. If you clip and include a picture, the clipped parts are still present. Fonts have dates and versions. Source code can be embedded in the PDF. Some standards like ZugFerd require embedded XML-invoices. If you edit a PDF or remove metadata with a tool like exiftool then old versions are still present in the PDF and can be recovered. There are tools like ghostscript or qpdf that allow you to postprocess a PDF and to remove or change such additional data, but this often comes at the cost to make your PDF less usable and less accessible, so one has to find the right balance. In any case I do not think that bib keys and other the other default destination names are one the wrong side here. If you do not want them in the PDF use the code I suggested on tex.sx, or postprocess with ghostscript, or use random bib keys in your bib-files. |
it is good to learn that other labels are not used in the same way.
as a security researcher i have to fundamentally disagree. there are papers about deanonymizing authors during double-blind peer review, it is a threat to the double-blind peer review process. in this case, hyperref introduces an unnecessary threat to the double-blind peer review process. Indeed, cite.thishorribleguy would be problematic (and cases like this exist! and hyperref is responsible for exposing these cases to the public) but the more problematic case is that when people reuse their bib entries, which they usually do (after all that's the idea behind bib entries, that you don't have to write the bibliography by hand every time), then grabbing the list of bibkeys from my 100 published papers and comparing it with a paper submitted for double-blind peer review will immediately tell everyone that the paper is from me because of a distinct style of how my bibliography is organized, which exact bib keys are in there (and were in previous papers in exactly the same form), etc....
that pdfs can contain additional information is clear. the question is does this information pose a threat to scientific integrity. if i submit a paper for double-blind peer review then, author and producer will not contain deanonymizing information, exactly to not undermine scientific integrity. |
it is not unnecessary. hyperref has no real other option to create a link between a citation and a bib-entry then through the bib-key. Numbers do not work as not every bibliography is numbered and the label data often contain formatting commands or additional structure. The bib key is the ID that links both.
Sorry but if someone has access to 100 published paper from you they don't need the bib keys to identify you as the author of the 101 paper. It is naive to believe that removing the bib keys will make the paper anonymous. |
there is a link between the bib item and the reference already and that has a human-readable accessible format, like a number, some letters, maybe letters and a year, depending on the bib style - exactly the same string could be used for the link as well - why are we using a different one? what is a good reason for that? it would be more beautiful to use the same identifier for the same thing all the time, whether it is in printed letters or in a link. and, of course everyone has access to 100 published papers from me because they are all open access. having the bib keys makes deanonymization trivial compared to training a neural network on the papers and then getting a probability that it is a paper from one or the other person... should the recommendation for the scientific community be rather to avoid hyperref in order to not undermine scientific integrity? |
No there isn't. A link is simply an active area on the page, it has no connection to the text under it. And even if it had: such a text has a complex formatting which is not suitable as an ID-string as needed in an annotation.
My recommendation for the scientific community would be that they check PDFs meant for peer review for unneeded data and postprocess them if needed e.g. with Ghostscript or other tools. Even if hyperref gets some option to replace these destination names: it would only be an option. hyperref will not by default scramble destination names as there are users who are used to and want readable names. So the scientific community would have to agree to activate this option (and you would have to convince packages like natbib or biblatex to make use of this option) and imho this is much less probable to work. |
Just out of curiosity: Wouldn't it be possible to introduce an option to |
that would solve the issue, yes |
I hate to say it, but a hash from a key is as unique as the key itself. So if people can identify you by comparing keys like "MaxMuster2024" they can also identify you by comparing the hashes. But beside this, this is easy to implement by using the answer I already cited:
You get then destination names like this:
|
That's why it suggested to make it possible to use some Maybe, it would be even feasible to generate a default value for |
the salt could even be a random value at compile time |
as long as it stays stable during the compilation. Sure. As I wrote already a few times: you can scramble the destinations if you want. But hyperref will not do it by default, so you will have to convince you peers to use this option too, or your papers will be very recognizable as they have so long destination names. I mean your problem is not that the destination names are readable but that they differ from destination names from other authors and that doesn't change if you hash them. |
obviously it could be but it would make the destination names unusable for anything other than internal links as they would be unstable. But as Ulrike shows a user only need to add one line to hash the things and adding in the time or a random number (both of which tex has available) or some such would only add a couple more commands, but this should be a user choice. Apart from loss of usability, many systems have a requirement for reproducible builds. |
apart from the deanonymization problems hyperref creates, it is still inconsistent that hyperref uses a different approach for ref links to sections/figures/etc and to bib entries. and it is still not clear why anyone would want to handle these two cases differently. either sections/figures/etc should also use the internal names like \label{fig:internal_name}, or for bib entries the internal name \cite{internal_name} should not be used either, to make these two cases consistent. Sure, the link and the underlying printed text are independent elements, but these two cases, links to section, links to bib entries, are identical. |
hyperref has an option |
Then the destlabel option should also control which label is used for bib entries, for consistency, right? These are labels as well. |
This was not my intention as the salt must not change due to the potential need of multiple compilations for name resolution.
That's why I suggested to be able to set a user defined value.
I am aware of that, I just made up something. What about the following options:
|
why should the names be consistent? As long as they are unique you can use what you want. You can name a part after Shakespeare plays and the other after flowers. Beside this: Lots of destination names are actually not managed by hyperref but by packages (e.g. biblatex), classes or by user code and such destinations can look quite different. The lilypond documentation for examples has destination names like this:
hyperref has with |
This won't happen. I just wanted to contribute some ideas if you would be willing to extend hyperref
And that's great as this provides a way how @dgruss can achieve the desired behavior |
The bibliography links hyperref generates can be used to deanonymize authors.
The following change fixes the problem by just using the number in the bibliography instead.
or here as a two-liner, easy to integrate in your Makefile / build system:
maybe something like this could be patched in to avoid deanonymization issues by default?
The text was updated successfully, but these errors were encountered: