Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyperref deanonymization issue #376

Open
dgruss opened this issue Jan 21, 2025 · 23 comments
Open

hyperref deanonymization issue #376

dgruss opened this issue Jan 21, 2025 · 23 comments

Comments

@dgruss
Copy link

dgruss commented Jan 21, 2025

The bibliography links hyperref generates can be used to deanonymize authors.

The following change fixes the problem by just using the number in the bibliography instead.

--- /usr/share/texlive/texmf-dist/tex/latex/hyperref/hyperref.sty	2021-06-07 22:05:52.000000000 +0200
+++ hyperref.sty	2024-11-15 00:07:41.817170900 +0100
@@ -7477,7 +7477,7 @@
   \providecommand*\@extra@b@citeb{}%
   \def\bibcite#1#2{%
     \@newl@bel{b}{#1\@extra@binfo}{%
-      \hyper@@link[cite]{}{cite.#1\@extra@b@citeb}{#2}%
+      \hyper@@link[cite]{}{cite.#2\@extra@b@citeb}{#2}%
     }%
   }%
   \gdef\@extra@binfo{}%
@@ -7519,7 +7519,7 @@
   \def\@bibitem#1{%
     \@skiphyperreftrue\H@item\@skiphyperreffalse
     \Hy@raisedlink{%
-      \hyper@anchorstart{cite.#1\@extra@b@citeb}\relax\hyper@anchorend
+      \hyper@anchorstart{cite.\the\value{\@listctr}\@extra@b@citeb}\relax\hyper@anchorend
     }%
     \if@filesw
       \begingroup

or here as a two-liner, easy to integrate in your Makefile / build system:

	cp `kpsewhich hyperref.sty` .
	sed -i -e 's/\\hyper@@link\[cite\]{}{cite\.\#1\\@extra@b@citeb}/\\hyper@@link[cite]{}{cite.#2\\@extra@b@citeb}/' -e 's/\\hyper@anchorstart{cite\.\#1\\@extra@b@citeb}/\\hyper@anchorstart{cite.\\the\\value{\\@listctr}\\@extra@b@citeb}/' hyperref.sty

maybe something like this could be patched in to avoid deanonymization issues by default?

@u-fischer
Copy link
Member

This won't work with natbib (and probably more bibliography packages). There the bibcite command looks like e.g this

\bibcite{doody}{{1}{1974}{{Doody}}{{}}}

and the second argument is not suitable for a destination name, and it is also not equal to some list counter value.

For a natbib solution to hide the key see

https://tex.stackexchange.com/a/728373/2388

For general scrambling of destination names see

https://tex.stackexchange.com/a/560287/2388

@davidcarlisle
Copy link
Member

Thank you for the report, but I am not sure this is hyperref's problem to solve, whatever system is anonymising the document should presumably adjust these links if they leak information.

I don't think the code is safe, and you provide no example or test file. You are using the 2nd argument as an anchor but that may have structure and not be suitable to generate a link name, depending on the bibliography format in use, a quick grep of aux files I have locally for example shows

\bibcite{myref2}{{2}{2024{}}{{author}}{{}}}

where the #2 in your code would be {2}{2024{}}{{author}}{{}} not "a number"

Similarly in your bibitem redefinition, you appear to be assuming a numbered bibliography.

The cp/sed code isn't directly usable in our build system (although I can read it) as (a) the build system is cross platform including platforms that don't have cp or sed and (b) it is editing hyperref.sty which is a generated file, presumably you intended to edit the source?

@dgruss
Copy link
Author

dgruss commented Jan 21, 2025

there is no system anonymizing the pdfs. I understand that my patch and my code is not directly usable in hyperref - obviously.
It's a quickfix workaround.
natbib is not a solution as it is not compatible with some conference templates unfortunately.
I'll give the other scrambling option a try though.

In general, still, it seems inappropriate to leak bib keys into the links here. the bibkeys are not obviously something authors are providing for inclusion in the pdf.

@davidcarlisle
Copy link
Member

there is no system anonymizing the pdfs.

ah I thought you were redacting printed names and objected to the link anchors being there
but it's just the internal names that concern you, OK.

natbib is not a solution as it is not compatible with some conference templates unfortunately.

natbib was not offered as a solution to the problem, rather an indication that your suggested change would break natbib and many other similar bibliography packages,

In general, still, it seems inappropriate to leak bib keys into the links here. the bibkeys are not obviously something authors are providing for inclusion in the pdf.

The code on stackexchange will show how to hash or otherwise hide these, however I'm not sure that I'd agree with the statement that they are not expected to be used. Like \label keys they don't appear in print but label keys similarly get used as a basis for anchors, if you load hyperref and create a digital linked document then internal keys are part of the document,

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

i would consider this equally problematic. many academics use hyperref in their papers (often even enforced by the conference template) without taking any precautions against deanonymization through this package.
both bibkeys and labels and any other information have author-specific preferences and styles. as paper PDFs are public afterwards it would be trivial to collect many PDFs from a specific community and create a mapping of keys/labels -> authors. maybe using machine learning to automate it. the link containing the deanonymizing information is shown in the footline of chrome for instance.

This should not be the default behavior. There is no functional need to have these names in the PDF.
All of these leak information that the users of the hyperref package may be unaware that they're sharing it. similar as i wouldn't expect a latex comment or a latex command to appear somewhere in the PDF, labels and bibkeys are equally internal names.

And this is not a problem that exists if the hyperref package is not included at all - it is a problem that hyperref introduces, which is why i would consider it a kind of security vulnerability (as it is a confidentiality breach) in the hyperref package rather than somewhere else.

@u-fischer
Copy link
Member

hyperref does not write label names into the PDF by default. If you get them, your document or your class or some package changes the \label definition (which is done e.g. by the beamer class).

counter names like chapter, section, enumi are used for destinations. Imho all standard counter names are sensible and harmless. But the set of counter in your PDF can say something about the packages you use in your document.

bib keys that hyperref uses for destinations typically are built out of some combination of author, year and title of the work -- data that is also in the bibliography and so also elsewhere in the PDF. So I do not see why the key should add to deanonymization or present a security risk. Imho there would only be a problem if you use bib keys like thishorribleguy.2023.

hyperref adds also metadata to the PDF, e.g. that the creator was "LaTeX with hyperref".

So yes, hyperref does reveal something about the source and the production process and your bib keys. But this is not only by accident: Named counter destinations allow links from the outside to parts to the document. And when producing tagged PDF the destination names are used to describe links, so e.g. a good, speaking bib key adds to accessibility:

Image

Generally, a PDF (and other complex formats like docx etc too) can contain lots of data that are not visible directly on the screen. The author, title, creation date and producer of the PDF are typically added to the metadata of the PDF. Included pictures can contain metadata. If you clip and include a picture, the clipped parts are still present. Fonts have dates and versions. Source code can be embedded in the PDF. Some standards like ZugFerd require embedded XML-invoices. If you edit a PDF or remove metadata with a tool like exiftool then old versions are still present in the PDF and can be recovered. There are tools like ghostscript or qpdf that allow you to postprocess a PDF and to remove or change such additional data, but this often comes at the cost to make your PDF less usable and less accessible, so one has to find the right balance.

In any case I do not think that bib keys and other the other default destination names are one the wrong side here. If you do not want them in the PDF use the code I suggested on tex.sx, or postprocess with ghostscript, or use random bib keys in your bib-files.

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

it is good to learn that other labels are not used in the same way.

So yes, hyperref does reveal something about the source and the production process and your bib keys. But this is not only by accident: Named counter destinations allow links from the outside to parts to the document. And when producing tagged PDF the destination names are used to describe links, so e.g. a good, speaking bib key adds to accessibility:

as a security researcher i have to fundamentally disagree. there are papers about deanonymizing authors during double-blind peer review, it is a threat to the double-blind peer review process. in this case, hyperref introduces an unnecessary threat to the double-blind peer review process.

Indeed, cite.thishorribleguy would be problematic (and cases like this exist! and hyperref is responsible for exposing these cases to the public) but the more problematic case is that when people reuse their bib entries, which they usually do (after all that's the idea behind bib entries, that you don't have to write the bibliography by hand every time), then grabbing the list of bibkeys from my 100 published papers and comparing it with a paper submitted for double-blind peer review will immediately tell everyone that the paper is from me because of a distinct style of how my bibliography is organized, which exact bib keys are in there (and were in previous papers in exactly the same form), etc....
Even capitalization is leaked. This is so rich in information that you could probably from this alone deanonymize most papers submitted to a conference if previous papers from the same people are available.

The author, title, creation date and producer of the PDF are typically added to the metadata of the PDF.

that pdfs can contain additional information is clear. the question is does this information pose a threat to scientific integrity. if i submit a paper for double-blind peer review then, author and producer will not contain deanonymizing information, exactly to not undermine scientific integrity.
with this unecessary choice of how bib keys hyperref is undermining these efforts of anonymization and thereby scientific integrity.

@u-fischer
Copy link
Member

with this unecessary choice of how bib keys hyperref

it is not unnecessary. hyperref has no real other option to create a link between a citation and a bib-entry then through the bib-key. Numbers do not work as not every bibliography is numbered and the label data often contain formatting commands or additional structure. The bib key is the ID that links both.

then grabbing the list of bibkeys from my 100 published papers and comparing it with a paper submitted for double-blind peer review will immediately tell everyone that the paper is from me

Sorry but if someone has access to 100 published paper from you they don't need the bib keys to identify you as the author of the 101 paper. It is naive to believe that removing the bib keys will make the paper anonymous.

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

there is a link between the bib item and the reference already and that has a human-readable accessible format, like a number, some letters, maybe letters and a year, depending on the bib style - exactly the same string could be used for the link as well - why are we using a different one? what is a good reason for that? it would be more beautiful to use the same identifier for the same thing all the time, whether it is in printed letters or in a link.

and, of course everyone has access to 100 published papers from me because they are all open access. having the bib keys makes deanonymization trivial compared to training a neural network on the papers and then getting a probability that it is a paper from one or the other person...

should the recommendation for the scientific community be rather to avoid hyperref in order to not undermine scientific integrity?

@u-fischer
Copy link
Member

there is a link between the bib item and the reference already and that has a human-readable accessible format,

No there isn't. A link is simply an active area on the page, it has no connection to the text under it. And even if it had: such a text has a complex formatting which is not suitable as an ID-string as needed in an annotation.

should the recommendation for the scientific community be rather to avoid hyperref in order to not undermine scientific integrity?

My recommendation for the scientific community would be that they check PDFs meant for peer review for unneeded data and postprocess them if needed e.g. with Ghostscript or other tools.

Even if hyperref gets some option to replace these destination names: it would only be an option. hyperref will not by default scramble destination names as there are users who are used to and want readable names. So the scientific community would have to agree to activate this option (and you would have to convince packages like natbib or biblatex to make use of this option) and imho this is much less probable to work.

@mrpiggi
Copy link

mrpiggi commented Jan 22, 2025

Just out of curiosity: Wouldn't it be possible to introduce an option to hyperref let's say hashedlabels=true|false|<salt> which would then just use \str_mdfive_hash:e{#1\hyper@hashedlabels@salt} for anchors, links etc.?

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

that would solve the issue, yes

@u-fischer
Copy link
Member

I hate to say it, but a hash from a key is as unique as the key itself. So if people can identify you by comparing keys like "MaxMuster2024" they can also identify you by comparing the hashes.

But beside this, this is easy to implement by using the answer I already cited:

\documentclass{article}

\usepackage{hyperref}
\ExplSyntaxOn
\def\HyperDestNameFilter#1{\exp_args:Ne\tl_map_function:nN {#1}\str_mdfive_hash:e}
\ExplSyntaxOff

\begin{document}
\section{abc}\label{test}
\cite{doody}\cite{herrmann}
\newpage
\bibliographystyle{plain}
\bibliography{biblatex-examples}

\end{document}

You get then destination names like this:

/Names [(03C7C0ACE395D80182DB07AE2C30F034E1671797C52E15F763380B45E841EC324A8A08F09D37B73795649038408B5F33E358EFA489F58062F10DD7316B65649E865C0C0B4AB0E063E5CAA3387C1A8741D95679752134A2D9EB61DBD7B91C4BCC7B8B965AD4BCA0E41AB51DE7B31363A13389DAE361AF79B04C9C8E7057F60CC65058F1AF8388633F609CADB75A75DC9DC4CA4238A0B923820DCC509A6F75849B) 22 0 R (03C7C0ACE395D80182DB07AE2C30F034E1671797C52E15F763380B45E841EC324A8A08F09D37B73795649038408B5F33E358EFA489F58062F10DD7316B65649E865C0C0B4AB0E063E5CAA3387C1A8741D95679752134A2D9EB61DBD7B91C4BCC7B8B965AD4BCA0E41AB51DE7B31363A15058F1AF8388633F609CADB75A75DC9DC4CA4238A0B923820DCC509A6F75849B) 2 0 R (4A8A08F09D37B73795649038408B5F33865C0C0B4AB0E063E5CAA3387C1A8741E358EFA489F58062F10DD7316B65649EE1671797C52E15F763380B45E841EC325058F1AF8388633F609CADB75A75DC9D2510C39011C5BE704182423E3A695E91E1671797C52E15F763380B45E841EC324B43B0AEE35624CD95B910189B3DC2314B43B0AEE35624CD95B910189B3DC2316F8F57715090DA2632453988D9A1501B0CC175B9C0F1B6A831C399E2697726617B8B965AD4BCA0E41AB51DE7B31363A17B8B965AD4BCA0E41AB51DE7B31363A1) 17 0 R (4A8A08F09D37B73795649038408B5F33865C0C0B4AB0E063E5CAA3387C1A8741E358EFA489F58062F10DD7316B65649EE1671797C52E15F763380B45E841EC325058F1AF8388633F609CADB75A75DC9D8277E0910D750195B448797616E091ADD95679752134A2D9EB61DBD7B91C4BCCD95679752134A2D9EB61DBD7B91C4BCC8277E0910D750195B448797616E091AD415290769594460E2E485922904F345D) 16 0 R (83878C91171338902E0FE0FB97A8C47A0CC175B9C0F1B6A831C399E269772661B2F5FF47436671B6E533D8DC3614845DE1671797C52E15F763380B45E841EC325058F1AF8388633F609CADB75A75DC9DC4CA4238A0B923820DCC509A6F75849B) 11 0 R (83878C91171338902E0FE0FB97A8C47A0CC175B9C0F1B6A831C399E269772661B2F5FF47436671B6E533D8DC3614845DE1671797C52E15F763380B45E841EC325058F1AF8388633F609CADB75A75DC9DC81E728D9D4C2F636F067F89CC14862C) 21 0 R]

@mrpiggi
Copy link

mrpiggi commented Jan 22, 2025

I hate to say it, but a hash from a key is as unique as the key itself. So if people can identify you by comparing keys like "MaxMuster2024" they can also identify you by comparing the hashes.

That's why it suggested to make it possible to use some <salt> before applying the hashing. It would be still up to the author to ensure a unique value for <salt> for every document but at least it would be possible.

Maybe, it would be even feasible to generate a default value for <salt> out of some provided meta data like \def\hyper@hashedlabels@salt{\str_mdfive_hash:e{\@author\@title\@date}} which still would not make it impossible to reconstruct the original labels but would at least give a fairly high chance to generate different hashes of the same label for different documents.

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

the salt could even be a random value at compile time

@u-fischer
Copy link
Member

u-fischer commented Jan 22, 2025

\@author, \@title and \@date are typically not expandable and should not be used there.

the salt could even be a random value at compile time

as long as it stays stable during the compilation. Sure. As I wrote already a few times: you can scramble the destinations if you want. But hyperref will not do it by default, so you will have to convince you peers to use this option too, or your papers will be very recognizable as they have so long destination names. I mean your problem is not that the destination names are readable but that they differ from destination names from other authors and that doesn't change if you hash them.

@davidcarlisle
Copy link
Member

the salt could even be a random value at compile time

obviously it could be but it would make the destination names unusable for anything other than internal links as they would be unstable. But as Ulrike shows a user only need to add one line to hash the things and adding in the time or a random number (both of which tex has available) or some such would only add a couple more commands, but this should be a user choice. Apart from loss of usability, many systems have a requirement for reproducible builds.

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

apart from the deanonymization problems hyperref creates, it is still inconsistent that hyperref uses a different approach for ref links to sections/figures/etc and to bib entries. and it is still not clear why anyone would want to handle these two cases differently. either sections/figures/etc should also use the internal names like \label{fig:internal_name}, or for bib entries the internal name \cite{internal_name} should not be used either, to make these two cases consistent.

Sure, the link and the underlying printed text are independent elements, but these two cases, links to section, links to bib entries, are identical.

@u-fischer
Copy link
Member

hyperref has an option destlabel. If you use it then \section{test}\label{sec:test} will create the destination name sec:test instead of the usual section:1. As the destination is created at the begin of the section, while the label comes after, this requires two compilations to resolve the names. So if you want to show people also the labels you use, you can do it. Naturally to get this for every section you have to manually add \label everywhere - a destination name using the section counter can be created automatically.

@dgruss
Copy link
Author

dgruss commented Jan 22, 2025

Then the destlabel option should also control which label is used for bib entries, for consistency, right? These are labels as well.

@mrpiggi
Copy link

mrpiggi commented Jan 22, 2025

the salt could even be a random value at compile time

This was not my intention as the salt must not change due to the potential need of multiple compilations for name resolution.

many systems have a requirement for reproducible builds

That's why I suggested to be able to set a user defined value.

\@author, \@title and \@Date are typically not expandable and should not be used there

I am aware of that, I just made up something. What about the following options:

  • hashlabel=false (default): obviously the current behavior
  • hashlabel=true: hashing without any salt at all (disputable if feasible but would lead to reproducible builds)
  • hashlabel=generate: generate a random number used as salt and writing to the aux file, if not present (meaning generate initially and leave untouched for consecutive runs) -- would probably be also fine for hashlabel=true
  • hashlabel=<user defined value>: do not generate but use the given value, which would lead to reproducible builds

@u-fischer
Copy link
Member

for consistency, right?

why should the names be consistent? As long as they are unique you can use what you want. You can name a part after Shakespeare plays and the other after flowers. Beside this: Lots of destination names are actually not managed by hyperref but by packages (e.g. biblatex), classes or by user code and such destinations can look quite different. The lilypond documentation for examples has destination names like this:

  /Names [
    (Accidental glyphs)
    1744 0 R
    (Accidentals)
    137 0 R
    (Accordion)
    773 0 R
    (Accordion glyphs)
    1756 0 R
    (Accordion registers)
    1876 0 R
    (Adding dynamics marks to stanzas)
    631 0 R
    (Adding singers' names to stanzas)

What about the following options:

hyperref has with \HyperDestNameFilter a generic method that can be used to implement various ways. You can write a package if you think there is a need to have some predefined scrambling options. Be aware that this will only affect destinations written with hyperref commands and not destinations created with the primitives or commands from the pdfmanagement.

@mrpiggi
Copy link

mrpiggi commented Jan 22, 2025

You can write a package if you think there is a need to have some predefined scrambling options

This won't happen. I just wanted to contribute some ideas if you would be willing to extend hyperref

hyperref has with \HyperDestNameFilter a generic method that can be used to implement various ways

And that's great as this provides a way how @dgruss can achieve the desired behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants