Skip to content

Privacy Preserving Learning

Manav Singhal edited this page Sep 17, 2021 · 5 revisions

The command line argument --privacy_activation implements aggregated learning by saving only those features that have seen a minimum threshold of users.

Motivation:

  • In many real-world scenarios, the recommender cannot use the feature preferences of a user directly for learning due to privacy constraints.
  • However, the recommender can learn from aggregated data which would uphold the privacy of the user.

Methodology:

  • For each feature, a 32-bit vector is defined.
  • We calculate a 5-bit hash of the tag of the example.
  • For each feature weight updated by a non-zero value, we use the 5-bit hash to look up a bit in the 32-bit vector and set it to 1.
  • When saving the weights into a file, we calculate the number of bits set to 1 for a feature. If it is greater than the threshold, the weights for that feature are saved.

Threshold:

  • The default value of the threshold is 10.
  • The number of trials for the first occurrence of m bits being flipped out of k bits which is a geometric distribution.
  • Therefore, the probability of the next bit being flipped given n bits are already flipped is (k-n)/k.
  • The expectation of a geometric distribution is 1/p, hence the expected waiting time until m bits are flipped is a summation from 0 to m-1 of the expectation k/(k-n).
  • On calculation, the expected waiting time for flipping 10 bits out of 32 was 11.76, while it was 22.21 for flipping 10 bits out of 11.
  • This implied that at least 12 unique users are needed in expectation to flip 10 bits out of 32 bits.

Implementation:

--privacy_activation : To activate the feature

--privacy_activation_threshold arg (=10) : To set the threshold

Future Work:

  • Implement the feature for save_resume.
  • Work on aggregations in the online setting.

Credits:

Clone this wiki locally