Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added implementation of https://eprint.iacr.org/2024/1077 #1449

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

sandy9999
Copy link

Added implementation of the Decision Tree protocol from "Securely Training Decision Trees Efficiently" by Divyanshu Bhardwaj, Sandhya Saravanan, Nishanth Chandran and Divya Gupta : eprint link.

This decision tree training protocol has communication complexity $O(mN \log N + hmN + hN \log N)$ which is an improvement of $\approx min(h, m, \log N)$ over Hamada et. al link.

-Sandhya Saravanan

@mkskeller
Copy link
Member

Thank you for submitting this pull request. However, the code isn't suitable for inclusion due to the following reasons:

  • Using the code fails with an indentation error
  • If the error is corrected, it fails because radix_sort_permutation_from_matrix is not available
  • I don't see a good reason for a separate module if the functionality it provides is essentially the same as an existing module
  • decision_tree_new is not a meaningful name and will be even less meaningful as time goes on
  • Multiple functions are just copy-pasted instead of imported. This is bad practice as future improvements to them will go unused.

I would suggest the following apart from correcting the compilation errors:

  • Add the code to the decision_tree module where it can be found by users.
  • Avoid using "new" as prefix or suffix. Use either "optimized" or something more prescriptive.
  • Avoid the use of entirely copy-pasted functions.
  • Consider sub-classing the existing classes.

@mkskeller
Copy link
Member

Thank you for your additions. I'm afraid I still see two issues here:

  1. Running Scripts/compile-emulate.py breast_tree nearest degrades between the main branch and this. This might be related to the fact that the main branch outputs a leave NID of 32 but in this branch the NID's only go up to 16.
  2. I'm hesitant to merge this branch into the main branch because of the separate module. The reason is that, in the long term, I want to incorporate the changes into the main functionality in the decision_tree module. After that, anyone who used the decision_tree_optimized module will need to change their code as I don't to add any hacks. If your functionality would be available within decision_tree module the adaption would be much more smooth. That said, I'm happy to incorporate your changes in a separate branch, and I indent to merge that in the future to give you credit. Please let me know your thoughts on this.

@mkskeller
Copy link
Member

After having another look, the first issue might be related to not calling UpdateState after the last layer.

@sandy9999
Copy link
Author

Hi @mkskeller Thank you for your detailed comments. I understand your concerns, and will try to work with the decision_tree module.

@sandy9999
Copy link
Author

Hi @mkskeller I have updated the PR as discussed. Would this work?

@mkskeller
Copy link
Member

I'm afraid there are still a few issues:

  1. decision_tree_optimized still exists, did you mean to delete this?
  2. The changes to decision_tree break all applications thereof, adult, easy_adult, bench-dt, and breast_tree (if changed to decision_tree).

@sandy9999
Copy link
Author

sandy9999 commented Dec 23, 2024

Hi @mkskeller

Sorry for the extremely late follow up, got missed. 1 and 2 have been addressed. That is,

./compile.py -Z 3 -R 64 breast_tree
(tmux)
./replicated-ring-party.x 0 breast_tree
./replicated-ring-party.x 1 breast_tree
./replicated-ring-party.x 2 breast_tree

and

./compile.py -Z 3 -R 64 bench-dt 10 10
(tmux)
./replicated-ring-party.x 0 bench-dt-10-10
./replicated-ring-party.x 1 bench-dt-10-10
./replicated-ring-party.x 2 bench-dt-10-10

do not break anymore. adult and easy_adult broke even on the older version of decision_tree.py (with adult, I need instructions on how to take input, and for easy_adult the process got killed).

@mkskeller
Copy link
Member

I'm afraid breast_tree still breaks in the sense that the accuracy is considerably less:

$ Scripts/compile-run.py ring breast_tree
(...)
train for height 1: 159/426 (159/159, 0/267)
test for height 1: 53/143 (53/53, 0/90)
train for height 2: 159/426 (159/159, 0/267)
test for height 2: 53/143 (53/53, 0/90)
train for height 3: 159/426 (159/159, 0/267)
test for height 3: 53/143 (53/53, 0/90)
train for height 4: 159/426 (159/159, 0/267)
test for height 4: 53/143 (53/53, 0/90)
train for height 5: 320/426 (87/159, 233/267)
test for height 5: 103/143 (24/53, 79/90)
(...)

This means that for height 5, only 103 out of 143 samples in the test set are classified correctly. For comparison, the old code results in the following:

train for height 1: 393/426 (146/159, 247/267)
test for height 1: 126/143 (48/53, 78/90)
train for height 2: 401/426 (139/159, 262/267)
test for height 2: 134/143 (48/53, 86/90)
train for height 3: 416/426 (155/159, 261/267)
test for height 3: 132/143 (52/53, 80/90)
train for height 4: 423/426 (156/159, 267/267)
test for height 4: 132/143 (51/53, 81/90)
train for height 5: 423/426 (156/159, 267/267)
test for height 5: 132/143 (51/53, 81/90)

The accuracy here is 132 out of 143. While this varies slightly from run to run, it is always around 130 rather 100. Similarly, the testing accuracy in the original code is close to 100% but much lower in the changed code. My understanding is that the new scheme doesn't change the algorithm, so the accuracy should be the same. Is this correct?

@sandy9999
Copy link
Author

sandy9999 commented Jan 2, 2025

My suspicion is that if 2 attributes, say A_i and A_j with their corresponding thresholds T_i and T_j are associated with the same Gini Index, then the best attribute among A_i and A_j is picked arbitrarily, and this could affect the accuracy.

@mkskeller
Copy link
Member

I'm not sure that's the whole explanation. Even a "bad" index choice can be corrected at a further level, so I would still expect the training set accuracy to approach 100% as with the old code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants