Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed shuffle function and combine all shuffled independence test p vals for a target node into a flat list to feed into Anderson Darling Test #1781

Merged
merged 2 commits into from
May 31, 2024

Conversation

vbcwonderland
Copy link
Collaborator

@vbcwonderland vbcwonderland commented May 28, 2024

Shuffle function now fixed.

When shuffleThreshold = 1.0, the number of independence test p vals is the same of amount as the total number of nodes in the whole graph.
When shuffleThreshold = 0.5, the number of independence test p vals is the twice of amount as the total number of nodes in the whole graph.

Then the flat list of combined total shuffled independence test p vals will be send into ADTest to get a ADTest P val for the target node.
...

@jdramsey
Copy link
Collaborator

Can you explain a little more? Are you ending up with one AD p-value per node?

@vbcwonderland
Copy link
Collaborator Author

Yes, now per target node has one ADTest PValue associated with each shuffle.

shuffleThreshold: a double number representing the percentage of data we would select for this shuffle.
shuffleTimes: the total times of shuffles we would make, aiming at an estimation of full data coverage after all the shuffles.

For example, a shuffleThresholdof 0.2 would lead to shuffling the data 5 times, each time takes 20% of the data.

List<List<Double>> pVals_list: a list of lists of double values, where each sublist contains the p-values (the node and each of its local nodes got from the independence test (e.g. FisherZ)) calculated for one shuffle of the data.

The loop iterates shuffleTimes times, each time performing the following steps:

  1. Data Subsampling:
    getSubsampleRows(shuffleThreshold): returns a list of row indices based on the shuffle threshold, indicating which rows of data to include in the test.
    ((RowsSettable) independenceTest).setRows(rows): This sets the rows that the test should consider.
  2. Calculating P-Values:
    A new list pVals is initialized for storing the p-values of this iteration. The inner loop goes through each IndependenceFact in facts: Depending on the type of independenceTest, it calculates the p-value for the fact f:
  • Fisher Z-test (IndTestFisherZ): The p-value is calculated and directly added to pVals.
  • Chi-square test (IndTestChiSquare): The p-value is calculated and added only if it's non-null.
    After all facts are processed for this shuffle, the list of p-values pVals is added to pVals_list.
    This pVals_list contains the p-values for each shuffle, each represented as a list of doubles.

Later on, this getLocalPValues method is used in methods e.g. getAndersonDarlingTestAcceptsRejectsNodesForAllNodes in the following way

 // All local nodes' p-values for node x
            List<List<Double>> shuffledlocalPValues = getLocalPValues(independenceTest, localIndependenceFacts, shuffleThreshold);
            for (List<Double> localPValues: shuffledlocalPValues) {
                //  P value obtained from AD test using the localPValues
                Double ADTest = checkAgainstAndersonDarlingTest(localPValues);
                if (ADTest <= threshold) {
                    rejects.add(x);
                } else {
                    accepts.add(x);
                }
            }

where each inner list of Independence test p values would then be fed into ADTest to generate ADTest P value for this target node.
There would be 1 result ADTest P value for 1 target node.
which means, when we have shuffleThreshold set as 1.0, there would be 1 total shuffleTimes, and there would be 1 ADTest P Value as the result.
And when we want to generate more data by shuffling, if we set shuffleThreshold as 0.2, there would be 5 total shuffleTimes, and there would be 5 ADTest P Value as the result, which is exactly we want.

@vbcwonderland vbcwonderland changed the title Fixed shuffle function Fixed shuffle function and combine all shuffled independence test p vals for a target node into a flat list to feed into Anderson Darling Test May 31, 2024
Copy link
Collaborator

@jdramsey jdramsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@jdramsey jdramsey merged commit e7a4b28 into development May 31, 2024
@jdramsey jdramsey deleted the vbc-05-28 branch May 31, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants