-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with fges algorithm #73
Comments
Hello. Can you please provide the log file? What is the error message? |
There is no error message, the command line just prints "Killed" along with the command I used. On the causal-cmd.log, the messages stop at the following: 2022-08-03 09:17:31 INFO edu.pitt.dbmi.causal.cmd.data.DataValidations:166 - Start data validation on file dataset.txt. And on the fges_*.txt, the last messages is the following: |
Is 'verbose' on? I.e. --verbose? No, I don't see it. I think it's off by default It's possible the algorithm is running at this point but not printing anything? |
Also I discovered some issues in my previous pipeline for doing large models which I need to think about. @kvb2univpitt Is --faithfulnessAssumed false by default? It would help if it were false here. (I think that's the flag). |
@jdramsey Sorry, I missed your question. Yes, faithfulnessAssumed is set to false by default. |
@kvb2univpitt @Zarmas Hmmm...... OK I need to think more. BTW the laptop I'm currently using is just 16 GB so it's a little hard for me to comment well, but I'm hopefully getting a 64 GB laptop soon and I'll try it when I get it. Also there's another different sort of algorithm I just thought of yesterday that I'd like to try on the larger laptop that would definitely benefit from parallelization. It's based on this paper that we just got published in UAI 2022: https://arxiv.org/abs/2206.05421 though it does assume that you're interested in a particular variable (or list of variables). Would that happen to be true in your case, @Zarmas ? Or did you need causal structure for all 20,000 variables? |
I am need the causal structure for all the variables, or to run it individually for all 20000 variables, if it can run fast. |
Interestingly, just to note, I recently went back to look at old algorithm of mine from years ago called MBFS; I just renamed it to PC-MB for clarify. It's very fast for finding the Markov blanket or causes and effects about a variables. In fact, if the model is not too dense it's very competitive. If you don't mind my being curious, what sort of application do you have where you actually need to know the causal structure over all 20,000 nodes? I can't currently think of one, which is why I've been going back to thinking of these partial algorithms the last few days. But maybe you convince me that it's worthwhile finding the whole causal structure so that i will be motivated to work on the problem for you. |
(I ask this despite having spent a lot of time trying to find the causal structure over many nodes like that.) |
Actually strategies for doing 20000 genes individually in parallel are occurring to me, but if I'm going to do this work for you you should send me emails so we can work out the details. I'm thinking if you have a big machine it might be doable, using a couple of strategies. |
@Zarmas I take it this is a stale issue, so I'll close it. If you're still interested, you can re-open the issue. |
@Zarmas Another comment--what I realized was that this was a problem specifically for running parallelized FGES in a context where more than one copy of FGES were also being run in parallel--something we hadn't previously tried to do. But it came up again recently, so then I looked back at this issue and realized it was the same issue. :-D |
Hello again, I tried using the fges algorithm in the new version, in a dataset that could previously run in the 1.4.1 version with no problems. The command I used was the following: java -Xmx24G -jar causal-cmd-1.7.0-SNAPSHOT-jar-with-dependencies.jar --data-type continuous --delimiter tab --parallelized --json-graph --algorithm fges --dataset dataset.txt --score sem-bic-score --penaltyDiscount 10.0 --maxDegree 20 And this time I get a new error. I wonder why a dataset that was small enough to be processed in the old version can no longer be supported. The error message I got was the following: Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit |
The out-of-memory issue is caused by outputting graph in json format. See issue #89. |
I am still having the original problem of not being able to run because of RAM limitations, even without the json graph. |
Can you give me the dimensions of your problem again? I'm seriously not seeing any problems like this on my end, though maybe I've got the dimensions wrong. Also how much RAM are you using? I know there was a problem with the JSON graphs (that we're looking into). |
I've got 64 GB of memory on my machine--is that more than you're using? |
Also, is your problem continuous? Discrete? Mixed? |
Continuous problem, I used it on a 128gb ram machine. Dimensions are 20000 variables and 3000 samples approximately. You can see details about it in the original message of this issue. |
That's what I thought. OK, I'll try it again later. |
OK, I just added a large FGES simulation module to our (new) py-tetrad project and got it up to 5000 variables by 3000 samples, average degree of 2. I guess I'll try to keep increasing the size of the problem incrementally to see how large I can make it. BTW when you say you're doing 20,000 variables, I have gotta know, are you trying to run a genome? (Maybe we talked about this before, many thoughts though my mind since then.) If so, could you limit the variables given to FGES by identifying some target variables and removing all nodes not unconditionally dependent on any of those target variables? Here's my simulation: https://github.com/cmu-phil/py-tetrad/blob/main/examples/run_large_fges.py Result:
This is on MacOS Ventura, M1-Max chip. |
Parallelized the search speeds up a little not much:
It's interesting that you unable to even do 200 variables with sample size 100:
Now my interest is piqued... why can I do bigger datasets than you can? I haven't scaled to 10,000 or 20,000 variables yet because already I've gone further than you could.... why? Any ideas? I don't currently have an Ubuntu machine to try this on, but could you tell me, when you type
what does it print out? By the way, it's taking far longer for my code to simulate a large sample like this than to actually run FGES on it. I used to parallelize the simulation code; I think I took that out after the million variable paper, but maybe I can put that back in. Hold it, another question. Do you have any idea now dense your model is? I'm assuming very sparse in the above overall. |
OK, here's what I get for 10000 variables, 3000 samples, average degree 2:
This Python JPype module: (The contents of that may change--I may go ahead and try 20,000 variables overnight...though I've got to get back to work at the moment.) In any case, I hope this gives you some ideas. Either it's a platform issue I think that you're up against, or else your model is just very dense. I think FGES should be able to run on the problem given enough memory and time. Oh--it could also be a data loading issue--I didn't load data here but simulated it. Here's the JVM I was using for this: openjdk version "11.0.9.1" 2020-11-04 LTS |
It is the same java versions I used in the fges-mb issue, 15.0.2, 18.0.2 and 11.0.18. The model is dense, did you set a low "max degree" when running the algorithm? |
I did. Check the code. |
Feel free to play with that script by the way. |
By the way, "average degree 2" simply means that the total number of edges in the graph is about equal to the total number of nodes. Locally, it could be quite dense. |
Which is why I'm quite curious what kind of data you're analyzing. |
Here's another run I did with m = 10000, N = 3000, avg degree - 2, penalty discount = 8. good precision.
|
I find that FGES does really well in the large, sparse case. For dense cases, precisions can fall. We have other algorithms that do really well in the dense regime (GRaSP, BOSS, etc.) but don't scale as well. Actually, we will have a version of BOSS that we scaled accurately to 1000 nodes, average degree of 10, which is not bad; that will come out in our next version I think. But in your case, to use it, you'd have to do some variable selection. m = 20000 is too much. I don't know of any other algorithms that will scale that well for the dense regime that are correct for linear, Gaussian data; perhaps for some specialized types of data, you might know of some. |
Seems to run better with a lοw max degree, but for 2000 variables and 500 samples seems like it ran for too long(about 4 hours) and checking the machine, for most of the time only one cpu was being used so I am not sure it was running parallelized. It seems that the parallelized option is removed and default option is not using parallelization. This was the command I used: java -Xmx24G -jar causal-cmd-1.7.0-SNAPSHOT-jar-with-dependencies.jar --data-type continuous --default --delimiter tab --algorithm fges --dataset data2000x500.txt --score sem-bic-score --penaltyDiscount 10.0 --maxDegree 3 I am wondering if I did something wrong in the command and it didn't run parallelized. |
My script returns with an average degree of 6, m = 2000, n = 500. CPU elapsed time of 28 seconds. I am trying to figure out what's going on with your machine or with your dataset--you'll have to fiddle with the script I wrote and see if you can make it take that long. Unfortunately, I can't help with that part of the project. All I've got is my simulated data and my machine. I could simulate some data of that size and try causal-cmd on it. I've got some things I need to do, and I need to run to campus for a faculty meeting, so that may have to wait until tomorrow.
|
Actually here I did it real quick. This ran in 31 seconds on my machine. Maybe it's because I used the --faithfulnessAssumed flag, which is fine for large, sparse models. (Sparse for 2000 variables can have a fairly high average degree, since density = avg degre / (# vars - 1). java -Xmx24G -jar causal-cmd-1.6.1-jar-with-dependencies.jar --data-type continuous --default --delimiter comma --algorithm fges --dataset mydata.csv --score sem-bic-score --penaltyDiscount 10.0 --verbose --faithfulnessAssumed This dataset was 2000 x 500, avg degree 6. I generated it using this script, with those parameter settings: https://github.com/cmu-phil/py-tetrad/blob/main/examples/simulating_data.py |
Yeah, it was the --faithfulnessAssumed parameter that did it, I'm sure. |
Sorry I got caught up with several projects. I wanted to comment on parallelization for FGES. SFAIK, there are only two places where you can take advantage of parallelization. One is in the initial step, where you try to determine the "effect graph"--i.e., which single edge out of all possible single advantages to adding to the graph. That step is highly parallelizable and should engage all processors. Then proceeds, a very serial part of the algorithm, adding each edge. This is serial in that each edge addition requires the previous one to be added. Each edge addition is parallelizable in that you choose an advantage from all possible edges to add to parallelize that choice. But the overall algorithm can only be made partially parallel. This is a limitation of the algorithm. |
We've been concentrating on scaling lately to denser graphs; we've only recently started to think again about scaling to graphs with many variables. Ideally, you'd want algorithms that can do both at once--we are thinking about that. Here's a paper we published for denser graphs: Lam, W. Y., Andrews, B., & Ramsey, J. (2022, August). Greedy relaxations of the sparsest permutation algorithm. In Uncertainty in Artificial Intelligence (pp. 1052-1062). PMLR. We're working on another algorithm like GRasP that scales to a larger number of variables, but we still need to scale it to 2000 for a dense graph. It's a hard problem; I don't know of any other algorithm in the literature that even tries to do that. For sparse graphs, FGES is a good option; in that sense, we've just not been interested in making better algorithms to scale to large numbers of variables for sparse graphs; there is already a good algorithm for that. Anyway, it's something we're thinking about. |
@Zarmas OK, I finally took the time today to run an example for FGES with 20,000 variables. I used py-tetrad with this script: https://github.com/cmu-phil/py-tetrad/blob/main/pytetrad/algcomparison_large_fges.py with the larger parameter settings. This used 20,000 variables with an average degree of 6, N = 500; you can vary the parameters to suit. The simulation took 15 minutes; the algorithm took 51 minutes to run, with this result:
AP is adjacency precision; AHP is arrowhead precision; these are essentially perfect, so there's no false information in the resulting graph. The recall could be a lot higher for two possible reasons. First, the sample size is just 500; if the sample size were greater than that, the recall would come up. Also, the range for coefficient values was (-1, 1) with no interval taken out about zero; the weak coefficient could make it challenging to find edges. So it can be done. I did set the Faithfulness parameter to True, as I suggested earlier. |
I tried using the py-tetrad on a small dataset with a script python script, similar to the one you shared, but I didn't use the comparison functions, and it looks like this: df = pd.read_csv("resources/dataset.txt", sep="\t") score = ts.SemBicScore(data) score.setPenaltyDiscount(8) fges_graph = search.fges(score) I get results without a problem, but I can't figure out how to set parameters like parallelized or maxDegree without the parameters module, similar to how I set the PenaltyDiscount using score.setPenaltyDiscount and can't find other parameter setting function when I looked into it more. |
Hello,
I am trying to make a causal inference study, and I am using the following command:
java -Xmx128G -jar causal-cmd-1.4.1-SNAPSHOT-jar-with-dependencies.jar --data-type continuous --delimiter tab --parallelized --json-graph --algorithm fges --dataset dataset.txt --score sem-bic-score --penaltyDiscount 10.0 --maxDegree 20
on a 14 thread and 128gb RAM system. The full dataset I enter are approximately 20000 continuous variables and 3000 samples.
The process gets killed after running for approximately 14 minutes. Even in smaller samples the process gets killed. From the top command it seems that memory is close to 100% before it stops running. On the log file, the last entry is the following:
Initializing effect edges: 21000
Tried dividing the set and the smallest data size it could run was 2000 variables and 100 samples. On the paper accompanying the caucal-cmd executable(Ramsey Et al. 2017), it is stated that it can be used for a million variables and more, but it seems impossible to use it on more than 2000 variables, on an above average system. What can I do in order to use it for the dataset I want to, with the system I currently use?
Thank you in advance,
George
The text was updated successfully, but these errors were encountered: