Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Parsing C++ Files for Code2Seq #198

Open
estiver-alvarez opened this issue Nov 15, 2021 · 15 comments
Open

Error Parsing C++ Files for Code2Seq #198

estiver-alvarez opened this issue Nov 15, 2021 · 15 comments

Comments

@estiver-alvarez
Copy link

estiver-alvarez commented Nov 15, 2021

Hi sir, I want to take this occasion to express my admiration for your great job extremely useful for my Bachelor's thesis. 
Currently, I am trying to convert my dataset of about  100000 cpp files, but always I am getting this error.

image

Originally thought that the issue was related to the capacity of memory, but quickly this was dismissed because I work with instances of machines of up to 150G Ram, also I checked of limits set by the OS linux and these are far above of the  thread quantity created when the App crash. even thanks to the capacity of instances of the machines   with whom I have worked,  it's possible create a thread by each cpp file and the system would not have any problem of performance or configuration.

I have kept track of the logs JVM and with the help an analyzing tool. I found that over 99% thread is established in state-timed waiting and all these are asociated to MVStore.

image

image

may you help me with this issue.

Thank you very much for your attention

@estiver-alvarez
Copy link
Author

this is the configuration that currently I'm applying.

image

@illided
Copy link
Contributor

illided commented Nov 15, 2021

Hi! At the start astminer will tell you how many threads will be created (by default only 1 but you can change it in the config with numOfThreads option). Could you send us full console output and log.txt (it should be created in root of the project)?

@estiver-alvarez
Copy link
Author

Hi! At the start astminer will tell you how many threads will be created (by default only 1 but you can change it in the config with numOfThreads option). Could you send us full console output and log.txt (it should be created in root of the project)?

of course this the file, thanks for you suport.
log.txt

@illided
Copy link
Contributor

illided commented Nov 15, 2021

It's really strange. Astminer created only 1 thread, but some subcomponents created multiple threads. Could you also provide us with some of your's system info: OS, gcc compiler version, JDK version etc. I think anything can help :)

In the meantime you can run astminer from docker. For this run ./gradlew shadowJar and docker build -t voudy/astminer . in the root of the project. From now on cli script will run docker container instead of jar.

@estiver-alvarez
Copy link
Author

estiver-alvarez commented Nov 18, 2021

It's really strange. Astminer created only 1 thread, but some subcomponents created multiple threads. Could you also provide us with some of your's system info: OS, gcc compiler version, JDK version etc. I think anything can help :)

In the meantime you can run astminer from docker. For this run ./gradlew shadowJar and docker build -t voudy/astminer . in the root of the project. From now on cli script will run docker container instead of jar.

@illided Thank you very much for your support. unfortunately still I have the issue,although this time the processing went so far than the previous tried, but likewise this couldn't make it at least 20%, the strangest it was that this time I executed it over cli and docker. the app is being run over an instance machine with 4 cpu and 132 gb ram, under linux(ubuntu 20.04 and debian 10) and with stable latest version of java 11. 

pdt when I executed the app under docker, I didn't know configure the parameters associated to heap in Java like -Xms -Xms for this reason I executed the app of the default way. 

image

I share you the last 100000 lines of log
log1.txt

@illided
Copy link
Contributor

illided commented Nov 18, 2021

Ok, now it's super strange :)
Is it possible for you to share the data on which you run the astminer? If the dataset is very large at least a small part.

@illided
Copy link
Contributor

illided commented Nov 18, 2021

Also, if you are in a real hurry, you can implement C ++ support through a tree sitter grammar. Parser that we are using for c++ right now is not the most stable.

We are planing to do this in the future, but PR is welcome.

@estiver-alvarez
Copy link
Author

Ok, now it's super strange :) Is it possible for you to share the data on which you run the astminer? If the dataset is very large at least a small part.

@illided of course very kind of you :). here in the next link you will find the dataset that I'm trying to process.
https://drive.google.com/file/d/1CyNKBDYequb7izMAsk3wHj9A0HOWZsfc/view?usp=sharing

@pppyx
Copy link

pppyx commented Dec 15, 2021

It's really strange. Astminer created only 1 thread, but some subcomponents created multiple threads. Could you also provide us with some of your's system info: OS, gcc compiler version, JDK version etc. I think anything can help :)
In the meantime you can run astminer from docker. For this run ./gradlew shadowJar and docker build -t voudy/astminer . in the root of the project. From now on cli script will run docker container instead of jar.

@illided Thank you very much for your support. unfortunately still I have the issue,although this time the processing went so far than the previous tried, but likewise this couldn't make it at least 20%, the strangest it was that this time I executed it over cli and docker. the app is being run over an instance machine with 4 cpu and 132 gb ram, under linux(ubuntu 20.04 and debian 10) and with stable latest version of java 11. 

pdt when I executed the app under docker, I didn't know configure the parameters associated to heap in Java like -Xms -Xms for this reason I executed the app of the default way. 

image

I share you the last 100000 lines of log log1.txt

I'm now running into the same situation:OOM error when trying to create thread, and my dataset is also very large.

Hope you will find out the reason soon, thanks!

@illided
Copy link
Contributor

illided commented Dec 15, 2021

Sorry, didn't had a time to respond.

As I suspected fuzzy parser creates multiple threads and can't close them properly. In fact it creates up to 6 threads for each file! This issue is fixed in new version of fuzzy parser, but unfortunately it's not compatible with astminer and we have no time right now to fully rewrite this part :(

You can try to experiment with the code, or, as i said earlier, try to implement support for c++ through tree-sitter. It shouldn't be that hard, as there a lot of examples of what you need to implement.

See astminer/src/main/kotlin/astminer/parse/treesitter/java/

@pppyx
Copy link

pppyx commented Dec 15, 2021

Sorry, didn't had a time to respond.

As I suspected fuzzy parser creates multiple threads and can't close them properly. In fact it creates up to 6 threads for each file! This issue is fixed in new version of fuzzy parser, but unfortunately it's not compatible with astminer and we have no time right now to fully rewrite this part :(

You can try to experiment with the code, or, as i said earlier, try to implement support for c++ through tree-sitter. It shouldn't be that hard, as there a lot of examples of what you need to implement.

See astminer/src/main/kotlin/astminer/parse/treesitter/java/

Hi, illided. Thank you for your prompt reply, I write a shell script to run astminer on a small batch of dataset each time and concat the path_context.c2s, I think this can be a quick solution for this problem temporarily.

@illided
Copy link
Contributor

illided commented Dec 15, 2021

Sounds interesting! Could you share this solution here in case anyone will get the same problem?

@estiver-alvarez
Copy link
Author

Sorry, didn't had a time to respond.
As I suspected fuzzy parser creates multiple threads and can't close them properly. In fact it creates up to 6 threads for each file! This issue is fixed in new version of fuzzy parser, but unfortunately it's not compatible with astminer and we have no time right now to fully rewrite this part :(
You can try to experiment with the code, or, as i said earlier, try to implement support for c++ through tree-sitter. It shouldn't be that hard, as there a lot of examples of what you need to implement.
See astminer/src/main/kotlin/astminer/parse/treesitter/java/

Hi, illided. Thank you for your prompt reply, I write a shell script to run astminer on a small batch of dataset each time and concat the path_context.c2s, I think this can be a quick solution for this problem temporarily.

please, may you share the script .

@SpirinEgor
Copy link
Contributor

Hi everyone!
Running with small batches and concatenating after looks interesting. But you also should be aware of the vocabulary collected for each batch. You need to disable the nodesToNumber property in the config.

@pppyx
Copy link

pppyx commented Dec 16, 2021

Hi, sorry for not reply in time, this is my shell script, and I haven't fully tested it, you can refer to it as a sample code.
This code snippet is for validation dataset, for train/test dataset, the code is similar.

VAL_DIR=your val set
AST_MINER_HOME=your astminer directory
DATASET_NAME=your dataset name

#model.yaml : example is given blow
MODEL_CONFIG=${AST_MINER_HOME}/model.yaml

CURRENT_WORK_DIRECTORY=$PWD
VAL_OUTPUT_DIR=${CURRENT_WORK_DIRECTORY}/data/${DATASET_NAME}/output/val


mkdir -p data
mkdir -p data/${DATASET_NAME}
mkdir -p data/${DATASET_NAME}/output
mkdir -p VAL_OUTPUT_DIR
mkdir -p ${VAL_OUTPUT_DIR}/total

#This is the c2s file after concatenating.
VAL_DATA_FILE=${VAL_OUTPUT_DIR}/total/path_contexts.c2s

touch $TRAIN_DATA_FILE
touch $VAL_DATA_FILE
touch $TEST_DATA_FILE
>$TRAIN_DATA_FILE
>$VAL_DATA_FILE
>$TEST_DATA_FILE

#before this, you have to split your dataset into small folders!!!!!

cd $VAL_DIR
echo "change pwd to:$PWD"

nums=$(ls -l|grep "^d"| wc -l)
echo $nums

dirs=$(ls -l |awk '/^d/ {print $NF}')

dirs=(${dirs//,/ })

VAL_CONFIG_TMP=$AST_MINER_HOME/val_tmp.yaml
touch $VAL_CONFIG_TMP
 > $VAL_CONFIG_TMP

cd $AST_MINER_HOME
echo "change pwd to:$PWD"

for dir in ${dirs[@]};do
  
  echo $VAL_DIR/$dir

  rm -f ${VAL_OUTPUT_DIR}/c/data/path_contexts.c2s

  echo "inputDir: $VAL_DIR/$dir">$VAL_CONFIG_TMP
  echo "outputDir: ${VAL_OUTPUT_DIR}">>$VAL_CONFIG_TMP
  cat $MODEL_CONFIG >>$VAL_CONFIG_TMP

  source $AST_MINER_HOME/cli.sh $VAL_CONFIG_TMP

  VAL_DATA_FILE_TMP=${VAL_OUTPUT_DIR}/c/data/path_contexts.c2s
  if [ -e $VAL_DATA_FILE_TMP ]; then
    cat $VAL_DATA_FILE_TMP >> $VAL_DATA_FILE
    echo "" >> $VAL_DATA_FILE
  fi


done
  • A sample model.yaml:(not include input&output Dir)

# parse Java files with GumTree parser
parser:
  name: fuzzy
  languages: [cpp,c]

# use function name as labels
# this selects the function level granularity
label:
  name: function name

# save to disk ASTs in the code2seq format
storage:
  name: code2vec


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants