Error Parsing C++ Files for Code2Seq #198

estiver-alvarez · 2021-11-15T01:40:27Z

Hi sir, I want to take this occasion to express my admiration for your great job extremely useful for my Bachelor's thesis.
Currently, I am trying to convert my dataset of about 100000 cpp files, but always I am getting this error.

Originally thought that the issue was related to the capacity of memory, but quickly this was dismissed because I work with instances of machines of up to 150G Ram, also I checked of limits set by the OS linux and these are far above of the thread quantity created when the App crash. even thanks to the capacity of instances of the machines with whom I have worked, it's possible create a thread by each cpp file and the system would not have any problem of performance or configuration.

I have kept track of the logs JVM and with the help an analyzing tool. I found that over 99% thread is established in state-timed waiting and all these are asociated to MVStore.

may you help me with this issue.

Thank you very much for your attention

estiver-alvarez · 2021-11-15T01:45:10Z

this is the configuration that currently I'm applying.

illided · 2021-11-15T08:30:22Z

Hi! At the start astminer will tell you how many threads will be created (by default only 1 but you can change it in the config with numOfThreads option). Could you send us full console output and log.txt (it should be created in root of the project)?

estiver-alvarez · 2021-11-15T16:30:07Z

Hi! At the start astminer will tell you how many threads will be created (by default only 1 but you can change it in the config with numOfThreads option). Could you send us full console output and log.txt (it should be created in root of the project)?

of course this the file, thanks for you suport.
log.txt

illided · 2021-11-15T21:31:42Z

It's really strange. Astminer created only 1 thread, but some subcomponents created multiple threads. Could you also provide us with some of your's system info: OS, gcc compiler version, JDK version etc. I think anything can help :)

In the meantime you can run astminer from docker. For this run ./gradlew shadowJar and docker build -t voudy/astminer . in the root of the project. From now on cli script will run docker container instead of jar.

estiver-alvarez · 2021-11-18T04:31:29Z

It's really strange. Astminer created only 1 thread, but some subcomponents created multiple threads. Could you also provide us with some of your's system info: OS, gcc compiler version, JDK version etc. I think anything can help :)

In the meantime you can run astminer from docker. For this run ./gradlew shadowJar and docker build -t voudy/astminer . in the root of the project. From now on cli script will run docker container instead of jar.

@illided Thank you very much for your support. unfortunately still I have the issue,although this time the processing went so far than the previous tried, but likewise this couldn't make it at least 20%, the strangest it was that this time I executed it over cli and docker. the app is being run over an instance machine with 4 cpu and 132 gb ram, under linux(ubuntu 20.04 and debian 10) and with stable latest version of java 11.

pdt when I executed the app under docker, I didn't know configure the parameters associated to heap in Java like -Xms -Xms for this reason I executed the app of the default way.

I share you the last 100000 lines of log
log1.txt

illided · 2021-11-18T21:04:18Z

Ok, now it's super strange :)
Is it possible for you to share the data on which you run the astminer? If the dataset is very large at least a small part.

illided · 2021-11-18T21:10:39Z

Also, if you are in a real hurry, you can implement C ++ support through a tree sitter grammar. Parser that we are using for c++ right now is not the most stable.

We are planing to do this in the future, but PR is welcome.

estiver-alvarez · 2021-11-19T15:09:19Z

Ok, now it's super strange :) Is it possible for you to share the data on which you run the astminer? If the dataset is very large at least a small part.

@illided of course very kind of you :). here in the next link you will find the dataset that I'm trying to process.
https://drive.google.com/file/d/1CyNKBDYequb7izMAsk3wHj9A0HOWZsfc/view?usp=sharing

pppyx · 2021-12-15T18:55:57Z

It's really strange. Astminer created only 1 thread, but some subcomponents created multiple threads. Could you also provide us with some of your's system info: OS, gcc compiler version, JDK version etc. I think anything can help :)
In the meantime you can run astminer from docker. For this run ./gradlew shadowJar and docker build -t voudy/astminer . in the root of the project. From now on cli script will run docker container instead of jar.

@illided Thank you very much for your support. unfortunately still I have the issue,although this time the processing went so far than the previous tried, but likewise this couldn't make it at least 20%, the strangest it was that this time I executed it over cli and docker. the app is being run over an instance machine with 4 cpu and 132 gb ram, under linux(ubuntu 20.04 and debian 10) and with stable latest version of java 11.

pdt when I executed the app under docker, I didn't know configure the parameters associated to heap in Java like -Xms -Xms for this reason I executed the app of the default way.

I share you the last 100000 lines of log log1.txt

I'm now running into the same situation:OOM error when trying to create thread, and my dataset is also very large.

Hope you will find out the reason soon, thanks!

illided · 2021-12-15T19:06:43Z

Sorry, didn't had a time to respond.

As I suspected fuzzy parser creates multiple threads and can't close them properly. In fact it creates up to 6 threads for each file! This issue is fixed in new version of fuzzy parser, but unfortunately it's not compatible with astminer and we have no time right now to fully rewrite this part :(

You can try to experiment with the code, or, as i said earlier, try to implement support for c++ through tree-sitter. It shouldn't be that hard, as there a lot of examples of what you need to implement.

See astminer/src/main/kotlin/astminer/parse/treesitter/java/

pppyx · 2021-12-15T19:16:01Z

Sorry, didn't had a time to respond.

As I suspected fuzzy parser creates multiple threads and can't close them properly. In fact it creates up to 6 threads for each file! This issue is fixed in new version of fuzzy parser, but unfortunately it's not compatible with astminer and we have no time right now to fully rewrite this part :(

You can try to experiment with the code, or, as i said earlier, try to implement support for c++ through tree-sitter. It shouldn't be that hard, as there a lot of examples of what you need to implement.

See astminer/src/main/kotlin/astminer/parse/treesitter/java/

Hi, illided. Thank you for your prompt reply, I write a shell script to run astminer on a small batch of dataset each time and concat the path_context.c2s, I think this can be a quick solution for this problem temporarily.

illided · 2021-12-15T19:18:02Z

Sounds interesting! Could you share this solution here in case anyone will get the same problem?

estiver-alvarez · 2021-12-15T19:33:05Z

Sorry, didn't had a time to respond.
As I suspected fuzzy parser creates multiple threads and can't close them properly. In fact it creates up to 6 threads for each file! This issue is fixed in new version of fuzzy parser, but unfortunately it's not compatible with astminer and we have no time right now to fully rewrite this part :(
You can try to experiment with the code, or, as i said earlier, try to implement support for c++ through tree-sitter. It shouldn't be that hard, as there a lot of examples of what you need to implement.
See astminer/src/main/kotlin/astminer/parse/treesitter/java/

Hi, illided. Thank you for your prompt reply, I write a shell script to run astminer on a small batch of dataset each time and concat the path_context.c2s, I think this can be a quick solution for this problem temporarily.

please, may you share the script .

SpirinEgor · 2021-12-16T10:34:55Z

Hi everyone!
Running with small batches and concatenating after looks interesting. But you also should be aware of the vocabulary collected for each batch. You need to disable the nodesToNumber property in the config.

pppyx · 2021-12-16T14:15:20Z

Hi, sorry for not reply in time, this is my shell script, and I haven't fully tested it, you can refer to it as a sample code.
This code snippet is for validation dataset, for train/test dataset, the code is similar.

VAL_DIR=your val set
AST_MINER_HOME=your astminer directory
DATASET_NAME=your dataset name

#model.yaml : example is given blow
MODEL_CONFIG=${AST_MINER_HOME}/model.yaml

CURRENT_WORK_DIRECTORY=$PWD
VAL_OUTPUT_DIR=${CURRENT_WORK_DIRECTORY}/data/${DATASET_NAME}/output/val


mkdir -p data
mkdir -p data/${DATASET_NAME}
mkdir -p data/${DATASET_NAME}/output
mkdir -p VAL_OUTPUT_DIR
mkdir -p ${VAL_OUTPUT_DIR}/total

#This is the c2s file after concatenating.
VAL_DATA_FILE=${VAL_OUTPUT_DIR}/total/path_contexts.c2s

touch $TRAIN_DATA_FILE
touch $VAL_DATA_FILE
touch $TEST_DATA_FILE
>$TRAIN_DATA_FILE
>$VAL_DATA_FILE
>$TEST_DATA_FILE

#before this, you have to split your dataset into small folders!!!!!

cd $VAL_DIR
echo "change pwd to:$PWD"

nums=$(ls -l|grep "^d"| wc -l)
echo $nums

dirs=$(ls -l |awk '/^d/ {print $NF}')

dirs=(${dirs//,/ })

VAL_CONFIG_TMP=$AST_MINER_HOME/val_tmp.yaml
touch $VAL_CONFIG_TMP
 > $VAL_CONFIG_TMP

cd $AST_MINER_HOME
echo "change pwd to:$PWD"

for dir in ${dirs[@]};do
  
  echo $VAL_DIR/$dir

  rm -f ${VAL_OUTPUT_DIR}/c/data/path_contexts.c2s

  echo "inputDir: $VAL_DIR/$dir">$VAL_CONFIG_TMP
  echo "outputDir: ${VAL_OUTPUT_DIR}">>$VAL_CONFIG_TMP
  cat $MODEL_CONFIG >>$VAL_CONFIG_TMP

  source $AST_MINER_HOME/cli.sh $VAL_CONFIG_TMP

  VAL_DATA_FILE_TMP=${VAL_OUTPUT_DIR}/c/data/path_contexts.c2s
  if [ -e $VAL_DATA_FILE_TMP ]; then
    cat $VAL_DATA_FILE_TMP >> $VAL_DATA_FILE
    echo "" >> $VAL_DATA_FILE
  fi


done

A sample model.yaml:(not include input&output Dir)


# parse Java files with GumTree parser
parser:
  name: fuzzy
  languages: [cpp,c]

# use function name as labels
# this selects the function level granularity
label:
  name: function name

# save to disk ASTs in the code2seq format
storage:
  name: code2vec

estiver-alvarez closed this as completed Nov 15, 2021

estiver-alvarez reopened this Nov 15, 2021

SpirinEgor mentioned this issue Feb 7, 2022

Integrating astminer with code2vec for C source codes #200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Parsing C++ Files for Code2Seq #198

Error Parsing C++ Files for Code2Seq #198

estiver-alvarez commented Nov 15, 2021 •

edited

Loading

estiver-alvarez commented Nov 15, 2021

illided commented Nov 15, 2021

estiver-alvarez commented Nov 15, 2021

illided commented Nov 15, 2021

estiver-alvarez commented Nov 18, 2021 •

edited

Loading

illided commented Nov 18, 2021

illided commented Nov 18, 2021 •

edited

Loading

estiver-alvarez commented Nov 19, 2021

pppyx commented Dec 15, 2021

illided commented Dec 15, 2021

pppyx commented Dec 15, 2021

illided commented Dec 15, 2021

estiver-alvarez commented Dec 15, 2021

SpirinEgor commented Dec 16, 2021

pppyx commented Dec 16, 2021

Error Parsing C++ Files for Code2Seq #198

Error Parsing C++ Files for Code2Seq #198

Comments

estiver-alvarez commented Nov 15, 2021 • edited Loading

estiver-alvarez commented Nov 15, 2021

illided commented Nov 15, 2021

estiver-alvarez commented Nov 15, 2021

illided commented Nov 15, 2021

estiver-alvarez commented Nov 18, 2021 • edited Loading

illided commented Nov 18, 2021

illided commented Nov 18, 2021 • edited Loading

estiver-alvarez commented Nov 19, 2021

pppyx commented Dec 15, 2021

illided commented Dec 15, 2021

pppyx commented Dec 15, 2021

illided commented Dec 15, 2021

estiver-alvarez commented Dec 15, 2021

SpirinEgor commented Dec 16, 2021

pppyx commented Dec 16, 2021

estiver-alvarez commented Nov 15, 2021 •

edited

Loading

estiver-alvarez commented Nov 18, 2021 •

edited

Loading

illided commented Nov 18, 2021 •

edited

Loading