Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JS support #22

Closed
egor-bogomolov opened this issue Jul 2, 2019 · 17 comments
Closed

Add JS support #22

egor-bogomolov opened this issue Jul 2, 2019 · 17 comments
Labels
languages Issues regarding support of new languages

Comments

@egor-bogomolov
Copy link
Collaborator

Add JS support via

  1. ANTLR grammar
  2. Wrap an existing tool
@utkarsh-agrawaal
Copy link

Hey, any update on JS support?

@elena-lyulina
Copy link
Contributor

Hey @utkarsh-agrawaal, the news is as follows:
For the first, we hope to add JS support via ANTLR grammar in a few days, but there are some issues since a context-free grammar does not describe JS syntax (here you can read about it in detail), so make sure it’s okay for you.
For the second, we were unable to find any tool that would suit our needs.

@HaseebLUMS
Copy link

HaseebLUMS commented Oct 12, 2019

Hey @elena-lyulina , any update on JS support?

@elena-lyulina
Copy link
Contributor

elena-lyulina commented Oct 14, 2019

Hey @HaseebLUMS,
The beta version of JS parser is under review.
May I ask what is your purpose of this parser usage?

@HaseebLUMS
Copy link

@elena-lyulina
I need it for using it with code2vec for my project in which I am applying different ML techniques on scripts of web pages.

When should I expect to see JS parser publicly released?

@elena-lyulina
Copy link
Contributor

@HaseebLUMS
Let's see what @egor-bogomolov can say about it, cause the review stage depends on him

@HaseebLUMS
Copy link

Hello @egor-bogomolov
Can you please give a tentative time when JS parser will be ready?

@egor-bogomolov
Copy link
Collaborator Author

@HaseebLUMS I will review parser's code tonight.

@HaseebLUMS
Copy link

@egor-bogomolov
Thank you. Will the output of this parse will be compatible with code2vec like other parsers?

@egor-bogomolov
Copy link
Collaborator Author

@HaseebLUMS hope so :)

@kvenux
Copy link

kvenux commented Nov 6, 2019

@egor-bogomolov Hi, how's it going?
Need some supports here. Many thanks!

@egor-bogomolov egor-bogomolov added the languages Issues regarding support of new languages label Dec 7, 2019
@nashid
Copy link

nashid commented Feb 16, 2021

@egor-bogomolov @elena-lyulina I would like to know whats the current status of JS support with astminer? Any plan when JS will be supported with astminer?

@egor-bogomolov
Copy link
Collaborator Author

Hi @nashid, big thanks for reminding us about JS :)
I added JS to the CLI (see #123), you can build the branch cli-javascript yourself and use it right away. If you need any further help, don't hesitate to contact us.

@nashid
Copy link

nashid commented Feb 17, 2021

@egor-bogomolov I have attempted to use it with the following sample input:

example:
sum(a, b)

execution:
./cli.sh pathContexts --lang js --project context-ml-dataset --output context-ml-dataset-output --maxL 5000 --maxW 5000 --maxContexts 10 --maxTokens 5000 --maxPaths 10

Output files:

tokens.csv

id,token
3,
2,a
4,b
1,(
5,)

node_types.csv

id,node_type
1,OpenParen UP
2,arguments TOP
4,Comma DOWN
3,singleExpression|Identifier DOWN
6,singleExpression|Identifier UP
5,CloseParen DOWN
7,Comma UP

paths.csv

id,path
1,1 2 3
2,1 2 4
3,1 2 5
5,6 2 3
4,6 2 4
7,7 2 3
6,6 2 5
8,7 2 5

path_contexts.csv
context-ml-dataset/temp.js 1,1,2 1,2,3 1,1,4 1,3,5 2,4,3 2,5,4 2,6,5 3,7,4 3,8,5 4,6,5

A couple of pertinent questions:

  • Firstly, why we have some paths containing parenthesis and comma?

    • Is there a way to omit parenthesis and comma as they are not supposed to be part of the AST path?
  • Is there a way to suppress comma (,) and left parentheses and right parenthesis in the AST paths?

  • Finally, I presume before feeding into code2vec we are supposed to replace the path_contexts along with actual values from tokens, node_types, and paths? I understand PathMiner is setting ID’s in the path-contexts for reducing memory and I can write a simple python code snippet to replace those tokens. Or I am missing something i.e. PathMiner can also perform the token replacement?

I would also be curious to know how to make the path output more closer to AST i.e. the output without comma, parenthesis.

Also AST output from Esprima illustrating the problem:
image

I am happy to contribute to the repo as required.

@SpirinEgor
Copy link
Contributor

Hi!

Storing parenthesis, commas, etc. is strange. But I think this is predefined by ANTLR4 grammar which we use for JS. Maybe there are some parameters for generating rules inside ANTLR4 to set up the way of parsing... Btw, did you try to run the astminer on more complex examples? For example on some functions?

Speaking about changing ids back to words in paths, it's completely unnecessary. You already can feed this data in code2vec. You need this back conversion only on inference, to produce readable output to users.

@egor-bogomolov
Copy link
Collaborator Author

@nashid, you don't see the sum token because you've set a very tight limit on the number of extracted contexts -- only 10. If you raise it up to, let's say, 100, you will see that all the expected tokens are there.

As @SpirinEgor mentioned, all the non-alphanumeric tokens (like , and () are due to the ANTLR4 grammar used under the hood. If you will run code2vec task instead of pathContexts, all such tokens will be replaced with EMPTY_TOKEN. You can either change the code a little bit in order not to store such contexts or just clean them afterward.

@SpirinEgor I guess we need to work on the configuration so that we can automatically ignore such tokens and corresponding contexts.

@SpirinEgor
Copy link
Contributor

Since JavaScript was added and there are no questions at this moment I will close the issue. But feel free to open at any time if you have ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
languages Issues regarding support of new languages
Projects
None yet
Development

No branches or pull requests

7 participants