Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node range #194

Merged
merged 52 commits into from
Nov 12, 2021
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
65a489d
gumtree update
illided Oct 23, 2021
4d88dde
normalization and token logic extracted as separate class
illided Oct 24, 2021
b681d84
normalization and token logic extracted as separate class
illided Oct 24, 2021
b6ea814
Merge remote-tracking branch 'origin/token_range' into token_range
illided Oct 24, 2021
2c9e535
renaming and refactoring
illided Oct 24, 2021
c3413c5
code style fixes
illided Oct 24, 2021
b5ab209
init restrictions removed
illided Oct 24, 2021
84badf7
token range added to javaparser
illided Oct 24, 2021
8b006e5
javaparser doc update
illided Oct 24, 2021
32a6917
spoon token position introduced
illided Oct 24, 2021
8d376d9
range now is a Node field + some interface rearrangement
illided Oct 28, 2021
95400d8
test fixes
illided Oct 28, 2021
b858836
simple node default parameters added
illided Oct 28, 2021
4ca400d
node range introduced in ANTLR
illided Oct 28, 2021
02192f5
ANTLR util refactor
illided Oct 28, 2021
8d15d54
spoon bug fixed
illided Oct 28, 2021
8798c06
node range refactor
illided Oct 28, 2021
7badac3
foreign parser update
illided Oct 28, 2021
5523bc7
added nodeRange extraction in tree sitter
illided Oct 28, 2021
abbb4f1
setup.py code style refactor
illided Oct 28, 2021
4194258
node range support in gumtree
illided Oct 28, 2021
dfcb5eb
doc little fix
illided Oct 28, 2021
a4eceee
code style fixes
illided Oct 28, 2021
38e705b
serial name change
illided Oct 30, 2021
734a2a9
new option in json ast storage
illided Oct 30, 2021
da2faf4
option to disable range serialization properly added
illided Oct 30, 2021
d4e60b3
Merge branch 'master' into token_range
illided Nov 2, 2021
c47517d
test fixed
illided Nov 2, 2021
f4bea6e
now it compiles
illided Nov 2, 2021
e2b9750
range support added
illided Nov 2, 2021
70b820e
Merge branch 'gumtree_update' into token_range
illided Nov 3, 2021
3d7ae17
now it compiles
illided Nov 3, 2021
5844f8d
code style fixes
illided Nov 3, 2021
1f4999a
Merge branch 'master' into token_range
illided Nov 4, 2021
beab3d2
position tests added
illided Nov 4, 2021
c1df042
tree sitter bug fix
illided Nov 4, 2021
383425e
unused imports removed
illided Nov 4, 2021
2ba6bef
expected function positions changed
illided Nov 4, 2021
edc24d3
positions in javalang test adjusted
illided Nov 4, 2021
396ab20
doc fix
illided Nov 4, 2021
2186e89
Normalization refactor
illided Nov 9, 2021
8a64df8
EMPTY token updated
illided Nov 9, 2021
5898169
file renamed and redundant map deleted
illided Nov 9, 2021
cbb70f0
code style fixes
illided Nov 9, 2021
f2a96a6
test fixed
illided Nov 9, 2021
64219d2
normalization object deleted
illided Nov 10, 2021
66716c0
removed empty line
illided Nov 10, 2021
bd76ac1
documentation added
illided Nov 10, 2021
95c33e5
doc and normalization fixes
illided Nov 11, 2021
bf5914b
docs update
illided Nov 11, 2021
5f86b14
Improve storage documentation
SpirinEgor Nov 12, 2021
37ae185
Change snippet format
SpirinEgor Nov 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions detekt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ style:
max: 5
WildcardImport:
active: false
UseDataClass:
allowVars: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use vars in data class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use var in Token class and for some reason without this option detekt suggests me to use data class, And when i do this detekt reports about var usage in Token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see, at the moment Token is not a data class. And detekt doesn't report anything.


formatting:
autoCorrect: true
Expand Down
23 changes: 23 additions & 0 deletions src/main/kotlin/astminer/common/SimpleNode.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package astminer.common

import astminer.common.model.Node
import astminer.common.model.NodeRange

/** Node simplest implementation **/
class SimpleNode(
override val typeLabel: String,
override val children: MutableList<SimpleNode>,
override val parent: Node? = null,
override val range: NodeRange? = null,
token: String?
) : Node(token) {
override fun removeChildrenOfType(typeLabel: String) {
children.removeIf { it.typeLabel == typeLabel }
}

override fun getChildrenOfType(typeLabel: String) = super.getChildrenOfType(typeLabel).map { it as SimpleNode }
override fun getChildOfType(typeLabel: String) = super.getChildOfType(typeLabel) as? SimpleNode

override fun preOrder() = super.preOrder().map { it as SimpleNode }
override fun postOrder() = super.postOrder().map { it as SimpleNode }
}
Original file line number Diff line number Diff line change
@@ -1,19 +1,35 @@
package astminer.common

const val EMPTY_TOKEN = "EMPTY"
const val EMPTY_TOKEN = "<E>"
const val TOKEN_DELIMITER = "|"

/** Splits tokens in sub-tokens and normalizes them by removing new lines, whitespaces, quotes etc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marks token as empty or split it into subtokens and normalize each of them.

* @see splitToSubtokens
* @see normalizeSubToken**/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @see normalizeSubToken
*/

fun normalizeToken(token: String?): String {
if (token == null) return EMPTY_TOKEN
val subTokens = splitToSubtokens(token)
return if (subTokens.isEmpty()) EMPTY_TOKEN else subTokens.joinToString(TOKEN_DELIMITER)
}

/**
* The function was adopted from the original code2vec implementation in order to match their behavior:
* https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Splits token into subtokens by commonly used practice, i.e. `camelCase` or `snake_case`.
* Returns a list of not empty, normalized subtokens.
* The function was adopted from the original code2vec implementation in order to match their behavior:
 * https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
* @see normalizeToken
*/

fun splitToSubtokens(token: String) = token
.trim()
.split(splitRegex)
.map { s -> normalizeSubToken(s, "") }
.filter { it.isNotEmpty() }
.toList()

val newLineReg = "\\\\n".toRegex()
val whitespaceReg = "//s+".toRegex()
val quotesApostrophesCommasReg = "[\"',]".toRegex()
val unicodeWeirdCharReg = "\\P{Print}".toRegex()
val notALetterReg = "[^A-Za-z]".toRegex()
private val splitRegex = "(?<=[a-z])(?=[A-Z])|_|[0-9]|(?<=[A-Z])(?=[A-Z][a-z])|\\s+".toRegex()

fun normalizeToken(token: String, defaultToken: String): String {
/**
* The function was adopted from the original code2vec implementation in order to match their behavior:
* https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Normalize token by conversion to lower case, removing the new line, whitespace, quotes, and other weird Unicode characters.
* The function was adopted from the original code2vec implementation in order to match their behavior:
* https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
*/

fun normalizeSubToken(token: String, defaultToken: String): String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it back to normalizeToken since there is no special behavior for subtokens. We may also apply it to a full token.

val cleanToken = token.lowercase()
.replace(newLineReg, "") // escaped new line
.replace(whitespaceReg, "") // whitespaces
Expand All @@ -30,16 +46,8 @@ fun normalizeToken(token: String, defaultToken: String): String {
}
}

/**
* The function was adopted from the original code2vec implementation in order to match their behavior:
* https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
*/

val splitRegex = "(?<=[a-z])(?=[A-Z])|_|[0-9]|(?<=[A-Z])(?=[A-Z][a-z])|\\s+".toRegex()

fun splitToSubtokens(token: String) = token
.trim()
.split(splitRegex)
.map { s -> normalizeToken(s, "") }
.filter { it.isNotEmpty() }
.toList()
private val newLineReg = "\\\\n".toRegex()
private val whitespaceReg = "//s+".toRegex()
private val quotesApostrophesCommasReg = "[\"',]".toRegex()
private val unicodeWeirdCharReg = "\\P{Print}".toRegex()
private val notALetterReg = "[^A-Za-z]".toRegex()
2 changes: 1 addition & 1 deletion src/main/kotlin/astminer/common/model/FunctionInfoModel.kt
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ interface NamedTree<T : Node> {
val nameNode: T?
get() = notImplemented("nameNode")
val name: String?
get() = nameNode?.originalToken
get() = nameNode?.token?.original
val root: T
get() = notImplemented("root")
val body: T?
Expand Down
49 changes: 13 additions & 36 deletions src/main/kotlin/astminer/common/model/ParsingModel.kt
Original file line number Diff line number Diff line change
@@ -1,31 +1,22 @@
package astminer.common.model

import astminer.common.EMPTY_TOKEN
import astminer.common.splitToSubtokens
import kotlinx.serialization.SerialName
import kotlinx.serialization.Serializable
import java.io.File
import java.io.InputStream
import java.util.*

abstract class Node(val originalToken: String?) {
abstract class Node(originalToken: String?) {
abstract val typeLabel: String
abstract val children: List<Node>
abstract val parent: Node?

val normalizedToken: String =
originalToken?.let {
val subtokens = splitToSubtokens(it)
if (subtokens.isEmpty()) EMPTY_TOKEN else subtokens.joinToString(TOKEN_DELIMITER)
} ?: EMPTY_TOKEN

var technicalToken: String? = null

val token: String
get() = technicalToken ?: normalizedToken

abstract val range: NodeRange?
val metadata: MutableMap<String, Any> = HashMap()
val token = Token(originalToken)

fun isLeaf() = children.isEmpty()

override fun toString(): String = "$typeLabel : $token"

fun prettyPrint(indent: Int = 0, indentSymbol: String = "--") {
repeat(indent) { print(indentSymbol) }
println(this)
Expand All @@ -52,30 +43,16 @@ abstract class Node(val originalToken: String?) {

fun postOrderIterator(): Iterator<Node> = postOrder().listIterator()
open fun postOrder(): List<Node> = mutableListOf<Node>().also { doTraversePostOrder(it) }

companion object {
const val TOKEN_DELIMITER = "|"
}
}

/** Node simplest implementation **/
class SimpleNode(
override val typeLabel: String,
override val children: MutableList<SimpleNode>,
override val parent: Node?,
token: String?
) : Node(token) {
override fun removeChildrenOfType(typeLabel: String) {
children.removeIf { it.typeLabel == typeLabel }
}

override fun getChildrenOfType(typeLabel: String) = super.getChildrenOfType(typeLabel).map { it as SimpleNode }
override fun getChildOfType(typeLabel: String) = super.getChildOfType(typeLabel) as? SimpleNode

override fun preOrder() = super.preOrder().map { it as SimpleNode }
override fun postOrder() = super.postOrder().map { it as SimpleNode }
@Serializable
data class NodeRange(val start: Position, val end: Position) {
override fun toString(): String = "[${start.line}, ${start.column}] - [${end.line}, ${end.column}]"
}

@Serializable
data class Position(@SerialName("l") val line: Int, @SerialName("c") val column: Int)

interface Parser<T : Node> {
/**
* Parse input stream into an AST.
Expand Down
23 changes: 23 additions & 0 deletions src/main/kotlin/astminer/common/model/Token.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package astminer.common.model

import astminer.common.normalizeToken

class Token(val original: String?) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Class to wrap logic with token processing.
* It is responsible for token normalization or replacing it with technical information.
* Use `token.original` to access the original token.
*/

/** Final token after all normalizations and shadowing
* @see technical
* @see normalized **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Access to the final representation of the token after normalization and other preprocessing.
* It returns technical assign token if it exists or normalized token otherwise.
*/

val final: String
get() = technical ?: normalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to declare the final as a function?

fun final() = technical ?: normalized

From a usage perspective, the property seems like it was created with a class instance and is permanent.
But for final token representation, I want to point out that it calculating anew for every request.


/** Token that shadows any original or normalized token
* and have the most priority in calculating final token
* that will be saved. It can be useful when it's necessary to hide something
* (for example method name in method name prediction problem) **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Technical token is used to shadow the original token with mining pipeline specific value.
* For example, for the method name prediction problem
*  we want to set technical `<METHOD_NAME>` token to hide real method name.
*/

var technical: String? = null

/** Original token after string normalization
* @see normalizeToken **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Apply normalization algorithm to the original token.
* @see normalizeToken
*/

val normalized = normalizeToken(original)

override fun toString(): String = final
}
12 changes: 10 additions & 2 deletions src/main/kotlin/astminer/config/StorageConfigs.kt
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,16 @@ class DotAstStorageConfig : StorageConfig() {
*/
@Serializable
@SerialName("json AST")
class JsonAstStorageConfig(private val withPaths: Boolean = false) : StorageConfig() {
override fun createStorage(outputDirectoryPath: String) = JsonAstStorage(outputDirectoryPath, withPaths)
class JsonAstStorageConfig(
private val withPaths: Boolean = false,
private val withRanges: Boolean = false
) : StorageConfig() {
override fun createStorage(outputDirectoryPath: String) =
JsonAstStorage(
outputDirectoryPath,
withPaths,
withRanges
)
}

/**
Expand Down
2 changes: 1 addition & 1 deletion src/main/kotlin/astminer/featureextraction/TreeFeature.kt
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ object Tokens : TreeFeature<List<String>> {

private fun findTokens(node: Node, tokensList: MutableList<String>): List<String> {
node.children.forEach { findTokens(it, tokensList) }
tokensList.add(node.token)
tokensList.add(node.token.final)
return tokensList
}
}
Expand Down
5 changes: 3 additions & 2 deletions src/main/kotlin/astminer/filters/CommonFilters.kt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package astminer.filters

import astminer.common.TOKEN_DELIMITER
import astminer.common.model.*
import astminer.featureextraction.NumberOfNodes

Expand All @@ -23,8 +24,8 @@ class TreeSizeFilter(private val minSize: Int = 0, private val maxSize: Int? = n
* Filter that excludes trees that have more words than [maxWordsNumber] in any token of their node.
*/
class WordsNumberFilter(private val maxWordsNumber: Int) : FunctionFilter, FileFilter {
SpirinEgor marked this conversation as resolved.
Show resolved Hide resolved
private fun validateTree(root: Node) =
!root.preOrder().any { node -> node.token.split(Node.TOKEN_DELIMITER).size > maxWordsNumber }
private fun validateTree(root: Node) = root.preOrder()
.none { node -> node.token.final.split(TOKEN_DELIMITER).size > maxWordsNumber }

override fun validate(functionInfo: FunctionInfo<out Node>) = validateTree(functionInfo.root)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ object FunctionNameLabelExtractor : FunctionLabelExtractor {
private const val RECURSIVE_CALL_TOKEN = "SELF"

override fun process(functionInfo: FunctionInfo<out Node>): LabeledResult<out Node>? {
val normalizedName = functionInfo.nameNode?.normalizedToken ?: return null
val normalizedName = functionInfo.nameNode?.token?.normalized ?: return null
functionInfo.root.preOrder().forEach { node ->
if (node.originalToken == functionInfo.nameNode?.originalToken) {
node.technicalToken = RECURSIVE_CALL_TOKEN
if (node.token.original == functionInfo.nameNode?.token?.original) {
node.token.technical = RECURSIVE_CALL_TOKEN
}
}
functionInfo.nameNode?.technicalToken = HIDDEN_METHOD_NAME_TOKEN
functionInfo.nameNode?.token?.technical = HIDDEN_METHOD_NAME_TOKEN
return LabeledResult(functionInfo.root, normalizedName, functionInfo.qualifiedPath)
}
}
35 changes: 30 additions & 5 deletions src/main/kotlin/astminer/parse/ForeignParser.kt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
package astminer.parse

import astminer.common.SimpleNode
import astminer.common.model.NodeRange
import astminer.common.model.Parser
import astminer.common.model.SimpleNode
import astminer.config.FileExtension
import astminer.config.ParserType
import kotlinx.serialization.Serializable
Expand All @@ -23,17 +24,29 @@ import kotlin.io.path.createTempDirectory
* {
* "token": null,
* "nodeType": "i_am_root",
* "children": [1,2]
* "children": [1,2],
* "range" : {
* "start" : { "l" : 0, "c" : 0 },
* "end" : { "l" 1, "c" : 4 }
* }
* },
* {
* "token": "Hello",
* "nodeType": "left_child",
* "children": []
* "range" : {
* "start" : { "l" : 0, "c": 0 },
* "end" : { "l": 0, "c": 5 }
* }
* },
* {
* "token": "World!",
* "nodeType": "right_child",
* "children": []
* "children": [],
* "range" : {
* "start" : { "l" : 1, "c" : 0 },
* "end" : { "l" : 1, "c" : 6 }
* }
* }
* ]
* }
Expand All @@ -57,7 +70,14 @@ private fun launchScript(args: List<String>): String {

private fun convertFromForeignTree(context: ForeignTree, rootId: Int = 0, parent: SimpleNode? = null): SimpleNode {
val foreignNode = context.tree[rootId]
val node = SimpleNode(foreignNode.nodeType, mutableListOf(), parent, foreignNode.token)

val node = SimpleNode(
children = mutableListOf(),
parent = parent,
typeLabel = foreignNode.nodeType,
token = foreignNode.token,
range = foreignNode.range
)
val children = foreignNode.children.map { convertFromForeignTree(context, it, node) }
node.children.addAll(children)
return node
Expand All @@ -67,7 +87,12 @@ private fun convertFromForeignTree(context: ForeignTree, rootId: Int = 0, parent
private data class ForeignTree(val tree: List<ForeignNode>)

@Serializable
private data class ForeignNode(val token: String?, val nodeType: String, val children: List<Int>)
private data class ForeignNode(
val token: String?,
val nodeType: String,
val range: NodeRange? = null,
val children: List<Int>
)

/** Use this parser to get a tree from external script.
* It uses `getTreeFromScript` and `getArguments` functions to generate
Expand Down
4 changes: 3 additions & 1 deletion src/main/kotlin/astminer/parse/antlr/AntlrNode.kt
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
package astminer.parse.antlr

import astminer.common.model.Node
import astminer.common.model.NodeRange

class AntlrNode(
override val typeLabel: String,
override var parent: AntlrNode?,
originalToken: String?
originalToken: String?,
override val range: NodeRange? = null
) : Node(originalToken) {

override val children: MutableList<AntlrNode> = mutableListOf()
Expand Down
23 changes: 23 additions & 0 deletions src/main/kotlin/astminer/parse/antlr/compressedTreesUtil.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package astminer.parse.antlr

import astminer.common.model.Node

fun decompressTypeLabel(typeLabel: String) = typeLabel.split("|")

fun AntlrNode.lastLabel() = decompressTypeLabel(typeLabel).last()

fun AntlrNode.firstLabel() = decompressTypeLabel(typeLabel).first()

fun AntlrNode.hasLastLabel(label: String): Boolean = lastLabel() == label

fun AntlrNode.lastLabelIn(labels: List<String>): Boolean = labels.contains(lastLabel())

fun AntlrNode.hasFirstLabel(label: String): Boolean = firstLabel() == label

fun AntlrNode.firstLabelIn(labels: List<String>): Boolean = labels.contains(firstLabel())

fun Node.getTokensFromSubtree(): String =
if (isLeaf()) token.original ?: "" else children.joinToString(separator = "") { it.getTokensFromSubtree() }

fun AntlrNode.getItOrChildrenOfType(typeLabel: String): List<AntlrNode> =
if (hasLastLabel(typeLabel)) listOf(this) else this.getChildrenOfType(typeLabel)
Loading