Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node range #194

Merged
merged 52 commits into from
Nov 12, 2021
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
65a489d
gumtree update
illided Oct 23, 2021
4d88dde
normalization and token logic extracted as separate class
illided Oct 24, 2021
b681d84
normalization and token logic extracted as separate class
illided Oct 24, 2021
b6ea814
Merge remote-tracking branch 'origin/token_range' into token_range
illided Oct 24, 2021
2c9e535
renaming and refactoring
illided Oct 24, 2021
c3413c5
code style fixes
illided Oct 24, 2021
b5ab209
init restrictions removed
illided Oct 24, 2021
84badf7
token range added to javaparser
illided Oct 24, 2021
8b006e5
javaparser doc update
illided Oct 24, 2021
32a6917
spoon token position introduced
illided Oct 24, 2021
8d376d9
range now is a Node field + some interface rearrangement
illided Oct 28, 2021
95400d8
test fixes
illided Oct 28, 2021
b858836
simple node default parameters added
illided Oct 28, 2021
4ca400d
node range introduced in ANTLR
illided Oct 28, 2021
02192f5
ANTLR util refactor
illided Oct 28, 2021
8d15d54
spoon bug fixed
illided Oct 28, 2021
8798c06
node range refactor
illided Oct 28, 2021
7badac3
foreign parser update
illided Oct 28, 2021
5523bc7
added nodeRange extraction in tree sitter
illided Oct 28, 2021
abbb4f1
setup.py code style refactor
illided Oct 28, 2021
4194258
node range support in gumtree
illided Oct 28, 2021
dfcb5eb
doc little fix
illided Oct 28, 2021
a4eceee
code style fixes
illided Oct 28, 2021
38e705b
serial name change
illided Oct 30, 2021
734a2a9
new option in json ast storage
illided Oct 30, 2021
da2faf4
option to disable range serialization properly added
illided Oct 30, 2021
d4e60b3
Merge branch 'master' into token_range
illided Nov 2, 2021
c47517d
test fixed
illided Nov 2, 2021
f4bea6e
now it compiles
illided Nov 2, 2021
e2b9750
range support added
illided Nov 2, 2021
70b820e
Merge branch 'gumtree_update' into token_range
illided Nov 3, 2021
3d7ae17
now it compiles
illided Nov 3, 2021
5844f8d
code style fixes
illided Nov 3, 2021
1f4999a
Merge branch 'master' into token_range
illided Nov 4, 2021
beab3d2
position tests added
illided Nov 4, 2021
c1df042
tree sitter bug fix
illided Nov 4, 2021
383425e
unused imports removed
illided Nov 4, 2021
2ba6bef
expected function positions changed
illided Nov 4, 2021
edc24d3
positions in javalang test adjusted
illided Nov 4, 2021
396ab20
doc fix
illided Nov 4, 2021
2186e89
Normalization refactor
illided Nov 9, 2021
8a64df8
EMPTY token updated
illided Nov 9, 2021
5898169
file renamed and redundant map deleted
illided Nov 9, 2021
cbb70f0
code style fixes
illided Nov 9, 2021
f2a96a6
test fixed
illided Nov 9, 2021
64219d2
normalization object deleted
illided Nov 10, 2021
66716c0
removed empty line
illided Nov 10, 2021
bd76ac1
documentation added
illided Nov 10, 2021
95c33e5
doc and normalization fixes
illided Nov 11, 2021
bf5914b
docs update
illided Nov 11, 2021
5f86b14
Improve storage documentation
SpirinEgor Nov 12, 2021
37ae185
Change snippet format
SpirinEgor Nov 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions detekt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ style:
max: 5
WildcardImport:
active: false
UseDataClass:
allowVars: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use vars in data class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use var in Token class and for some reason without this option detekt suggests me to use data class, And when i do this detekt reports about var usage in Token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see, at the moment Token is not a data class. And detekt doesn't report anything.


formatting:
autoCorrect: true
Expand Down
23 changes: 23 additions & 0 deletions src/main/kotlin/astminer/common/SimpleNode.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package astminer.common

import astminer.common.model.Node
import astminer.common.model.NodeRange

/** Node simplest implementation **/
class SimpleNode(
override val typeLabel: String,
override val children: MutableList<SimpleNode>,
override val parent: Node? = null,
override val range: NodeRange? = null,
token: String?
) : Node(token) {
override fun removeChildrenOfType(typeLabel: String) {
children.removeIf { it.typeLabel == typeLabel }
}

override fun getChildrenOfType(typeLabel: String) = super.getChildrenOfType(typeLabel).map { it as SimpleNode }
override fun getChildOfType(typeLabel: String) = super.getChildOfType(typeLabel) as? SimpleNode

override fun preOrder() = super.preOrder().map { it as SimpleNode }
override fun postOrder() = super.postOrder().map { it as SimpleNode }
}
52 changes: 52 additions & 0 deletions src/main/kotlin/astminer/common/TokenNormalization.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
package astminer.common

object TokenNormalization {
const val EMPTY_TOKEN = "<E>"
const val TOKEN_DELIMITER = "|"

private val newLineReg = "\\\\n".toRegex()
private val whitespaceReg = "//s+".toRegex()
private val quotesApostrophesCommasReg = "[\"',]".toRegex()
private val unicodeWeirdCharReg = "\\P{Print}".toRegex()
private val notALetterReg = "[^A-Za-z]".toRegex()

private val splitRegex = "(?<=[a-z])(?=[A-Z])|_|[0-9]|(?<=[A-Z])(?=[A-Z][a-z])|\\s+".toRegex()

fun normalizeToken(token: String?): String {
if (token == null) return EMPTY_TOKEN
val subTokens = splitToSubtokens(token)
return if (subTokens.isEmpty()) EMPTY_TOKEN else subTokens.joinToString(TOKEN_DELIMITER)
}

/**
* The function was adopted from the original code2vec implementation in order to match their behavior:
* https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
*/
fun splitToSubtokens(token: String) = token
.trim()
.split(splitRegex)
.map { s -> normalizeSubToken(s, "") }
.filter { it.isNotEmpty() }
.toList()

/**
* The function was adopted from the original code2vec implementation in order to match their behavior:
* https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Common/Common.java
*/
fun normalizeSubToken(token: String, defaultToken: String): String {
val cleanToken = token.lowercase()
.replace(newLineReg, "") // escaped new line
.replace(whitespaceReg, "") // whitespaces
.replace(quotesApostrophesCommasReg, "") // quotes, apostrophes, commas
.replace(unicodeWeirdCharReg, "") // unicode weird characters

val stripped = cleanToken.replace(notALetterReg, "")

return stripped.ifEmpty {
val carefulStripped = cleanToken.replace(" ", "_")
carefulStripped.ifEmpty {
defaultToken
}
}
}
}
45 changes: 0 additions & 45 deletions src/main/kotlin/astminer/common/TreeUtil.kt

This file was deleted.

2 changes: 1 addition & 1 deletion src/main/kotlin/astminer/common/model/FunctionInfoModel.kt
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ interface NamedTree<T : Node> {
val nameNode: T?
get() = notImplemented("nameNode")
val name: String?
get() = nameNode?.originalToken
get() = nameNode?.token?.original
val root: T
get() = notImplemented("root")
val body: T?
Expand Down
49 changes: 13 additions & 36 deletions src/main/kotlin/astminer/common/model/ParsingModel.kt
Original file line number Diff line number Diff line change
@@ -1,31 +1,22 @@
package astminer.common.model

import astminer.common.EMPTY_TOKEN
import astminer.common.splitToSubtokens
import kotlinx.serialization.SerialName
import kotlinx.serialization.Serializable
import java.io.File
import java.io.InputStream
import java.util.*

abstract class Node(val originalToken: String?) {
abstract class Node(originalToken: String?) {
abstract val typeLabel: String
abstract val children: List<Node>
abstract val parent: Node?

val normalizedToken: String =
originalToken?.let {
val subtokens = splitToSubtokens(it)
if (subtokens.isEmpty()) EMPTY_TOKEN else subtokens.joinToString(TOKEN_DELIMITER)
} ?: EMPTY_TOKEN

var technicalToken: String? = null

val token: String
get() = technicalToken ?: normalizedToken

abstract val range: NodeRange?
val metadata: MutableMap<String, Any> = HashMap()
val token = Token(originalToken)

fun isLeaf() = children.isEmpty()

override fun toString(): String = "$typeLabel : $token"

fun prettyPrint(indent: Int = 0, indentSymbol: String = "--") {
repeat(indent) { print(indentSymbol) }
println(this)
Expand All @@ -52,30 +43,16 @@ abstract class Node(val originalToken: String?) {

fun postOrderIterator(): Iterator<Node> = postOrder().listIterator()
open fun postOrder(): List<Node> = mutableListOf<Node>().also { doTraversePostOrder(it) }

companion object {
const val TOKEN_DELIMITER = "|"
}
}

/** Node simplest implementation **/
class SimpleNode(
override val typeLabel: String,
override val children: MutableList<SimpleNode>,
override val parent: Node?,
token: String?
) : Node(token) {
override fun removeChildrenOfType(typeLabel: String) {
children.removeIf { it.typeLabel == typeLabel }
}

override fun getChildrenOfType(typeLabel: String) = super.getChildrenOfType(typeLabel).map { it as SimpleNode }
override fun getChildOfType(typeLabel: String) = super.getChildOfType(typeLabel) as? SimpleNode

override fun preOrder() = super.preOrder().map { it as SimpleNode }
override fun postOrder() = super.postOrder().map { it as SimpleNode }
@Serializable
data class NodeRange(val start: Position, val end: Position) {
override fun toString(): String = "[${start.line}, ${start.column}] - [${end.line}, ${end.column}]"
}

@Serializable
data class Position(@SerialName("l") val line: Int, @SerialName("c") val column: Int)

interface Parser<T : Node> {
/**
* Parse input stream into an AST.
Expand Down
14 changes: 14 additions & 0 deletions src/main/kotlin/astminer/common/model/Token.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
package astminer.common.model

import astminer.common.TokenNormalization

class Token(val original: String?) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Class to wrap logic with token processing.
* It is responsible for token normalization or replacing it with technical information.
* Use `token.original` to access the original token.
*/

val final: String
get() = technical ?: normalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to declare the final as a function?

fun final() = technical ?: normalized

From a usage perspective, the property seems like it was created with a class instance and is permanent.
But for final token representation, I want to point out that it calculating anew for every request.


var technical: String? = null

val normalized = TokenNormalization.normalizeToken(original)

override fun toString(): String = final
}
12 changes: 10 additions & 2 deletions src/main/kotlin/astminer/config/StorageConfigs.kt
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,16 @@ class DotAstStorageConfig : StorageConfig() {
*/
@Serializable
@SerialName("json AST")
class JsonAstStorageConfig(private val withPaths: Boolean = false) : StorageConfig() {
override fun createStorage(outputDirectoryPath: String) = JsonAstStorage(outputDirectoryPath, withPaths)
class JsonAstStorageConfig(
private val withPaths: Boolean = false,
private val withRanges: Boolean = false
) : StorageConfig() {
override fun createStorage(outputDirectoryPath: String) =
JsonAstStorage(
outputDirectoryPath,
withPaths,
withRanges
)
}

/**
Expand Down
2 changes: 1 addition & 1 deletion src/main/kotlin/astminer/featureextraction/TreeFeature.kt
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ object Tokens : TreeFeature<List<String>> {

private fun findTokens(node: Node, tokensList: MutableList<String>): List<String> {
node.children.forEach { findTokens(it, tokensList) }
tokensList.add(node.token)
tokensList.add(node.token.final)
return tokensList
}
}
Expand Down
5 changes: 3 additions & 2 deletions src/main/kotlin/astminer/filters/CommonFilters.kt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package astminer.filters

import astminer.common.TokenNormalization
import astminer.common.model.*
import astminer.featureextraction.NumberOfNodes

Expand All @@ -23,8 +24,8 @@ class TreeSizeFilter(private val minSize: Int = 0, private val maxSize: Int? = n
* Filter that excludes trees that have more words than [maxWordsNumber] in any token of their node.
*/
class WordsNumberFilter(private val maxWordsNumber: Int) : FunctionFilter, FileFilter {
SpirinEgor marked this conversation as resolved.
Show resolved Hide resolved
private fun validateTree(root: Node) =
!root.preOrder().any { node -> node.token.split(Node.TOKEN_DELIMITER).size > maxWordsNumber }
private fun validateTree(root: Node) = root.preOrder()
.none { node -> node.token.final.split(TokenNormalization.TOKEN_DELIMITER).size > maxWordsNumber }

override fun validate(functionInfo: FunctionInfo<out Node>) = validateTree(functionInfo.root)

Expand Down
4 changes: 2 additions & 2 deletions src/main/kotlin/astminer/filters/FunctionFilters.kt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
package astminer.filters

import astminer.common.TokenNormalization
import astminer.common.model.FunctionFilter
import astminer.common.model.FunctionInfo
import astminer.common.model.Node
import astminer.common.splitToSubtokens

/**
* Filter that excludes functions that have at least one of modifiers from the [excludeModifiers] list.
Expand Down Expand Up @@ -38,7 +38,7 @@ object ConstructorFilter : FunctionFilter {
class FunctionNameWordsNumberFilter(private val maxWordsNumber: Int) : FunctionFilter {
override fun validate(functionInfo: FunctionInfo<out Node>): Boolean {
val name = functionInfo.name
return name != null && splitToSubtokens(name).size <= maxWordsNumber
return name != null && TokenNormalization.splitToSubtokens(name).size <= maxWordsNumber
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ object FunctionNameLabelExtractor : FunctionLabelExtractor {
private const val RECURSIVE_CALL_TOKEN = "SELF"

override fun process(functionInfo: FunctionInfo<out Node>): LabeledResult<out Node>? {
val normalizedName = functionInfo.nameNode?.normalizedToken ?: return null
val normalizedName = functionInfo.nameNode?.token?.normalized ?: return null
functionInfo.root.preOrder().forEach { node ->
if (node.originalToken == functionInfo.nameNode?.originalToken) {
node.technicalToken = RECURSIVE_CALL_TOKEN
if (node.token.original == functionInfo.nameNode?.token?.original) {
node.token.technical = RECURSIVE_CALL_TOKEN
}
}
functionInfo.nameNode?.technicalToken = HIDDEN_METHOD_NAME_TOKEN
functionInfo.nameNode?.token?.technical = HIDDEN_METHOD_NAME_TOKEN
return LabeledResult(functionInfo.root, normalizedName, functionInfo.qualifiedPath)
}
}
35 changes: 30 additions & 5 deletions src/main/kotlin/astminer/parse/ForeignParser.kt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
package astminer.parse

import astminer.common.SimpleNode
import astminer.common.model.NodeRange
import astminer.common.model.Parser
import astminer.common.model.SimpleNode
import astminer.config.FileExtension
import astminer.config.ParserType
import kotlinx.serialization.Serializable
Expand All @@ -23,17 +24,29 @@ import kotlin.io.path.createTempDirectory
* {
* "token": null,
* "nodeType": "i_am_root",
* "children": [1,2]
* "children": [1,2],
* "range" : {
* "start" : { "l" : 0, "c" : 0 },
* "end" : { "l" 1, "c" : 4 }
* }
* },
* {
* "token": "Hello",
* "nodeType": "left_child",
* "children": []
* "range" : {
* "start" : { "l" : 0, "c": 0 },
* "end" : { "l": 0, "c": 5 }
* }
* },
* {
* "token": "World!",
* "nodeType": "right_child",
* "children": []
* "children": [],
* "range" : {
* "start" : { "l" : 1, "c" : 0 },
* "end" : { "l" : 1, "c" : 6 }
* }
* }
* ]
* }
Expand All @@ -57,7 +70,14 @@ private fun launchScript(args: List<String>): String {

private fun convertFromForeignTree(context: ForeignTree, rootId: Int = 0, parent: SimpleNode? = null): SimpleNode {
val foreignNode = context.tree[rootId]
val node = SimpleNode(foreignNode.nodeType, mutableListOf(), parent, foreignNode.token)

val node = SimpleNode(
children = mutableListOf(),
parent = parent,
typeLabel = foreignNode.nodeType,
token = foreignNode.token,
range = foreignNode.range
)
val children = foreignNode.children.map { convertFromForeignTree(context, it, node) }
node.children.addAll(children)
return node
Expand All @@ -67,7 +87,12 @@ private fun convertFromForeignTree(context: ForeignTree, rootId: Int = 0, parent
private data class ForeignTree(val tree: List<ForeignNode>)

@Serializable
private data class ForeignNode(val token: String?, val nodeType: String, val children: List<Int>)
private data class ForeignNode(
val token: String?,
val nodeType: String,
val range: NodeRange? = null,
val children: List<Int>
)

/** Use this parser to get a tree from external script.
* It uses `getTreeFromScript` and `getArguments` functions to generate
Expand Down
Loading