A multi-language tokenizer for extracting identifiers (or, theoretically, anything else) from source code.
The tool is already employed in searching for similar repositories and studying the dynamics of topics in code.
The tool currently works on Linux and MacOS, correct versions of files will be downloaded automatically.
-
The project uses tree-sitter and its grammars as submodules, so update them after cloning:
git submodule update --init --recursive --depth 1
-
Install the required dependencies:
pip3 install cython pip3 install -r requirements.txt
-
Create an input file with a list of repositories. In the default mode, the list must contain links to GitHub, in the local mode (activated by passing the
-l
argument), the list must contain the paths to local directories. -
Run from the command line with
python3 -m identifiers_extractor.run
and the following arguments:-i
: a path to the input file;-o
: a path to the output directory;-b
: the size of the batch of projects that will be saved together (by default 100);-l
: if passed, switches the tokenization into the local mode, where the input file must contain the paths to local directories.
For every batch, two files will be created:
docword
: for every repository, all of its subtokens are listed asid:count
, one repository per line, in descending order of counts. The ids are the same for the entire batch.vocab
: all unique subtokens are listed asid;subtoken
, one subtoken per line, in ascending order of ids.
After the target project is downloaded, it is processed in three main steps:
- Language recognition. Firstly, the languages of the project are recognized with enry. This operation returns a dictionary with languages as keys and corresponding lists of files as values. Only the files in supported languages are passed on to the next step (see the full list below).
- Parsing. Every file is parsed with one of the two parsers. The most popular languages are parsed with tree-sitter, and the languages that do not yet have tree-sitter grammar are parsed with pygments. At this point, identifiers are extracted and every identifier is passed on to the next step.
- Subtokenizing. Every identifier is split into subtokens by camelCase and snake_case, small subtokens are connected to longer ones, and the subtokens are stemmed. In general, the preprocessing is carried out as described in this paper.
The counters of subtokens are aggregated for projects and saved to file.
Every step of the pipeline can be modified:
- Languages can be added by modifying
SUPPORTED_LANGUAGES
inparsing.py
. - The tool can extract not only identifiers, but anything that is detected by either tree-sitter or pygments. This can be done my modifying
NODE_TYPES
inTreeSitterParser
class andTYPES
inPygmentsParser
class. - Subtokenization can be modified in
subtokenizing.py
. The tokens can be connected together, stemmed, filtered by length, etc.
Currently, the following languages are supported: C, C#, C++, Go, Haskell, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust, Scala, Shell, Swift, and TypeScript.