Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(transformers): References and definitions from code #186

Merged
merged 8 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,12 @@ _You can find more examples in [/examples](https://github.com/bosun-ai/swiftide/

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Vision

Our goal is to create a fast, extendable platform for data indexing and querying to further the development of automated LLM applications, with an easy-to-use and easy-to-extend api.

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Features

- Fast streaming indexing pipeline with async, parallel processing
Expand All @@ -123,11 +129,15 @@ _You can find more examples in [/examples](https://github.com/bosun-ai/swiftide/
- Store into multiple backends
- `tracing` supported for logging and tracing, see /examples and the `tracing` crate for more information.

<p align="right">(<a href="#readme-top">back to top</a>)</p>
### In detail

## Vision

Our goal is to create a fast, extendable platform for data indexing and querying to further the development of automated LLM applications, with an easy-to-use and easy-to-extend api.
| **Feature** | **Details** |
| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Supported Large Language Model providers** | OpenAI (and Azure) - All models and embeddings <br> AWS Bedrock - Anthropic and Titan <br> Groq - All models |
| **Loading data** | Files <br> Scraping <br> Other pipelines and streams |
| **Transformers and metadata generation** | Generate Question and answerers for both text and code (Hyde) <br> Summaries, titles and queries via an LLM <br> Extract definitions and references with tree-sitter |
| **Splitting and chunking** | Markdown <br> Code (with tree-sitter) |
| **Storage** | Qdrant <br> Redis |

<p align="right">(<a href="#readme-top">back to top</a>)</p>

Expand Down
209 changes: 209 additions & 0 deletions swiftide/src/integrations/treesitter/code_tree.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
//! Code parsing
//!
//! Extracts typed semantics from code.
#![allow(dead_code)]
use itertools::Itertools;
use tree_sitter::{Parser, Query, QueryCursor, Tree};

use anyhow::{Context as _, Result};
use std::collections::HashSet;

use crate::integrations::treesitter::queries::{python, ruby, rust, typescript};

use super::SupportedLanguages;

#[derive(Debug)]
pub struct CodeParser {
language: SupportedLanguages,
}

impl CodeParser {
pub fn from_language(language: SupportedLanguages) -> Self {
Self { language }
}

/// Parses code and returns a `CodeTree`
///
/// Tree-sitter is pretty lenient and will parse invalid code. I.e. if the code is invalid,
/// queries might fail and return no results.
///
/// This is good as it makes this safe to use for chunked code as well.
///
/// # Errors
///
/// Errors if the language is not support or if the tree cannot be parsed
pub fn parse<'a>(&self, code: &'a str) -> Result<CodeTree<'a>> {
let mut parser = Parser::new();
parser.set_language(&self.language.into())?;
let ts_tree = parser.parse(code, None).context("No nodes found")?;

Ok(CodeTree {
ts_tree,
code,
language: self.language,
})
}
}

/// A code tree is a queryable representation of code
pub struct CodeTree<'a> {
ts_tree: Tree,
code: &'a str,
language: SupportedLanguages,
}

pub struct ReferencesAndDefinitions {
pub references: Vec<String>,
pub definitions: Vec<String>,
}

impl CodeTree<'_> {
/// Queries for references and definitions in the code. It returns a unique list of non-local
/// references, and local definitions.
///
/// # Errors
///
/// Errors if the query is invalid or fails
pub fn references_and_definitions(&self) -> Result<ReferencesAndDefinitions> {
let (defs, refs) = ts_queries_for_language(self.language);

let defs_query = Query::new(&self.language.into(), defs)?;
let refs_query = Query::new(&self.language.into(), refs)?;

let defs = self.ts_query_for_matches(&defs_query)?;
let refs = self.ts_query_for_matches(&refs_query)?;

Ok(ReferencesAndDefinitions {
// Remove any self references
references: refs
.into_iter()
.filter(|r| !defs.contains(r))
.sorted()
.collect(),
definitions: defs.into_iter().sorted().collect(),
})
}

/// Given a `tree-sitter` query, searches the code and returns a list of matching symbols
fn ts_query_for_matches(&self, query: &Query) -> Result<HashSet<String>> {
let mut cursor = QueryCursor::new();

cursor
.matches(query, self.ts_tree.root_node(), self.code.as_bytes())
.map(|m| {
m.captures
.iter()
.map(|c| {
Ok(c.node
.utf8_text(self.code.as_bytes())
.context("Failed to parse node")?
.to_string())
})
.collect::<Result<Vec<_>>>()
.map(|s| s.join(""))
})
.collect::<Result<HashSet<_>>>()
}
}

fn ts_queries_for_language(language: SupportedLanguages) -> (&'static str, &'static str) {
use SupportedLanguages::{Javascript, Python, Ruby, Rust, Typescript};

match language {
Rust => (rust::DEFS, rust::REFS),
Python => (python::DEFS, python::REFS),
// The univocal proof that TS is just a linter
Typescript | Javascript => (typescript::DEFS, typescript::REFS),
Ruby => (ruby::DEFS, ruby::REFS),
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_parsing_on_rust() {
let parser = CodeParser::from_language(SupportedLanguages::Rust);
let code = r#"
use std::io;

fn main() {
println!("Hello, world!");
}
"#;
let tree = parser.parse(code).unwrap();
let result = tree.references_and_definitions().unwrap();
assert_eq!(result.references, vec!["println"]);

assert_eq!(result.definitions, vec!["main"]);
}

#[test]
fn test_parsing_on_ruby() {
let parser = CodeParser::from_language(SupportedLanguages::Ruby);
let code = r#"
class A < Inheritance
include ActuallyAlsoInheritance

def a
puts "A"
end
end
"#;

let tree = parser.parse(code).unwrap();
let result = tree.references_and_definitions().unwrap();
assert_eq!(
result.references,
["ActuallyAlsoInheritance", "Inheritance", "include", "puts",]
);

assert_eq!(result.definitions, ["A", "a"]);
}

#[test]
fn test_parsing_python() {
// test with a python class and list comprehension
let parser = CodeParser::from_language(SupportedLanguages::Python);
let code = r#"
class A:
def __init__(self):
self.a = [x for x in range(10)]

def hello_world():
print("Hello, world!")
"#;
let tree = parser.parse(code).unwrap();
let result = tree.references_and_definitions().unwrap();
assert_eq!(result.references, ["print", "range"]);
assert_eq!(result.definitions, vec!["A", "hello_world"]);
}

#[test]
fn test_parsing_on_typescript() {
let parser = CodeParser::from_language(SupportedLanguages::Typescript);
let code = r#"
function Test() {
console.log("Hello, TypeScript!");
otherThing();
}

class MyClass {
constructor() {
let local = 5;
this.myMethod();
}

myMethod() {
console.log("Hello, TypeScript!");
}
}
"#;

let tree = parser.parse(code).unwrap();
let result = tree.references_and_definitions().unwrap();
assert_eq!(result.definitions, vec!["MyClass", "Test", "myMethod"]);
assert_eq!(result.references, vec!["log", "otherThing"]);
}
}
5 changes: 4 additions & 1 deletion swiftide/src/integrations/treesitter/mod.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
//! Chunking code with tree-sitter
//! Chunking code with tree-sitter and various tools
mod code_tree;
mod queries;
mod splitter;
mod supported_languages;

pub use code_tree::{CodeParser, CodeTree, ReferencesAndDefinitions};
pub use splitter::{ChunkSize, CodeSplitter, CodeSplitterBuilder};
pub use supported_languages::SupportedLanguages;
Loading