-
Notifications
You must be signed in to change notification settings - Fork 15
CDLI Multi Layer Annotation Query Tool
In this summer I worked with CDLI to build a Multi-Layer Annotation Query Tool. Using this tool, Multi-Layered Annotated data can be queried easily.
The objective of the project was to build a tool so that corpus with multiple layers of annotations can be easily queried and retrieve the relevant documents. The objectives of the project achieved are as follows:-
- Converting the complete CDLI-CoNLL data to CoNLL-U format and then to RDF format so that a Graph could be constructed from this data.
- A graph is built from the RDF files and the relevant results are retrieved from the Graph using SPARQL.
- The tool has a Query Generator GUI which can
- Generate a query which has the annotations for a word with various conditions.
- Can specify Linear and Semantic Dependencies between two words.
- Specify the proximity between two words.
- The CQP4RDF tool
- can retrieve the relevant documents from the corpus.
- is dockerized so that it is easy to install and deploy.
- is highly configurable so that it can be extended to any other corpus.
In a Linguistic Corpus, we usually have annotations for the data so that we can understand the data. But as we progress, it is extended to multiple layers of annotations for the same data. These multiple annotations provide different aspects of information about the same data. With the increase in the number of layers of annotation, it becomes highly difficult to search by specifying various conditions for this type of data.
CQP4RDF can make querying the data over multiple layers of annotations quite easy. The annotation files are converted to RDF files using which the annotation Graph is constructed and then the data is queried over the graph using SPARQL which is a query language used to query over a graph. Seeing the difficulty of the SPARQL language, our tool uses the syntax of CQP(Corpus Query Processor) commonly used to query the corpus. CQP is easy to understand and therefore used to build the query. SPARQL is widely used to query over the graph. CQP4RDF builds the bridge between both of them. The CQP query is then converted to SPARQL which would then query over the graph of the annotations which we have using the files.
Also, with the responsive GUI for the query generator tool, it would be really easy to build the CQP query and query the corpus. It would be really easy even for non-technical professionals to build the CQP query. Also, the CQP4RDF tool has been generalized so that it can be easily configured to any other corpus with any other field.
- Sagar - Student Developer
- Max Ionov - Mentor
- Ilya Khait - Mentor
- SPARQL
- CQP
- Flask
- Python
- Javascript
- JQuery
- Bootstrap
- CSS
- HTML
- My main task for the summer was to build the Query Builder GUI for the CQP Query. It can be seen here:-
- This Query Builder was integrated with the CQP4RDF project. The CQP4RDF project was made more configurable and was also dockerized. My contributions to CQP4RDF can be seen here:-
- Since the CQP4RDF, is a generalized version, a version specifically for CDLI was developed. The link for the CDLI version repository is:-
- Also, the CDLI version of CQP4RDF has been integrated with CDLI framework. The link for the PR is:- LINK HERE
- Improving the CDLI-CoNLL-to-CoNLLU-Converter tool
- Some of the contributions in the data with CDLI:-
- Making the installation of the framework easy, installation instructions were added:-
-
- CQP is a popular tool used to query a corpus. SPARQL is commonly used to query over a graph. CQP4RDF uses the best of both the word and bridges the gap in between them.
- Takes CQP query as input from the user
- Converts that to SPARQL
- SPARQL used to query over the graph
- All the types of annotation fields and various conditions for them can be configured.
- The type of annotations can be specified and on that basis, the values would be shown in the dropdown or would be suggested while entering in the input box.
- By default, the text entered is treated as RegEx, but if the user wants to match the text exactly, then it can be done by checking out the checkbox.
- Semantic dependency between 2 words can also be specified.
- The proximity between words can also be defined.
- Queries can be generated which can even query in the semantic tree of a sentence.
- CQP is a popular tool used to query a corpus. SPARQL is commonly used to query over a graph. CQP4RDF uses the best of both the word and bridges the gap in between them.
-
- One of the limitations of the project is that we have broken the rule for CQP. This was done to decrease the complexity of the tool and ease of the user. By default, the words in CQP are sequential and are in order. But we have removed this constraint and the words are not in relation. Instead, if the words have to be linearly linked then a
nextWord
dependency in the dependency section. This process has been automated in the Query Generator GUI which automatically adds thenextWord
dependency. For eg:- -
[conll:UPOSTAG="NOUN"] [conll:UPOSTAG="VERB"]
in CQP means a NOUN followed by a VERB. -
[conll: UPOSTAG="NOUN"] [conll: UPOSTAG="VERB"]
in our tool means a sentence with NOUN and VERB with no linear dependency. - If we want them to be continuous, then a dependency has to be added.
w1:[conll:UPOSTAG="NOUN"] w2:[conll:UPOSTAG="VERB"] :: (w1.nif:nextWord=w2)
means NOUN followed by a VERB.
- One of the limitations of the project is that we have broken the rule for CQP. This was done to decrease the complexity of the tool and ease of the user. By default, the words in CQP are sequential and are in order. But we have removed this constraint and the words are not in relation. Instead, if the words have to be linearly linked then a
We would be writing a Research Paper about the complete CQP4RDF tool.
The CQP4RDF project can be extended to have the following improvements:-
- Automatic uploading of the data to the Fuseki Server, without doing this thing manually. A script would be required to auto-upload and should be able to do so with the
docker-compose
file. - An
ADMIN
portal, which can be used to configure and would be saved as a config file. - The portal which can be used to upload the data directly without uploading from the SPARQL endpoint.
- Improving the API, so that it can return data in multiple formats.
My main task in the project was to convert the data to RDF format, then build the GUI for CQP4RDF so that the CQP queries can be built easily. After this, the tool was dockerized so that the tool would be easy to run and deploy. Deciding on the steps to be taken on improving the tool, so that power of tool increase along with keeping it intuitive and easy to understand for non-technical users was the most challenging task. Also Dockerizing the app was a new experience which I had and I learned a lot from that also. The mentors were very helpful and supportive and helped a lot in taking various decisions on how to proceed with the tool. The constant support from the mentors was the major factor for this project to happen.
The work is complete and I am looking forward to contributing a lot more to the project and the organization.