-
Notifications
You must be signed in to change notification settings - Fork 99
Data Annotation Best Practices Guide
Expert annotation of text resources, when done with care and planning, provides valuable and necessary training data for building models, as well as evaluation data for testing models. This document is intended to outline some best practices when conducting data annotation projects. Text annotation requires care and concentration, and thus can be labor-intensive and time-consuming. It is therefore important to design annotation projects with care.
TRAM presents special challenges for human annotation, given the large number of techniques described in the MITRE ATT&CK framework (over 600 total techniques and subtechniques). TRAM 2.0 focuses on annotating a subset of 50 ATT&CK techniques in order to form a more tractable research problem, but this is still a very large number of tags for an annotation effort.
The purpose of text annotation is to leverage human expertise to mark ground truth in the annotated documents. Annotators should not try to anticipate any needs of downstream processes (e.g., modeling). Such anticipation tends to distort annotation away from ground truth and harms the long-term usability of the annotated documents.
Before you undertake a data annotation project, create an annotation guidelines document with the following characteristics:
- State the purpose of the annotation effort. The purpose statement should emphasize the annotation of ground truth, without reference to any downstream uses of the annotated texts.
- Define elements to be annotated. These definitions should be guided by both domain expertise and linguistic expertise and should clearly delineate the set of tags that the annotators will be applying and the linguistic units (e.g., words, phrases, sentences, or larger blocks of text) to tag.
- Include examples. Each tag discussed should include an example with the tag applied to the appropriate linguistic unit. Include both positive examples (elements of text that should be annotated with a given tag) and negative examples (elements that an annotator might think should be tagged, but in fact should not). Negative examples should explain why a given tag should not be applied.
- Discuss edge cases. When annotating textual resources, there will be edge cases where one or more tags could be argued to apply. Discussing (and disambiguating) these cases in a guidelines document will improve the quality of resulting annotations.
- Offer guidance about annotation passes. For many annotation projects, quality of annotations can be greatly improved by breaking up the task into a reading pass, and annotation pass, and a review pass.
It is inevitable that an annotation guidelines document will need to be updated as pilot annotation occurs, and even during annotation of text to be used in downstream processes. Often these updates will include additional examples and discuss further edge cases.
Annotating text according to guidelines takes practice. Thus, having a pilot annotation phase where annotators can practice on sample documents while they learn the guidelines will increase the quality of subsequent annotation. If there are multiple annotators, they can also freely discuss their annotations with each other during the pilot phase, to benefit from each other’s expertise and perspective, and resolve ambiguities. Pilot annotation also affords an opportunity to refine and update the annotation guidelines.
After initial pilot annotation, one can check inter-annotator agreement and decide if more pilot annotation is needed, or if the project can proceed to the annotation phase. After all pilot documents are annotated, and the annotators are comfortable with the annotation guidelines, the effort can proceed to annotating the target documents.
During the annotation phase, here are some best practices to keep in mind:
- Divide the full corpus of documents into manageable groups of documents. It is best to separate documents into manageable groups, assessing progress and quality of annotation between groups. The size of the document groups depends on the length of the documents. For very short documents, it may be reasonable to annotate 20 or more documents between quality checks. For longer documents, assessing quality every 5-10 documents may be appropriate.
-
Perform multiple passes per document.
- Reading Pass: For each document to be annotated, each annotator should perform a reading pass over the document before beginning annotation. Reading through a document before annotating allows the annotator to have the full context of the document in mind during the annotation pass. Having this context can help each annotator to disambiguate edge cases more readily as they go through the document, leading to more consistent annotation.
- Annotation Pass: After the reading pass, each annotator goes through the document again, this time assigning tags according to the annotation guidelines.
- Review Pass: A review pass is best performed over all documents within a group, after a reading pass and annotation pass has been applied to each document. In the review pass, each annotator can assess their own annotations across the group of documents, ensuring the annotations are consistent and conformant to the annotation guidelines.
Many annotation projects involve more than one person annotating. In these cases, there is a tradeoff between dividing the documents in the corpus among the annotators (single annotation) and having more than one annotator tag each document (double, triple, etc. annotation). In the former case, more documents may be annotated in less time, with a likely sacrifice in annotation quality and consistency. In the latter case a smaller but higher quality annotated corpus may be created. There are times when a hybrid approach may be applied. If early phases of annotation result in good consistency among annotators, the project may then move to the strategy of assigning different documents to different annotators. If performing double annotation, annotators must work independently of one another to preserve the ability to calculate inter-annotator agreement.
When more than one person is annotating it is necessary to ensure the annotators are applying the guidelines consistently. This can be checked during the pilot annotation phase, when annotators can freely discuss their annotations and compare their annotation decisions. Periodically during the annotation phase, even if single annotation is employed, it is a best practice to include a small number of common documents that each annotator tags independently, so that inter-annotator agreement and consistency can be checked and maintained.
When double (or triple or more) annotation is practiced, there should be an adjudication phase where the annotations of each document are reconciled into a single, gold-standard version. This can be performed by the annotators themselves in a joint meeting where they discuss each document, or it can be performed by a third-party expert (the adjudicator). Double (or more) annotation followed by adjudication will result in annotations that are more consistent and higher quality. An example of adjudicating two annotations is shown in the Reconciliation (adjudication) interface in the MITRE Annotation Tool (MAT) below. Annotations from two separate human annotators are shown above the text (one in black, one in green), and an adjudication interface for selecting or modifying annotations is displayed. The results of adjudication should also be used to update and improve the annotation guidelines documents.
As stated earlier, annotation is labor-intensive, and thus adding annotators increases the cost of the annotation effort. Some researchers have looked closely at this issue. Dorr et al. (2006) established that there is considerable variability of human annotators, even in the comparatively constrained task of marking mentions of personal health identifiers (PHI) in clinical health records. This variability means that there will be items that different annotators will miss or will tag with different labels. Having a single annotator almost guarantees that annotation will be incomplete and contain mistakes that, if a second annotator were added, could be avoided. In a later study (also in the domain of marking PHI in clinical records), Carrell et al. (2016) compared annotations produced by one, two, three, or four annotators, and showed that there are measurable improvements in annotation quality and coverage by adding a second annotator, but the benefits diminish rapidly when adding a third or fourth. Annotating mentions of ATT&CK Techniques is a more complex task than annotating PHI, in that there are more than 600 ATT&CK Techniques, while the PHI space is typically divided into 10-15 tag types. This effort focused on just 50 ATT&CK Techniques, but this is still a very large number of tags for annotators to apply consistently and correctly. For annotating mentions of ATT&CK Techniques it is advisable to perform double annotation, and to follow the practices outlined above to ensure consistency and quality.
We considered several tool features in selecting an annotation tool for the TRAM project. One of these was support for multiple annotations per sentence. As the project moved to phrasal rather than sentence-level annotation, this criteria become less crucial but still important because of the possibility of overlapping descriptions of ATT&CK techniques in threat intel reports. Other criteria deemed essential include the ability to annotate plain text documents (necessary for most downstream modeling processes), ability to export in standard formats (e.g., JSON and/or XML text formats – important for interoperability / portability to other tools), and support for multiple annotators. Other desirable features include that the tool be locally installable and configurable, that it has a free license, and that it includes good documentation. We also considered whether a tool could natively annotate PDF files, though this is a rare feature in text annotation tools.
The MITRE Annotation Toolkit (MAT) meets all of these criteria. Furthermore, the TRAM project team has access to MAT experts inside MITRE.