This documentation provides a detailed overview of the sub-corpora (datasets) used in the training and evaluation of MaintNorm and its baseline comparisons. The focus of this documentation is to offer insights into the composition, structure, and usage of these datasets within the context of MaintNorm's application in heavy mobile equipment (HME) maintenance.
The MaintNorm dataset comprises 12,000 randomly sampled maintenance short texts (MSTs) originating from three prominent Australian mining and mineral processing organisations. These texts are instrumental in understanding the linguistic nuances and technical jargon prevalent in maintenance short texts.
- Single Annotator: The entire dataset has been meticulously annotated by a dedicated annotator.
- Annotation Guidelines: The guidelines followed during annotation are comprehensive, ensuring consistency and accuracy. These guidelines, along with the masking scheme employed, are detailed here.
To maintain confidentiality, sensitive information within the dataset has been substituted with obfuscated mock data. This process was executed programmatically, preserving the semantics of the original texts. For example, BF079
which is masked as an <id>
is not the true value for this, instead it mocks the original information's semantic structure e.g. two uppercased alphabetical characters followed by three digits. This process is performed for all of the masks used in MaintNorm.
- Case Sensitivity: It is important to note that the dataset is case sensitive, reflecting the real-world usage of text in HME maintenance.
The dataset is structured within the ./data
directory, segregated based on the source company (A, B, C, and combined A+B+C). The table below presents a correlation between the folder names in the ./data
directory and the experiments conducted as part of the MaintNorm study.
Folder Name | Description |
---|---|
company_a |
Dataset from Company A. Used in experiments with and without additional data. |
company_b |
Dataset from Company B. Used in experiments with and without additional data. |
company_c |
Dataset from Company C. Used in experiments with and without additional data. |
combined |
Combined dataset from Companies A, B, and C. The training portion of this dataset is used with company_a/b/c to form the "extra data" experiments. |
- Extra Data Concept: In some experiments, the training portion of datasets from other companies (not the primary one being experimented on) is used. This approach is referred to as 'using extra data'.
- Purpose: The rationale behind using extra data is to enrich the support for lexical normalisation and masking, thereby enhancing the model's robustness and applicability across different corporate contexts.
This section outlines the original corpus statistics, the normalised and masking corpus statistics, and the detailed characteristics of each sub-corpora.
The following table provides a summary of the MaintNorm corpus statistics. This table displays statistics for 4,000 texts from each company, focusing on heavy mobile equipment. It includes token-based text length and vocabulary size. Changes due to normalisation and masking are indicated by arrows and percentages (↑/↓ X%). The right-hand section of the table delineates the text transformations, categorising them as Modified for texts undergoing normalisation or masking, Norm Only for texts exclusively normalised, and Mask Only for texts solely subjected to masking.
Company | Length | Vocab Size | Tokens | Modified | Norm Only | Mask Only |
---|---|---|---|---|---|---|
A | 5.2 (1.2) | 2,561 | 20,944 | - | - | - |
5.4 (1.3) (+3%) | 1,106 (-57%) | 21,591 (+3%) | 3,998 | 115 | 45 | |
B | 5.5 (1.4) | 3,100 | 21,919 | - | - | - |
6.2 (1.8) (+13%) | 1,360 (-56%) | 24,690 (+13%) | 3,946 | 192 | 321 | |
C | 5.1 (1.5) | 4,168 | 20,559 | - | - | - |
5.5 (1.8) (+7%) | 2,048 (-51%) | 22,114 (+7%) | 3,431 | 1,879 | 150 | |
A+B+C | 5.3 (1.4) | 7,612 | 63,422 | - | - | - |
5.7 (1.7) (+8%) | 2,872 (-62%) | 68,395 (+8%) | 11,375 | 2,116 | 586 |
The detailed characteristics (transformations, etc) of the corpora post normalisation and masking is outlined in the following table.
A | B | C | A+B+C | |
---|---|---|---|---|
Normalisation Operations | ||||
Character addition | 3,022 | 4,704 | 2,781 | 10,507 |
Character removal | 191 | 939 | 247 | 1,377 |
Character rearrangement | 145 | 118 | 233 | 358 |
Character replacement | 209 | 508 | 231 | 950 |
Token expansion | 662 | 2,264 | 1,281 | 4,207 |
Token removal | 194 | 97 | 195 | 486 |
Title cased | 69 | 118 | 97 | 284 |
Partial casing added | 8 | 6 | 9 | 23 |
All casing removed | 13,826 | 9,214 | 7,098 | 30,138 |
All casing added | 4 | 29 | 36 | 69 |
No change | 1,978 | 7,178 | 10,173 | 19,338 |
Normalisation Transformations | ||||
1:1 | 17,898 | 12,233 | 8,694 | 38,825 |
1:N | 662 | 2,264 | 1,281 | 4,207 |
N:1 | 194 | 97 | 195 | 486 |
N:M | 2 | 4 | 2 | 8 |
N:0 | 7 | 15 | 6 | 28 |
Masking Operations | ||||
<id> |
4,055 | 3,916 | 1,116 | 9,087 |
<sensitive> |
44 | 25 | 155 | 224 |
<num> |
573 | 1,349 | 847 | 2,769 |
<date> |
9 | 2 | 49 | 60 |
Text normalisation involves various operations and transformations to modify and standardise text data. Here, we provide detailed descriptions and examples for each category.
Normalisation operations are fundamental changes to individual characters or tokens in the text. These include:
- Character Addition: Adding missing characters for correction.
- Example: rp to replace
- Character Removal: Removing extraneous characters.
- Example: reeplace to replace
- Character Rearrangement: Correcting character order.
- Example: erplace to replace
- Character Replacement: Substituting one character for another.
- Example: teplace to replace
- Token Expansion: Expanding abbreviations or acronyms.
- Example: c/o to change out
- Token Removal: Removing unnecessary tokens.
- Example: replace $ to replace
- Title Cased: Capitalising the first letter of each word.
- Example: a-frame to A-frame
- Partial Casing Added: Adding case to part of a token.
- Example: TEco to TECO
- All Casing Removed: Converting all characters to lower case.
- Example: REPLACE to replace
- All Casing Added: Converting all characters to upper case.
- Example: teco to TECO
- No Change: No modification required.
- Example: replace
Normalisation transformations involve changing the structure of the text, often affecting the number of tokens:
1:1
: One-to-one transformation, replacing a single token with another single token.- Example: repl to replace
1:N
: One-to-many transformation, replacing a single token with multiple tokens.- Example: c/o to change out
N:1
: Many-to-one transformation, consolidating multiple tokens into a single token.- Example: repl ace to replace
N:M
: Many-to-many transformation, replacing several tokens with a different number of tokens.- Example: c/ outeng to change out engine
N:0
: Removing tokens without replacement.- Example: $ to _
Masking operations involve replacing specific types of information with semantic tags:
<id>
: Masking identifiers.- Example: "replace ENG01" to "replace
<id>
"
- Example: "replace ENG01" to "replace
<sensitive>
: Masking sensitive information.- Example: "John Smith to inspect" to "
<sensitive>
to inspect"
- Example: "John Smith to inspect" to "
<num>
: Masking numerical values.- Example: "replace 2 vents" to "replace
<num>
vents"
- Example: "replace 2 vents" to "replace
<date>
: Masking date information.- Example: "inspection 2nd March" to "inspection
<date>
"
- Example: "inspection 2nd March" to "inspection