-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This is the datasheet for the AGB-DE corpus.
Why was the dataset created? (e.g., were there specific tasks in mind, or a specific gap that needed to be filled?)
The dataset was created to enable the training and evaluation of machine learning models that can detect potentially void clauses in consumer standard form contracts.
What (other) tasks could the dataset be used for? Are there obvious tasks for which it should not be used?
The dataset can also be used for clause topic classification.
Has the dataset been used for any tasks already? If so, where are the results so others can compare (e.g., links to published papers)?}
ToDo add link
Who funded the creation of the dataset? If there is an associated grant, provide the grant number.
The data collection and annotation was supported by funds of the Federal Ministry of Justice and Consumer Protection (BMJV) based on a decision of the Parliament of the Federal Republic of Germany via the Federal Office for Agriculture and Food (BLE) under the innovation support programme.
What are the instances? (that is, examples; e.g., documents, images, people, countries) Are there multiple types of instances? (e.g., movies, users, ratings; people, interactions between them; nodes, edges)
Each instance consists of a clause from a consumer standard form contract and includes the text of the clause, the title (if any), the language of the clause, a unique ID, a unique ID identifying the contract the clause is from and three annotations: whether the clause was considered as potentially void by the annotators and a list of topics and subtopics.
Are relationships between instances made explicit in the data (e.g., social network links, user/movie ratings, etc.)?
Clauses from the same contract are linked through the contract ID.
How many instances of each type are there?
The dataset consists of 3764 clauses in total, 179 have been annotated as potentially void and and 3585 as likely valid.
What data does each instance consist of? “Raw” data (e.g., unprocessed text or images)? Features/attributes? Is there a label/target associated with instances? If the instances are related to people, are subpopulations identified (e.g., by age, gender, etc.) and what is their distribution?}
Each instance consists of the clause text, the title of the clause (if any), the language of the clause, a unique ID, a unique ID identifying the contract the clause is from and three annotations: whether the clause was considered as potentially void by the annotators, a list of topics, and a list of subtopics.
Is everything included or does the data rely on external resources? (e.g., websites, tweets, datasets) If external resources, a) are there guarantees that they will exist, and remain constant, over time; b) is there an official archival version. Are there licenses, fees or rights associated with any of the data?
Everything is included in the dataset.
Are there recommended data splits or evaluation measures? (e.g., training, development, testing; accuracy/AUC)
Splits for training and test are available together with the corpus. We suggest using metrics that work well on unbalanced data and highly discourage the use of accuracy as metric on this dataset.
What experiments were initially run on this dataset? Have a summary of those results and, if available, provide the link to a paper with more information here.
The dataset was initially used to train classifiers that are able to detect potentially void clauses in consumer contracts.
How was the data collected? (e.g., hardware apparatus/sensor, manual human curation, software program, software interface/API; how were these constructs/measures/methods validated?)
The data was manually collected by human annotators and copied into a structured Excel format.
Who was involved in the data collection process? (e.g., students, crowdworkers) How were they compensated? (e.g., how much were crowdworkers paid?)
The data was collected by fully-qualified lawyers during their usual work-time. All participants worked for organizations that pay according to the collective labor agreement for public service workers in German states.
Over what time-frame was the data collected? Does the collection time-frame match the creation time-frame?
The data was collected between 2021 and 2022 and annotated between 2021 and 2023. The creation date of most of the items is unknown.
How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part of speech tags; model-based guesses for age or language)? If the latter two, were they validated/verified and if so how?
The data was directly observable or was manually annotated by the annotators who are experts in the subject of the annotation.
Does the dataset contain all possible instances? Or is it, for instance, a sample (not necessarily random) from a larger set of instances?}
No, the dataset does not claim completeness in any sense.
If the dataset is a sample, then what is the population? What was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Is the sample representative of the larger set (e.g., geographic coverage)? If not, why not (e.g., to cover a more diverse range of instances)? How does this affect possible uses?
We believe that the dataset is somewhat representative for standard form consumer contracts in Germany. It is sampled from different industry (e.g. e-commerce and fitness).
Is there information missing from the dataset and why? (this does not include intentionally dropped instances; it might include, e.g., redacted text, withheld documents) Is this data missing because it was unavailable?
The data has been anonymised, i.e. company names, phone numbers, addresses, tax ids, and similar information has been removed.
How is the dataset distributed? (e.g., website, API, etc.; does the data have a DOI; is it archived redundantly?)
It is archived on GitHub and for easier access also available in the Hugging Face Hub.
When will the dataset be released/first distributed (Is there a canonical paper/reference for this dataset?)
June 2024.
What license (if any) is it distributed under? Are there any copyrights on the data?
The annotations are licensed under CC-BY-SA 4.0.
Are there any fees or access/export restrictions?
No.
Who is supporting/hosting/maintaining the dataset? How does one contact the owner/curator/manager of the dataset (e.g. email address, or other contact info)?
Daniel Braun, [email protected]
Will the dataset be updated? How often and by whom? How will updates/revisions be documented and communicated (e.g., mailing list, GitHub)? Is there an erratum?}
There are no plans to update the dataset unless important mistakes become clear.
If the dataset becomes obsolete how will this be communicated?
On the GitHub page.
Is there a repository to link to any/all papers/systems that use this dataset?
Yes.
If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? If so, is there a process for tracking/assessing the quality of those contributions. What is the process for communicating/distributing these contributions to users?
We would suggest to create a fork on GitHub.
If the dataset relates to people (e.g., their attributes) or was generated by people, were they informed about the data collection? (e.g., datasets that collect writing, photos, interactions, transactions, etc.)
There is no information about individuals in the data or was recorded during the annotation of the data.
Does the dataset comply with the EU General Data Protection Regulation (GDPR)? Does it comply with any other standards, such as the US Equal Employment Opportunity Act?
Yes, since only publicly available information was collected, the dataset complies with the GDPR and similar regulations.
Does the dataset contain information that might be considered sensitive or confidential? (e.g., personally identifying information)
No.
Does the dataset contain information that might be considered inappropriate or offensive?
No.