Skip to content

avichaychriqui/Legal-HeBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Legal-HeBERT

Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.

Training Data

Our training datasets are:

Name Hebrew Description Size (GB) Documents Sentences Words Notes
The Israeli Law Book ספר החוקים הישראלי 0.05 2338 293352 4851063
Judgments of the Supreme Court מאגר פסקי הדין של בית המשפט העליון 0.7 212348 5790138 79672415
custody courts החלטות בתי הדין למשמורת 2.46 169,708 8,555,893 213,050,492
Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment תזכירי חוק, טיוטות חקיקת משנה וטיוטות מבחני תמיכה שהופצו להערות הציבור 0.4 3,291 294,752 7,218,960
Supervisors of Land Registration judgments מאגר פסקי דין של המפקחים על רישום המקרקעין 0.02 559 67,639 1,785,446
Decisions of the Labor Court - Corona מאגר החלטות בית הדין לעניין שירות התעסוקה – קורונה 0.001 146 3505 60195
Decisions of the Israel Lands Council החלטות מועצת מקרקעי ישראל 118 11283 162692 aggregate file
Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal פסקי דין של בית הדין למשמעת ובית הדין לערעורים של משטרת ישראל 0.02 54 83724 1743419 aggregate files
Disciplinary Appeals Committee in the Ministry of Health ועדת ערר לדין משמעתי במשרד הבריאות 0.004 252 21010 429807 465 files are scanned and didn't parser
Attorney General's Positions מאגר התייצבויות היועץ המשפטי לממשלה 0.008 281 32724 813877
Legal-Opinion of the Attorney General מאגר חוות דעת היועץ המשפטי לממשלה 0.002 44 7132 188053
total 3.665 389,139 15,161,152 309,976,419

We thank Yair Gardin for the referring to the governance data, Elhanan Schwarts for collecting and parsing The Israeli law book, and Jonathan Schler for collecting the judgments of the supreme court.

Training process

  • Vocabulary size: 50,000 tokens
  • 4 epochs (1M steps±)
  • lr=5e-5
  • mlm_probability=0.15
  • batch size = 32 (for each gpu)
  • NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)

Additional training settings:

Fine-tuned HeBERT model: The first eight layers were freezed (like Lee et al. (2019) suggest)
Legal-HeBERT trained from scratch: The training process is similar to HeBERT and inspired by Chalkidis et al. (2020)

How to use

The models can be found in huggingface hub and can be fine-tunned to any down-stream task:

# !pip install transformers==4.14.1
from transformers import AutoTokenizer, AutoModel

model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model 
model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model=model_name,
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")

Stay tuned!

We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.

If you used this model please cite us as :

Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai, Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127

@article{chriqui2021hebert,
  title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
  author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
  journal={SSRN preprint:4147127},
  year={2022}
}

Contact us

Avichay Chriqui, The Coller AI Lab
Inbal yahav, The Coller AI Lab
Ittai Bar-Siman-Tov, the BIU Innovation Lab for Law, Data-Science and Digital Ethics

Thank you, תודה, شكرا

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published