A Materials Domain Language Model for Text Mining and Information Extraction
bash install_requirements.sh
import torch
from normalize_text import normalize
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('m3rg-iitd/matscibert')
model = AutoModel.from_pretrained('m3rg-iitd/matscibert')
sentences = ['SiO2 is a network former.']
norm_sents = [normalize(s) for s in sentences]
tokenized_sents = tokenizer(norm_sents)
tokenized_sents = {k: torch.Tensor(v).long() for k, v in tokenized_sents.items()}
with torch.no_grad():
last_hidden_state = model(**tokenized_sents)[0]
If you use MatSciBERT in your research, please cite MatSciBERT: A materials domain language model for text mining and information extraction
@article{gupta_matscibert_2022,
title = "{MatSciBERT}: A Materials Domain Language Model for Text Mining and Information Extraction",
author = "Gupta, Tanishq and
Zaki, Mohd and
Krishnan, N. M. Anoop and
Mausam",
year = "2022",
month = may,
journal = "npj Computational Materials",
volume = "8",
number = "1",
pages = "102",
issn = "2057-3960",
url = "https://www.nature.com/articles/s41524-022-00784-w",
doi = "10.1038/s41524-022-00784-w"
}