Skip to content

M3RG-IITD/MatSciBERT

Repository files navigation

MatSciBERT

A Materials Domain Language Model for Text Mining and Information Extraction

Using MatSciBERT

Installing requirements

bash install_requirements.sh

Importing libraries

import torch
from normalize_text import normalize
from transformers import AutoModel, AutoTokenizer

Loading pre-trained model and tokenizer

tokenizer = AutoTokenizer.from_pretrained('m3rg-iitd/matscibert')
model = AutoModel.from_pretrained('m3rg-iitd/matscibert')

Tokenizing sentences

sentences = ['SiO2 is a network former.']
norm_sents = [normalize(s) for s in sentences]
tokenized_sents = tokenizer(norm_sents)
tokenized_sents = {k: torch.Tensor(v).long() for k, v in tokenized_sents.items()}

Obtaining MatSciBERT embeddings

with torch.no_grad():
    last_hidden_state = model(**tokenized_sents)[0]

Citing

If you use MatSciBERT in your research, please cite MatSciBERT: A materials domain language model for text mining and information extraction

@article{gupta_matscibert_2022,
  title   = "{MatSciBERT}: A Materials Domain Language Model for Text Mining and Information Extraction",
  author  = "Gupta, Tanishq and 
            Zaki, Mohd and 
            Krishnan, N. M. Anoop and 
            Mausam",
  year    = "2022",
  month   = may,
  journal = "npj Computational Materials",
  volume  = "8",
  number  = "1",
  pages   = "102",
  issn    = "2057-3960",
  url     = "https://www.nature.com/articles/s41524-022-00784-w",
  doi     = "10.1038/s41524-022-00784-w"
}
Fine-tuned models for Named entity recognition, Abstract classification, and Relation classification with the codes of this repository are available on DOI

About

Material Science Aware Language Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published