Skip to content

Sentence embeddings topology analysis: a project for NPFL087 Statistical Machine Translation at MFF UK.

Notifications You must be signed in to change notification settings

johnvojtech/sentence-topology

Repository files navigation

Topology of Czech sentences

This is a repo containg code to a paper we wrote for LREC COOLING 2024. The paper is called "Unveiling semantic information in Sentence Embedding" and is available in ACL Anthology.

Installation

Just run the following command (using your virtual environment) in the root of the project:

pip install -e .

Scripts

  • sent-transfomer-embedding - gets embeddings using sentence-transformers package
# For unsupervised embeddings
sent-transfomer-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/{model}.tsv

# For supervised embeddings
sent-transfomer-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/{model}_{split_ind}.tsv --train_objective "transformation-prediction"
  • bow-embedding - creates embeddings for BOW and TF-IDF models
# Pure BOW; filters out tokens based on their document frequency
bow-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/bow_limited.tsv --max_df 0.8 --min_df 0.001 --no-tfidf

# TF-IDF; does not limit tokens
bow-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/tfidf.tsv  --tfidf

Cite this

If you use our work, feel free to cite the mentioned paper:

@inproceedings{zhang-etal-2024-unveiling,
    title = "Unveiling Semantic Information in Sentence Embeddings",
    author = "Zhang, Leixin  and
      Burian, David  and
      John, Vojt{\v{e}}ch  and
      Bojar, Ond{\v{r}}ej",
    editor = "Bonial, Claire  and
      Bonn, Julia  and
      Hwang, Jena D.",
    booktitle = "Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.dmr-1.5",
    pages = "39--47",
    abstract = "This study evaluates the extent to which semantic information is preserved within sentence embeddings generated from state-of-art sentence embedding models: SBERT and LaBSE. Specifically, we analyzed 13 semantic attributes in sentence embeddings. Our findings indicate that some semantic features (such as tense-related classes) can be decoded from the representation of sentence embeddings. Additionally, we discover the limitation of the current sentence embedding models: inferring meaning beyond the lexical level has proven to be difficult.",
}

About

Sentence embeddings topology analysis: a project for NPFL087 Statistical Machine Translation at MFF UK.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •