Topology of Czech sentences

This is a repo containg code to a paper we wrote for LREC COOLING 2024. The paper is called "Unveiling semantic information in Sentence Embedding" and is available in ACL Anthology.

Installation

Just run the following command (using your virtual environment) in the root of the project:

pip install -e .

Scripts

sent-transfomer-embedding - gets embeddings using sentence-transformers package

# For unsupervised embeddings
sent-transfomer-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/{model}.tsv

# For supervised embeddings
sent-transfomer-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/{model}_{split_ind}.tsv --train_objective "transformation-prediction"

bow-embedding - creates embeddings for BOW and TF-IDF models

# Pure BOW; filters out tokens based on their document frequency
bow-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/bow_limited.tsv --max_df 0.8 --min_df 0.001 --no-tfidf

# TF-IDF; does not limit tokens
bow-embedding -i ./data/COSTRA1.1.tsv -o ./embeddings/tfidf.tsv  --tfidf

Cite this

If you use our work, feel free to cite the mentioned paper:

@inproceedings{zhang-etal-2024-unveiling,
    title = "Unveiling Semantic Information in Sentence Embeddings",
    author = "Zhang, Leixin  and
      Burian, David  and
      John, Vojt{\v{e}}ch  and
      Bojar, Ond{\v{r}}ej",
    editor = "Bonial, Claire  and
      Bonn, Julia  and
      Hwang, Jena D.",
    booktitle = "Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.dmr-1.5",
    pages = "39--47",
    abstract = "This study evaluates the extent to which semantic information is preserved within sentence embeddings generated from state-of-art sentence embedding models: SBERT and LaBSE. Specifically, we analyzed 13 semantic attributes in sentence embeddings. Our findings indicate that some semantic features (such as tense-related classes) can be decoded from the representation of sentence embeddings. Additionally, we discover the limitation of the current sentence embedding models: inferring meaning beyond the lexical level has proven to be difficult.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
clustering		clustering
data		data
embeddings		embeddings
notebooks		notebooks
paper		paper
results		results
src		src
visualization		visualization
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topology of Czech sentences

Installation

Scripts

Cite this

About

Releases

Packages

Contributors 3

Languages

johnvojtech/sentence-topology

Folders and files

Latest commit

History

Repository files navigation

Topology of Czech sentences

Installation

Scripts

Cite this

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages