VELD registry

This is a living collection of VELD repositories and their contained velds.

The technical concept for the VELD design can be found here: https://zenodo.org/records/13318651

data velds

https://github.com/veldhub/veld_data__akp_ner_linkedcat
- linkedcat/veld.yaml
  - valid: True
  - metadata:
    - description: Prefered dataset is not this one, but linkedcat2! This dataset was created by applying a custom trained SpaCy NER model an APIS / ÖBL data, on data set 'linkedcat2' at our solr index. The csv file is split into id column, character start index of recognized entity, character end index of entity, label of entity type, and a small context window.
    - topics: NLP, Named Entity Recognition
    - file_type: csv
    - contents: NER data, inferenced NLP data
- linkedcat2/veld.yaml
  - valid: True
  - metadata:
    - description: Prefered dataset is this one, not linkedcat! This dataset was created by applying a custom trained SpaCy NER model an APIS / ÖBL data, on data set 'linkedcat2' at our solr index. The csv file is split into id column, character start index of recognized entity, character end index of entity, label of entity type, and a small context window.
    - topics: NLP, Named Entity Recognition
    - file_type: csv
    - contents: NER data, inferenced NLP data
https://github.com/veldhub/veld_data__amc_we_training_data
- 203_vert_rftt_inhalt_nodup/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__lowercased/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__lowercased__punctuation_removed/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__lowercased__punctuation_removed__cleaned/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__sampled/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__sampled__lowercased/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__sampled__lowercased__punctuation_removed/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- 203_vert_rftt_inhalt_nodup__uniq__stripped__sampled__lowercased__punctuation_removed__cleaned/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
https://github.com/veldhub/veld_data__apis_oebl__ner_gold
- data_cleaned/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- data_cleaned_simplified/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- data_uncleaned/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
https://github.com/veldhub/veld_data__apis_spacy_ner_models
- m1/model-best/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m2/model-best/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
https://github.com/veldhub/veld_data__bert_amc_embeddings_db
https://github.com/veldhub/veld_data__demo_inference_input_ts-vienna-2024
- veld.yaml
  - valid: True
  - metadata:
    - description: A single txt file, used as inference input to a self-trained updipe model as a demonstration
    - topics: NLP, universal dependencies
    - file_type: txt
    - contents: raw text
https://github.com/veldhub/veld_data__demo_train_data_ts-vienna-2024
- veld.yaml
  - valid: True
  - metadata:
    - description: A single conllu file, used to train a updipe model as a demonstration
    - topics: NLP, universal dependencies
    - file_type: conllu
    - contents: linguistically enriched text, tokenized text, lemmatized text
https://github.com/veldhub/veld_data__eltec_conllu_stats
https://github.com/veldhub/veld_data__eltec_original_selection
- veld.yaml
  - valid: True
  - metadata:
    - description: parent git repo that integrates various ELTeC corpora as submodules for downstream processing.
    - file_type: xml
    - contents: TEI, annotated literature
https://github.com/veldhub/veld_data__fasttext_models
- m1/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m3/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m4/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m5/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m6/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m7/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m8/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m9/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
https://github.com/veldhub/veld_data__glove_models
- m1/veld.yaml
  - valid: False, is not primitive type, but <class 'list'>, at: /x-veld/data/file_type/
- m3/veld.yaml
  - valid: False, is not primitive type, but <class 'list'>, at: /x-veld/data/file_type/
https://github.com/veldhub/veld_data__word2vec_models
- m3/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m4/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m5/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m6/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m7/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m8/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
- m9/veld.yaml
  - valid: False, is not primitive type, but <class 'dict'>, at: /x-veld/data/additional/
https://github.com/veldhub/veld_data__wordembeddings_evaluation
- evaluation_gold_data/capitalized/veld.yaml
  - valid: True
  - metadata:
    - description: custom evaluation data for evaluating word embeddings models. Words are capitalized.
    - topics: NLP, word embeddings
    - file_type: yaml
    - contents: evaluation data, NLP gold data
- evaluation_gold_data/lowercase/veld.yaml
  - valid: True
  - metadata:
    - description: custom evaluation data for evaluating word embeddings models. Words are all lowercase.
    - topics: NLP, word embeddings
    - file_type: yaml
    - contents: evaluation data, NLP gold data

code velds

https://github.com/veldhub/veld_code__analyse_conllu
- veld.yaml
  - valid: True
  - metadata:
    - topics: NLP, Machine learning, tokenization, lemmatization, part of speech, dependency parsing, universal dependencies, grammatical annotation
    - inputs:
      - 1:
        
        file_type: conllu
    - outputs:
      - 1:
        
        file_type: json
        
        contents: statistics, NLP statistics
https://github.com/veldhub/veld_code__apis_ner_evaluate_old_models
- veld_evaluate.yaml
  - valid: True
  - metadata:
    - description: hard-coded evaluation of several spaCy2.2.4 models.
    - topics: NLP, Machine learning, Named entity recognition
    - inputs:
      - 1:
        
        description: This input is hard-wired to the apis spacy-ner repo and not made for generic usage.
        
        file_type: pickle, txt, json, spacy model
        
        contents: NER gold data, Machine learning model, NLP model
    - outputs:
      - 1:
        
        description: evaluation report of the models from the apis spacy-ner repo.
        
        file_type: md
        
        contents: evaluation report
https://github.com/veldhub/veld_code__apis_ner_transform_to_gold
- veld.yaml
  - valid: True
  - metadata:
    - description: hard-coded conversion of apis ner models to custom json format.
    - topics: ETL, data cleaning
    - inputs:
      - 1:
        
        description: This input is hard-wired to the apis spacy-ner repo and not made for generic usage.
        
        file_type: pickle, txt, json
        
        contents: NER gold data
    - outputs:
      - 1:
        
        description: raw uncleaned, as it was originally. Now just transformed to json.
        
        file_type: json
        
        contents: NER gold data
      - 2:
        
        description: removed empty entity annotations and fixed border issues.
        
        file_type: json
        
        contents: NER gold data
      - 3:
        
        description: additionally to cleaning, this data is slimmed down from superfluous entity ids in favor of simplified entity classes.
        
        file_type: json
        
        contents: NER gold data
      - 4:
        
        file_type: txt
https://github.com/veldhub/veld_code__bert_embeddings
- veld_infer_and_create_index.yaml
  - valid: True
https://github.com/veldhub/veld_code__downloader
- veld.yaml
  - valid: True
  - metadata:
    - description: A very simple curl call. Since many veld chains need to download data, it makes sense to encapsulate the download functionality into a dedicated downloader veld code
    - topics: ETL
    - outputs:
      - 1:
        
        description: environment var is optional. If unset, this script will fetch the file name from the resource.
https://github.com/veldhub/veld_code__fasttext
- veld_jupyter_notebook.yaml
  - valid: True
  - metadata:
    - description: a fasttext training and inference jupyter notebook.
    - topics: NLP, Machine Learning, word embeddings
- veld_train.yaml
  - valid: True
  - metadata:
    - description: a fasttext training and inference jupyter notebook.
    - topics: NLP, Machine Learning, word embeddings
    - inputs:
      - 1:
        
        description: training data must be expressed as one sentence per line.
        
        file_type: txt
        
        contents: raw text
    - outputs:
      - 1:
        
        file_type: bin, fasttext model
        
        contents: fasttext model, word embeddings
https://github.com/veldhub/veld_code__glove
- veld_jupyter_notebook.yaml
  - valid: True
  - metadata:
    - description: A jupyter notebook that loads GloVe vectors and provides some convenient functions to use them.
    - topics: NLP, Machine learning, word embeddings
- veld_train.yaml
  - valid: True
  - metadata:
    - description: This code repo encapsulates the original code from https://github.com/stanfordnlp/GloVe/tree/master
    - topics: NLP, Machine learning, word embeddings
    - inputs:
      - 1:
        
        description: In the txt file, each line must be one sentence
        
        file_type: txt
        
        contents: natural text
    - outputs:
      - 1:
        
        file_type: bin
        
        contents: GloVe global word cooccurrence matrix, GloVe vectors
      - 2:
        
        file_type: bin
        
        contents: GloVe global word cooccurrence matrix, GloVe vectors
      - 3:
        
        file_type: bin
        
        contents: GloVe global word cooccurrence matrix, GloVe vectors
      - 4:
        
        file_type: bin
        
        contents: GloVe global word cooccurrence matrix, GloVe vectors
https://github.com/veldhub/veld_code__jupyter_notebook_base
- veld.yaml
  - valid: True
  - metadata:
    - description: template veld code repo for a juptyer notebook
https://github.com/veldhub/veld_code__simple_docker_test
- veld.yaml
  - valid: True
  - metadata:
    - description: prints information about the python intepreter within the docker container.
    - topics: testing
https://github.com/veldhub/veld_code__spacy
- veld_convert.yaml
  - valid: True
  - metadata:
    - description: prepare data for spacy NER training, since spacy expects the entity annotation indices to be precisely at the beginning and end of the words, and also no overlapping entity annotations. Then it converts the data to spaCy docbin, and prepares it for training by splitting it into train, dev, eval subsets, and shuffling them randomly.
    - topics: ETL, NLP, Machine learning
    - inputs:
      - 1:
        
        description: name of the csv file, containing NER gold data
        
        file_type: json
        
        contents: NER gold data
    - outputs:
      - 1:
        
        description: path to folder where spacy docbin files will be stored with file names train.spacy, dev.spacy, eval.spacy
        
        file_type: spacy docbin
        
        contents: NER gold data
      - 2:
        
        description: log file of conversion
        
        file_type: spacy docbin
        
        contents: NER gold data
- veld_create_config.yaml
  - valid: True
  - metadata:
    - description: Creating a spacy config by encapsulating init config ( https://spacy.io/api/cli#init-config ) and init fill-config ( https://spacy.io/api/cli#init-fill-config ) . The output is ai config file used for training; see more here: https://spacy.io/usage/training/#config
    - topics: NLP, Machine learning
    - outputs:
      - 1:
        
        description: See https://spacy.io/usage/training/#config
        
        file_type: cfg
        
        contents: spacy training config
- veld_publish_to_hf.yaml
  - valid: True
  - metadata:
    - description: simple service to push spacy models to huggingface. IMPORTANT: Only works from spacy v3.* onwards!
    - topics: NLP, ETL
    - inputs:
      - 1:
        
        file_type: spacy model
        
        contents: NLP model
- veld_train.yaml
  - valid: True
  - metadata:
    - description: A spacy trainig setup, utilizing spacy v3's config system.
    - topics: NLP, Machine learning
    - inputs:
      - 1:
        
        file_type: spacy docbin
        
        contents: NLP gold data, ML gold data, gold data
      - 2:
        
        file_type: spacy docbin
        
        contents: NLP gold data, ML gold data, gold data
      - 3:
        
        file_type: spacy docbin
        
        contents: NLP gold data, ML gold data, gold data
      - 4:
        
        description: See https://spacy.io/usage/training/#config
        
        file_type: cfg
        
        contents: spacy training config
    - outputs:
      - 1:
        
        file_type: spacy model
        
        contents: NLP model, spacy model
      - 2:
        
        description: path to the train log file
        
        file_type: txt
        
        contents: logs
      - 3:
        
        description: path to the eval log file
        
        file_type: txt
        
        contents: logs
https://github.com/veldhub/veld_code__teitok-tools
- veld_parseudpipe.yaml
  - valid: True
  - metadata:
    - description: This code veld encapsulates and veldifies the parseudpipe script. All its settings here are passed down to the script. For more information on its usage and settings, see: https://github.com/ufal/teitok-tools?tab=readme-ov-file#parseudpipe
    - topics: NLP, ETL, tokenization, universal dependencies
    - inputs:
      - 1:
        
        file_type: xml
    - outputs:
      - 1:
        
        file_type: xml
- veld_udpipe2teitok.yaml
  - valid: True
  - metadata:
    - description: This code veld encapsulates and veldifies the udpipe2teitok script. All its settings here are passed down to the script. For more information on its usage and settings, see: https://github.com/ufal/teitok-tools?tab=readme-ov-file#udpipe2teitok
    - topics: NLP, ETL, tokenization, universal dependencies
    - inputs:
      - 1:
        
        file_type: txt
    - outputs:
      - 1:
        
        file_type: xml
- veld_xmltokenize.yaml
  - valid: True
  - metadata:
    - description: This code veld encapsulates and veldifies the xmltokenize script. All its settings here are passed down to the script. For more information on its usage and settings, see: https://github.com/ufal/teitok-tools?tab=readme-ov-file#xmltokenize
    - topics: NLP, ETL, tokenization, universal dependencies
    - inputs:
      - 1:
        
        description: The xml file to be tokenized
        
        file_type: xml
    - outputs:
      - 1:
        
        description: The output tokenized xml
        
        file_type: xml
https://github.com/veldhub/veld_code__udpipe
- veld_infer.yaml
  - valid: True
  - metadata:
    - description: udpipe inference setup
    - topics: NLP, Machine learning, tokenization, lemmatization, part of speech, dependency parsing, universal dependencies, grammatical annotation
    - inputs:
      - 1:
        
        description: txt files to be inferenced on. Note that the environment var in_txt_file is optional, and if it is not present, the entire input folder will be processed recursively
        
        file_type: txt
        
        contents: raw text
      - 2:
        
        file_type: udpipe model
        
        contents: NLP model, tokenizer, lemmatizer
    - outputs:
      - 1:
        
        description: The file name of the output conllu is created by the corresponding input txt file, since recursive processing requires such automatic logic
        
        file_type: conllu, tsv
        
        contents: inferenced NLP data, tokenized text, lemmatized text, part of speech of text, universal dependencies of text, grammatically annotated text, linguistic data
- veld_train.yaml
  - valid: True
  - metadata:
    - description: udpipe training setup
    - topics: NLP, Machine learning, tokenization, lemmatization, part of speech, dependency parsing, universal dependencies, grammatical annotation
    - inputs:
      - 1:
        
        file_type: conllu
        
        contents: tokenized text, enriched text, linguistic data
    - outputs:
      - 1:
        
        file_type: udpipe model
        
        contents: NLP model, tokenizer, lemmatizer
https://github.com/veldhub/veld_code__wikipedia_nlp_preprocessing
- veld_download_and_extract.yaml
  - valid: True
  - metadata:
    - description: downloading wikipedia archive and extracting each article to a json file.
    - topics: NLP, Machine Learning, ETL
    - outputs:
      - 1:
        
        description: a folder containing json files, where each file contains the contents of a wikipedia article
        
        file_type: json
        
        contents: NLP training data, raw text
- veld_transform_wiki_json_to_txt.yaml
  - valid: True
  - metadata:
    - description: transforming wikipedia raw jsons to a single txt file.
    - topics: NLP, Machine Learning, ETL
    - inputs:
      - 1:
        
        description: a folder containing json files, where each file contains the contents of a wikipedia article
        
        file_type: json
        
        contents: NLP training data, raw text
    - outputs:
      - 1:
        
        description: single txt file, containing only raw content of wikipedia pagaes, split into sentences or per article with a newline each, possibly being only a sampled subset for testing.
        
        file_type: txt
        
        contents: NLP training data, word embeddings training data, raw text
https://github.com/veldhub/veld_code__word2vec
- veld_jupyter_notebook.yaml
  - valid: True
  - metadata:
    - description: a word2vec jupyter notebook, for quick experiments
    - topics: NLP, Machine Learning, word embeddings
    - inputs:
      - 1:
        
        description: arbitrary storage for word2vec experiments
        
        file_type: word2vec model, training data, NLP training data, raw text
        
        contents: NLP model, word embeddings model, model metadata, NLP training data, word embeddings training data, raw text
    - outputs:
      - 1:
        
        description: arbitrary storage for word2vec experiments
- veld_train.yaml
  - valid: True
  - metadata:
    - description: word2vec training setup
    - topics: NLP, Machine Learning, word embeddings
    - inputs:
      - 1:
        
        description: training data. Must be one single txt file, one sentence per line.
        
        file_type: txt
        
        contents: NLP training data, word embeddings training data, raw text
    - outputs:
      - 1:
        
        description: self trained word embeddings word2vec model
        
        file_type: word2vec model
        
        contents: NLP model, word embeddings model
https://github.com/veldhub/veld_code__wordembeddings_evaluation
- veld_analyse_evaluation.yaml
  - valid: True
  - metadata:
    - description: data visualization of all evaluation data. In a jupyter notebook.
    - topics: NLP, word embeddings, data visualization
    - inputs:
      - 1:
        
        description: summary of the custom evaluation logic on word embeddings
        
        file_type: yaml
        
        contents: evaluation data
    - outputs:
      - 1:
        
        description: data visualization of all evaluation data, expressed as interactive html
        
        file_type: html
        
        contents: data visualization
      - 2:
        
        description: data visualization of all evaluation data, expressed as png
        
        file_type: png
        
        contents: data visualization
- veld_analyse_evaluation_non_interactive.yaml
  - valid: True
  - metadata:
    - description: data visualization of all evaluation data. non-interactive version of the juypter code.
    - topics: NLP, word embeddings, data visualization
    - inputs:
      - 1:
        
        description: summary of the custom evaluation logic on word embeddings
        
        file_type: yaml
        
        contents: evaluation data
    - outputs:
      - 1:
        
        description: data visualization of all evaluation data, expressed as interactive html
        
        file_type: html
        
        contents: data visualization
      - 2:
        
        description: data visualization of all evaluation data, expressed as png
        
        file_type: png
        
        contents: data visualization
- veld_eval_fasttext.yaml
  - valid: True
  - metadata:
    - description: custom evaluation logic on fasttext word embeddings.
    - topics: NLP, Machine learning, evaluation
    - inputs:
      - 1:
        
        file_type: fasttext model
        
        contents: NLP model, word embeddings model
      - 2:
        
        file_type: yaml
        
        contents: metadata
      - 3:
        
        file_type: yaml
        
        contents: NLP gold data
    - outputs:
      - 1:
        
        file_type: yaml
      - 2:
        
        file_type: txt
        
        contents: logs
- veld_eval_glove.yaml
  - valid: True
  - metadata:
    - description: custom evaluation logic on GloVe word embeddings.
    - topics: NLP, Machine learning, evaluation
    - inputs:
      - 1:
        
        file_type: GloVe vector model
        
        contents: NLP model, word embeddings model
      - 2:
        
        file_type: yaml
        
        contents: metadata
      - 3:
        
        file_type: yaml
        
        contents: NLP gold data
    - outputs:
      - 1:
        
        file_type: yaml
      - 2:
        
        file_type: txt
        
        contents: logs
- veld_eval_word2vec.yaml
  - valid: True
  - metadata:
    - description: custom evaluation logic on word2vec word embeddings.
    - topics: NLP, Machine learning, evaluation
    - inputs:
      - 1:
        
        description: word2vec model file to be evaluated
        
        file_type: word2vec model
        
        contents: NLP model, word embeddings model
      - 2:
        
        description: word2vec model metadata
        
        file_type: yaml
        
        contents: metadata
      - 3:
        
        file_type: yaml
        
        contents: NLP gold data
    - outputs:
      - 1:
        
        file_type: yaml
      - 2:
        
        file_type: txt
        
        contents: logs
https://github.com/veldhub/veld_code__wordembeddings_preprocessing
- veld_preprocess_clean.yaml
  - valid: True
  - metadata:
    - description: Removes lines that don't reach a threshold regarding the ratio of textual content to non-textual (numbers, special characters) content. Splits output into clean and dirty file.
    - topics: NLP, preprocessing, ETL
    - inputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
    - outputs:
      - 1:
        
        description: clean lines, where each line's ratio is above the configured threshold
        
        file_type: txt
        
        contents: raw text
      - 2:
        
        description: dirty lines, where each line's ratio is below the configured threshold
        
        file_type: txt
        
        contents: raw text
- veld_preprocess_lowercase.yaml
  - valid: True
  - metadata:
    - description: makes entire text lowercase
    - topics: NLP, preprocessing, ETL
    - inputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
    - outputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
- veld_preprocess_remove_punctuation.yaml
  - valid: True
  - metadata:
    - description: removes punctuation from text with spaCy pretrained models
    - topics: NLP, preprocessing, ETL
    - inputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
    - outputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
      - 2:
        
        file_type: txt
        
        contents: raw text
- veld_preprocess_sample.yaml
  - valid: True
  - metadata:
    - description: takes a random sample of lines from a txt file. Randomness can be set with a seed too
    - topics: NLP, preprocessing, ETL
    - inputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
    - outputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
- veld_preprocess_strip.yaml
  - valid: True
  - metadata:
    - description: removes all lines before and after given line numbers
    - topics: NLP, preprocessing, ETL
    - inputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
    - outputs:
      - 1:
        
        file_type: txt
        
        contents: raw text
https://github.com/veldhub/veld_code__xmlanntools
- veld_ann2standoff.yaml
  - valid: True
  - metadata:
    - description: A demo code veld, integrating the ann2standoff script. For more documentation, see: https://github.com/czcorpus/xmlanntools?tab=readme-ov-file#ann2standoff
    - topics: NLP, ETL
    - inputs:
      - 1:
        
        file_type: conllu, tsv
      - 2:
        
        file_type: txt
      - 3:
        
        file_type: ini
    - outputs:
      - 1:
        
        file_type: json
- veld_standoff2xml.yaml
  - valid: True
  - metadata:
    - description: A demo code veld, integrating the standoff2xml script. For more documentation, see: https://github.com/czcorpus/xmlanntools?tab=readme-ov-file#standoff2xml
    - topics: NLP, ETL
    - inputs:
      - 1:
        
        file_type: txt
      - 2:
        
        file_type: json
      - 3:
        
        file_type: json
    - outputs:
      - 1:
        
        file_type: xml
- veld_tag_ud.yaml
  - valid: True
  - metadata:
    - description: A demo code veld, integrating the tag_ud script. For more documentation, see: https://github.com/czcorpus/xmlanntools?tab=readme-ov-file#tag_ud
    - topics: NLP, ETL
    - inputs:
      - 1:
        
        file_type: txt
    - outputs:
      - 1:
        
        file_type: tsv, conllu
- veld_xml2standoff.yaml
  - valid: True
  - metadata:
    - description: A demo code veld, integrating the xml2standoff script. For more documentation, see: https://github.com/czcorpus/xmlanntools?tab=readme-ov-file#xml2standoff
    - topics: NLP, ETL
    - inputs:
      - 1:
        
        file_type: xml
    - outputs:
      - 1:
        
        file_type: txt
      - 2:
        
        file_type: json
- veld_xml2vrt.yaml
  - valid: True
  - metadata:
    - description: A demo code veld, integrating the xml2vrt script. For more documentation, see: https://github.com/czcorpus/xmlanntools?tab=readme-ov-file#xml2vrt
    - topics: NLP, ETL
    - inputs:
      - 1:
        
        file_type: xml
      - 2:
        
        file_type: ini
    - outputs:
      - 1:
        
        file_type: xml
https://github.com/veldhub/veld_code__xml_xslt_transformer
- veld.yaml
  - valid: False, elements not matching anything at: /x-veld/code/inputs/0/optional

chain velds

https://github.com/veldhub/veld_chain__akp_ner_inference
- veld_infer.yaml
  - valid: True
  - metadata:
    - description: This repo uses self-trained spaCy NER models on the linkedcat dataset to extract entities, which are stored in csv files.
    - topics: NLP, Machine learning, Named entity recognition
https://github.com/veldhub/veld_chain__apis_ner_evaluate_old_models
- veld_evaluate.yaml
  - valid: True
  - metadata:
    - description: hard-coded evaluation of several spaCy 2.2.4 models.
    - topics: NLP, Machine learning, Named entity recognition
https://github.com/veldhub/veld_chain__apis_ner_transform_to_gold
- veld.yaml
  - valid: True
  - metadata:
    - description: Conversion of apis ner model data to harmonized custom json format.
    - topics: ETL, data cleaning
https://github.com/veldhub/veld_chain__demo_teitok-tools
- veld_parseudpipe.yaml
  - valid: True
  - metadata:
    - description: This chain veld exemplifies usage of the respective code veld. For more information on the underlying tool and its usage, see: https://github.com/ufal/teitok-tools?tab=readme-ov-file#parseudpipe
    - topics: NLP, ETL, tokenization, universal dependencies
- veld_udpipe2teitok.yaml
  - valid: True
  - metadata:
    - description: This chain veld exemplifies usage of the respective code veld. For more information on the underlying tool and its usage, see: https://github.com/ufal/teitok-tools?tab=readme-ov-file#udpipe2teitok
    - topics: NLP, ETL, tokenization, universal dependencies
- veld_xmltokenize.yaml
  - valid: True
  - metadata:
    - description: This chain veld exemplifies usage of the respective code veld. For more information on the underlying tool and its usage, see: https://github.com/ufal/teitok-tools?tab=readme-ov-file#xmltokenize
    - topics: NLP, ETL, tokenization, universal dependencies
https://github.com/veldhub/veld_chain__demo_udipe_ts-vienna-2024
- veld_infer.yaml
  - valid: True
  - metadata:
    - description: A demonstration of a VELD chain inferencing on a txt with a self-trained udpipe model
    - topics: NLP, universal dependencies
- veld_train.yaml
  - valid: True
  - metadata:
    - description: A demonstration of a VELD chain training a udpipe model from scratch
    - topics: NLP, universal dependencies
https://github.com/veldhub/veld_chain__demo_wordembeddings_multiarch
- veld_jupyter_word2vec.yaml
  - valid: True
  - metadata:
    - description: demo word2vec jupyter notebook
    - topics: NLP, Machine Learning, word embeddings
- veld_preprocess.yaml
  - valid: True
  - metadata:
    - description: Download and preprocessing of the bible
    - topics: ETL, NLP, bible studies
- veld_train_word2vec.yaml
  - valid: True
  - metadata:
    - description: demo word2vec training on the bible
    - topics: NLP, Machine Learning, word embeddings
https://github.com/veldhub/veld_chain__eltec_udpipe_inference
- veld_analyse.yaml
  - valid: True
  - metadata:
    - description: chain to analyse the conllu data which was inferenced by udpipe on several ELTeC corpora.
    - topics: NLP, Machine learning, tokenization, lemmatization, part of speech, dependency parsing, universal dependencies, grammatical annotation
- veld_infer.yaml
  - valid: False, elements not matching anything at: /x-vars
- veld_preprocess.yaml
  - valid: False, elements not matching anything at: /x-vars
https://github.com/veldhub/veld_chain__mara_load_and_publish_models
- veld_publish_to_hf.yaml
  - valid: False, non-optional key missing: 'extends', at: /services/veld_publish_to_hf/
https://github.com/veldhub/veld_chain__train_infer_wordembeddings_multiple_architectures__amc
- veld_analyse_evaluation.yaml
  - valid: True
- veld_analyse_evaluation_non_interactive.yaml
  - valid: True
- veld_eval_fasttext.yaml
  - valid: False, elements not matching anything at: /services/veld_eval_fasttext/depends_on
- veld_eval_glove.yaml
  - valid: True
- veld_eval_word2vec.yaml
  - valid: True
- veld_jupyter_notebook_fasttext.yaml
  - valid: False, elements not matching anything at: /services/veld_jupyter_notebook_fasttext/ports
- veld_jupyter_notebook_glove.yaml
  - valid: False, elements not matching anything at: /services/veld_jupyter_notebook_glove/ports
- veld_jupyter_notebook_word2vec.yaml
  - valid: False, elements not matching anything at: /services/veld_jupyter_notebook_word2vec/ports
- veld_preprocess_clean.yaml
  - valid: True
- veld_preprocess_lowercase.yaml
  - valid: True
- veld_preprocess_remove_punctuation.yaml
  - valid: True
- veld_preprocess_sample.yaml
  - valid: True
- veld_preprocess_strip.yaml
  - valid: True
- veld_train_fasttext.yaml
  - valid: True
- veld_train_glove.yaml
  - valid: True
- veld_train_word2vec.yaml
  - valid: True
https://github.com/veldhub/veld_chain__train_infer_wordembeddings_multiple_architectures__wikipedia
- veld_playground_jupyter_notebook_fasttext.yaml
  - valid: True
  - metadata:
    - description: jupyter notebook for playing with fasttext models
    - topics: NLP
- veld_playground_jupyter_notebook_glove.yaml
  - valid: True
  - metadata:
    - description: jupyter notebook for playing with glove models
    - topics: NLP
- veld_playground_jupyter_notebook_word2vec.yaml
  - valid: True
  - metadata:
    - description: jupyter notebook for playing with word2vec models
    - topics: NLP
- veld_step_01_preprocess_download_and_extract.yaml
  - valid: True
  - metadata:
    - description: downloading wikipedia archive and extracting each article to a json file.
    - topics: NLP, Machine Learning, ETL
- veld_step_02_preprocess_transform_wiki_json_to_txt.yaml
  - valid: True
  - metadata:
    - description: transforming wikipedia jsons to a single txt file.
    - topics: NLP, Machine Learning, ETL
- veld_step_03_preprocess_lowercase.yaml
  - valid: True
  - metadata:
    - description: preprocessing by making the entire text lowercase.
    - topics: NLP
- veld_step_04_preprocess_remove_punctuation.yaml
  - valid: True
  - metadata:
    - description: preprocessing by removing punctuation of the entire text.
    - topics: NLP
- veld_step_05_train_fasttext.yaml
  - valid: True
  - metadata:
    - description: training a fasttext model on wikipediaa
    - topics: NLP
- veld_step_06_train_word2vec.yaml
  - valid: True
  - metadata:
    - description: training a word2vec model on wikipediaa
    - topics: NLP
- veld_step_07_train_glove.yaml
  - valid: True
  - metadata:
    - description: training a glove model on wikipediaa
    - topics: NLP
- veld_step_08_eval_fasttext.yaml
  - valid: True
  - metadata:
    - description: evaluate fasttext model against evaluation gold data
    - topics: NLP
- veld_step_09_eval_word2vec.yaml
  - valid: True
  - metadata:
    - description: evaluate word2vec model against evaluation gold data
    - topics: NLP
- veld_step_10_eval_glove.yaml
  - valid: True
  - metadata:
    - description: evaluate glove model against evaluation gold data
    - topics: NLP
- veld_step_11_analyse_evaluation.yaml
  - valid: True
  - metadata:
    - description: chain of analysing and evaluating models trained on wikipedia
    - topics: NLP
- veld_step_all_multi_chain.yaml
  - valid: True
  - metadata:
    - description: An entire multi chain, going through everything (fetching, preprocessing, training, evaluation in one service. This chain is composed of the other chains and is rather meant as a demonstration of the entire setup
    - topics: NLP
https://github.com/veldhub/veld_chain__train_spacy_apis_ner
- veld_analysis.yaml
  - valid: True
- veld_convert.yaml
  - valid: True
  - metadata:
    - description: cleaning and converting json into spaCy docbin
    - topics: ETL, NLP, Machine learning
- veld_create_config.yaml
  - valid: True
- veld_publish_to_hf.yaml
  - valid: True
  - metadata:
    - description: pushing spacy model to huggingface.
    - topics: NLP
- veld_train.yaml
  - valid: True
  - metadata:
    - description: A NER trainig setup, utilizing spaCy 3's config system.
    - topics: NLP, Machine learning, Named entity recognition

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
compose.yaml		compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VELD registry

data velds

code velds

chain velds

About

Releases

Packages

Contributors 2

Languages

acdh-oeaw/VELD_registry

Folders and files

Latest commit

History

Repository files navigation

VELD registry

data velds

code velds

chain velds

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages