Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
notebooks	notebooks
.gitignore	.gitignore
README.md	README.md
project.yml	project.yml

🪐 spaCy Project: Analyzing how mentions of ingredients change over time (Named Entity Recognition)

This project was created as part of a step-by-step video tutorial. It uses sense2vec and Prodigy to bootstrap an NER model to detect ingredients Reddit comments and to calculate how mentions change over time. The results were then used to create a bar chart race visualization of selected ingredients.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File	Source	Description
`assets/reddit_r_cooking.jsonl`	URL	Extracted data from r/cooking on Reddit from 2012-2019 (2,174,118 comments)
`assets/reddit_r_cooking_sample.jsonl`	Local	Randomly selected sample from the r/cooking data for annotation and testing (10,000 comments)
`assets/food_patterns.jsonl`	Local	Match patterns created with `sense2vec.teach` and `terms.to-patterns`, used to bootstrap annotation or at runtime via spaCy's `EntityRuler` (209 patterns)
`assets/food_data.jsonl`	Local	Full dataset annotated with `INGRED` (ingredient) entities (949 examples)
`assets/counts.csv`	Local	Full entity counts by month, generated by running the trained model over the whole Reddit data.
`assets/tok2vec_cd8_model289.bin`	URL	Pretrained tok2vec weights to initialize the model
`assets/s2v_reddit_2015_md.tar.gz`	URL	sense2vec vectors trained on Reddit comments of 2015. Used to bootstrap the terminology list

🧮 Results

Model	F-Score	Precision	Recall	# Examples
spaCy blank	80.85	82.13	79.61	760
spaCy `en_vectors_web_lg` + tok2vec¹	84.95	87.28	82.74	760

Representations trained on 1 billion words from Reddit comments using spacy pretrain predicting the en_vectors_web_lg vectors (~8 hours on GPU). Download: tok2vec_cd8_model289.bin

📚 Data

Labelling the data took about 2.5 hours and was done manually using the patterns (first half) and pretrained NER model (second half) to pre-highlight suggestions. For more details, see the data creation and training workflow below. The raw text was sourced from the from the r/Cooking subreddit.

Raw data

File	Count	Description
`reddit_r_cooking.jsonl`	2,174,118	Extracted data from Reddit (2012-2019).
`reddit_r_cooking_sample.jsonl`	10,000	Randomly selected sample for annotation and testing.

Labelled data, patterns and artifacts

File	Count	Description
`food_patterns.jsonl`	209	Match patterns created with `sense2vec.teach` and `terms.to-patterns`. Can be used to bootstrap annotation with `ner.manual` or at runtime via spaCy's `EntityRuler`
`food_data.jsonl`	949	Full dataset annotated with `INGRED` (ingredient) entities.
`counts.csv`	113,577	Full `INGRED` entity counts by month, generated by running the trained model over `reddit_r_cooking.jsonl`. See the notebooks section for the code.

Data format

The datasets are distributed in Prodigy's simple JSONL (newline-delimited JSON) format. Each entry contains a "text" and a list of "spans" with the "start" and "end" character offsets and the "label" of the annotated entities. The data also includes the tokenization and meta information (name of the subreddit and UTC timestamp). Here's a simplified example entry:

{
  "text": "Where do you get the mock duck?",
  "meta": { "section": "Cooking", "utc": "1364690064" },
  "tokens": [
    { "text": "Where", "start": 0, "end": 5, "id": 0 },
    { "text": "do", "start": 6, "end": 8, "id": 1 },
    { "text": "you", "start": 9, "end": 12, "id": 2 },
    { "text": "get", "start": 13, "end": 16, "id": 3 },
    { "text": "the", "start": 17, "end": 20, "id": 4 },
    { "text": "mock", "start": 21, "end": 25, "id": 5 },
    { "text": "duck", "start": 26, "end": 30, "id": 6 },
    { "text": "?", "start": 30, "end": 31, "id": 7 }
  ],
  "spans": [
    {
      "start": 21,
      "end": 30,
      "token_start": 5,
      "token_end": 6,
      "label": "INGRED"
    }
  ],
  "answer": "accept",
  "_input_hash": -768143800,
  "_task_hash": -1459747024
}

Data creation and training workflow

Create a phrase list using seed terms. Requires sense2vec and a vectors package.

prodigy sense2vec.teach food_terms ./s2v_reddit_2015_md --seeds "garlic, avocado, cottage cheese, olive oil, cumin, chicken breast, beef, iceberg lettuce"

Convert the phrase list to a match patterns file.

prodigy terms.to-patterns food_terms --label INGRED --spacy-model blank:en > ./food_patterns.jsonl

Label data manually with help from the patterns.

prodigy ner.manual food_data blank:en ./reddit_r_cooking_sample.jsonl --label INGRED --patterns food_patterns.jsonl

Pretrain a model to check results and to improve later on. Uses the pretrained tok2vec weights tok2vec_cd8_model289.bin.

prodigy train ner food_data en_vectors_web_lg --init-tok2vec ./tok2vec_cd8_model289.bin --output ./tmp_model --eval-split 0.2

Label more data by correcting the model's predictions.

prodigy ner.correct food_data_correct ./tmp_model ./reddit_r_cooking_sample.jsonl --label INGRED --exclude food_data

Train the final model.

prodigy train ner food_data,food_data_correct en_vectors_web_lg --init-tok2vec ./tok2vec_cd8_model289.bin --output ./food_model --eval-split 0.2 --n-iter 20

📓 Notebooks

The notebooks include the code used to preprocess the raw Reddit data, run the trained model over the data and create the final counts.

File	Description
`01_Preprocess_Reddit.ipynb`	Preprocess and clean Reddit data and export JSONL for annotation.
`02_Process_text_and_counts.ipynb`	Use the trained model to count entities.
`03_Generate_counts.ipynb`	Generate the final counts for the visualization and select examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ner_food_ingredients

ner_food_ingredients

README.md

🪐 spaCy Project: Analyzing how mentions of ingredients change over time (Named Entity Recognition)

📋 project.yml

🗂 Assets

🧮 Results

📚 Data

Raw data

Labelled data, patterns and artifacts

Data format

Data creation and training workflow

📓 Notebooks

Files

ner_food_ingredients

Directory actions

More options

Directory actions

More options

Latest commit

History

ner_food_ingredients

Folders and files

parent directory

README.md

🪐 spaCy Project: Analyzing how mentions of ingredients change over time (Named Entity Recognition)

📋 project.yml

🗂 Assets

🧮 Results

📚 Data

Raw data

Labelled data, patterns and artifacts

Data format

Data creation and training workflow

📓 Notebooks