Add Green Talks Scraper (#2)

* Add Green Talks Scraper In this commit, I introduce the following key files of Green Talks Scraper: 1. `scraper.sh`: This shell script is the core of the Green Talks Scraper tool. It's responsible for fetching conference schedules from the provided URLs in `urls.txt`, extracting data, and identifying sustainability-related talks based on specified keywords. 2. `urls.txt`: This file contains a list of URLs pointing to conference schedules. 3. `README.md`: The README file contains essential information about the tool. It provides an overview, prerequisites, and usage guidelines. I introduce a GitHub Actions workflow tailored for the Green Talks Scraper tool. The workflow automates the process of running the scraper shell script, publishing the results to talks.md, and pushing the changes to the repository. Signed-off-by: Al-HusseinHameedJasim <[email protected]> --------- Signed-off-by: Al-HusseinHameedJasim <[email protected]>
cncf-tags · Oct 26, 2023 · 536e4c4 · 536e4c4
1 parent ab524bd
commit 536e4c4
Show file tree

Hide file tree

Showing 4 changed files with 238 additions and 0 deletions.
diff --git a/.github/workflows/green-talks-scraper.yml b/.github/workflows/green-talks-scraper.yml
@@ -0,0 +1,48 @@
+name: Green Talks Scraper
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - '.github/workflows/green-talks-scraper.yml'
+      - 'green-talks-scraper/*'
+      - '!green-talks-scraper/README.md'
+
+jobs:
+  scrape:
+    runs-on: ubuntu-latest
+
+    steps:
+    - name: Checkout repository
+      uses: actions/checkout@v3
+      with:
+        persist-credentials: false
+
+    - name: Prepend line to talks.md
+      run: echo "### An automatically generated list of environmental sustainability-related talks at The Linux Foundation events" > green-talks-scraper/talks.md
+
+    - name: Run Scraper Script
+      run: |
+        chmod +x green-talks-scraper/scraper.sh
+        ./green-talks-scraper/scraper.sh >> green-talks-scraper/talks.md
+    
+    - name: Commit file
+      run: |
+        # Check if "talks.md" has been modified
+        if git diff --name-only | grep "talks.md" || git ls-files --others --exclude-standard | grep "talks.md"; then
+            git config --local user.email "[email protected]"
+            git config --local user.name "green-talks-scraper-workflow"
+            git add green-talks-scraper/talks.md
+            git commit -m "Update the green talks list [skip actions]"
+            echo "FILE_COMMITTED=true" >> $GITHUB_ENV  # Set an environment variable
+        else
+            echo "The list of talks is up to date"
+        fi
+
+    - name: Push changes
+      if: env.FILE_COMMITTED == 'true'
+      uses: ad-m/github-push-action@master
+      with:
+        github_token: ${{ secrets.MY_GITHUB_TOKEN }}
+        force: true
diff --git a/green-talks-scraper/README.md b/green-talks-scraper/README.md
@@ -0,0 +1,29 @@
+# Green Talks Scraper
+
+The Green Talks Scraper is aimed at scraping conference schedule URLs and extracting talk titles that contain specific keywords via a script. The script compiles a list of talks presented (or to be) at [KubeCon + CloudNativeCon](https://KubeCon.io), Open Source Summit, and their Co-Located events, where at least one of the following keywords is mentioned in the title: carbon, climate, energy, green, kepler, sustainability, or sustainable.
+
+These talks focus on topics such as energy efficiency, environmental sustainability, and green computing within the cloud native ecosystem.
+The repository also includes a GitHub Actions workflow that automates the scraping process and publishes the extracted talk titles to a [markdown file](talks.md).
+
+## Getting Started
+
+### Prerequisites
+
+- Bash
+- cURL
+
+### Usage
+
+To get started, fork and clone this repository to your local machine and run the script:
+
+```bash
+git clone https://github.com/YOUR-USERNAME/tag-env-tooling.git
+cd tag-env-tooling
+chmod +x green-talks-scraper/scraper.sh
+./green-talks-scraper/scraper.sh
+```
+
+You can also modify the `scraper.sh` script to add other keywords you want to search for or exclude within the talk titles.
+
+### Output
+The script outputs a list of talks that contain the keywords specified.
diff --git a/green-talks-scraper/scraper.sh b/green-talks-scraper/scraper.sh
@@ -0,0 +1,86 @@
+#!/bin/bash
+
+# An array of keywords to search for in the content of URLs
+keywords=("carbon" "climate" "energy" "green" "kepler" "sustainability" "sustainable")
+
+# An array of specific words/phrases that, if found, the title should be excluded in case there is only 1 matched keyword
+specific_words=("blue/green" "blue-green" "blue green" "green light" "greenberg" "greendale" "greene" "greenfield" "greenlee" "greenley" "greentree" "greenwald" "greenwich" "greenwood")
+
+# A function to check if a title contains a specific word/phrase
+contains_specific_word() {
+    local title="$1"
+
+    for word in "${specific_words[@]}"; do
+        if [[ "$title" == *"$word"* ]]; then
+            return 0  # Match found
+        fi
+    done
+    return 1  # No match found
+}
+
+# A function to process the content of a URL
+process_url() {
+    local url="$1"
+    local content
+
+    # Fetch the content of the URL
+    content=$(curl -s "$url")
+
+    # An associative array to store encountered titles
+    declare -A encountered_titles
+
+    # An array to store talks for the current URL
+    talks=()
+
+    # Loop through the array of keywords
+    for keyword in "${keywords[@]}"; do
+        matched_lines=($(echo "$content" | grep -ni "$keyword" | cut -d':' -f1))
+
+        # Loop through the matched lines
+        for line_number in "${matched_lines[@]}"; do
+            title=$(echo "$content" | sed -n "${line_number}s/.*'>\(.*\)<span class=\"vs\">.*/\1/p")
+
+            # Check if the title has at least 40 characters (to avoid having unwanted matched contents) and is not encountered before
+            if [ "${#title}" -ge 40 ] && [ -z "${encountered_titles["$title"]}" ]; then
+                encountered_titles["$title"]=1  # Mark the title as encountered
+                talks+=("$title")  # Add title to talks array
+            fi
+        done
+    done
+
+    # An array to store filtered talks for the current URL
+    filtered_talks=()
+
+    # Filter the talks
+    if [ "${#talks[@]}" -gt 0 ]; then
+        for title in "${talks[@]}"; do
+            title_lowercase=$(echo "$title" | tr '[:upper:]' '[:lower:]')  # Convert title to lowercase for keyword counting
+	    # A counter to check the number of matched keywords in each title
+            count=0
+            for keyword in "${keywords[@]}"; do
+                if [[ "$title_lowercase" == *"$keyword"* ]]; then
+                    ((count++))
+                fi
+            done
+            # Check if the talk has only one matched keyword and whether it should be excluded if it contains a specific word/phrase
+	    if ! { [ "$count" -eq 1 ] && contains_specific_word "$title_lowercase"; }; then
+                filtered_talks+=("$title")  # Add title to filtered_talks array
+            fi
+        done
+    fi
+    # Print the conference schedule link and "Talks:" section if talks were found
+    if [ "${#filtered_talks[@]}" -gt 0 ]; then
+        echo -e "\nConference schedule link: $url"
+        echo -e "\nTalks:"
+        for filtered_title in "${filtered_talks[@]}"; do
+            echo "- $filtered_title"
+        done
+    fi
+}
+
+# Loop through the array of URLs and process each URL
+mapfile -t URLS < green-talks-scraper/urls.txt
+
+for url in "${URLS[@]}"; do
+    process_url "$url"
+done
diff --git a/green-talks-scraper/urls.txt b/green-talks-scraper/urls.txt
@@ -0,0 +1,75 @@
+https://colocatedeventsna2023.sched.com
+https://kccncna2023.sched.com
+https://kccncosschn2023.sched.com
+https://colocatedeventseu2023.sched.com
+https://kccnceu2023.sched.com
+https://cdcongitopscon2023.sched.com
+https://lssna2023.sched.com
+https://osseu2023.sched.com
+https://ossna2023.sched.com
+https://kccncna2022.sched.com
+https://backstageconna22.sched.com
+https://cloudnativeebpfdayna22.sched.com
+https://cloudnativesecurityconna22.sched.com
+https://cloudnativetelcodayna22.sched.com
+https://cloudnativewasmdayna22.sched.com
+https://envoyconna22.sched.com
+https://gitopsconna22.sched.com
+https://kubernetesaidayna22.sched.com
+https://kubernetesonedgedayna22.sched.com
+https://knativeconna22.sched.com
+https://kubernetesbatchdayna22.sched.com
+https://openobservabilitydayna22.sched.com
+https://prometheusdayna22.sched.com
+https://servicemeshconna22.sched.com
+https://kccnceu2022.sched.com
+https://cloudnativeebpfdayeu22.sched.com
+https://cloudnativesecurityconeu22.sched.com
+https://cloudnativetelcodayeu22.sched.com
+https://cloudnativewasmdayeu22.sched.com
+https://fluentconeu22.sched.com
+https://gitopsconeu22.sched.com
+https://knativeconeu22.sched.com
+https://kubernetesaidayeu22.sched.com
+https://kubernetesbatchdayeu22.sched.com
+https://kubernetesonedgedayeu22.sched.com
+https://prometheusdayeu22.sched.com
+https://servicemeshconeu22.sched.com
+https://ossjapan2022.sched.com
+https://osslatam22.sched.com
+https://ossna2022.sched.com
+https://osseu2022.sched.com
+https://kccncosschn21.sched.com
+https://kccncna2021.sched.com
+https://cloudnativedevxdayna21.sched.com
+https://cloudnativesecurityconna21.sched.com
+https://cloudnativewasmdayna21.sched.com
+https://envoyconna21.sched.com
+https://fluentconna21.sched.com
+https://gitopsconna21.sched.com
+https://kubernetesaidayna21.sched.com
+https://prodiddayspiffespirena21.sched.com
+https://promconna21.sched.com
+https://servicemeshconna21.sched.com
+https://supplychainsecurityconna21.sched.com
+https://kccnceu2021.sched.com
+https://cloudnativerustdayeu21.sched.com
+https://cloudnativewasmeu21.sched.com
+https://cnsecuritydayeu21.sched.com
+https://crossplanedayeu21.sched.com
+https://fluentconeu21.sched.com
+https://kubenetesaidayeu21.sched.com
+https://kubenetesedgedayeu21.sched.com
+https://promcononline2021.sched.com
+https://servicemeshconeu21.sched.com
+https://ossalsjp21.sched.com
+https://osselc21.sched.com
+https://kccncna20.sched.com
+https://kccnceu20.sched.com
+https://cnosvschina20eng.sched.com
+https://cnsdna20.sched.com
+https://servicemeshconeu20.sched.com
+https://smcna20.sched.com
+https://ossalsjp20.sched.com
+https://ossna2020.sched.com
+https://osseu2020.sched.com