Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Green Talks Scraper #2

Merged
merged 2 commits into from
Oct 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/green-talks-scraper.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Green Talks Scraper

on:
push:
branches:
- main
paths:
- '.github/workflows/green-talks-scraper.yml'
- 'green-talks-scraper/*'
- '!green-talks-scraper/README.md'

jobs:
scrape:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
persist-credentials: false

- name: Prepend line to talks.md
run: echo "### An automatically generated list of environmental sustainability-related talks at The Linux Foundation events" > green-talks-scraper/talks.md

- name: Run Scraper Script
run: |
chmod +x green-talks-scraper/scraper.sh
./green-talks-scraper/scraper.sh >> green-talks-scraper/talks.md

- name: Commit file
run: |
# Check if "talks.md" has been modified
if git diff --name-only | grep "talks.md" || git ls-files --others --exclude-standard | grep "talks.md"; then
git config --local user.email "[email protected]"
git config --local user.name "green-talks-scraper-workflow"
git add green-talks-scraper/talks.md
git commit -m "Update the green talks list [skip actions]"
echo "FILE_COMMITTED=true" >> $GITHUB_ENV # Set an environment variable
else
echo "The list of talks is up to date"
fi

- name: Push changes
if: env.FILE_COMMITTED == 'true'
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.MY_GITHUB_TOKEN }}
force: true
29 changes: 29 additions & 0 deletions green-talks-scraper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Green Talks Scraper

The Green Talks Scraper is aimed at scraping conference schedule URLs and extracting talk titles that contain specific keywords via a script. The script compiles a list of talks presented (or to be) at [KubeCon + CloudNativeCon](https://KubeCon.io), Open Source Summit, and their Co-Located events, where at least one of the following keywords is mentioned in the title: carbon, climate, energy, green, kepler, sustainability, or sustainable.

These talks focus on topics such as energy efficiency, environmental sustainability, and green computing within the cloud native ecosystem.
The repository also includes a GitHub Actions workflow that automates the scraping process and publishes the extracted talk titles to a [markdown file](talks.md).

## Getting Started

### Prerequisites

- Bash
- cURL

### Usage

To get started, fork and clone this repository to your local machine and run the script:

```bash
git clone https://github.com/YOUR-USERNAME/tag-env-tooling.git
cd tag-env-tooling
chmod +x green-talks-scraper/scraper.sh
./green-talks-scraper/scraper.sh
```

You can also modify the `scraper.sh` script to add other keywords you want to search for or exclude within the talk titles.

### Output
The script outputs a list of talks that contain the keywords specified.
86 changes: 86 additions & 0 deletions green-talks-scraper/scraper.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/bin/bash

# An array of keywords to search for in the content of URLs
keywords=("carbon" "climate" "energy" "green" "kepler" "sustainability" "sustainable")

# An array of specific words/phrases that, if found, the title should be excluded in case there is only 1 matched keyword
specific_words=("blue/green" "blue-green" "blue green" "green light" "greenberg" "greendale" "greene" "greenfield" "greenlee" "greenley" "greentree" "greenwald" "greenwich" "greenwood")

# A function to check if a title contains a specific word/phrase
contains_specific_word() {
local title="$1"

for word in "${specific_words[@]}"; do
if [[ "$title" == *"$word"* ]]; then
return 0 # Match found
fi
done
return 1 # No match found
}

# A function to process the content of a URL
process_url() {
local url="$1"
local content

# Fetch the content of the URL
content=$(curl -s "$url")

# An associative array to store encountered titles
declare -A encountered_titles

# An array to store talks for the current URL
talks=()

# Loop through the array of keywords
for keyword in "${keywords[@]}"; do
matched_lines=($(echo "$content" | grep -ni "$keyword" | cut -d':' -f1))

# Loop through the matched lines
for line_number in "${matched_lines[@]}"; do
title=$(echo "$content" | sed -n "${line_number}s/.*'>\(.*\)<span class=\"vs\">.*/\1/p")

# Check if the title has at least 40 characters (to avoid having unwanted matched contents) and is not encountered before
if [ "${#title}" -ge 40 ] && [ -z "${encountered_titles["$title"]}" ]; then
encountered_titles["$title"]=1 # Mark the title as encountered
talks+=("$title") # Add title to talks array
fi
done
done

# An array to store filtered talks for the current URL
filtered_talks=()

# Filter the talks
if [ "${#talks[@]}" -gt 0 ]; then
for title in "${talks[@]}"; do
title_lowercase=$(echo "$title" | tr '[:upper:]' '[:lower:]') # Convert title to lowercase for keyword counting
# A counter to check the number of matched keywords in each title
count=0
for keyword in "${keywords[@]}"; do
if [[ "$title_lowercase" == *"$keyword"* ]]; then
((count++))
fi
done
# Check if the talk has only one matched keyword and whether it should be excluded if it contains a specific word/phrase
if ! { [ "$count" -eq 1 ] && contains_specific_word "$title_lowercase"; }; then
filtered_talks+=("$title") # Add title to filtered_talks array
fi
done
fi
# Print the conference schedule link and "Talks:" section if talks were found
if [ "${#filtered_talks[@]}" -gt 0 ]; then
echo -e "\nConference schedule link: $url"
echo -e "\nTalks:"
for filtered_title in "${filtered_talks[@]}"; do
echo "- $filtered_title"
done
fi
}

# Loop through the array of URLs and process each URL
mapfile -t URLS < green-talks-scraper/urls.txt

for url in "${URLS[@]}"; do
process_url "$url"
done
75 changes: 75 additions & 0 deletions green-talks-scraper/urls.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
https://colocatedeventsna2023.sched.com
https://kccncna2023.sched.com
https://kccncosschn2023.sched.com
https://colocatedeventseu2023.sched.com
https://kccnceu2023.sched.com
https://cdcongitopscon2023.sched.com
https://lssna2023.sched.com
https://osseu2023.sched.com
https://ossna2023.sched.com
https://kccncna2022.sched.com
https://backstageconna22.sched.com
https://cloudnativeebpfdayna22.sched.com
https://cloudnativesecurityconna22.sched.com
https://cloudnativetelcodayna22.sched.com
https://cloudnativewasmdayna22.sched.com
https://envoyconna22.sched.com
https://gitopsconna22.sched.com
https://kubernetesaidayna22.sched.com
https://kubernetesonedgedayna22.sched.com
https://knativeconna22.sched.com
https://kubernetesbatchdayna22.sched.com
https://openobservabilitydayna22.sched.com
https://prometheusdayna22.sched.com
https://servicemeshconna22.sched.com
https://kccnceu2022.sched.com
https://cloudnativeebpfdayeu22.sched.com
https://cloudnativesecurityconeu22.sched.com
https://cloudnativetelcodayeu22.sched.com
https://cloudnativewasmdayeu22.sched.com
https://fluentconeu22.sched.com
https://gitopsconeu22.sched.com
https://knativeconeu22.sched.com
https://kubernetesaidayeu22.sched.com
https://kubernetesbatchdayeu22.sched.com
https://kubernetesonedgedayeu22.sched.com
https://prometheusdayeu22.sched.com
https://servicemeshconeu22.sched.com
https://ossjapan2022.sched.com
https://osslatam22.sched.com
https://ossna2022.sched.com
https://osseu2022.sched.com
https://kccncosschn21.sched.com
https://kccncna2021.sched.com
https://cloudnativedevxdayna21.sched.com
https://cloudnativesecurityconna21.sched.com
https://cloudnativewasmdayna21.sched.com
https://envoyconna21.sched.com
https://fluentconna21.sched.com
https://gitopsconna21.sched.com
https://kubernetesaidayna21.sched.com
https://prodiddayspiffespirena21.sched.com
https://promconna21.sched.com
https://servicemeshconna21.sched.com
https://supplychainsecurityconna21.sched.com
https://kccnceu2021.sched.com
https://cloudnativerustdayeu21.sched.com
https://cloudnativewasmeu21.sched.com
https://cnsecuritydayeu21.sched.com
https://crossplanedayeu21.sched.com
https://fluentconeu21.sched.com
https://kubenetesaidayeu21.sched.com
https://kubenetesedgedayeu21.sched.com
https://promcononline2021.sched.com
https://servicemeshconeu21.sched.com
https://ossalsjp21.sched.com
https://osselc21.sched.com
https://kccncna20.sched.com
https://kccnceu20.sched.com
https://cnosvschina20eng.sched.com
https://cnsdna20.sched.com
https://servicemeshconeu20.sched.com
https://smcna20.sched.com
https://ossalsjp20.sched.com
https://ossna2020.sched.com
https://osseu2020.sched.com