Skip to content

Commit

Permalink
Add Green Talks Scraper (#2)
Browse files Browse the repository at this point in the history
* Add Green Talks Scraper

In this commit, I introduce the following key files of Green Talks Scraper:

1. `scraper.sh`: This shell script is the core of the Green Talks Scraper tool. It's responsible for fetching conference schedules from the provided URLs in `urls.txt`, extracting data, and identifying sustainability-related talks based on specified keywords.
2. `urls.txt`: This file contains a list of URLs pointing to conference schedules.
3. `README.md`: The README file contains essential information about the tool. It provides an overview, prerequisites, and usage guidelines.

I introduce a GitHub Actions workflow tailored for the Green Talks Scraper tool. The workflow automates the process of running the scraper shell script, publishing the results to talks.md, and pushing the changes to the repository.

Signed-off-by: Al-HusseinHameedJasim <[email protected]>

---------

Signed-off-by: Al-HusseinHameedJasim <[email protected]>
  • Loading branch information
Al-HusseinHameedJasim authored Oct 26, 2023
1 parent ab524bd commit 536e4c4
Show file tree
Hide file tree
Showing 4 changed files with 238 additions and 0 deletions.
48 changes: 48 additions & 0 deletions .github/workflows/green-talks-scraper.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Green Talks Scraper

on:
push:
branches:
- main
paths:
- '.github/workflows/green-talks-scraper.yml'
- 'green-talks-scraper/*'
- '!green-talks-scraper/README.md'

jobs:
scrape:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
persist-credentials: false

- name: Prepend line to talks.md
run: echo "### An automatically generated list of environmental sustainability-related talks at The Linux Foundation events" > green-talks-scraper/talks.md

- name: Run Scraper Script
run: |
chmod +x green-talks-scraper/scraper.sh
./green-talks-scraper/scraper.sh >> green-talks-scraper/talks.md
- name: Commit file
run: |
# Check if "talks.md" has been modified
if git diff --name-only | grep "talks.md" || git ls-files --others --exclude-standard | grep "talks.md"; then
git config --local user.email "[email protected]"
git config --local user.name "green-talks-scraper-workflow"
git add green-talks-scraper/talks.md
git commit -m "Update the green talks list [skip actions]"
echo "FILE_COMMITTED=true" >> $GITHUB_ENV # Set an environment variable
else
echo "The list of talks is up to date"
fi
- name: Push changes
if: env.FILE_COMMITTED == 'true'
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.MY_GITHUB_TOKEN }}
force: true
29 changes: 29 additions & 0 deletions green-talks-scraper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Green Talks Scraper

The Green Talks Scraper is aimed at scraping conference schedule URLs and extracting talk titles that contain specific keywords via a script. The script compiles a list of talks presented (or to be) at [KubeCon + CloudNativeCon](https://KubeCon.io), Open Source Summit, and their Co-Located events, where at least one of the following keywords is mentioned in the title: carbon, climate, energy, green, kepler, sustainability, or sustainable.

These talks focus on topics such as energy efficiency, environmental sustainability, and green computing within the cloud native ecosystem.
The repository also includes a GitHub Actions workflow that automates the scraping process and publishes the extracted talk titles to a [markdown file](talks.md).

## Getting Started

### Prerequisites

- Bash
- cURL

### Usage

To get started, fork and clone this repository to your local machine and run the script:

```bash
git clone https://github.com/YOUR-USERNAME/tag-env-tooling.git
cd tag-env-tooling
chmod +x green-talks-scraper/scraper.sh
./green-talks-scraper/scraper.sh
```

You can also modify the `scraper.sh` script to add other keywords you want to search for or exclude within the talk titles.

### Output
The script outputs a list of talks that contain the keywords specified.
86 changes: 86 additions & 0 deletions green-talks-scraper/scraper.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/bin/bash

# An array of keywords to search for in the content of URLs
keywords=("carbon" "climate" "energy" "green" "kepler" "sustainability" "sustainable")

# An array of specific words/phrases that, if found, the title should be excluded in case there is only 1 matched keyword
specific_words=("blue/green" "blue-green" "blue green" "green light" "greenberg" "greendale" "greene" "greenfield" "greenlee" "greenley" "greentree" "greenwald" "greenwich" "greenwood")

# A function to check if a title contains a specific word/phrase
contains_specific_word() {
local title="$1"

for word in "${specific_words[@]}"; do
if [[ "$title" == *"$word"* ]]; then
return 0 # Match found
fi
done
return 1 # No match found
}

# A function to process the content of a URL
process_url() {
local url="$1"
local content

# Fetch the content of the URL
content=$(curl -s "$url")

# An associative array to store encountered titles
declare -A encountered_titles

# An array to store talks for the current URL
talks=()

# Loop through the array of keywords
for keyword in "${keywords[@]}"; do
matched_lines=($(echo "$content" | grep -ni "$keyword" | cut -d':' -f1))

# Loop through the matched lines
for line_number in "${matched_lines[@]}"; do
title=$(echo "$content" | sed -n "${line_number}s/.*'>\(.*\)<span class=\"vs\">.*/\1/p")

# Check if the title has at least 40 characters (to avoid having unwanted matched contents) and is not encountered before
if [ "${#title}" -ge 40 ] && [ -z "${encountered_titles["$title"]}" ]; then
encountered_titles["$title"]=1 # Mark the title as encountered
talks+=("$title") # Add title to talks array
fi
done
done

# An array to store filtered talks for the current URL
filtered_talks=()

# Filter the talks
if [ "${#talks[@]}" -gt 0 ]; then
for title in "${talks[@]}"; do
title_lowercase=$(echo "$title" | tr '[:upper:]' '[:lower:]') # Convert title to lowercase for keyword counting
# A counter to check the number of matched keywords in each title
count=0
for keyword in "${keywords[@]}"; do
if [[ "$title_lowercase" == *"$keyword"* ]]; then
((count++))
fi
done
# Check if the talk has only one matched keyword and whether it should be excluded if it contains a specific word/phrase
if ! { [ "$count" -eq 1 ] && contains_specific_word "$title_lowercase"; }; then
filtered_talks+=("$title") # Add title to filtered_talks array
fi
done
fi
# Print the conference schedule link and "Talks:" section if talks were found
if [ "${#filtered_talks[@]}" -gt 0 ]; then
echo -e "\nConference schedule link: $url"
echo -e "\nTalks:"
for filtered_title in "${filtered_talks[@]}"; do
echo "- $filtered_title"
done
fi
}

# Loop through the array of URLs and process each URL
mapfile -t URLS < green-talks-scraper/urls.txt

for url in "${URLS[@]}"; do
process_url "$url"
done
75 changes: 75 additions & 0 deletions green-talks-scraper/urls.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
https://colocatedeventsna2023.sched.com
https://kccncna2023.sched.com
https://kccncosschn2023.sched.com
https://colocatedeventseu2023.sched.com
https://kccnceu2023.sched.com
https://cdcongitopscon2023.sched.com
https://lssna2023.sched.com
https://osseu2023.sched.com
https://ossna2023.sched.com
https://kccncna2022.sched.com
https://backstageconna22.sched.com
https://cloudnativeebpfdayna22.sched.com
https://cloudnativesecurityconna22.sched.com
https://cloudnativetelcodayna22.sched.com
https://cloudnativewasmdayna22.sched.com
https://envoyconna22.sched.com
https://gitopsconna22.sched.com
https://kubernetesaidayna22.sched.com
https://kubernetesonedgedayna22.sched.com
https://knativeconna22.sched.com
https://kubernetesbatchdayna22.sched.com
https://openobservabilitydayna22.sched.com
https://prometheusdayna22.sched.com
https://servicemeshconna22.sched.com
https://kccnceu2022.sched.com
https://cloudnativeebpfdayeu22.sched.com
https://cloudnativesecurityconeu22.sched.com
https://cloudnativetelcodayeu22.sched.com
https://cloudnativewasmdayeu22.sched.com
https://fluentconeu22.sched.com
https://gitopsconeu22.sched.com
https://knativeconeu22.sched.com
https://kubernetesaidayeu22.sched.com
https://kubernetesbatchdayeu22.sched.com
https://kubernetesonedgedayeu22.sched.com
https://prometheusdayeu22.sched.com
https://servicemeshconeu22.sched.com
https://ossjapan2022.sched.com
https://osslatam22.sched.com
https://ossna2022.sched.com
https://osseu2022.sched.com
https://kccncosschn21.sched.com
https://kccncna2021.sched.com
https://cloudnativedevxdayna21.sched.com
https://cloudnativesecurityconna21.sched.com
https://cloudnativewasmdayna21.sched.com
https://envoyconna21.sched.com
https://fluentconna21.sched.com
https://gitopsconna21.sched.com
https://kubernetesaidayna21.sched.com
https://prodiddayspiffespirena21.sched.com
https://promconna21.sched.com
https://servicemeshconna21.sched.com
https://supplychainsecurityconna21.sched.com
https://kccnceu2021.sched.com
https://cloudnativerustdayeu21.sched.com
https://cloudnativewasmeu21.sched.com
https://cnsecuritydayeu21.sched.com
https://crossplanedayeu21.sched.com
https://fluentconeu21.sched.com
https://kubenetesaidayeu21.sched.com
https://kubenetesedgedayeu21.sched.com
https://promcononline2021.sched.com
https://servicemeshconeu21.sched.com
https://ossalsjp21.sched.com
https://osselc21.sched.com
https://kccncna20.sched.com
https://kccnceu20.sched.com
https://cnosvschina20eng.sched.com
https://cnsdna20.sched.com
https://servicemeshconeu20.sched.com
https://smcna20.sched.com
https://ossalsjp20.sched.com
https://ossna2020.sched.com
https://osseu2020.sched.com

0 comments on commit 536e4c4

Please sign in to comment.