Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

audit scraping tutorial (and audit the HTML) #23

Open
hannesdatta opened this issue May 17, 2023 · 6 comments
Open

audit scraping tutorial (and audit the HTML) #23

hannesdatta opened this issue May 17, 2023 · 6 comments
Assignees

Comments

@hannesdatta
Copy link
Contributor

hannesdatta commented May 17, 2023

Background:

We've built our site so others can learn how to scrape. But, we've never actually tried scraping it ourselves!

The purpose of this task is to build a "scraping tutorial" for the site, BUT ALSO revise our HTML templates to make the site "scraping-friendly".

We need to ensure that we cover a range of "identifiers" to get data from the site. This should be

  • TAGS (e.g., "h1")
  • CLASS NAMES (e.g., class = 'artist')
  • as well as attribute-value pairs ( id = 123 ).

Further, we need to ensure students can extract information (1) from the TEXT attributes of HTML, (2) as well as from attribute-values.

Deliverable:

  • A tutorial in Python, that easily teaches anyone how to scrape our site using BeautifulSoup. As an example, see this tutorial.
  • This tutorial can initially be tried out in Jupyter Notebook. Later on, we will directly add it to our site.
  • Running into "weird" things with scraping? Or do you think our HTML templates are not yet good enough? Then give feedback about the HTML source code so we can improve it.

Next steps:

  • Upon the approval of the tutorial, we can directly put it on our site using the article HTML template.
  • Another step would be to develop the same tutorial using R.
@hannesdatta hannesdatta changed the title audit scraping process audit scraping process and add tutorial May 17, 2023
@hannesdatta hannesdatta mentioned this issue May 17, 2023
10 tasks
@hannesdatta hannesdatta changed the title audit scraping process and add tutorial develop scraping tutorial (and audit the HTML) May 19, 2023
@fleurlemire
Copy link
Contributor

Hi Hannes,

This is what i included so far, i can not add it in here since it is a jupyter notebook. I will send the right version via email (since it looks like images are not working well in the colab), but will add a google colab in here two: https://colab.research.google.com/drive/1F64Po-c3weJAm_ZrABAQRBJzXS-Qj5y4?usp=sharing

I did not include the recently played or top 10 songs for users and artist since i only can scrape the table as a whole but i have code for that if we want to include it later.
I did not include song information yet, since that page caused an error. I can try some things and add that if we want since i guess that one is a little more difficult.
I also did not add code to save it as pd dataframe yet. I can include that if we want to.

I can also remove things if some things are already too extensive.

@hannesdatta
Copy link
Contributor Author

Hi @fleurlemire - please commit your work directly on our github repository for this project. You can create a new folder (say: tutorials) as a root directory. Let me know please.

@fleurlemire
Copy link
Contributor

Hi @hannesdatta, when i try to, i get an error message saying permission denied when I try to commit.

@hannesdatta
Copy link
Contributor Author

You should now have push access. Can you try again?

@fleurlemire
Copy link
Contributor

It is added! @hannesdatta

@fleurlemire
Copy link
Contributor

Hi Hannes, i added some extra information, including how to save the information and uploaded it.

@hannesdatta hannesdatta changed the title develop scraping tutorial (and audit the HTML) audit scraping tutorial (and audit the HTML) Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants