Fix scraper browser returns duplicate texts #1134

ewgdg · 2025-02-12T07:45:20Z

soup.find_all(tags) returns nested tags even if the text blocks are already extracted from their parent tag. This lead to large amount of duplicate text blocks.

prefer soup.get_text(strip=True, separator="\n") for a much smaller result size.

assafelovic

This is great, thank you so much!

ewgdg added 3 commits February 11, 2025 23:24

fix: scraper browser duplicate texts

9f5a323

fix: extract title from head

9110263

cleanup

611e177

assafelovic approved these changes Feb 12, 2025

View reviewed changes

assafelovic merged commit 3ffb264 into assafelovic:master Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scraper browser returns duplicate texts #1134

Fix scraper browser returns duplicate texts #1134

ewgdg commented Feb 12, 2025 •

edited

Loading

assafelovic left a comment

Fix scraper browser returns duplicate texts #1134

Fix scraper browser returns duplicate texts #1134

Conversation

ewgdg commented Feb 12, 2025 • edited Loading

assafelovic left a comment

Choose a reason for hiding this comment

ewgdg commented Feb 12, 2025 •

edited

Loading