【征集】不兼容站点 url #136

bigbrother666sh · 2024-12-02T03:32:47Z

wiseflow计划在下一个版本（V0.3.2）引入全新的基于 playwright 的通用爬虫，以实现对复杂页面（尤其是动态页面）的更好适配。
目前初步试验下来，包括之前容易解析为乱码的国内新闻网站，无法获取全部信息的论坛都能得到很好的支持……如果您在实际业务中有发现目前版本不能很好支持的url 或者您业务中常见的 url，欢迎跟帖留言，我将优先进行测试。

感谢大家！

bigbrother666sh · 2024-12-06T09:00:39Z

目前如下站点V0.3.2已经测试可以完美读取，
https://cryptopanic.com/news/
https://mp.weixin.qq.com/s/ (微信公众号普通文章）
https://news.bjx.com.cn/rankinglist/qingneng/
https://www.xuexi.cn/ （学习强国文章）
http://www.people.com.cn/ （人民网文章）

后续大家发现不能读取的网站，欢迎跟帖

live-in-the-moment · 2024-12-10T07:32:14Z

https://www.gd121.cn/zx/qxzx/list.shtml
这个网站无法获取文章数据详情，配置时只能获取列表页面

bigbrother666sh · 2024-12-10T15:00:30Z

https://www.gd121.cn/zx/qxzx/list.shtml 这个网站无法获取文章数据详情，配置时只能获取列表页面

你用最新的 V0.3.5 试试看，我试下来可以啊

live-in-the-moment · 2024-12-11T03:25:02Z

https://www.gd121.cn/zx/qxzx/list.shtml 这个网站无法获取文章数据详情，配置时只能获取列表页面

你用最新的 V0.3.5 试试看，我试下来可以啊

更新为V0.3.5版本了：

还是拿到的列表标题数据，没有进入相关数据标签详情页，拿里面的数据作为最后的content；
url为：https://ftp.gd121.cn/zx/zhxx/list.shtml

bigbrother666sh · 2024-12-11T14:35:19Z

https://www.gd121.cn/zx/qxzx/list.shtml 这个网站无法获取文章数据详情，配置时只能获取列表页面

你用最新的 V0.3.5 试试看，我试下来可以啊

更新为V0.3.5版本了：还是拿到的列表标题数据，没有进入相关数据标签详情页，拿里面的数据作为最后的content； url为：https://ftp.gd121.cn/zx/zhxx/list.shtml

进不进链接、以及如何提取数据是按你兴趣点设定的，你可以把兴趣点写的详细写，比如2024年12月12日广州市的天气这些……不然的话，任何页面的有关广州天气的信息都会提取（如果页面很多的话，你要等待所有页面都爬取完毕后）

live-in-the-moment · 2024-12-12T01:13:38Z

https://www.gd121.cn/zx/qxzx/list.shtml 这个网站无法获取文章数据详情，配置时只能获取列表页面

你用最新的 V0.3.5 试试看，我试下来可以啊

更新为V0.3.5版本了：还是拿到的列表标题数据，没有进入相关数据标签详情页，拿里面的数据作为最后的content； url为：https://ftp.gd121.cn/zx/zhxx/list.shtml

进不进链接、以及如何提取数据是按你兴趣点设定的，你可以把兴趣点写的详细写，比如2024年12月12日广州市的天气这些……不然的话，任何页面的有关广州天气的信息都会提取（如果页面很多的话，你要等待所有页面都爬取完毕后）

明白了，非常感谢您

bigbrother666sh · 2024-12-18T10:25:40Z

测试下来 https://www.zhihu.com/topic/19552832/hot 这个不行
等待后续方案

bigbrother666sh · 2024-12-21T10:04:29Z

学习强国全站效果都一般……
不是不能获取内容，而是解析不全。这网站有些特殊，等待后续方案

bigbrother666sh · 2024-12-31T05:46:00Z

https://www.gzhu.edu.cn/

tusik · 2025-01-02T05:09:53Z

https://arxiv.org/list/cs/recent ，content有抓到但是会抓到一些url都是列表页面的url，还有重复的url，而且抓取的条目有一些跟兴趣点关联不是很大

bigbrother666sh · 2025-01-05T14:00:38Z

测试下来 https://www.zhihu.com/topic/19552832/hot 这个不行等待后续方案

v0.3.6可以了

bigbrother666sh · 2025-01-06T01:45:09Z

https://arxiv.org/list/cs/recent ，content有抓到但是会抓到一些url都是列表页面的url，还有重复的url，而且抓取的条目有一些跟兴趣点关联不是很大

可以尝试下 v0.3.6的效果（需要重新拉代码仓，pip uninstall crawlee pip install crawl4ai==0.4.245 删除原来的 pb/pb_data)

bigbrother666sh · 2025-01-06T01:47:24Z

https://www.gzhu.edu.cn/

V0.3.6可以了，但是不要太频繁（一天一次ok）

tusik · 2025-01-06T07:11:29Z

https://arxiv.org/list/cs/recent ，content有抓到但是会抓到一些url都是列表页面的url，还有重复的url，而且抓取的条目有一些跟兴趣点关联不是很大

可以尝试下 v0.3.6的效果（需要重新拉代码仓，pip uninstall crawlee pip install crawl4ai==0.4.245 删除原来的 pb/pb_data)

试了抓取文章的可用性高了很多，但是出现一个新的问题他会一直抓取search页面，导致停不下来，是不是要加入一种非兴趣点的东西排除这些干扰内容

————
deepseekv3更换用gpt4o抓到的更多，但是无关条目也更多，抓取深度也有点深，从arxiv抓到了huggingface又抓到wikipedia

bigbrother666sh · 2025-01-08T02:07:59Z

@tusik 你的意思是 gpt4o 比 deepseekV3抓取出来的东西更多，但引入的无关信息也更多？
外链的处理这块我正在构思方案。比较复杂，完全排除也不是很妥当

tusik · 2025-01-08T02:17:04Z

@tusik 你的意思是 gpt4o 比 deepseekV3抓取出来的东西更多，但引入的无关信息也更多？外链的处理这块我正在构思方案。比较复杂，完全排除也不是很妥当

我最后发现其实是DSV3是没有遵守prompt输出，get_info要求输出"""content""",但是实际上50%的概率是输出`` `content` ``,我加了代码匹配不到三双引号再次匹配反引号后两个LLM条目结果差不多

sinianyutian · 2025-01-09T07:39:10Z

https://www.shenmezhideting.com/app/homepage 这个网站抓取不到作曲数量排名

bigbrother666sh · 2025-01-17T15:24:57Z

https://www.shenmezhideting.com/app/homepage 这个网站抓取不到作曲数量排名

V0.3.7 会支持这个站点。
不过我说的是可以获取各个作曲人的作曲数量，如果你需要的是排名信息，这个网站并不包含直接的排名信息（虽然视觉上有排名列表，但这不是目前llm 可以很好理解的），这需要等待0.4.x 版本的insight 功能

bigbrother666sh pinned this issue Dec 6, 2024

bigbrother666sh changed the title ~~【征集】测试站点 url~~ 【征集】不兼容站点 url Dec 6, 2024

This was referenced Dec 11, 2024

为什么只抓取了列表页，未抓取到文章详情页？ #128

Closed

抓取到的内容为cookies提醒，正常的文章抓取不到。 #123

Closed

bigbrother666sh self-assigned this Jan 9, 2025

bigbrother666sh mentioned this issue Jan 18, 2025

无法抓取reddit #199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【征集】不兼容站点 url #136

【征集】不兼容站点 url #136

bigbrother666sh commented Dec 2, 2024

bigbrother666sh commented Dec 6, 2024

live-in-the-moment commented Dec 10, 2024

bigbrother666sh commented Dec 10, 2024

live-in-the-moment commented Dec 11, 2024

bigbrother666sh commented Dec 11, 2024

live-in-the-moment commented Dec 12, 2024

bigbrother666sh commented Dec 18, 2024

bigbrother666sh commented Dec 21, 2024

bigbrother666sh commented Dec 31, 2024

tusik commented Jan 2, 2025 •

edited

Loading

bigbrother666sh commented Jan 5, 2025

bigbrother666sh commented Jan 6, 2025

bigbrother666sh commented Jan 6, 2025

tusik commented Jan 6, 2025 •

edited

Loading

bigbrother666sh commented Jan 8, 2025

tusik commented Jan 8, 2025

sinianyutian commented Jan 9, 2025

bigbrother666sh commented Jan 17, 2025

【征集】不兼容站点 url #136

【征集】不兼容站点 url #136

Comments

bigbrother666sh commented Dec 2, 2024

bigbrother666sh commented Dec 6, 2024

live-in-the-moment commented Dec 10, 2024

bigbrother666sh commented Dec 10, 2024

live-in-the-moment commented Dec 11, 2024

bigbrother666sh commented Dec 11, 2024

live-in-the-moment commented Dec 12, 2024

bigbrother666sh commented Dec 18, 2024

bigbrother666sh commented Dec 21, 2024

bigbrother666sh commented Dec 31, 2024

tusik commented Jan 2, 2025 • edited Loading

bigbrother666sh commented Jan 5, 2025

bigbrother666sh commented Jan 6, 2025

bigbrother666sh commented Jan 6, 2025

tusik commented Jan 6, 2025 • edited Loading

bigbrother666sh commented Jan 8, 2025

tusik commented Jan 8, 2025

sinianyutian commented Jan 9, 2025

bigbrother666sh commented Jan 17, 2025

tusik commented Jan 2, 2025 •

edited

Loading

tusik commented Jan 6, 2025 •

edited

Loading