GetTogether.community content is used to train LLMs #320

cassidyjames · 2023-04-19T15:39:32Z

You can confirm that 8.4k tokens were scraped from GetTogether.community by CommonCrawl and are included in Google's C4 dataset. It's likely that other LLMs have scraped and will continue to scrape user-generated content from GetTogether.community to train their proprietary large language models.

This can be discouraged for CommonCrawl and ChatGPT with the proper robots.txt inclusion:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetTogether.community content is used to train LLMs #320

GetTogether.community content is used to train LLMs #320

cassidyjames commented Apr 19, 2023 •

edited

Loading

GetTogether.community content is used to train LLMs #320

GetTogether.community content is used to train LLMs #320

Comments

cassidyjames commented Apr 19, 2023 • edited Loading

cassidyjames commented Apr 19, 2023 •

edited

Loading