Both OpenAI and Google have released guidance for website owners who do not want the two companies using the content of their sites to train the company's large language models.
We've long been supporters of the right to scrape websites-the process of using a computer to load and read pages of a website for later analysis-as a tool for research, journalism, and archivers.
We believe this practice is still lawful when collecting training data for generative AI, but the question of whether something should be illegal is different from whether it may be considered rude, gauche, or unpleasant.
As norms continue to develop around what kinds of scraping and what uses of scraped data are considered acceptable, it is useful to have a tool for website operators to automatically signal their preference to crawlers.
Asking OpenAI and Google to not include scrapes of your site in its models is an easy process as long as you can access your site's file structure.
We've talked before about how these models use art for training, and the general idea and process is the same for text.
The end result, at least currently, is the chatbots we've seen in the form of Google Bard and ChatGPT. If you do not want your website's content used for this training, you can ask the bots deployed by Google and Open AI to skip over your site.
Keep in mind that this only applies to future scraping.
If Google or OpenAI already have data from your site, they will not remove it.
You can block Common Crawl, but doing so blocks the web crawler from using your data in all its data sets, many of which have nothing to do with AI. There's no technical requirement that a bot obey your requests.
It also doesn't block any other types of scraping that are used for research or for other means, so if you're generally in favor of scraping but uneasy with the use of your website content in a corporation's AI training set, this is one step you can take.
If website owners want to ask a specific search engine or other bot to not scan their site, they can enter that in their robots.
If you run your own website, you should have some way to access the file structure of that site, either through your hosting provider's web portal or FTP. You may need to comb through your provider's documentation for help figuring out how to access this folder.
In most cases, your site will already have a robots.
With all that out of the way, here's what to include in your site's robots.
Txt file if you do not want ChatGPT and Google to use the contents of your site to train their generative AI models.
If you want to cover the entirety of your site, add these lines to your robots.
User-agent: Google-ExtendedDisallow: /. You can also narrow this down to block access to only certain folders on your site.
Maybe you don't mind if most of the data on your site is used for training, but you have a blog that you use as a journal.
As mentioned above, we at EFF will not be using these flags because we believe scraping is a powerful tool for research and access to information; we want the information we're providing to spread far and wide and to be represented in the outputs and answers provided by LLMs. Of course, individual website owners have different views for their blogs, portfolios, or whatever else you use your website for.
This Cyber News was published on www.eff.org. Publication date: Tue, 12 Dec 2023 18:43:18 +0000