Amazon's cloud division has launched an investigation into Perplexity AI. At issue is whether the AI search startup is violating Amazon Web Services rules by scraping websites that attempted to prevent it from doing so, WIRED has learned.
An AWS spokesperson, who talked to WIRED on the condition that they not be named, confirmed the company's investigation of Perplexity.
WIRED had previously found that the startup-which has backing from the Jeff Bezos family fund and Nvidia, and was recently valued at $3 billion-appears to rely on content from scraped websites that had forbidden access through the Robots Exclusion Protocol, a common web standard.
While the Robots Exclusion Protocol is not legally binding, terms of service generally are.
The Robots Exclusion Protocol is a decades-old web standard that involves placing a plaintext file on a domain to indicate which pages should not be accessed by automated bots and crawlers.
While companies that use scrapers can choose to ignore this protocol, most have traditionally respected it.
The Amazon spokesperson told WIRED that AWS customers must adhere to the robots.
Scrutiny of Perplexity's practices follows a June 11 report from Forbes that accused the startup of stealing at least one of its articles.
WIRED investigations confirmed the practice and found further evidence of scraping abuse and plagiarism by systems linked to Perplexity's AI-powered search chatbot.
Engineers for Condé Nast, WIRED's parent company, block Perplexity's crawler across all its websites using a robots.
WIRED found the company had access to a server using an unpublished IP address-44.221.181.252-which visited Condé Nast properties at least hundreds of times in the past three months, apparently to scrape Condé Nast websites.
The machine associated with Perplexity appears to be engaged in widespread crawling of news websites that forbid bots from accessing their content.
Spokespeople for The Guardian, Forbes, and The New York Times also say they detected the IP address on its servers multiple times.
WIRED traced the IP address to a virtual machine known as an Elastic Compute Cloud instance hosted on AWS, which launched its investigation after we asked whether using AWS infrastructure to scrape websites that forbade it violated the company's terms of service.
He refused to name the company, citing a nondisclosure agreement.
This Cyber News was published on www.wired.com. Publication date: Thu, 27 Jun 2024 22:43:05 +0000