top of page

Perplexity’s Secret IPs: Scraping the Web While Sites Say ‘No’”

  • usareisende
  • Aug 5
  • 2 min read

 

The impact of Artificial Intelligence (AI) on today's technology has been massive. Automating tasks, answering basic queries, generating images and videos with just a text prompt, and even basic software development, AI is poised to take over the internet by storm. Due to its potential, companies such as Google, Amazon, and Microsoft spend billions on this technology.

 

Unfortunately, AI's advancement has been a bane for some companies. Websites with important data could be at the wrong end of business because of AI data scraping.

 

Recently, AI search start-up Perplexity has been in the spotlight. Their AI has been accused of information scraping without prior permission. It’s even more disturbing when Perplexity claims they do not actually crawl websites for content.

 

For context, Perplexity's claim of not using crawlers is indicated in their documentation on bots:

 

“PerplexityBot is designed to surface and link websites in search results on Perplexity. It is not used to crawl content for AI foundation models. Perplexity-User...is not used for web crawling or to collect content for training AI foundation models.”

 

Based on their description, it might be possible that PerplexityBot and Perplexity-User are still crawling for content. They make the distinction of crawling only for user-requested information and not for training their AI.

 

However, Cloudflare reported this bad practice by Perplexity in their recent blog post:

 

“Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.”

 

This situation is a big challenge for websites. Robots.txt files contain instructions for web crawlers on the type of information they can fetch for use. If Perplexity web crawlers ignore the file, they can easily fetch the information they want without permission. Perplexity bots used as web crawlers also operate outside the IP address usually associated with the AI company. Allegedly, this is done to circumvent the block made on Perpexity’s IP addresses.

 

Financially, websites will bear the burden of this practice. Before AI web crawling, search engines indexed websites that contained the information users needed. Online visitors help websites monetize through ads and other online services. With web crawling, online users can get the information they need without visiting the websites that contain the data. Web crawlers get the information they need without any interaction with the source of information.

 

Billions in Valuation in AI

 

Accusations against Perplexity’s use of IP addresses outside their range should alarm website owners. AI is slowly taking over search engines as the source of information. Unfortunately, web crawling for content means AI gets all the revenue without anything in return to website owners. Unauthorized web scraping also comes after Perplexity recently secured $100million in funding, according to Tech Funding News. As the company grows bigger, websites have to be on guard against crawlers not just from Perplexity but also from other AI companies.

 
 
 

Recent Posts

See All

Comments


bottom of page