Google has quietly updated its list of user-triggered fetchers with new documentation for Google NotebookLM. The importance of this seemingly minor change is that it’s clear that Google NotebookLM will not obey robots.txt.
Google NotebookLM
NotebookLM is an AI research and writing tool that enables users to add a web page URL, which will process the content and then enable them to ask a range of questions and generate summaries based on the content.
Google’s tool can automatically create an interactive mind map that organizes topics from a website and extracts takeaways from it.
User-Triggered Fetchers Ignore Robots.txt
Google User-Triggered Fetchers are web agents that are triggered by users and by default ignore the robots.txt protocol.
According to Google’s User-Triggered Fetchers documentation:
“Because the fetch was requested by a user, these fetchers generally ignore robots.txt rules.”
Google-NotebookLM Ignores Robots.txt
The purpose of robots.txt is to give publishers control over bots that index web pages. But agents like the Google-NotebookLM fetcher aren’t indexing web content, they’re acting on behalf of users who are interacting with the website content through Google’s NotebookLM.
How To Block NotebookLM
Google uses the Google-NotebookLM user agent when extracting website content. So, it’s possible for publishers wishing to block users from accessing their content could create rules that automatically block that user agent. For example, a simple solution for WordPress publishers is to use Wordfence to create a custom rule to block all website visitors that are using the Google-NotebookLM user agent.
Another way to do it is with .htaccess using the following rule:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} Google-NotebookLM [NC] RewriteRule .* - [F,L]
Source link