SEOMarch 30, 2026

Robots.txt Tester Guide: Syntax, Rules & Common Mistakes

By The hakaru Team·Last updated March 2026

Quick Answer

*Robots.txt tells crawlers which URLs they can and cannot access on your site.
*It must live at the root domain: yoursite.com/robots.txt — subdirectories are ignored.
*Disallow blocks crawling, not indexing. Google can still index a blocked URL if other pages link to it.
*According to Ahrefs (2024), 26.4% of websites have at least one robots.txt syntax error.

What Is Robots.txt?

Robots.txt is a plain text file that sits at the root of your website and tells search engine crawlers which pages they can and cannot access. It follows the Robots Exclusion Protocol, originally proposed in 1994 by Martijn Koster and formalized as RFC 9309 in September 2022.

Every major search engine — Google, Bing, Yandex, Baidu — respects robots.txt directives. So do most legitimate bots. Malicious scrapers, however, ignore it entirely. Robots.txt is a request, not a security mechanism.

Basic Syntax and Structure

A robots.txt file is made up of one or more rule groups. Each group starts with a User-agent line and is followed by one or more Disallow or Allow directives.

Directive	Purpose	Example
User-agent	Specifies which crawler the rules apply to	User-agent: Googlebot
Disallow	Blocks the crawler from a path	Disallow: /admin/
Allow	Overrides a Disallow for a specific path	Allow: /admin/public/
Sitemap	Points to your XML sitemap	Sitemap: https://example.com/sitemap.xml
Crawl-delay	Seconds between requests (not supported by Google)	Crawl-delay: 10

The wildcard User-agent: * applies to all crawlers. Google recommends placing more specific user-agent rules before the wildcard block so they take precedence.

How Googlebot Interprets Robots.txt

Googlebot follows a specific order of priority when evaluating robots.txt rules:

Most specific match wins. Allow: /admin/public/ beats Disallow: /admin/ for the path /admin/public/page.html.
Longer paths take priority. A rule matching 15 characters beats one matching 7.
If rules tie in specificity, Allow wins. This was clarified in Google's 2019 open-source robots.txt parser release.

According to Google Search Central documentation (updated January 2025), Googlebot caches robots.txt for up to 24 hours. Changes may not take effect immediately.

Crawling vs. Indexing: A Critical Distinction

This is where most people get confused. Robots.txt controls crawling (whether a bot visits the page), not indexing (whether the page appears in search results).

If you Disallow a page in robots.txt but other websites link to it, Google may still index the URL. It will appear in search results with the message “No information is available for this page.” A 2023 Google SearchLiaison tweet confirmed that Disallow is not and has never been an indexing directive.

To prevent indexing, use:

Meta noindex tag: <meta name="robots" content="noindex">
X-Robots-Tag HTTP header: X-Robots-Tag: noindex

But here is the catch: if you block crawling via robots.txt, Googlebot cannot see the noindex tag. You must allow crawling of pages you want to noindex. This circular dependency trips up even experienced SEOs.

Wildcard Patterns

Robots.txt supports two wildcard characters:

Pattern	Meaning	Example Match
*	Matches any sequence of characters	Disallow: /search* blocks /search, /search?q=test
$	Matches end of URL	Disallow: /*.pdf$ blocks /docs/file.pdf but not /docs/file.pdf?v=2

These are supported by Googlebot and Bingbot. Not all crawlers support wildcards — the original 1994 protocol did not include them. RFC 9309 formalized wildcard support in 2022.

Common Robots.txt Mistakes

Blocking CSS and JavaScript

Disallowing /css/ or /js/ directories prevents Googlebot from rendering your pages. Google has stated since 2014 that blocking render-critical resources can negatively impact rankings. A 2024 Screaming Frog audit of 10,000 sites found 12.3% still block at least one critical resource in robots.txt.

Using Robots.txt for Security

Robots.txt is publicly accessible. Putting sensitive paths like /admin/ or /internal-api/ in your Disallow list essentially advertises them. Use proper authentication and access controls instead.

Forgetting Subdomain Coverage

A robots.txt at example.com does not apply to blog.example.com or shop.example.com. Each subdomain needs its own file. This is frequently overlooked when migrating staging subdomains to production.

Empty Disallow Line

An empty Disallow directive (Disallow: with nothing after the colon) means “allow everything.” This is the opposite of what many people expect. To block all crawling, use Disallow: / (with a forward slash).

Blocking AI Crawlers Without a Strategy

As of 2025, Originality.ai reports that 35% of the top 1,000 websites block at least one AI training crawler. Common user-agents to block include GPTBot, Google-Extended, CCBot, and anthropic-ai. But blocking these crawlers also prevents your content from being cited in AI-generated answers, which is an increasingly significant traffic source.

Testing Your Robots.txt

Always test your robots.txt before deploying changes. A single typo can block your entire site from being crawled. Google Search Console includes a robots.txt tester, but it only checks against Googlebot.

When testing, verify these scenarios:

Your important pages (homepage, category pages, product pages) are not blocked
Admin panels, staging environments, and internal search results are blocked
CSS, JavaScript, and image files needed for rendering are accessible
Your sitemap URL is declared and accessible at the specified path
Wildcard rules match exactly what you intend — no more, no less

Test your robots.txt for errors

Use our free Robots.txt Tester →

Frequently Asked Questions

What is robots.txt?

Robots.txt is a plain text file placed at the root of a website (e.g., example.com/robots.txt) that tells search engine crawlers which pages or sections they are allowed or not allowed to access. It follows the Robots Exclusion Protocol, first proposed in 1994 and formalized as an Internet standard (RFC 9309) in 2022.

Does robots.txt block pages from appearing in Google?

Not necessarily. Robots.txt prevents crawling, but Google can still index a URL if other pages link to it. The page may appear in search results with the note “No information is available for this page.” To truly prevent indexing, use a noindex meta tag or X-Robots-Tag HTTP header instead. Google's documentation explicitly states that Disallow is not an indexing directive.

Where should robots.txt be placed?

Robots.txt must be placed at the root of the host. For example, https://example.com/robots.txt. It will not work in subdirectories like /blog/robots.txt. For subdomains, each subdomain needs its own robots.txt file (e.g., blog.example.com/robots.txt is separate from example.com/robots.txt).

What happens if my robots.txt has errors?

Syntax errors in robots.txt can cause crawlers to misinterpret your rules. Common issues include typos in directive names (e.g., “Dissallow” instead of “Disallow”), missing colons after directives, or incorrect wildcard usage. Googlebot treats unrecognized directives as empty and may crawl pages you intended to block. According to a 2024 Ahrefs study, 26.4% of websites have at least one robots.txt error.

Should I block AI crawlers in robots.txt?

Many site owners now block AI training crawlers like GPTBot (OpenAI), Google-Extended (Google AI training), CCBot (Common Crawl), and anthropic-ai (Anthropic). As of 2025, approximately 35% of the top 1,000 websitesblock at least one AI crawler via robots.txt, according to Originality.ai. Add User-agent: GPTBot followed by Disallow: / to block OpenAI's crawler, for example.