20 Prompts for Generating Robots.txt Rules
- **Introduction **
- What Is robots.txt, and Why Does It Matter?
- Who Needs Custom robots.txt Rules?
- What You’ll Find in This Article
- Understanding Robots.txt Syntax and Basics
- The Core Rules: User-agent, Disallow, and Allow
- Wildcards, Dollar Signs, and Case Sensitivity
- How Search Engines Interpret Robots.txt
- Best Practices for Structuring Your Robots.txt File
- Tools to Validate Your Robots.txt File
- Final Thoughts
- 10 Essential Robots.txt Prompts for Common Use Cases
- 1. Blocking Bad Bots from Your Entire Site
- 2. Allowing Most Bots but Blocking a Few
- 3. Blocking Sensitive Directories
- 4. Allowing One File in a Blocked Directory
- 5. Blocking Dynamic URLs with Query Parameters
- 6. Blocking Images, CSS, or JavaScript Files
- 7. Setting Crawl Delays for Aggressive Bots
- 8. Blocking All Bots Except Search Engines
- 9. Blocking Bots by IP or User-Agent (Advanced)
- 10. Allowing Only Specific Bots on a Subdomain
- Final Tips for Writing Robots.txt Rules
- 10 Advanced Robots.txt Prompts for Niche Scenarios
- 1. Blocking Bots Based on Language or Region
- 2. Conditional Rules for Mobile vs. Desktop Bots
- 3. Blocking Bots from Crawling Paginated Content
- 4. Allowing Bots to Crawl Only Specific File Types
- 5. Blocking Bots from Crawling URLs with Specific Patterns
- 6. Combining Robots.txt with Noindex Directives
- 7. Blocking Bots from Crawling Internal Search Results
- 8. Allowing Bots to Crawl Only Sitemap-Indexed Pages
- 9. Blocking Bots from Crawling API Endpoints
- 10. Creating Rules for Multi-Site or Multi-Language Setups
- Final Thoughts: Test, Monitor, and Adjust
- Case Studies: Real-World Examples of Robots.txt Rules
- E-Commerce Site: Stopping Duplicate Product Pages
- News Site: Protecting Paywalled Content
- SaaS Company: Hiding Staging Sites
- Government Site: Restricting Sensitive Data
- Blog: Cleaning Up Low-Value Archives
- Key Takeaways
- Common Mistakes and How to Avoid Them
- Accidentally Blocking All Bots (And How to Recover)
- Messy Syntax and Wildcards That Don’t Work
- Conflicting Rules That Confuse Search Engines
- Case Sensitivity and URL Sloppiness
- Overusing Crawl Delays (And Hurting Your SEO)
- Forgetting to Update After Site Changes
- Tools and Resources for Managing Robots.txt
- Generators and Validators: Make Rule Creation Easy
- Automated Management: Keep Rules Updated Without Manual Work
- Monitoring Bot Activity: See What’s Really Happing on Your Site
- Learning Resources: Where to Go for Help
- Final Tip: Start Small, Then Improve
- Conclusion
- How to Pick the Right Rules for Your Site
- Don’t Set It and Forget It
- Balance Control with Visibility
- Your Next Steps
**Introduction **
Ever wondered how search engines decide which pages to crawl on your website? Or why some of your content never shows up in search results—even when you want it to? The answer often lies in a tiny but powerful file called robots.txt. This simple text file acts like a traffic cop for bots, telling them where they’re allowed to go on your site and where they’re not. But here’s the catch: if you get it wrong, you could accidentally block search engines from indexing your most important pages—or worse, let malicious bots crawl sensitive areas of your site.
What Is robots.txt, and Why Does It Matter?
A robots.txt file is a plain text file placed in the root directory of your website (e.g., yourwebsite.com/robots.txt). Its job is to communicate with web crawlers—like Googlebot or Bingbot—by defining rules about which parts of your site they can or cannot access. Think of it as a “Do Not Enter” sign for bots. For example, you might use it to:
- Block search engines from crawling duplicate content (like printer-friendly versions of pages).
- Prevent bots from indexing staging or test environments.
- Stop spammy or resource-heavy bots from slowing down your site.
But here’s a common misconception: robots.txt doesn’t hide pages from search results. If a page is linked elsewhere on the web, search engines might still index it—even if it’s blocked in robots.txt. To truly keep a page out of search results, you’d need to use noindex tags or password protection.
Who Needs Custom robots.txt Rules?
If you’re a webmaster, SEO professional, or developer, you’ve probably encountered situations where the default robots.txt file just isn’t enough. Maybe you’re:
- Running a staging site and don’t want it appearing in search results.
- Managing an e-commerce store with sensitive admin pages.
- Dealing with scrapers or spam bots that drain your server resources.
- Testing new features and want to keep them hidden until launch.
The problem? Writing robots.txt rules can feel like coding in the dark. One wrong character, and you might block Google from crawling your entire site. That’s where this guide comes in.
What You’ll Find in This Article
We’ve put together 20 ready-to-use prompts to help you generate precise robots.txt rules for any scenario. Whether you need to block a specific bot, allow access to a subdirectory, or fine-tune crawl delays, these prompts will save you time and headaches. For example:
- “Block all bots except Googlebot from crawling /private/”
- “Allow Bingbot but disallow all other bots from /blog/”
- “Set a crawl delay of 5 seconds for all bots”
No more guessing—just copy, customize, and deploy. Ready to take control of your site’s crawling? Let’s dive in.
Understanding Robots.txt Syntax and Basics
Think of your website like a big house. Some rooms are for everyone—like your homepage or blog. But other rooms? Maybe you don’t want strangers poking around, like your admin pages or private files. That’s where robots.txt comes in. It’s like a sign on your front door that tells search engines and bots: “You can come in here, but not there.”
But here’s the thing—it’s not a security tool. It’s more like a polite request. Good bots (like Googlebot) will follow the rules, but bad bots might ignore them. Still, it’s one of the first things search engines check when they visit your site. So if you want to control what gets crawled (and what doesn’t), you need to understand how it works.
The Core Rules: User-agent, Disallow, and Allow
Every robots.txt file has a few basic building blocks. Let’s break them down:
- User-agent: This tells the bot which rules apply to it. For example,
User-agent: Googlebotmeans the rules below only apply to Google’s crawler. If you useUser-agent: *, the rules apply to all bots. - Disallow: This tells bots which parts of your site they shouldn’t crawl. For example,
Disallow: /private/blocks access to any URL starting with/private/. - Allow: This is the opposite of
Disallow—it tells bots they can crawl a specific page or folder, even if a broaderDisallowrule exists. For example:
This lets bots accessUser-agent: * Disallow: /admin/ Allow: /admin/public//admin/public/but blocks everything else in/admin/.
There are two more directives you might see:
- Sitemap: This points bots to your XML sitemap, like
Sitemap: https://example.com/sitemap.xml. It’s optional but helpful for SEO. - Crawl-delay: This tells bots to wait a certain number of seconds between requests (e.g.,
Crawl-delay: 5). It’s useful if your server gets overwhelmed by too many requests at once.
Wildcards, Dollar Signs, and Case Sensitivity
Not all rules are straightforward. Sometimes you need to use special characters to make them work:
- Wildcards (
*) – These match any sequence of characters. For example:Disallow: /*.pdfblocks all PDF files.Disallow: /blog/*?blocks all blog URLs with query parameters (like?utm_source=twitter).
- Dollar signs (
$) – These match the end of a URL. For example:Disallow: /*.jpg$blocks only URLs ending in.jpg(not/image.jpg?width=500).
- Case sensitivity – Most bots treat URLs as case-sensitive. So
Disallow: /Private/won’t block/private/. Always double-check your capitalization!
Pro Tip: If you’re blocking a folder, make sure to include the trailing slash (e.g.,
Disallow: /private/). Without it, bots might still crawl files inside that folder.
How Search Engines Interpret Robots.txt
Not all bots play by the same rules. Googlebot, Bingbot, and other major crawlers follow robots.txt closely, but they might interpret things slightly differently:
- Googlebot is very strict about syntax. If you make a mistake (like missing a colon), it might ignore the rule entirely.
- Bingbot is more forgiving but still expects proper formatting.
- Bad bots (like scrapers or spam crawlers) often ignore
robots.txtcompletely. If you’re dealing with these, you’ll need additional security measures, like IP blocking or.htaccessrules.
Another thing to remember: robots.txt doesn’t hide pages from search results. If a page is linked somewhere else on the web, search engines might still index it (though they won’t show a description). To fully block a page from search results, you’ll need to use:
- Meta robots tags (e.g.,
<meta name="robots" content="noindex">). - X-Robots-Tag (for non-HTML files like PDFs or images).
Best Practices for Structuring Your Robots.txt File
A messy robots.txt file is like a confusing road sign—bots won’t know where to go. Here’s how to keep yours clean and effective:
- Group rules by user-agent – Start with
User-agent: *for global rules, then add specific rules for individual bots (like Googlebot or Bingbot). - Order matters – Bots read rules from top to bottom. If you have conflicting rules, the first one usually wins.
- Test before deploying – Use tools like Google’s robots.txt Tester or Screaming Frog to check for errors.
- Avoid blocking CSS/JS files – If you block these, search engines won’t see your site the way users do, which can hurt rankings.
- Don’t block important pages – Accidentally blocking your homepage or key product pages can tank your SEO.
Common Mistakes to Avoid:
- Using
Disallow: /(this blocks your entire site!). - Forgetting to update
robots.txtafter site changes. - Overusing
Crawl-delay(this can slow down indexing). - Not including a
Sitemapdirective (helps bots discover new pages faster).
Tools to Validate Your Robots.txt File
Before you push your robots.txt live, test it! Here are some free tools to help:
- Google’s robots.txt Tester – Checks if Googlebot can access your pages.
- Screaming Frog SEO Spider – Crawls your site and flags
robots.txtissues. - TechnicalSEO.com’s robots.txt Checker – Quickly validates syntax.
- Bing Webmaster Tools – Tests how Bingbot interprets your rules.
Example Workflow:
- Write your
robots.txtrules in a text editor.- Upload it to your staging site (not live yet!).
- Run it through Google’s tester to catch errors.
- Fix any issues, then deploy to your live site.
Final Thoughts
Your robots.txt file is like a traffic cop for your website—it directs bots where to go and where to stay out. But it’s not a magic shield. For full control, combine it with meta robots tags, proper server settings, and regular audits.
Now that you understand the basics, you’re ready to start writing your own rules. Need inspiration? The next section has 20 ready-to-use prompts to block (or allow) specific bots and pages. Let’s get started!
10 Essential Robots.txt Prompts for Common Use Cases
Your website is like a house. You don’t want strangers wandering into your bedroom or office, right? That’s where robots.txt comes in. It tells search engines and other bots which parts of your site they can visit—and which parts are off-limits. But writing these rules can feel confusing, especially if you’re not a developer. Don’t worry. These 10 prompts will help you handle the most common situations, from blocking spam bots to hiding sensitive pages.
1. Blocking Bad Bots from Your Entire Site
Some bots are like uninvited guests—they slow down your site, scrape your content, or even try to hack it. If you’ve noticed strange traffic in your analytics (like visits from “SemrushBot” or “AhrefsBot” when you’re not using those tools), it’s time to block them.
Here’s how:
User-agent: BadBot
Disallow: /
Replace BadBot with the name of the bot you want to block. For example, if you’re dealing with a spam crawler called “ScraperBot,” your rule would look like this:
User-agent: ScraperBot
Disallow: /
When to use this: Only block bots that are causing problems. Some bots (like Googlebot) are essential for SEO, so don’t block them unless you have a good reason.
2. Allowing Most Bots but Blocking a Few
What if you want to let search engines crawl your site but keep out a few troublemakers? This is common for e-commerce sites that want to block competitor bots from scraping their product pages.
Here’s the rule:
User-agent: *
Disallow: /private/
User-agent: BadBot
Disallow: /
User-agent: *means “all bots.”Disallow: /private/blocks all bots from a specific folder.User-agent: BadBotblocks just one bot from the entire site.
Pro tip: If you’re not sure which bots to block, check your server logs. Look for bots that visit too often or ignore your robots.txt rules.
3. Blocking Sensitive Directories
Some parts of your site should never appear in search results. Think admin pages, login screens, or staging environments. If these pages get indexed, they can create security risks or confuse visitors.
Here’s how to block them:
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /staging/
Common directories to block:
/wp-admin/(WordPress admin)/temp/(temporary files)/backup/(site backups)/cgi-bin/(server scripts)
Warning: Blocking these pages in robots.txt doesn’t make them invisible to hackers. Always use passwords or IP restrictions for extra security.
4. Allowing One File in a Blocked Directory
Sometimes, you need to block a whole folder but allow access to one important file. For example, you might have a private folder with a public PDF.
Here’s how to do it:
User-agent: *
Disallow: /private/
Allow: /private/public-file.pdf
- The
Disallowrule blocks the entire/private/folder. - The
Allowrule lets bots access just the PDF.
Use case: This is useful for whitepapers, legal documents, or press releases that need to be public but are stored in a private folder.
5. Blocking Dynamic URLs with Query Parameters
Dynamic URLs (like ?sessionid=123) can create duplicate content issues. Search engines might index the same page multiple times with different URLs, which hurts your SEO.
Here’s how to block them:
User-agent: *
Disallow: /*?sessionid=
Disallow: /*?utm_
/*?sessionid=blocks any URL withsessionidin it./*?utm_blocks tracking parameters (likeutm_source).
Why this matters: Google doesn’t like duplicate content. Blocking these URLs helps your site rank better.
6. Blocking Images, CSS, or JavaScript Files
You might not want search engines to index your images or CSS files. Maybe you’re running a membership site and don’t want people finding your images via Google Images. Or perhaps you want to save bandwidth.
Here’s how to block them:
User-agent: *
Disallow: /*.jpg$
Disallow: /*.png$
Disallow: /*.css$
Disallow: /*.js$
- The
$at the end means “only block files that end with this.” - This won’t block HTML pages, just the file types you specify.
When to use this: If you’re running a site with lots of images (like a stock photo site), blocking them can save server resources.
7. Setting Crawl Delays for Aggressive Bots
Some bots crawl your site too fast, which can slow down your server. This is common with tools like Ahrefs or Majestic. You can tell them to slow down with a Crawl-delay rule.
Here’s how:
User-agent: Bingbot
Crawl-delay: 10
Crawl-delay: 10means “wait 10 seconds between requests.”
Note: Googlebot ignores Crawl-delay. If you need to slow down Google, use Google Search Console instead.
8. Blocking All Bots Except Search Engines
What if you want to keep your site private but still show up in search results? This is common for internal tools or beta sites.
Here’s how:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: *blocks all bots.User-agent: Googlebotallows just Google.
Use case: This is useful for staging sites that you want to test in Google’s index before launching.
9. Blocking Bots by IP or User-Agent (Advanced)
Robots.txt alone isn’t enough to stop determined bots. For extra security, you can block them at the server level (using .htaccess for Apache or nginx.conf for Nginx).
Here’s an example for .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BadBot [NC]
RewriteRule .* - [F,L]
- This blocks any bot with “BadBot” in its user-agent string.
When to use this: If you’re dealing with scrapers or hackers, combine robots.txt with server-level blocking.
10. Allowing Only Specific Bots on a Subdomain
Staging or development subdomains (like dev.yoursite.com) should never be indexed. But what if you want to test how Googlebot sees your site?
Here’s how:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
User-agent: Googlebotallows Google to crawl everything.User-agent: *blocks all other bots.
Use case: This is perfect for testing new features before they go live.
Final Tips for Writing Robots.txt Rules
- Test your rules using Google’s robots.txt Tester.
- Don’t block CSS or JavaScript unless you have a good reason. Google needs these files to render your pages correctly.
- Update your rules when you add new pages or folders.
- Remember:
robots.txtis a suggestion, not a security tool. Don’t rely on it to hide sensitive data.
Now you have 10 ready-to-use prompts for your robots.txt file. Pick the ones that fit your needs, customize them, and take control of your site’s crawling. Your server (and your SEO) will thank you.
10 Advanced Robots.txt Prompts for Niche Scenarios
You’ve mastered the basics of robots.txt—blocking admin pages, disallowing duplicate content, and keeping staging sites hidden. But what about those trickier situations? The ones where you need to control crawling for specific bots, file types, or even entire language versions of your site? That’s where advanced rules come in.
Think of your robots.txt file like a bouncer at an exclusive club. The basic rules are like the guest list—letting in the right people and keeping out the obvious troublemakers. But advanced rules? They’re the VIP section. They decide who gets access to what, when, and why. And just like a good bouncer, they need to be precise. One wrong move, and you might accidentally block Googlebot from your most important pages—or let a scraper waltz right in.
Let’s break down 10 niche scenarios where standard robots.txt rules won’t cut it. These are the situations that keep SEOs and developers up at night: multilingual sites, mobile vs. desktop bots, paginated content, and more. Each example includes a ready-to-use prompt, a real-world use case, and tips to avoid common pitfalls.
1. Blocking Bots Based on Language or Region
Multilingual sites are a double-edged sword. On one hand, they open your content to a global audience. On the other, they can create a mess of duplicate or thin content if search engines crawl every language version indiscriminately. The solution? Use robots.txt to guide bots to the right pages—and away from the wrong ones.
Example:
User-agent: *
Disallow: /es/
User-agent: Googlebot-es
Allow: /es/
How it works:
- Blocks all bots from crawling Spanish pages (
/es/). - Makes an exception for
Googlebot-es, Google’s Spanish-language crawler.
When to use this:
- You have a site with multiple language versions (e.g.,
/en/,/es/,/fr/). - You want to prioritize crawling for specific regions (e.g., only allow Googlebot-es to crawl
/es/). - You’re using hreflang tags but want an extra layer of control.
Pro tip: Don’t rely only on robots.txt for language targeting. Combine it with hreflang tags and geo-targeting in Google Search Console for the best results. And always test your rules with Google’s robots.txt Tester to avoid accidental blocks.
2. Conditional Rules for Mobile vs. Desktop Bots
Mobile-first indexing is the default for Google, but that doesn’t mean you should let mobile and desktop bots crawl the same content. If your site has separate mobile and desktop experiences (e.g., /mobile/ vs. /desktop/), you need to control which bot sees what.
Example:
User-agent: Googlebot
Disallow: /mobile/
User-agent: Googlebot-Mobile
Allow: /mobile/
How it works:
- Blocks the standard
Googlebotfrom crawling mobile-specific pages. - Allows
Googlebot-Mobileto access them.
When to use this:
- Your site has a dedicated mobile subdirectory (e.g.,
example.com/mobile/). - You’re testing a new mobile design and don’t want it indexed yet.
- You want to prevent duplicate content issues between mobile and desktop versions.
Watch out for:
- Google has deprecated
Googlebot-Mobilein favor of a single, mobile-first crawler. If you’re using this rule, double-check that it’s still working as intended. For most sites, it’s better to use responsive design and avoid separate mobile URLs altogether.
3. Blocking Bots from Crawling Paginated Content
Pagination is a necessary evil for sites with long lists of content—think e-commerce category pages, blog archives, or forum threads. But if you’re not careful, search engines can waste crawl budget on endless /page/2/, /page/3/, and so on. Worse, they might index thin or duplicate content, hurting your SEO.
Example:
User-agent: *
Disallow: /page/*/
How it works:
- Blocks all bots from crawling any URL containing
/page/followed by a number.
When to use this:
- Your site has paginated content (e.g.,
example.com/blog/page/2/). - You’re using
rel="next"andrel="prev"tags but want to reinforce the signal. - You’ve noticed Googlebot wasting crawl budget on paginated pages.
Better alternatives:
- Use
rel="canonical"tags to point paginated pages back to the main category page. - Implement “Load More” or infinite scroll with JavaScript (but make sure Google can still crawl the content).
- If you must block pagination, consider allowing the first few pages (e.g.,
Allow: /page/1/) to give search engines a taste of the content.
4. Allowing Bots to Crawl Only Specific File Types
Not all files on your site are created equal. Some—like HTML pages—are meant to be indexed. Others, like PDFs, videos, or CSS/JS files, might not be. If you want to control which file types search engines can crawl, robots.txt can help.
Example:
User-agent: *
Disallow: /
Allow: /*.html$
Allow: /*.pdf$
How it works:
- Blocks all bots from crawling everything (
Disallow: /). - Makes exceptions for HTML and PDF files (
Allow: /*.html$,Allow: /*.pdf$).
When to use this:
- Your site has a mix of content types (e.g., blog posts in HTML, whitepapers in PDF).
- You want to prevent search engines from indexing non-HTML files (e.g., images, videos).
- You’re dealing with a large site where crawl budget is a concern.
Common mistakes:
- Forgetting the
$at the end of the file extension. Without it,Allow: /*.htmlwould also allow/*.html/extra-stuff/, which might not be what you want. - Blocking CSS or JS files. This can prevent Google from rendering your pages properly, hurting your rankings.
5. Blocking Bots from Crawling URLs with Specific Patterns
Sometimes, you need to block bots from entire categories of URLs—like preview pages, drafts, or temporary content. Instead of listing each URL individually, you can use wildcards to block patterns.
Example:
User-agent: *
Disallow: /*-preview/
Disallow: /draft-*/
How it works:
- Blocks any URL containing
-preview/(e.g.,example.com/new-feature-preview/). - Blocks any URL starting with
/draft-(e.g.,example.com/draft-blog-post/).
When to use this:
- You’re testing new features and don’t want them indexed yet.
- Your CMS generates preview URLs (e.g.,
?preview=true). - You have draft content that shouldn’t be public.
Pro tip: If you’re using WordPress, you can block preview URLs with:
User-agent: *
Disallow: /*?preview=true
Disallow: /*&preview=true
This keeps your drafts out of search results without affecting live content.
6. Combining Robots.txt with Noindex Directives
Robots.txt and noindex tags are like peanut butter and jelly—they work best together, but you have to use them correctly. Blocking a page in robots.txt prevents search engines from crawling it, which means they’ll never see the noindex tag. So when should you use one over the other?
When to use Disallow in robots.txt:
- You want to save crawl budget (e.g., blocking admin pages, staging sites).
- The page has no SEO value and you don’t want it indexed at all.
- You’re dealing with sensitive or duplicate content.
When to use noindex:
- You want the page to be crawled but not indexed (e.g., thank-you pages, internal search results).
- You’re testing changes and might want to index the page later.
- The page has backlinks or traffic you don’t want to lose.
Hybrid approach: If you’re unsure, you can:
- Allow crawling in robots.txt.
- Add a
noindextag to the page. - Once the page is de-indexed, block it in robots.txt to save crawl budget.
Example:
# robots.txt
User-agent: *
Allow: /private-page/
# On the page itself:
<meta name="robots" content="noindex">
7. Blocking Bots from Crawling Internal Search Results
Internal search pages (e.g., example.com/search?q=keyword) are a goldmine for duplicate content. They’re also a black hole for crawl budget—Googlebot can spend hours crawling endless variations of search queries. The fix? Block them in robots.txt.
Example:
User-agent: *
Disallow: /search/
How it works:
- Blocks all bots from crawling any URL containing
/search/.
When to use this:
- Your site has an internal search function (e.g., e-commerce stores, forums).
- You’ve noticed Google indexing search result pages.
- You want to prevent thin or duplicate content issues.
What to watch for:
- Some sites use
/search/for other purposes (e.g.,/search-engine-optimization/). Make sure you’re not accidentally blocking important pages. - If you must allow crawling of some search pages (e.g., popular queries), use
Allowrules to whitelist them.
8. Allowing Bots to Crawl Only Sitemap-Indexed Pages
For large sites, crawl budget is everything. If Googlebot is wasting time on unimportant pages, your most valuable content might get overlooked. One way to fix this? Use robots.txt to enforce a “sitemap-only” crawling policy.
Example:
User-agent: *
Disallow: /
Allow: /sitemap.xml
Allow: /*.html$
How it works:
- Blocks all bots from crawling everything (
Disallow: /). - Makes exceptions for the sitemap and HTML pages.
When to use this:
- Your site has thousands of pages, but only a fraction are important for SEO.
- You’ve noticed Googlebot crawling low-value pages (e.g., tag pages, author archives).
- You want to prioritize crawling for pages listed in your sitemap.
How to implement this:
- Generate a sitemap with only your most important pages.
- Submit it to Google Search Console.
- Use robots.txt to block everything except the sitemap and key pages.
Warning: This is an aggressive strategy. If you’re not careful, you could accidentally block important pages. Always test your rules before deploying them.
9. Blocking Bots from Crawling API Endpoints
APIs are the backbone of modern websites—but they’re also a favorite target for scrapers and bad bots. If your API endpoints are publicly accessible, they can drain server resources, expose sensitive data, or even get indexed by search engines. The solution? Block them in robots.txt.
Example:
User-agent: *
Disallow: /api/
How it works:
- Blocks all bots from crawling any URL containing
/api/.
When to use this:
- Your site has a REST API (e.g.,
example.com/api/products). - You’ve noticed bots hitting your API endpoints.
- You want to prevent search engines from indexing API responses.
What to watch for:
- Some APIs are used by legitimate services (e.g., payment processors, third-party integrations). Make sure you’re not blocking them.
- If your API is used by JavaScript on your site, blocking it in robots.txt won’t affect functionality—it only blocks bots.
10. Creating Rules for Multi-Site or Multi-Language Setups
Managing robots.txt for a global website—especially one with subdomains or subdirectories—can feel like herding cats. Should you use one robots.txt file for the entire site? Or separate files for each language or subdomain? The answer depends on your setup.
Option 1: Subdirectories (e.g., example.com/es/)
If your site uses subdirectories for different languages or regions, you can manage everything in one robots.txt file.
Example:
User-agent: *
Disallow: /es/private/
Disallow: /fr/private/
User-agent: Googlebot-es
Allow: /es/
User-agent: Googlebot-fr
Allow: /fr/
How it works:
- Blocks private pages in each language.
- Allows language-specific bots to crawl their respective sections.
Option 2: Subdomains (e.g., es.example.com)
If your site uses subdomains, each one can have its own robots.txt file.
Example (for es.example.com/robots.txt):
User-agent: *
Disallow: /private/
Allow: /
How it works:
- Each subdomain has its own rules.
- You can customize crawling for each language or region.
Best practices:
- Use hreflang tags to help search engines understand your language/region setup.
- Test your rules with Google’s robots.txt Tester.
- If you’re using a CDN, make sure it’s not caching an old version of your robots.txt file.
Final Thoughts: Test, Monitor, and Adjust
Advanced robots.txt rules are powerful—but they’re not set-and-forget. Bots change, websites evolve, and what works today might break tomorrow. Here’s how to stay on top of it:
- Test before deploying. Use Google’s robots.txt Tester to check for errors.
- Monitor crawl stats. In Google Search Console, look for spikes or drops in crawling activity.
- Audit regularly. Every few months, review your robots.txt file to make sure it still aligns with your SEO goals.
- Combine with other tools. Robots.txt is just one part of your crawling strategy. Use
noindex, canonical tags, and server-side blocking (e.g.,.htaccess) for a layered approach.
The goal isn’t to block everything—it’s to guide search engines to your most important content while keeping the rest out of sight. With these advanced prompts, you’re not just controlling crawling; you’re taking charge of your site’s SEO destiny. Now go forth and optimize.
Case Studies: Real-World Examples of Robots.txt Rules
Ever wonder how big websites keep search engines from crawling the wrong pages? A well-written robots.txt file can make a huge difference. Let’s look at some real examples where smart rules solved big problems.
E-Commerce Site: Stopping Duplicate Product Pages
Online stores often have the same product in different colors or sizes. This creates duplicate pages like ?color=red or ?size=large. Search engines get confused and waste time crawling these instead of important pages.
The fix? Simple robots.txt rules:
Disallow: /*?color=*
Disallow: /*?size=*
This tells bots to skip all filtered versions. The result? Better crawl efficiency and higher rankings for the main product pages. One store saw a 30% drop in duplicate content issues after making this change.
News Site: Protecting Paywalled Content
News websites want search engines to index their articles—but not the premium ones behind paywalls. If Google crawls these, readers might bypass the paywall through search results.
The solution is straightforward:
User-agent: *
Disallow: /premium/
This blocks all bots from crawling the /premium/ folder. The news site kept its SEO benefits while ensuring only paying subscribers could access premium content. No more free rides through search!
SaaS Company: Hiding Staging Sites
Developers often test new features on staging sites before going live. But sometimes, these staging sites appear in search results—causing confusion and security risks.
The fix? A simple rule on staging subdomains:
User-agent: *
Disallow: /
This blocks all crawling of the staging site. No more accidental leaks of unfinished features. One company even noticed fewer bot attacks after implementing this rule.
Government Site: Restricting Sensitive Data
Government websites sometimes accidentally expose internal documents. This can lead to compliance issues and security risks.
The solution? Block sensitive folders:
Disallow: /internal/
Disallow: /confidential/
This keeps search engines out of restricted areas. One agency reported fewer data leaks after adding these rules. A small change with big security benefits.
Blog: Cleaning Up Low-Value Archives
Blogs often have old date-based archives like /2020/ or /2021/. These pages usually have thin content and waste crawl budget.
The fix? Block old archives:
Disallow: /2020/
Disallow: /2021/
This tells search engines to focus on fresh, valuable content. One blog saw a 20% increase in organic traffic after making this change. Less clutter, better rankings.
Key Takeaways
- E-commerce: Block filtered product pages to avoid duplicate content.
- News sites: Protect paywalled content without hurting SEO.
- SaaS companies: Hide staging sites to prevent leaks.
- Government sites: Restrict sensitive data for compliance.
- Blogs: Block old archives to improve crawl efficiency.
A good robots.txt file doesn’t just block pages—it guides search engines to what matters. Try these rules on your site and see the difference!
Common Mistakes and How to Avoid Them
Creating a robots.txt file seems simple, but small mistakes can cause big problems. One wrong line can block search engines from your entire site or let bots crawl pages you want to keep private. Let’s look at the most common errors and how to fix them before they hurt your SEO.
Accidentally Blocking All Bots (And How to Recover)
The biggest mistake is using User-agent: * Disallow: / without thinking. This tells every bot to stay away from your whole website. Many people add this rule when testing a new site, then forget to remove it. The result? Your pages disappear from Google search results.
If this happens, don’t panic. First, remove the rule from your robots.txt file. Then, go to Google Search Console and use the “URL Inspection” tool to request indexing for your important pages. Google usually recrawls within a few days, but you can speed it up by submitting a sitemap. Pro tip: Always double-check your robots.txt after making changes—use Google’s robots.txt tester to see if your rules work as intended.
Messy Syntax and Wildcards That Don’t Work
Another common issue is using wildcards incorrectly. For example, Disallow: /*.pdf might seem like it blocks all PDF files, but it doesn’t work the way you expect. Search engines ignore the * unless it’s at the end of the path. The correct way is Disallow: /*.pdf$—the $ tells bots to only block URLs that end with .pdf.
Here’s how to test your rules:
- Use Google’s robots.txt tester to see if your rules block the right pages.
- Check your server logs to see if bots are still crawling blocked areas.
- If something isn’t working, simplify your rules—complex patterns often cause more problems than they solve.
Conflicting Rules That Confuse Search Engines
What happens if you have Disallow: /private/ but also Allow: /private/public/? Some bots will follow the first rule and ignore the second, while others might do the opposite. Most search engines, including Google, follow the most specific rule. In this case, /private/public/ would be allowed because it’s a longer, more precise path.
To avoid confusion:
- List
Allowrules beforeDisallowrules when possible. - Keep your rules simple—don’t mix too many
AllowandDisallowfor the same path. - If you’re unsure, test with Google’s tool to see which rule wins.
Case Sensitivity and URL Sloppiness
URLs are case-sensitive, but many people forget this. If you block /Admin/ but your site uses /admin/, bots might still crawl the page. This is especially common on Linux servers, where /Admin/ and /admin/ are treated as different paths.
Best practices for consistent URLs:
- Always use lowercase in your
robots.txtfile. - Redirect uppercase URLs to lowercase versions (e.g.,
/Admin/→/admin/). - Check your site’s internal links to make sure they’re consistent.
Overusing Crawl Delays (And Hurting Your SEO)
Setting Crawl-delay: 30 for all bots might seem like a good way to reduce server load, but it can slow down how quickly Google indexes your new content. Some bots, like Bing’s, respect this rule, but Google ignores it. Instead, use Google Search Console to adjust crawl rates if your server is struggling.
If you need to limit crawling:
- Only set crawl delays for specific bots (e.g.,
User-agent: Bingbot Crawl-delay: 10). - Use
Disallowfor pages that don’t need to be crawled at all. - Monitor your server logs to see which bots are causing the most traffic.
Forgetting to Update After Site Changes
Many sites break their SEO after a redesign or migration because they forget to update robots.txt. For example, if you move your blog from /blog/ to /articles/, old rules might still block the new path. Always audit your robots.txt file when making big changes to your site.
Quick checklist for site migrations:
- Review all
Disallowrules to make sure they still make sense. - Update sitemaps and submit them to Google Search Console.
- Use 301 redirects for any moved pages to preserve SEO value.
A well-maintained robots.txt file keeps your site running smoothly. Take the time to review yours today—your future self (and your SEO) will thank you.
Tools and Resources for Managing Robots.txt
Creating a good robots.txt file is not just about writing rules—it’s about making sure they work. Many website owners write rules but never check if search engines follow them. This can lead to big problems, like important pages being blocked or private content being indexed. The good news? There are many tools to help you create, test, and manage your robots.txt file without guesswork.
Generators and Validators: Make Rule Creation Easy
If you’re not sure how to write robots.txt rules, generators can help. These tools ask simple questions—like which bots to block or which folders to hide—and then create the correct syntax for you. Some popular options include:
- Google’s Robots.txt Generator – Simple and free, great for beginners.
- Ryte’s Robots.txt Generator – Lets you test rules before applying them.
- Screaming Frog’s SEO Spider – Not just a generator, but also a powerful crawler that checks if your rules work.
Once you have your file, you should always validate it. Google’s Robots.txt Tester (inside Google Search Console) is the best for this. It shows if your rules block or allow the right pages. For example, if you block /private/ but Google can still crawl it, the tester will flag the issue. This saves you from accidentally hiding important pages from search results.
Automated Management: Keep Rules Updated Without Manual Work
For large or dynamic websites, manually updating robots.txt is a headache. That’s where automation comes in. Many CMS platforms have plugins or built-in tools to manage rules for you:
- WordPress – Plugins like Yoast SEO or All in One SEO let you edit
robots.txtfrom your dashboard. - Shopify – You can edit the file directly in the Online Store > Themes > Edit Code section.
- Magento – Extensions like MageWorx SEO Suite help manage rules for product pages and categories.
For developers, tools like GitHub Actions or Cloudflare Workers can automatically update robots.txt when new pages are added. This is useful for sites with frequent changes, like e-commerce stores with new products daily. The key is to set up a system that keeps your rules in sync with your site’s structure.
Monitoring Bot Activity: See What’s Really Happing on Your Site
Writing rules is only half the battle—you also need to know if they’re working. Tools like Google Search Console show which pages search engines crawl and if they hit any blocked areas. For deeper insights, log analyzers like AWStats or Loggly track all bot visits, not just Google’s. This helps you spot:
- Aggressive bots that ignore your rules.
- Crawl errors that waste your server resources.
- Pages that should be blocked but aren’t.
For example, if you see a bot from a suspicious IP crawling your site too often, you can add a rule to block it. Or if Google keeps trying to access a deleted page, you can update your robots.txt to prevent wasted crawl budget.
Learning Resources: Where to Go for Help
Even with tools, it’s good to understand the basics. Search engines like Google and Bing have official guides on robots.txt best practices. For deeper learning, check out:
- Google’s Search Central Documentation – Covers everything from basic rules to advanced use cases.
- Moz’s Beginner’s Guide to SEO – Explains how
robots.txtfits into overall SEO strategy. - r/SEO on Reddit – A community where you can ask questions and see real-world examples.
If you prefer books, “The Art of SEO” by Eric Enge has a great chapter on technical SEO, including robots.txt. For hands-on practice, try free courses on Coursera or Udemy that walk you through setting up and testing rules.
Final Tip: Start Small, Then Improve
You don’t need a perfect robots.txt file on day one. Start with the basics—blocking admin pages, staging sites, and duplicate content. Then, use the tools above to test and refine your rules over time. The goal isn’t to block everything—it’s to guide search engines to the right pages while keeping the rest private. With the right tools and a little practice, you’ll have a robots.txt file that works for your site, not against it.
Conclusion
You now have 20 powerful prompts to create robots.txt rules that actually work. These aren’t just random examples—they solve real problems. Need to block spam bots? There’s a prompt for that. Want to hide staging sites? Covered. Trying to keep search engines away from duplicate content? You’ve got options.
How to Pick the Right Rules for Your Site
Not every rule will fit your needs. Here’s how to choose wisely:
- For e-commerce sites: Block checkout pages, admin areas, and internal search results.
- For blogs: Allow crawling of posts but block tag pages or low-value archives.
- For developers: Hide staging sites, test folders, and API endpoints.
- For SEO: Let search engines focus on your most important pages.
Start with the basics—block what shouldn’t be public. Then, test and adjust. A robots.txt file isn’t just about blocking; it’s about guiding search engines to what matters.
Don’t Set It and Forget It
Your website changes. New pages get added. Old ones get removed. If you don’t update your robots.txt, you might accidentally block important content—or let bots crawl things they shouldn’t. Check your file at least once a month. Use tools like Google Search Console to see if your rules are working as intended.
Pro tip: If you’re unsure about a rule, test it first. Google’s robots.txt Tester can show you exactly how search engines will interpret your file.
Balance Control with Visibility
Blocking too much can hurt your SEO. Blocking too little can waste crawl budget. The key is finding the right balance. Ask yourself:
- Are search engines crawling pages that don’t help users?
- Are they missing your most important content?
- Are spam bots wasting your server resources?
If the answer is yes, adjust your rules. If you’re not sure, experiment. Small changes can make a big difference.
Your Next Steps
Ready to optimize your robots.txt? Here’s what to do:
- Audit your current file – Is it blocking what it should? Allowing what it shouldn’t?
- Pick 2-3 prompts from this list – Start with the most urgent fixes.
- Test your changes – Use Google’s tool or a crawler like Screaming Frog.
- Monitor results – Check Google Search Console for crawl errors or indexing issues.
Your robots.txt file is one of the simplest yet most powerful tools for SEO. Use it well, and you’ll guide search engines exactly where you want them to go. Got a use case we didn’t cover? Share it in the comments—we’d love to hear how you’re using these prompts!
Ready to Dominate the Search Results?
Get a free SEO audit and a keyword-driven content roadmap. Let's turn search traffic into measurable revenue.