In this tutorial, we’ll learn how to block bad bots and spiders from your website. We can save bandwidth and performance for customers, increase security, and prevent scrapers from putting duplicate content around the web.
Quick Start Instructions/Roadmap
For those looking to get started right away (without a lot of chit-chat), here are the steps to blocking bad bots with .htaccess:
- FTP to your website and find your .htaccess file in your root directory
- Create a page in your root directory called 403.html, the content of the page doesn’t matter, our is a text file with just the characters “403″
- Browse to this page on AskApache that has a sample .htaccess snippet complete with bad bots already coded in
- You can add any bots to the sample .htaccess file as long as you follow the .htaccess syntax rules
- Test your .htaccess file with a bot spoofing site like wannabrowser.com
Check Your Server Logs for Bad Bots

If you read your website server logs, you’ll see that bots and crawlers regularly visit your site–these visits can ultimately amount to hundreds of visits a day and plenty of bandwidth. The server log pasted above is from TastyPlacement, and the bot identified in red is discoverybot. This bot was nice enough to identify its website for me, but DiscoveryEngine.com touts itself as the next great search engine, but presently offers nothing except stolen bandwidth. It’s not a bot I want visiting my site. If you check your server logs, you might see bad bots like sitesnagger, reaper, harvest, and others. Make a note of any suspicious bots you see in your logs.
AskApache’s Bad Bot RewriteRules
AskApache maintains a very brief tutorial but a very comprehensive .htaccess code snippet here. What’ makes that page so great is that the .htaccess snippet already has dozens of bad bots blocked (like reaper, blackwidow, sitesnagger) and you can simply add any new bots you identify.
If we want to block a bot not covered by AskApache’s default text, we just add a line to the “RewriteCond” section, separating each bot with a “|” pipe character. We’ve put “discoverybot” in our file because that’s a visitor we know we don’t want :
# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(verybadbot|discoverybot) [NC,OR]
If you are on the WordPress platform be careful not to disrupt existing entries in your .htaccess file. As always, keep a backup of your .htaccess file, it’s quite easy to break your site with one coding error. Also, it’s probably better to put these rewrite rules at the beginning of your .htaccess file so no pages are served before the bots read the rewrite directives. Here’s a simplified version of the complete .htaccess file:
ErrorDocument 403 /403.html
RewriteEngine On
RewriteBase /
# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|discoverybot) [NC,OR]
# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]
Here’s a translation of the .htaccess file above:
- ErrorDocument sets a webpage titled 403.html to serve as our error document when bad bots are encountered; you want to create a page in your root directory called 403.html, the content of the page doesn’t matter, our is a text file with just the characters “403″
- RewriteEngine and RewriteBase simple mean “ready to enforce rewrite rules, and set the base URL to the website root”
- RewriteCond directs the server “if you encounter any of these bot names, enforce the RewriteRule that follows”
- RewriteRule directs all bad bots identified in the text to our ErrorDocument, 403.html
Testing Our .htaccess File
Once you upload your .htaccess file, you can test it by browsing to your site and pretending to be a bad bad. You do this by going to wannabrowser.com and spoofing a User Agent, in this case, we spoofed “SiteSnagger”:

If you installed properly, you should be directed to your 403 page, and you have successfully blocked most bad bots.
Limitiations
Now, why don’t we do this with Robots.txt and simply tell bots not to index? Simple: because bots might simply ignore our directive, or they’ll crawl anyway and just not index the content–that’s not a fix. Even with this .htaccess fix, it’ll only block bots that identify themselves. If a bot is spoofing itself as a legitimate User Agent, then this technique won’t work. We’ll post a tutorial soon about how to block traffic based on IP address. But, that said, you’ll block 90% of bad bot traffic with this technique.
Enjoy!



About the Author: Michael David
Michael David is the founder, current CEO, and lead strategist at TastyPlacement, based in Austin, Texas. He is the author of "WordPress 3.0 Search Engine Optimization" with the prestigious IT publisher, Packt Publishing. TastyPlacement performs search marketing campaigns, public relations, search engine optimization, social media consulting and online advertising for companies in a wide range of fields.