Tutorial: Block Bad Bots with .htaccess

In this tutorial, we’ll learn how to block bad bots and spiders from your website. This is a standard safety measure we implement with our WordPress SEO service. We can save bandwidth and performance for customers, increase security, and prevent scrapers from putting duplicate content around the web.

Quick Start Instructions/Roadmap

For those looking to get started right away (without a lot of chit-chat), here are the steps to blocking bad bots with .htaccess:

FTP to your website and find your .htaccess file in your root directory
Create a page in your root directory called 403.html, the content of the page doesn’t matter, our is a text file with just the characters “403”
Browse to this page on AskApache that has a sample .htaccess snippet complete with bad bots already coded in
You can add any bots to the sample .htaccess file as long as you follow the .htaccess syntax rules
Test your .htaccess file with a bot spoofing site like wannabrowser.com

Check Your Server Logs for Bad Bots

If you read your website server logs, you’ll see that bots and crawlers regularly visit your site–these visits can ultimately amount to hundreds of visits a day and plenty of bandwidth. The server log pasted above is from TastyPlacement, and the bot identified in red is discoverybot. This bot was nice enough to identify its website for me, but DiscoveryEngine.com touts itself as the next great search engine, but presently offers nothing except stolen bandwidth. It’s not a bot I want visiting my site. If you check your server logs, you might see bad bots like sitesnagger, reaper, harvest, and others. Make a note of any suspicious bots you see in your logs.

AskApache’s Bad Bot RewriteRules

AskApache maintains a very brief tutorial but a very comprehensive .htaccess code snippet here. What’ makes that page so great is that the .htaccess snippet already has dozens of bad bots blocked (like reaper, blackwidow, sitesnagger) and you can simply add any new bots you identify.

If we want to block a bot not covered by AskApache’s default text, we just add a line to the “RewriteCond” section, separating each bot with a “|” pipe character. We’ve put “discoverybot” in our file because that’s a visitor we know we don’t want :

# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(verybadbot|discoverybot) [NC,OR]

If you are on the WordPress platform be careful not to disrupt existing entries in your .htaccess file. As always, keep a backup of your .htaccess file, it’s quite easy to break your site with one coding error. Also, it’s probably better to put these rewrite rules at the beginning of your .htaccess file so no pages are served before the bots read the rewrite directives. Here’s a simplified version of the complete .htaccess file:

ErrorDocument 403 /403.html

RewriteEngine On
RewriteBase /

# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|discoverybot) [NC,OR]

# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]

Here’s a translation of the .htaccess file above:

ErrorDocument sets a webpage titled 403.html to serve as our error document when bad bots are encountered; you want to create a page in your root directory called 403.html, the content of the page doesn’t matter, our is a text file with just the characters “403”
RewriteEngine and RewriteBase simple mean “ready to enforce rewrite rules, and set the base URL to the website root”
RewriteCond directs the server “if you encounter any of these bot names, enforce the RewriteRule that follows”
RewriteRule directs all bad bots identified in the text to our ErrorDocument, 403.html

Testing Our .htaccess File

Once you upload your .htaccess file, you can test it by browsing to your site and pretending to be a bad bot. You do this by going to wannabrowser.com and spoofing a User Agent, in this case, we spoofed “SiteSnagger”:

If you installed properly, you should be directed to your 403 page, and you have successfully blocked most bad bots.

Some Limitations

Now, why don’t we do this with Robots.txt and simply tell bots not to index? Simple: because bots might simply ignore our directive, or they’ll crawl anyway and just not index the content–that’s not a fix. Even with this .htaccess fix, it’ll only block bots that identify themselves. If a bot is spoofing itself as a legitimate User Agent, then this technique won’t work. We’ll post a tutorial soon about how to block traffic based on IP address. But, that said, you’ll block 90% of bad bot traffic with this technique.

Enjoy!

stuart says:

February 20, 2014 at 1:51 am

Hi Michael can you block backlinking bots like Majestic, Ahrefs, Moz etc… from crawling your site by using the .htaccess?

How can you test the .htaccess to see if this worls correctly?

Michael David says:
February 21, 2014 at 10:34 pm

I was just looking into this tonite. I want to block “builtwith.com”, but can’t see it in my server logs. I added the text “builtwith” to my htaccess file just in case. Majestic shows up in my server logs and I do believe i can block it. To test it, got to WannaBrowser.com and spoof the user agent. Just enter the name of the user agent and that will spoof it for you.

Bill Minozzi says:

February 1, 2020 at 6:52 am

Hi,
We developed this free PHP App to block bots:
http://stopbadbots.com/
Cheers,
Bill
Developer

Prakash Gohel says:

September 16, 2020 at 11:30 pm

Which is the Right way to block bad bots, right now I am usiythe the above method, some blogger are recommending the Cloudflare firewall rule & some are using robots.txt
Also should we block baidu & archive.org because it’s crawling my site a week.

Thank You.

Tutorial: Block Bad Bots with .htaccess

Quick Start Instructions/Roadmap

Check Your Server Logs for Bad Bots

AskApache’s Bad Bot RewriteRules

Testing Our .htaccess File

Some Limitations

Leave a Reply

Leave a Reply Cancel reply

Our Most Popular Services

Let’s Talk: How to Get in Touch With Us