Pin It
512-535-2492

The TastyPlacement Blog

enjoy...
Pubcon Austin 2013

Tutorial: Block Bad Bots with .htaccess

In this tutorial, we’ll learn how to block bad bots and spiders from your website. We can save bandwidth and performance for customers, increase security, and prevent scrapers from putting duplicate content around the web.

Quick Start Instructions/Roadmap

For those looking to get started right away (without a lot of chit-chat), here are the steps to blocking bad bots with .htaccess:

  • FTP to your website and find your .htaccess file in your root directory
  • Create a page in your root directory called 403.html, the content of the page doesn’t matter, our is a text file with just the characters “403″
  • Browse to this page on AskApache that has a sample .htaccess snippet complete with bad bots already coded in
  • You can add any bots to the sample .htaccess file as long as you follow the .htaccess syntax rules
  • Test your .htaccess file with a bot spoofing site like wannabrowser.com

Check Your Server Logs for Bad Bots

Bad Bots Server Log

If you read your website server logs, you’ll see that bots and crawlers regularly visit your site–these visits can ultimately amount to hundreds of visits a day and plenty of bandwidth. The server log pasted above is from TastyPlacement, and the bot identified in red is discoverybot. This bot was nice enough to identify its website for me, but DiscoveryEngine.com touts itself as the next great search engine, but presently offers nothing except stolen bandwidth. It’s not a bot I want visiting my site. If you check your server logs, you might see bad bots like sitesnagger, reaper, harvest, and others.  Make a note of any suspicious bots you see in your logs.

AskApache’s Bad Bot RewriteRules

AskApache maintains a very brief tutorial but a very comprehensive .htaccess code snippet here. What’ makes that page so great is that the .htaccess snippet already has dozens of bad bots blocked (like reaper, blackwidow, sitesnagger) and you can simply add any new bots you identify.

If we want to block a bot not covered by AskApache’s default text, we just add a line to the “RewriteCond” section, separating each bot with a “|” pipe character. We’ve put “discoverybot” in our file because that’s a visitor we know we don’t want :

# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(verybadbot|discoverybot) [NC,OR]

If you are on the WordPress platform be careful not to disrupt existing entries in your .htaccess file. As always, keep a backup of your .htaccess file, it’s quite easy to break your site with one coding error. Also, it’s probably better to put these rewrite rules at the beginning of your .htaccess file so no pages are served before the bots read the rewrite directives. Here’s a simplified version of the complete .htaccess file:

ErrorDocument 403 /403.html

RewriteEngine On
RewriteBase /

# IF THE UA STARTS WITH THESE
RewriteCond %{HTTP_USER_AGENT} ^(black.?hole|blackwidow|discoverybot) [NC,OR]

# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule . - [F,L]

Here’s a translation of the .htaccess file above:

  • ErrorDocument sets a webpage titled 403.html to serve as our error document when bad bots are encountered; you want to create a page in your root directory called 403.html, the content of the page doesn’t matter, our is a text file with just the characters “403″
  • RewriteEngine and RewriteBase simple mean “ready to enforce rewrite rules, and set the base URL to the website root”
  • RewriteCond directs the server “if you encounter any of these bot names, enforce the RewriteRule that follows”
  • RewriteRule directs all bad bots identified in the text to our ErrorDocument, 403.html

 Testing Our .htaccess File

Once you upload your .htaccess file, you can test it by browsing to your site and pretending to be a bad bad. You do this by going to wannabrowser.com and spoofing a User Agent, in this case, we spoofed “SiteSnagger”:

If you installed properly, you should be directed to your 403 page, and you have successfully blocked most bad bots.

Limitiations

Now, why don’t we do this with Robots.txt and simply tell bots not to index? Simple: because bots might simply ignore our directive, or they’ll crawl anyway and just not index the content–that’s not a fix. Even with this .htaccess fix, it’ll only block bots that identify themselves. If a bot is spoofing itself as a legitimate User Agent, then this technique won’t work. We’ll post a tutorial soon about how to block traffic based on IP address. But, that said, you’ll block 90% of bad bot traffic with this technique.

Enjoy!

 

 

2 Responses

  1. Hi Michael can you block backlinking bots like Majestic, Ahrefs, Moz etc… from crawling your site by using the .htaccess?

    How can you test the .htaccess to see if this worls correctly?

    1. I was just looking into this tonite. I want to block “builtwith.com”, but can’t see it in my server logs. I added the text “builtwith” to my htaccess file just in case. Majestic shows up in my server logs and I do believe i can block it. To test it, got to WannaBrowser.com and spoof the user agent. Just enter the name of the user agent and that will spoof it for you.

Leave a Reply