I received requests from a few webmasters some time ago asking me if there was a way to block unwanted bots from their website.
This article shows you how you can do this using .htaccess
.
"Bots", for those not familiar with the term, are basically computer programs that "surf" multiple websites to perform a variety of automated tasks. It's short for "robots". Examples of bots include those used by the search engines. Those bots retrieve a copy of your web page so that they can include relevant terms from that page in their search index. Not all bots are benign however. Some bots go through your website looking for web forms and email addresses to send you spam. Other bots probe your website for security vulnerabilities.
Before you rush to implement the things suggested by this article, I should probably mention the following prerequisites.
Your website must be hosted on an Apache web server, and your web host must have a facility known as
".htaccess overrides" enabled.
If this is not the case, you won't be able to do anything mentioned here without bringing your site down. In practice, this usually
means that your website is hosted with a commercial
web host since most free web hosts don't allow you to override server behaviour using .htaccess
.
You need to be able to check your site's raw web logs. Again, this probably means that you are using a commercial web host rather than a free one. If all you have is the web statistics provided by your web host or a free web statistics and analytics service you won't be able to get the information you need to block the bot.
You need to have a specific bot that you wish to block. If you arrived at this page hoping to find a list of bots to block, you're at the wrong place. This article is a practical guide designed to help webmasters who already know what they want to block.
Don't think that you can really get rid of all unwanted bots from your website using the method described here. Trying to block all undesirable bots from your site is like trying to rid the world of pests. Swat one, and another few will take its place. This doesn't mean that you can't try, of course. I'm just saying this so that you don't get your hopes too high about what you can actually achieve.
Before you can block a bot, you will need to know at least one of two things: the IP address where the bot is coming from or the "User Agent string" that the bot is using. The easiest way to find this is to look into your raw web log.
Download your web log from your web host, uncompress it using an archiver (if needed), and open it in a plain text editor. You'll probably need to use a better editor than Notepad if your logs are large. If you have a search and replace utility like those listed on the Free Text Search and Replace Utilities page, you can use those instead of the editor. Search through the file for the bot you want to block. It helps if you know either the page it tried to access or the time it hit your web, so that you can narrow your search down.
Once you've located the entries that belong to the bot, look for the IP address and the user agent string.
The IP address is a series of 4 numbers separated by dots. They look like "127.0.0.1". The "User Agent string" is just the name that the program accessing your site goes by. For example, version 58.0 of the Firefox web browser has a user agent string of "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0" (among others) while the Google search engine bot goes by "Googlebot/2.1 (+http://www.google.com/bot.html)" (among others). You won't need to know the entire user-agent string. Just find some part of the user agent string that is unique to that particular bot, that is, that no other bot or web browser uses.
Note the IP addresses used by the bot and the user agent string.
Be careful though. Just because a bad bot has visited your website using a particular IP address does not mean that if you block that IP address, you'll be rid of that bot forever. Some viruses and malware infect a normal computer user's machine to turn it into a machine that sends spam and probes sites for vulnerabilities. The IP address that you plan to block may well belong to such an ordinary person, and you could be blocking an internet provider's IP address. When that user disconnects from the Internet, and another user logs in, the internet provider could assign the new user the same IP address. When you block by IP address, you may end up blocking an entire internet provider, and thus a lot of real users and potential customers.
Likewise, many bad bots intentionally use User Agent names that correspond to normal web browsers. As such you won't be able to tell from the user agent alone whether it's a bot or a real user. If you wantonly block user agents with the name of "Mozilla", for example, you could end up blocking nearly every human from your website.
In general, if you don't know what you're doing, it's best not to block anything, unless you don't mind inadvertently blocking users and perhaps even whole countries (if you're especially careless).
Once you know the bot's IP address or user agent string, connect to your site using an
FTP or SFTP client. Go to the top web directory of
your site, where your home page is located. Look for a file named ".htaccess
". If it exists, download it to your computer.
If it doesn't exist, make sure that it is not hidden from your view. Depending on the FTP program you use, you may need to log off, set a "Remote file mask"
of "-a
" (without the quotation marks) in the options for the program, and log in again to check.
(The "remote file mask" is the term used in one FTP client. Your program may use a different term.)
Alternatively, log into your site using your web host's control panel. Most commercial web hosts allow you to access your web directories from
your web browser and download files that way. If your host has a setting to "show hidden files" or the like, make sure you enable it to look for
the .htaccess
file.
If, after all your efforts to find it, you cannot locate any .htaccess
file in the top web directory of your site, don't worry. It's quite normal
not to have any in the default setup for most web hosts. You will simply have to create one yourself. The reason we went to all that trouble to locate it
is that if one exists, you will need to get it so that you can add to the settings already present in that file. If you
don't, and you create one from scratch to overwrite an existing one, you may inadvertently wipe out some other settings that you want for your site.
If you've managed to get the .htaccess
file, open it in a
plain text editor (like Notepad).
If one does not exist, use the editor to create a new blank document. The rest of this article will assume that you have
already started the editor with the .htaccess
open or with a blank document if no .htaccess
file
previously existed.
WARNING: do not use a word processor like Word, Office, or WordPad to create or edit your .htaccess
file.
If you do, your site will mysteriously fail when you upload the file to your web server.
To block a certain IP address, say, 127.0.0.1, add the following lines to your .htaccess
file. If your file
already has some content, just move your cursor to the end of the file, and add the following on a new line in the file.
If you don't have an existing .htaccess
file, just type it into your blank document. You should of course
change the numbers "127.0.0.1" to point to the correct IP address you want to block.
The first line has the effect that if the web server encounters a request that matches any Deny rule, it will deny the request. If the request that does not match any Deny rule, it will be allowed. This is generally the behaviour that most people want for the normal web directories on their site.
The second line sets the rule that if a request comes from the IP address "127.0.0.1", the web server is to deny the request. The program making that request will receive the "Forbidden" error instead of the normal page at that address.
If you have more than one IP addresses to block, just add another "Deny from" line with that IP address underneath. For example, if you also want to block "192.168.1.1" in addition to "127.0.0.1", the code to use is as follows:
You may add as many IP addresses as you wish, although if your .htaccess
file becomes very large, your site
may become sluggish due to the number of rules the server has to process each time it has to deliver your site's pages.
To block a bot by a user agent string, look for a part of the user agent string that is unique to that robot and that contains ordinary letters of the alphabet with no spaces, slashes or punctuation marks (unless you are familiar with regular expressions).
For example, if you are planning to block a robot that has this user agent string, "SpammerRobot/5.1 (+http://www.example.com/bot.html)
",
and you decide that the portion "SpammerRobot" is unique to this robot, add the following lines to your .htaccess
file.
As in the case of blocking by IP address, add the lines to the end of the file if you already have an existing
.htaccess
file. Otherwise, type the lines into your blank document. You should of course change
"SpammerRobot" to the actual user agent you want to block.
The first line tells the web server to check the user agent string of the program making the request. If the user agent string contains
the word "SpammerRobot", it will set an "environment variable" (a sort of internal flag used by the server) called bad_bot. Note that the word "SpammerRobot"
can be in any mixture of capital (uppercase) or small (lowercase) letters. If you only want to match the exact case, use BrowserMatch
instead of
BrowserMatchNoCase
. In addition, I simply made up the name "bad_bot" for the purpose of this article. You can call your environment variable
some other name if you wish, although if you're not familiar with the rules for naming such variables, just accept "bad_bot".
The second line has already been explained above, in the section on blocking by IP address.
The third line tells the server to deny the request if finds that an environment variable called "bad_bot" has been set.
To add more user agent strings to your block list, just add another "BrowserMatchNocase" line. For example, if you want to block "SecurityHoleRobot" in addition to "SpammerRobot", the lines to use are:
(Before you ask, "SpammerRobot" and "SecurityHoleRobot" are just names I invented for this article. As far as I know, these robots don't exist.)
Note that your .htaccess
file can contain block rules for both user agents and IP address. Just put them
all in the same file. An example of a .htaccess
file with rules to block both by IP address and
user agent strings is as follows:
There's no need to repeat the "Order Deny,Allow
" line when you combine the rules.
Once you've finished with blocking unwanted bots in
your .htaccess file, save the file. If you are using Notepad, and are creating a new document, remember to save the file as ".htaccess"
,
including the quotation marks, otherwise you will encounter the problem of
Notepad adding a .txt extension to your
filename.
Then upload the file to your web server using an FTP/SFTP program (or with your web host's control panel). If you want to use an FTP program, and don't know how to do so, check out my tutorial on How to Upload a File to Your Website Using the FileZilla FTP Client.
Properly implemented, the method described in this article will allow you to block specific bots from accessing your website by either their IP address or their User Agent string.
This article can be found at https://www.thesitewizard.com/apache/block-bots-with-htaccess.shtml
Copyright © 2008-2020 by Christopher Heng. All rights reserved.
Get more free tips and articles like this,
on web design, promotion, revenue and scripting, from https://www.thesitewizard.com/.
Do you find this article useful? You can learn of new articles and scripts that are published on thesitewizard.com by subscribing to the RSS feed. Simply point your RSS feed reader or a browser that supports RSS feeds at https://www.thesitewizard.com/thesitewizard.xml. You can read more about how to subscribe to RSS site feeds from my RSS FAQ.
This article is copyrighted. Please do not reproduce or distribute this article in whole or part, in any form.
It will appear on your page as:
How to Block Unwanted Bots from Your Website with .htaccess