In some cases, robots can go berserk and cause performance issues with an Apache server. To some extent, robots can be helpful as they are used by search engines to index a world-visible site.
But “bad robots” can be blocked.
robots.txt
Robots that are less bad, will pay attention to a page named “robots.txt” which is a plain text file in the root of each virtual web server (i.e. the URL for it would be https://your.server.name/robots.txt). The contents of which look something like :-
User-agent: BadBot Disallow: /
User-agent: NotQuiteSoBadBot Disallow: /infinite-depth
The “User-Agent” parameter is used to specify what robots the following directive(s) should apply to. A “User-Agent: *” would apply to all robots (but not web browsers piloted by people); you can also limit access to certain parts of the web-site – as shown by “/infinite-depth” in the second example.
Some automatically created sub-sites will actually be of infinite depth which is why robots.txt was invented in the first place.
But we’re probably more interested in the really bad robots that ignore robots.txt.
Blocking Robots
This method involves changing the Apache configuration files (at least the way I’ll show you). The first step is to check the file /etc/sysconfig/apache2 to verify that it contains :-
APACHE_MODULES="actions alias auth_basic authn_file authz_host authz_groupfile authz_core authz_user autoindex cgi dir env expires include log_config mime negotiation setenvif ssl socache_shmcb userdir reqtimeout authn_core headers"
And that it contains “setenvif” so that module gets enabled. If it doesn’t you will have to add it, and restart the Apache service (and not just apachectl graceful) :-
systemctl restart apache2.service
The next step is to add configuration for the setenvif module to add a “tag” to requests with the relevant User-Agent value. Within the file server-tuning.conf is a section guarded by “<IfModule mod_setenvif.c>”. Add to that section something like :-
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "AhrefsBot" bad_bot
BrowserMatchNoCase "PetalBot" bad_bot
BrowserMatchNoCase "SemrushBot" bad_bot
This list can be modified of course, all you need is a unique string found within the “User-Agent” string that can replace “Bytespider”.
The next step is to add the following to each virtual server. Because it applies to the whole virtual server, it is quite likely there is already a section like this, so you may have to adapt the configuration example :-
<Location "/">
Order deny,allow
Deny from env=bad_bot
</Location>
The final step is to restart apache, but it is helpful to check that the configuration is likely to be working :-
# apachectl configtest
Syntax OK
The command will return nothing at all if the configuration hasn’t changed, and will return some useful information if there’s an error :-
# apachectl graceful
AH00526: Syntax error on line 232 of /etc/apache2/apache2.conf:
Invalid command 'BlahBlah', perhaps misspelled or defined by a module not included in the server configuration
Action 'graceful' failed.
The Apache error log may have more information.
One finished, you can restart Apache with apachectl graceful (which is intentionally non-disruptive as existing sessions remain running with the old configuration), but if you have been messing with which modules are loaded, you may need to restart using systemctl :-
systemctl restart apache2.service
Testing
There is no point in changing configuration if you are not going to test the changes – it certainly took me an attempt or six to get the right stuff in the right place! The key is to have a browser that will change the “User-Agent” string, which is apparently something often available in ordinary browsers with the “Development Tools”.
However, for better or worse, I did my testing with curl :-
» curl -H "User-Agent: Bytespider" http://brucellosis-ice.nss.eps.is.port.ac.uk/
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>