Guide to using the robots.txt file, which allows us to tell Google how it should interact with our content, and which pages it can index.
What is robots.txt?
It is actually a simple text file that is placed in the root of our website, so it can be found at:
This file “suggests” to the search engines (i.e. Mr. Google), how it should interact with the content of our website, which URLs it can index and which URLs it cannot, since by default Google will index ALL the content it finds.
On the other hand, be careful, because this file is public! So if for example you have a private page that you don’t want to be indexed, for example:
If we put it in robots.txt, it is true that it will not be indexed, but it is also true that we are placing it in a file that anyone can see. In fact, many hackers start by looking there, to see if they can find something interesting to attack. So let’s not use robots.txt to hide information, but simply to prevent useless content from being indexed.
User agents (robots)
The first and most typical thing to do is to indicate the “User Agent“, that is, the robot or search engine we want to target, and what gives meaning to the name “robots.txt”. If we want to target any robot, we simply write this:
That is telling us that the rules that we put below will be for all robots, whether they are from Google, Bing, Yahoo, or whatever it takes. But if we want to target one in particular, we will have to put its name. In the case of Google, it is Googlebot.
There are many more bots than we think. To give you an idea, Google already has nine! For mobiles, for videos, for images… in short, here’s the list.
Let’s look at the two most typical commands. Allow and Disallow.
Allow and Disallow in robots.txt
For example, if we want to NOT index a particular directory, we do it like this:
This prevents the indexing of that URL and all the content that depends on it. A very typical case is to de-index the administration folders:
These two folders are the ones that always go with WordPress. On the other hand, we will never put the /wp-content/ folder, as this is normally where the images are stored, and we want them to be indexed.
The two most radical forms of the use of “Disallow” are to allow access to everything:
And the one of not allowing access to ANYTHING (beware that this makes us disappear from Google):
On the other hand, we also have the Allow command, which will allow us to allow the indexing of a certain subdirectory or file within a directory that we had forbidden to index.
We can create a one-off exception, for example:
So, within the disallowed directory, we can create an exception that is indexed, if we want.
Hierarchy of instructions
There is a certain hierarchy of instructions that depends on the level of specificity of each User Agent. So, if we have several instructions that can be fulfilled according to this parameter, the most specific one wins. For example:
In this case, Google could follow both the first instruction, since it has an asterisk indicating that this rule addresses all bots, but also the second, which is specific to it. As the rules contradict each other, the second one wins, as it is more specific.
Alternative: robots meta tag
An interesting alternative to the robots file is the robots meta tag, which instead of going in the robots file, goes at the page level:
<meta name=”robots” content=”noindex”>
Unlike the robots.txt file, this is not listed anywhere, but is in the code of each page. In addition, apart from being able to indicate the parameter “noindex”, we can also use “nofollow”, so that the links on that page are not followed.
Another very interesting option of the robots.txt file is the possibility to use wildcard characters. For example, if we want to block all URLs containing a question mark, we will do it like this:
This is useful if we do not want URLs with parameters, such as search, comments, custom campaigns, etc., to be indexed. In the specific case of WordPress search, the “s” parameter is used, so we could go into more detail:
This ensures that no search page is indexed, even if we have it in an internal or external link, and thus avoids duplicate content problems. Another wildcard parameter is the dollar sign $, which allows you to affect any URL containing a certain string. For example, if we want to deindex all files ending in .php we would say this:
Obviously, we could do this with any extension or string, and it allows us to make sure that no file we are not interested in is indexed.
You can continue reading more interesting posts about SEO and Online Marketing. Here are some interesting posts.