Mistakes to avoid while creating robots.txt
What is the Robots.txt file?
Robots.txt file is a text file as you may already know because it has an extension *.txt and these are responsible for helping the search engine spiders from Google, Bing, Yahoo, Yandex, and many other search engines or bots which crawl your website for various reasons. Usually, website administrators use a robots.txt file for search engine optimization reasons and also not allowing bots which are not good for your website or your business.
Now, let's try to make it simpler with the following definition of ours.
"The Robots.txt file is a file needed by robots over the web including search engines in order to follow the rules like which pages or URLs have to be crawled and indexed and which URLs shouldn't be".
Let's explain with a simple example. Shall we?
A robots.txt file has the following simple syntax:
01. You have /deactivateAccount URL which should be available only for your customers who can deactivate their account only after authenticating successfully. And it would be irrelevant to showcase the URL on the search results. Only URLs like https://example.com/, https://example.com/login, https://example.com/shop may be the intended URLs to be crawled and indexed on the search engine as they drive the business.
02. Sensitive URLs like /admin, /admin.php, /adminLogin.asp etcetera. We hope that you got an idea about what we are trying to communicate here.
As you have got hang of what robots.txt basically does, let us take a deeper look at the syntax now.
Where is the problem?
Let us investigate what's wrong in the above example. Well, at first glance you may feel that everything is completely right as we are segregating the URLs which shouldn't be indexed on search engines because they are irrelevant. For instance, administrators of websites will not go to Google to search for the URL to login. They know the URL or they have the link in their email or text chat transcripts probably.
Without further ado, let's take a look at the first line which is
Is there a problem here?
Well, this appears to be relevant because this is an internal link or a URL that is accessible only to logged in users. Are we on the same page here?
All right, let's proceed to the second and third line of robots.txt file in our example.
Is there a problem here?
Well, probably not? Let us help you by explaining why we see it as a problem.
/myAdmin.aspx appears to be a login interface for administrators only who have the highest privileges and the URL has to be a secret or not disclosed via any files such as robots.txt and even sitemap.xml files. This is one of the security vulnerabilities where the sensitive URL has been disclosed and someone may try to attack the authentication via login interface. A motivated attacker or a malicious hacker may try brute-forcing the login credentials and gain access to the whole application data which could be crucial to your business and your users.
So how do we protect it?
01. Do not add any sensitive URLs in the robots.txt file
The first thing to do is stop adding sensitive URLs like the admin login page, upload folder which has the files which are sensitive etcetera. Firstly, do not keep those sensitive URLs to be publicly accessible across the internet. We recommend you to have a whitelist of IP addresses, 2 Factor Authentication or Multi-Factor authentication in order to gain access to the sensitive data. Anything that appears to be sensitive, think of security measures.
02. Do not use default robots.txt file from third-party sources (For instance, WordPress)