Work2Code > Coding and Testing  > Mistakes to avoid while creating robots.txt

Mistakes to avoid while creating robots.txt

What is the Robots.txt file?

Robots.txt file is a text file as you may already know because it has an extension *.txt and these are responsible for helping the search engine spiders from Google, Bing, Yahoo, Yandex, and many other search engines or bots which crawl your website for various reasons. Usually, website administrators use a robots.txt file for search engine optimization reasons and also not allowing bots which are not good for your website or your business.

Now, let's try to make it simpler with the following definition of ours.

"The Robots.txt file is a file needed by robots over the web including search engines in order to follow the rules like which pages or URLs have to be crawled and indexed and which URLs shouldn't be".

Let's explain with a simple example. Shall we?

A robots.txt file has the following simple syntax:

User-Agent: *

Disallow: absolute_url_path_here

Disallow: /directory_path_here

 

01. You have /deactivateAccount URL which should be available only for your customers who can deactivate their account only after authenticating successfully. And it would be irrelevant to showcase the URL on the search results. Only URLs like https://example.com/, https://example.com/login, https://example.com/shop may be the intended URLs to be crawled and indexed on the search engine as they drive the business.

02. Sensitive URLs like /admin, /admin.php, /adminLogin.asp etcetera. We hope that you got an idea about what we are trying to communicate here.

As you have got hang of what robots.txt basically does, let us take a deeper look at the syntax now.

Disallow: /deactivate-account

Disallow: /myAdmin.aspx

Disallow: /uploads/results/health-reports-database

Where is the problem?

Let us investigate what's wrong in the above example. Well, at first glance you may feel that everything is completely right as we are segregating the URLs which shouldn't be indexed on search engines because they are irrelevant. For instance, administrators of websites will not go to Google to search for the URL to login. They know the URL or they have the link in their email or text chat transcripts probably.

Without further ado, let's take a look at the first line which is

Disallow: /deactivate-account

Is there a problem here?

Well, this appears to be relevant because this is an internal link or a URL that is accessible only to logged in users. Are we on the same page here?

All right, let's proceed to the second and third line of robots.txt file in our example.

Disallow: /myAdmin.aspx

Disallow: /uploads/results/health-reports-database

Is there a problem here?

Well, probably not? Let us help you by explaining why we see it as a problem.

/myAdmin.aspx appears to be a login interface for administrators only who have the highest privileges and the URL has to be a secret or not disclosed via any files such as robots.txt and even sitemap.xml files. This is one of the security vulnerabilities where the sensitive URL has been disclosed and someone may try to attack the authentication via login interface. A motivated attacker or a malicious hacker may try brute-forcing the login credentials and gain access to the whole application data which could be crucial to your business and your users.

So how do we protect it?

01. Do not add any sensitive URLs in the robots.txt file

The first thing to do is stop adding sensitive URLs like the admin login page, upload folder which has the files which are sensitive etcetera. Firstly, do not keep those sensitive URLs to be publicly accessible across the internet. We recommend you to have a whitelist of IP addresses, 2 Factor Authentication or Multi-Factor authentication in order to gain access to the sensitive data. Anything that appears to be sensitive, think of security measures.

02. Do not use default robots.txt file from third-party sources (For instance, WordPress)

 

Further Reading

http://www.robotstxt.org/

https://neilpatel.com/blog/robots-txt/

 

 

 

 

 

Santhosh Tuppad

Santhosh Tuppad has played different roles in his life which include being an entrepreneur, liar, lover, boyfriend, husband, thief, passionate software tester, blogger, reader, trainer, coach, black-hat hacker, white-hat hacker, grey-hat hacker and what not. In this amazing journey of life, he has experienced his salvation. Not to forget that, “Salvation comes at a price” and of course he has paid that price. Before he was known for being merciless, ruthless, unkind, evil, etc. And today he is known for kindness, humbleness, and some people call him “Privacy Fighter”.

No Comments

Leave a reply

two × four =