It takes more than keywords and backlinks – both on-page and off-page – to start ranking a site and get above the fold of search results. Often times knowing Robots.txt best practices – the dos and
For obvious reasons, it’s important to control the visibility of information that a search engine can view, how it’s viewed, and the speed of which a search engine can crawl the content of your website. All of which, has to pass through a gatekeeper of your website, otherwise known as your robots.txt file.
The internal SEO strategies that surround index optimization and crawl optimization depend on the robots.txt file of your website, and we’re going to go through how to maximize the use of your robots.txt file to optimize your rankings.
What is a Robots.txt File and Why Your Site Needs It
The robots.txt – also known as the robots exclusion protocol – is a small text file and a means of crawl optimization. Think of this as an instruction for search engines on how to read your website.
You create this file to tell crawlers what you want them to access and crawl with the goal of improving the site’s SEO performance.
The robots.txt file lets you control which pages you want and don’t want search engines to display, such as user pages or automatically generated pages. If the website doesn’t have this file, search engines will proceed to crawl the entire website.
In short, this file makes the job of search engine crawlers easier and faster.
So, how do you maximize the robots.txt file to better optimize a website? Here are the best practices to get you started:
1. Identify Pages You Don’t Want to Be Crawled
There’s a reason for why you need to identify pages that you want to be crawled, and this has something to do with the crawl rate limit Google set. Simply put, if a website that has too many server errors, Google crawls that site less.
One thing you need to understand about the robots.txt file is you don’t use it to stop search engines from crawling pages that you want to block. It’s about determining which pages you want search engine crawlers to display on search results.
Pages that you don’t want to be crawled are those that are likely to have no value to your SEO efforts:
- Login page to the site’s backend
- Thank you pages
- Pages containing sensitive information
- Website search pages
- Category pages
A common misconception among SEO specialists is that using Disallow prevents search engines from indexing the page.
One of the robots.txt best practices is to add a noindex directive for pages that you don’t want to be indexed by search engines, or indicate these pages as noindex in your meta robots tag instead.
However, if you add a noindex tag to these pages, make sure they’re not disallowed in your robots.txt file—otherwise, search engines won’t be able to crawl these pages and your noindex tag will be ineffective.
2. Robots.txt Best Practices
There are cases when Disallow may affect pages that have partial matches. This means any URLs that begin with the same path will be affected by the directive on your robots.txt. If you want to add a URL to Disallow in your robots.txt file, you have to be as specific as possible.
Here’s an example:
- User-agent: *
- Disallow: /category
This type of directive may unintentionally prevent crawlers from accessing the following:
To ensure this doesn’t happen and prevent search bots from crawling important pages, include the specific URL path in the disallow section.
3. Check for Errors, Typos And Mis-Capitalization
Always double check your robots.txt file for any typo error. A misspelled word, incorrect character or spacing issue can mess up your robots.txt file, and prevent search engines from crawling the site properly.
You also need to be mindful of type cases. For example, /seo and /SEO won’t be automatically disallowed if you simply add one of the two under a disallow directive. You need to put them in separate disallow lines, like this:
- Disallow: /seo
- Disallow: /SEO
If your robots.txt file isn’t well formatted or has errors, search engine crawlers may not understand the content of the file. When this happens, they may ignore the configurations you made in the file and still crawl the website.
4. Avoid Using Robots.txt To Hide Duplicate Content
Duplicate content is a big issue for SEO and it’s a common problem that every SEO expert will face at least once in their career. The problem with duplicate content is how it confuses search engines, and not being able to make the optimal decision of displaying the right content for the query.
When it comes to dealing with duplicate content, some SEO specialists suggest using the robots.txt file to prevent search engines from crawling pages with duplicate content. But, this is an outdated tactic.
The appropriate way to deal with duplicate content is to use the <rel=”canonical”> tag.
By using the robots.txt file to block crawlers accessing pages where you have duplicate content, other sites that are already linking to your page will not be able to pass any authority, or signal or “importance” otherwise known as their link juice.
In this case, <rel=”canonical”> is the more viable option, as it consolidates links signals from other sites.
RankBrain is Google’s latest learning system based on artificial intelligence and machine learning. As RankBrain is just one of the algorithms that measure the value of your site for display on SERPs, it’s dependent on the relevancy of the text content being loaded. What makes this far more advanced than the previous algorithms is it takes into account the interactivity of your website.
6. Be Careful of Conflicting Directives
One of the rookie mistakes when it comes to robots.txt is having conflicting directives. In these cases, search engines will be confused about what to do with a specific URL – whether they’re allowed to access it or they’re blocked from it.
Here’s an example of conflicting directives:
- User-agent: *
- Allow: /category
- Disallow: /*.html
If this how the directives are structured, search engines won’t know if they can access https://www.exampledomainlink.com/category.html
Another case of conflicting directives involves the user-agent. One example is when you restrict access to all user-agents but have a specific directive for one user-agent. The right way to do this should be:
- User-agent: *
- Disallow: /sample/
- Disallow: /test-page/
- Disallow: /staging/
- User-agent: googlebot-news
- Disallow: /sample/
In this example, all search engines, except for Google News, are not allowed to access /sample/, /test-page/, and /staging/. Google News is only restricted to /sample/.
7. Always Do a Quick Check of Your File
Once you’re done creating or updating your robots.txt file, be sure to go over through it or use Google Search Console’s robots.txt tester. This tool acts like a Googlebot and identifies whether there are existing syntax warnings and logic errors on your file. It also checks if the URL paths with the Disallowed directive are blocked properly.
Note that the robots.txt tester will only verify or check files with Google user-agents. You can use other tools to check your robots.txt file are Screaming Frog’s SEO Spider and SeoBook’s Robots.txt Analyzer.
Last but not least, your robots.txt file should be in your top-level directory. When editing the file, don’t use capitalization in the file as it is case sensitive, and for each sub-domain, you should have a separate robots.txt file.
This is basic robots.txt hygiene at it’s best.
Do you really need to have a robots.txt file?
It’s ideal to have it on the website than not have one at all. With a robots.txt file, you can control to a certain extent how search engines can treat your site.
Remember that when it comes to SEO, every aspect of the website matters in how it can perform. More importantly, how you optimize these aspects can be a key factor in making the website outrank the competition.
In case, you have more SEO related questions – visit our SEO FAQ 😉
Itamar Gero is the founder and CEO of SEOReseller.com, a white label SEO services and digital marketing solutions provider that empowers agencies—and their local business clientele—all over the world. When he isn’t working, he’s traveling the world, meditating, or dreaming (in code).