FB Twitter LinkedIn Google+

Making Sense of Google’s Robots.txt Contradictions

If you’re in SEO these days, as much as you’re suspicious of Google, you also have to make sure to keep up with what they tell the webmaster community so that you can give the best recommendations to your clients.  For me, one of the main places I go to follow this information is Barry Schwartz’s site, Search Engine Roundtable.  During one of my reading sessions last week, I saw Barry’s article about John Mueller’s Google Webmaster Central Hangout, and I bookmarked it to watch later because the video was almost an hour long. Finally, yesterday, I got around to watching the video, and I noticed a couple of seeming contradictions between what John said and what his colleague Gary Illyes said a month earlier here.

Robots.txt image

Google Robots.txt Contradiction #1

At around the 28:25 mark, John says specifically that the rule “Allow: .js” which was espoused by Gary would only affect URLs that begin with .js (for example, /.jsSampleURL). To unblock .js files, you’d need to include the wildcard operator (*) such that the rule should be “Allow: *.js”. Interestingly, Gary does mention using the wildcard later in his Stack Overflow comment, but for the “catch-all” rule, he omits it for some reason.

Which format is correct in this case?

Testing This Using Search Console

Robots.txt Testing Tool

Luckily for us mere mortals who don’t have regular access to Google’s engineers, Google’s Search Console has a Robots.txt Testing Tool where you can make mock adjustments to your current robots.txt file and test that against specific URLs to see how your changes would affect the site in question.

For this test, I used the example of http://www.example.com/.jsSampleURL and http://www.example.com/sample.js.

Test #1 Rule
Disallow: .js

In this test, Gary seems to believe that http://www.example.com/sample.js will be blocked, while John thinks that you need the wildcard (*) before .js to have this effect.  For his part, John said that this rule should block http://www.example.com/.jsSampleURL

Test #1 Winner – NEITHER!

Robots.txt Test 2

The testing tool showed that *neither* URL was blocked.  Thus, both John AND Gary are partially wrong here.  Very interesting.

But what happens when you add the wildcard like John said?

Test #2 Rule
Disallow: *.js

Test #2 Winner – John Mueller!
Both URLs are now blocked!  The wildcard makes all the difference, just like John said.

Contradiction #2 – Not Really a Contradiction

At around the 27 minute mark, John says that the rule “Allow: *.js” would NOT allow crawling of javascript resources within a previously disallowed subfolder. Rather, John says, you’d have to add the rule “Allow: /blockedsubfolder/*.js”. Presumably, according to John, you’d have to add this for every single blocked subfolder in the file.  Originally, when I read Barry’s article, I thought that Gary’s advice said that one simple rule (“Allow: .js) would allow crawling of all javascript resources across the site. This would have created a contradiction between the two Googlers. However, upon reading Gary’s full comment on Stack Overflow, he actually spells out the subfolder issue as well. Still, it is instructive to give an example to show how this works.

Testing Google’s Guidelines Again

For our purposes, I ran two tests by the Search Console testing tool using the sample javascript resource URL http://www.example.com/blockedsubfolder/sample.js

Test #1 Rules
Allow: *.js

Test #2 Rules
Allow: /blockedsubfolder/*.js

One might think that Test #1 should allow Google to crawl /blockedsubfolder/sample.js because ALL javascript resources are explicitly allowed.  However, according to both Gary and John, Test #1 would NOT allow Google to crawl the resource because the subfolder is a more specific instruction which overrides rule #2.  Test #2 is the preferred structure which Google says will indeed allow the resource to be crawled.

Testing Outcome – Google’s Guidelines Confirmed!

Robots.txt Test 1

As John and Gary said, the testing tool showed that the resource would not be crawlable in Test #1 due to the first rule.  However, when adding the subfolder to the second rule, we get the following result:

Robots.txt Test 1a

SEO Takeaways

There are a few things I learned from this experience:

  1. Make sure to include wildcards when using the method Gary Illyes described.
  2. Make sure to include separate “Allow” rules for each subfolder that contains the files you are trying to unblock.

Last, and most importantly, robots.txt is a complicated business.  Make sure that you have an experienced SEO or web developer looking at these issues for you if you can’t get Google to crawl your site.

About Ari Roth

Ari Roth is a Senior Digital Marketing Manager at DriveHill Media. Ari has worked with some of the world's biggest brands throughout his career to optimize their online presence and increase their ROI. As a full-service digital marketer, Ari manages PPC budgets totaling $50,000+ per month and is well-versed in Conversion Rate Optimization, App Store Optimization (ASO), & analyzing Google Analytics data in addition to his specialization in technical SEO audits. Ari has headed the SEO audit efforts for two separate agencies by designing a comprehensive audit checklist and training junior SEOs in how to use it. When he's not working, Ari enjoys spending time with his family, reading tech industry news, and watching his beloved New York sports teams in action.

Speak Your Mind