We’ve been doing SEO for WordPress for a long time. A big part of that has always been controlling the amount, and quality, of indexed pages, since WordPress creates so many different flavors of content automatically. If you’ve read Michael David’s book on WordPress SEO, you’ve seen his ultimate robots.txt file
https://www.tastyplacement.com/book-excerpt-the-ultimate-wordpress-robots-txt-file which goes something like this:
Unfortunately, we’re in a post-Mobilegeddon world. Google is expecting free access to render every page in its entirety so it can infer the sort of experience a user would have on various mobile devices. A few weeks ago, a significant portion of the WordPress installations in the world received the Google Search Console warning:
Googlebot cannot access CSS and JS files
Some of you may be wondering why we can’t just remove all Robots.txt disallow rules and let Googlebot decide what it thinks is important, and stop being fussy about what’s allowed and disallowed. For security reasons, you don’t want to have deep indexing of your site publicly searchable. For instance, the following search term gives you a list of thousands of WordPress installations which have the highly hackable timthumb.php:
Just something to think about when you assume that Google has your site’s best interests at heart.
It’s possible that you can go through each resource, and allow the precise file paths line by line. But that’s going to be very time consuming.
The solution which has been going around (advocated by the likes of SEOroundtable and Peter Mahoney is to add an additional few lines which explicitely allow Google’s spiders access to the resources in question:
#THE ABOVE CODE IS WRONG!
If you haven’t read the Google developers page on Robots.txt, I highly recommend doing so. It’s like 50 Shades of Grey for nerds. The section under “Order of Precedence for User-Agents” states “Only one group of group-member records is valid for a particular crawler . . . the most specific user-agent that still matches. All other groups of records are ignored by the crawler.” By creating a new group for Googlebot, you are effectively erasing all prior disallow commands.
And wildcard conflicts are undefined. So it’s a tossup result for:
The long and the short of it is there is no simple cut-and-paste solution to this issue. We’re approaching it on a case by case basis, doing what’s necessary for each WordPress installation.
As far as keeping the indexes clean, we’re going to lean heavily on the robots metatags, as managed by our (still) favorite SEO plugin. Expect the role of robots.txt to be greatly reduced going forward.