Avoid SEO Duplicated content between "master" URL and actual production URL

I am running into an issue related to SEO and am wondering how other teams deal with this basic SEO issue, which is a side-effect of a Netlify feature of creating URLs for all deploys (deploy previews, branches…).

We want to avoid having google crawl and index our deploy previews and master-xxx netlify-generated URLs to avoid duplicate content, which can have a negative impact on the site rankings.

For deploy previews, it was pretty easy, our site is generated by node.js build so when creating the html ejs file, we use this code

<% if ( environment!= "production" ) { %> 
  <meta name="robots" content="noindex">
<% } %>

This means that for deploy-preview URLs where the environment variable is equal to “deploy-preview”, hence not “production”, we’ll have a meta robot=noindex for google.
GREAT!

Our real issue comes from how to find a solution for master URLs : in this case it’s also environment=production so the code above does not work. they’re the ones when you merge a github branch to the master branch: netlify automatically creates TWO sites, one on https://master--zen-freedom-2b123.netlify.com/fr (a master branch, which should not be indexed by google) and one on https://zen-freedom-2b123.netlify.com/ (the actual final website reached by users which must be indexed)

Is there a way to tell google NOT to index master–zen-freedom-2b123 .netlify.com/while still indexing zen-freedom-2b123 .netlify.com/ ? We can’t use the above conditional : we tried the env variables such as deploy_prime_url and deploy_url but they don’t work as they’ll be equal to “master” when you deploy sth to production.

The challenge boils down to: is it posible (when building a static website with node in our case) to show on the homepage for example a text that is DIFFERENT on zen-freedom-2b123 .netlify.com/ than on master–zen-freedom-2b123 .netlify.com ?

If not, how do people prevent google from indexing the master-xxx website(sometimes just because sb posted by error on some page the master-xxx URL) to avoid duplicate content ?

Thanks for any help or advice,

M.

Hi there,

There are some “security through obscurity” tricks you could try, like having your production branch not named “master” so nobody would guess it, but that isn’t a very good type of advice either.

Our general advice is best summarized as “don’t share links to content you don’t want to be indexed”. For builds on master that you will publish as production, since headers are locked for a deploy at deploy time, it’s tough to be “conditional” there and have it support both use cases. Other workflows that might work to shield that content from Google would be to deploy manually, since then there is no “branch” name to guess and they’d have to instead guess the unique deploy ID to come up with your hostname - and that ID is a hexadecimal hash of 25 characters which while still “obscurity-based” is much less likely.

For other scenarios we do have some protection:

  1. We DO prevent our deploy previews from being crawled (that would be, builds from PR’s) by automatically setting an X-Robots-Tag to a value of noindex automatically on them.
  2. you can selectively set headers based on context, but that’s more useful when you want to protect e.g. all of staging or all of beta: Selective Password Protection

I think adding
Should be enough to avoid any penalties and tell Google to prefer the linked page

Thanks for the reply !

We DO prevent our deploy previews from being crawled (that would be, builds from PR’s) by automatically setting an X-Robots-Tag to a value of noindex automatically on them.

It’s a great feature: I did not see it in the docs. I think it would be very valuable to be documented in the official docs.

thanks,
M.

1 Like

great suggestion! I’ll get that in front of our docs team :slight_smile:

1 Like