UPDATE 2019-04-17 — The example at the bottom which shows that the, er, randomly chosen site has the offending <meta>
tag present is an old result. As of this update timestamp, that robots noindex tag is not on the site. Since the presence status of that tag is in flux, it will continue to be monitored.
Say your organization has done something pretty terrible. Terrible enough that you really didn’t want to acknowledge it initially but eventually blogged about it, and haven’t added a blog post in a long time so that entry is at the top of your blog index page which Google can still index and will since it’s been linked to from this site which has a high rating internally in their massive database.
If you wanted to help ensure nobody finds that original page, there are lots of ways to do that.
First, you could add a Disallow
entry in your robots.txt for it. Ironically, some organizations don’t go that route but do try to prevent Google (et al) from indexing their terms of use and privacy policy, which might suggest they don’t want to have a historical record that folks could compare changes to, and perhaps are even planning changes (might be good if more than just me saves off some copies of that now).
Now, robots.txt modifications are fairly straightforward. And, they are also super easy to check.
So, what if you wanted to hide your offense from Google (et al) and not make it obvious in your robots.txt? For that, you can use a special <meta>
tag in the header of your site.
This is an example of what that looks like:
but that may be hard to see, so let’s look at it up close:
<meta name="robots" content="noindex" class="next-head" />
<title class="next-head">A note to our community (article) - DataCamp</title>
<link rel="canonical" href="https://www.datacamp.com/community/blog/note-to-our-community" class="next-head" />
<meta property="og:url" content="https://www.datacamp.com/community/blog/note-to-our-community" class="next-head" />
That initial <meta>
tag will generally be respected by all search engines.
And, if you want to really be sneaky, you can add a special X-Robots-Tag: noindex
HTTP header to your web server for any page you want to have no permanent record of and sneak past even more eyes.
Unfortunately, some absolute novices who did know how to do the <meta>
tag trick aren’t bright enough to do the sneakier version and get caught. Here’s an example of a site that doesn’t use the super stealthy header approach:
FIN
So, if you’re going to be childish and evil, now you know what you really should do to try to keep things out of public view.
Also, if you’re one of the folks who likes to see justice be done, you now know where to check and can use this R snippet to do so whenever you like. Just substitute the randomly chosen site/page below for one that you want to monitor.
library(httr)
library(xml2)
httr::GET(
url = "https://www.datacamp.com/community/blog/note-to-our-community"
) -> res
data.frame(
name = names(res$all_headers[[1]]$headers), # if there are more than one set (i.e. redirects) you'll need to iterate
value = unlist(res$all_headers[[1]]$headers, use.names = FALSE)
) -> hdrs
hdrs[hdrs[["name"]] == "robots",]
## [1] name value
## <0 rows> (or 0-length row.names)
httr::content(res) %>%
xml_find_all(".//meta[@name='robots']")
## {xml_nodeset (1)}
## [1] <meta name="robots" content="noindex" class="next-head">\n
readLines("https://www.datacamp.com/robots.txt")
## [1] "User-Agent: *"
## [2] "Disallow: /users/auth/linkedin/callback"
## [3] "Disallow: /terms-of-use"
## [4] "Disallow: /privacy-policy"
## [5] "Disallow: /create/how"
## [6] "Sitemap: http://assets.datacamp.com/sitemaps/main/production/sitemap.xml.gz"
Thank you for reading to the end of this note to our community.