GatsbyJS sitemap excluding pages

October 18, 2017

Oops, I screwed up with that previous SEO update. Let’s hack the sitemap module for a quick and easy custom sitemap

I ran into and SEO thing on Dr Lingua. I have a game that does drag-n-drop hiragana and drag-n-drop katakana. It works well, and Japanese language teachers are using it in their classes. That’s great. To make it better for them and, I thought to help with SEO, I split it into three games. The original game, one starting with Hiragana, and one starting with Katakana.

I didn’t think through all the SEO ramifications. By splitting it into three pages, I was diminishing the page value of each page, especially when it comes to inbound links.

Let say we have page A, and 100 inbound links. Yay. OK, now to make it better for users, I add pages A1 and A2 with almost the same content. I get 100 more inbound links, but the links now split between the three different pages, instead of 200 going to A, I may have A: 140, A1: 40, A2: 20. Well, that’s kind of sucky. Time for a couple of changes.

First, I set A1 and A2 to have A as the canonical link. That should ensure all SEO juice from A1 or A2 flow back to A. But there was still a problem, by mentioning Hiragana drag and drop on the Hiragana version, I was diluting the strength of that keyword throughout the site, which lowered it from A. Next thing was to remove page A1 and A2 from the index.

Adding a meta tag to A1 and A2 of

<meta name='robots' content='noindex' />

means that the pages should be removed from the index. But there was still a problem, GatsbyJS’s site map generator adds all pages to the sitemap.

I know Google is way smart, and won’t index the pages because I added the noindex meta tag, right? To be doubly sure I wanted to remove any references to these pages from the sitemap, and some other ‘utility’ pages while I was at it.

I don’t have the time to get up to scratch with GraphQL at the moment, or build up a decent understanding of the GatsbyJS dev API, and do things the right way (thought I intend to when I get a bit of time). Instead I quickly hacked the sitemap module.

In node_modules/gatsby-plugin-sitemap/internals.js I created a simple array holding the pages I want to exclude from the sitemap like so:

var exclude = ['/404/', '/contact/', '/A1/', '/A2']

I then ran a quick filter over allSitePages.edges to exclude those pages before passing them into the map function:

...
return allSitePage.edges
  .filter(function (edge) {
    return !(exclude.includes(edge.node.path))
  })
  .map(function (edge)
...

Sitemaps now generate on a build as usual, those pesky pages are no longer referenced, and all is good with the world (well, not really, but I’ll take a little win).


Profile picture

Written by Adrian Gray making  web apps and games in Sydney. Find me on Twitter.