Indexing, SSR & SEO Hell

Where SPA MERN meets Search Engine Indexing

Written on - 11/05/2024

My web app "RomanizedMM" is making some good progress. I average around 30-40 users a day which I think is great for the content that is pretty niche. I used to advertise it on Reddit and I still get DMs asking about the app, the content and many others. If you don't know what RomanizedMM is, visit my project @ romanizedmm.com.

There's one problem I have been facing since I first deployed my app. No search engine could not index my web pages due to one issue; content loading. I think it's important to describe the architecture of my app first. That'll make things easier to understand later.

My app functions just like a normal website, where you are directed to a page when you want to explore a particular web page. However, this happens on the SPA architecture, also known as Single Page Application. Instead of a traditional website where a new page is loaded upon clicking, SPA loads the content right away on the page you are at without having to jump around while also getting you to the new page. This makes the app a lot smoother, and removes the hassle of having to jump back and forth.

Now, let's jump to the issue. All my song data are stored in MongoDB. Everytime a user browses a song, the data is fetched from the database through the API I created, and delivers it to the frontend for display. This takes time, and that's the problem. Sending the request, fetching the song, delivering it back all cost time. Plus, the website is React based, so it takes a few milliseconds to load the JavaScript components too. When all these combine, it takes a good chunk of time to get the data back to where it should belong.

Bot crawlers don't like this. "Time is money", that's what people say. Bot crawlers think the same too. When a page is up, they crawl the page, and decide if it should be indexed based on the content. Only when they approve, the pages appear on search engines. Here comes the actual problem. When the bot crawls through my pages, they find nothing and flags them. I think you can already see why. The bots crawled while the pages were waiting for the content to arrive. While data is in flight, the page content is empty, so bot crawlers think all the pages look the same since they are all empty. So, they decide to not index my pages and my pages aren't visible on Google, Yahoo etc.

I came up with a short term solution. I decided to ditch fetching songs from the backend. Instead, I moved all the files to the frontend, so they become static. Fetching static files and displaying the content takes much much less time then having to go through the database and all that. It worked well, so I kept moving on; my pages were now getting indexed, and they are visible on Google, which is exactly what I want.

But, I don't like this solution. My song data are visible to anyone who visits my repository. These are not some secret data I need to hide, but the fact that they are sitting there just doesn't sit right with me. It should function in a proper MERN fashion, where songs are fetched from the backend.

I tried to find many solutions where I can use database for fetching song data while also getting bots to crawl through proper content. The first solution, and possibly the best in long term is utilizing SSR @ Server Side Rendering. SPA does not provide SSR; in fact, it is a CSR @ Client Side Rendering. CSR just has that longer load time as things have to be rendered on the client side, which doesn't really suit my scenario of priortizing SEO. Looking further, I found that Next.js is the framework that provides Server Side Rendering. What it will do is render the webpage from the server, and deliver it so that things like bot crawlers will see the page with full content, boosting SEO and making it so easier to index the pages. Fine, this is a sound solution. Now what, I need to migrate my entire project to use Next.js instead.

This solution is going to be impossible for some time. I have never worked with Next.js before, so I will need to learn it first. It's a JS framework so it shouldn't be that hard to get my hands on, but migrating to a different tech stack just sounds awful lot of work, and I cannot give time for that currently. So, there I go, searching for an alternative solution.

Another solution is pre-rendering. I learnt that this used to be a big thing back in early 2010s, but people say bots have gotten smarter that this service is not much in demand anymore. I tried digging in, and found that this could be a viable option for me. What happens is, a third party site will crawl my webpages, and store them on a cache for bots to fetch. So, when bots try to crawl my pages, instead of actually crawling my pages live, I instead direct them to go fetch the pages that are pre-rendered by the third party. That way, the bot crawls through the actual content, knows there actually is content.

I found a great third-party site called https://page-replica.com/. There is a decent amount of users, and it was very easy to set up as well. I set it up through Cloudfare to host worker routes that will redirect bot crawlers to fetch static pages instead. I also get 5000 render request per month, which is way more than what I need because I only have 70-80 pages. All that one has to do is to request pre-render for individual pages or import the sitemap to get all pages rendered.

So, I tried this, but it still also returned empty pages :( I thought this was going to solve my on-going problem but it didn't. I contacted the creator of Page Replica and asked how long it waits to scrape the page, because if it's short, then it makes sense to just scrape an empty page. I'll wait for a reply and see how that goes, because it seems to guarantee that pages will only be captured after a full load, which isn't the case on my app.

Another solution I can try is to find ways to optimize my fetching method, that can possibly speed up the whole requesting, fetching and delivering process. If this worked, then that's good, but if not, I will have to go back to the old way of storing local static files, until I find another solution. I will definitely try to get a good solution for sure.