Disaster Recovery

Published 09-22-2019 15:55:30

A years worth of work… just gone. One simple misunderstanding of how the software worked. It was all gone. No backups. It's the cloud. You shouldn't need backups. SaaS offerings do that for you, right?

Hello Shopify

Unfortunately, not in the case of Shopify. One of the nice things about Shopify, is that they allow for external parties to build tooling and sell their wares to “enhance” the Shopify platform. This also has a downside to it. There's not always an “out-of-the-box” solution that you'd expect or hope for. It's yet another USD$10 a month charge on top of the bill you're paying to Shopify and the 5 to 10 other apps you've added to get your store humming along. Not a critique on Shopify by any means, it's just not as straightforward for less tech-friendly store owners trying to run a business.

This brings us back to our little intro there. A partner of mine, who has a shop hosted on Shopify, decided to reorganise some of their blogs. In doing so, they were trying to shift some articles from one blog section to another. After creating the new sections, they proceeded to delete the old. Sadly, the articles were not actually migrated over. Gone. Even the “Are you really sure?” prompt wasn't enough to save the content from extinction. Panic set in. Research into what could be done, what should have been done, etc.

Shortly after, I got the panicked call of “What can I do?". So of course, I do the exact same thing they had done. Cracked out the ol’ Google search for “Shopify + data recovery". There are some nice apps that allow you to backup and restore, including point-in-time, for your shops data. But if you haven't already signed up for said services, you're kind of out of luck.

Getting Creative

You may (or may not) be surprised that your content is already backed up for you on the internet. It's mostly unintentional on your part, but it serves the purpose of “historical record". Tools like Google's Cached View and Archive.org's Wayback Machine spend their days scraping the internet and keeping snapshots of the content. This was my first stop. Apparently, my gut is inline with other creative types as you can find several suggestions along the same lines.

I could engineer the shit out of this problem. The dev in me started writing code. Hacking bits and pieces together to re-scrape back from Google. To pull each page back from the abyss of that delete button.

4 hours later…

Yeah… I didn't think this one through at the start. Google kind of doesn't like it when you start throwing “robots” at their stuff. And if you want to jump through their search results, you need a full browser (or at least a headless one) to render the page and execute their Javascript in order to populate the age. It think it's time to get a little lazy and hacky here.

Now what I really want is to find every single (potential) blog page for the target site. You've got a nice little “site search” option under Google via the site query … site:elliottpolk.dev. This allows you to pull back everything Google has cached on that site. In the case of the partners shop, they conveniently have a /blogs path that nicely divides up the content I was now on the hunt for. Instead of clicking on each link for the site, there's a nice little dropdown indicator that, when clicked gives you the “Cached” option of the associated URL.

Cached

Early on, this is where I made a mistake. I thought copying the webcache URL straight out and having code that just replaced the target site URL in the search query would work a treat. Nope. Google isn't too fond of that. They've got a mechanism that includes a unique caching bit as part of the URL. If you don't have this, they will happily block your IP (from the coding side, but still accessible in the browser).

OK… So, I've got to get each one of those magic URLs with the token as part of the query parameters. How do I quickly get a HTML + Javascript renderer and start scraping those URLs? I could use NodeJS with things like Puppeteer, Zombie, or PhantomJS but I'm already 4-ish hours into this. I just want to get it out of the way.

Le sigh… I've already got a browser installed on my machine that allows me to run custom Javascript on the active page, Chrome and it's dev tools. Yup. I over thought it. Let's crack open the dev tools on the first page of the search results. Paste the below snippet in the console, and away we go.

// create a new 'textarea' element to hold the text to enable copying the text to the clipboard
let ta = document.createElement('textarea');

// append all the current cached URLs to the new textarea
ta.value = [...document.getElementsByClassName('fl')].filter(e => e.href.startsWith('https://webcache.googleusercontent.com/search?q=cache:')).join('\n');

// - set it to 'readonly' to help ensure nothing else happens to the data
// - attach to the dom
// - select the text in the textarea
ta.setAttribute('readonly','');
document.body.appendChild(ta);
ta.select();

// copy to clipboard and remove the textarea from the dom
document.execCommand('copy');
document.body.removeChild(ta);

// navigate to the next page in the search results or output 'done' if on the last page
let n = document.getElementById('pnnext'); 
(n && n !== undefined) ? n.click() : console.log('done');

So, why? What does it do? Well… It generates a textarea element, drops it into the current DOM, grabs all the <a> tags with the (as of this writing) appropriate class name and dumps their href content to the textarea‘s value. “But why with the textarea?", one might ask. Well, simply put, I'm lazy. I don't want to copy ‘n paste from the console when I can have the code do it for me. Tossing the string to the textarea, triggering the select() on the textarea and having the document execute the copy command, tossing the contents to the clipboard seems far more lazy-like, IMO.

From there, we'll have the page go ahead and navigate to the next set of search results in preparation for the next run of the script. But, just before that, we need to paste the contents of the clipboard somewhere. In this instance, I just tossed it to a simple urls.txt file, where each URL lives on it's own line. I use this to my advantage later.

Rinse and repeat until the last search results page, where the script should output done.

Back to the original code

So, I didn't quite show the original code in this post, but I do revisit it to do the bulk of the pull. Now, I could've just used something like wget or cURL to pull down the raw pages, but Google adds in this little cache header that I wanted to remove for my partner. This is the main reason I went back to my code.

In combination with the x/net/html package and some scratch code, I was able to strip out the content and write the raw HTML to disk.

func main() {
    res, err := http.Get(os.Args[1])
	if err != nil {
		panic(err)
	}
    defer res.Body.Close()

    doc, err := html.Parse(res.Body)
    if err != nil {
        panic(err)
    }

    var (
        title string
        fn    func(n *html.Node)
    )

    fn = func(n *html.Node) {
        if n.Type == html.ElementNode {
            switch n.Data {
            case "body":
                for _, attr := range n.Attr {
                    if attr.Key == "id" {
                        title = attr.Val
                    }
                }
            case "div":
                for _, attr := range n.Attr {
                    if attr.Key == "id" && strings.HasSuffix(attr.Val, "__google-cache-hdr") {
                        n.Parent.RemoveChild(n)
                    }
                }
            }
        }

        for c := n.FirstChild; c != nil; c = c.NextSibling {
            fn(c)
        }
    }
    fn(doc)

    var (
        d = filepath.Join("data", "raw")
        f = fmt.Sprintf("%s.html", filepath.Base(title))
    )
    if err := ioutil.WriteFile(filepath.Join(d, f), []byte(render(doc)), 0644); err != nil {
        panic(err)
    }
}

It's worth noting, this is meant to be quick and dirty code, not pretty. Once we've got this and it works with 1 URL, it's time to get it to work with many. Let's use a simple bash script for this one.

#!/bin/bash

get () {
    # include some random sleeps as to not irritate the Google god
    sleep $[ ( $RANDOM % 10 ) + 1 ]s
    go run main.go "$1"
}

# loop through the list of URLs and pass it to the newly minted `get` func
# and place the call in the background to make things a little 'faster'
for i in `cat urls.txt`; do
    echo "${i}"
    get ${i} &
done

# need to wait for all those background tasks to finish
wait
exit 0

Conclusion

So… With a bit of Javascript, bash, and Go we've got as much back as Google had. From here, I handed the HTML over to my partner to copy ‘n paste the content to their hearts desire. I could continue, help strip the actual blog content out, and push to the Shopify API. This wasn't their priority at the moment. It's now time to get them on a backup plan. More to come on that.