Scraping Content From Google Groups

I was helping a friend out who wanted to build a word cloud from the text in Google Groups posts. If you’ve made any efforts to try to get content out of Google Groups you know that the only way to do so is to ensure you subscribe to the group posts via e-mail, then extract all those messages. If you don’t e-mail subscribe to a group, there really is no way to create an archive of the content.

After hacking around a bit and failing, I pulled up the mobile version of the group. You can do that for any Google Group by using the following URL and filling in GROUPNAME for the group you’re interested in: https://groups.google.com/forum/m/#!topic/GROUPNAME.

input_text_not_updated_-_Google_Groups

Then, you’ll need to navigate to a thread, use the double-down arrow to expand all the items in the thread, open up the JavaScript inspector on one of the posts and look for <div dir="ltr">. If that surrounds the post, the following hack will work. Google only seems to add this left-to-right attribute on newer groups, so if you have an older group you need to work with, you’ll need to figure out a different selector (which is coming up in a bit).

With all of the posts expanded, paste the following code into the JavaScript console:

nl = document.querySelectorAll('[dir=ltr]');
s="" ; 
for (i=0; i<nl.length; i++) {
  s = s + nl[i].textContent + "<br/><br/>";
}; 
nw = window.open(); 
nd = nw.document; 
nd.write(s); 
nd.close()

and hit return (I have it spaced out in the code above just for clarity; it will all fit on one line which makes it easier to execute in the console).

Untitled_and_input_text_not_updated_-_Google_Groups

You should get a new browser window (so, you may need to temporarily enable popups on Google Groups for this to work) with the text of all the posts in it. I only put the double <br/> tags in there for the purposes of this example. I just needed the raw text, but you can mark the posts any way you’d like.

You can tweak this hack in many ways to pull as much post metadata as you need since it’s all wrapped in heavily marked <div>s and the base technique should work in a GreaseMonkey or TamperMonkey userscript for those of you with time to code one up.

This hack only lessens the tedium a small amount. You still need to go topic by topic in the group if you want all the content. There’s probably a way to get that navigation automation coded into the script as well. Thankfully, I didn’t need to do that this time around.

If you have other ways to free Google Groups content, drop a note in the comments.

Cover image from Data-Driven Security
Amazon Author Page

1 Comment Scraping Content From Google Groups

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.