Skip to content
davidyjeong edited this page Oct 29, 2014 · 65 revisions

Welcome to the goose wiki!

Try it out online! http://jimplush.com/blog/goose

You can follow the latest developments from my twitter account: jimplush http://twitter.com/#!/jimplush

Links of importance

Projects actively using the Goose library

Using Goose from the command line

You can now use goose from the command line to batch process extractions or just do a quick one for test purposes.

  1. Download the goose source

  2. cd into the goose directory

  3. mvn compile

  4. MAVEN_OPTS="-Xms256m -Xmx2000m" mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt

That will put the results of the extraction into the gooseresult.txt file on your desktop.

Overview

Project Goose is an article extractor written in Scala using Maven for the dependencies. It's an open source project born from Gravity Labs http://gravity.com, Its goal is to take a webpage, perform calculations and extract the main text of the article as well as make recommendations on what image might be the most relevant image on the page. Goose aims to create an easy to use, scalable extractor that can plug into any application that needs to extract structure from unstructured web pages. Goose was born for a project we needed that would take any article page, extract out the pure text of the content and pick what we thought was the most important image from that page. It's geared more for NLP type processing where you just care about the raw text of the article but I did code it up so we can add new OutputFormatter classes that override that behavior and give you more of a Flipboard type extraction where the content is all inline. Goose has performed tens of millions of extractions and we wanted to give back what we found out regarding extractions.

Project goose was in fact named after the Top Gun character call sign "goose". We were on a major Top Gun kick one week and that's what happens, projects get weird names. It is based on Arc90's readability code but has definitely moved away from their initial implementation and added image extraction. To see how it works check out some of the unit tests: https://github.com/jiminoc/goose/blob/master/src/test/scala/com/gravity/goose/GoldSitesTestIT.scala

No article extractor will ever be 100% on every single site, so when you come across articles Goose did not properly extract, please log an issue and I'll get it looked at.

Requirements

  • Scala (min 2.8)
  • Maven
  • ImageMagick (for image extraction)
    • ON OSX: sudo port install ImageMagick
    • ON UBUNTU: sudo apt-get install imagemagick

Example Use

Here is what it would look like to use goose from Java

String url = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html";
Goose goose = new Goose(new Configuration());
Article article = goose.extractContent(url);
System.out.println(article.cleanedArticleText());

And from Scala

val goose = new Goose(new Configuration)
val article = goose.extractContent(url, rawHTML)
article
println(article.cleanedArticleText)

You'll receive back an Article object that has features from the article. To see all the features currently extracted you can look at: https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/Article.scala

Roadmap

  • Continue to add unit tests for Gold List Sites
  • Show code examples of overriding the Configuration object
  • Explain the Configuration Object
  • Add an online app that given a URL will show you the extracted text and image (DONE)
  • Add additional output formatters
  • Add ability for users to define custom ids and classes for known sites to help with extraction (by domain?)
  • Be able to follow multiple pages of articles

Workflow

Goose goes through 3 phases during execution

  1. Document cleaning
  2. Content / Image extraction
  3. Document cleanup

Document cleaning

When you pass a URL to goose the first things we'll start to do is cleanup the document to make it easier to parse. We'll go through and remove comments, common social network sharing divs, convert em and other tags to text nodes, try to convert divs used as text nodes to paragraphs as well as general doc cleanup.

Content / Images Extraction

When dealing with random article links you're bound to come across the craziest of HTML files. Some sites even like to include 2 HTML files per site. We use a scoring system based on clustering of English stop words and other factors that you can find in the code. We also do descending scoring so as the nodes move down the lower their scores become. The goal is to find the strongest grouping of text nodes inside a parent container and assume that's your group of content as long as it's high enough up on the page.

Image extraction is the one that took the longest. Trying to find the most important image on a page proved to be challenging and required to download all the images to manually inspect them using ImageMagick. Java's Image functions were just too unreliable and inaccurate. ImageMagick is well documented, tested and is fast and accurate. Images are focused from the top node that we find the content in then we do a recursive run outwards trying to find good images that aren't ads or banners or author logos.

OutputFormatting

Once we have the top node where we think the content is we'll want to format the content of that node for our application. For example for NLP type applications my output formatter will just suck all the text and ignore everything else, other extractors will be built to offer a more flipboardy type experience.