Skip to content
jiminoc edited this page Jan 22, 2011 · 4 revisions

There will be times you need to configure different locations for defaults. Goose provides a configuration object that can be passed into the extractor so you can set items that make sense to your environment.

See example below:

// this is an example of using the configuration object with goose // it is expected for you to not want certain things in certain places so whatever is in the // configuration object you can override

String url = "http://www.msnbc.msn.com/id/41207891/ns/world_news-europe/";

// set my configuration options for goose Configuration configuration = new Configuration(); configuration.setMinBytesForImages(5000); configuration.setLocalStoragePath("/opt/goose"); configuration.setEnableImageFetching(false); // i don't care about the image, just want text, this is much faster! configuration.setImagemagickConvertPath("/opt/local/bin/convert");

ContentExtractor contentExtractor = new ContentExtractor(configuration); Article article = contentExtractor.extractContent(url); assertTrue(article.getCleanedArticleText().startsWith("Prime Minister Brian Cowen announced Saturday"));

Clone this wiki locally