Extract main textual content from a webpage.
Today I am going to discuss some of the libraries which can be used to extract main textual content and remove boilerplate or clutter content from a webpage.
We see tons of pages every day with full of advertisement, copyright statements, links, images etc. These are not the actual relevance content of webpage but the boilerplate contents.
There are many Java supported libraries which we can use to extract textual content from Wikipedia, news article, blog content etc.
Before exploring library it is important to know that –
- Each page has different structure (in terms of tags).
- Actual data are segregated by different paragraph, heading, div with content class etc.
- For example when you search “Obama” and see the source of first two links i.e. http://en.wikipedia.org/wiki/Barack_Obama and http://www.barackobama.com/.
Both the page has different structure.
No parser has any Artificial intelligence; it is just the heuristic algorithm with well-defined rule which works behind the scene. They work on DOM (document object model). Most of the parser or HTML page stripper require user to supply tag name to get data of individual tag or it return the whole page text.
These libraries don’t work on all the pages due to vary nature of page content in terms of tags.
We will see example of following libraries:
Boilerpipe: Boilerpipe is a Java library written by Christian Kohlschütter. It is based on Boilerplate Detection using Shallow Text Features. You can read here more about shallow text feature .
There is also a test page deployed on Google app engine where you can enter a link and it will give you page text.
URL:- http://boilerpipe-web.appspot.com/
Boilerpipe is very easy to use. Add following dependency to POM
<repository> <id>boilerpipe-m2-repo</id> <url>http://boilerpipe.googlecode.com/svn/repo/</url> </repository> <dependency> <groupId>de.l3s.boilerpipe</groupId> <artifactId>boilerpipe</artifactId> <version>1.2.0</version> </dependency>
There are five types of extractor –
ARTICLE_EXTRACTOR: Works very well for most types of Article-like HTML.
CANOLA_EXTRACTOR: Trained on krdwrd Canola (different definition of “boilerplate”). You may give it a try.
DEFAULT_EXTRACTOR: Usually worse than ArticleExtractor, but simpler/no heuristics.
KEEP_EVERYTHING_EXTRACTOR: Dummy Extractor; should return the input text. Use this to double-check that your problem is within a particular BoilerpipeExtractor, or somewhere else.
LARGEST_CONTENT_EXTRACTOR: Like DefaultExtractor, but keeps the largest text block only.
Java Example
package com.test; import java.net.URL; import de.l3s.boilerpipe.document.TextDocument; import de.l3s.boilerpipe.extractors.CommonExtractors; import de.l3s.boilerpipe.sax.BoilerpipeSAXInput; import de.l3s.boilerpipe.sax.HTMLDocument; import de.l3s.boilerpipe.sax.HTMLFetcher; public class BoilerpipeTextExtraction { public static void main(String[] args) throws Exception { final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL("http://www.basicsbehind.com/stack-data-structure/")); final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument(); String content = CommonExtractors.ARTICLE_EXTRACTOR.getText(doc); System.out.println(content); } }
JSoup: As per Jsoup official page – jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
JSoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
Test page for JSoup: http://try.jsoup.org/
How to use JSoup:
You can use JQuery like selector to get the content of a tag.
Enter following entry to POM-
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.3</version> </dependency>
Java Code
package com.test; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JSoupExtractor { public static void main(String[] args) throws Exception { Document doc = Jsoup.connect("http://www.basicsbehind.com/stack-data-structure/").get(); // select title of the page System.out.println(doc.title()); // select text of whole page System.out.println(doc.text()); // select text of body System.out.println(doc.getElementsByTag("body").text()); // select text of paragraph System.out.println(doc.getElementsByTag("p").text()); } }
Apache Tika: Apache Tika is a content analysis tool which can be used to extracts metadata and text content from various documents.
Enter following dependency to POM-
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.5</version> </dependency>
Java Code
package com.test; import java.io.InputStream; import java.net.URL; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.html.HtmlParser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.LinkContentHandler; import org.apache.tika.sax.TeeContentHandler; import org.apache.tika.sax.ToHTMLContentHandler; import org.xml.sax.ContentHandler; public class ApacheTikaExtractor { public static void main(String[] args) throws Exception{ URL url = new URL("http://www.basicsbehind.com/stack-data-structure/"); InputStream input = url.openStream(); LinkContentHandler linkHandler = new LinkContentHandler(); ContentHandler textHandler = new BodyContentHandler(); ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler(); TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler); Metadata metadata = new Metadata(); ParseContext parseContext = new ParseContext(); HtmlParser parser = new HtmlParser(); parser.parse(input, teeHandler, metadata, parseContext); System.out.println("title:\n" + metadata.get("title")); System.out.println("links:\n" + linkHandler.getLinks()); System.out.println("text:\n" + textHandler.toString()); System.out.println("html:\n" + toHTMLHandler.toString()); } }