HTML Parser - The Bio One

HTML and XML parsing for the masses

Project Description

HTML Parser - The Bio One is a minimallistic open source HTML parsing library, implemented in Java 5.0. Our goal is not only implementing a practically usable HTML parser. There are tons of such programs on the Internet. We are trying to achieve simple and easy to understand HTML and XML parsing. Therefore the document object model is kept as simple as possible. All this makes the parser suitable for educational purposes and also for using it as a base for a custom parser.

License

This software is licensed under the terms of the General Public License.

Features

Parsing of HTML and XML files (but attributes and comments still not handled adequately).
Automatic correction of unclosed, mismatched or (todo->) mistyped tags.
Generation of HTML/XML text from object model.
(Todo->) Customizable attribute value delimiter (by default is the " character).
Simple and easy to read source code.

Requirements

For users

Java Runtime Environment 1.5 (can be downloaded from Java's web site)
htmlparser-bio.jar (download from the project's download section)

For developers

Java SDK 1.5 (can be downloaded from Java's web site)
JUnit framework (available by default in Eclipse; can be downloaded from JUnit's web site)
Project's source code is available in our CVS tree
Test project is also in CVS
We recommend the Eclipse IDE (version 3.1 and above) for development

Download

You can download the HTML parser's source, binaries and tests here.

CVS

The source code repository information is available here and the repository can be browsed via HTTP here.

Documentation

Project's javadocs can be accessed here.

Quickstart Example

import com.bioinformatixx.htmlparser.*;
import com.bioinformatixx.htmlparser.dom.*;

class Main
{
	public static void main(String [] args)
	{
		Parser parser = new Parser("<html><body>Hello, World!</body></html>");
		ArrayList<SimpleNode> rootElements = parser.parseHtml();
		Node html = (Node)rootElements.get(0);
		Node body = (Node)html.getChildren().get(0);
		TextNode bodyInnerText = (TextNode)body.getChildren().get(0);
		System.out.println(bodyInnerText.getText());
	}
}

Contributing

If you are interested in development of HTML Parser - The Bio One, the source code is available in the CVS tree of the project.
In order to contribute source code, you have to be added to the active developers of the project. Please, contact the project administrator in this case.
The development and test process is supported by the Eclipse IDE and the JUnit test framework.

Contacts

Project administrator - v_bachvarov (at-no-spam) users.sourceforge.net

Links

Project summary page at Sourceforge.net
Project download section
Java web site
JUnit web site
The Eclipse project