Document Formats

SiteSurfer Builder will extract text from the following document formats:

Text
HTML
Ami Pro
Windows Write
Microsoft Word 3.x to 5.x
Microsoft Word for Windows 1.0 to 8.0
Corel WordPerfect 5.0 to 7.0

The HTML filter, in particular, has special features for handling web documents. When extracting the main text stream, it will ignore everything between the following tags so that characters out of the visible text stream will not be indexed as part of the text:


<APPLET>...</APPLET>
<OBJECT>...</OBJECT>
<SCRIPT>...</SCRIPT>
<STYLE>...</STYLE>
<TITLE>...</TITLE>

The HTML filter is also capable of extracting the document title, several pre-defined META tags, and also user-defined META tags. When these features are enabled, the SiteSurfer applet will be able to search these fields in addition to just text, size, or date.

SiteSurfer also handles sites that use HTML frames. Links to child frames will be followed just like other links. If an HTML page has a <NOFRAMES>..</NOFRAMES> area, such text will be indexed as if it were normal text.