This is an off-site copy of the corresponding Product report page on the SearchTools.com website, and it is designed to allow you to comment on the product and/or the reporting. For more information about the topic of search and tools visit SearchTools.com where you can browse many articles, in-depth analysis and overviews of external resources.
Fast but polite robot crawler for indexing internal/external web sites
- Flexible include/exclude rules using regexp (grep) patterns
- Accesses SSL secure sites via HTTPS
- Handles proxy servers and password protected areas.
ndexes mounted file system volumes in native formats, NFS, Samba, Novell
- Handles file formats: HTML, ASCII text, RTF, Microsoft Word, Excel, PowerPoint, Acrobat PDF, PostScript
- Indexes XML files and searches within XML tags, can define DTDs and metadata.
- Indexes relational databases, MySQL, Oracle, PostgreSQL using JDBC interfaces.
- Update scheduler options.
- Metadata fields: URL, images, mailtos, hrefs, anchors, Dublin Core and AGLS
- External metadata assignment to documents or directories
- Supports Western European languages (ISO-8859-1) Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish
- Advanced search interface for additional metadata and query operators
- Synonym list
- Spellchecking using aspell, dictionaries and/or the site content text.
- Can enable stemming for searches.
- Can search in specified subsites or combined meta-collections
- Search page and results page customization using HTML templates.
- Results sorted by relevance, with extra weight for metadata matches
- Date sort option
- Shows "Featured pages" (manual recommendations).
- Shows search word in context in results pages and/or metadata content
- Option to view cached versions of files
- Advanced results customization uses Perl syntax for extensive flexibility
- Can return results in XML
- Web-browser administration interface for general customization
- Extensive config files for complete control
- Query log reports include most common queries, no-matches, time taken
- Based on PADRE (PArallel Document Retrieval Engine), started in 1994
- Scales to over 18 million web pages (100 gigabytes) on low-end hardware.
- Australian Department of Industry, Tourism and Resources - 3570 servers, 2 million pages
- Australian Broadcasting Corporation - very large broadcast site
- University of Sydney - university external and specialized sites