More about file format parsing tools for indexing 
5th-Mar-2009 05:54 pm
This is a follow-up to my article about file format access tools and Lucene Tika.

More free open-source packages: Charlie Hull directed me to a file converters listed in the Omega overview, part of the Xapian project. Presumably, this lists packages that they've tested for quality.

The Omega overview, I've added links to the packages

pdf (acrobat) - pdftotext using xpdf
PostScript - ps2pdf using xpdf
OpenOffice/StarOffice documents - compressed XML - (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) using unzip
OpenDocument format documents - compressed XML - (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) using unzip
MS Word documents (.doc, .dot) using antiword or catdoc
MS Excel documents (.xls, .xlb, .xlt) using xls2csv (comes with catdoc)
MS Powerpoint documents (.ppt, .pps) using catppt (comes with catdoc)
Wordperfect documents (.wpd) using wpd2text (comes with libwpd)
MS Works documents (.wps, .wpt) using wps2text (comes with libwps)
AbiWord documents (.abw)
Compressed AbiWord documents (.zabw) using gzip
Rich Text Format documents (.rtf) using unrtf
Perl POD documentation (.pl, .pm, .pod) using pod2text
TeX DVI files (.dvi) using catdvi
DjVu files (.djv, .djvu) using djvutxt

Corrections to my previous file format access article:

Corrected Tika project link
Outside-In is now owned by Oracle
Corrected Mark Bennett's Filters article link

(if only the Firefox Link Check plugin still worked...)
