SearchTools Blog (searchtools) wrote,
SearchTools Blog

More about file format parsing tools for indexing

This is a follow-up to my article about file format access tools and Lucene Tika.

More free open-source packages: Charlie Hull directed me to a file converters listed in the Omega overview, part of the Xapian project. Presumably, this lists packages that they've tested for quality.

The Omega overview, I've added links to the packages

pdf (acrobat) - pdftotext using xpdf
PostScript - ps2pdf using xpdf
OpenOffice/StarOffice documents - compressed XML - (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) using unzip
OpenDocument format documents - compressed XML - (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) using unzip
MS Word documents (.doc, .dot) using antiword or catdoc
MS Excel documents (.xls, .xlb, .xlt) using xls2csv (comes with catdoc)
MS Powerpoint documents (.ppt, .pps) using catppt (comes with catdoc)
Wordperfect documents (.wpd) using wpd2text (comes with libwpd)
MS Works documents (.wps, .wpt) using wps2text (comes with libwps)
AbiWord documents (.abw)
Compressed AbiWord documents (.zabw) using gzip
Rich Text Format documents (.rtf) using unrtf
Perl POD documentation (.pl, .pm, .pod) using pod2text
TeX DVI files (.dvi) using catdvi
DjVu files (.djv, .djvu) using djvutxt

Corrections to my previous file format access article:

Corrected Tika project link
Outside-In is now owned by Oracle
Corrected Mark Bennett's Filters article link

(if only the Firefox Link Check plugin still worked...)

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.