SearchTools Blog (searchtools) wrote,
SearchTools Blog

links: Boilerplate code library, enterprise relevance, HTML5

  • boilerpipe - removes clutter around web page content (java code library)

    The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides. A video of the presentation is freely available on (turn speaker balance to the left to improve audio quality). Commercial support is available through Kohlschütter Search Intelligence.

    tags: analysis APIs indexing

  • What makes relevance such a challenge in the enterprise? (sharepoint & fast search blog)

    Nice overview of why internal search is often worse than web search: mainly that there's little meaningful linking within an intranet, little incentive to make a site easily searchable, and security issues with access control.  The post recommends realistic expectations, not indexing low-value content, looking at third-party relevance tools, offering scope or zoned search, and tagging content.

    tags: enterprise search engines intranets overviews relevance

  • HTML5 specification, w3

    Complex and difficult to read, though I can tell they're trying to make it easier.

    tags: site-search web-search research

  • HTML5 - A Step Forward Towards Semantic Web

    Nice introduction to the new structural tags in HTML5: section, article, aside, header, hgroup, footer, and nav, and new content tags: figure, video, audio, canvas.

    tags: semantic search web-search

Posted from Diigo. The rest of my favorite links are here.


  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.