Google This

Lately, the project at work has been revolving around an application search engine. We’re storing data in a database, and documents uploaded by users, and one of the specs requires search functionality to quickly parse both datasets and return the results. We’re using the Jakarta Lucene search framework as the foundation. Basically, Lucene is a library to provide the under-the-hood search engine functionality — indexing the data according to specified keyword and stopword analysis, optimizing the index to provide the fastest possible searches, and providing the various search algorithms to parse queries. Interesting stuff.

The biggest hurdle with Lucene is the absolute lack of good documentation. There are a few brief tutorials on the web site, and there are a few other articles out there, but these provide little more than the classic “Hello, World” adapted to Lucene. After a week or so hacking around the indexing functionality, we’ve got that fairly well nailed down, but now I’m fighting through trying to parse the search results and return them to the UI in some sort of generic format. What really throws the process on its ear is the multiple data types that can be searched. If we were just dealing with documents, that wouldn’t be so bad….simply return the document name, path, and perhaps an excerpt with the search terms highlighted. Or, if it were just database data, that wouldn’t be so bad (though we still have to deal with pulling the data from various tables in the database). But combine the two and things get sticky….I think I’m going to approach it from both the document and database angles, developing a possible solution for each, then finding the common items between the two and going from there.