Build your own search engine with opensource software

The development of a search engine is a complex operation. The complexity depends on the features and type of service to be offered to users. To date, the company, but also the individual web developer, can build its own search engine using free tools. If the implementation of a general purpose search engine is objectively too costly in terms of resources, to develop a search engine optimized and customized for a specific segment is an opportunity to assess.
The first interesting tool is Apache Lucene, a library written entirely in Java with a broad, high-performance, full-featured text search support. Lucene can be used for all applications that require a full-text search.
Apache Lucene is obviously released under open license.
Lucene works on a textual database called index that can be stored on disk and / or in RAM, depending on the size and the intended purposes.
The index contains a list of documents. When a new resource (document) is included in this list this will be parsed and “indexed” and the extracted informations become available for searching.
Each search result is “weighed” with a score, an assessment for measuring the relevance with the submitted query.
Lucene can be used individually to create the search engine “core”, but in more complex cases it is to be placed in a broader context. Of course you can extend Lucene to adapt to the needs for searching, or you can use Lucene as a basis for experimenting with new research paradigms, such as semantic search.
If the search engine works on a considerable amount of data then the first problem that arises is related to the needs for storage (disk space) and computational resources. Even in this scenario, an open project, Hadoop, is climbing the charts. In February 2008 Yahoo! said that they had run up a cluster of 10000 core based on Hadoop (and on the Linux operating system).
Hadoop Apache is a framework written in Java for distributed computing on large clusters. The architecture and engineering solutions enables applications to easily scale out to thousands of nodes and petabytes. It is not a secret that Hadoop was inspired by some projects “made in google”: MapReduce and GFS (Google File System)
With references to these projects this first post dedicated to this attractive market segment came to the end. Over the next posts will be reviewed other tools and technologies that can help you.
It’s important to notice that just using only Lucene and Hadoop is possible to develop a complex search engine, for example a search engine for tourism, for the Linux world … the only limit is imagination and … the budget.
Apache Lucene Official Web Site
Apache Hadoop Official Web Site
Source: build_your_own_search_engine_with_opensource