Technology

Overview

SearchArea is a commercial geo-search engine with the following features:
  • Powerful support for location search - based on OpenGIS standards (Documents can represent polygons, points, multipoint geometries etc)
  • Faster than leading commercial search engines
  • Highly relevant search results based on state-of-the art search algorithms
  • Fully distributed architecture for massive scalability in data and user volumes
  • Pure Java implementation runs on all Java supported platforms

Full-text search features

The Apache foundation's Lucene search library is at the core of SearchArea and is a fully featured search engine with fast indexing, and state-of-the-art search algorithms. Unlike search facilities found in many relational databases the engine offers excellent relevance ranking and a powerful query language.
For the average user who wants to perform basic queries by typing in keywords the search engine is capable of returning the most relevant results first by using statistical natural language processing techniques to identify the most significant terms from the query and underlying documents.
For power-users who wish to have more control over queries the query syntax supports powerful features such as boolean, phrase, fuzzy, wildcard and fielded queries as well as term and document boosting.


The "significant terms" feature is an extension to the Lucene functionality and is used to automatically identify key words or phrases found in search results. With a single mouse click these terms can be added or removed from the query to help refine the concept and improve the quality of search results.

A "highlighter" feature offers the ability to summarise and highlight the most relevant parts of a document when displaying query results.
The core engine is very modular in its design and can easily be re-configured to use different relevance scoring algorithms and text processors e.g. choice of stemmer.

Spatial search features

Relational databases such as Oracle implement the OpenGIS.org specifications in order to provide support for spatial queries. SearchArea is the first search engine to follow the OpenGIS standard for use in pure search technology.
Geography is defined internally using co-ordinates in order to provide the best support for queries. A spatial query tests the relationship between a shape representing the searcher's area of interest and shapes held in the index representing the location of documents' subject matter.
Documents stored in a SearchArea index can represent any kind of shape defined in the OpenGIS.org’s Simple Features Specification (eg lines, points, polygons, multipoints, multipolygons etc.).The shapes are expressed in a standard plain text format called “WKT” (Well Known Text) and queries which have a spatial element use the same WKT format to define the area of interest. The WKT format is powerful enough to express many location types eg. single retail locations (using points), business franchises (using multipoints), geographical studies (using polygons) and queries for restaurants situated within 5 miles of the next 10 motorway exits (using multipolygons).
The SearchArea Developer's Guide offers further details on spatial queries.

Defining location for end users

Obviously end users do not want to type co-ordinates in to perform queries so SearchArea applications are typically deployed with a user interface which makes it easy to define location such as:
  • Text-based interfaces that accept the users choice of town name or zip code
  • Map-based interfaces where users can highlight their choice of location on a map (now freely provided by Google , Yahoo, ESRI and Microsoft)
  • Location enabled devices such as a mobile phone which automatically provides the user's location information.
Each type of interface shown above converts the user's choice of location into a WKT representation that can be used to query the SearchArea index. The demos on this website show examples of Google Map based interfaces with Google Earth integration.

Defining location for documents

Not all documents which need to be placed into a SearchArea index come conveniently prepopulated with coordinate information in the WKT format. Applications built on the SearchArea technology therefore typically involve some form of document parsing to recognize location information such as:
  • Postcodes
  • Telephone codes
  • Town names
  • Metadata such as latitude-longitude coordinates
These location references are then converted into a WKT representation before the document is placed in the index. More information on the WKT format and spatial querying can be found in the developer's guide.

Architecture overview

SearchArea is a pure Java solution running on the Java 1.4 platform. The engine is composed of a number of components that can be deployed in highly scalable distributed solution running on many machines or can be confiigured to operate as a library that can run in-process in a small-scale application.

Distributed large scale deployment

In large scale deployments brokers and indexes can be distributed across multiple machines in order to provide load-balancing and fail-over. Index servers can be both partitioned and replicated to avoid the issues of trying to fit large data volumes on just one machine or trying to service large volumes of search requests. The broker automatically manages the merging of query results from multiple partitions, load balancing requests across replicated index servers and fail-over in the event of failure. Java's Remote Method Invocation protocol is used for communication between servers and indexes can register with more than one broker in order to avoid any one single point of failure.

Small scale deployment

Not all applications need the overhead of a brokered architecture and the SearchArea engine is designed to be embeddable so that the broker and index components can run within the one Java Virtual Machine without any need for remote method calls.Should the need arise, applications can easily migrate to a distributed architecture without having to redesign the application code.

Typical application configuration

The SearchArea engine is often configured in a web-based environment although this is not a mandatory requirement. Each deployment tends to have its own requirements for providing a user interface that allows users to pick a location and has its own source of location-based data that needs to be searched.
The user interface typically needs to offer the end user the ability to define his choice of location using a map or by entering the name of a location.
The source of data that needs to be searched can also vary in format between applications. This can vary between structured data such as XML or a database containing precise coordinate information and unstructuted data such as webpages where location information needs to be parsed by identifying key patterns such as postcodes or telephone dialling codes in the text. A batch task is required to add this content into the index using the index APIs. We can tailor existing example solutions to help with these application-specific tasks in order to get a solution up and running quickly.
Full text features

Spatial features

Architecture

Copyright © 2004-2006, Inperspective Technology Ltd. All rights reserved.