||Add To My Personal Library
March 21, 2008
Vol.30 Issue 12|
Page(s) 29 in print issue
dtSearch Performs Incredible Feats
Search Terabytes Of Text In Less Than A Second
Imagine a text retrieval product that can search through terabytes of data in less than a single second in most cases. Now imagine that you can perform that same function in Chinese, Arabic, and hundreds of other international languages that fall under the Unicode standard. According to multiple experts (based on font size and type, page size, margins, spacing, and software used), a terabyte of data is approximately 500 million typewritten pages, all of which can be searched and indexed to produce specific matched criteria in less than a second.
dtSearch (301/263-0731; www.dtsearch.com) has a unique system to accomplish these tasks. Development began on dtSearch in 1988, and after years of fine-tuning, the company released the first application in early 1991. Now, dtSearch can instantly search terabytes of text in many dozens of formats, including an ever-expanding list of Microsoft file types and non-Microsoft file formats, as well as data thats available through HTTP/HTTPS connections, public Web sites, popular Web-ready file types, and dynamically generated Web data.
In addition to expanding the types of data that the dtSearch product line supports, says David Thede, president of dtSearch, we have also expanded the list of programming languages that our developer component supports. Originally, we supported only C/C++. Then we added Java, COM, and .NET. More recently, for example, we have expanded the .NET Spider API to include not only current versions of .NET but also sample code covering 32-bit and 64-bit .NET.
History & Evolution
According to Thede, the original idea behind dtSearch was to write a better text-retrieval application. The goal was to instantly search a large volume of files quickly and with 100% accuracy. After executing a search, the goal was to intelligently sort matching documents and then to highlight hits in retrieved files. In early 1991, the first dtSearch application was released.
While the ultimate goal of writing a better text retrieval application has remained, notes Thede, the company has seen some fairly radical transformations in implementing that original goal. One transformation was related to the typical size of data that customers needed to search. Originally, says Thede, our slogan was instantly search megabytes of text. Then, it evolved to instantly search gigabytes of text. And now, today, the slogan is instantly search terabytes of text.
But it isnt just the slogan that changed, notes Thede. The index capacity is greater now by orders of magnitude compared to what it was in 1991. Very few people could even conceive of what a terabyte of data was back then. Now, a single index can hold more than a terabyte of data, with indexed search time still typically less than a second.
Partners & Changes
Another major change was adding a developer component to the dtSearch product line, adds Thede. In the first few years after its initial release, the dtSearch product line was an end-user application only. Then, in 1994, Symantec (www.symantec.com) approached dtSearch about including its search technology into one of the first applications for 32-bit Windows.
To accommodate Symantec, says Thede, we turned the dtSearch end-user application into a dynamic link library (DLL), which Symantec could programmatically integrate with its own code. Symantec embedded the DLL in Norton Navigator, explains Thede, which was released alongside Microsofts (www.microsoft.com) initial release of a 32-bit Windows operating system, They sold millions of copies of Norton Navigator with the embedded dtSearch technology.
According to Thede, after the Norton Navigator release, the company began marketing the DLL as the dtSearch Engine component for programmers to embed in their own applications. While dtSearch has continued to manufacture enterprise-ready solutions for Windows, it is in the development component arena that the dtSearch product line has seen the largest growth.
Many Phases, Many Options
The dtSearch Text Retrieval Engine now comes in several different versions for programmers, including Engine for Linux and Engine for Win & .NET. The company added an easy-to-use, no programming required, developer application for Microsoft IIS (Internet Information Services) Web servers and a similar developer application for publishing data to CD, DVDs, and other portable media. And, he says, We added a built-in Spider that has also turned into a development component.
The original dtSearch line was English only, notes Thede. But as the international market grew, full Unicode support was added, including hundreds of international languages. Input from customers with advanced Asian and Arabic text analysis needs inspired dtSearch to add API support for third-party morphological analyzers.
Combining its text retrieval with advanced morphological analysis enabled better handling of issues such as Asian and Arabic morphology and entity extraction, continues Thede. We worked with a leading vendor of language analyzer technologies and also with some of the smaller vendors to ensure that the resulting API would work efficiently with their products.
The initial implementation of the API was functional, but post-release feedback from linguistic technology vendors, as well as customers, made dtSearch realize that it should change the API to allow the analyzers access to greater volumes of input data. Giving the language analyzers larger, consistently sized blocks of text to work with was an important step for optimizing certain advanced linguistic processes. This, subsequently, resulted in a release that incorporated this new generation of language integration technology.
Whats Different From The Competition?
In addition to its ability to quickly search a large amount of text, dtSearch offers built-in proprietary file format support and converters, eliminating the need for third-party supports. We built our own proprietary file format and display technology, says Thede. Relying on our own file format support enables us to greatly simplify licensing for our developer customers and allows us to optimize the file parsing process for efficient indexing, searching, and hit-highlighting.
The built-in Spider is another significant distinction. According to Thede, the dtSearch product line offers extensive and highly flexible distributed or federated search options. All products are equipped with integrated relevancy-ranking and hit-highlighted file displays. The product line also has a flexible developer API for integrating dtSearch text search and file format support capabilities into other applications.
And last, says Thede, Our company is known for speedy adoption of new programming standards, new operating systems, and new file types. Plus, we have a flexible licensing model that doesnt involve per document and other arbitrary limits that enterprises and developers find maddening. For example, we sell a lot of royalty-free OEM licenses, which make the product line an easy choice.
by Julie Sartain