Advertisement

Data Retrieval at Your Fingertips

Share
RICHARD O'REILLY is director of computer analysis for The Times

Information is everywhere, but getting your hands on what you need when you need it can be murder.

PageKeeper 1.0, an $895 Windows program from Caere Corp. of Los Altos, Calif., aims to help. Its goal is to gather all kinds of text into your computer or networked file server and then show you how relevant it is to any particular information request you make.

Caere has built a good reputation for optical character recognition--turning images of scanned text into files that can be edited--with its OmniPage software. In PageKeeper, Caere has essentially built a document indexing system around its OmniPage Professional OCR system to create a complete document database program.

Advertisement

Thus, armed with a scanner, your computer can take in any combination of pictures and text from newspapers, magazines, journals, court transcripts and books.

PageKeeper takes a unique approach to data retrieval that goes beyond the typical Boolean and proximity word searches usually found in text databases. You type in a list of words the sought-after articles should contain and PageKeeper’s “weighted-word search” determines the degree of relevancy each article has to your request. Titles of documents found are then listed in declining order of relevance.

You won’t have any trouble understanding this hierarchy of documents because the titles are color-coded in a bar chart. Names displayed with a red bar most closely match the search criteria specified. Green bars are less relevant, and blue bars are the least relevant. The length of each bar also indicates the degree of relevancy, so the scheme works as well on a monochrome screen.

Once you have found one article that meets your requirements, you can instruct the program to find all other articles that most closely match it.

According to Larry Miller, Caere’s vice president of advanced products, PageKeeper indexes documents in a complex process that analyzes how words fall within sentences and paragraphs and creates a mathematical model of the article. It also creates a list of the words for each article.

If you select a scanned article from the database, you can see the page image as well as the text file created from it.

Advertisement

That has a lot of advantages. You can see pages as they were printed, along with any pictures and illustrations. You can also see the actual text, so any errors made during the optical character recognition phase are overcome.

Any program that stores page images is going to eat up a lot of hard disk storage space, but PageKeeper is supposed to minimize the requirements by doing what it calls “super compression” of the scanned image. Unfortunately, a bug prevents the current version from achieving maximum compression. A fix is in the works and will be shipped free to registered PageKeeper owners when it is ready.

Another welcome change will allow you to easily select multiple word processing or other text files for inclusion in a PageKeeper database. Several steps are now needed to select each one, which is slow and aggravating.

OmniPage Professional is an excellent optical character recognition program, so PageKeeper naturally has fine OCR capabilities. But they aren’t foolproof, and that has an impact on how you use the program.

If you have a desktop scanner with an automatic sheet feeder, it is possible to load it with pages to be scanned, choose the batch scanning, recognition and indexing options of PageKeeper and leave the computer to do its work unattended. But the results may not be what you expect.

The program can figure out where one article ends and another begins if you put a blank sheet of paper in the stack between them. And often it can automatically follow the flow of text across the columns of a page, even when a picture or two breaks up the columns.

Advertisement

But some page layouts will fool it--for instance, a page with lots of small illustrations with long captions surrounding short columns of text, or pages with type printed on various colored backgrounds, or unusual sizes and styles of type for headlines, or the initial capital letter of the first sentence on the page.

On the other hand, you should be able to run court transcripts through the program without a problem. The best advice is to monitor the program each time you scan a new kind of document layout to see how well it performs.

I needed to adjust the recognition zones the program would have used on some magazine pages I scanned. If the program misunderstands the page structure, the resulting text file will run sentence fragments together, and it won’t make any sense.

Because of the weighted-word search method of retrieval, PageKeeper may still find the article despite the scrambled sentences.

The words, after all, are still there. And you’ll be able to read the original text from the page image, which the program has saved.

Advertisement