A Better System to Scan Text Into PC

Personal computers are superb at churning out beautifully printed pages. But getting a printed page back into a computer where its text can be revised, indexed and stored is a more difficult task.

The process is called optical character recognition. It requires a peripheral device connected to the computer, called a scanner, which essentially takes a picture of the page. Then special software deciphers the letters and numerals on the page, turning them into the coded symbols that computers can manipulate.

Caere Corp. of Los Gatos, Calif., and Hewlett-Packard Co. of Cupertino, Calif., have teamed up to help the process work a little better.

HP’s new color scanner, the ScanJet IIc, comes with enabling technology built in called AccuPage, which is used to advantage by Caere’s new OmniPage Professional 2.0 software for both IBM-compatible and Macintosh computers.


It is not a cheap solution. The ScanJet IIc lists for $2,195 for PC computers, including the expansion card needed for the connection, and $1,995 for Macintosh computers, which don’t need the card. The OmniPage Professional 2.0 software retails for $995 in either Mac or PC versions. The PC version requires Windows 3.0 to operate.

(You can buy less-expensive scanners to use with OmniPage Professional 2.0, including H-P’s $995 monochrome ScanJet Plus, but you forgo the advantage of AccuPage.)

With AccuPage, the new scanner is able to vary the contrast as a page is scanned so that it is always at the optimum. The second step of the process, optical character recognition, known as OCR, depends on proper contrast between the type image and the background to most accurately match the character patterns it knows with the image it sees.

An impressive example of the new system is seen by scanning a magazine page on which portions of text are printed over colored backgrounds. Without AccuPage, the contrast between the text and the backgrounds is too slight for OCR to work. But with AccuPage turned on, the scanner automatically adjusts the contrast for each colored block of text, as well as for the remainder of the text on the white portion of the page. The result is virtually perfect character recognition.


The system works well with printed pages and clean photocopies, whether the text is of typeset quality or typewriter quality or comes from a dot matrix printer. But it does a poor job with pages that have been received by fax because the characters are generally too irregular. Fax images received directly in the computer can be converted with more success, especially if sent in fine-resolution mode.

The OCR portion of the process is fairly fast, and the OmniPage Professional 2.0 software shows you how it is progressing by displaying the portion of the scanned image being recognized. The system is complete enough to read common type styles from about 1/10 inch high (six-point type) to an inch high (72-point type). It doesn’t work with stylized type such as script or other fancy faces. You can teach the software how to recognize symbols or characters that it consistently misses.

Of course, you also can scan photographs and other graphic images in color or black and white, but the OmniPage Professional 2.0 software treats them all as black and white, with up to 256 shades of gray. The software includes a variety of image editing features.

H-P’s own software included with the ScanJet IIc allows you to retain and control the color in scanned images. That software also allows you to optimize color images for printing on black and white laser printers.

Caere also makes a less-expensive scanning system, Typist Plus Graphics, which is a hand-held unit capable of scanning a swath 5 inches wide. It is priced at $595 for the PC and $695 for the Macintosh.

You can scan a whole page by making several passes sideways across it and the software will automatically merge the images so that no lines are repeated.

I tested the previous version, which lacked gray-scale graphics capabilities and is simply called Typist. It was finicky. If the page contained simple text without graphics and if it was good quality printing on white paper, Typist worked quite well.

But newspaper stories needed a fair amount of editing to correct scanning errors. Pages with large sections of indented text, as found in some technical manuals, also confused Typist, often causing it to ignore the indented material. Books and magazines are usually impossible to scan unless individual pages can be removed and laid flat.


On the other hand, even with the older version I was able to scan a color photograph as a black and white image and then erase the background and enhance the image with Windows’ Paintbrush program to create a letterhead logo.

With its new gray-scale imaging and graphics editing software, Typist Plus Graphics maybe an acceptable graphics scanner. There are cheaper hand scanners for that task, however, such as Logitech’s ScanMan 256/GS at $495 for the PC. For serious OCR work, get a full-page scanner.