Inventors:
M. Margaret Withgott - Los Altos CA
Steven C. Bagley - Palo Alto CA
Dan S. Bloomberg - Palo Alto CA
Daniel P. Huttenlocher - Ithaca NY
Ronald M. Kaplan - Palo Alto CA
Todd A. Cass - Cambridge MA
Per-Kristian Halvorsen - Los Altos CA
Ramana B. Rao - San Francisco CA
Douglass R. Cutting - Menlo Park CA
Assignee:
Xerox Corporation - Rochester NY
International Classification:
G06K 900
Abstract:
A method and apparatus for processing a document image, using a programmed general or special purpose computer, includes forming the image into image units, and at least one image unit classifier of at least one of the image units is determined, without decoding the content of the at least one of the image units. The classifier of the at least one of the image units is then compared with a classifier of another image unit. The classifier may be image unit length, width, location in the document, font, typeface, cross-section, the number of ascenders, the number of descenders, the average pixel density, the length of the top line contour, the length of the base contour, the location of image units with respect to neighboring image units, vertical position, horizontal inter-image unit spacing, and so forth. The classifier comparison can be a comparison with classifiers of image units of words in a reference table, or with classifiers of other image units in the document. Equivalent classes of image units can be generated, from which word frequency and significance can be determined.