Specification
BACKGROUND
[0001] As time progresses, people become more dependent on computers to
help with both work and leisure activities. However, computers operate in a digital domain that requires discrete states to be identified in order for information to be processed. This is contrary to humans who function in a distinctly analog manner where occurrences are never completely black or white, but in between shades of gray. Thus, a central distinction between digital and analog is that digital requires discrete states that are disjunct over time (e.g., distinct levels) while analog is continuous over time. As humans naturally operate in an analog fashion, computing technology has evolved to alleviate difficulties associated with interfacing humans to computers (e.g., digital computing interfaces) caused by the aforementioned temporal distinctions.
[0002] Technology first focused on attempting to input existing typewritten or
typeset information into computers. Scanners or optical imagers were used, at first, to
"digitize" pictures (e.g., input images into a computing system). Once images could
be digitized into a computing system, it followed that printed or typeset material
should also be able to be digitized. However, an image of a scanned page cannot be
manipulated as text or symbols after it is brought into a computing system because it
is not "recognized" by the system, i.e., the system does not understand the page. The
characters and words are "pictures" and not actually editable text or symbols. To
overcome this limitation for text, optical character recognition (OCR) technology was
developed to utilize scanning technology to digitize text as an editable page. This
technology worked reasonably well if a particular text font was utilized that allowed
the OCR software to translate a scanned image into editable text.
[0003] Although text was "recognized" by the computing system, important
additional information was lost by the process. This information included such things as formatting of the text, spacing of the text, orientation of the text, and general page layout and the like. Thus, if a page was double-columned with a picture in the upper right corner, an OCR scanned page would become a grouping of text in a word processor without the double columns and picture. Or, if the picture was included, it typically ended up embedded at some random point between the texts. This is even
more of a problem when different document construction standards are utilized. A typical OCR technique is generally unable to "convert" or properly recognize structure from another document standard. Instead, the resulting recognition attempts to confine or force recognized parts into its associated standard. When this occurs, an OCR process usually inputs "unknown" markers, such as question marks, into the recognized portions to indicate that it cannot process these components of the document.
SUMMARY
[0004] The following presents a simplified summary of the subject matter in
order to provide a basic understanding of some aspects of subject matter embodiments. This summary is not an extensive overview of the subject matter. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the subject matter. Its sole purpose is to present some concepts of the subject matter in a simplified form as a prelude to the more detailed description that is presented later.
[00051 Systems and methods are provided that employ grammatical parsing to
facilitate in recognition of document structures. A two-dimensional representation of
a document is leveraged to extract a hierarchical structure that facilitates recognition
of the document. The visual structure of the document is grammatically parsed
utilizing two-dimensional adaptations of statistical parsing algorithms. This allows
recognition of layout structures (e.g., columns, authors, titles, footnotes, etc.} and the
like such that structural components of the document can be accurately interpreted.
Additional techniques can also be employed to facilitate document layout recognition.
For example, grammatical parsing techniques that utilize machine learning, parse
scoring based on image representations, boosting techniques, and/or "fast features"
and the like can be employed to facilitate in document recognition. This provides for
efficient document recognition with substantially improved accuracy.
[0006] To the accomplishment of the foregoing and related ends, certain
illustrative aspects of embodiments are described herein in connection with the following description and the annexed drawings. These aspects are indicative, however, of but a few of the various ways in which the principles of the subject matter may be employed, and the subject matter is intended to include all such aspects
and their equivalents. Other advantages and novel features of the subject matter may become apparent from the following detailed description when considered in conjunction with the drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] FIG. 1 is a block diagram of a document visual structure analysis
system in accordance with an aspect of an embodiment.
FIG. 2 is another block diagram of a document visual structure analysis system in accordance with an aspect of an embodiment.
FIG. 3 is yet another block diagram of a document visual structure analysis system in accordance with an aspect of an embodiment.
FIG. 4 is an illustration of an example page from the UWIII database in accordance with an aspect of an embodiment.
FIG. 5 is an illustration of an example equation used to train a mathematical expression recognizer in accordance with an aspect of an embodiment.
FIG. 6 is an illustration of a mathematical expression in accordance with an aspect of an embodiment.
FIG. 7 is a flow diagram of a method of facilitating document visual structure analysis in accordance with an aspect of an embodiment.
FIG. 8 is another flow diagram of a method of facilitating document visual structure analysis in accordance with an aspect of an embodiment.
FIG. 9 illustrates an example operating environment in which an embodiment can function.
FIG. 10 illustrates another example operating environment in which an embodiment can function.
DETAILED DESCRIPTION
[0008] The subject matter is now described with reference to the drawings,
wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the subject matter. It may be evident, however, that subject matter embodiments may be practiced without these
specific details. In other instances, well-known structures and devices are shown in
block diagram form in order to facilitate describing the embodiments.
[0009] As used in this application, the term "component" is intended to refer
to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a computer component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. A "thread" is the entity within a process that the operating system kernel schedules for execution. As is well known in the art, each thread has an associated "context" which is the volatile data associated with the execution of the thread. A thread's context includes the contents of system registers and the virtual address belonging to the thread's process. Thus, the actual data comprising a thread's context varies as it executes.
[0010] Systems and methods are provided to facilitate in the recognition of
documents via utilization of visual structures. The inherent hierarchical structure of
the document (e.g., document-» pages -» sections -* columns -» paragraphs, etc.) is
recognized utilizing two-dimensional parsing mechanisms that employ grammar-
based techniques. By further utilizing machine learning processes with the
grammatical parsing mechanisms, the efficiency of recognizing documents can be
substantially improved while still providing high accuracy. Image scoring techniques
can also be utilized to facilitate in increasing parsing speed and efficiency. Selection
of fast features of the document as well as boosting techniques for parse learning can
also be utilized to increase productivity of the systems and methods.
[0011] Grammatical parsing is utilized for processing computer languages and
natural languages. In the case of computer languages, the grammar is unambiguous and given the input there is one and only one valid parse. In the case of natural languages, the grammar is ambiguous and given the input sequence there are a very large number of potential parses. The desire in statistical natural language parsing is to employ machine learning to yield a scoring function which assigns the highest score to the correct parse. In the systems and methods provided herein, the visual
IS
structure layout is modeled as a grammar, and a global search for an optimal parse i
performed based on a grammatical cost function. Machine learning can then be
utilized to discriminatively select features and set all parameters in the grammatical
parsing process, adapting to a variety of visual structure layouts.
[0012] In FIG. 1, a block diagram of a document visual structure analysis
system 100 in accordance with an aspect of an embodiment is shown. The document visual structure analysis system 100 is comprised of a document visual structure analysis component 102 that receives an input 104 and provides an output 106. The document visual structure analysis component 102 utilizes a non-generative grammatical model of a visual structure layout of a document to facilitate in determining an optimal parse tree for the visual structure layout. The input 104 includes, for example, a visual layout of a page of a document. The document visual structure analysis component 102 parses the input 104 utilizing a grammatical parsing process that parses the visual structure of a document to provide the output 106. The output 106 can be comprised of, for example, an optimal parse tree for the document visual structure layout. A globally learned "reference" grammar can also be established to provide parsing solutions for different tasks without requiring additional grammar learning.
[0013] Look at FIG. 2, another block diagram of a document visual structure
analysis system 200 in accordance with an aspect of an embodiment is illustrated. The document visual structure analysis system 200 is comprised of a document visual structure analysis component 202 that receives a visual structure input 204 and provides an optimal parse tree 206. The document visual structure analysis component 202 utilizes a discriminative grammatical model of a document visual structure layout. The document visual structure analysis component 202 is comprised of a receiving component 208 and a grammar component 210. The receiving component 208 receives the visual structure input 204 and relays it 204 to the grammar component 210. In other instances, the functionality of the receiving component 208 can be included in the grammar component 210, allowing the grammar component 210 to directly receive the visual structure input 204. The grammar component 210 also receives a basic structure layout grammar 212. The basic structure layout grammar 212 provides an initial visual structure grammar framework for the document layout. The grammar component 210 parses the visual
structure input 204 to obtain an optimal parse tree 206. It 210 accomplishes this via
utilization of a grammatical parsing process that parses the visual structure of a
document. The grammar component 210 employs a dynamic programming process to
determine a globally optimal parse tree. This prevents the optimal parse tree 206
from only being evaluated locally, yielding improved global results.
[0014] Turning to FIG. 3, yet another block diagram of a document visual
structure analysis system 300 in accordance with an aspect of an embodiment is
depicted. The document visual structure analysis system 300 is comprised of a
document visual structure analysis component 302 that receives a visual structure
input 304 and provides an optimal parse tree 306. The document visual structure
analysis component 302 utilizes a discriminative grammatical model of a document
visual structure layout for parsing. The document visual structure analysis component
302 is comprised of a receiving component 308 and a grammar component 310. The
grammar component 310 is comprised of a parsing component 312 and a document
structure extraction component 314. The parsing component 312 is comprised of a
visual structure grammar model 316 with a grammatical cost function 318. The visual
structure input 304 includes, for example, a visual layout of a document page. The
receiving component 308 receives the visual structure input 304 and relays it 304 to
the parsing component 312. In other instances, the functionality of the receiving
component 308 can be included in the parsing component 312, allowing the parsing
component 312 to directly receive the visual structure input 304. The parsing
component 312 parses the document visual structure from the visual structure input
304 based initially on a visual structure layout grammar 320. The parsing component
312 interacts with the document structure extraction component 314 to specifically
facilitate in extracting visual structure information from the visual structure input 304.
[0015] The document structure extraction component 314 utilizes complex
local and/or global features to facilitate the parsing component 312 in parsing the visual structure input 304. It 314 can utilize various optional mechanisms to augment visual structure layout parsing by the parsing component 312 including, but not limited to, machine learning 322, parse boosting 324, fast features 326, image scoring 328, and/or other 330 and the like. Other 330 represents additional efficiency and/or visually oriented mechanisms that facilitate to expedite and/or enhance the parsing component 312.
[0016] For example, machine learning 322 can be provided by the document
structure extraction component 314 to facilitate the parsing component 312 in order to
generate a chart. It 312 then converts the chart into a subsequent set of labeled
examples that are relayed to a classification process. The classification process
utilizes the subsequent set of labeled examples along with the machine learning to
train a set of classifiers. The classification process then determines identifying
properties between positive and negative examples. The identifying properties allow
the classifiers to facilitate in assigning proper costs to correct and/or incorrect parses.
The parsing component 312 then utilizes the set of classifiers in the grammatical cost
function 318 of the visual structure grammar model 316 to facilitate in scoring sub-
parses of the subsequent set of labeled examples. In this manner, the process
continues iteratively until an optimal parse tree 306 is obtained (e.g., no higher
scoring parse tree is obtained or no lower cost parse tree is obtained).
[0017] Similarly, the parse boosting mechanism 324 can be provided to the
parsing component 312 to facilitate in learning correct parses more efficiently. A fast feature mechanism 326 can be provided to compute parse images via computation of integral images of document features and/or utilization of constellations of integral images to enhance the parsing efficiency. The image scoring mechanism 328 can facilitate parsing by providing scores of parsed images for the grammatical cost function 318. These mechanisms 322-330 are optional and not required for parsing of the visual structure input 304.
[0018] When utilizing constellations of integral images, rather than a single
integral image for an entire page of a document, an integral image is computed for each element of the page (e.g., character, word, and/or line as appropriate, etc.). Attention can be focused by including only the critical characters in a feature computation. The systems and methods herein can also utilize computed integral images of document features as well. For example, document features such as large white space rectangles, vertical alignments of bounding boxes, and/or horizontal alignments of text lines and the like can be utilized.
[0019] Thus, by utilizing the integral image, it is possible to quickly compute
the number of white and/or black pixels within an image rectangle. Computing the integral image for an image is expensive, but once it is computed, rectangular sums can be quickly computed. When a set of objects is given that may or may not be in an
image, there is an exponential number of images that may be rendered from the image (power set P(N)). Rendering these images and computing the rectangle sums for each rendered image is prohibitively expensive. So, instead, the integral image is rendered for each of the objects and is denoted as "integral image constellations." Thus, the rectangle sum for any subset of the images is the sum of the rectangle sums from the constellations.
Two-Dimensional Parsing
[0020] While there are a number of competing parsing algorithms, one simple
yet generic framework is called "chart parsing" (see, M. Kay, "Algorithm schemata and data structures in syntactic processing," pp. 35-70, 1986). Chart parsing attempts to fill in the entries of a chart C(A,R) . Each entry stores the best score of a non-
terminal A as an interpretation of the sub-sequence of terminals R . The cost of any non-terminal can be expressed as the following recurrence:
(Equation Removed)
where {BC} ranges over all productions for A, and RQ is a subsequence of terminals (denoted as a "region"), and Rt and /?2 are subsequences which are disjoint and whose union is RQ (i.e., they form a "partition"). Essentially, the recurrence states that the score for A is computed by finding a low cost decomposition of the terminals into two disjoint sets. Each production is assigned a cost (or loss or negative log probability) in a table, /(A -» BC) . The entries in the chart (sometimes called edges) can be filled in any order, either top down or bottom up. The complexity of the parsing process arises from the number of chart entries that must be filled and the work required to fill each entry. The chart constructed while parsing a linear sequence of N terminals using a grammar including P non-terminals has O(PN2)
entries (there are 1/2H e O(N2) contiguous subsequences, {i,j} such that 0≤i
Documents