Abstract: A system for managing documents is disclosed, comprising of interfaces to a user interface, proving an application programming interface, a database of document images, a remote server, configured to communicate a text representation of the document from the optical character recognition engine to the report server, and to receive from the remote server a classification of the document; and logic configured to receive commands from the user interface, and to apply the classifications received from the remote server to the document images through the interface to the database. A corresponding method is also provided.
Claims:We Claim:
1. A system for managing documents, comprising:
a. an interface to a user interface, providing an application programming interface;
b. an interface to a document database storing a record
c. an interface to a remote server to communicate the semantic content of the series of documents
d. at least one automated processor
e. receive commands from the user interface for manually classifying a respective document portion; and
f. control the document database through the interface to the document database.
2. The system according to claim 1, wherein the at least one automated processor is to include a task for execution by the user interface, selectively dependent on a content of a respective document portion.
3. The system according to claim 1, wherein the at least one automated processor includes to access a document archive for reprocessing of a set of documents in the document archive according to an updated classification.
4. The system according to claim 1, wherein the at least one automated processor is included a document archive for processing a set of documents in the document archive according to a set of classifications.
, Description:Technical Field of the Invention
The present invention relates to data management and storing in a web.
Background of the Invention
Classification of textual documents denotes assigning an unknown document to one of predefined classes. This is a straightforward concept from pattern recognition or from supervised machine learning. It implies the existence of a labelled training data set, a way to represent the documents, and a statistical classifier trained using the chosen representation of the training set.
Linear Discriminant Analysis (LDA) may be applied to document classification, when vector space document representations are employed. LDA is a well-known method in statistical pattern recognition literature. Its aim is to learn a discriminative transformation matrix from the original high-dimensional space to a desired dimensionality. The idea is to project the documents into a low dimensional space in which the classes are well separated. This can also be viewed as extracting features that only carry information pertinent to the classification task.
A known prior technique provides a vectoral text document representation based on the so-called bag-of-words approach, in which each document is essentially represented as a histogram of terms, or as a function of the histogram. One straightforward function is normalization: the histograms are divided by the number of terms of the document to account for different document lengths. Terms (words) that occur in every document obviously do not convey much useful information for classification. Same applies to rare terms that are found only in a few documents. These, as well as common stop words, are usually filtered out of the corpus. Furthermore, the words may be stemmed. These operations leave a term dictionary that can range in size from thousands to tens of thousands. Correspondingly, this is the dimension of the space in which documents now are represented as vectors. Although the dimension may be high, a characteristic of this representation is that the vectors are sparse. For many statistical pattern classification methods this dimensionality may be too high. Thus, dimension reduction methods are called for. Two possibilities exist, either selecting a subset of the original features, or transforming the features (or combinations) into derivative features.
Optimal feature selection coupled with a pattern recognition system leads to a combinatorial problem since all combinations of available features need to be evaluated, by actually training and evaluating a classifier. This is called the wrapper configuration. Obviously, the wrapper strategy does not allow learning of parametric feature transforms, such as linear projections, because all possible transforms cannot be enumerated. Another approach is to evaluate some criterion related to the final classification error that would reflect the “importance” of a feature or a few features jointly. This is called the filter configuration in feature selection. An optimal criterion would normally reflect the classification error rate. Approximations to the Bayes error rate can be used, based on Bhattacharyya bound or an interclass divergence criterion. However, these joint criteria are usually accompanied by a parametric, such as Gaussian, estimation of the multivariate densities at hand, and are characterized by heavy computational demands.
In document classification problems, the dominant approach has been sequential greedy selection using various criteria. This is dictated by the sheer dimensionality of the document-term representation. However, greedy algorithms based on sequential feature selection using any criterion are suboptimal because they fail to find a feature set that would jointly optimize the criterion. For example, two features might both be very highly ranked by the criterion, but they may carry the same exact information about class discrimination and are thus redundant. Thus, feature selection through any joint criteria such as the actual classification error, leads to a combinatorial explosion in computation. For this very reason finding a transform to lower dimensions might be easier than selecting features, given an appropriate objective function.
One well known dimension reducing transform is the principal component analysis (PCA), also called Karhunen-Loeve transform. PCA seeks to optimally represent the data in a lower dimensional space in the mean squared error sense. The transform is derived from the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of training data. In the information retrieval community this method has been named Latent Semantic Indexing (or LSI). The covariance matrix of data in PCA corresponds now to the document-term matrix multiplied by its transpose. Entries in the covariance matrix represent co-occurring terms in the documents. Eigenvectors of this matrix corresponding to the dominant eigenvalues are directions related to dominant combinations of terms occurring in the corpus. These dominant combinations can be called “topics” or “semantic concepts”. A transform matrix constructed from these eigenvectors projects a document onto these “latent semantic concepts”, and the new low dimensional representation consists of the magnitudes of these projections. The eigenanalysis can be computed efficiently by a sparse variant of singular value decomposition of the document-term matrix. LSI was introduced to improve precision/recall, and it is useful in various information retrieval tasks. However, it is not an optimal representation for classification. LSI/PCA is completely unsupervised, that is, it pays no attention to the class labels of the existing training data. LSI aims at optimal representation of the original data in the lower dimensional space in the mean squared error sense. This representation has nothing to do with the optimal discrimination of the document classes.
Object of the Invention
The object of the invention is system and method to provides a supervised document classification algorithm, implemented by a programmable processor, which employs alphanumeric strings, typically extracted from documents, which begin and end along word boundaries.
Summary of The Invention
A string may be a single word or a plurality of words. As each new document is added to the library, the classifier algorithm is executed. If the classification is not deemed reliable, a user is prompted to manually classify the document, and the training data used to adaptively improve the algorithm. If the document fits well within the existing classes, it is automatically classified, and the algorithm updated based on the new document.
A determination of reliability of classification may be statistical that is, if the strings and/or characteristics of the document occur commonly within a single classification, and uncommonly within documents having different classifications, then the classification according to that string or characteristic is deemed reliable, and an automated classification takes place. On the other hand, if none of the available strings and/or characteristics has significant probative power for any existing classification, then the user is prompted to classify the document, and the resulting classification is then populated with the characteristic strings and characteristics of the document. In some cases, users may accidentally or intentionally establish redundant document classifications. That is, the same document or substantially identical is sought to be classified in multiple classes. In that case, the system may prompt the user regarding the issue, or automatically generate an alias, which can then be used to handle the ambiguity. Indeed, the classification may be user-specific, such that each user may have a private or customized classification. In general, the classification will be objective and consistent across users, and the classification unambiguous, such that inconsistencies are resolved at the time the document is being entered.
A newly received document is scanned and optically character recognized (OCR). Documents from other sources may also be integrated, such as fax, email, word-processing format, and archive. The OCRed document is then parsed on word boundaries and then all the single words and strings of words up to, for example, 256 characters, are then indexed. The strings may be processed to eliminate spaces and/or “stop” words. Indexed strings are then compared to lists, which provide exclusions of strings that are common. For example, the word “the” alone would likely not result in important classification power between different classes, and thus may be excluded from the analytics. Even if a string is present in many different classes, so long as it is not present in certain classes, it has distinctive power. On the other hand, if the string is unreliably present in any class, and unreliably absent in any class, then it has low distinctive power. Through string analysis, a statistical classification system may be implemented, with tunable rules. That is, when errors are encountered, the strings that caused the misclassification analysed (e.g., by a person), and the rules altered to achieve the correct result.
Brief Description of Drawings
FIG. 1 shows a schematic diagram of a preferred system 5 architecture
Detailed Description of Invention
FIG. 1 shows a schematic diagram of a preferred system architecture, in which a client user interface communicates with a C-server having a client database, which communicates with the Internet. Through the Internet, the C-server communicates with an A-server, which has associated a central database. The user interface is provided is, for example, a native Windows 7 compliant application (using the dot net (NET) platform, windows communication foundation (WCF), windows presentation foundation (WPF)). The user interface provides a set of windows, for example, an intake screen, which monitors incoming documents or batches of documents, and indicates its progress though automated tasks, Such as the classification. Another window provided is the sender window, which, for example, allows the user to control sending documents of any format, to the C server for pre-processing (image deskewing, despeckling, other image enhancement) as might be necessary, forming a batch from multiple documents of the same or different document type. The sender also provides a preview function to permit the user to view the document(s) to be sent to the C-server and provides ability to pace and schedule Submissions to the C server, to thus permit administration of workloads and workflows. The sender window can send files to user-specific or functions specific designated data Zones, which can be, for example, an indication of security status, privacy flags, data partitioning, workflow delineation, etc. The client user interface Software may provide a generic communication function for interaction with other systems, and which may include both input and output functions (controlled through a window) that permits communication of data with external processes. This function is managed by the client user interface software, but implemented by the C server, typically without having data pass through the user interface component, in Such as TCP/IP communications, XML, ODBC, SOAP, and RPC. On the other hand, in some cases, communication can be to or through the client user interface software, for example with a local file system, USB drive or DVD-R, or using OLE, COM or other data feeds. One task that may be desired is migrating archival or external databases into the C-server database. A function may be provided in the sender to access these documents and present these to the C server. Alternately, the client software may provide a configuration file or command to the C server for automated processing without requiring these documents to pass through the client software. In either case, the C server will generally give preference to processing new documents from the sender, and not to the external workflows.
The sender provides an important but optional functionality for the client software and may be separately licensed and/or enabled. The client software may thus be usable in a data search and retrieval-only mode or create and consume mode. The C server can directly synchronize with another system to acquire documents input through the other system. The client user interface software is still used to classify documents, to search and retrieve documents, to otherwise interact with the A server, and to provide ancillary functionality. The client Software in some cases can act as an add-on or add-into another document management system. The interaction may be tightly coupled, or non-cooperative. In some cases, documents are placed in a directory structure or other simple database (e.g., an email-type archive) by a separate document management system. The client user interface software or C server can monitor this directory structure or simple database to concurrently input these new documents into the C server database. Further, a window may be provided for pending user intervention, i.e., messages or tasks that require the user to provide input to permit completion of processing.
The pending window shows jobs that require confirmation before filing or are incomplete or unidentifiable from the automation process of the server. Typically, the operation of the A server is not in real time with respect to the user interface software, so the pending jobs are delayed with respect to original user inputs for those jobs. A window showing complete jobs representing documents or batches of documents that are completely filed and are thus available for search and retrieval is also provided. This window, and other windows within the client user interface software, provides an ability for a user to review historical information such as jobs submitted and/or completed within a given date range.
The intake monitoring (which encompasses the sender), pending, and complete processes are preferably Subject to data Zone partitioning, and thus may be separately filterable and controllable on that basis. A view window is provided, which provides a method of presenting a directory or group of documents, representing a targeting of documents, filtered based on target categories. These target categories are user selectable, and thus provide a convenient means for interacting with the database, providing instant directed search and document retrieval. The view window provides an ability to preformat certain “cabinet' views based on user defined criteria. Document views may be represented in a strip of thumb nails presented as a scrollable transparent overlay over a larger selected full-page view, to facilitate user navigation.
A search window is provided to manage search and retrieval of documents from the C server database. The search window provides full and complex search facilities, including logical (Boolean), key phrase and full text searches, field range, etc. The search window may be used for detailed data analysis, for example in a medical information database, to extract information on patients, medical conditions, and productivity. The user may save a formulated search for future re-execution or to later retrieve the same results. Another function is a sub search of a defined document set defined by another search or other document set definition. Various document sets may be named, that is, the limiting criteria defined by shorthand.
| # | Name | Date |
|---|---|---|
| 1 | 201921035167-Proof of Right [29-11-2020(online)].pdf | 2020-11-29 |
| 1 | 201921035167-STATEMENT OF UNDERTAKING (FORM 3) [31-08-2019(online)].pdf | 2019-08-31 |
| 2 | 201921035167-POWER OF AUTHORITY [31-08-2019(online)].pdf | 2019-08-31 |
| 2 | 201921035167- ORIGINAL UR 6(1A) FORM 26-170919.pdf | 2019-09-21 |
| 3 | Abstract.jpg | 2019-09-13 |
| 3 | 201921035167-FORM FOR STARTUP [31-08-2019(online)].pdf | 2019-08-31 |
| 4 | 201921035167-COMPLETE SPECIFICATION [31-08-2019(online)].pdf | 2019-08-31 |
| 4 | 201921035167-FORM FOR SMALL ENTITY(FORM-28) [31-08-2019(online)].pdf | 2019-08-31 |
| 5 | 201921035167-FORM 1 [31-08-2019(online)].pdf | 2019-08-31 |
| 5 | 201921035167-DRAWINGS [31-08-2019(online)].pdf | 2019-08-31 |
| 6 | 201921035167-FIGURE OF ABSTRACT [31-08-2019(online)].jpg | 2019-08-31 |
| 6 | 201921035167-EVIDENCE FOR REGISTRATION UNDER SSI [31-08-2019(online)].pdf | 2019-08-31 |
| 7 | 201921035167-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [31-08-2019(online)].pdf | 2019-08-31 |
| 8 | 201921035167-FIGURE OF ABSTRACT [31-08-2019(online)].jpg | 2019-08-31 |
| 8 | 201921035167-EVIDENCE FOR REGISTRATION UNDER SSI [31-08-2019(online)].pdf | 2019-08-31 |
| 9 | 201921035167-FORM 1 [31-08-2019(online)].pdf | 2019-08-31 |
| 9 | 201921035167-DRAWINGS [31-08-2019(online)].pdf | 2019-08-31 |
| 10 | 201921035167-COMPLETE SPECIFICATION [31-08-2019(online)].pdf | 2019-08-31 |
| 10 | 201921035167-FORM FOR SMALL ENTITY(FORM-28) [31-08-2019(online)].pdf | 2019-08-31 |
| 11 | 201921035167-FORM FOR STARTUP [31-08-2019(online)].pdf | 2019-08-31 |
| 11 | Abstract.jpg | 2019-09-13 |
| 12 | 201921035167-POWER OF AUTHORITY [31-08-2019(online)].pdf | 2019-08-31 |
| 12 | 201921035167- ORIGINAL UR 6(1A) FORM 26-170919.pdf | 2019-09-21 |
| 13 | 201921035167-STATEMENT OF UNDERTAKING (FORM 3) [31-08-2019(online)].pdf | 2019-08-31 |
| 13 | 201921035167-Proof of Right [29-11-2020(online)].pdf | 2020-11-29 |