Abstract: A self-regulating method for swift categorizing documents of a specific domain fetched from different web sites using commodity hardware is presented. The content on the website changes after some time and the method is defined to monitor them on timely basis. The change identified on websites may be a plain text or a web page link pointing to another web site. The content is fetched and converted to textual format. The unlabelled text is categorized using a probability based classification engine. The classification engine is trained to classify domain specific text expeditiously
Claims:1. Classification and clustering are examples of the more
general problem of pattern recognition, which is the
assignment of some sort of output value to a given input
value.
2. NB algorithm assumptions of features make the features
order is irrelevant and consequently that the present of one
feature does not affect other features in classification
tasks.
3. There has been expeditious increase in electronic
documents generation on the web. With the substantial
growth of these electronic documents the task of web
scraping and categorizing of the documents becomes
painstaking.
4. This phase helps different users to monitor and maintain
their websites independently. For every user the tool gives
liberty to have its own environment by maintaining
separate projects within.
5. Machine-learning models typically cannot be re-used
between different cases; one has to start all over again for
each case. Also, the model typically cannot handle dynamic
collections, when new documents are added to a case; one
has to start all over again. To handle such scenarios the
tool offers a facility to dynamically add domain specific
keywords in data preparation phase. This makes the tool
more intelligent with respect to the domain knowledge that
we are trying to train it upon. , Description:FIELD OF THE INVENTION
The present invention relates to the categorization of documents of a specific
domain from different website. Method and System for automated document
classification by enriched domain inputs and automated parallel execution
on a commodity computer.
BACKGROUND OF THE INVENTION
In legal services, there are few thousands federal, state and local websites
that regularly publish relevant content for income tax analysts. The
frequency at which content on these websites change is uncertain and needs
to be monitored daily. This process is done manually by few hundreds of
domain experts day in day out. Once change is identified, it must be
categorized. Due to manual process there can be issues like:
1. The current process is repetitive, painstaking and involves
manual labour resulting in under-utilising of the skilled
domain professionals.
2. There is a fair chance of failure doing this critical job
manually which might lead to inaccuracy of the outcome
3. If the data size grows then this process will require
significant amount of time and human effort which will be
unmanageable .
4. The repetitive manual labour might also result in
inconsistency in results.
Hence, resulting to poor analysis. So, it is essential to automate this
process.
2DESCRIPTION
[1] To efficiently and accurately classify changed content on documents of
websites we have defined a detailed technical flow mentioned in FIG 1.
[2] Contents on thousands of websites change multiple times in a day. To
monitor these websites, we download all the WebPages multiple times in a
day. Once the latest version of webpage is available it must be compared
with previous version so that change can be tracked. The older version of
webpage is stored in No SQL database and the reason is described in
[4].Once both the versions of web page are available, we compare and
identify if content on it has changed. If content is changed then we consider
it for further analysis which is scraping what exactly has changed. This
overall process is mentioned in FIG 2.
[3] In downloading of WebPages mentioned in [2], it may take longer time if
linearly done, so we distribute the load over a cluster using load distribution
mechanism (Refer FIG 4). This speed up the process and further analysis
can be done sooner.
[4] No SQL database can store huge data including unstructured data. No
SQL database helps to store and retrieve different versions of same web
pages quite fast.
[5] As mentioned in FIG 3indexes can be fetched from scrapping queue for
both versions of WebPages and then using those contents of pages are
stored in some data structure. Texts on both the web pages is compared and
differing content is loaded in another queue. The details of content
comparison are mention in [6].
3[6] The changed content on the webpage can be in textual form or a uniform
resource locater. If the changed content is in textual form, it can be directly
used for categorizing. And if it is a URL we download the document from it.
The URL may be pointing to any format document from the range of PDFs,
Microsoft documents to CSVs, XMLs and Excels.
[7] To extract the text from above file types, there is a converter which
converts any file format to text. The converter works faster since it is done
over a cluster mentioned in FIG 4.Once the text is available from the
documents related to all the changed web pages, load it to No SQL database
and update the metadata and indexes.
[8]Reading indexes for newly added text files and their older versions. The
text is fetched and compared for both the versions and if the content does
not match, the indexes are added to classification queue as mentioned in
FIG 5.
[9] Text classification can be done using supervised learning where
algorithm is fed with some training data to get experience of past. Once the
model is trained, fresh data without labels is given for categorization. The
classification of unstructured data is difficult as it involves lot of noise in the
data. The data pre-processing is mentioned in detailed at [10].After preprocessing, the features are ready to be trained or classified. Training the
algorithm is the most crucial part of machine learning and must be done
carefully as mentioned in [11].The trained classification model is then used
to classify the text in certain categories.
[10] Data pre-processing also called as data cleansing involves many stages
like uniform casing, removal of noise, etc. But there is no exact sequence for
these stages for execution. It all depends on the importance of the words in
your data. But to generalize it the process is mentioned in FIG 6. The input
text may contain non-ascii character and anyways they can’t be stored in
4classification model, so we remove them. Mostly the character’s case differs
in whole text. To have a clear distinction in words we make them all lower
case. Data must be tokenized so that meaning can be derived from those
unstructured chunks. Typically, we start with higher degree of tokenization
as bag of words make more meaning. The noise can be removed from the
tokens by using standard list of stop words. Additionally, few domain
specific words or set of words which are irrelevant are also removed. Data is
then normalized using different techniques like stemming and lemmatizing.
The data is clean enough so that features can be derived out of it. The
feature extraction process is done by counting bag of words in the clean
corpus also called as frequency matrix. But sometimes frequency matrix is
not enough, the features can be given more relevance by assigning more
weights to few features. In the weight age assignment process extra weights
are given to the domain specific keywords to make the features more
distinct.
[ llJTraining a supervised learning algorithm involves feature extraction and
tagging those features with certain label. The feature extraction is
mentioned in [10]. The accuracy of the classification algorithm depends on
how well the model is trained. There is way to improve the classification
model by retraining. The optimization is mentioned in [12].
[12]For optimizing the classification model, the training data can be
distributed in few samples. One set used for training the model and other
set used to test accuracy of the tool. After first round of classification test
the accuracy by comparing results of classification by the algorithm with
manually tagged ones. If average accuracy is less than threshold retraining
is done. While retraining the model feature extraction process must be
changed where domain specific keywords are tuned for the documents
which were wrongly classified.
[13] Probability lies between 0 to 1. For a probability based classification
model, when fresh features are up for classification it returns different
5scores. The different scores are associated with respective categories. The
category with highest score is tagged to the fresh feature. But this approach
is not always useful as the data distribution differs for every case. There is
an innovative approach to improve accuracy where algorithm identifies the
probability score and if it is close to borderline, it is tagged as a separate
class altogether. The documents on this different category later can be
reviewed by business user. This helps business user to identify wrongly
classified document.
| # | Name | Date |
|---|---|---|
| 1 | 201821008143-STATEMENT OF UNDERTAKING (FORM 3) [06-03-2018(online)].pdf | 2018-03-06 |
| 2 | 201821008143-POWER OF AUTHORITY [06-03-2018(online)].pdf | 2018-03-06 |
| 3 | 201821008143-FORM 1 [06-03-2018(online)].pdf | 2018-03-06 |
| 4 | 201821008143-FIGURE OF ABSTRACT [06-03-2018(online)].jpg | 2018-03-06 |
| 5 | 201821008143-DRAWINGS [06-03-2018(online)].pdf | 2018-03-06 |
| 6 | 201821008143-DECLARATION OF INVENTORSHIP (FORM 5) [06-03-2018(online)].pdf | 2018-03-06 |
| 7 | 201821008143-COMPLETE SPECIFICATION [06-03-2018(online)].pdf | 2018-03-06 |
| 8 | Abstract1.jpg | 2018-08-11 |
| 9 | 201821008143-Form 18-140220.pdf | 2020-02-17 |
| 10 | 201821008143-FER.pdf | 2021-10-18 |
| 1 | 2021-05-2510-40-44E_25-05-2021.pdf |