Abstract: A system and method for data classification are disclosed. The method includes receiving by a data classifier, a data corpus comprising one or more words. The method further includes comparing the data corpus with at least one pre-classified category of words to determine an overlap ratio between the data corpus and each of the at least one pre-classified category of words. The method further includes computing a confidence score of the data corpus for each of the at least one pre-classified category of words based on the overlap ratio and a predefined confidence score associated with the data corpus for each of the at least one pre-classified category of words. Finally, the method includes classifying the data corpus based on the confidence score into the at least one pre-classified category. Figure 2
Claims:WE CLAIM:
1. A method of data classification:
receiving, by a data classifier, a data corpus comprising one or more words;
comparing, by the data classifier, the data corpus with at least one pre-classified category of words to determine an overlap ratio between the data corpus and each of the at least one pre-classified category of words;
computing, by the data classifier, a confidence score of the data corpus for each of the at least one pre-classified category of words based on the overlap ratio and a predefined confidence score associated with the data corpus for each of the at least one pre-classified category of words; and
classifying, by the data classifier, the data corpus based on the confidence score into the at least one pre-classified category.
2. The method of claim 1, wherein the overlap ratio is based on one or more words common between the data corpus and the at least one pre-classified category of words.
3. The method of claim 1, wherein the confidence score is the probability of the data corpus belonging to a category of the at least one pre-classified category of words.
4. The method of claim 1, further comprising determining a boost value for the confidence score of the data corpus for each of the at least one pre-classified category of words based on a change in the confidence score for each of the at least one pre-classified category of words from the predefined confidence score associated with the data corpus for each of the at least one pre-classified category of words.
5. A system for data classification, comprising:
a hardware processor; and
a memory storing instructions executable by the hardware processor for:
receiving a data corpus comprising one or more words;
comparing the data corpus with at least one pre-classified category of words to determine an overlap ratio between the data corpus and each of the at least one pre-classified category of words;
computing a confidence score of the data corpus for each of the at least one pre-classified category of words based on the overlap ratio and a predefined confidence score associated with the data corpus for each of the at least one pre-classified category of words; and
classifying the data corpus based on the confidence score into the at least one pre-classified category.
6. The system of claim 5, wherein the overlap ratio is based on one or more words common between the data corpus and the at least one pre-classified category of words.
7. The system of claim 5, wherein the confidence score is the probability of the data corpus belonging to a category of the at least one pre-classified category of words.
8. The system of claim 5, further comprising determining a boost value for the confidence score of the data corpus for each of the at least one pre-classified category of words based on a change in the confidence score for each of the at least one pre-classified category of words from the predefined confidence score associated with the data corpus for each of the at least one pre-classified category of words.
Dated this 29th day of November, 2016
Swetha SN
Of K&S Partners
Agent for the Applicant
, Description:TECHNICAL FIELD
This disclosure relates to natural language processing, and more particularly to a system and method for data classification.