Abstract: A method for detecting outliers is provided, the method comprising: receiving a digitized text corpus comprising a plurality of data points; identifying k clusters of the plurality of data points; sampling a data point among the plurality of data points as a first cluster center of the k clusters; determining sampling probability of each of remaining data points of the plurality of data points; sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled; generating weightage for each of the k cluster centers; determining sensitivity scores of the data points belonging to each of the k cluster centers; and labeling a data point having a sensitivity score greater than a threshold value as an outlier and removing the outlier from the digitized text corpus. FIG.1
Claims:WE CLAIM:
1. An apparatus for detecting outliers of a digitized text corpus, the apparatus comprising:
a memory for storing instructions; and
a processor that, when executing the instructions performs a method, the method comprising:
receiving the digitized text corpus comprising a plurality of data points (X=?{x_i}?_(i=1)^n);
identifying k clusters of the plurality of data points, wherein k is a natural number smaller than n;
sampling a data point among the plurality of data points as a first cluster center of the k clusters;
determining sampling probability of each of remaining data points of the plurality of data points, wherein the remaining data points indicate a difference between the plurality of data points and the sampled data point and the sampling probability indicates a probability of each of the remaining data points to be sampled as a next cluster center of the k clusters;
sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled to form a set of cluster centers C = ?{C}?_(i=1)^k, wherein each of the k cluster centers corresponds to each of the k clusters;
generating weightage for each of the k cluster centers by counting a number of data points belonging to each of the k cluster centers;
determining sensitivity scores of the data points belonging to each of the k cluster centers based on the weightage for each of the k cluster centers;
labeling, based on the determined sensitivity scores, a data point having a sensitivity score greater than a threshold value as an outlier of the digitized text corpus and removing the outlier from the digitized text corpus; and
providing a first parameter of the digitized text corpus by analyzing the removed outlier or a second parameter of the digitized text corpus by analyzing data points without the outlier.
2. The apparatus of claim 1, wherein the k clusters of the plurality of data points are identified by spherical k-means clustering algorithm applied to the plurality of data points.
3. The apparatus of claim 1, wherein the first cluster center is selected among data points belonging to a cluster having a highest density of data points among the k clusters.
4. The apparatus of claim 1, wherein the sampling probability is determined by calculating angular distance between the sampled data point and each of the remaining data points and normalizing the calculated angular distance with a sum of all distances between the sampled data point and the remaining data points.
5. The apparatus of claim 1, wherein a sensitivity of a data point is determined based on determination of a maximal ratio between a cost contribution of the data point and an average cost contribution of the plurality of data points.
6. The apparatus of claim 1, wherein the number of data points belonging to each of the k cluster centers is determined by assigning each of the plurality of data points to a nearest cluster center among the k cluster centers and determining distance between each of the plurality of data points and the corresponding nearest cluster center.
7. The apparatus of claim 1, wherein the threshold value is determined based on a number of outliers that a user decided to detect and remove from the digitized text corpus.
8. A method for detecting outliers of a digitized text corpus, the method comprising:
receiving the digitized text corpus comprising a plurality of data points (X=?{x_i}?_(i=1)^n);
identifying k clusters of the plurality of data points, wherein k is a natural number smaller than n;
sampling a data point among the plurality of data points as a first cluster center of the k clusters;
determining sampling probability of each of remaining data points of the plurality of data points, wherein the remaining data points indicate a difference between the plurality of data points and the sampled data point and the sampling probability indicates a probability of each of the remaining data points to be sampled as a next cluster center of the k clusters;
sampling the next cluster center based on the sampling probability and iterate the process of determining sampling probability and the process of sampling the next cluster center until k cluster centers are sampled to form a set of cluster centers C = ?{C}?_(i=1)^k, wherein each of the k cluster centers corresponds to each of the k clusters;
generating weightage for each of the k cluster centers by counting a number of data points belonging to each of the k cluster centers;
determining sensitivity scores of the data points belonging to each of the k cluster centers based on the weightage for each of the k cluster centers;
labeling, based on the determined sensitivity scores, a data point having a sensitivity score greater than a threshold value as an outlier of the digitized text corpus and removing the outlier from the digitized text corpus; and
providing a first parameter of the digitized text corpus by analyzing the removed outlier, or a second parameter of the digitized text corpus by analyzing data points without the outlier.
9. The method of claim 8, wherein identifying k clusters of the plurality of data points further comprises:
applying, by the processor, spherical k-means clustering algorithm to the plurality of data points.
10. The method of claim 8, wherein the first cluster center is selected among data points belonging to a cluster having a highest density of data points among the k clusters.
11. The method of claim 8, wherein determining sampling probability of each of remaining data points of the plurality of data points further comprises:
calculating, by the processor, angular distance between the sampled data point and each of the remaining data points; and
normalizing, by the processor, the calculated angular distance with a sum of all distances between the sampled data point and the remaining data points.
12. The method of claim 8, wherein a sensitivity of a data point is determined based on determination of a maximal ratio between a cost contribution of the data point and an average cost contribution of the plurality of data points.
13. The method of claim 8, wherein generating weightage of each of the k cluster centers further comprises:
assigning, by the processor, each of the plurality of data points to a nearest cluster center among the k cluster centers; and
determining, by the processor, distance between each of the plurality of data points and the corresponding nearest cluster center.
14. The method of claim 8, wherein the threshold value is determined based on a number of outliers that a user decided to detect and remove from the digitized text corpus.
Dated this 28th day of September, 2018
R Ramya Rao
Of K&S Partners
Agent for the Applicant
IN/PA-1607
, Description:FIELD
Apparatuses, methods and systems consistent with the present disclosure relate generally to detecting and removing outliers, and more particularly, to apparatuses, methods and systems that detect outliers from a text corpus using dynamically determined sensitivity score.
| # | Name | Date |
|---|---|---|
| 1 | 201841036828-STATEMENT OF UNDERTAKING (FORM 3) [28-09-2018(online)].pdf | 2018-09-28 |
| 2 | 201841036828-REQUEST FOR EXAMINATION (FORM-18) [28-09-2018(online)].pdf | 2018-09-28 |
| 3 | 201841036828-POWER OF AUTHORITY [28-09-2018(online)].pdf | 2018-09-28 |
| 4 | 201841036828-FORM 18 [28-09-2018(online)].pdf | 2018-09-28 |
| 5 | 201841036828-FORM 1 [28-09-2018(online)].pdf | 2018-09-28 |
| 6 | 201841036828-DRAWINGS [28-09-2018(online)].pdf | 2018-09-28 |
| 7 | 201841036828-DECLARATION OF INVENTORSHIP (FORM 5) [28-09-2018(online)].pdf | 2018-09-28 |
| 8 | 201841036828-COMPLETE SPECIFICATION [28-09-2018(online)].pdf | 2018-09-28 |
| 9 | 201841036828-Request Letter-Correspondence [09-10-2018(online)].pdf | 2018-10-09 |
| 10 | 201841036828-Power of Attorney [09-10-2018(online)].pdf | 2018-10-09 |
| 11 | 201841036828-Form 1 (Submitted on date of filing) [09-10-2018(online)].pdf | 2018-10-09 |
| 12 | 201841036828-Proof of Right (MANDATORY) [20-12-2018(online)].pdf | 2018-12-20 |
| 13 | Correspondence by Agent_Form30,Form1_31-12-2018.pdf | 2018-12-31 |
| 14 | 201841036828-RELEVANT DOCUMENTS [04-10-2021(online)].pdf | 2021-10-04 |
| 15 | 201841036828-PETITION UNDER RULE 137 [04-10-2021(online)].pdf | 2021-10-04 |
| 16 | 201841036828-OTHERS [04-10-2021(online)].pdf | 2021-10-04 |
| 17 | 201841036828-Information under section 8(2) [04-10-2021(online)].pdf | 2021-10-04 |
| 18 | 201841036828-FORM 3 [04-10-2021(online)].pdf | 2021-10-04 |
| 19 | 201841036828-FER_SER_REPLY [04-10-2021(online)].pdf | 2021-10-04 |
| 20 | 201841036828-DRAWING [04-10-2021(online)].pdf | 2021-10-04 |
| 21 | 201841036828-CORRESPONDENCE [04-10-2021(online)].pdf | 2021-10-04 |
| 22 | 201841036828-COMPLETE SPECIFICATION [04-10-2021(online)].pdf | 2021-10-04 |
| 23 | 201841036828-CLAIMS [04-10-2021(online)].pdf | 2021-10-04 |
| 24 | 201841036828-FER.pdf | 2021-10-17 |
| 25 | 201841036828-PatentCertificate14-02-2024.pdf | 2024-02-14 |
| 26 | 201841036828-IntimationOfGrant14-02-2024.pdf | 2024-02-14 |
| 27 | 201841036828-PROOF OF ALTERATION [02-05-2024(online)].pdf | 2024-05-02 |
| 1 | 2021-04-0616-24-42E_06-04-2021.pdf |