Method And System For Extracting Information From Templateless Formats

Abstract: The present disclosure relates to a method and system for extracting information from templateless formats. The method comprises: (1) receiving, by a processing unit [102], a data message; (2) identifying a set of specific patterns of entity values; (3) substituting the identified specific patterns with a corresponding key value to generate processed data message; (4) storing each key value in memory unit [108]; (5) generating, by a fine-tuned tokenizer [104], a set of data message tokens, organisation data tokens and a domain data, from the processed data message; (6) receiving, by a pre-trained sub-system [106], the generated set of data message tokens, the organisation data tokens, and the domain data from the fine-tuned tokenizer; (7) extracting an information comprising a set of entity values and meta data associated with the data message; (8) associating each of the identified entity value with a corresponding key value stored in memory unit.

Patent Information

Application #

Filing Date

07 November 2022

Publication Number

04/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Patent Number

Legal Status

Grant Date

2024-10-01

Renewal Date

Applicants

FLIPKART INTERNET PRIVATE LIMITED

Buildings Alyssa, Begonia & Clover, Embassy Tech Village, Outer Ring Road, Deverabeesanahalli Village, Bengaluru - 560103, Karnataka, India

Inventors

1. Tal Elisberg

40 Kislev St., Ashdod, Israel

2. Shlomi Lifshits

16 Katsenelson St., Tel Aviv, Israel

Specification

We Claim:

1. A method for extracting information from templateless formats, the method comprising:

- receiving, by a processing unit [102], a data message;

- identifying, by the processing unit [102], a set of one or more specific patterns of entity values;

- substituting, by the processing unit [102], each entity value of the identified specific patterns of entity values with a corresponding key value to generate a processed data message;

- storing, by the processing unit [102], the each entity value in a memory unit [108];

- generating, by a fine-tuned tokenizer [104], a set of one or more data message tokens, organisation data tokens and a domain data, from the processed data message;

- receiving, by a pre-trained sub-system [106], the generated set of one or more data message tokens, the organisation data tokens, and the domain data from the fine-tuned tokenizer [104];

- extracting, by the pre-trained sub-system [104], an information comprising a set of one or more entity values and a meta data associated with the data message; and

- associating, by the pre-trained sub-system [104], each of the identified entity values with a corresponding key value stored in memory unit [108].

2. The method as claimed in claim 1, the method further comprising storing, by the pre-trained sub-system [106], each of the one or more entity values and the meta data in the memory unit [108].

3. The method as claimed in claim 1, wherein the pre-trained sub-system [106] is a layered system.

4. The method as claimed in claim 3, wherein the pre-training of the subsystem [106] further comprises:
- converting, by one or more embedding layers, the organisation data tokens into an organisation data embeddings and the data message tokens into a data message embeddings; and

- feeding the organisation data embeddings and the data message embeddings to one or more bi-directional Long Short-Term Memory (LSTM) layers.

5. The method as claimed in claim 3, wherein the extracting, by the pre-trained sub-system [104], an information comprising a set of one or more entity values and a meta data associated with the data message, further comprises:

- feeding tokens, from the one or more bi-directional Long Short-Term Memory (LSTM) layers, to a fully connected layer and a fully connected and adaptive Max-pooling Layers; and

- predicting, by a Conditional Random Field (CRF) layer, an entity sequence and a set of meta data labels.

6. The method as claimed in claim 5, wherein the predicting, by a Conditional Random Field (CRF) layer, an entity sequence and a set of meta data labels, is based on a transition matrix and a set of Negative Log Likelihood Loss (NLLL) weights.

7. A system for extracting information from templateless formats, the system comprising:

- a processing unit [102] configured to:

o receive a data message;

o identify a set of one or more specific patterns of entity values;

o substitute each entity value of the identified specific patterns of entity values with a corresponding key value to generate a processed data message; and

o store the each entity value in a memory unit [108] coupled with the processing unit [102];
- a fine-tuned tokenizer [104] connected to the processing unit [102] and the memory unit [108], the fine-tuned tokenizer [104] configured to:

o generate a set of one or more data message tokens, organisation data tokens and a domain data, from the processed data message; and

- a pre-trained sub-system [106] connected to the processing unit [102], the memory unit [108], and the fine-tuned tokenizer [104], the pretrained sub-system [106] configured to:

o receive the generated set of one or more data message tokens, the organisation data tokens, and the domain data from the fine-tuned tokenizer [104];

o extract an information comprising a set of one or more entity values and a meta data associated with the data message; and

o associate each of the identified entity value with a corresponding key value stored in the memory unit [108].

8. The system as claimed in claim 7, wherein the pre-trained sub-system [106] is further configured to store each of the one or more entity values and the meta data in the memory unit [108].

9. The system as claimed in claim 7, wherein the pre-trained sub-system [106] is a layered system.

10. The system as claimed in claim 9, wherein the sub-system [106] for the pretraining, is further configured to:

- convert, using one or more embedding layers of the sub-system [106], the organisation data tokens into an organisation data embeddings and the data message tokens into a data message embeddings; and

- feed the organisation data embeddings and the data message embeddings to one or more bi-directional Long Short-Term Memory (LSTM) layers.

11. The system as claimed in claim 9, wherein the pre-trained sub-system [106], for extracting an information comprising a set of one or more entity values
and the meta data associated with the data message, is further configured to:

- feed tokens, from the one or more bi-directional Long Short-Term Memory (LSTM) layers, to a fully connected layer and a fully connected and adaptive Max-pooling Layers; and

- predict, by a Conditional Random Field (CRF) layer, an entity sequence and a set of meta data labels.

12. The system as claimed in claim 11, wherein the pre-trained sub-system [106] predicts, by a Conditional Random Field (CRF) layer, the entity sequence and a set of meta data labels, based on a transition matrix and a set of Negative Log Likelihood Loss (NLLL) weights.

Documents

Application Documents

#	Name	Date
1	202241063464-STATEMENT OF UNDERTAKING (FORM 3) [07-11-2022(online)].pdf	2022-11-07
2	202241063464-REQUEST FOR EXAMINATION (FORM-18) [07-11-2022(online)].pdf	2022-11-07
3	202241063464-REQUEST FOR EARLY PUBLICATION(FORM-9) [07-11-2022(online)].pdf	2022-11-07
4	202241063464-PROOF OF RIGHT [07-11-2022(online)].pdf	2022-11-07
5	202241063464-POWER OF AUTHORITY [07-11-2022(online)].pdf	2022-11-07
6	202241063464-FORM-9 [07-11-2022(online)].pdf	2022-11-07
7	202241063464-FORM 18 [07-11-2022(online)].pdf	2022-11-07
8	202241063464-FORM 1 [07-11-2022(online)].pdf	2022-11-07
9	202241063464-FIGURE OF ABSTRACT [07-11-2022(online)].pdf	2022-11-07
10	202241063464-DRAWINGS [07-11-2022(online)].pdf	2022-11-07
11	202241063464-DECLARATION OF INVENTORSHIP (FORM 5) [07-11-2022(online)].pdf	2022-11-07
12	202241063464-COMPLETE SPECIFICATION [07-11-2022(online)].pdf	2022-11-07
13	202241063464-Request Letter-Correspondence [09-11-2022(online)].pdf	2022-11-09
14	202241063464-Power of Attorney [09-11-2022(online)].pdf	2022-11-09
15	202241063464-Form 1 (Submitted on date of filing) [09-11-2022(online)].pdf	2022-11-09
16	202241063464-Covering Letter [09-11-2022(online)].pdf	2022-11-09
17	202241063464-Correspondence_Form-1 And POA_21-11-2022.pdf	2022-11-21
18	202241063464-FER.pdf	2023-03-10
19	202241063464-FER_SER_REPLY [08-09-2023(online)].pdf	2023-09-08
20	202241063464-PatentCertificate01-10-2024.pdf	2024-10-01
21	202241063464-IntimationOfGrant01-10-2024.pdf	2024-10-01

Search Strategy

1	SearchHistoryE_10-03-2023.pdf