Abstract: State of art techniques can lead to erroneous results as white space-dead space technology landscaping is limited at corpus level. A method and system for white space-dead space technology landscaping from unstructured data is disclosed. Global topic set is obtained from global corpus by generating a topic model. An enterprise corpus, which is a collection of research documents specifying a research area and a research subarea identified by the enterprise, is processed by the topic model to identify an enterprise topic set from among the global topic set. Topics from the global topic set and the enterprise topic set are mapped to each of the research subareas. Trend analysis is performed over the mapped topics to identify trending topics and declining topics, for each of the subarea. Topic relevancy is determined to identify majority topics and emerging topics. Topic flowchart is created providing evolution of the topic over the years. [To be published with 2]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR WHITE SPACE-DEAD SPACE TECHNOLOGY LANDSCAPING FROM UNSTRUCTURED DATA
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The embodiments herein generally relate to data analysis of unstructured data to derive insights and, more particularly, to a method and system for white space-dead space technology landscaping from unstructured data.
BACKGROUND
[002] Research area forecasting has become an indispensable part of every organization to aid its accelerated success. In each research area and its related subareas, there may be multitude of topics with each having its own area of specialization and impact on business. Research teams working in a specific research area need to work on most relevant topics of the day to establish a name and stay ahead of the competition. To do this they need a breakdown of research in their area of work into a comprehensive list of topics, keep track of the trending topics and ensure that they are invested in them. Spotting such missed opportunities is known as white space mapping. Similarly, they also need to spot dead spaces which are technologies that have sharp decline globally and need to be retired, wherein resources from them can be shifted to other research areas. Existing tools in the market indicate broad trends primarily based on bibliometrics. However, what is desired is an analysis that is tailored to the work being done in a specific subarea in the context of developments at large. The white space spotting problem has been difficult to automate, and currently research and development (R&D) establishments are employing experts to manually identify these spaces. Experts with sufficient capability to perform this work are hard to find and when they do, experts are prone to make human mistakes.
[003] In the domain of white space-dead space analysis, majority work is focused on just counting the frequency of the topic at corpus level. However, any identified topic at corpus level generally represents a broad subject that cuts across many subareas and the same topic can trend differently in each subarea. For example, machine learning can be a topic both in robotics and data science research area. Thus, analyzing the topics in context of the area/subarea of research is crucial
to enable factual white space-dead space analysis, so as to guide an enterprise or organization in its research approach. The simple approach global to enterprise mapping as used in conventional approaches can lead to erroneous results as its analysis focus is limited to the corpus level.
[004] Further, to determine declining and emerging topics, conventional methods utilize standard Compound Annual Growth Rate (CAGR) approach, which is calculated considering the beginning and end values of change rate of topic of interest over a given time span. However, to capture technology trends that are changing at fast pace an CAGR computation that better captures the fast changes is required.
SUMMARY
[005] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
[006] For example, in one embodiment, a method for white space-dead space technology landscaping from unstructured data is provided. The method includes creating, i) a global corpus comprising a plurality of global documents, and ii) an enterprise corpus comprising a plurality of enterprise documents with each of the plurality of enterprise documents identified with a research area among a plurality of research areas and a research subarea among a plurality of research subareas defined for each area. The plurality of global documents and the plurality of enterprise documents belong to a predefined time span.
[007] Further, the method includes preprocessing, each of the plurality of global documents and each of the plurality of enterprise documents by performing tokenization, dictionary generation and Term Frequency–Inverse Document Frequency (TF-IDF) computation.
[008] Further, the method includes generating a topic model by applying a Latent Dirichlet Allocation (LDA) technique on the preprocessed global documents, wherein the topic model generates a global topic set comprising a predefined number of topics derived from the global documents. Each topic among
the global topic set is assigned a topic score per global document to generate document-to-topics pairing for each of the preprocessed global documents with topics from the global topic set having topics scores above a predefined score threshold.
[009] Furthermore, the method includes processing the preprocessed enterprise documents by applying the topic model to generate an enterprise topic set comprising one or more topics from among the global topic set. Each topic among the enterprise topic set is assigned the topic score per enterprise document to generate the document-to-topics pairing for each of the preprocessed enterprise documents with the topics from the enterprise topic set having the topics scores above the predefined score threshold.
[0010] Further, the method includes mapping the topics from among the enterprise topic set and the global topic set with each of the research subarea among the plurality of research subareas. The mapping comprises grouping the plurality of preprocessed enterprise documents into a plurality of groups in accordance with each research subarea among the plurality of research subareas, and applying an iterative process over the plurality of groups to allocate the topics from the enterprise topic set and by the global topic set to each research subarea. Furthermore, the iterative process comprises aggregating the topic score of each topic from the document-to-topics pairing of each of the plurality of preprocessed enterprise documents in each group among the plurality of groups and tabulating the aggregated topic score for each topic across the plurality of groups. Thereafter, allocating one or more topics among the enterprise topic set to the research subarea if the aggregated topic score for the one or more topics is above a predefined aggerated threshold. Thereafter, allocating, to each group among the plurality of groups, one or more preprocessed global documents from among the plurality of global documents based on a similarity between each topic in the document-to-topics pairing of each of the plurality of preprocessed global documents and one or more topics allocated to the subarea if the topic scores satisfy the predefined score threshold to obtain updated subgroups. Each preprocessed global document is allocated to a maximum of a predefined fraction of a total number of subareas
defined by the enterprise. Finally, repeating the iterative process on a plurality of documents in the updated subgroups until each of the plurality of subareas is mapped to a maximum of half a number of total topics in the global topic set.
[0011] Furthermore, the method includes, determining trending topics and declining topics from among the mapped topics to each research subarea for generating landscaping of white space technologies and dead space technologies for each of the enterprise documents based on an intersection of the mapped topics of each subarea and the global topic, wherein the white space technologies refer to technologies missed by the enterprise for research, and the dead space technologies refer to the technologies that need to be retired by the enterprise from research, and a topic relevancy by generating a network chart, for each subarea among a plurality of research subareas, with mapped topics of the subarea as nodes, which are connected by an edge if topics among the mapped topics corresponding to two nodes are in a single enterprise document. The major topics and emerging topics within each subarea are identified based on centrality measures of each node in the network chart.
[0012] Furthermore, the method includes, determining the trending topics and the declining topics from among the mapped topics of each research subarea by determining a total number of the plurality of global documents published per year over a time range and determining a topic frequency for each topic of the global topic set per year by counting the plurality of global documents for the topic per year for the time range to generate a plurality of topic frequencies. Thereafter, normalizing each of the plurality of topic frequencies to the average number of documents for the topic per year, and then plotting the normalized topic frequencies over the years in the time range, wherein a topic is defined to be trending topic if there is a percentage growth in the topic frequency of the topic over a last one year, and wherein a topic is defined as declining topic if there is a negative cumulative growth in last predefined number of years.
[0013] Further, the method includes determining white spaces technologies
and dead space technologies for the enterprise based on intersection of the
enterprise topic set and the global topic set for a subarea, by identifying one
or more topics present in the global topic set but absent in the mapped topics for the subarea as the white space technologies for the enterprise for future research, and then identifying one or more topics present the in the mapped topics of the subarea, which are identified as declining topics in the global topic set, as the dead space technologies for the enterprise, which need to be retired during future research.
[0014] Furthermore, the method includes computing a topic progression based on cosine distance between each topic identified in a year with each topic identified in previous years to get a progression of topics from year to year, wherein the topic progression enables identifying carry forward topics, emerging topics and extinct topics, and creating a topic flowchart by plotting topic progression score for each topic as a Sankey diagram providing evolution of the topic over the years.
[0015] In an aspect, a system for white space-dead space technology landscaping from unstructured data is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to create, i) a global corpus comprising a plurality of global documents, and ii) an enterprise corpus comprising a plurality of enterprise documents with each of the plurality of enterprise documents identified with a research area among a plurality of research areas and a research subarea among a plurality of research subareas defined for each area. The plurality of global documents and the plurality of enterprise documents, comprising the unstructured data, belong to a predefined time span.
[0016] Further, the one or more hardware processors are configured to preprocess, each of the plurality of global documents and each of the plurality of enterprise documents by performing tokenization, dictionary generation and Term Frequency–Inverse Document Frequency (TF-IDF) computation.
[0017] Further, the one or more hardware processors are configured to generate a topic model by applying a Latent Dirichlet Allocation (LDA) technique on the preprocessed global documents, wherein the topic model generates a global topic set comprising a predefined number of topics derived from the global
documents. Each topic among the global topic set is assigned a topic score per global document to generate document-to-topics pairing for each of the preprocessed global documents with topics from the global topic set having topics scores above a predefined score threshold.
[0018] Furthermore, the one or more hardware processors are configured to process the preprocessed enterprise documents by applying the topic model to generate an enterprise topic set comprising one or more topics from among the global topic set. Each topic among the enterprise topic set is assigned the topic score per enterprise document to generate the document-to-topics pairing for each of the preprocessed enterprise documents with the topics from the enterprise topic set having the topics scores above the predefined score threshold.
[0019] Further, the one or more hardware processors are configured to map the topics from among the enterprise topic set and the global topic set with each of the research subarea among the plurality of research subareas. The mapping comprises grouping the plurality of preprocessed enterprise documents into a plurality of groups in accordance with each research subarea among the plurality of research subareas, and applying an iterative process over the plurality of groups to allocate the topics from the enterprise topic set and by the global topic set to each research subarea. Furthermore, the iterative process comprises aggregating the topic score of each topic from the document-to-topics pairing of each of the plurality of preprocessed enterprise documents in each group among the plurality of groups and tabulating the aggregated topic score for each topic across the plurality of groups. Thereafter, allocating one or more topics among the enterprise topic set to the research subarea if the aggregated topic score for the one or more topics is above a predefined aggerated threshold. Thereafter, allocating, to each group among the plurality of groups, one or more preprocessed global documents from among the plurality of global documents based on a similarity between each topic in the document-to-topics pairing of each of the plurality of preprocessed global documents and one or more topics allocated to the subarea if the topic scores satisfy the predefined score threshold to obtain updated subgroups. Each preprocessed global document is allocated to a maximum of a predefined fraction of a total
number of subareas defined by the enterprise. Finally, repeating the iterative process on a plurality of documents in the updated subgroups until each of the plurality of subareas is mapped to a maximum of half a number of total topics in the global topic set.
[0020] Furthermore, the one or more hardware processors are configured to determine trending topics and declining topics from among the mapped topics to each research subarea for generating landscaping of white space technologies and dead space technologies for each of the enterprise documents based on an intersection of the mapped topics of each subarea and the global topic, wherein the white space technologies refer to technologies missed by the enterprise for research, and the dead space technologies refer to the technologies that need to be retired by the enterprise from research, and a topic relevancy by generating a network chart, for each subarea among a plurality of research subareas, with mapped topics of the subarea as nodes, which are connected by an edge if topics among the mapped topics corresponding to two nodes are in a single enterprise document. The major topics and emerging topics within each subarea are identified based on centrality measures of each node in the network chart.
[0021] Furthermore, the one or more hardware processors are configured to determine the trending topics and the declining topics from among the mapped topics of each research subarea by determining a total number of the plurality of global documents published per year over a time range and determining a topic frequency for each topic of the global topic set per year by counting the plurality of global documents for the topic per year for the time range to generate a plurality of topic frequencies. Thereafter, normalizing each of the plurality of topic frequencies to the average number of documents for the topic per year, and then plotting the normalized topic frequencies over the years in the time range, wherein a topic is defined to be trending topic if there is a percentage growth in the topic frequency of the topic over a last one year, and wherein a topic is defined as declining topic if there is a negative cumulative growth in last predefined number of years.
[0022] Further, the one or more hardware processors are configured to determine white space technologies and dead space technologies for the enterprise
based on intersection of the enterprise topic set and the global topic set for a subarea, by identifying one or more topics present in the global topic set but absent in the mapped topics for the subarea as the white space technologies for the enterprise for future research, and then identifying one or more topics present the in the mapped topics of the subarea, which are identified as declining topics in the global topic set, as the dead space technologies for the enterprise, which need to be retired during future research.
[0023] Furthermore, the one or more hardware processors are configured to compute a topic progression based on cosine distance between each topic identified in a year with each topic identified in previous years to get a progression of topics from year to year, wherein the topic progression enables identifying carry forward topics, emerging topics and extinct topics, and creating a topic flowchart by plotting topic progression score for each topic as a Sankey diagram providing evolution of the topic over the years.
[0024] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for white space-dead space technology landscaping from unstructured data. The method includes creating, i) a global corpus comprising a plurality of global documents, and ii) an enterprise corpus comprising a plurality of enterprise documents with each of the plurality of enterprise documents identified with a research area among a plurality of research areas and a research subarea among a plurality of research subareas defined for each area. The plurality of global documents and the plurality of enterprise documents, comprising the unstructured data, belong to a predefined time span.
[0025] Further, the method includes preprocessing, each of the plurality of global documents and each of the plurality of enterprise documents by performing tokenization, dictionary generation and Term Frequency–Inverse Document Frequency (TF-IDF) computation.
[0026] Further, the method includes generating a topic model by applying a Latent Dirichlet Allocation (LDA) technique on the preprocessed global
documents, wherein the topic model generates a global topic set comprising a predefined number of topics derived from the global documents. Each topic among the global topic set is assigned a topic score per global document to generate document-to-topics pairing for each of the preprocessed global documents with topics from the global topic set having topics scores above a predefined score threshold.
[0027] Furthermore, the method includes processing the preprocessed enterprise documents by applying the topic model to generate an enterprise topic set comprising one or more topics from among the global topic set. Each topic among the enterprise topic set is assigned the topic score per enterprise document to generate the document-to-topics pairing for each of the preprocessed enterprise documents with the topics from the enterprise topic set having the topics scores above the predefined score threshold.
[0028] Further, the method includes mapping the topics from among the enterprise topic set and the global topic set with each of the research subarea among the plurality of research subareas. The mapping comprise grouping the plurality of preprocessed enterprise documents into a plurality of groups in accordance with each research subarea among the plurality of research subareas, and applying an iterative process over the plurality of groups to allocate the topics from the enterprise topic set and by the global topic set to each research subarea. Furthermore the iterative process comprises aggregating the topic score of each topic from the document-to-topics pairing of each of the plurality of preprocessed enterprise documents in each group among the plurality of groups and tabulating the aggregated topic score for each topic across the plurality of groups. Thereafter, allocating one or more topics among the enterprise topic set to the research subarea if the aggregated topic score for the one or more topics is above a predefined aggerated threshold. Thereafter, allocating, to each group among the plurality of groups, one or more preprocessed global documents from among the plurality of global documents based on a similarity between each topic in the document-to-topics pairing of each of the plurality of preprocessed global documents and one or more topics allocated to the subarea if the topic scores satisfy the predefined score
threshold to obtain updated subgroups. Each preprocessed global document is allocated to a maximum of a predefined fraction of a total number of subareas defined by the enterprise. Finally, repeating the iterative process on a plurality of documents in the updated subgroups until each of the plurality of subareas is mapped to a maximum of half a number of total topics in the global topic set.
[0029] Furthermore, the method includes, determining trending topics and declining topics from among the mapped topics to each research subarea for generating landscaping of white space technologies and dead space technologies for each of the enterprise documents based on an intersection of the mapped topics of each subarea and the global topic wherein the white space technologies refer to technologies missed by the enterprise for research, and the dead space technologies refer to the technologies that need to be retired by the enterprise from research, and a topic relevancy by generating a network chart, for each subarea among a plurality of research subareas, with mapped topics of the subarea as nodes, which are connected by an edge if topics among the mapped topics corresponding to two nodes are in a single enterprise document. The major topics and emerging topics within each subarea are identified based on centrality measures of each node in the network chart.
[0030] Furthermore, the method includes, determining the trending topics and the declining topics from among the mapped topics of each research subarea by determining a total number of the plurality of global documents published per year over a time range and determining a topic frequency for each topic of the global topic set per year by counting the plurality of global documents for the topic per year for the time range to generate a plurality of topic frequencies. Thereafter, normalizing each of the plurality of topic frequencies to the average number of documents for the topic per year, and then plotting the normalized topic frequencies over the years in the time range, wherein a topic is defined to be trending topic if there is a percentage growth in the topic frequency of the topic over a last one year, and wherein a topic is defined as declining topic if there is a negative cumulative growth in last predefined number of years.
[0031] Further, the method includes determining white spaces technologies
and dead space technologies for the enterprise based on intersection of the
enterprise topic set and the global topic set for a subarea, by identifying one
or more topics present in the global topic set but absent in the mapped topics for the subarea as the white space technologies for the enterprise for future research, and then identifying one or more topics present the in the mapped topics of the subarea , which are identified as declining topics in the global topic set, as the dead space technologies for the enterprise, which need to be retired during future research.
[0032] Furthermore, the method includes computing a topic progression based on cosine distance between each topic identified in a year with each topic identified in previous years to get a progression of topics from year to year, wherein the topic progression enables identifying carry forward topics, emerging topics and extinct topics, and creating a topic flowchart by plotting topic progression score for each topic as a Sankey diagram providing evolution of the topic over the years.
[0033] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0034] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[0035] FIG. 1 is a functional block diagram of a system for white space-dead space technology landscaping from unstructured data, in accordance with some embodiments of the present disclosure.
[0036] FIG. 2 depicts an architectural overview of the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0037] FIG. 3A and 3B (collectively referred as FIG. 3) is a flow diagram illustrating a method for white space-dead space technology landscaping from unstructured data, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0038] FIG. 4 is a flow diagram illustrating a process of the method for mapping the topics from among the enterprise topic set and the global topic set with each research subarea among a plurality of research subareas of an enterprise, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0039] FIG. 5 depicts an example network chart generated from the mapped topics to provide topic relevancy among the mapped topics of a research subarea, in accordance with some embodiments of the present disclosure.
[0040] FIGS. 6A, 6B and 6C depict example topic charts generated from the mapped topics to provide topic progression among the mapped topics of the research subarea, in accordance with some embodiments of the present disclosure.
[0041] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS [0042] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[0043] State of art techniques for white space-dead space technology landscaping utilize simple approach of global topics to enterprise topics mapping. With the analysis focus limited at corpus level, the conventional approaches can lead to erroneous results. Further, to capture technology trends, changing at faster pace,
a change in approach to standard Compound Annual Growth Rate (CAGR) computation is required. State of art techniques, which measure growth rate in trending in topics by standard CAGR approach or weighted average of technology, which usually does not directly correspond to any single topic.
[0044] Furthermore, understanding of how the topics evolve and how the keyword distributions under a topic vary and understanding relevancy among the topics plays a significant role, and needs to be captured for accurate technology landscaping to derive true research insights for an enterprise.
[0045] Embodiments of the present disclosure provide a method and system for white space-dead space technology landscaping from unstructured data. White space technologies refer to technologies missed by the enterprise for research, and dead space technologies refer to the technologies that need to be retired by the enterprise from research. The method disclosed maps topics to specific research areas and research subareas enabling a topic mapping better aligned to research interests of the enterprise. Further, trend analysis approach utilizes average of the actual rate of changes observed during the time window, which better captures the fast-moving trends across the research areas. The method disclosed consider year wise activity for a topic and then calculate difference between nth year’s activity and (n-1) year’s activity. Adding these differences and then divide by number of years gives a CAGR with the modified approach, that may be positive or negative. Negative value indicates declining topic.
[0046] Referring now to the drawings, and more particularly to FIGS. 1 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0047] FIG. 1 is a functional block diagram of a system for white space-dead space technology landscaping from unstructured data, in accordance with some embodiments of the present disclosure.
[0048] In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O)
interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
[0049] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like.
[0050] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface to display the generated target images and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices.
[0051] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0052] Further, the memory 102 includes a database 108 that stores a global corpus, an enterprise corpus, a global topic set, an enterprise topic set, metadata associated with each research documents in the enterprise corpus comprising the
research area and research subarea within the research area for each enterprise document and so on and is depicted in the architectural overview of system in FIG. 2. The terms research subarea and subarea are interchangeably referred herein after. Further, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system100 and methods of the present disclosure. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106. Functions of the components of the system 100 are explained in conjunction with architectural overview of the system 100 in FIG. 2 and FIGS. 3A through FIG. 6.
[0053] FIG. 2 depicts an architectural overview of the system of FIG. 1, in accordance with some embodiments of the present disclosure. The global corpus and the enterprise corpus, mined for technology landscaping, is a collection of research documents of various types from various technology domains, and predominantly is an unstructured data. The system 100 models the entire research work done universally, captured in the global corpus, into the global topic set by generating a topic model. Further, the enterprise corpus, which is a collection of the research documents with each research document associated with the metadata specifying a research area and a research subarea within the research area, is processed by the topic model to identify the enterprise topics set from among the global topic set. Further, the system 100 maps topics from the global topic set and the enterprise topic set to each corresponding research subarea. With topic spread across subareas, the technical challenge lies in mapping the identified topics from the global corpus and the enterprise corpus to specific subareas identified by the enterprise for its research domains. The technical challenge is addressed herein by the system 100 by disclosing an approach based on topic scores distribution. A trend analysis is performed over the mapped topics of each of the subarea to identify trending and declining topics, specifying white spaces and dead spaces, for each of the subarea. Furthermore, the method includes determining topic relevancy within the mapped topics to identify majority topics and emerging topics for the enterprise by creating a network chart and creating a topic flowchart by plotting topic progression score for each topic providing evolution of the topic over the years.
[0054] Thus, the technology landscaping provided by the method disclosed herein enables the enterprise to understand the breadth, depth of the current research in context of the research areas and subarea defined for the enterprise, rather than mere following the global topic trends. Accordingly, the enterprise can work on white space technologies, also referred to as white spaces, and retire dead space technologies, also referred to as dead spaces.
[0055] FIG. 3A and 3B (collectively referred as FIG. 3) is a flow diagram illustrating a method 300 for white space-dead space technology landscaping from unstructured data, using the system of FIG. 1, in accordance with some embodiments of the present disclosure
[0056] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 300 by the processor(s) or one or more hardware processors 104. The steps of the method 300 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and FIG. 2, and the steps of flow diagram as depicted in FIG. 3A through 4. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[0057] Referring to the steps of the method 300, at step 302 of the method 300, the one or more hardware processors 104 creates the global corpus comprising a plurality of global documents, and the enterprise corpus comprising a plurality of enterprise documents. Each of the plurality of enterprise documents is identified with the metadata indicating a research area among a plurality of research areas and a research subarea among a plurality of research subareas defined for each area. The global corpus and the enterprise corpus are collected over a predefined time span. For example, the global corpus and the enterprise corpus is collected year wise.
[0058] At step 304 of the method 300, the one or more hardware processors 104 preprocess each of the plurality of global documents and each of the plurality of enterprise documents by performing tokenization, dictionary generation and Term Frequency–Inverse Document Frequency (TF-IDF) computation. In an example implementation a content extractor is started by a batch job to extract textual content from each research document from each corpus. A tokenizer decomposes the research documents into word tokens after spell check and clean up. A stop word remover removes common words like a, an, the, of etc. Further, a dictionary generator builds a dictionary of relevant terms for the entire corpus, including common phrases. A metadata extractor extracts each document’s metadata such as title, author, date, location, etc. along with the research area and subarea identified for the enterprise documents. The database 108, is an index database, which allows fast search and retrieval of documents.
[0059] Once the research documents of the global corpus and the enterprise corpus are preprocessed by cleaning the documents, then at step 306, the one or more hardware processors 104 generate a topic model by applying a Latent Dirichlet Allocation (LDA) technique, known in the art, on the preprocessed global documents. Any other similar technique may also be used in another implementation of the system 100. The topic model generates the global topic set comprising a predefined number of topics derived from the global documents. Further, each topic among the global topic set is assigned a topic score per global document to generate document-to-topics pairing for each of the preprocessed global documents. The topic score is probability distribution of each topic in corresponding document. The topics having the topic score below a predefined score threshold are least-significant and are discarded. Thus, the global topic set includes topics having topics scores above the predefined score threshold. In general, per document only 5 to 10 topics are significant.
[0060] Similar to generating the global topic set, at step 308, the one or more hardware processors 104 process the preprocessed enterprise documents by applying the topic model on the preprocessed enterprise documents to generate an enterprise topic set comprising one or more topics from among the global topic set.
Each topic among the enterprise topic set is assigned the topic score per enterprise document to generate the document-to-topics pairing for each of the preprocessed enterprise documents. As mentioned above, the enterprise topic set includes the topics that have topics scores above the predefined score threshold, thus elimination the insignificant topics. In an example implementation, a text mining framework, which is a collection independent modules mines text for specific ends. A term frequency analyzer computes the TF-IDF scores for each term in the corpus dictionary to assess its relevance to a document and the corpus. A topic modeler uses Latent Dirichlet Allocation (LDA model to identify a set of topics present in the documents, specifically, the global topic set from global corpus and the enterprise topic set from the enterprise corpus as mentioned in step 306 and step 308.
[0061] The topic mode used is by the method disclosed herein is progressive, which is calculated from year to year. As topics tend to evolve, their keyword distributions change. For instance, Artificial Intelligence (AI) topic 5 years ago may not have the same keyword distribution now. The method disclosed compensates for this by calculating a fresh topic model each year and mapping similar topics from year to year.
[0062] Upon generating the global topic set and the enterprise topic set, at step 310, the one or more hardware processors 104 map the topics, from among the enterprise topic set and the global topic set with each of the research subarea among the plurality of research subareas. An illustrative example is provided to explain the mapping of topics to research areas. It is to be understood, that the research area and the research subarea are two separate metadata and can be used interchangeably with change of underlying data points based on the filter. Hence as required, the topic mapping can be performed in accordance with the research areas of the enterprise instead of diving down to research subareas.
Example:
Consider a corpus with Global documents GD1 to GDn and enterprise
documents ED1 to EDn
For topic- area mapping
For ED1 , research area is RA1 , suppose 3 topics are above threshold and
assigned as T1 , T4 , T6,
For ED2 , research area is RA2 , suppose 2 topics are above threshold and
assigned as T1 , T8,
For ED3 , research area is RA1 , suppose 3 topics are above threshold and
are assigned as T1 , T5 ,T6
For ED4 , research area is RA2 , suppose 2 topics are above threshold
assigned are T8 , T9
Then For mapping,
The topic area score is aggregated over all enterprise documents ,
For T1 = (RA1 = ED1T1Score1 + ED2T1Score2 + ED3T1Score3
+....EDiTjScorek) , (RA2 = ED2T8Score1 + ED4T8Score1 ), .... , n areas
RA1 is assigned to a topic after all topic- research area score is computed. It is assigned depending on
1) All topics with RA1 score above threshold have RA1 assigned to them
2) Threshold is varied to keep the number of topics allocated with research area less than 50%
For global document to area mapping
While coming to this step first level topic-area mapping is already done
Suppose
T1 has research area as RA1 , RA5
T2 has research area as RA1 , RA7
T3 has research area as RA5, RA8
GD1 has topics T1 , T3 , T2 with topic document score as GD1T1Score1 ,
GD1T2Score2 , GD1T3Score3
all topics are assigned based on topics and score computed
Step 1 : GD1 will have research area as: RA1 , RA5, RA7 , RA8 with score
as
RA1Score = GD1T1Score1 + GD1T2Score2 ; RA5Score = GD1T2Score1
+ GD1T3Score3
RA7Score = GD1T2Score2 ; RA8Score = GD1T3Score3
Step 2: Assign area based on condition that document has less than 1/3 areas
of total area in step 1
[0063] The mapping comprises grouping the plurality of preprocessed enterprise documents into a plurality of groups in accordance with each research subarea among the plurality of research subareas (310a). Further, applying an iterative process over the plurality of groups to allocate the topics from the enterprise topic set and by the global topic set to each research subarea (310b). The iterative process of mapping of topics used by the method 300 is explained in conjunction with FIG. 4. FIG. 4 is a flow diagram illustrating a process 400 of the method 300 for mapping the topics from among the enterprise topic set and the global topic set with each research subarea among a plurality of research subareas of an enterprise, using the system of FIG. 1, in accordance with some embodiments of the present disclosure. The iterative process comprises the following steps 402 through 408.
a) Aggregating the topic score of each topic from the document-to-topics pairing of each of the plurality of preprocessed enterprise documents in each group among the plurality of groups and tabulating the aggregated topic score for each topic across the plurality of groups (402).
b) Allocating one or more topics among the enterprise topic set to the research subarea if the aggregated topic score for the one or more topics is above a predefined aggerated threshold (404).
c) Allocating, to each group among the plurality of groups, one or more preprocessed global documents from among the plurality of global documents based on a similarity between each topic in the document-to-topics pairing of each of the plurality of preprocessed global documents and one or more topics allocated to the subarea if the topic scores satisfy the predefined score threshold to obtain
updated subgroups. Each global preprocessed document is allocated to a maximum of a predefined fraction of a total number of subareas defined by the enterprise (406). It can be understood that a document is no longer relevant if it is mapped to all subareas. Thus, in example implementation the maximum predefined fraction value is set to 1/3 based on experimentation and analysis by a subject matter expert. d) Repeating the iterative process on a plurality of documents in the updated subgroups until each of the plurality of subareas is mapped to a maximum of half a number of total topics in the global topic set (408). Each subarea is allocated to less than 50% of topics, where at least 70% of topics have area assigned to it. This is achieved by varying the topic score threshold. This criteria of half a number of topics is selected because as mentioned earlier, if a topic maps to all the area, it is no longer relevant. The condition mentioned above specifying value of 50% and 70% is selected to arrive at correct threshold to allocate final area to topic. [0064] Referring back to the steps of method 300, once the mapped topic per research subarea are identified, then at step 312, the one or more hardware processors 104 determine a) trending topics and declining topics, and b) a topic relevancy by generating a network chart. The trending topics and the declining topics are determined from among the mapped topics to each research subarea for generating landscaping of the white space technologies and the dead space technologies for each of the enterprise documents based on an intersection of the mapped topics of each subarea and the global topic. The process steps of the method 300 to determine the trending topics and declining topics are explained below:
1) Determine a total number of the plurality of global documents published per year over a time range.
2) Determine a topic frequency for each topic of the global topic set per year by counting the plurality of global documents for the topic per year for the time range to generate a plurality of topic frequencies.
3) Normalizing each of the plurality of topic frequencies to the average number of documents for the topic per year.
4) Plotting the normalized topic frequencies over the years in the time range, wherein a topic is defined to be trending topic if there is a percentage
growth in the topic frequency of the topic over a last one year, and wherein
a topic is defined as declining topic if there is a negative cumulative growth
in last predefined number of years.
[0065] Thus, the method 300 collects corpus year wise, performs topic modeling and identifies topics (as year wise topic). Then a topic similarity check is performed, where top N words of topic in current year is compared with all topic of n-1 year. The previous year topic, with highest similarity score is assigned the current year topic. Thus, the method disclosed considers year wise activity for a topic and then calculate difference between nth year’s activity and (n-1) year’s activity. Adding these differences and then divide by number of years gives a CAGR, with the modified approach, that may be positive or negative. Negative value indicates declining topic.
[0066] Once the trending topics and the declining topics are identified, the method 300 further comprises, determining white spaces technologies and dead space technologies for the enterprise based on intersection of the enterprise topic set and the global topic set for a subarea, by:
a) identifying one or more topics present in the global topic set but absent in the mapped topics for the subarea as the white space technologies for the enterprise for future research; and
b) identifying one or more topics present in the mapped topics of the subarea, which are identified as declining topics in the global topic set, as the dead space technologies for the enterprise, which need to be retired during future research.
[0067] Further, the topic relevancy is determined by generating a network chart, for each subarea among a plurality of research subareas, with mapped topics of the subarea as nodes. The nodes are connected by an edge if topics among the mapped topics corresponding to two nodes are in a single enterprise document. Further, major topics and emerging topics within each subarea are identified based on centrality measures of each node in the network chart. The major topics correspond to nodes of the network chart that have high centrality measures and the emerging topics correspond to the nodes having high betweenness centrality
measure. The major topics are the ones that are already well established in that research/subarea indicating the specializations in the area, whereas emerging topics refer to the ones with recent emergence. Once the mapped topics are obtained for research subarea of interest, a network chart is created for each research subarea area with all the topics as nodes, and their intersecting document count as the edges, as depicted in FIG. 5. Output of the network chart provides the list of major topics and emerging topics. Communities formed help in identifying the specializations within the subarea. Nodes with high centrality measures indicate the major topics. If the node is connected with other major nodes and has high betweenness centrality, it represents the emerging topics. Once the network chart is generated, a list of major topics and emerging topics are recommended to the enterprise. Distinct specializations within the subarea are also identified. Research insights derived for the organization from the example network chart of FIG. 5 is as below:
The communities in the network diagram represent Specializations within
the sub area
Communities observed:
1) Sensing
2) Cognition
3) Security
Major Topics : Indicated by dashed arrows
1) Scheme Code Complexity Transmission (Scheme Privacy community)
2) Security Attack Protection Signature (Privacy Protection)
3) Level Multi Multiple Single (Multiset Selection)
4) Determine Plurality Computer Configure (Information processing)
5) Hand motion
Emerging topics : Indicated by bold arrows
1) Dynamic Change Shape Profile
2) Optical Active Laser Propagation Pulse
[0068] Once the trending topics and declining topics are identified and topics for each year are identified, at step 312, the method 300 further computes a
topic progression based on cosine distance between each topic identified in a year with each topic identified in previous years to get a progression of topics from year to year. The topic progression enables identifying carry forward topics, emerging topics, and extinct topics. Thereafter, the method 300 comprises creating a topic flowchart by plotting topic progression score for each topic as a Sankey diagram providing evolution of the topic over the years. Further, an example below depicts topic progression list calculation as follows:
[0069] Topics of (year+1) is compared with year; similarity score calculated, and assignment done. So, if topic1 of (year + 1) is merged to topicN of year; we give it a new name called UniveralsTopic X. Next time when (year+2) topics comes, it is compared with (year+1) topics and similarity score calculated. If the similarity is above threshold the (year+1) topicN will be similar to (year+2) topicsM and the topic name UniversalTopic X will have (year+2) topic in progression.
For example:
UNIVERSAL_TOPIC_11(Main topic name) -
Topic progressions content:
2013_UNIVERSAL_TOPIC_192014_UNIVERSAL_TOPIC_99
2015_UNIVERSAL_TOPIC_89 2016_UNIVERSAL_TOPIC_3
2017_UNIVERSAL_TOPIC_86
[0070] From the topic flow chart, the enterprise can track the path and position of research subarea’s topics. It is observed that the topic lanes are relative to the specializations identified in the network chart. By tracking the major and emerging topics of the subarea, a pattern of the topic trend over the years can be identified. It helps enterprise in understanding if the topic may merge with another or may branch out into two or more areas of interest. There is also a possibility of emergence of a new topic from two pre-existing topics. Unfolding of a distinct novel topic indicates the upward trend in the Sub Research Area. Emergence of a new path can also be predicted from the topic flow chart which may aid in the future trend prediction.
[0071] FIG. 6A, 6B and 6C depict example topic flow charts generated from the mapped topics to provide topic progression among the mapped topics of the research subarea, in accordance with some embodiments of the present disclosure. Key topics selected from Cyber Physical Security subarea of the enterprise are depicted in FIG. 6A, while simplified flow paths of the key topics are depicted in FIG. 6B. The index to FIG. 6B is depicted in FIG. 6C. The insights derived for the enterprise from the topic flow chart indicate that significant topics in the subarea are: 1) Privacy protection, 2) Information processing, 3) Scheme Privacy Community, 4) Multiset Selection, 5) Hand motion, and 6) Optical active laser. Each number represent a new topic and a star indicates emergence of a topic.
[0072] In the topic flow chart generated by the method 300, the topic progression score, obtained for each topic, is plotted in the form of a Sankey Diagram for understanding the evolution of the topic over the years. The plot is customized in such a way that the width of each node(topic) increases over the years and the topics are ordered across years based on the activity count. The width of the link between each topic indicates the strength of similarity. On selection of a topic, its entire path gets highlighted, which shows how the topic has emerged and evolved.
[0073] One of the works in the literature “a text visualization method for cross-domain research topic mining” by Xinyi Jiang, Jiawan Zhang, uses colors for edges and mention that if the topic is not that similar as other topics, then maybe this topic is a disappeared topic. Unlike the literature work, the topic chart created by the method 300 captures the flow of the topic, so a topic evolves/splits into a different topic or combines with an existing topic to form another major topic or it will cease to exist is depicted by the topic flow chart giving better understanding of topic evolution/decline as compared to state of art. Further, only adjacent years are considered, while the method 300 captures, the presence of the topic in the topic flow chart and is marked from the year it came into existence. Further, the method 300 preserves all the similarity scores between topic so that it can be varied using the similarity bar in Sankey diagram to see the topic progressions. The state of art techniques focus on predicting evolving cross-domain relationships. Since topics
are often cutting across multiple domains, popular topics will always from cross domain relationships as use cases multiply. However, the method 300 focuses on how clusters of topics evolve, merge, split and die to predict the new areas of specializations are evolving centered on the topics of interest. Thus, the community of the network chart and the lanes of the topic flow chart help the enterprise in identifying the specializations within the domain. The outputs are easily readable and do not involve any logical complexities, enhancing the usability aspect of the system 100.
[0074] The research insights derived by the method 300, can be displayed by the system 100 over a User Interface (UI), wherein the trending topics, the declining topics, the network chart, the topic flow chart can be displayed to the enterprise (end user).
[0075] The outputs provided by the method disclosed, such as listing of the white space and dead space technology for the enterprise, generation topic relevancy, topic interrelation and the like, can be further provided as input data for many automation tools for handling R&D department of an enterprise. R&D managing process is fraught with several operational, methodological, strategic and efficiency challenges. In order to manage its R&D activities effectively, enterprises or companies have been employing multiple generations of R&D management models since the 1950’s. These models define multiple approaches, such as Technology Push, Market Pull, Collaboration with Stakeholders, External Collaboration, Open Innovation, etc. Over time, in order to closely interact with the marketplace, and the ecosystem, R&D operations are becoming more decentralized. A number of processes and process automation tools have emerged to assist the R&D managers such as:
[0076] Automated Budget Allocation: A sub research area gets its budget based on the demonstrated value of its research in achieving marketplace objectives. In a decentralized R&D center, the budget allocation is automatically approved based on certain performance criteria. One such criteria is about the extent of research coverage within the subarea. The operative metrics for the coverage are provided by the methods disclosed herein as follows: 1) Number of white spaces and dead spaces
2) Percentage of relevant coverage (Total research volume – white space research volume – dead space research volume)
3) Percentage of effort/cost budgeted for identified white spaces
4) Percentage of effort/cost budgeted for identified dead spaces
[0077] The automated budget allocation system will clear the budget if the above metrices are within permitted thresholds, else the budget goes through a more elaborate manual review process.
[0078] Automated Approval of New Research Proposals: In a de-centralized automated R&D management environment, the number of research proposals submitted in any quarter may number in hundreds and typically a provision is made to approve proposals. The proposals which address white spaces in the research area are given higher weightage and those addressing dead spaces are given lower weightage. Thus, the method disclosed herein enables generating a score for each proposal and enable auto-approval or rejection of the proposal.
[0079] Automated Selection of External Collaborators: Recently evolved R&D Management models emphasize external collaboration for research in specific subareas where the organization has no research capability. There are several academic institutions and dedicated research facilities that may compete for this collaboration. Thus, the method disclosed herein provides a good scoring mechanism that can be used to select the most appropriate collaborator in the research subarea.
[0080] Automated Team Recommender System: Recently, Team Recommender Systems (TRS) have become extremely common because they are software tools and techniques that helps to organizations to composite team needed to carry out a task requiring multiple skills. TRS ensures that other important criteria such as Diversity are properly satisfied. In R&D Management context within a research sub-area, forming teams among researchers who have dissimilar skills and yet can complement each other in productive ways is a serious challenge, as a research team is future-focused. The method disclosed herein can overcome this challenge using the predictions provided by the system 100 about the different specializations existing or newly forming within the subarea. The specializations
are identified as topic clusters and their growth trends are clearly identified via the topic flow chart mechanism. The current skills of each research team member can be enumerated in terms of topics inferred from the research artifacts produced by the team member. Using this skill profile, the system can automatically identify the right team combination for taking up research work in predicted growth area, thereby optimizing work allocation within the subarea. If a required suite of skills is absent in the research area, the system can analyze researcher profiles from external collaborator organizations and suggest the best fit.
[0081] Automated Member Assessment for Special Interest Groups (SIG): The external collaboration approach is emerging as the preferred model for accelerated R&D and one of the most popular methods adopted by this approach is the formation of a Special Interest Group. Since an SIG is a highly focused group, often with shared business objectives, the organization setting up the SIG employs a set of criteria that are used to filter out non-value adding applicants to the SIG. One of the important set of criteria is the relevant technical skills needed to fulfil the objective of the SIG. The method disclosed herein is eminently suited to identify the skills required for a given SIG in terms of topic clusters and their predicted trends based on current work in the field. Thus, the method disclosed generates the most critical input for automated processing of selecting members of an SIG established by an R&D organization.
[0082] Thus, the method and system disclosed herein applies a topic trending approach that is much more sophisticated. As topics can be at different stages of evolution, the method ensures the trend is captured at a sub area level. Further, as topics exhibit wildly varying growth, the method utilizes the average of annual growths to realistically model the trend. Further, the topic model is partitioned in three different dimensions, comprising a global and enterprise dimension, multiple granular overlapping sub areas and annual trend: rising, declining or stable. This information is further used to obtain white spaces and dead spaces in an enterprise’s activity within a sub area. The definition of white space and dead space as used by the method disclosed is more rigorous and purposeful, wherein the white space reflects a serious gap in activity within the sub area and
hence presents more a risk, than a potential opportunity as implied in other prior art. Similarly, the dead space is where an enterprise’s resources are locked in unproductively as that train has already left the station. These definitions underline the significance of the analysis provided by the method in governance of the research activity
[0083] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0084] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0085] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by
various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0086] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0087] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include
random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0088] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
We Claim:
1. A processor implemented method (300) for white space-dead space technology landscaping from unstructured data, the method comprising:
creating, by one or more hardware processors, i) a global corpus comprising a plurality of global documents, and ii) an enterprise corpus comprising a plurality of enterprise documents with each of the plurality of enterprise documents identified with a research area among a plurality of research areas and a research subarea among a plurality of research subareas defined for each area, wherein the plurality of global documents and the plurality of enterprise documents, comprising the unstructured data, belong to a predefined time span (302);
preprocessing, by the one or more hardware processors, each of the plurality of global documents and each of the plurality of enterprise documents by performing tokenization, dictionary generation and Term Frequency–Inverse Document Frequency (TF-IDF) computation (304);
generating, by the one or more hardware processors, a topic model by applying a Latent Dirichlet Allocation (LDA) technique on the preprocessed global documents, wherein the topic model generates a global topic set comprising a predefined number of topics derived from the global documents, and wherein each topic among the global topic set is assigned a topic score per global document to generate document-to-topics pairing for each of the preprocessed global documents with topics from the global topic set having topics scores above a predefined score threshold (306);
processing, by the one or more hardware processors, the preprocessed enterprise documents by applying the topic model to generate an enterprise topic set comprising one or more topics from among the global topic set, wherein each topic among the enterprise topic set is assigned the topic score per enterprise document to generate the document-to-topics pairing for each of the preprocessed enterprise documents with the topics from the enterprise topic set having the topics scores above the predefined score threshold (308);
mapping the topics, by the one or more hardware processors, from among the enterprise topic set and the global topic set with each of the research subarea among the plurality of research subareas (310), the mapping comprising:
a) grouping the plurality of preprocessed enterprise documents into a plurality of groups in accordance with each research subarea among the plurality of research subareas (310a); and
b) applying an iterative process over the plurality of groups to allocate the topics from the enterprise topic set and by the global topic set to each research subarea (310b), the iterative process comprising:
aggregating the topic score of each topic from the document-to-topics pairing of each of the plurality of preprocessed enterprise documents in each group among the plurality of groups and tabulating the aggregated topic score for each topic across the plurality of groups (402);
allocating one or more topics among the enterprise topic set to the research subarea if the aggregated topic score for the one or more topics is above a predefined aggerated threshold (404);
allocating, to each group among the plurality of groups, one or more preprocessed global documents from among the plurality of global documents based on a similarity between each topic in the document-to-topics pairing of each of the plurality of preprocessed global documents and one or more topics allocated to the subarea if the topic scores satisfy the predefined score threshold to obtain updated subgroups, wherein each preprocessed global document is allocated to a maximum
of a predefined fraction of a total number of subareas defined by the enterprise (406); and
repeating the iterative process on a plurality of documents in the updated subgroups until each of the plurality of subareas is mapped to a maximum of half a number of total topics in the global topic set (408); and determining, by the one or more hardware processors, (312):
a) trending topics and declining topics from among the
mapped topics to each research subarea for generating
landscaping of white space technologies and dead space
technologies for each of the enterprise documents based
on an intersection of the mapped topics of each subarea
and the global topic, wherein the white space
technologies refer to technologies missed by the
enterprise for research, and the dead space technologies
refer to the technologies that need to be retired by the
enterprise from research; and
b) a topic relevancy by generating a network chart, for each
subarea among a plurality of research subareas, with
mapped topics of the subarea as nodes, which are
connected by an edge if topics among the mapped topics
corresponding to two nodes are in a single enterprise
document, wherein major topics and emerging topics
within each subarea are identified based on centrality
measures of each node in the network chart.
2. The method as claimed in claim 1, wherein determining the trending topics and the declining topics from among the mapped topics of each research subarea comprises:
a) determining a total number of the plurality of global documents published per year over a time range;
b) determining a topic frequency for each topic of the global topic set per year by counting the plurality of global documents for the topic per year for the time range to generate a plurality of topic frequencies;
c) normalizing each of the plurality of topic frequencies to the average number of documents for the topic per year; and
d) plotting the normalized topic frequencies over the years in the time range, wherein a topic is defined to be trending topic if there is a percentage growth in the topic frequency of the topic over a last one year, and wherein a topic is defined as declining topic if there is a negative cumulative growth in last predefined number of years.
3. The method as claimed in claim 2, wherein, determining the white spaces
technologies and the dead space technologies for the enterprise based on
intersection of the enterprise topic set and the global topic set for a subarea
comprises:
identifying one or more topics present in the global topic set but absent in the mapped topics for the subarea as the white space technologies for the enterprise for future research; and
identifying one or more topics present the in the mapped topics of the subarea , which are identified as declining topics in the global topic set, as the dead space technologies for the enterprise, which need to be retired during future research.
4. The method as claimed in claim 1, wherein the major topics correspond to
nodes of the network chart that have high centrality measures and the
emerging topics correspond to the nodes having high betweenness centrality
measure.
5. The method as claimed in claim 2, wherein method further comprises:
computing a topic progression based on cosine distance between each topic identified in a year with each topic identified in previous years to get a progression of topics from year to year, wherein the topic progression enables identifying carry forward topics, emerging topics and extinct topics; and
creating a topic flowchart by plotting topic progression score for each topic as a Sankey diagram providing evolution of the topic over the years.
6. The method as claimed in claim 1, wherein the topic scores are based on probability of occurrence of each topic in each of the plurality of preprocessed global documents, and wherein topic with topics scores above a predefined score threshold are used to generate the document-to-topics pairing.
7. A system (100) for white space-dead space technology landscaping from unstructured data , the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the
one or more I/O interfaces (106), wherein the one or more hardware
processors (104) are configured by the instructions to:
create, i) a global corpus comprising a plurality of global documents, and ii) an enterprise corpus comprising a plurality of enterprise documents with each of the plurality of enterprise documents identified with a research area among a plurality of research areas and a research subarea among a plurality of research subareas defined for each area, wherein the plurality of global documents and the plurality of enterprise documents, comprising the unstructured data, belong to a predefined time span;
preprocess, each of the plurality of global documents and each of the plurality of enterprise documents by performing tokenization, dictionary generation and Term Frequency–Inverse Document Frequency (TF-IDF) computation;
generate, a topic model by applying a Latent Dirichlet Allocation (LDA) technique on the preprocessed global documents, wherein the topic model generates a global topic set comprising a predefined number of topics derived from the global documents, and wherein each topic among the global topic set is assigned a topic score per global document to generate document-to-topics pairing for each of the preprocessed global documents with topics from the global topic set having topics scores above a predefined score threshold;
process, the preprocessed enterprise documents by applying the topic model to generate an enterprise topic set comprising one or more topics from among the global topic set, wherein each topic among the enterprise topic set is assigned the topic score per enterprise document to generate the document-to-topics pairing for each of the preprocessed enterprise documents with the topics from the enterprise topic set having the topics scores above the predefined score threshold;
map the topics, from among the enterprise topic set and the global topic set with each of the research subarea among the plurality of research subareas, by:
a) grouping the plurality of preprocessed enterprise documents into a plurality of groups in accordance with each research subarea among the plurality of research subareas; and
b) applying an iterative process over the plurality of groups to allocate the topics from the enterprise topic set and by the global topic set to each research subarea, the iterative process comprising:
aggregating the topic score of each topic from the document-to-topics pairing of each of the plurality of preprocessed enterprise documents in each group among the plurality of groups and tabulating the aggregated topic score for each topic across the plurality of groups;
allocating one or more topics among the enterprise topic set to the research subarea if the aggregated topic score for the one or more topics is above a predefined aggerated threshold;
allocating, to each group among the plurality of groups, one or more preprocessed global documents from among the plurality of global documents based on a similarity between each topic in the document-to-topics pairing of each of the plurality of preprocessed global documents and one or more topics allocated to the subarea if the topic scores satisfy the predefined score threshold to obtain updated subgroups, wherein each preprocessed global document is allocated to a maximum of a predefined fraction of a total number of subareas defined by the enterprise; and
repeating the iterative process on a plurality of documents in the updated subgroups until each of the plurality of subareas is mapped to a maximum of half a number of total topics in the global topic set; and determine:
a) trending topics and declining topics from among the mapped topics to each research subarea for generating landscaping of white space technologies and dead space technologies for each of the enterprise documents based on an intersection of the mapped topics of each subarea and the global topic, wherein the white space
technologies refer to technologies missed by the enterprise for research, and the dead space technologies refer to the technologies that need to be retired by the enterprise from research; and b) a topic relevancy by generating a network chart, for each subarea among a plurality of research subareas, with mapped topics of the subarea as nodes, which are connected by an edge if topics among the mapped topics corresponding to two nodes are in a single enterprise document, wherein major topics and emerging topics within each subarea are identified based on centrality measures of each node in the network chart.
8. The system as claimed in claim 7, wherein the one or more hardware processors are configured to determine the trending topics and the declining topics from among the mapped topics of each research subarea by:
a) determining a total number of the plurality of global documents published per year over a time range;
b) determining a topic frequency for each topic of the global topic set per year by counting the plurality of global documents for the topic per year for the time range to generate a plurality of topic frequencies;
c) normalizing each of the plurality of topic frequencies to the average number of documents for the topic per year; and
d) plotting the normalized topic frequencies over the years in the time range, wherein a topic is defined to be trending topic if there is a percentage growth in the topic frequency of the topic over a last one year, and wherein a topic is defined as declining topic if there is a negative cumulative growth in last predefined number of years.
9. The system as claimed in claim 8, wherein the one or more hardware
processors are further configured to determine white spaces technologies
and dead space technologies for the enterprise based on intersection of the
enterprise topic set and the global topic set for a subarea by:
identifying one or more topics present in the global topic set but absent in the mapped topics for the subarea as the white space technologies for the enterprise for future research; and
identifying one or more topics present the in the mapped topics of the subarea, which are identified as declining topics in the global topic set, as the dead space technologies for the enterprise, which need to be retired during future research.
10. The system as claimed in claim 7, wherein the major topics correspond to nodes of the network chart that have high centrality measures and the emerging topics correspond to the nodes having high betweenness centrality measure.
11. The system as claimed in claim 8, wherein one or more hardware processors are further configured to:
compute a topic progression based on cosine distance between each topic identified in a year with each topic identified in previous years to get a progression of topics from year to year, wherein the topic progression enables identifying carry forward topics, emerging topics and extinct topics; and
create a topic flowchart by plotting topic progression score for each topic as a Sankey diagram providing evolution of the topic over the years.
12. The system as claimed in claim 7, wherein the topic scores are based on
probability of occurrence of each topic in each of the plurality of
preprocessed global documents, and wherein topic with topics scores above
a predefined score threshold are used to generate the document-to-topics pairing.
| # | Name | Date |
|---|---|---|
| 1 | 202121035610-STATEMENT OF UNDERTAKING (FORM 3) [06-08-2021(online)].pdf | 2021-08-06 |
| 2 | 202121035610-REQUEST FOR EXAMINATION (FORM-18) [06-08-2021(online)].pdf | 2021-08-06 |
| 3 | 202121035610-PROOF OF RIGHT [06-08-2021(online)].pdf | 2021-08-06 |
| 4 | 202121035610-FORM 18 [06-08-2021(online)].pdf | 2021-08-06 |
| 5 | 202121035610-FORM 1 [06-08-2021(online)].pdf | 2021-08-06 |
| 6 | 202121035610-FIGURE OF ABSTRACT [06-08-2021(online)].jpg | 2021-08-06 |
| 7 | 202121035610-DRAWINGS [06-08-2021(online)].pdf | 2021-08-06 |
| 8 | 202121035610-DECLARATION OF INVENTORSHIP (FORM 5) [06-08-2021(online)].pdf | 2021-08-06 |
| 9 | 202121035610-COMPLETE SPECIFICATION [06-08-2021(online)].pdf | 2021-08-06 |
| 10 | 202121035610-Proof of Right [17-08-2021(online)].pdf | 2021-08-17 |
| 11 | 202121035610-FORM-26 [21-10-2021(online)].pdf | 2021-10-21 |
| 12 | Abstract1.jpg | 2022-02-15 |
| 13 | 202121035610-FER.pdf | 2023-08-03 |
| 14 | 202121035610-OTHERS [19-01-2024(online)].pdf | 2024-01-19 |
| 15 | 202121035610-FER_SER_REPLY [19-01-2024(online)].pdf | 2024-01-19 |
| 16 | 202121035610-CLAIMS [19-01-2024(online)].pdf | 2024-01-19 |
| 17 | 202121035610-US(14)-HearingNotice-(HearingDate-03-06-2024).pdf | 2024-05-13 |
| 18 | 202121035610-Correspondence to notify the Controller [24-05-2024(online)].pdf | 2024-05-24 |
| 19 | 202121035610-Written submissions and relevant documents [13-06-2024(online)].pdf | 2024-06-13 |
| 20 | 202121035610-PatentCertificate23-07-2024.pdf | 2024-07-23 |
| 21 | 202121035610-IntimationOfGrant23-07-2024.pdf | 2024-07-23 |
| 1 | SearchHistoryE_11-07-2023.pdf |