BACKGROUND
[1] Input methods can be employed to convert phonetic strings (reading) into display characters for East Asian languages such as Chinese, Korean, and Japanese, for example, and also process strokes such as in the Traditional Chinese characters. Ambiguity exists in conversion due to homonyms and various possible word segmentations. An input method tries to solve the ambiguity based on a general (e.g., baseline, default) language model and user input history. Adaptation to the user input history can be performed in several ways, for example, short-term memory and long-term memory. Short-term memory corresponds to the quickness of adaptation, and long-term memory corresponds to the stability of the adaptation. Conversion results are determined by adding information from the short-term and long-term memory to the general language model.
[2] Short-term memory can be implemented by boosting word scores or changing word rank based on a previous user choice of words (user input history). However, some words do not appear soon enough after being used and some words appear unexpectedly in unacceptable contexts after being used. Long-term memory can be implemented by accumulating user input history. However, some words still appear unexpectedly in unacceptable context in spite of the utilization of long-term memory.
SUMMARY
[3] The following presents a simplified summary in order to provide a basic understanding of some novel embodiments described herein. This summary is not an extensive overview, and it is not intended to identify key/critical elements or to delineate the scope thereof. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
[4] The disclosed architecture suppresses the unexpected appearance of words by applying appropriate restrictions to long-term and short-term memory. The quickness of adaptation is also realized by leveraging the restriction.
[5] The architecture includes a history component for processing user input history for conversion of a phonetic string by a conversion process that output conversion results, and an adaptation component for adapting the conversion process to the user input history based on restriction(s) applied to short-term memory that impacts word appearances during the conversion process. The architecture performs probability boosting based on context-dependent probability differences (short-term memory), and dynamic linear- interpolation between long-term memory and baseline language model based on frequency of preceding context of word (long-term memory).
[6] To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings. These aspects are indicative of the various ways in which the principles disclosed herein can be practiced and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. Other advantages and novel features will become apparent from the following detailed description when considered in conjunction with the drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[7] FIG. 1 illustrates a computer-implemented phonetic system in accordance with the disclosed architecture.
[8] FIG. 2 illustrates a system that includes additional aspects of the phonetic system of FIG. 1.
[9] FIG. 3 illustrates a graph for the weights transition.
[10] FIG. 4 illustrates a graph for a cache weight transition.
[11] FIG. 5 illustrates a computer-implemented phonetic method.
[12] FIG. 6 illustrates additional aspects of the method of FIG. 5.
[13] FIG. 7 illustrates additional aspects of the method of FIG. 5.
[14] FIG. 8 illustrates a block diagram of a computing system operable to execute fast and stable adaptation for a statistical language model in accordance with the disclosed architecture.
DETAILED DESCRIPTION
[15] Although the conversion accuracy of existing phonetic systems can be high in a general scenario, users are still disappointed because the language space of a specific user is different from the generic space. This is true especially for personal names, and the expression preferences naturally vary according to the users, and thus, cannot be addressed by the generic language model.
[16] The disclosed architecture is a self-tuning technique where the user no longer needs to open a candidate list after using the product for a short period of time (e.g., 2-3 weeks). Moreover, the disclosed self-tuning technique improves a user's work performance. The architecture performs probability boosting based on context-dependent probability differences (short-term memory), and dynamic linear-interpolation between long-term memory and baseline language model based on frequency of preceding context of word (long-term memory).
[17] Reference is now made to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding thereof. It may be evident, however, that the novel embodiments can be practiced without these specific details. In other instances, well known structures and devices are shown in block diagram form in order to facilitate a description thereof. The intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the claimed subject matter.
[18] FIG. 1 illustrates a computer-implemented phonetic system 100 in accordance with the disclosed architecture. The system 100 includes a history component 102 for processing user input history 104 for conversion of a phonetic string 105 by a conversion process that output conversion results 106, and an adaptation component 108 for adapting the conversion process to the user input history 104 based on restriction(s) 110 applied to short-term memory 112 that impacts word appearances during the conversion process.
[19] The adaptation component 108 performs dynamic linear interpolation between long-term memory 114 and a baseline language model based on long-term memory 114. The restriction(s) 110 boost probability of a word when the word is other than a first candidate of a candidate list. The restriction(s) 110 applied to the short-term memory 112 employs a context-sensitive short-term memory bigram probability. The restriction(s) 110 applied to the short-term memory 112 boost a probability based on a word and a context of the word in a sentence. The context includes a preceding context and a succeeding context relative to the word in the sentence. The adaptation component 108 includes a learning algorithm that performs flag-learning based on a difference between a first candidate of a candidate list and a selected candidate of the candidate list and moves the selected candidate to a first conversion result position in a next conversion process.
[20] FIG. 2 illustrates a system 200 that includes additional aspects of the phonetic system 100 of FIG. 1. The system 200 includes the history component 102 for processing the user input history 104 for conversion of the phonetic string 105 by a conversion process 204, and the adaptation component 108 for adapting the conversion process 204 to the user input history 104 based on the restriction(s) 110 applied to the short-term memory 112 that impacts word appearances during the conversion process 204.
[21] The adaptation component 108 performs dynamic linear interpolation between the long-term memory 114 and a baseline language model 208 based on the long-term memory 114. The restriction(s) 110 boost probability of a word when the word is other than a first candidate of a candidate list. The restriction(s) 110 applied to the short-term memory 112 employs a context-sensitive short-term memory bigram probability. The restriction(s) 110 applied to the short-term memory 112 boosts a probability based on a word and a context of the word in a sentence. The context includes a preceding context and a succeeding context relative to the word in the sentence. The adaptation component 108 includes a learning algorithm that performs flag-learning based on a difference between a first candidate of a candidate list and a selected candidate of the candidate list and moves the selected candidate to a first conversion result position in a next conversion process.
[22] The system 200 further comprises a restriction component 206 for applying the restriction(s) 110 by boosting a probability based on context-dependent probability differences. The restriction component 206 can also apply one or more of the restriction(s) 110 to the long-term memory 114 by boosting a probability based on a context-dependent probability difference.
[23] Put another way, the phonetic system 200 includes the history component 102 for processing the user input history 104 for conversion of the phonetic string 105 during the conversion process 204, the restriction component 206 for applying one or more of the restriction(s) 110 to the user input history 104 during the conversion process 204. The history 104 includes the short-term memory 112 and the long-term memory 114. The system 200 also includes the adaptation component 108 for adapting the conversion process 204 to the user input history 104 based on the restriction(s) 110.
[24] The restriction component 206 applies one or more of the restriction(s) 110 to the short-term memory 112. The applied restriction(s) 110 employ a context-sensitive short- term memory bigram probability, and one or more restrictions(s) 110 to the long-term memory 114 that boosts a probability based on a context-dependent probability difference. The adaptation component 108 performs dynamic linear interpolation between the long- term memory 114 and the baseline language model 208 based on the long-term memory.
114. The restriction(s) 110 boost probability of a word when the word is other than a first candidate of a candidate list. The restriction(s) 110 applied to the short-term memory 112 boost a probability based on a word and a context of the word in a sentence. The context includes a preceding context and a succeeding context relative to the word in the sentence.
[25] Following is a detail description of the computations employed for fast and stable adaptation for statistical language model.
[26] The input method conversion result for an input phonetic string can be determined by the following probability:
P(W) = P(Wi | < S >)
. P(W2\W1) • P(W3\W2). ...P(WN/lWN-1) • P( | wN)
where W is a sentence that includes a word sequence {wx, w2, w3,..., wN] and,
< s > and are symbols for sentence-start and sentence-end, respectively.
The equation is for the bigram model, but can be represented with trigram or higher-order n- gram models.
[27] There can be many possible word sequences W for an input phonetic string due to homonyms and ambiguity of word segmentation.
[28] A most probable candidate sentence is selected as a conversion result.
W = argmaxP(W)
[29] The probability for each word can be defined as,
P(wn\wn_1) = α PbaselineOJWn-l) + β ^mOnK^) + 6 PstmOnK^) where α, (3, and 6 are linear interpolation coefficients that sum to one (α + (β + 5 = γ), ^baseline(wk\wk-i) ls a baseline bigram probability estimated from the training text database (when using the input method for the first time, only this probability has a value), Pltm(wn|wn_1) is the bigram probability for the long-term memory, and Pstm(wn|wn_1) is the bigram probability for the short-term memory. The bigram probability for the long- term memory can be calculated from the user input history, as follows.
where Cuser(wn) is the number of times the user used the word wn, and Cuser(wn_1, wn) is the number of times the user uses the word sequence wn_1( wn.
[30] The bigram probability for short-term memory, Pstm(wn|wn_1), boosts the
probability for words when the word is not the first candidate of the result, but user selects the word from the candidate list. where Cuser-sei (wn_i, wn) is the number of times that user selects the word sequence wn-i, wn from the candidate list, and M is the maximum count for selecting. Note that
[31] The above equation can be generalized by exponentiation, represented as the following:
[32] Following is additional description for long-term memory. The linear- interpolation weight a and p for a word wn change depending on CMser(wn_1). This means the weights differ depending on the previous word.
[33] The target weights atarget and Ptarget are defined and used when Cuser(wn_1) is sufficiently large. Actual weights a and (3 for wn can be calculated as follows,
[34] FIG. 3 illustrates a graph 300 for the weights transition. The graph 300 shows the relative vertical range segments for short term memory γ, long-term memory β, and baseline α, with the long-term memory β designated the β target, and the baseline designated the αtarget- The graph 300 indicates that as the number of times that the word is used increases, at a time t, the weighting for the long-term memory reaches the βtarget-
[35] When CUSer(wn-i) is small, the long-term bigram probability tends to be high and yields an unexpected appearance of words. However, this weight-adjustment can suppress these kinds of side-effects.
[36] Following is additional description for short-term memory. Two approaches can be employed, either separately or combined: context-sensitive use of a short-term memory bigram probability, and probability boosting depending on the probability difference.
[37] With respect to the first approach, the context-sensitive use of short-term memory bigram probability, the probability is regarded as zero when the selected-count is zero for the succeeding word sequence.
[38] Similar results can be obtained using the preceding word sequence. These conditions can be varied depending on the part-of-speech (POS) of the words. Based on these conditions, the probability boosting depends on the context, and the unexpected appearance of previously-selected words can be suppressed.
[40] With respect to the second approach, probability boosting depending on probability difference, one-by-one incrementing CUser-seiOn-i> wn) may be insufficient for some words and too much for other words. The appropriate number of incrementing of Cuser-sei(wn-i> wn) depends on the word and context of the word.
[41] The user selects the word from the candidate is because the probability that the sentence includes the word is lower than the probability that the other sentence includes the word. Thus, in order to obtain the word next time, the probability that the sentence includes the word should be higher than the probability that the other sentence includes the word (the first sentence at the previous conversion).
[42] FIG. 4 illustrates a graph 400 for a cache weight transition. In an alternative embodiment, a cache weight transition is provided using a linear function and the cache weight is only for the bigram cache (bicache). The bigram cache weight depends on a unigram cache (unicache) amount of the preceding word. This means that the weight for bigram cache probability PbicacheCwilWj-J depends on Cunicache(wi_a) .
[43] The flag weight 8+e is constant. The weight for the unigram cache is constant as well, but an offset value is added to the total unigram cache count to reduce the side- effects by the earlier cache.
[44] Flag-learning depends on the probability differences. The level of increase of a bigram flag changes depending on the amount of difference estimated between the first candidate and the selected candidate. The selected candidate becomes the first subsequent conversion result if the surrounding context is the same.
[45] The following cases can be considered and the algorithm below covers all cases.
Case#l: [wa, Wb, Wc}after conversion iwa>wx>wc) after editing
Case #2: {wa, Wbl_wbm, wc}after conversion -» {wa, Wx, wc}after editing
Case #3: {wa, Wb, wc}after conversion iwa'wx 1 - wxn> ^clafter editing
Case #4: {wa, wbl.wbm, wc}after conversion -» {wa, WXL ... WXN, Wc}after editing
[46] The following definitions are provided.
[47] P(wb\wa) is the word bigram probability before learning including baseline, cache, and flag probabilities.
P(wb\wa) = α • P baselineOfc|wa) + β• Pcache(.Wb\wa) + γ• PCACHE(.WB)
+ γ • Pf\ag(wb\wa) + £ • PfiagCWfc)
[48] Pf,(wf,|wa) is the word bigram probability after learning. The change of cache probabilities is ignored here for simplification, and only the flag probabilities change after learning.
[49] The flag counts for candidate words, which are the first candidates when a user selects an alternative candidate from the candidate list, are decremented by one after learning.
[50] The unigram flag counts for the corresponding candidate words, which candidate words are selected from the candidate list, are incremented by one. The bigram flag counts for the corresponding candidate words, which are selected from the candidate list, are incremented, the amount of increment to be determined.
[51] With respect to the algorithm, before learning, the magnitude relation between the first candidate and the selected candidate is as follows,
[52] The magnitude after learning becomes,
[53] The probabilities PL(wb\wa) . PL(wc\wb) and P(wx\wa) . P(wc\wx) are known and how to calculate PL(wx|iva) . PL(wc|wx) is desired.
[54] The change of probabilities by learning can be represented as an exponentiation (or power).
PMwJ . PL(wc\wx) = [P(wx\wa) . P(wc\wx)]*Therefore,
PL(wb\wa) .PL(wc\wb) < [P(wx\wa) . P(wc\wx))v Then cp can be calculated as,
[55] Now, calculate Pf\ag(wx\wa)+d from cp. Pi(wx\wa) • PL(wc\wx) = [P(wx\wa) • P(wc\wx))v = P(wx\war • P{wc\wxr
[56] If the below equalities are satisfied, the above equality is satisfied. PL(.wx\wa) = P(wx\war and PL(wc\wx) = P(wc\wx)