Sign In to Follow Application
View All Documents & Correspondence

"Automated Collation Creation"

Abstract: A collation creation process is provided to automatically establish collation support for sorted linguistic data. The sorted linguistic data is examined to determine if it matches an existing collation support. If not. a new collation support is created for the sorted linguistic data. The provider of the sorted linguistic data may participate in the collation creation process by answering queries concerning the sorted linguistic data. The provider"s input is integrated into the sorted linguistic data before the collation creation process is applied to the sorted linguistic data. A user interface is provided that enables the interaction between the provider of the sorted linguistic data and the collation creation process. The user interface provides visual cues identifying distinctions among the strings in the sorted linguistic data.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
03 November 2005
Publication Number
45/2009
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

MICROSOFT CORPORATION
ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, U.S.A

Inventors

1. CATHERINE A. WISSINK
ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, U.S.A
2. MICHAEL S. KAPLAN
ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052, U.S.A

Specification

AUTOMATED COLLATION CREATION FIELD OF THE INVENTION The present invention relates to a computer program and more particularly, to a computer program lor collating linguistic data BACKGROUND OF THE INVENTION One ot the greatest challenges in the globalization of computer technologies is to properly handle the numerous written languages used in different parte ot the world Languages may ditler greatly in the linguistic symbols they use and in their grammatical structures Consequently it can be a daunting task to support most, it not all, languages in various forms ot computer data processing To facilitate the support of different languages by computers a standardized coding system known as Unicode was developed to uniquely identity ever)' symbol in a language with a distinct numeric value 1 e , codepoint and a distinct name Codepomts are expressed as hexadecimal numbers with four to six digits For example, the English letter "A" is identified by the codepoint 0041, while the English letter "a" is identified by codepoint 0061, the English letter "b" is identified by the codepoint 0062 and the English letter "c" is identified by the codepoint 0063 in the Unicode system A fundamental operation on linguistic characters for graphemes) of a given language is collation which may be defined as sorting strings according to a set of rules that is culturally conea to users of a particular language Collation is used any time a user orders linguistic data or searches tor linguistic data in a logical fashion within the structure ot a given language Support ot collation on a computer requires an in-depth understanding ot the language Specificall) there must be a good understanding ot the graphemes used in the language and the relationship between the graphemes/phonemes and the Unicode codepoints used to construct them For example, in English, a speaker expects a word starting with the letter "Q" to sort atter all words beginning with the letter "P" and before all vvotds starting with the letter "R " As another example in the Traditional Chinese, the ideographs are often stored according to their pronunciations based on the "bopomoto" phonetic system as well as by the numbers ot strokes in the characters Further the proper sorting of the graphemes also has to take into account variations on the graphemes Common examples of such variations include casings (upper or lower case) ot the symbols and modifiers (diacritics, Indie matras, vowel marks) applied to the symbols Collation 1 e sorting, is one ot the most fundamental features that a user expects to simpl) work Ideally collation should be transparent People simply expect that when they click on the top of a column in Windows® Explorer, that the column will be sorted according to their linguistic expectations Such expectation may be easy to meet trom a technical perspective for simple languages, such as English however, when support for additional languages is needed, such support can be more complicated The challenges in achieving proper collation are due to several factors For example, people usually have a clear idea ot how the information they choose to collate should be ordered However few people can reallv describe the rules by which collation works for any but the simplest of languages such as English To make the matter even more complicated collations that are appropriate for one language are often not appropriate for another, in tact, many collation schemes contradict each other Furthermore, people who generally understand the technical issues of collation do not understand the language or the linguistic structure Contrariwise experts in languages often lack the technical expertise to provide collation in a form that can be used in a traditional multi-weighted collation format In addition existing platforms providing collation extensibility require full collation information as input This requires extensive technical skill knowledge of internal methodology and structures, and overt collation knowledge Usually collation is done manually by professional collation providers such as professional linguists FIGURE 1 illustrates a linguist 102 operating a computer 104 to collate linguistic data such as the set ot strings 106 Linguistic data can be comprised or as few as a handful ot strings or as many as tens ot thousands of strings and characters included m a language However, a single professional collation provider, or even a small group ot them can only do so much at a time Thus there is a need to automate the collation process so that collation support for a given language can be easily provided Additionally, different institutions often need the capability of collating data in a linguistically appropriate fashion Such institutions tor example, the U S Homeland Security Agency, mav prefer not to share data with a professional collation provider Therefore, there is a need to provide an automated collation support so as to allow data to be collated in a private matter In summary proper collation support requires a comprehensive understanding of the language of the linguistic sttucture Manually input collation information by professional collation pro\iders, such as linguists limits the ability to add collation support tor linguistic data As a result, there is a need to automate the collation process such that collation support can be easily extended for any given language and collation can be done by a general user when privacy is preferred The invention described below is directed to addressing this need SUMMARY OF THE INVENTION The invention is directed to a tool that automatically establishes collation support for sorted linguistic data The tool analyzes the sorted linguistic data to identity the underlying collation rules During the analyzing process, the tool mav ask the user who provided the sorted linguistic data iterative questions concerning the sorted linguistic data thus collaborating with the user in reaching a correct collation support tor the sorted linguistic data The tool may further test the resultant collation support by sorting test data provided by the user In accordance with one aspect of the invention, analyzing the sorted linguistic data to establish collation support includes searching existing collation support schemes and locating a matching collation support scheme for the sorted linguistic data If no existing collation support scheme is available tor the sorted linguistic data a new collation support is established bv analyzing the sorted linguistic data In accordance with another aspect of the invention to establish a new collation support based on the sorted linguistic data, each character in each string contained in the sorted linguistic data is analyzed to identity the underlying weighting structure, beginning with the first character in each string When analyzing each character in a string the strings in the sorted linguistic data are first grouped based on the primary weight, 1 e the alphabetic weight ol the character in each string The strings lesulting from the first grouping are then further grouped baj>ed on the secondary weight, i e , the diacritic weight ot the character in each string The strings are then hirther grouped based on the tertiary weight 1 e the casing weight of the character in each string To establish a new collation support based on the sorted linguistic data further includes analyzing the behaviors of special characters, such as diacritics combining marks and scripts In accordance with yet another aspect of the invention, when analyzing the sorted linguistic data to establish collation support for the sorted linguistic data, the sorted linguistic data is preprocessed The preprocessing first validates the sorted linguistic data to ensure that it is consistent in ordering and complete in coverage Preferably, validating the sorted linguistic data includes identifying a problem in the sorted linguistic data, requesting correction to the sorted linguistic data and applying the correction to the sorted linguistic data Preprocessing the sorted linguistic data may also include normalizing the sorted linguistic data In accordance with yet another aspect ot the invention after establishing collation support for the sorted linguistic data, the collation support may be verified, preferably by the user who provided the sorted linguistic data The user may correct the collation support by adjusting the ordering of the sorted linguistic data which has been collated bv the collation support Any changes provided by the user are integrated into the sorted linguistic data, which is analyzed again to establish a correct collation support reflecting the changes made by the user In accoidance with a further aspect of the invention after establishing the collation support for the sorted linguistic data, test data may be provided to test the collation support The test data can be sorted itself to verify if the application of the collation support on the sorted test data maintains the ordering of the test data The test data can also be unsorted Upon applying the collation support to the unsorted test data the ordering of the collated test data is preferably examined to verify whether it reflects the user's expectation If the collated test data does not meet the user's expectation, the ordering of the test data may be adjusted by the user and the adjusted test data may then be integrated into the sorted linguistic data which may be analyzed again to generate the correct collation support In accordance with another aspect of the invention, the collation support information ma> be built into a bm

Documents