First, we decided to improve our « picking up » module that corrects written mistakes. Let′s note that these mistakes are numerous in the news texts. The goal was to avoid silence due to the fact that a miswritten word cannot be reached and so cannot propose its text as a candidate during search. The second decision was to apply a natural language process on the input corpora during indexation. We has the following reasons to do that: 数据挖掘研究院
1) we wanted to recognize compound words as really compound words in order to : 数据挖掘研究院
a) avoid noise due to false interpretation of components: a « pomme de terre » is not a « pomme ».
b) thanks to the fact that we have a lot of compound words recorded in the lexicon, we can identify words are interesting to index and so, are interesting to identify the document where they appear. 数据挖掘研究院
2) we wanted to filter certain grammatical categories, for instance, we wanted to avoid indexation of empty words and adverbs. 数据挖掘实验室
3) we wanted to insert inside the index only the lemmatized forms et not the full forms in order to group the various occurrences of the same lemmatized form, and compute a weight for the whole occurrences of the various full forms. This criteria holds for simple and compounds words.
4) we wanted to desambiguate certain difficult (and frequent) French words like « tu » as « Pronoun » vs « Past participle of the verb taire ». 数据挖掘研究院
5) we needed to use local grammars in order to recognize dates, times, numbers etc. and the morphological analyzer already had these algorithms.

