Xiao Han Chinese segmentation technology on search engines love Shanghai

Category enxxdxdp

what is the Chinese word

, a

search engine technology for our Shanghai Longfeng work is of great significance, whether our keywords layout or link architecture, have great relations with the word. Here we talk about the word Chinese Xiaohan to love Shanghai next (of course is not limited to love Shanghai, other search engines is almost the same). This paper is divided into two parts, the first is about the removal of the existing word explanation, then add my own word extension ideas.

segmentation dictionary to solve a lot of problems, but also.

we all know, English sentence is a word according to space separated, so the word segmentation is more convenient, but we Chinese is a Chinese characters connected, so it is relatively complex. Chinese word refers to a Chinese sentence cut into a single word, according to certain rules are reassembled into word sequence process. This is called the "Chinese segmentation".

2, reverse maximum matching method (left to right direction by

this method starts with a large dictionary, is the word index library, then matching string according to certain rules will stay in the library and word word word, if find a word, there is a match, the match is divided into the following four ways:


4, bidirectional maximum matching method (from left to right, from right to left two scans)

two, based on statistical word segmentation method

usually, search engines will be used in combination with a variety of ways. But this way also bring to the search engine is, such as for ambiguity processing (the key is we Chinese broad and profound ah), in order to improve the matching accuracy, the search engine will simulation of sentence comprehension, to identify the effect of words. Analysis of the basic idea is in word segmentation and syntactic, semantic, to deal with ambiguity by using syntactic and semantic information. Usually consists of three parts: word segmentation system, French justice system, the total control part. In the general control part of coordination, word segmentation system can obtain syntactic and semantic information about words and sentences to judge the ambiguity, which simulates human comprehension of sentences. This segmentation method requires a large amount of language knowledge and information, and of course our search engine is also in constant progress.

3, at least segmentation (to cut out the words in each sentence minimum);

based word segmentation dictionary matching Although the


segmentation has great effect on search engine, is the foundation of text mining, can help the program to automatically identify the meaning of the statement, in order to achieve the search results to match the height of segmentation directly affects the quality of search results. The method of search engine mainly through two kinds of dictionary matching and statistical method.

1, maximum matching method (from left to right direction);

Leave a Reply

Your email address will not be published. Required fields are marked *