ES查询之分词器


概念

  1. 切词
  2. mormallization: (提升recall召回率:能搜索到的结果的比率)
分析器
  1. character filter(mapping):分词的预处理(过滤无用字符、标点、转换一些&=and 《Elasticsearch》 -> Elasticsearch)
    • HTML Strip Character Filter: html strip
      • a. 参数:escaped_tags 需要保留的html标签
    • Mapping Character Filter: type mapping
    • Pattern Repace Character Filter: type pattern_replace
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# HTML Strip Character Filter
PUT index
{
"settings":
{
"analysis":
{
"char_filter":
{
"my_char_filter":
{
"type":"html_strip",
"escaped_tags":['a']
}
},
"analyzer":
{
"my_analyzer":
{
"tokenizer":"keyword", #standard
"char_filter":"my_char_filter"

}
}
}
}

}


GET index/_analyze
{
"analyzer":"my_analyzer",
"text": "mashibing <a><b>deu</b></a>"
}


#Mapping Character Filter
PUT index
{
"settings":{
"analysis":{
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappigns":[
".=>0",
")=>1",
"(=>2"
]
}
},
"analyzer":{
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":{
"my_char_filter"
}
}
}
}
}
}

GET index/_annlyze
{
"analyzer":"my_analyzer",
"text": "sds.sd(sd) dss "
}


#Pattern Repace Character Filter
#213-456-789
PUT index
{
"settings":
{
"analysis":
{
"char_filter":
{
"my_char_filter":
{
"type":"pattern_replace",
"pattern":"(\\d)-(?=\\d)",
"replacement":"$1_"
}
},
"analyzer":
{
"my_analyzer":
{
"tokenizer":"keyword", #standard
"char_filter":"my_char_filter"

}
}
}
}

}

GET index/_analyze
{
"analyzer":"my_analyzer",
"text": "213-456-789"
}

  1. tokenizer(分词器):分词
  1. token filter:停用词、时态转换、大小写转换、同义词转换、语气词处理等。比如has=>have him=>he apples=>apple the/oh/a=>干掉
    1
    2
    3
    4
    5
    6
    GET /_annlyze
    {
    "tokenizer":"standard",
    "filter":["lowercase"],
    "text":"MA SHI BING"
    }

ES 内置分词器 7.6 15种默认自带分词器

  1. standard anaylzer: 默认分词器,中文支持的不理想,会逐字拆分。
    1. max_token_length:最大令牌长度。如果看到令牌超过此长度,则将其max_token_length间隔分割。默认255
  2. Pattern Tokenizer: 以正则匹配分隔符,把文字拆分成若干词项。
  3. Simple Pattern Tokenizer: 一正则匹配词项,速度比Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#自定义 分析器

PUT index
{
"settings":{
"analysis":{
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":["& => and"]
}
},
"filter":{
"test_filter":{
"type": "stop",
"stopword":["is","a","at","the"]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":[
"my_char_filter",
"html_strip"
],
"filter":["lowercase","test_filter"],
"tokenizer":"standard"
}
}
}
}
}

GET index/_analyze
{
"tokenizer":"my_analyzer",
"text":"TEST and T a T at S the T"
}
中文分词器
  1. IK分词器

文章作者: TheMoonLight
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 TheMoonLight !
评论
  目录