Elasticsearch and Spanish Accents -
i trying use elasticsearch index data research paper. i'am figthing accents. intance, if use:
get /_analyze?tokenizer=standard&filter=asciifolding&text="boletÃnes de investigaciónes"
get
{ "tokens": [ { "token": "bolet", "start_offset": 1, "end_offset": 6, "type": "<alphanum>", "position": 1 }, { "token": "nes", "start_offset": 7, "end_offset": 10, "type": "<alphanum>", "position": 2 }, { "token": "de", "start_offset": 11, "end_offset": 13, "type": "<alphanum>", "position": 3 }, { "token": "investigaci", "start_offset": 14, "end_offset": 25, "type": "<alphanum>", "position": 4 }, { "token": "nes", "start_offset": 26, "end_offset": 29, "type": "<alphanum>", "position": 5 } ] }
and should
{ "tokens": [ { "token": "boletines", "start_offset": 1, "end_offset": 6, "type": "<alphanum>", "position": 1 }, { "token": "de", "start_offset": 11, "end_offset": 13, "type": "<alphanum>", "position": 3 }, { "token": "investigacion", "start_offset": 14, "end_offset": 25, "type": "<alphanum>", "position": 4 } ] }
what should do?
to prevent tokens being formed, need use alternative tokenizer, e.g. try whitespace tokenizer.
alternatively use language analyzer , specify language.
Comments
Post a Comment