Elasticsearch

Elasticsearch 是一套基于 Lucence 的搜索引擎解决方案,和 Solr 类似。

安装

ElasticSearch 需要 Java 8 环境,可以参考 jdk

https://www.elastic.co/downloads/elasticsearch 下载合适的包

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.4.0.tar.gz
tar zxvf elasticsearch-6.4.0.tar.gz
groupadd elsearch
useradd elsearch -g elsearch -p elsearch
chown -R elsearch:elsearch elasticsearch-6.4.0
su elsearch
cd elasticsearch-6.4.0
./bin/elasticsearch -d

一切正常的话,执行 curl http://localhost:9200 输出

{
  "name": "7P-d1nE",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "qO3-26L6QiWCl6LLPKf9JA",
  "version": {
    "number": "6.4.0",
    "build_flavor": "default",
    "build_type": "tar",
    "build_hash": "595516e",
    "build_date": "2018-08-17T23:18:47.308994Z",
    "build_snapshot": false,
    "lucene_version": "7.4.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

中文分词插件

bin/elasticsearch-plugin install analysis-smartcn
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.0/elasticsearch-analysis-ik-6.4.0.zip

拼音分词插件

要找到对应的版本,需要maven打包。

git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git
mvn package

在elasticsearch-analysis-pinyin/target/releases目录下找到 elasticsearch-analysis-pinyin-6.3.0.zip,拷贝到 elasticsearch-6.3.0/plugins/pinyin 目录并解压缩

已经打好的 6.3.0的包: http://files.php.net.cn/elasticsearch-analysis-pinyin-6.3.0.zip

配置

开启外网访问9200端口 vi config/elasticsearch.yml 修改如下配置项

network.host: 0.0.0.0

概念

Index

 [
        'settings' => [
            "index" => [
                'number_of_shards' => 5, 
                "analysis"         => [
                    "analyzer"  => [
                        "pinyin_analyzer" => [
                            "tokenizer" => "my_pinyin",
                        ],
                    ],
                    "tokenizer" => [
                        "my_pinyin" => [
                            "type"                       => "pinyin",
                            "keep_separate_first_letter" => false,
                            "keep_full_pinyin"           => true,
                            "keep_original"              => true,
                            "limit_first_letter_length"  => 16,
                            "lowercase"                  => true,
                            "remove_duplicated_term"     => true
                        ],
                    ],
                ],
            ],
        ],
        "mappings" => [
            Elastic::DEFAULT_TYPE => [
                'properties' => [
                    'id'          => [
                        "type" => "long",
                    ],
                    "title"     => [
                        "type"    => "text",
                        "copy_to" => "search_text"
                    ],
                    'search_text' => [
                        "type"   => "text",
                        "fields" => [
                            "pinyin" => [
                                "type"        => "text",
                                "store"       => false,
                                "term_vector" => "with_offsets",
                                "analyzer"    => "pinyin_analyzer",
                                "boost"       => 10,
                            ],
                        ],
                    ],
                    'status'     => [
                        "type" => "long",
                    ],
                ],
            ],
        ],
    ];

settings

settings是修改分片和副本数,以及analyzer和tokenizer

mappings

数据类型

Elasticsearch 6.x Mapping设置 https://juejin.im/post/5b799dcb6fb9a019be279bd7

字符串 - text

用于全文索引,该类型的字段将通过分词器进行分词,最终用于构建索引

字符串 - keyword

不分词,只能搜索该字段的完整的值,只用于 filtering

数值型

  • long:有符号64-bit integer:-2^63 ~ 2^63 - 1
  • integer:有符号32-bit integer,-2^31 ~ 2^31 - 1
  • short:有符号16-bit integer,-32768 ~ 32767
  • byte: 有符号8-bit integer,-128 ~ 127
  • double:64-bit IEEE 754 浮点数
  • float:32-bit IEEE 754 浮点数
  • half_float:16-bit IEEE 754 浮点数
  • scaled_float


布尔 - boolean

值:false, "false", true, "true"


日期 - date

由于Json没有date类型,所以es通过识别字符串是否符合format定义的格式来判断是否为date类型

format默认为:strict_date_optional_time||epoch_millis format


二进制 - binary

该类型的字段把值当做经过 base64 编码的字符串,默认不存储,且不可搜索

使用

1. 查看所有index

curl -X GET 'http://localhost:9200/_cat/indices?v'
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   my_index RFXw1NmhRcuxFelQQObm_Q   5   1          0            0      1.2kb          1.2kb

2. 查看所有的type

curl 'localhost:9200/_mapping?pretty=true'
{
  "my_index" : {
    "mappings" : { }
  }
}

3. 批量操作 bulk https://www.elastic.co/guide/cn/elasticsearch/guide/current/bulk.html

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }

scroll

scroll 可以生成一个临时的快照,但是如果搜索条件一直的话,scroll_id 可能是一样的,所以可能会出现多个请求同时消费同一个 scroll 的情况。

SQL查询语句

缺点是 limit 不能进行偏移。

FAQs

进行聚合操作时提示 Fielddata is disabled on text fields by default,使用text字段排序会有问题,请设置为fileddata=true或者明确指定其数据类型为 long 。

PUT megacorp/_mapping/employee/
{
  "properties": {
    "interests": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

参见 https://blog.csdn.net/wild46cat/article/details/62889554

相关内容

搜索技术 lucene solr sphinx elasticsearch

参考资料