SpringCloud— Elasticsearch aggregation and auto-completion

SpringCloud (9) - Elasticsearch aggregation and auto-completion

1 Data Aggregation

1. Classification of aggregation

aggregations can realize the statistics, analysis and operation of document data. There are three common aggregations:

1. Bucket aggregation (Bucket)

text does not support bucket aggregation

Bucket aggregation (Bucket) is used to group documents, and the more common ones are:

  • TermAggregation: group by the field value of the document, similar to group by in MySql
  • Date Histogram: Group by date ladder, for example, a week or a month as a group

2. Metric aggregation (Metric)

text and keyword do not support metric aggregation

Metric aggregation is used to calculate some value like max, min, average etc. The common ones are:

  • Avg: average value
  • Max: maximum value
  • Min: minimum value
  • Sum: Sum
  • Stats: Find Avg, Max, Min, Sum, etc. at the same time

3. Pipeline aggregation (pipeline)

Pipeline aggregation aggregates based on the results of other aggregations

The field type participating in the aggregation cannot be text type

Generally keyword,date,bool,integer etc.

2.DSL implements Bucket aggregation

Here is an example syntax:

GET /[indexName]/_search
{
  "size":"paging value,The default is 10, if it is set to 0 only for aggregation but not for data pagination, only the aggregation result will be returned and the document will not be returned",
  "aggs": {
    "aggregate name, customizable": {
      "Aggregation type, usually terms": {
        "field": "field name",
        "size": the amount of data returned
      }
    }
  }
}

By default, Bucket aggregation will count the number of documents as _count, and sort them in reverse order according to _count.

If you need to modify it, you only need to add the order attribute and set the sorting rules

The following is an example aggregation using brand as an example:

GET /hotel/_search
{
  "size":0,
  "aggs": {
    "brandAgg": {
      "terms": {
        "field": "brand",
        "size": 20,
        "order":{
           "_count":"asc"
        }
      }
    }
  }
}

By default, Bucket aggregation aggregates all documents in the index library, which consumes a lot of memory.

We can limit the document scope of the aggregation by adding query.

For example, only aggregate the data whose price is in the range of 200-300:

GET /hotel/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 200,
        "lte": 300
      }
    }
  }, 
  "size":0,
  "aggs": {
    "brandAgg": {
      "terms": {
        "field": "brand",
        "size": 20,
        "order":{
           "_count":"asc"
        }
      }
    }
  }
}

3.DSL implements Metrics aggregation

Use Stats aggregation to get the measurement values ​​of the specified fields

GET /hotel/_search
{
  "size":0,
  "aggs": {
    "brandAgg": {
      "terms": {
        "field": "brand",
        "size": 20
      },
      "aggs":{           //It is a sub-aggregation after brandAgg aggregation, that is, it is calculated separately for each group after grouping
        "score_stats":{  //aggregate name
          "stats": {     //Aggregation type, here can be min,max,avg, etc.
            "field": "score"  //Aggregated field values ​​can only be numeric types, because only arrays can be added, subtracted, multiplied and divided
          }
        }
      }
    }
  }
}

DSL example:

GET /hotel/_search
{
  "size":0,
  "aggs": {
    "brandAgg": {
      "terms": {
        "field": "brand",
        "size": 20
      },
      "aggs":{
        "score_stats":{
          "stats": {
            "field": "score"
          }
        }
      }
    }
  }
}

Run the DSL code to get the result on the right side of the figure below, and use stats to calculate the maximum value, minimum value, sum value, average value and quantity at the same time.

At this time, if you want to sort the values ​​after aggregation, you should use the attributes in score_stats to define.

For example, to sort by the highest value of score

GET /hotel/_search
{
  "size":0,
  "aggs": {
    "brandAgg": {
      "terms": {
        "field": "brand",
        "size": 20,
        "order": {
          "score_stats.max": "desc"
        }
      },
      "aggs":{
        "score_stats":{
          "stats": {
            "field": "score"
          }
        }
      }
    }
  }
}

4.RestClient implements aggregation

First refer to the following comparison chart

Here is a test example:

@Test
public void testAggregationBrand() throws IOException {

    //1. Create a SearchRequest object and specify the name of the index library
    SearchRequest request = new SearchRequest("hotel");
    //2. Remove document data
    request.source().size(0);
    request.source().aggregation(
            AggregationBuilders
                    //Set the aggregation type to term and name the aggregation
                    .terms("brandAgg")
                    //Set the fields that need to be aggregated
                    .field("brand")
                    //Set the amount of data returned
                    .size(20)
                    //set sort,
                    .order(BucketOrder.aggregation("_count",true))
    );
    //3. Send request
    SearchResponse response = restHighLevelClient.search(request, RequestOptions.DEFAULT);

    //Get all aggregation results
    Aggregations aggregations = response.getAggregations();
    //Get the aggregation result according to the aggregation name
    Terms brandTerms = aggregations.get("brandAgg");
    //get bucket
    List<? extends Terms.Bucket> buckets = brandTerms.getBuckets();
    //traverse data
    for (Terms.Bucket bucket : buckets) {
        String brandName = bucket.getKeyAsString();
        Long docCount = bucket.getDocCount();
        System.out.println(brandName+","+docCount);
    }
}

Note: Use AggregationBuilders to build an aggregation object

Two auto-completion

1. The use of pinyin word segmentation

GitHub address: elasticsearch-analysis-pinyin

installation steps:

  1. Download the elasticsearch-analysis-pinyin of the specified version (consistent with the es version, and the document uses v7.12.1)
  2. Unzip and upload to the directory where the es container mounts the plugin (the same directory as the ik tokenizer)
  3. Restart the es container
  4. test
GET /_analyze
{
  "text":["Home Inn"],
  "analyzer": "pinyin"
}

Returning the result shows that the word breaker is installed successfully, and the word is successfully divided

2. Custom tokenizer

1. The composition of the tokenizer

  • character filters: process the text before the tokenizer, such as deleting characters, replacing characters, etc.
  • tokenizer: Cut the text into term s according to certain rules, such as keyword s.
  • tokenizer filter: further process the tokenizer output entries, such as case conversion, synonym processing, pinyin processing, etc.

All three parts are not necessarily required when customizing the tokenizer. According to actual business needs. For example, in the following example, there are only two parts, tokenizer and filter, and no character

2. Implementation of custom tokenizer

When creating an index library, configure a custom analyzer (word breaker) through settings

Because the custom word breaker is specified when creating the index library, the custom word breaker only takes effect for the current index library

Here is an example syntax:

PUT /[indexName]
{
  "settings": {
    "analysis": {
      "analyzer": {
        "The name of the custom tokenizer":{
          "tokenizer":"tokenizer name",
          "filter":"tokenizer name"
        }
      }
    }
  }
}

Implementation example:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":"pinyin"
        }
      }
    }
  }
}

The above default implementation will divide Chinese characters into pinyin one by one, which is not what we want. Refer to the documentation to set other related parameters.

Here is an example implementation:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":"py"
        }
      },
      "filter": {
        "py":{
          "type":"pinyin",   // Use the pinyin word breaker, the following are some parameter settings of the pinyin word breaker
          "keep_full_pinyin":false,
          "keep_joined_full_pinyin":true,
          "keep_original":true,
          "limit_first_letter_length":16,
          "remove_duplicated_term":true,
          "none_chinese_pinyin_tokenize":false
        }
      }
    }
  }
}
  • my_analyzer: custom tokenizer name
  • py: custom tokenizer name

3. Precautions for pinyin word segmentation

The pinyin word breaker is used when creating an inverted index, but it cannot be used when searching (homophones will be found), so the field should use the created word breaker when creating an index, and the field should use the ik_smart word breaker when searching

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":"py"
        }
      },
      "filter": {
        "py":{
          "type":"pinyin",
          "keep_full_pinyin":false,
          "keep_joined_full_pinyin":true,
          "keep_original":true,
          "limit_first_letter_length":16,
          "remove_duplicated_term":true,
          "none_chinese_pinyin_tokenize":false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

3. Autocomplete query

1.Completion Suggester

elasticsearch provides Completion Suggester query to achieve auto-completion, this query will match the entry starting with the content entered by the user and return.

In order to improve query efficiency, some constraints need to be made on the fields in the heap document:

  • It is required that the query field must be of completion type
  • The content of the field is generally the data formed by multiple entries for completion

2. Grammar example

GET /test/_search
{
  "suggest": {
    "customize suggest name": {
      "text": "YOUR TEXT",
      "completion":{
        "field":"field name",        // Fields to complete the query
        "skip_duplicates":true, //skip duplicate
        "size":10               // Get the first 10 results
      }
    }
  }
}

Three, realize the automatic completion of the search

1. Modify the original data structure

Use a custom tokenizer and add a suggestion field for automatic completion

PUT /hotel
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":"py"
        },
        "completion_analyzer":{
          "tokenizer":"keyword",
          "filter":"py"
        }
      },
      "filter": {
        "py":{
          "type":"pinyin",
          "keep_full_pinyin":false,
          "keep_joined_full_pinyin":true,
          "keep_original":true,
          "limit_first_letter_length":16,
          "remove_duplicated_term":true,
          "none_chinese_pinyin_tokenize":false
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "suggestion":{
        "type": "completion",
        "analyzer": "completion_analyzer"
      },
      "all":{
        "type": "text",
        "analyzer": "text_analyzer",
        "search_analyzer": "ik_smart"
      },
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "copy_to": "all", 
        "analyzer": "text_analyzer",
        "search_analyzer": "ik_smart"
      },
      "address":{
        "type": "keyword",
        "index": false
      },
      "price":{
        "type": "double"
      },
      "score":{
        "type": "integer"
      },
      "brand":{
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
        "type": "keyword",
        "copy_to": "all"
      },
      "starName":{
        "type": "keyword",
        "copy_to": "all"
      },
      "business":{
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
        "type": "geo_point"
      },
      "pic":{
        "type": "keyword",
        "index": false
      },
      "isAD":{
        "type": "boolean"
      }
    }
  }
} 

2. Reimport data

Modify the correspondence between the entity class and the document, and re-import the data

@Data
@NoArgsConstructor
public class HotelDoc {
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;

    private Object distance;
    /**
    * Added autocomplete field
    */
    private List<String> suggestion;

    /**
     * advertise
     */
    public Boolean isAD;

    public HotelDoc(Hotel hotel) {
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + "," + hotel.getLongitude();
        this.pic = hotel.getPic();
        //Handle Advertisement Field (isAD) format
        if (hotel.getIsAD() == 1) {
            this.isAD = true;
        } else {
            this.isAD = false;
        }
        //Handling autocomplete information
        List<String> list =new ArrayList<>();

        list.add(hotel.getName());
        list.add(hotel.getBrand());

        //When the business district information contains "/", it will be regarded as multiple, and will be cut
        String business = hotel.getBusiness();
        if (business.contains("/")) {
            String[] businessArr = business.split("/");
            Collections.addAll(list,businessArr);
        }

        this.setSuggestion(list);

    }
}

Import data is complete, use DSL statement to test

GET /hotel/_search
{
  "suggest": {
    "textSuggestion": {
      "text": "s",
      "completion": {
        "field": "suggestion",
        "skip_duplicates":true,
        "size":10
      }
    }
  }
}

3.RestClient realizes automatic completion

First look at the format comparison

Write test code to achieve automatic completion

@Test
void testSuggestion() throws IOException {
    //1. Create a SearchRequest object
    SearchRequest searchRequest=new SearchRequest("hotel");
    //2. Construct DSL statement
    searchRequest.source().suggest(
            new SuggestBuilder().addSuggestion("textSuggestion",
                    SuggestBuilders.completionSuggestion("suggestion")
                            .prefix("bj")
                            .skipDuplicates(true)
                            .size(10)
            ));
    //3. Send request
    SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
    //4. Processing data
    Suggest suggest = response.getSuggest();
    CompletionSuggestion suggestion= suggest.getSuggestion("textSuggestion");

    for (CompletionSuggestion.Entry.Option option : suggestion.getOptions()) {
        String text = option.getText().toString();
        System.out.println(text);
    }
}

The parsing result is actually parsed layer by layer according to the Json format:


Flowers are over! ! !

Tags: Big Data Spring Cloud ElasticSearch

Posted by jamiet757 on Wed, 28 Dec 2022 09:51:05 +0530