SpringCloud (9) - Elasticsearch aggregation and auto-completion
1 Data Aggregation
1. Classification of aggregation
aggregations can realize the statistics, analysis and operation of document data. There are three common aggregations:
1. Bucket aggregation (Bucket)
text does not support bucket aggregation
Bucket aggregation (Bucket) is used to group documents, and the more common ones are:
- TermAggregation: group by the field value of the document, similar to group by in MySql
- Date Histogram: Group by date ladder, for example, a week or a month as a group
2. Metric aggregation (Metric)
text and keyword do not support metric aggregation
Metric aggregation is used to calculate some value like max, min, average etc. The common ones are:
- Avg: average value
- Max: maximum value
- Min: minimum value
- Sum: Sum
- Stats: Find Avg, Max, Min, Sum, etc. at the same time
3. Pipeline aggregation (pipeline)
Pipeline aggregation aggregates based on the results of other aggregations
The field type participating in the aggregation cannot be text type
Generally keyword,date,bool,integer etc.
2.DSL implements Bucket aggregation
Here is an example syntax:
GET /[indexName]/_search { "size":"paging value,The default is 10, if it is set to 0 only for aggregation but not for data pagination, only the aggregation result will be returned and the document will not be returned", "aggs": { "aggregate name, customizable": { "Aggregation type, usually terms": { "field": "field name", "size": the amount of data returned } } } }
By default, Bucket aggregation will count the number of documents as _count, and sort them in reverse order according to _count.
If you need to modify it, you only need to add the order attribute and set the sorting rules
The following is an example aggregation using brand as an example:
GET /hotel/_search { "size":0, "aggs": { "brandAgg": { "terms": { "field": "brand", "size": 20, "order":{ "_count":"asc" } } } } }
By default, Bucket aggregation aggregates all documents in the index library, which consumes a lot of memory.
We can limit the document scope of the aggregation by adding query.
For example, only aggregate the data whose price is in the range of 200-300:
GET /hotel/_search { "query": { "range": { "price": { "gte": 200, "lte": 300 } } }, "size":0, "aggs": { "brandAgg": { "terms": { "field": "brand", "size": 20, "order":{ "_count":"asc" } } } } }
3.DSL implements Metrics aggregation
Use Stats aggregation to get the measurement values of the specified fields
GET /hotel/_search { "size":0, "aggs": { "brandAgg": { "terms": { "field": "brand", "size": 20 }, "aggs":{ //It is a sub-aggregation after brandAgg aggregation, that is, it is calculated separately for each group after grouping "score_stats":{ //aggregate name "stats": { //Aggregation type, here can be min,max,avg, etc. "field": "score" //Aggregated field values can only be numeric types, because only arrays can be added, subtracted, multiplied and divided } } } } } }
DSL example:
GET /hotel/_search { "size":0, "aggs": { "brandAgg": { "terms": { "field": "brand", "size": 20 }, "aggs":{ "score_stats":{ "stats": { "field": "score" } } } } } }
Run the DSL code to get the result on the right side of the figure below, and use stats to calculate the maximum value, minimum value, sum value, average value and quantity at the same time.
At this time, if you want to sort the values after aggregation, you should use the attributes in score_stats to define.
For example, to sort by the highest value of score
GET /hotel/_search { "size":0, "aggs": { "brandAgg": { "terms": { "field": "brand", "size": 20, "order": { "score_stats.max": "desc" } }, "aggs":{ "score_stats":{ "stats": { "field": "score" } } } } } }
4.RestClient implements aggregation
First refer to the following comparison chart
Here is a test example:
@Test public void testAggregationBrand() throws IOException { //1. Create a SearchRequest object and specify the name of the index library SearchRequest request = new SearchRequest("hotel"); //2. Remove document data request.source().size(0); request.source().aggregation( AggregationBuilders //Set the aggregation type to term and name the aggregation .terms("brandAgg") //Set the fields that need to be aggregated .field("brand") //Set the amount of data returned .size(20) //set sort, .order(BucketOrder.aggregation("_count",true)) ); //3. Send request SearchResponse response = restHighLevelClient.search(request, RequestOptions.DEFAULT); //Get all aggregation results Aggregations aggregations = response.getAggregations(); //Get the aggregation result according to the aggregation name Terms brandTerms = aggregations.get("brandAgg"); //get bucket List<? extends Terms.Bucket> buckets = brandTerms.getBuckets(); //traverse data for (Terms.Bucket bucket : buckets) { String brandName = bucket.getKeyAsString(); Long docCount = bucket.getDocCount(); System.out.println(brandName+","+docCount); } }
Note: Use AggregationBuilders to build an aggregation object
Two auto-completion
1. The use of pinyin word segmentation
GitHub address: elasticsearch-analysis-pinyin
installation steps:
- Download the elasticsearch-analysis-pinyin of the specified version (consistent with the es version, and the document uses v7.12.1)
- Unzip and upload to the directory where the es container mounts the plugin (the same directory as the ik tokenizer)
- Restart the es container
- test
GET /_analyze { "text":["Home Inn"], "analyzer": "pinyin" }
Returning the result shows that the word breaker is installed successfully, and the word is successfully divided
2. Custom tokenizer
1. The composition of the tokenizer
- character filters: process the text before the tokenizer, such as deleting characters, replacing characters, etc.
- tokenizer: Cut the text into term s according to certain rules, such as keyword s.
- tokenizer filter: further process the tokenizer output entries, such as case conversion, synonym processing, pinyin processing, etc.
All three parts are not necessarily required when customizing the tokenizer. According to actual business needs. For example, in the following example, there are only two parts, tokenizer and filter, and no character
2. Implementation of custom tokenizer
When creating an index library, configure a custom analyzer (word breaker) through settings
Because the custom word breaker is specified when creating the index library, the custom word breaker only takes effect for the current index library
Here is an example syntax:
PUT /[indexName] { "settings": { "analysis": { "analyzer": { "The name of the custom tokenizer":{ "tokenizer":"tokenizer name", "filter":"tokenizer name" } } } } }
Implementation example:
PUT /test { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"ik_max_word", "filter":"pinyin" } } } } }
The above default implementation will divide Chinese characters into pinyin one by one, which is not what we want. Refer to the documentation to set other related parameters.
Here is an example implementation:
PUT /test { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"ik_max_word", "filter":"py" } }, "filter": { "py":{ "type":"pinyin", // Use the pinyin word breaker, the following are some parameter settings of the pinyin word breaker "keep_full_pinyin":false, "keep_joined_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "remove_duplicated_term":true, "none_chinese_pinyin_tokenize":false } } } } }
- my_analyzer: custom tokenizer name
- py: custom tokenizer name
3. Precautions for pinyin word segmentation
The pinyin word breaker is used when creating an inverted index, but it cannot be used when searching (homophones will be found), so the field should use the created word breaker when creating an index, and the field should use the ik_smart word breaker when searching
PUT /test { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"ik_max_word", "filter":"py" } }, "filter": { "py":{ "type":"pinyin", "keep_full_pinyin":false, "keep_joined_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "remove_duplicated_term":true, "none_chinese_pinyin_tokenize":false } } } }, "mappings": { "properties": { "name":{ "type": "text", "analyzer": "my_analyzer", "search_analyzer": "ik_smart" } } } }
3. Autocomplete query
1.Completion Suggester
elasticsearch provides Completion Suggester query to achieve auto-completion, this query will match the entry starting with the content entered by the user and return.
In order to improve query efficiency, some constraints need to be made on the fields in the heap document:
- It is required that the query field must be of completion type
- The content of the field is generally the data formed by multiple entries for completion
2. Grammar example
GET /test/_search { "suggest": { "customize suggest name": { "text": "YOUR TEXT", "completion":{ "field":"field name", // Fields to complete the query "skip_duplicates":true, //skip duplicate "size":10 // Get the first 10 results } } } }
Three, realize the automatic completion of the search
1. Modify the original data structure
Use a custom tokenizer and add a suggestion field for automatic completion
PUT /hotel { "settings": { "analysis": { "analyzer": { "text_analyzer":{ "tokenizer":"ik_max_word", "filter":"py" }, "completion_analyzer":{ "tokenizer":"keyword", "filter":"py" } }, "filter": { "py":{ "type":"pinyin", "keep_full_pinyin":false, "keep_joined_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "remove_duplicated_term":true, "none_chinese_pinyin_tokenize":false } } } }, "mappings": { "properties": { "suggestion":{ "type": "completion", "analyzer": "completion_analyzer" }, "all":{ "type": "text", "analyzer": "text_analyzer", "search_analyzer": "ik_smart" }, "id":{ "type": "keyword" }, "name":{ "type": "text", "copy_to": "all", "analyzer": "text_analyzer", "search_analyzer": "ik_smart" }, "address":{ "type": "keyword", "index": false }, "price":{ "type": "double" }, "score":{ "type": "integer" }, "brand":{ "type": "keyword", "copy_to": "all" }, "city":{ "type": "keyword", "copy_to": "all" }, "starName":{ "type": "keyword", "copy_to": "all" }, "business":{ "type": "keyword", "copy_to": "all" }, "location":{ "type": "geo_point" }, "pic":{ "type": "keyword", "index": false }, "isAD":{ "type": "boolean" } } } }
2. Reimport data
Modify the correspondence between the entity class and the document, and re-import the data
@Data @NoArgsConstructor public class HotelDoc { private Long id; private String name; private String address; private Integer price; private Integer score; private String brand; private String city; private String starName; private String business; private String location; private String pic; private Object distance; /** * Added autocomplete field */ private List<String> suggestion; /** * advertise */ public Boolean isAD; public HotelDoc(Hotel hotel) { this.id = hotel.getId(); this.name = hotel.getName(); this.address = hotel.getAddress(); this.price = hotel.getPrice(); this.score = hotel.getScore(); this.brand = hotel.getBrand(); this.city = hotel.getCity(); this.starName = hotel.getStarName(); this.business = hotel.getBusiness(); this.location = hotel.getLatitude() + "," + hotel.getLongitude(); this.pic = hotel.getPic(); //Handle Advertisement Field (isAD) format if (hotel.getIsAD() == 1) { this.isAD = true; } else { this.isAD = false; } //Handling autocomplete information List<String> list =new ArrayList<>(); list.add(hotel.getName()); list.add(hotel.getBrand()); //When the business district information contains "/", it will be regarded as multiple, and will be cut String business = hotel.getBusiness(); if (business.contains("/")) { String[] businessArr = business.split("/"); Collections.addAll(list,businessArr); } this.setSuggestion(list); } }
Import data is complete, use DSL statement to test
GET /hotel/_search { "suggest": { "textSuggestion": { "text": "s", "completion": { "field": "suggestion", "skip_duplicates":true, "size":10 } } } }
3.RestClient realizes automatic completion
First look at the format comparison
Write test code to achieve automatic completion
@Test void testSuggestion() throws IOException { //1. Create a SearchRequest object SearchRequest searchRequest=new SearchRequest("hotel"); //2. Construct DSL statement searchRequest.source().suggest( new SuggestBuilder().addSuggestion("textSuggestion", SuggestBuilders.completionSuggestion("suggestion") .prefix("bj") .skipDuplicates(true) .size(10) )); //3. Send request SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); //4. Processing data Suggest suggest = response.getSuggest(); CompletionSuggestion suggestion= suggest.getSuggestion("textSuggestion"); for (CompletionSuggestion.Entry.Option option : suggestion.getOptions()) { String text = option.getText().toString(); System.out.println(text); } }
The parsing result is actually parsed layer by layer according to the Json format:
Flowers are over! ! !