lucene7GroupingBy group encapsulation class

1. Lucene

Lucene is an open source Full Text Search engine toolkit, i.e. it is not a complete Full Text Search engine, but a Full Text Search The engine architecture provides a complete query engine and index engine, part of the text analysis engine (English and German two Western languages). The purpose of Lucene is to provide software developers with an easy-to-use toolkit to facilitate the full-text search function in the target system, or to build a complete full-text search engine based on this.

Lucene contains four basic data types, namely:

Index: Index, composed of many Document s.
Document: Composed of many Field s, it is the smallest unit of Index and Search.
Field: It consists of many terms, including Field Name and Field Value.
Term: Consists of many bytes. Generally, each smallest unit after the word segmentation of the Field Value of the Text type is called Term.

In lucene, read and write paths are separated. Create an IndexWriter when writing, and create an IndexSearcher when reading

Apache Solr

Solr is a high-performance, Java5-based, Lucene-based full-text search server. At the same time, it has been extended to provide a richer query language than Lucene. At the same time, it is configurable, scalable, and optimizes query performance, and provides a complete function management interface. It is a very good Full text search engine. It provides an API interface similar to Web-service. Users can pass http request , submit an XML file in a certain format to the search engine server to generate an index; you can also submit a search request through the Http Solr Get operation, and get the returned result in XML format;

Apache Solr is a popular open source search server that uses a REST-like HTTP API, which ensures that you can use solr from almost any programming language.

Solr is an open source search platform for building search applications. it builds on Lucene (full-text search engine). Solr is enterprise-grade, fast and highly scalable. Applications built with Solr are highly complex and provide high performance.

There are three essential differences between Solr and Lucene: search server, enterprise level and management. Lucene is essentially a search library, not a standalone application, while Solr is. Lucene focuses on the construction of the underlying search, while Solr focuses on enterprise applications. Lucene is not responsible for the management necessary to support the search service, while Solr is. So, to summarize Solr in one sentence: Solr is an extension of Lucene for enterprise search applications.

2. Grouping

1.grouping introduction

When we do lucene search, we may use the data of a certain condition to count, such as counting how many provinces there are. In sql query, we can use distinct to complete similar functions, and we can also use group by to query Columns for grouping queries.

The main user of group is to process the group statistics of different document s containing the same field value in different lucene s.

2.grouping receives parameters

groupField: the field to be grouped; for example, if we group provinces, we need to pass in the corresponding value of province. It should be noted that if groupField does not exist in the document, a null group will be returned;

groupSort: How the groups are sorted, the sort field determines the order in which the group content is displayed;

topNGroups: the number of group display, only counting 0 to topNGroup records;

groupOffset: Counting from the number of TopGroup s, for example, if groupOffset is 3, it will display the corresponding records from 3 to topNGroup, and we can use this value for pagination query;

withinGroupSort: how to sort within each group;

maxDocsPerGroup: How many document s are processed in each group;

withinGroupOffset: the initial position of the document displayed by each group;

3. Other important parameters

Properties in SortProperties in SortFieldmeaning
Sort.INDEXORDERSortField.FIELD_DOCSort according to the order of the index
Sort.RELEVANCESortField.FIELD_SCORESort by Relevance Score

3. Test

There are mainly two grouping methods recommended by the API, a double-pass traversal method and a single-pass traversal method. Now there is an encapsulation class GoupingSearch that can implement two different methods

Realize the function:

1. Group by author, you can specify the upper limit of documents in the GroupDocsLimit group

2. Paging; each page is interspersed with all groups, polling (todo\\\\\ will be added later)

Test code codeTest-easy-lucene use
Step 1: Create an index

//index directory
    static String indexDir = "D:\\codeTest\\luceneTest\\easyLucene";
    static Analyzer analyzer = new StandardAnalyzer();
    //Specify which index to group on
    static String groupField = "author";

    @Test
    public void mainTest() throws Exception{
        createIndex();
//        Directory directory = FSDirectory.open(Paths.get(indexDir));
//        IndexReader reader = DirectoryReader.open(directory);
//        IndexSearcher searcher = new IndexSearcher(reader);
//        Query query = new TermQuery(new Term("content", "random"));
//        /** sorting rules inside each group */
//        Sort groupSort = Sort.RELEVANCE;
//        groupBy(searcher, query, groupSort);
    }


    /**
     * Create an index document for testing
     *
     * @throws IOException
     */
    public static void createIndex() throws IOException {
        Directory dir = FSDirectory.open(Paths.get(indexDir));
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        IndexWriter writer = new IndexWriter(dir, indexWriterConfig);
        addDocuments(groupField, writer);
    }

    /**
     * Add index document
     *
     * @param groupField
     * @param writer
     * @throws IOException
     */
    public static void addDocuments(String groupField, IndexWriter writer)
            throws IOException {
        // 0
        Document doc = new Document();
        addGroupField(doc, groupField, "author1");
        doc.add(new StringField("author", "author1", Field.Store.YES));
        doc.add(new TextField("content", "random text", Field.Store.YES));
        doc.add(new StringField("id", "1", Field.Store.YES));
        writer.addDocument(doc);

        // 1
        doc = new Document();
        addGroupField(doc, groupField, "author1");
        doc.add(new StringField("author", "author1", Field.Store.YES));
        doc.add(new TextField("content", "some more random text",
                Field.Store.YES));
        doc.add(new StringField("id", "2", Field.Store.YES));
        writer.addDocument(doc);

        // 2
        doc = new Document();
        addGroupField(doc, groupField, "author1");
        doc.add(new StringField("author", "author1", Field.Store.YES));
        doc.add(new TextField("content", "some more random textual data",
                Field.Store.YES));
        doc.add(new StringField("id", "3", Field.Store.YES));
        writer.addDocument(doc);

        // 3
        doc = new Document();
        addGroupField(doc, groupField, "author2");
        doc.add(new StringField("author", "author2", Field.Store.YES));
        doc.add(new TextField("content", "some random text", Field.Store.YES));
        doc.add(new StringField("id", "4", Field.Store.YES));
        writer.addDocument(doc);

        // 4
        doc = new Document();
        addGroupField(doc, groupField, "author3");
        doc.add(new StringField("author", "author3", Field.Store.YES));
        doc.add(new TextField("content", "some more random text",
                Field.Store.YES));
        doc.add(new StringField("id", "5", Field.Store.YES));
        writer.addDocument(doc);

        // 5
        doc = new Document();
        addGroupField(doc, groupField, "author3");
        doc.add(new StringField("author", "author3", Field.Store.YES));
        doc.add(new TextField("content", "random", Field.Store.YES));
        doc.add(new StringField("id", "6", Field.Store.YES));
        writer.addDocument(doc);

        // 6 -- no author field
        doc = new Document();
        addGroupField(doc, groupField, "author4");
        doc.add(new StringField("author", "author4", Field.Store.YES));
        doc.add(new TextField("content",
                "random word stuck in alot of other text", Field.Store.YES));
        doc.add(new StringField("id", "6", Field.Store.YES));
        writer.addDocument(doc);
        writer.commit();
        writer.close();
    }

    /**
     * add group field
     *
     * @param doc
     *            index document
     * @param groupField
     *            Domain name to be grouped
     * @param value
     *            field value
     */
    private static void addGroupField(Document doc, String groupField,
                                      String value) {
        //The field for grouping must be of the SortedDocValuesField type
        doc.add(new SortedDocValuesField(groupField, new BytesRef(value)));
    }

The second step: groupingBy

@Test
    public void lucene7GroupBy() throws Exception{
        GroupingSearch groupingSearch = new GroupingSearch(groupField);//Specify the index to group by
        groupingSearch.setGroupSort(new Sort(SortField.FIELD_SCORE));//Specify grouping collation
        groupingSearch.setFillSortFields(true);//Whether to populate the sortValues ​​of SearchGroup
        groupingSearch.setCachingInMB(4.0, true);
        groupingSearch.setAllGroups(true);
        //groupingSearch.setAllGroupHeads(true);
        groupingSearch.setGroupDocsLimit(10);//Maximum number of documents in a group

        //no search term specified
        BooleanQuery query = new BooleanQuery.Builder()
                .add(new TermQuery(new Term("author", "author1")), BooleanClause.Occur.SHOULD)
                .add(new TermQuery(new Term
                                ("author", "author2")),
                        BooleanClause.Occur.SHOULD)
                .add(new TermQuery(new Term("author", "author3")), BooleanClause.Occur.SHOULD)
                .add(new TermQuery(new Term("author", "author4")), BooleanClause.Occur.SHOULD).build();

        //specify search term
//        Analyzer analyzer = new StandardAnalyzer();
//        QueryParser parser = new QueryParser("content", analyzer);
//        String queryExpression = "some content";
//        Query query = parser.parse(queryExpression);
        Directory directory = FSDirectory.open(Paths.get(indexDir));
        IndexReader reader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(reader);
        //Perform a specific query on the index containing some and content word segmentation on the content index, and the results are grouped according to the content of the author index
        TopGroups<BytesRef> result = groupingSearch.search(searcher, query, 0, 1000);
        int totalHit = result.totalHitCount;
        //total hits
        System.out.println("total hits:"+totalHit);
        //Number of groups
        System.out.println("Number of groups:"+result.groups.length);
        //Print query results by group
        Map<String, List<Document>> groupingMap = new HashMap<>();
        for (GroupDocs<BytesRef> groupDocs : result.groups){
            List<Document> totalDoc = new ArrayList<>();
            if (groupDocs != null) {
                if (groupDocs.groupValue != null) {
                    System.out.println("group:" + groupDocs.groupValue.utf8ToString());
                }else{
                    //Since there is a piece of data that does not create a SortedDocValued index on the group index when the index is built, the groupValue of this group is null
                    System.out.println("group:" + "unknow");
                }
                System.out.println("The number of data in the group:" + groupDocs.totalHits);

                ScoreDoc[] scoreDocs = groupDocs.scoreDocs;
                int maxCount = Math.min(totalHit, scoreDocs.length);
                for(int i = 0; i < maxCount; i++){
                    Document document = searcher.doc(scoreDocs[i].doc);
                    totalDoc.add(document);
                }
                groupingMap.put(totalDoc.get(0).get("author"), totalDoc);
                for(ScoreDoc scoreDoc : groupDocs.scoreDocs){
                    System.out.println("author:" + searcher.doc(scoreDoc.doc).get("author"));
                    System.out.println("content:" + searcher.doc(scoreDoc.doc).get("content"));
                    System.out.println();
                }
                System.out.println("=====================================");
            }
        }
    }

Tags: lucene

Posted by sebastienp on Wed, 18 Jan 2023 17:31:35 +0530