1. Lucene
Lucene is an open source Full Text Search engine toolkit, i.e. it is not a complete Full Text Search engine, but a Full Text Search The engine architecture provides a complete query engine and index engine, part of the text analysis engine (English and German two Western languages). The purpose of Lucene is to provide software developers with an easy-to-use toolkit to facilitate the full-text search function in the target system, or to build a complete full-text search engine based on this.
Lucene contains four basic data types, namely:
Index: Index, composed of many Document s.
Document: Composed of many Field s, it is the smallest unit of Index and Search.
Field: It consists of many terms, including Field Name and Field Value.
Term: Consists of many bytes. Generally, each smallest unit after the word segmentation of the Field Value of the Text type is called Term.
In lucene, read and write paths are separated. Create an IndexWriter when writing, and create an IndexSearcher when reading
Apache Solr
Solr is a high-performance, Java5-based, Lucene-based full-text search server. At the same time, it has been extended to provide a richer query language than Lucene. At the same time, it is configurable, scalable, and optimizes query performance, and provides a complete function management interface. It is a very good Full text search engine. It provides an API interface similar to Web-service. Users can pass http request , submit an XML file in a certain format to the search engine server to generate an index; you can also submit a search request through the Http Solr Get operation, and get the returned result in XML format;
Apache Solr is a popular open source search server that uses a REST-like HTTP API, which ensures that you can use solr from almost any programming language.
Solr is an open source search platform for building search applications. it builds on Lucene (full-text search engine). Solr is enterprise-grade, fast and highly scalable. Applications built with Solr are highly complex and provide high performance.
There are three essential differences between Solr and Lucene: search server, enterprise level and management. Lucene is essentially a search library, not a standalone application, while Solr is. Lucene focuses on the construction of the underlying search, while Solr focuses on enterprise applications. Lucene is not responsible for the management necessary to support the search service, while Solr is. So, to summarize Solr in one sentence: Solr is an extension of Lucene for enterprise search applications.
2. Grouping
1.grouping introduction
When we do lucene search, we may use the data of a certain condition to count, such as counting how many provinces there are. In sql query, we can use distinct to complete similar functions, and we can also use group by to query Columns for grouping queries.
The main user of group is to process the group statistics of different document s containing the same field value in different lucene s.
2.grouping receives parameters
groupField: the field to be grouped; for example, if we group provinces, we need to pass in the corresponding value of province. It should be noted that if groupField does not exist in the document, a null group will be returned;
groupSort: How the groups are sorted, the sort field determines the order in which the group content is displayed;
topNGroups: the number of group display, only counting 0 to topNGroup records;
groupOffset: Counting from the number of TopGroup s, for example, if groupOffset is 3, it will display the corresponding records from 3 to topNGroup, and we can use this value for pagination query;
withinGroupSort: how to sort within each group;
maxDocsPerGroup: How many document s are processed in each group;
withinGroupOffset: the initial position of the document displayed by each group;
3. Other important parameters
Properties in Sort | Properties in SortField | meaning |
---|---|---|
Sort.INDEXORDER | SortField.FIELD_DOC | Sort according to the order of the index |
Sort.RELEVANCE | SortField.FIELD_SCORE | Sort by Relevance Score |
3. Test
There are mainly two grouping methods recommended by the API, a double-pass traversal method and a single-pass traversal method. Now there is an encapsulation class GoupingSearch that can implement two different methods
Realize the function:
1. Group by author, you can specify the upper limit of documents in the GroupDocsLimit group
2. Paging; each page is interspersed with all groups, polling (todo\\\\\ will be added later)
Test code codeTest-easy-lucene use
Step 1: Create an index
//index directory static String indexDir = "D:\\codeTest\\luceneTest\\easyLucene"; static Analyzer analyzer = new StandardAnalyzer(); //Specify which index to group on static String groupField = "author"; @Test public void mainTest() throws Exception{ createIndex(); // Directory directory = FSDirectory.open(Paths.get(indexDir)); // IndexReader reader = DirectoryReader.open(directory); // IndexSearcher searcher = new IndexSearcher(reader); // Query query = new TermQuery(new Term("content", "random")); // /** sorting rules inside each group */ // Sort groupSort = Sort.RELEVANCE; // groupBy(searcher, query, groupSort); } /** * Create an index document for testing * * @throws IOException */ public static void createIndex() throws IOException { Directory dir = FSDirectory.open(Paths.get(indexDir)); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); IndexWriter writer = new IndexWriter(dir, indexWriterConfig); addDocuments(groupField, writer); } /** * Add index document * * @param groupField * @param writer * @throws IOException */ public static void addDocuments(String groupField, IndexWriter writer) throws IOException { // 0 Document doc = new Document(); addGroupField(doc, groupField, "author1"); doc.add(new StringField("author", "author1", Field.Store.YES)); doc.add(new TextField("content", "random text", Field.Store.YES)); doc.add(new StringField("id", "1", Field.Store.YES)); writer.addDocument(doc); // 1 doc = new Document(); addGroupField(doc, groupField, "author1"); doc.add(new StringField("author", "author1", Field.Store.YES)); doc.add(new TextField("content", "some more random text", Field.Store.YES)); doc.add(new StringField("id", "2", Field.Store.YES)); writer.addDocument(doc); // 2 doc = new Document(); addGroupField(doc, groupField, "author1"); doc.add(new StringField("author", "author1", Field.Store.YES)); doc.add(new TextField("content", "some more random textual data", Field.Store.YES)); doc.add(new StringField("id", "3", Field.Store.YES)); writer.addDocument(doc); // 3 doc = new Document(); addGroupField(doc, groupField, "author2"); doc.add(new StringField("author", "author2", Field.Store.YES)); doc.add(new TextField("content", "some random text", Field.Store.YES)); doc.add(new StringField("id", "4", Field.Store.YES)); writer.addDocument(doc); // 4 doc = new Document(); addGroupField(doc, groupField, "author3"); doc.add(new StringField("author", "author3", Field.Store.YES)); doc.add(new TextField("content", "some more random text", Field.Store.YES)); doc.add(new StringField("id", "5", Field.Store.YES)); writer.addDocument(doc); // 5 doc = new Document(); addGroupField(doc, groupField, "author3"); doc.add(new StringField("author", "author3", Field.Store.YES)); doc.add(new TextField("content", "random", Field.Store.YES)); doc.add(new StringField("id", "6", Field.Store.YES)); writer.addDocument(doc); // 6 -- no author field doc = new Document(); addGroupField(doc, groupField, "author4"); doc.add(new StringField("author", "author4", Field.Store.YES)); doc.add(new TextField("content", "random word stuck in alot of other text", Field.Store.YES)); doc.add(new StringField("id", "6", Field.Store.YES)); writer.addDocument(doc); writer.commit(); writer.close(); } /** * add group field * * @param doc * index document * @param groupField * Domain name to be grouped * @param value * field value */ private static void addGroupField(Document doc, String groupField, String value) { //The field for grouping must be of the SortedDocValuesField type doc.add(new SortedDocValuesField(groupField, new BytesRef(value))); }
The second step: groupingBy
@Test public void lucene7GroupBy() throws Exception{ GroupingSearch groupingSearch = new GroupingSearch(groupField);//Specify the index to group by groupingSearch.setGroupSort(new Sort(SortField.FIELD_SCORE));//Specify grouping collation groupingSearch.setFillSortFields(true);//Whether to populate the sortValues of SearchGroup groupingSearch.setCachingInMB(4.0, true); groupingSearch.setAllGroups(true); //groupingSearch.setAllGroupHeads(true); groupingSearch.setGroupDocsLimit(10);//Maximum number of documents in a group //no search term specified BooleanQuery query = new BooleanQuery.Builder() .add(new TermQuery(new Term("author", "author1")), BooleanClause.Occur.SHOULD) .add(new TermQuery(new Term ("author", "author2")), BooleanClause.Occur.SHOULD) .add(new TermQuery(new Term("author", "author3")), BooleanClause.Occur.SHOULD) .add(new TermQuery(new Term("author", "author4")), BooleanClause.Occur.SHOULD).build(); //specify search term // Analyzer analyzer = new StandardAnalyzer(); // QueryParser parser = new QueryParser("content", analyzer); // String queryExpression = "some content"; // Query query = parser.parse(queryExpression); Directory directory = FSDirectory.open(Paths.get(indexDir)); IndexReader reader = DirectoryReader.open(directory); IndexSearcher searcher = new IndexSearcher(reader); //Perform a specific query on the index containing some and content word segmentation on the content index, and the results are grouped according to the content of the author index TopGroups<BytesRef> result = groupingSearch.search(searcher, query, 0, 1000); int totalHit = result.totalHitCount; //total hits System.out.println("total hits:"+totalHit); //Number of groups System.out.println("Number of groups:"+result.groups.length); //Print query results by group Map<String, List<Document>> groupingMap = new HashMap<>(); for (GroupDocs<BytesRef> groupDocs : result.groups){ List<Document> totalDoc = new ArrayList<>(); if (groupDocs != null) { if (groupDocs.groupValue != null) { System.out.println("group:" + groupDocs.groupValue.utf8ToString()); }else{ //Since there is a piece of data that does not create a SortedDocValued index on the group index when the index is built, the groupValue of this group is null System.out.println("group:" + "unknow"); } System.out.println("The number of data in the group:" + groupDocs.totalHits); ScoreDoc[] scoreDocs = groupDocs.scoreDocs; int maxCount = Math.min(totalHit, scoreDocs.length); for(int i = 0; i < maxCount; i++){ Document document = searcher.doc(scoreDocs[i].doc); totalDoc.add(document); } groupingMap.put(totalDoc.get(0).get("author"), totalDoc); for(ScoreDoc scoreDoc : groupDocs.scoreDocs){ System.out.println("author:" + searcher.doc(scoreDoc.doc).get("author")); System.out.println("content:" + searcher.doc(scoreDoc.doc).get("content")); System.out.println(); } System.out.println("====================================="); } } }