Using Java to write a search engine system, this is too strong...

preface

If we use our small server to do Baidu, Sogou that kind of engine is certainly not good. It belongs to the whole site search. We do an in-site search here. This is OK. It is similar to searching the resources in the website.

I How search engines search

The search engine is like a little bee picking honey every day. It is to crawl various web pages, and then build an index for us to search.

Here we can use Python or download the document zip. Let's pack here. It's much faster. I wanted to build a hero League, but I can't find it. If Lao tie finds it later, I can share it.

I suggest you don't crawl (or you will be accused, but we can crawl the official website of our school at will, which we used to practice at that time). Why should we use the index?

Because there is too much data to crawl, I don't index it. Can I traverse it? The time complexity is too great.

Here, we need to create an index. The indexes are forward index and reverse index.

Take LOL for example. The positive row is equivalent to the following. When we mention the skills of the limitless swordsman, we can think of:

  • Q skill alpha raid
  • W skill meditation
  • E unparalleled skills
  • R skill plateau pedigree

So this is a skill based on the name

Inverted index means who has a sword in LOL:

  1. Tryndamere
  2. Limitless swordsman
  3. Jianji

So this is to choose heroes according to their characteristics

II Module division

1. index module

1) Scan the downloaded documents, analyze the contents, and build forward index and reverse index. And save the index contents to the file.

2) Load the index created. It also provides some API s to implement such functions as checking the positive row and checking the inverted row.

2. search module

1) Call the index module to implement a complete search process.

Input: user's query words output: complete search results

3.web module

It is necessary to implement a simple web program that can interact with users in the form of web pages. Contains the front and back ends.

III How to realize word segmentation

Principle of word segmentation:

1. based on Thesaurus

Try to enumerate all the words and put these results in the dictionary file.

2. based on statistics

We collected a lot of corpora and annotated them manually. We know that the probability of those words being together is relatively high~

There are also many third-party tools in java that can implement word segmentation

For example, ansj (I heard that my brother may have heard of ansj, ha ha) is a third-party participle Library of maven central warehouse.

We download the latest version directly and put it into pom

Direct operation in the test package: we use this test code to do it directly. Try how to use this bag.

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;
import java.util.List;
public class TastAnsj {
 public static void main(String[] args) {
  String str = "Master Yi is an assassin and warrior hero with high mobility. He is good at quickly defeating his opponents with quick blows. As the last successor of limitless Kendo, Master Yi can quickly cut a lot of damage. At the same time, he can also use his skills to avoid fierce attacks and the enemy's fire.";
  List<Term> terms = ToAnalysis.parse(str).getTerms();
  for (Term term : terms) {
   System.out.println(term.getName());
  }
 }
}

IV File read

Copy the path of the newly downloaded document into a String and mark it with a constant.

This step is to use the traversal method to get all html files out. Here we use a recursion. If it is an absolute path, it will be added to the file list. If it is not, it will be recursion. Continue to add the values inside.

import java.io.File;
import java.util.ArrayList;

//Read the document just
public class Parser {
  private static final String INPUT_PATH="D:/test/docs/api";
  public void run(){
   //Entry of the entire Parser class
   //1. list all documents according to the path (html);
   ArrayList<File> fileList=new ArrayList<>();
   enumFile(INPUT_PATH,fileList);
   System.out.println(fileList);
   System.out.println(fileList.size());
   //2. for the files listed above, open the file, read the file contents, and parse them
   //3. put the Memory The index data structure constructed in is guaranteed to the specified file.
  }
  //The first parameter indicates where to start traversal / / the second parameter indicates the result.
  private void enumFile(String inputPath,ArrayList<File>fileList){
   File rootPath=new File(inputPath);
   //listFiles can get the files under the first level directory
  File[] files= rootPath.listFiles();
   for(File f:files){
    //Determine whether to recurse according to the type of current f.
    //If f is an ordinary file, add f to the fileList
    //If not, call recursion
    if(f.isDirectory()){
     enumFile(f.getAbsolutePath(),fileList);
    }else {
     fileList.add(f);
    }
   }
  }
 public static void main(String[] args) {
  //The main method is used to realize the whole indexing process
  Parser parser=new Parser();
  parser.run();
 }
}

Let's try to run it. There are too many files here, and everything is printed out. So our next step is to filter these files and select useful ones.

else {
     if(f.getAbsolutePath().endsWith(",html"))
     fileList.add(f);
    }

This code is only for files with html at the end. The following figure shows the results.

4.1 open the file and parse the contents.

There are three parts here: parsing Title, parsing Url, and parsing Content

4.1.1 analysis of Title

f.getName() is a method that directly reads the file name.

We use the name substring(0,f.getName(). length()-5); Why subtract 5 from the total file name length, because HTML is exactly five.

private String parseTitle(File f) {
   String name= f.getName();
   return name.substring(0,f.getName().length()-5);

 }

4.1.2 parse Url operation

The url here is the one we usually go to browser Enter a thing and there will be a url under it. This url is our absolute path. After intercepting, we will get our relative directory, and then splice it with our http. In this way, we can directly get a page.

private String parseUrl(File f) {
  String part1="";
  String part2=f.getAbsolutePath().substring(INPUT_PATH.length());
   return part1+part2;
 }

4.1.3 analysis content

Use < > as the switch to read the data. Why use int instead of char when reading the data with int type? Because the int type becomes -1 after reading, you can judge whether the reading is completed. The following code is easy to understand.

private String parseContent(File f) throws IOException {
   //Read one character at a time, and use < > as the switch
  try(FileReader fileReader=new FileReader(f)) {
   //Add a switch whether to copy or not
   boolean isCopy=true;
   //You also need to prepare a result save
   StringBuilder content=new StringBuilder();
   while (true){
    //The return value of read here is int, not char
    //If you read the end of the file, you will return -1, which is the advantage of using int;
    int ret = 0;
    try {
     ret = fileReader.read();
    } catch (IOException e) {
     e.printStackTrace();
    }
    if(ret==-1) {
      break;
     }
     char c=(char) ret;
     if(isCopy){
      if(c=='<'){
       isCopy=false;
       continue;
      }
      //Direct copy of other characters
      if(c=='\n'||c=='\r'){
       c=' ';
      }
      content.append(c);
     }else{
      if(c=='>'){
       isCopy=true;
      }
     }
   }

   return content.toString();
  } catch (FileNotFoundException e) {
   e.printStackTrace();
  }
  return "";
 }

The total code block of this module is as follows:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;

//Read the document just
public class Parser {
  private static final String INPUT_PATH="D:/test/docs/api";
  public void run(){
   //Entry of the entire Parser class
   //1. list all documents according to the path (html);
   ArrayList<File> fileList=new ArrayList<>();
   enumFile(INPUT_PATH,fileList);
   System.out.println(fileList);
   System.out.println(fileList.size());
   //2. for the files listed above, open the file, read the file contents, and parse them
   for (File f:fileList){
    System.out.println("Start parsing"+f.getAbsolutePath());
    parseHTML(f);
   }
   //3. set the index data structure constructed in memory to the specified file.
  }

 private String parseTitle(File f) {
   String name= f.getName();
   return name.substring(0,f.getName().length()-5);

 }
 private String parseUrl(File f) {
  String part1="";
   String part2=f.getAbsolutePath().substring(INPUT_PATH.length());
   return part1+part2;
 }
 private String parseContent(File f) throws IOException {
   //Read one character at a time, and use < > as the switch
  try(FileReader fileReader=new FileReader(f)) {
   //Add a switch whether to copy or not
   boolean isCopy=true;
   //You also need to prepare a result save
   StringBuilder content=new StringBuilder();
   while (true){
    //The return value of read here is int, not char
    //If you read the end of the file, you will return -1, which is the advantage of using int;
    int ret = 0;
    try {
     ret = fileReader.read();
    } catch (IOException e) {
     e.printStackTrace();
    }
    if(ret==-1) {
      break;
     }
     char c=(char) ret;
     if(isCopy){
      if(c=='<'){
       isCopy=false;
       continue;
      }
      //Direct copy of other characters
      if(c=='\n'||c=='\r'){
       c=' ';
      }
      content.append(c);
     }else{
      if(c=='>'){
       isCopy=true;
      }
     }
   }

   return content.toString();
  } catch (FileNotFoundException e) {
   e.printStackTrace();
  }
  return "";
 }
 private void parseHTML (File f){
  //Parse out the title
   String title=parseTitle(f);
  //Resolve the corresponding url
   String url=parseUrl(f);
  //Parse the corresponding text
  try {
   String content=parseContent(f);
  } catch (IOException e) {
   e.printStackTrace();
  }
 }
  //The first parameter indicates where to start traversal / / the second parameter indicates the result.
  private void enumFile(String inputPath,ArrayList<File>fileList){
   File rootPath=new File(inputPath);
   //listFiles can get the files under the first level directory
  File[] files= rootPath.listFiles();
   for(File f:files){
    //Determine whether to recurse according to the type of current f.
    //If f is an ordinary file, add f to the fileList
    //If not, call recursion
    if(f.isDirectory()){
     enumFile(f.getAbsolutePath(),fileList);
    }else {
     if(f.getAbsolutePath().endsWith(".html"))
     fileList.add(f);
    }
   }
  }
 public static void main(String[] args) {
  //The main method is used to realize the whole indexing process
  Parser parser=new Parser();
  parser.run();
 }
}

 
If you are interested in learning more about the content and related learning materials, please click like Collection + comment forwarding + follow me. There will be a lot of dry goods in the future.
I have some interview questions, architecture and design materials, which can be said to be necessary for programmers' interview! All the information has been sorted out to the online disk. Welcome to download if necessary! I can get it for free by replying to [07] in a private letter

 

Original source: www.shaoqun com/a/1790520. html

Tags: Java crawler search engine Back-end Framework

Posted by Dale_G on Thu, 02 Jun 2022 23:43:33 +0530