Lucune 技术

用户搜索的过程
- 用户输入关键词
- ——–搜索过程——–
- 展示结果
搜索相关技术
- lucene是什么？
  - lucene是一个用来构建搜索引擎的类库，并不是一个完整的搜索引擎。
  - lucene核心功能：创建索引和查询索引
- solr是什么？
  - 是一个完整的搜索引擎，是基于lucene。
  - 官网地址
搜索的过程

在mysql数据库进行数据查询–mysql
- 去哪里查？ mysql
- 怎么查询快一点？ select * from 表 where title like “%关键词%”；
- 在海量数据下放弃数据库的查询技术
lucene的快速查询
- 能够通过lucene的索引库实现海量数据的快速查询。
- 核心功能：创建索引、查询索引
  - 创建索引时，需要将数据读取之后进行分词。分词技术
- lucene使用一个技术，叫做倒排索引，能够加快查询。
  - 什么是倒排索引
  - 顺序查找的过程是如果有一个万个文档，依次打开每个文档，然后逐行查询。
  - 倒排索引时，通过提前计算，将一个关键词出现在哪些文档中，记录下来成为索引数据，查询的时候，去索引数据中查找。
  - 举例

document1
我爱我的祖国，我的祖国很强大。
document2
我爱你祖国，祖国我爱你。
document3
爱我中华，为中华之崛起而编写爬虫。

需求1：查找出现过“祖国”两个子的文档，并显示出来。
顺序查找：
readfile1->readline->find keyword
readfile2->readline->find keyword
……
倒排索引：提前创建索引
使用分词技术，将读取的一行句子分词。

我|爱|我的|祖国|我的|祖国|很|强大
我爱你|祖国|祖国|我爱你
爱|我|中华|为|中华|之|崛起|而|编写|爬虫

我：{document1[1:1],document3[1:2]}
爱：{docuemnt1[1:2],document3[1:1]}
我的：{document1[1:3,1:5]}
祖国：{document1[1:4,1:6],document2[1:2,1:3]}

使用倒排技术进行查询
1，先通过关键词找到对应的索引数据

祖国：{document1[1:4,1:6],document2[1:2,1:3]}

2,根据索引数据拼装response返回结果集
3，展示给用户结果

4、lucene快速入门

创建maven项目导入lucene的pom依赖

<dependencies>
	   <dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-core</artifactId>
		<version>4.10.2</version>
	</dependency>
	<!-- 查询索引的一些包 -->
	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-queries</artifactId>
		<version>4.10.2</version>
	</dependency>
	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-test-framework</artifactId>
		<version>4.10.2</version>
	</dependency>
	<!-- 创建索引时，需要创建倒排索引信息，创建倒排需要分词 -->
	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-analyzers-common</artifactId>
		<version>4.10.2</version>
	</dependency>
	<!--  如果用户的输入的关键词是个组合词，或者一个句子：为什么打雷不下雨？ -->
	<!-- 用来解析用户的查询意图 -->
	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-queryparser</artifactId>
		<version>4.10.2</version>
	</dependency>
 </dependencies>

4.1、创建索引的过程

public class LuceneMain {

	public static void main(String[] args) throws Exception {
		// 需求:使用lucene创建索引
		// 1.使用某种技术
		// document--->分词--->倒排--->写入到索引库----介质(硬盘或者内存)

		Directory d = FSDirectory.open(new File("index"));
		IndexWriterConfig conf = new IndexWriterConfig(Version.LATEST, new StandardAnalyzer());
		// 第一步:创建indexWriter，传入两个参数，一个directory(FSDriectory,RAMDriectory);一个是IndexWriterConfig
		// IndexWriterConfig,传入两个参数，一个是版本号，一个分词器(中文分词器在3.1时过期了，推荐使用标准StandardAnalyzer)
		IndexWriter indexWriter = new IndexWriter(d, conf);

		// 第二步:对文件创建索引----虎嗅网站：id,title,author,createTime,content,url
		// 一个文档中，有很多字段，想通过title进行查询，这个字段需要被indexed。
		// 如果一个想创建索引，有两种选择，一种不被分词就是完全匹配，一种是分词。
		Document document = new Document();
		// StringField 不分词
		document.add(new StringField("id", "100010", Store.YES));
		// TextField 分词
		document.add(new TextField("title", "《跑男》《极限挑战3》相继停播，政策已成为今年综艺的最大风险", Store.YES));
		// StringField 不分词
		document.add(new StringField("author", "文娱商业观察", Store.YES));
		// StringField 不分词
		document.add(new StringField("createTime", "100010", Store.YES));
		// TextField 分词
		document.add(new TextField("content",
				"虽然安全播了三季，但每年都要被停播个那么几周。果不其然，这周日《极限挑战》又要停播，联系前段时间东方卫视《金星秀》《今晚80后脱口秀》宣布停播，这次《极限挑战》的停播是在警示谁？",
				Store.YES));
		// StringField 不分词
		document.add(new StringField("url", "https://www.huxiu.com/article/213893.html", Store.YES));

		// 第三步：使用indexWriter创建索引
		indexWriter.addDocument(document);
		// commit将内存中等待的数据提交到硬盘创建索引。
		indexWriter.commit();

		// 第四步：关闭indexwriter
		indexWriter.close();

	}
}

4.2、使用luke工具查看索引库

分析出lucene索引库会存储两种类型的数据
- 保存索引数据
- 保存原始数据

4.3、查询索引的过程

private static void searchIndex() throws IOException {
		//需求:使用lucene查询索引
		FSDirectory directory = FSDirectory.open(new File("index"));
		// 第一步:创建IndexSearcher，IndexSearcher 是基于索引库进行查询 ，需要先读取索引库。
		IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(directory));
		// 第二步：设置用户的关键词
		String keyword = "文娱商业观察";
		// 第三步：根据词条进行查询
		// 将关键词转换成一个term对象，需要指定关键词要查询字段 new Term("author", keyword)
		TopDocs res = indexSearcher.search(new TermQuery(new Term("author", keyword)), Integer.MAX_VALUE);
		// 第四步：从topdocs中获取数据
		System.out.println("当前查询命中多少天数据:"+res.totalHits);
		ScoreDoc[] scoreDocs = res.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			System.out.println("维护在lucene内部的文档编号:"+scoreDoc.doc);
			System.out.println("当前文档的得分:"+scoreDoc.score);
			//第五步:获取单篇文章的信息
			Document doc = indexSearcher.doc(scoreDoc.doc);
			System.out.println("id:   "+doc.get("id"));
			System.out.println("title:   "+doc.get("title"));
			System.out.println("author:   "+doc.get("author"));
			System.out.println("createTime:   "+doc.get("createTime"));
			System.out.println("content:   "+doc.get("content"));
			System.out.println("url:   "+doc.get("url"));
		}
	}

如何知道一次搜索过程中，哪些文档相关性更强？
- 10个文件，都包含关键词 “传智播客”
- 每篇文章中关键词都会出现？
  - 出现的次数比较多的，相关性比较大。

#5、在项目中使用ikanalyzer分词器

是什么？中国林良益 2012版本之后在阿里

maven项目，2012年发布jar，没有发布到maven。

需要将jar包导入到本地mvn仓库

mvn install:install-file -Dfile=IKAnalyzer2012FF_u1.jar -DgroupId=org.wltea.ik-analyzer -DartifactId=ik-analyzer -Dversion=4.10.2 -Dpackaging=jar

在项目中使用

<dependency>
	<groupId>org.wltea.ik-analyzer</groupId>
	<artifactId>ik-analyzer</artifactId>
	<version>4.10.2</version>
</dependency>

使用自己的词库
ik分词器和lucene的关系
- ik是一个独立的分词器
- 由于lucene的查询功能底层使用了倒排索引的技术，而倒排索引需要对内容进行分词，才使用了分词技术。

6、lucene的花式查询

// TermQuery query = new TermQuery(new Term("content", "战"));

//// 通过两次置换，能够得到一个词条
 Term term = new Term("content","大据数");
 FuzzyQuery query = new FuzzyQuery(term);

//通过前缀查询
PrefixQuery query = new PrefixQuery(new Term("content", "大"));

WildcardQuery query = new WildcardQuery(new Term("content","大*"));


BooleanQuery query = new BooleanQuery();

WildcardQuery wildcardQuery = new WildcardQuery(new Term("content", "大*"));
query.add(wildcardQuery, Occur.MUST);

PrefixQuery prefixQuery = new PrefixQuery(new Term("content", "金"));
query.add(prefixQuery, Occur.MUST);

//重要：在大多数情况下，用户的输入不一定是一个词条，
//所以我们需要对用户的输入进行分词，将输入编程多个词条之后进行查询。
Analyzer analyzer = new IKAnalyzer();
QueryParser queryParser = new QueryParser("content", analyzer);
Query query =queryParser.parse("学习大数据");

//重要：有时候业务会提供多个字段供用户选择，店铺，商家，旺旺。
MultiFieldQueryParser multiFieldQueryParser 
= new MultiFieldQueryParser(new String[]{"title","content"},new IKAnalyzer());
Query query =multiFieldQueryParser.parse("学习大数据");

7、如何使用lucene开发一个搜索引擎

对外开发一个controller，这个controller提供两个方法
- init:读取数据库的数据，创建索引
  - ———dao————-
  - mysql jdbc DataSource
  - jdbctemplate mybatis访问
  - select * from huxiu_article;
  - ———-service———–
  - 得到数据库查询的结果，并封装到list集合当中
  - 迭代list中的元素，将每个元素封装一个document对象
    - 调用indexWriter.add(doc)创建索引
    - 调用indexWriter.commit()
  - 关闭indexWriter.close()
- query:查询方法，主要接受用户输入的关键字。
  - ——–controller———
  - 获取数据
  - ————service——–
  - 将用户的关键词封装成一个query对象
    - 使用QueryParser对用户输入的关键词进行分词
    - indexSearcher.search(query,integer.max)
  - 得到索引库的返回值，封装成一个list对象。
  - 返回list对象给页面
  - 页面通过jstl等等技术，进行展现。

Duncan's Blog

搜索引擎的核心基础技术