搜索技术整理

实现语言
名称
简介
相关链接
python
gensim
Gensim is a FREE Python library Scalable statistical semantics Analyze plain-text documents for semantic structure Retrieve semantically similar documents
开源python库:灵活的语意统计 检索相似文档 可用来构建自己的文本模型(word tag)进行相识度检索
python
whoosh
Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly. Some of Whoosh’s features include:
● Pythonic API.
● Pure-Python. No compilation or binary packages needed, no mysterious crashes.
● Fielded indexing and search.
● Fast indexing and retrieval — faster than any other pure-Python search solution I know of. See Benchmarks.
● Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
● Powerful query language.
● Production-quality pure Python spell-checker (as far as I know, the only one). Whoosh might be useful in the following circumstances:
● Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
● As a research platform (at least for programmers that find Python easier to read and work with than Java 😉
● When an easy-to-use Pythonic interface is more important to you than raw speed.
●If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option). Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code.
whoosh 是一个python实现的快速且功能丰富的全文索引和搜索库。程序员可以轻而易举的给自己的应用或网站添加搜索功能。根据具体的场景,whoosh实现的每个部分都可以二次开发和替换。 个人开发者Matt Chaput 在公司上班的时候开发的。
官方文档:
python、java、c++等
jieba
“结巴”中文分词:做最好的 Python 中文分词组件。 功能:中文分词、词性标注、关键词抽取 目前已有各种语言的实现。和whoosh结合轻松实现中文检索。
官方:
rust
sonic
Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond’s time. Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database. A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic – when under load – responds to search queries in the μs range, eats ~30MB RAM and has a low CPU footprint; see our benchmarks).
sonic 是个快速、轻量级、非结构化的搜索后端。专注文本和标记对的毫秒级查询。 sonic在某些场景下可以对重量级的全功能搜索引擎进行替代,比如elasticsearch。它能实现统一的自然语言查询,检索词自动补全,输出相关性结果。sonic是一个标记索引系统而不是文档索引系统:sonic的查询结果不是相关文档,而是外部数据库的ID标识。 设计sonic的时候就对性能和代码的整洁度非常关注。无故障、超快、轻度资源依赖就是sonic的目标。 起源于Crisp公司,背景就是简单+轻量资源。
官方:
(详细描述了关键特性)
java
Elasticsearch
java
Solr
Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.
Apache Lucence 项目的一个子项目。 Solr是一个流行、超快、开源的企业级的搜索平台。
官方:
java
Lucene
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Apache Lucence 项目的一个子项目。 Apache Lucence是一个java写的高性能的全文搜索引擎库。对于需要全文检索的应用比较好的技术解决方案,特别是跨平台的应用。
官方 :
c++
sphinx
Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems. Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server. A variety of text processing features enable fine-tuning Sphinx for your particular application requirements, and a number of relevance functions ensures you can tweak search quality as well. Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL. Sphinx clusters scale up to tens of billions of documents and hundreds of millions search queries per day, powering top websites such as CraigslistLiving SocialMetaCafe and Groupon… to view a complete list of known users please visit our Powered-by page. And last but not least, it’s licensed under GPLv2.
官方:
关键特性:
c++
faiss
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.
相似向量搜索和密集向量聚类
官方:
c++
SPTAG
A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
官方:

 

20款开源搜索引擎介绍与比较
https://my.oschina.net/u/2274056/blog/1592809

Solr vs. Elasticsearch谁是开源搜索引擎王者(中英文)
https://www.cnblogs.com/xiaoqi/p/solr-vs-elasticsearch.html
https://logz.io/blog/solr-vs-elasticsearch/

开源搜索引擎Lucene、Solr、Sphinx等优劣势比较
https://blog.csdn.net/belalds/article/details/82667692

微软开源了 Bing 搜索背后的关键算法
https://www.oschina.net/news/106730/microsoft-open-sources-sptag?nocache=1558414049690

数据抓取工具:

    客户端模拟工具

seleniumhq https://www.seleniumhq.org/
phantomanjs
appuim

    爬虫框架

http://nutch.apache.org/
pyspider 个人用过对小型项目很友好,功能简单易用。

发表评论

邮箱地址不会被公开。 必填项已用*标注