分类：未分类

秒杀极简版实现

2020年9月12日 by whomm·0评论

秒杀系统在各种业务场景里面都会出现。对于电商公司商品秒杀会涉及到：会场、排期、招商、C端（商详店详）、金融等中台或业务线的通力合作，系统本身也会涉及到网关、风控、基础设施（db，cache，容器）多个环节。但是如果现在仅仅考虑1个固定商品的秒杀一天就一个固定场次，不涉及支付（0元就可以秒），用户限购1个，这样市场或营销的场景，现在看一下如何短平快实现一个高并发支持实现简单快捷的秒杀系统。

图里面的系统简单起见都认为已经有了，不需要再进行考虑了。

1. 开始具体的的设计
数据库一张表名字叫buy：
字段：id, item_id，user_id，number(秒杀序号)
uniq key： item_id，user_id
uniq key: item_id, number

2. 秒杀接口（登陆态必须）：
input ：userid，itemid
output：是否秒杀成功

3. 核心逻辑

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

/**
* -1无库存
* @param userId 用户id
* @param itemId 商品id
* @param stock 商品库存
* @return
*/
public long buyNow(long userId, long itemId, long stock) {

String nowNo = jedis.get(String.format("buynow_%d",itemId));
if (nowNo != null && Long.valueOf(nowNo) >= stock) {
// 没有库存了
return -1;
}
if(jedis.pfadd(String.format("buynow_u_%d",itemId),String.valueOf(userId)) > 0) {
// 这个uid没来过
long byno = jedis.incr(String.format("buynow_%d",itemId));
if (byno <= stock) {
// 获取到购买序号了

// 插入表格返回id，插入失败返回null
Long id = buyDao.insert(userId,itemId,byno);
if (id !=null ) {
// 记录数据库成功
return id;
}

}
}
// 没有抢到
return -1;
}

上面的逻辑主要看当前号是否已发放完了，没发放完，给没来过的用户发号。里面用到了hyperloglog和string两个redis的数据类型。但是需要这个kv系统是可靠的。
如果库存很小的情况下方法基本没有瑕疵，库存大的话，有些用户没来过会判断已经来过了。不会给他号。不过还好足够简单了。

mvn 加加速

2020年3月22日 by whomm·0评论

家里网上太慢了，项目依赖的snapshot包太多了，本地编译太费劲了，如何搞
修改 settings.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

<profile>
<id>default</id>
<repositories>
<repository>
<id>central</id>
<url>http://central</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>


<updatePolicy>never</updatePolicy>
</snapshots>
</repository>
</repositories>
</profile>

此时快速编译一下：

mvn package -nsu -Dmaven.test.skip=true
# 如果只使用这个命令是没什么效果的，settings.xml里面如果不改，致此处还会不停的从远程检查。

强制更新一下snapshot 并编译一下
mvn package -U -Dmaven.test.skip=true

其他的更新策略，也可以尝试一下
The frequency for downloading updates – can be “always”, “daily” (default), “interval:XXX” (in minutes) or “never” (only if it doesn’t exist locally).

参考文档： http://maven.apache.org/ref/3.5.0/maven-settings/settings.html

地理位置索引整理进化

2019年11月10日 by whomm·0评论

空间映射&编码

墨卡托模型： http://cdn.hujiulong.com/geohey/blog/mercator/play.html
Geohash 是一种地理编码，由 Gustavo Niemeyer 发明的。它是一种分级的数据结构，把空间划分为网格。Geohash 属于空间填充曲线中的 Z 阶曲线（Z-order curve）的实际应用。
s2 参考文档 https://github.com/halfrost/Halfrost-Field/blob/master/contents/Go/gospatialsearch.md

空间索引的方法

Lucene的方法：

2D空间下的索引：
1. geo_point ：前缀索引可以解决附近点和区域内点的搜索
2. geo_sharp : 形状索引的实现。 Tessellation Decomposition + Block K Dimensional Trees (Bkd)，三角拆分+blockKDtree
3D空间下的索引
1. XYShape for Spatial…in Lucene 8.2+ Tessellating & Indexing Virtual Worlds
lucene geo索引原理参考文档
1. https://www.slideshare.net/elasticsearch/geospatial-advancements-in-elasticsearch
2. https://portal.opengeospatial.org/files/?artifact_id=90337

索引涉及的数据结构和发展

空间数据库中最基本的查询
1. 正交范围查询(range query)
2. 窗口查询(window query): 在二维坐标数据中，窗口查询是一个坐标对齐矩形，查询目标是数据库中的落在矩形内所有点。
  - 支持窗口查询的的多维点索引结构有： k-d B-tree hB-tree Bkd-tree R-tree

kdtree 的发展：

K-d tree 1975 Jon Louis Bentley 发明（https://en.wikipedia.org/wiki/K-d_tree）
1. k-d树是每个节点都为k维点的二叉树。所有非叶子节点可以视作用一个超平面把空间分割成两个半空间。节点左边的子树代表在超平面左边的点，节点右边的子树代表在超平面右边的点。选择超平面的方法如下：每个节点都与k维中垂直于超平面的那一维有关。因此，如果选择按照x轴划分，所有x值小于指定值的节点都会出现在左子树，所有x值大于指定值的节点都会出现在右子树。这样，超平面可以用该x值来确定，其法线为x轴的单位向量。
2. 基本结构&构建：
  - 有很多种方法可以选择轴垂直分割面（ axis-aligned splitting planes ），所以有很多种创建k-d树的方法。最典型的方法如下：
    1. 随着树的深度轮流选择轴当作分割面。（例如：在三维空间中根节点是 x 轴垂直分割面，其子节点皆为 y 轴垂直分割面，其孙节点皆为 z 轴垂直分割面，其曾孙节点则皆为 x 轴垂直分割面，依此类推。）
    2. 选择的轴的具体座标的中位数作为作为分割面，同时该点放入子树。
  - 这个方法产生一个平衡的k-d树每个叶节点的高度都十分接近。但是平衡的树不一定所有场景都是最佳的
3. 支持：最近邻查询和范围查询
4. kdtree的主要问题：
  - 因为k-d树是多维排序的，做到平衡比较复杂。不能使用树旋转技术来平衡它们，因为这可能会破坏不变量
  - 平衡K-D树的几种算法：divided k-d tree, pseudo k-d tree, k-d B-tree, hB-tree and Bkd-tree
5. 制约和限制：
  - 维度灾难：在高维空间中，k-d树叶并不能做很高效的* 最邻近搜索。一般的准则是：在k维情况下，数据点数目N应当远远大于2^k时，k-d树的最邻近搜索才可以很好的发挥其作用
K-D-B-tree (k-dimensional B-tree) k-d-b-tree的目标是提高平衡k-d树的搜索效率，同时提高b-tree面向块的存储以优化外部内存访问
- J. T. Robinson. The K-D-B-tree: A search structure for large multidimensional dynamic indexes. In Proc. SIGMOD Intl. Conf. on Management of Data, pages 10–18, 1981
- https://en.wikipedia.org/wiki/K-D-B-tree
- 原文： http://www.csee.usf.edu/~tuy/Literature/KDB-SIGMOD81.pdf
- 也有简称kdtree的
- 与k-d树类似，k-d-b树组织k维空间中的点，对于范围搜索和多维数据库查询等任务非常有用。k-d-b树通过比较单个域中的元素将空间划分为两个子空间。以二维b树（二维k-d-b树）为例，空间的细分方式与k-d树相同：仅使用其中一个域或轴中的一个点，在这种情况下，所有其他值都小于或大于当前值，并分别落在拆分平面的左侧和右侧。与k-d树不同，每个半空间不是自己的节点。相反，与b-树一样，k-d-b-树中的节点存储为页，树存储指向根页的指针。
- kdbtree基本结构
  todo： https://en.wikipedia.org/wiki/K-D-B-tree
Bkd-tree 是 kd-tree 的扩展 2003发明
- 参考文档：https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf
hB-tree
1. http://www.cs.bu.edu/fac/gkollios/cs591/hb-tree.pdf

空间数据分割树

R树 R*树 R+树 X树 M树线段树可持久化线段树希尔伯特R树优先R树
R-tree 1984 Antonin Guttman 发明 (https://en.wikipedia.org/wiki/R-tree https://zh.wikipedia.org/wiki/R%E6%A0%91) 在空间索引应用上非常典型也非常成功。最典型的应用就是：附近两公里的餐厅这样的查询
R树也是平衡树（所以所有叶子节点都在同一深度）。它把数据按（内存）页存储，是为磁盘存储设计的（跟数据库的做法相同）。每一页可以存储不超过一个上限的条目，这个上限一般用M表示。R树会在除根节点以外的节点上维持数据条目不小于最小条目数

Elastic search中空间索引和lucene映射历史

1
2
3
4
5
6
7
8
9
10
11
12
13
14

？ Lucene 8.2+ XYShape for Spatial... Tessellating & Indexing Virtual Worlds
Elasticsearch 7.0 bundles Lucene 8
2019年3月26日 - Elasticsearch 6.7.0, based on Lucene 7.7.0.
2019-2-4 6.6.0 Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default geo_shape
Elasticsearch 6.5.0, based on Lucene 7.5.0.
Elasticsearch 6.4.0, based on Lucene 7.4.0.
2018年6月13日 Elasticsearch 6.3.0, based on Lucene 7.3.0.
Elasticsearch 6.2.0, based on Lucene 7.2.1.
6.6+ "triangle" encoding(Bkd-tree)
pre 6.6 terms/postings encoding
Elasticsearch 5.4.1, based on Lucene 6.5.1.
Elasticsearch 5.4.0, based on Lucene 6.5.0.
Elasticsearch 5.3.3, based on Lucene 6.4.2.
Elasticsearch 2.4.0, based on Lucene 5.5.2.

elastic search 7.4 Geo 相关支持情况

Geo query（https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html）
ElasticSearch支持两种类型的地理数据：支持纬度/经度对的地理点域和支持点、线、圆、多边形、多多边形等的地理形状域。
支持的查询类型：
1. 地理框查询（geoboundingbox）
  - 查找具有落在指定矩形中的地理点的文档。（支持经纬度查询数组、字符串、或geohash编码构建的矩形区域进行查询）
2. 地理距离查询（geo_distance）
  - 查找地理点位于中心点指定距离内的文档。
3. 地理多边形查询（geo_polygon）
  - 查找指定多边形内具有地理点的文档。
4. 地理图形查询（geo_shape）
  - 查找具有与指定的地理形状相交、包含或不相交的地理形状的文档。
  - 能够过滤带有geo_shap类型字段的索引
  - 查询支持直接给的一个形状也支持在另外一个索引里面定义的形状。
  - 空间关系运算符：
    1. intersects-（默认）返回其geo_shape字段与查询几何体相交的所有文档。
    2. DISJOINT-返回其geo_shape字段与查询几何图形没有任何共同点的所有文档。
    3. within-返回其geo_shape字段位于查询几何体中的所有文档。
    4. contains-返回其geo_shape字段包含查询几何图形的所有文档。注意：这只支持使用递归前缀树策略[6.6]
Shap query（https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-shape-query.html）
1. 查询带有shap类型的索引
2. 和geo_shape类型查询很像，shap query 用 GeoJSON 或 Well Known Text (WKT)表征形状。
3. 支持的空间关系运算：
  1. intersects-（默认）返回其geo_shape字段与查询几何体相交的所有文档。
  2. DISJOINT-返回其geo_shape字段与查询几何图形没有任何共同点的所有文档。
  3. within-返回其geo_shape字段位于查询几何体中的所有文档。

es lbs 查询实例

场景：查询当前点在可服务区域里面的所有doc并按照直线距离排序。 poi_zone 代表可服务区域， poi_point 代表服务提供方所在位置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

DELETE /shapes

PUT /shapes
{
"mappings": {
"properties": {
"poi_zone": {
"type": "geo_shape"
},
"poi_point": {
"type": "geo_point"
}
}
}
}

PUT /shapes/_doc/deu
{
"poi_zone": {
"type": "polygon",
"coordinates" : [
[[116.402508,39.947163], [116.414653,39.94697], [116.414869,39.940083], [116.40276,39.939972],[116.402508,39.947163]]
]
},
"poi_point":{
"lon":116.408401,
"lat":39.943734
}

}

PUT /shapes/_doc/deu2
{
"poi_zone": {
"type": "polygon",
"coordinates" : [
[[116.409443,39.947136], [116.420043,39.946997], [116.420043,39.945615], [116.409443,39.945587],[116.409443,39.947136]]
]
},
"poi_point":{
"lon":116.414096,
"lat":39.946513
}
}

GET /shapes/_search

GET /shapes/_search
{
"query":{
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_shape": {
"poi_zone": {
"relation": "intersects",
"shape": {
"coordinates": [116.409766,39.946389],
"type": "point"
}
}
}

}
}
},
"sort": [{
"_geo_distance": {
"poi_point": {
"lon": 116.413503,
"lat": 39.946776
},
"unit": "km",
"order": "asc"
}
}]
}

ntp两个机器如何实现时间同步

2019年11月7日 by whomm·0评论

仔细想想这个问题还是挺有意思的。如图即将见到方法：

client相对服务器的时间偏差是：
offset = ((t1-t0)+(t2-t3))/2 如何理解呢，这个假设网络往返时间是相同的（实际情况仅同一网络内可以这么理解，广域网不会一样）。 t1-t0 是网络延迟+时间差别 t2-t3 是负的网络延迟+时间差别两个合起来除2就是时间差距了。
另外：网络延迟
delay = (t3-t0) – (t2-t1) 除去服务器的处理时间。

关于时间偏差如下示例：

1
2
3
4
5
6
7
8
9
10

timeline->
server: 0 1 2 3 4 5 6 7 8
client: 1 2 3 4 5 6 7 8 9

clock synchronization:
c 2
s 2 3
c 5

offset: -1=((2-2) + (3-5))/2

utc 是 tai（基于铯133的物理计时（购买原子钟可以联系我））的协调时间会有闰秒，每年的年中 6.30 和 12.31 是采用行政的方式加1秒，地球是转速的变慢的所已会加1秒。
闰秒的存在会使得同步差别更大。集群里面有些地方是要注意这个问题的，但是有折中的解决的办法。
另外代码时间相关的需要考虑到个闰秒的影响。

参考：
关于NTP，你需要知道的一切: https://www.jianshu.com/p/8096c0477230
wiki: https://en.wikipedia.org/wiki/Network_Time_Protocol
Raspberry Pi与GPS构建NTP服务器: https://blog.csdn.net/xiaohu50/article/details/78731534
Time synchronization with a Garmin GPS: https://www.lammertbies.nl/comm/info/GPS-time.html

FST & tire-tree 及应用场景

2019年10月26日 by whomm·0评论

FSA
FSA:确定无环有限状态接收机（Deterministric acyclic finite state acceptor, FSA） tire-tree就是这种

（构建工具http://examples.mikemccandless.com/fst.py）

FST

fst： Finite State Transducers（有限状态传感器）

（构建工具http://examples.mikemccandless.com/fst.py）
图中输入框里面单词代表输入后面的数字代表状态
用单词遍历这个有向无环图，初始状态是0，遇到的数字就累加上，到达终点节点的话就是对应的结果状态，如果没有到达终点代表咱们这里不支持这个输入
红色的边代表可优化存储的边 a large reduction of the FST size(实际实现中它的到达节点就是它的下一个节点,这个和实际存储实现由很大关系 )
加粗的边代表下个节点是终结（这个地方注意黑色的有加粗的红色的也有 0/1 g/1 这两条边都是加粗的）

构建方法：
输入的词是有序的（有序的构建的是最小树），添加词，再添加词（锁定部分边有些是不会动的边了可以锁定了）调整未锁定的边，同时调整里面的数值让输出是目标值。再添加的时候有些边又可以锁定了，然后调整其他未锁定的。
输出状态可以是数字大于等于0的整数：每个路，经边的值加和。
输出状态可以是文本：每个路，经过的值得合并,下面以上图为例 Lucene 的fst为例进行说明

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

目的是使用1维的byte数组将状态变化全部保存起来

上图拉伸变形后

c a t
d e e p
o g s
e g g

镜像后得到（镜像是和插入顺序相关）
c d e g g
o g s
e e p
a t

从后到前依次填入到byte里面
fst的bytes字段
#一行10个byte 10进制

0 116 15 97 6 112 15 101 6 1
115 31 1 103 23 1 111 23 8 101
0 103 15 103 6 6 101 22 20 2
100 16 4 1 99 16

#转换arc为字母标记状态
(0) [t] 15 [a] 6 [p] 15 [e] 6 <1>
[s] 31 <1> [g] 23 <1> [o] 23 {8} [e]
0 [g] 15 [g] 6 <6> [e] 22 {20} <2>
[d] 16 {4} <1> [c] 16

lucene 的arc 标记位8个bit定义了6个标记
BIT_FINAL_ARC = 1 << 0; 表示此边的目标节点是否为可接受状态
BIT_LAST_ARC = 1 << 1; 表示此边是节点的最后一条边，此边后面的内容就是下一个节点了。一个节点的所有边按顺序写入到bytes中，只有最后一条边被标记为BIT_LAST_ARC
BIT_TARGET_NEXT = 1 << 2; 表示此边应用了target_next优化，bytes中的下一个Node就是此边的目标节点
BIT_STOP_NODE = 1 << 3; 表示此边的目标节点是-1，已经遍历到了FST的末尾。
BIT_ARC_HAS_OUTPUT = 1 << 4; 表示此边有输出
BIT_ARC_HAS_FINAL_OUTPUT = 1 << 5; 则该ARC需要存储output信息。
BIT_MISSING_ARC = 1 << 6; 表示此边的目标节点有输出(final output)

flag=16，二进制为 ‭00010000‬ 。
第三位为BIT_TARGET_NEXT=0，因此有目标节点target；
第五位为BIT_ARC_HAS_OUTPUT=1，因此有输出符号output

flag=22, 二进制为‭ 00010110‬ 。
第二位BIT_LAST_ARC=1，说明这是节点的最后一条边；
第三位BIT_TARGET_NEXT=1，因此不需要存储target，其目标节点为下一个节点往前找就可以了；
第五位为BIT_ARC_HAS_OUTPUT=1，因此有输出符号output

用的时候从后往前查找比如查找 do
(0) 代表结束
[] 代表arc
<> 代表输出
{} 代表目标跳转
纯数字代表flag

应用场景：

1.搜索功能的sug模块： tire-tree和 fst均可作为底层数据结构，实现产品效果是一样的，查询效率相同都是 o(length(queryword))。但是fst的索引构建相较tire的复杂，tire的使用的内存较多，不能共享尾部。对于中文sug来转化为拼音和拼音前缀的Tire很浪费内存。【这个地方重点是压缩率】sug不用这些结构也是可以的（提前构建好用hashmap也可以）
2.Lucene的 term和 doc id的映射就是用的fst。
3.拼写检查

lucence fst使用示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

public static void main(String[] args) throws Exception {
/*
参考文档
http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/util/fst/package-summary.html
*/
String inputValues[] = {"cat", "deep", "do", "dog", "dogs", "egg",};
long outputValues[] = {1, 2, 3, 4, 5, 6};

PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);
IntsRefBuilder scratchInts = new IntsRefBuilder();
for (int i = 0; i < inputValues.length; i++) {
BytesRef bb = new BytesRef(inputValues[i]);
System.out.println(bb.toString());
builder.add(Util.toIntsRef(bb, scratchInts), outputValues[i]);
}
FST<Long> fst = builder.finish();

System.out.println(fst.toString());

//find-----------------------------------
BytesRef xx = new BytesRef("cat");
System.out.println("-------------");
System.out.println("search result:"+Util.get(fst, xx));

System.out.println("-------------");
System.out.println("foreach fst:");
//foreach-------------------------------
IntsRefFSTEnum<Long> fstEnum = new IntsRefFSTEnum<Long>(fst);
IntsRefFSTEnum.InputOutput<Long> inputOutput;
BytesRefBuilder scratch = new BytesRefBuilder();
while ((inputOutput = fstEnum.next()) != null) {
String input = Util.toBytesRef(inputOutput.input, scratch).utf8ToString();
System.out.println(input + "\t" + inputOutput.output);
}

//sug-------------------------------------
System.out.println("-------------\nsug result:");
String forsug = "do";

BytesRef bytesRef = new BytesRef(forsug);
IntsRef input = Util.toIntsRef(bytesRef, new IntsRefBuilder());
FST.Arc<Long> arc = fst.getFirstArc(new FST.Arc<Long>());
FST.BytesReader fstReader = fst.getBytesReader();
for (int i = 0; i < input.length; i++) {
if (fst.findTargetArc(input.ints[input.offset + i], arc, arc, fstReader) == null) {
System.out.println("can not find ...");
return;
}
}
System.out.println(arc.toString());

//"do" find top 5 shortest path
Util.TopResults<Long> results = Util.shortestPaths(fst, arc, PositiveIntOutputs.getSingleton().getNoOutput(), new Comparator<Long>() {
@Override
public int compare(Long o1, Long o2) {
return o1.compareTo(o2);
}
}, 5, false);
BytesRefBuilder bytesRefBuilder = new BytesRefBuilder();
for (Util.Result<Long> result : results) {
IntsRef intsRef = result.input;
System.out.println( forsug + Util.toBytesRef(intsRef, bytesRefBuilder).utf8ToString());
}

}

其他字典树：

Array/List 使用二分法查找，不平衡
HashMap/TreeMap 性能高，内存消耗大，几乎是原始数据的三倍
Skip List 跳跃表，可快速查找词语，在lucene、redis、Hbase等均有实现。相对于TreeMap等结构，特别适合高并发场景（Skip List介绍）
Trie 适合英文词典，如果系统中存在大量字符串且这些字符串基本没有公共前缀，则相应的trie树将非常消耗内存（数据结构之trie树）
Double Array Trie 适合做中文词典，内存占用小，很多分词工具均采用此种算法（An Implementation of Double-Array Trie ： https://linux.thai.net/~thep/datrie/datrie.html）
Ternary Search Tree 三叉树，每一个node有3个节点，兼具省空间和查询快的优点（https://en.wikipedia.org/wiki/Ternary_search_tree）
Finite State Transducers (FST) 一种有限状态转移机，Lucene 4有开源实现，并大量使用

FST的开源实现

1. OpenFst, an open-source library for FST operations.
2. Stuttgart Finite State Transducer Tools, another open-source FST toolkit
3. java FST Framework, an open-source java FST Framework capable of handling OpenFst text format.
4. Vcsn, an open-source platform (C++ & IPython) platform for weighted automata and rational expressions.
5. Finite State Morphology–The Book XFST/ LEXC, a description of Xerox’s implementation of finite-state transducers intended for linguistic applications.
6. FOMA, an open-source implementation of most of the capabilities of the Xerox XFST/ LEXC implementation.

【one more thing】

利用Lucene的开源工具实现了一个小的中文sug的demo：sugdemo

参考资料：

lucene字典实现原理——FST https://www.cnblogs.com/bonelee/p/6226185.html
关于Lucene的词典FST深入剖析 http://www.shenyanchao.cn/blog/2018/12/04/lucene-fst/
https://blog.csdn.net/zx2011302580235/article/details/88594342
fst wiki https://en.wikipedia.org/wiki/Finite-state_transducer

php两个小问题

2019年9月24日 by whomm·0评论

1） opcache缓存的问题。如果开启了opcache缓存，关闭了脚本检查opcache.validate_timestamps=0，修改代码会生效吗？如果不生效需要怎么办？

opcache.validate_timestamps boolean
如果启用，那么 OPcache 会每隔 opcache.revalidate_freq 设定的秒数检查脚本是否更新。如果禁用此选项，你必须使用 opcache_reset() 或者opcache_invalidate() 函数来手动重置 OPcache，也可以通过重启 Web 服务器来使文件系统更改生效。

opcache.revalidate_freq integer
检查脚本时间戳是否有更新的周期，以秒为单位。设置为 0 会导致针对每个请求， OPcache 都会检查脚本更新。如果 opcache.validate_timestamps 配置指令设置为禁用，那么此设置项将会被忽略。

【注意】如果禁用此选项，你必须使用 opcache_reset() 或者 opcache_invalidate() 函数来手动重置 OPcache，也可以通过重启 Web 服务器来使文件系统更改生效。重启php-fpm或nginx都是没有用的，缓存存储到共享内存里面的，不清除共享内存，缓存是不会被清理的。
如果修改源代码一直不生效，就差删除文件了，可以排查一下这个设置。

2）php.ini修改后php-fpm子进程如果异常退出了，会加载最新配置吗？
不会pm = dynamic本身就支持动态子进程的，如果子进程重新创建使用新配置就乱套了。

参考文档：
https://www.php.net/manual/zh/opcache.configuration.php#ini.opcache.revalidate-freq

位运算小技巧

2019年9月19日 by whomm·0评论

位运算（位运算baike）

奇偶判断

1
2
3

if(a&1 == 1){
//偶数
}

两数交换

1
2
3
4
5
6
7
8

//常规
x = x^y
y = x^y
x = x^y
//简写
x ^= y
y ^= x
x ^= y

数组里面就一个数出现一次其他都是两次求出这个数

1
2
3
4
5
6
7
8
9

int find(int[] arr)
{
int tmp = arr[0];
for(int i=1;i<arr.length;i++)
{
tmp ^= arr[i];
}
return tmp;
}

不用加减运算符求两数只和

https://leetcode-cn.com/problems/sum-of-two-integers
这个地方要了解加法器是如何运行的
另外一个重点是负数在内存里面存储是补码（https://blog.csdn.net/zl10086111/article/details/80907428）

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

int getSum(int a, int b) {
while (b != 0) {
unsigned x = a ^ b;
unsigned y = (unsigned)(a & b) << 1;
a = x;
b = y;
}
return a;
}
/*
整数按照加法器来搞
5 + 3
0101
0011

x= 5^3 = 0110
y= 5&3 = 0001

x^ (y<<1) =
0110
0010 = 0100
x& (y<<1)
0110
0010 = 0010

x = 0100
y = 0010

x^ (y<<1) =
0100
0100 = 0000
x& (y<<1)
0100
0100 = 0100

x = 0000
y = 0100

x = 0000 ^ 1000 = 1000
y = 0000 & 1000 = 0
*/

其他请参考：位运算装逼指南 https://blog.csdn.net/Mrchai521/article/details/90318128

redis trick

2019年5月18日 by whomm·一条评论

1. 集群配置+proxy 不支持事务，不支持 script load 的情况下, 一种实现事务的方法。

eval 执行是具有原子性的（https://redis.io/commands/eval#atomicity-of-scripts）但是得注意 cluster的环境下
用eval 执行多语句，遇到多个key的情况
(error) CROSSSLOT Keys in request don’t hash to the same slot
一个可以用的方案： https://redis.io/topics/cluster-spec#keys-hash-tags

eval(script) 组装一个脚本，但是多个key必须用 keys hash tags的方法到一个slot里面，代理只会按照第一个key的hash来。

2. redis 大厂解决方案
https://www.cnblogs.com/me115/p/9043420.html
很有必要参考，这个大厂确实是这么做的。开发过程中坑位很多。

搜索技术整理

2019年5月18日 by whomm·0评论

实现语言	名称	简介	相关链接
python	gensim	Gensim is a FREE Python library Scalable statistical semantics Analyze plain-text documents for semantic structure Retrieve semantically similar documents 开源python库：灵活的语意统计检索相似文档可用来构建自己的文本模型（word tag）进行相识度检索	官方：https://radimrehurek.com/gensim/index.html 中文文档：https://gensim.apachecn.org/#/
python	whoosh	Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly. Some of Whoosh’s features include: ● Pythonic API. ● Pure-Python. No compilation or binary packages needed, no mysterious crashes. ● Fielded indexing and search. ● Fast indexing and retrieval — faster than any other pure-Python search solution I know of. See Benchmarks. ● Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc. ● Powerful query language. ● Production-quality pure Python spell-checker (as far as I know, the only one). Whoosh might be useful in the following circumstances: ● Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them). ● As a research platform (at least for programmers that find Python easier to read and work with than Java 😉 ● When an easy-to-use Pythonic interface is more important to you than raw speed. ●If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option). Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code. whoosh 是一个python实现的快速且功能丰富的全文索引和搜索库。程序员可以轻而易举的给自己的应用或网站添加搜索功能。根据具体的场景，whoosh实现的每个部分都可以二次开发和替换。个人开发者Matt Chaput 在公司上班的时候开发的。	官方：https://bitbucket.org/mchaput/whoosh/wiki/Home 官方文档： https://whoosh.readthedocs.io/en/latest/
python、java、c++等	jieba	“结巴”中文分词：做最好的 Python 中文分词组件。功能：中文分词、词性标注、关键词抽取目前已有各种语言的实现。和whoosh结合轻松实现中文检索。	官方： https://github.com/fxsjy/jieba
rust	sonic	Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond’s time. Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database. A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic – when under load – responds to search queries in the μs range, eats ~30MB RAM and has a low CPU footprint; see our benchmarks). sonic 是个快速、轻量级、非结构化的搜索后端。专注文本和标记对的毫秒级查询。 sonic在某些场景下可以对重量级的全功能搜索引擎进行替代，比如elasticsearch。它能实现统一的自然语言查询，检索词自动补全，输出相关性结果。sonic是一个标记索引系统而不是文档索引系统：sonic的查询结果不是相关文档，而是外部数据库的ID标识。设计sonic的时候就对性能和代码的整洁度非常关注。无故障、超快、轻度资源依赖就是sonic的目标。起源于Crisp公司，背景就是简单+轻量资源。	官方： https://github.com/valeriansaliou/sonic https://crates.io/crates/sonic-server （详细描述了关键特性）
java	Elasticsearch
java	Solr	Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™. Apache Lucence 项目的一个子项目。 Solr是一个流行、超快、开源的企业级的搜索平台。	官方： http://lucene.apache.org/solr/
java	Lucene	Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucence 项目的一个子项目。 Apache Lucence是一个java写的高性能的全文搜索引擎库。对于需要全文检索的应用比较好的技术解决方案，特别是跨平台的应用。	官方： http://lucene.apache.org/core/
c++	sphinx	Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems. Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server. A variety of text processing features enable fine-tuning Sphinx for your particular application requirements, and a number of relevance functions ensures you can tweak search quality as well. Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL. Sphinx clusters scale up to tens of billions of documents and hundreds of millions search queries per day, powering top websites such as Craigslist, Living Social, MetaCafe and Groupon… to view a complete list of known users please visit our Powered-by page. And last but not least, it’s licensed under GPLv2.	官方： http://sphinxsearch.com/ 关键特性： http://sphinxsearch.com/about/sphinx/
c++	faiss	Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research. 相似向量搜索和密集向量聚类	官方： https://github.com/facebookresearch/faiss
c++	SPTAG	A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.	官方： https://github.com/microsoft/SPTAG

20款开源搜索引擎介绍与比较
https://my.oschina.net/u/2274056/blog/1592809

Solr vs. Elasticsearch谁是开源搜索引擎王者（中英文）
https://www.cnblogs.com/xiaoqi/p/solr-vs-elasticsearch.html
https://logz.io/blog/solr-vs-elasticsearch/

开源搜索引擎Lucene、Solr、Sphinx等优劣势比较
https://blog.csdn.net/belalds/article/details/82667692

微软开源了 Bing 搜索背后的关键算法
https://www.oschina.net/news/106730/microsoft-open-sources-sptag?nocache=1558414049690

数据抓取工具：

客户端模拟工具

seleniumhq https://www.seleniumhq.org/
phantomanjs
appuim

爬虫框架

http://nutch.apache.org/
pyspider 个人用过对小型项目很友好，功能简单易用。

go one line code – http

2018年12月7日 by whomm·0评论

反向代理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

package main

import (
"net/http"
"net/http/httputil"
"net/url"
)

func main() {
//【[one line code] 一行代码http反向代理】
//127.0.0.1:8900 转发到 10.0.12.110:6335
//测试: curl http://127.0.0.1:8900

u, _ := url.Parse("http://10.0.12.110:6335") //反向代理的目的地址
http.ListenAndServe(":8900", httputil.NewSingleHostReverseProxy(u))

}

静态服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

package main

import (
"net/http"
)

func main() {

//【[one line code] 一行代码静态web服务】
//测试： curl http://127.0.0.1:8901/a.txt

// 静态web服务:
http.ListenAndServe(":8901", http.FileServer(http.Dir("/tmp")))

//【[one line code] 一行代码静态web服务带路径转发】
// curl 127.0.0.1:8902/tmpfiles/a.txt
// url路径/tmpfile 映射到/tmp
/*
http.Handle("/tmpfiles/", http.StripPrefix("/tmpfiles/", http.FileServer(http.Dir("/tmp"))))
http.ListenAndServe(":8902", nil)
*/

}

劳动笔记

又一个WordPress站点

分类：未分类

秒杀极简版实现

mvn 加加速

地理位置索引整理进化

空间映射&编码