通过本篇文章，将会学习搜索引擎背后的自然语言处理方法 TF-IDF，以及它的改进 TF-IWE 和 BM25 算法。

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) 是被经常用在信息检索中的加权方法，用于评估一个词在文档或语料库中的重要性。本质上来讲，就是 TF 与 IDF 相乘：

$$\text{TF-IDF}(t, d, D)=\text{TF}(t, d) \cdot \text{IDF}(t, D)$$

Term Frequency (TF) is the frequency of term $t$ in document $d$

$$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$ $$\text{DF}(t)=\text{Number of documents containing the term t}$$

IDF 是对词语所提供的信息量的测量，换句话说，是词语在所有documents中的罕见程度。

$$\text{IDF}(t, D) = \log \frac{\text{ total number of documents in the corpus D}}{\text{number of documents in which the term $t$ appears} + 1}$$ 在分母部分加 1 以避免实际情况中出现 $t$ 不包含在 $D$ 中的所有 documents。

举例来说，以下是莎士比亚第37场戏剧的DF和IDF表格：

Word	df	idf
Romeo	1	1.57
Falstaff	4	0.967
wit	34	0.037
fool	36	0.012
good	37	0

可以看见的是 Romeo 和 Falstaff 这两个词语在莎士比亚的所有戏剧中并不常见，所以这两个词语与本场戏剧的关联性很强。而 fool 和 wit 出现在了很多场戏剧中，因此其实也就缺少了关于本场戏剧的信息性。

BM25算法

Okapi BM25 (Best Matching 25)算法，在 ElasticSearch, Lucene 等都有所应用。

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$ $$\text{IDF}(q_i) = \log\left(\frac{n(q_i) + 0.5}{N - n(q_i) + 0.5} + 1\right)$$

$D$ 是文档， $|D|$ 是文档中词语的数量
$Q$ is full query, potentially $Q={q_1, q_2, \dots,q_i}$
$k_i$ 是调整参数
$avgdl$ 是语料库中文档的平均长度

TF-IWE

$IDF$ 的结构较为简单，在同类语料库中，

改进的加权算法 Term Frequency - Inverse Word Frequency

使用 Sklearn

停用词

Refs

https://en.wikipedia.org/wiki/Tf%E2%80%93idf