Sklearn 中的朴素贝叶斯分类算法

Scikit-learn 中提供了3中朴素贝叶斯分类算法:

高斯朴素贝叶斯：特征变量是连续变量，符合高斯分布，比如说人的身高，物体的长度。

多项式朴素贝叶斯：特征变量是离散变量，符合多项分布，在文档分类中特征变量体现在一个单词出现的次数，或者是单词的 TF-IDF 值等。

伯努利朴素贝叶斯：特征变量是布尔变量，符合 0/1 分布，在文档分类中特征是单词是否出现。

伯努利朴素贝叶斯是以文件为粒度，如果该单词在某文件中出现了即为 1，否则为 0。而多项式朴素贝叶斯是以单词为粒度，会计算在某个文件中的具体次数。而高斯朴素贝叶斯适合处理特征变量是连续变量，且符合正态分布（高斯分布）的情况。比如身高、体重这种自然界的现象就比较适合用高斯朴素贝叶斯来处理。而文本分类是使用多项式朴素贝叶斯或者伯努利朴素贝叶斯。

实例

假设有4个文档, 计算文档中有那些单词, 这些单词在不同文档中的 TF-IDF

文档 1：this is the bayes document；

文档 2：this is the second second document；

文档 3：and the third one；

文档 4：is this the document。

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer() 
"""
TfidfVectorizer 可以传入以下参数
* stop_words 自定义停止词列表
* token_pattern 过滤规则, 正则表达式
* max_df=0.5 当一个单词在50%的文档中都出现过, 就不纳入分词统计
"""

# 创建数据
documents = [
    'this is the bayes document',
    'this is the second second document',
    'and the third one',
    'is this the document'
]
tfidf_matrix = tfidf_vec.fit_transform(documents) # 拟合

print('不重复的词:', tfidf_vec.get_feature_names())
"""
Output:
不重复的词: ['and', 'bayes', 'document', 'is', 'one', 'second', 'the', 'third', 'this']
"""

print('每个单词的 ID:', tfidf_vec.vocabulary_)
"""
Output:
每个单词的 ID: {'this': 8, 'is': 3, 'the': 6, 'bayes': 1, 'document': 2, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
"""

print('每个单词的 tfidf 值: \\n', tfidf_matrix.toarray())
"""
Output:
每个单词的 tfidf 值:
 [[0.         0.63314609 0.40412895 0.40412895 0.         0.
  0.33040189 0.         0.40412895]
 [0.         0.         0.27230147 0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.         0.52210862 0.52210862 0.         0.
  0.42685801 0.         0.52210862]]
"""

对文档进行分类

对文档进行分词

import nltk
word_list = nltk.word_tokenize(text) # 分词
nltk.pos_tag(word_list) # 标注单词的词性

# 或者中文分词
import jieba
word_list = jieba.cut (text) # 中文分词

加载停止词表

stop_words = [line.strip().decode('utf-8') for line in io.open('stop_words.txt').readlines()]

计算单词的权重

tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
features = tf.fit_transform(train_contents)

生成朴素贝叶斯分类器

我们将特征训练集的特征空间 train_features，以及训练集对应的分类 train_labels 传递给贝叶斯分类器 clf，它会自动生成一个符合特征空间和对应分类的分类器。

这里我们采用的是多项式贝叶斯分类器，其中 alpha 为平滑参数。为什么要使用平滑呢？因为如果一个单词在训练样本中没有出现，这个单词的概率就会被计算为 0。但训练集样本只是整体的抽样情况，我们不能因为一个事件没有观察到，就认为整个事件的概率为 0。为了解决这个问题，我们需要做平滑处理。

当 alpha=1 时，使用的是 Laplace 平滑。Laplace 平滑就是采用加 1 的方式，来统计没有出现过的单词的概率。这样当训练样本很大的时候，加 1 得到的概率变化可以忽略不计，也同时避免了零概率的问题。

当 0<alpha<1 时，使用的是 Lidstone 平滑。对于 Lidstone 平滑来说，alpha 越小，迭代次数越多，精度越高。我们可以设置 alpha 为 0.001。
```
# 多项式贝叶斯分类器
from sklearn.naive_bayes import MultinomialNB  
clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)
```

使用生成的分类器做预测

test_tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=tf.vocabulary_)
test_features=test_tf.fit_transform(test_contents)

predicted_labels=clf.predict(test_features)

计算准确率

from sklearn import metrics
print metrics.accuracy_score(test_labels, predicted_labels)