本文共 5900 字,大约阅读时间需要 19 分钟。
为了在文本文档中执行机器学习,我们首先需要将文本内容转换为数字特征向量。
简单有效,通过以下步骤将文本转化为数值向量 -> (分词,计数,规范化和加权)
局限性:
N-grams
代替构建简单的unigrams集合(n=1),可以使用bigrams(n=2),这样就考虑了连续单词的出现次数。词序
character n-grams
能够表示拼写错误和派生词的情况
from sklearn.feature_extraction.text import CountVectorizerngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))counts = ngram_vectorizer.fit_transform(['words', 'wprds'])ngram_vectorizer.get_feature_names()Out: [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp']
分词、去停用词
from sklearn.feature_extraction.text import CountVectorizercount_vect = CountVectorizer(min_df=1, stop_words='english')X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf
applying the Euclidean (L2) norm:
# without normalizationimport numpy as npimport scipy as spdef tfidf(term, doc, docset): tf = float(doc.count(term)) / sum(doc.count(term) for doc in docset) idf = np.log(float(len(docset)) / (1 + len([doc for doc in docset if term in doc]))) return tf * idf
# Term Frequency times Inverse Document Frequencyfrom sklearn.feature_extraction.text import TfidfTransformertfidf_transformer = TfidfTransformer()X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
词干化
import nltk.stemenglish_stemmer = nltk.stem.SnowballStemmer('english')class StemmedTfidfVectorizer(TfidfVectorizer): def build_analyzer(self): analyzer = super(TfidfVectorizer, self).build_analyzer() return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))tfidf_vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')X_train_stemmed_tfidf = tfidf_vectorizer.fit_transform(twenty_train.data)
数字处理
from sklearn.feature_extraction.text import TfidfVectorizerdef number_normalizer(tokens): """ Map all numeric tokens to a placeholder. For many applications, tokens that begin with a number are not directly useful, but the fact that such a token exists can be relevant. By applying this form of dimensionality reduction, some methods may perform better. """ return ("#NUMBER" if token[0].isdigit() else token for token in tokens)class NumberNormalizingVectorizer(TfidfVectorizer): def build_tokenizer(self): tokenize = super(NumberNormalizingVectorizer, self).build_tokenizer() return lambda doc: list(number_normalizer(tokenize(doc)))vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
中文
from sklearn.feature_extraction.text import TfidfVectorizerimport jiebaclass Tokenizer(object): def __init__(self): self.n = 0 def __call__(self, line): tokens = [] for query in line.split('\t'): words = [word for word in jieba.cut(query)] for gram in [1, 2]: for i in range(len(words) - gram + 1): tokens += ["_*_".join(words[i:i+gram])] if np.random.rand() < 0.00001: print(line) print('='*20) print(tokens) self.n += 1 if self.n % 10000 == 0: print(self.n, end=' ') return tokenstfv = TfidfVectorizer(tokenizer=Tokenizer(), min_df=3, max_df=0.95, sublinear_tf=True)X_sp = tfv.fit_transform(df_all['query'])
词性标注(Part Of Speech tagging, POS标注)
词性标注器会对整句进行解析,目标是把它重新排列成一个依存关系树的形式。树中的灭个节点对应一个词语,而父子关系确定了这个词是依赖谁的。有了这个树就可以做成更好的判断,如”book”是名词(”This is a good book.”)还是动词(”Could you please book the flight?”)。
import nltknltk.pos_tag(nltk.word_tokenize("This is a good book."))nltk.pos_tag(nltk.word_tokenize("Could you please book the flight?"))
引入外部知识,如SentiWordNet
SentiWordNet(),一个13MB的文件。赋予了大部分英文单词一个正向分值和一个负向分值。
文本预处理:
Customizing the vectorizer classes
To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor, build_tokenizer` and build_analyzer factory methods instead of passing custom functions.
1.
>>> def my_tokenizer(s):... return s.split()...>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (... ['some...', 'punctuation!'])Out[17]: True
2.
>>> from nltk import word_tokenize >>> from nltk.stem import WordNetLemmatizer >>> class LemmaTokenizer(object):... def __init__(self):... self.wnl = WordNetLemmatizer()... def __call__(self, doc):... return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]...>>> vect = CountVectorizer(tokenizer=LemmaTokenizer())
3.
>>> import re>>> def to_british(tokens):... for t in tokens:... t = re.sub(r"(...)our$", r"\1or", t)... t = re.sub(r"([bt])re$", r"\1er", t)... t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)... t = re.sub(r"ogue$", "og", t)... yield t...>>> class CustomVectorizer(CountVectorizer):... def build_tokenizer(self):... tokenize = super(CustomVectorizer, self).build_tokenizer()... return lambda doc: list(to_british(tokenize(doc)))...>>> print(CustomVectorizer().build_analyzer()(u"color colour")) ['color', 'color']
单独讨论,见Blog
主题模型
单独讨论,见Blog
词向量
《Building Machine Learning Systems with Python》 P37 P106
–idf