Data Mining - Similarity
相似度的計算有 cosine similarity, Pearson correlation, Spearman rank correlation
Cosine Similarity
- 等於 0 : 不相似
- 小於 0 : 正相關
- 大於 0 : 負相關
若將 \(D_i, D_j\) 表達為兩個向量維度:
\(D_i = (w_{i1}, w_{i2}, \dots, w_{in})\) ,
\(D_j = (w_{j1}, w_{j2}, \dots, w_{jn})\) ,
則兩者的 餘弦相似度
為:
如果以 \(\vec{a}, \vec{b}\) 來表示,則為: \(cos \theta = \frac {\vec{a} \cdot \vec{b}} {\Vert{a}\Vert \ \Vert{b}\Vert}\)
以下面的例子,餘弦相似度為 15/(sqrt(14)*sqrt(18)) = 0.945
a = [1, 2, 3] b = [0, 3, 3]
# calculate with Python
>>> from sklearn.metrics.pairwise import cosine_similarity
>>> from scipy.sparse import csr_matrix
>>> A = csr_matrix([[1, 2, 3], [0, 3, 3], [-2, 2, 1]])
>>> cosine_similarity(A[0], A)
array([[ 1. , 0.94491118, 0.4454354 ]])