页面相似度的计算方法

TF-IDF,余弦夹角算法,欧式距离,SVD-LSI,simhash

IF-IDF

IF--Item frequency = item counts/total item count of the file

IDF--inverse document frequency=log(total documents)/(documents contain the word)

IF-IDF = IF*IDF

余弦夹角算法

根据文章生成向量,向量中的数字代表某个词出现的次数,

对两篇文章的向量进行求余弦,值越接近1,说明两个文章越相似

欧式距离

svd-lsi

http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E4%B8%80

simhash

http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html
http://blog.csdn.net/lance_yan/article/details/11451781
http://taop.marchtea.com/06.03.html