TF-IDF,余弦夹角算法,欧式距离,SVD-LSI,simhash
IF-IDF
IF–Item frequency = item counts/total item count of the file
IDF–inverse document frequency=log(total documents)/(documents contain the word)
IF-IDF = IF*IDF
余弦夹角算法
根据文章生成向量,向量中的数字代表某个词出现的次数,
对两篇文章的向量进行求余弦,值越接近1,说明两个文章越相似
欧式距离
svd-lsi
http://www.52nlp.cn/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6%E4%B8%80
simhash
http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html
http://blog.csdn.net/lance_yan/article/details/11451781
http://taop.marchtea.com/06.03.html