StdSort: Efficient Pre-Processing For Faster Vector Similarity Join Using Standard Deviation

Information

Title StdSort: Efficient Pre-Processing For Faster Vector Similarity Join Using Standard Deviation
Authors
Hyun Joon Kim, Sang-goo Lee
Year 2015 / 1
Keywords
Acknowledgement SRC
Publication Type International Conference
Publication International Conference on Ubiquitous Information Management and Communication ​(ICUIMC 2015)

Abstract

Vector Similarity Join is an important operation that is used in duplication detection, entity resolution and other data analysis. It is an essential operation used in many fields, therefore researched extensively. In this paper we propose an efficient data pre-processing technique called StdSort. It utilizes the fact that the dimensions of vectors have different standard deviation values. Applied to the prefix and length filtering technique, StdSort method can expedite the vector similarity join process. It requires O(n) of pre-processing time which is equal to the existing pre-processing method. Through experiments, we showed that StdSort reduces the overall time taken for similarity join operation and the number of candidates for similar pairs than existing pre-processing method.