StdSort: Efficient Pre-Processing For Faster Vector Similarity Join Using Standard Deviation
Vector Similarity Join is an important operation that is used in duplication detection, entity resolution and other data analysis. It is an essential operation used in many fields, therefore researched extensively. In this paper we propose an efficient data pre-processing technique called StdSort. It utilizes the fact that the dimensions of vectors have different standard deviation values. Applied to the prefix and length filtering technique, StdSort method can expedite the vector similarity join process. It requires O(n) of pre-processing time which is equal to the existing pre-processing method. Through experiments, we showed that StdSort reduces the overall time taken for similarity join operation and the number of candidates for similar pairs than existing pre-processing method.