World Wide Web DOI 10.1007/s11280-013-0232-6
Fast T-overlap query algorithms using graphics processor units and its applications in web data query Mengjuan Li & Lianyin Jia & Jinguo You & Jianqing Xi & HaiFei Qin & Rui Zeng
Received: 12 August 2012 / Revised: 24 May 2013 / Accepted: 5 June 2013 # Springer Science+Business Media New York 2013
Abstract Given a collection of sets and a query set, a T-Overlap query identifies all sets having at least T common elements with the query. T-Overlap query is the foundation of set similarity query and join and plays an important role on web data query and processing, such as the behavior analysis of web users and the near duplicated detection of web documents. To address T-Overlap query efficiently, unlike traditional algorithms based on CPU, we aim at designing efficient GPU based algorithms. We firstly design inverted index in GPU, then choose ScanCount, a straightforward but efficient T-Overlap algorithm, as underlying algorithm to develop our GPU based T-Overlap algorithms. Depending on queries processed serially or in parallel, three new efficient algorithms are proposed based on our GPU based inverted index. Among all these three algorithms, GS-Parallel-Group processes a group of queries in M. Li Library, Yunnan Normal University, Kunming, China e-mail:
[email protected] L. Jia (*) : J. You Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China e-mail:
[email protected] J. You e-mail:
[email protected] J. Xi School of Computer Science & Engineering, South China University of Technology, Guangzhou, China e-mail:
[email protected] H. Qin Department of Computer Science, Chuxiong Normal University, Chuxiong, China e-mail:
[email protected] R. Zeng School of Information, Yunnan Normal University, Kunming, China e-mail:
[email protected]
World Wide Web
parallel and supports a high degree of parallelism. Extensive experiments are carried out to compare our GPU based algorithms with other state-of-the-art CPU based algorithms. Results show that GS-Parallel-Group outperforms CPU based algorithms significantly. Keywords T-Overlap query . set similarity query . GPU based inverted index . ScanCount . GS-Parallel-Group
1 Introduction Given a dataset D making up of a collection of records (all are sets) and a query q (also a set), T-Overlap query identifies all records r∈D (the record identifier of r is denoted by rid) having at least T common elements with q, that is |r∩q|≥T. T-Overlap query is the foundation of set similarity query and join. Similarity is researched extensively in many fields, e.g. information retrieval [17] and natural language processing [1, 6]. In this paper, we mainly focus on T-Overlap query which plays an important role in web data query and processing, such as the behavior analysis of web users and the near duplicated detection of web documents. For example, a web portal contains multiple areas and each user session visits a subset of them, a TOverlap query here can be used to find users visiting at least T common areas with a specific user. Consider another example, when integrating job information from different sites, it is necessary to use T-Overlap query to avoid storing near duplicated jobs because a job may be published many times in a site or in different sites. Many algorithms have been proposed to solve T-Overlap query and join. Some of them could only be used to solve T-Overlap join, such as Allpair [5], PPJoin [25]. Other algorithms, such as ProbeCount [20], MergeOpt [20], ScanCount [16], MergeSkip [16], T-Similarity [15] could be used to solve both T-Overlap query and TOverlap join. Answering T-Overlap query is quite different from answering T-Overlap join. Algorithms efficient for T-Overlap join are not always efficient for T-Overlap query [15, 16]. For example, PPJoin is efficient for T-Overlap join, but can’t be used to solve T-Overlap query because it scans records one by one and creates index on partial data. But T-Overlap query can be a basis to solving T-Overlap join. In this paper, we mainly focus on T-Overlap query. There are two major weaknesses for the algorithms mentioned above. Firstly, these algorithms can only be run in CPU and need to be accelerated by modern high performance hardware. From the perspective of a single user, there is no great difference between 20 ms and 200 ms to get the result of a query. But from the perspective of servers, the former means ten times throughput larger than the latter. Higher throughput is especially useful for web servers treating millions of queries per second. Secondly, none of the above algorithms is excellent at all threshold T, some of them are only good at low thresholds whereas other algorithms only work better at high thresholds. To overcome these two weaknesses, in this paper, we use Graphics Processors Units (GPU) as our main platform to solve T-Overlap query. GPU are first designed for gaming applications and image processing, but recently are also extensively used in database [3, 10, 11, 14], data mining [7, 9, 22], information retrieval [2, 8] and many other general-purpose applications. The new-generation GPU has an order of magnitude higher memory bandwidth and higher computation capability than the multi-core CPU, so much higher running speed can be expected for treating queries in GPU.
World Wide Web
In this paper, we design a GPU based inverted index (GII) and use ScanCount, an efficient T-Overlap algorithm as our underlying algorithm. Depending on queries processed in parallel or serially, two serial algorithms (GS-Serial and GS-Serial-Atomic) and one grouped parallel algorithm (GS-Parallel-Group) are proposed. GS-Serial processes inverted lists related with a query one by one and GS-Serial-Atomic processes the inverted lists in parallel with atomic operations used whereas GS-Parallel-Group processes a group of queries in parallel. We make the following contributions: (1) To the best of our knowledge, we are the first using GPU based inverted index to solve T-Overlap query. (2) We carry out extensive experiments to compare our GPU based algorithms with other CPU based algorithms. Results show that GS-Parallel-Group outperforms state-of-the-art CPU algorithms significantly. (3) By tuning two important parameters, multiprocessor coefficient (MPC) and threads per block (TPB), we can choose appropriate group size for GSParallel-Group and get optimal performance while keeping a relative reasonable GPU memory size. The remainder of this paper is organized as follows. Next section we discuss related works on T-Overlap query algorithms. In section 3, we introduce GPU architecture in brief. Relative GPU primitives which are necessary building blocks for our GPU based T-Overlap algorithms are described in section 4. We design GII in section 5 and three GII based algorithms in section 6. Experimental results based on real world dataset are presented in Section 7. Section 8 we conclude the paper.
2 Related work The most famous T-Overlap query algorithms proposed recently are ProbeCount, MergeOpt, ScanCount and MergeSkip which are all inverted index based algorithms and work efficiently in CPU. Sarawagi and Kirpal [20] proposes ProbeCount and MergeOpt algorithms. Given a query q, there are |q| ordered inverted lists (lists in short) related with q. The i-th lists are made up of rids of records which contain the i-th element in q. ProbeCount maintains the frontiers of |q| lists as a heap. We pop the top of the heap and push next rid of the same list with popped rid into heap iteratively and increment the count of rid corresponding to the popped frontier record. If the count of rid reaches the threshold T, we output the rid. The bottleneck of ProbeCount lies in pushing and popping all rids in all associated lists when the size of the index is large. To overcome the bottleneck, MergeOpt divides all associated lists into T-1 long lists and |q|-T+1 short lists. For short lists, we use heap algorithm to merge them. For each popped rid, check it in long lists using binary search to verify if the rid exists in at least T of total |q| lists. The simple idea behind MergeOpt is that each qualified rid exists in at least one of the short lists. So the rids only appearing in long lists need not to be checked. Specifically, when T=1, MergeOpt is degenerated into ProbeCount. Li et al. [16] proposes ScanCount and MergeSkip algorithms. ScanCount adopts an idea of space trading time and abandons heap and heap operations. It simply scans all |q| lists and increments the count of rid, so it makes some improvement over ProbeCount. To do this, the algorithm needs a count array CA which counts for rids in dataset. MergeSkip also maintains the frontiers of |q| lists. It firstly pops n rids having the same value with the top of heap t. If n is equal to or greater than T, we output t as a result and push
World Wide Web
next rids of popped lists into the heap. Otherwise t cannot be qualified result. We continue to pop T-1-n additional rids from the heap and mark the current heap top t’. For each popped list, locate its smallest rid r satisfying r>=t’ and push r into the heap. The advantage of MergeSkip lies in skipping unqualified rids by exploiting T. Unlike inverted index based T-Overlap algorithms discussed above, [15] proposes T-Similarity algorithm based on expanded trie index (ETI). Each record is mapped to a path from root to leaf in ETI and common prefix of different records is mapped to a common trie path. The node structure of ETI is expanded to facilitate T-Overlap query. Based on ETI, an efficient T-Overlap query algorithm, T-Similarity, is proposed. T-Similarity converts T-Overlap problem to finding query nodes with query depth equaling to T and runs much faster than the four inverted index based algorithms when overlap threshold is low. There are also algorithms using GPU based inverted indexes [2, 8, 23, 24], but these algorithms are mainly designed for intersection query, not T-Overlap query, so we don’t discuss these algorithms in detail.
3 GPU architecture A brief architecture of GPU is illustrated in Figure 1. GPU is used as co-operative processors for CPU and works in a SIMD manner [19]. GPU contains a certain number of multiprocessors (MPs), each of which consists of a number of SIMD processors. All processors in a MP execute the same instruction but operate on different data at the same time. GPU does not support allocating device memory (DM) dynamically, so we usually allocate DM statically. Additionally, GPU cannot access main memory (MM) directly, so the data operated should be transferred to DM beforehand and be transferred back to MM after processed in GPU. Each MP also has a small but very fast shared memory (SM) which can be shared by all processors in the MP. Note that when accessing data in DM, it is necessary to take advantage of one of most important Figure 1 A brief architecture of GPU
CPU
Main Memory
Device Memory
MultiProcessor 1 P1
Pn
Shared Memory GPU
MultiProcessor N P1
Pn
Shared Memory
World Wide Web
features of GPU—the coalesced access. When accessing data in SM, it is necessary to avoid bank conflict. We use compute unified device architecture (CUDA) which is released by NVIDIA as our GPU computing environment. CUDA provides a direct C language interface and is easy to use for programmers. The CUDA code can be divided into two parts, the host and the kernel. The host runs on CPU transferring data and results between MM and DM. The kernel processes data in GPU in parallel. The basic programming model of GPU is threads. Threads in GPU are lightweight and easy to be created and terminated without causing too much overhead. We can create thousands or even millions of threads processing data in parallel. All the threads are organized into blocks which usually has multiples of 32 threads. Let TPB denote threads per block. Threads in the same block can exchange information by SM. Threads in different blocks exchange information by DM, but the latency delay for DM is much larger than SM. In order to ensure the correctness of parallel algorithms and to synchronize threads, CUDA provides atomic functions, e.g. atomicCAS, atomicDec and atomicAdd. These atomic functions are very useful when multiple threads accessing the same address at the same time.
4 Basic primitives In this section, we introduce our basic primitives implemented in GPU, including map, prefix scan, conditional scatter and block reduce. These primitives are essential building blocks for our GPU based T-Overlap query algorithms and take advantage of advanced features of GPU, such as the massive thread parallelism and the coalesced access. 4.1 Map The map primitive we implemented is similar with map primitive in [14]. Given an input array Ain[0,…,n−1] and a map function f, The output array Aout[0,…,n−1] is computed as Formula 1 shows. Aout ½i ¼ f ðAin ½iÞ; i ¼ 0; …; n−1
ð1Þ
Depending on the size of input array, multiple blocks are started to map elements. Each thread is responsible for a single element of Ain. 4.2 Parallel scan Parallel scan, also called parallel prefix sum, includes exclusive parallel scan and inclusive parallel scan. Parallel scan is an important building block for a broad range of algorithms. Given an input array Ain[0,…,n−1] and an associative binary operator ⊕, exclusive parallel scan and inclusive parallel scan compute corresponding output array Aout[0,…,n−1] as Formula 2 and 3 shows. Aout ½i ¼ ⊕ j Ain ½ j; j < i; i ¼ 0; …; n−1
ð2Þ
Aout ½i ¼ ⊕ j Ain ½ j; j ≤i; i ¼ 0; …; n−1
ð3Þ
World Wide Web
The associative binary operator ⊕ used in parallel scan can be add, subtract, min, max, logical or, logical and. In this paper, we use add as default binary operator. From the formulas above, we can see that inclusive parallel scan can be computed from exclusive parallel scan easily by adding Ain[i] and Aout[i]. In this paper, we always use exclusive parallel scan as default parallel scan primitive. The parallel scan primitive we used is implemented in cudpp 2.0 library [http:// code.google.com/p/cudpp/] which is based on the ideas of [12, 21]. The parallel scan is consists of two stages: reduce and downswap. Each stage needs log(n) parallel steps, the final complexity is O(n). An example of the parallel scan primitive is depicted in Figure 2. The arrows show the data movement. Two arrows meeting together mean the binary operation add. 4.3 Conditional scatter Given an input array Ain[0,…,n−1], a location array L[0,…,n−1] and a function f, conditional scatter computes output array Aout[0,…,n−1] as Formula 4 shows.
Figure 2 An example array works on parallel scan
Ain
2
4
2
1
2
1
2
3
2
6
2
3
2
3
2
5
2
6
2
9
2
3
2
8
2
6
2
9
2
3
2
17 0
Aout
2
6
2
9
2
3
2
0
2
6
2
0
2
3
2
9
2
0
2
6
2
9
2
12
0
2
6
8
9
11
12
14
World Wide Web
Aout ½L½i ¼ i; if f ðAin ½iÞ ¼¼ true
ð4Þ
The conditional scatter we implemented is different with scatter used in [13, 14] in the following two aspects. 1) Not all elements, but only elements qualifying f(Ain[i])==true in Ain can be scattered to Aout. 2) What we scattered is not the value of the i-th element in Ain, but the location i of Ain. 4.4 Block reduce Given an input array Ain[0,…,n−1] and an associative binary operator ⊕, block reduce computes output array Aout[0,…,b−1] for b blocks as Formula 5 shows. Aout ½i ¼ ⊕ j Ain ½ j
ð5Þ
Where i=0,…,b−1 i n=b≤ j ≤i n=b þ n=b−1 and j ≤n−1 Block-reduce primitive start b blocks. Each block is responsible for ⌈n/b⌉ elements of Ain. Block reduce primitive is different from reduce primitive used in parallel scan and many other parallel algorithms. For block reduce, each block works on parts of data and gets a result, whereas reduce gets a single result for the whole input array. We realize block reduce by modifying reduce slightly implemented in cudapp 2.0 library. We use SM to speed up the process of block reduce. An example array works on block reduce as Figure 3 shows. Cells with the same shading belong to the same block.
5 GII: a GPU based inverted index Most efficient T-Overlap algorithms work on inverted index which is studied extensively in set similarity query, information retrieval and many other fields. Usually, inverted index is created on CPU. Figure 4 shows CPU based inverted index created for example dataset of Table 1. In this section, we create GPU based inverted index (GII). Unlike CPU based inverted index, GII is array based which is made up of 4 arrays: EA stores all distinct elements of database; LA stores the corresponding lengths of the inverted lists for each element in EA; Figure 3 An example array works on block reduce
Ain
1
0
1
Aout
1
1
2
3
1
0
1
0
0
0
1
World Wide Web Figure 4 Inverted index created for database of Table 1
PA stores pointers, each of which points to an inverted list for a element in EA; DA stores all inverted lists sequentially for each element in EA. LA and PA are all sorted by EA, so the element for a value in LA and PA is indicated by its location in array. The ordering of inverted lists in DA is also sorted by EA. Figure 5 shows GII created for database of Table 1. There is still an important issue for building such a GII. That is how to get PA efficiently in GPU? To address this issue, we need a temporary array OA storing the offsets for each inverted list in DA. OA can be computed easily by executing parallel scan on LA as in Figure 2. After getting OA, we can compute each value of PA by adding the start address of DA in GPU and the corresponding value in OA.
Table 1 Example database
RID
Record
0
{a,b,e,f,g,h}
1
{b,c,d}
2
{a,b,c,h}
3
{b,e,g,h}
World Wide Web Figure 5 GII created for database of Table 1
EA
LA
a
2
b
4
c
2
d
1
e
2
f
1
g
2
h
3
PA
DA 0 2 0 1 2 3 1 2 1 0 3 0 0 3 0 2 3
6 T-overlap query algorithms based on GII We choose ScanCount as underlying algorithm to design GII based T-Overlap query algorithm. There are two reasons for ScanCount as a good match for GPU: 1) ScanCount is scan based and has the nature of supporting parallelism. 2) ScanCount is easy to be implemented in GPU. Given a set of queries Q with the total number is N, the i-th query is denoted as Qi. We realize two kinds of GII based ScanCount algorithms according to whether the queries are executed in parallel, the serial GPU ScanCount algorithms and the parallel GPU ScanCount algorithm. 6.1 Serial GPU ScanCount algorithms The first kind of GII based ScanCount algorithms we implemented are to process queries one by one in GPU. Before processing Qi, the first problem we faced is how to retrieve the | Qi | inverted lists efficiently. A naive idea is to query | Qi | elements in EA in parallel with one thread for each element and get corresponding locations in EA of these elements. In this way, the efficiency is low because we should execute too many comparisons. A more efficient way is to use hash. In this paper we introduce a GPU based Cuckoo hashing (G-Cuckoo) implemented in cudpp 2.0, which is a parallel GPU version of Cuckoo hashing [18]. We create G-Cuckoo by elements of EA as hash keys and the
World Wide Web
locations of elements in EA as hash values. By G-cuckoo, we can retrieve the corresponding locations in EA for | Qi | elements simultaneously, thus get corresponding inverted lists and lengths. After getting these | Qi | inverted lists, we need a main ScanCount kernel to scan these inverted lists and store the count of each rid encountered in array CA as discussed in section 2. There are two kernels we can use. The first kernel scans the | Qi | inverted lists serially. To maximize the degree of parallelism, we start multiple blocks to scan a single inverted list according to its length. For each rid scanned, we increment the count of the corresponding location in CA. After a list has been scanned, another number of blocks are started to scan the next inverted list. This is repeated for all | Qi | inverted lists. We call the algorithm adopting this kernel GS-Serial. The second kernel scans all | Qi | inverted lists in parallel, one block for each inverted list. The main problem of this idea is the write conflicts when multiple threads increment the count of the same location in CA. Fortunately, CUDA supplies atomic operations for us. Using atomic operations we can scan inverted lists in parallel and get correct answers directly. We call the algorithm using this kernel GS-Serial-Atomic. After the main ScanCount kernel executed, the counts of rids are stored in CA. Transferring CA to MM and scanning CA in CPU to find locations of CA with count no less than T is low efficiency for two reasons. 1) We need to transfer the whole CA back to MM whereas the number of final results is usually much smaller than the length of CA. 2) Scanning CA serially in CPU is too expensive. To avoid problems mentioned above, we adopt the following three-step framework to get final results directly in GPU. 1) Map: We use map primitive mapping CA to an array (BA) only containing 0 and 1. If the count of a location in CA is no less than T, then the same location in BA is mapped to 1, otherwise to 0. 2) Parallel scan: By executing parallel scan primitive on BA, we get an output location array (OLA) which points out the output locations for all 1 s in BA. 3) Conditional scatter: We execute conditional scatter primitive by BA as input array and OLA as location array, thus scatter the final results to the front of CA. Given GII in Figure 5 and a query {a,b,e,h}, the final results, also the cells with dark background and bold border are shown in the front of CA in Figure 6. For T-Overlap query, what we needed are qualified rids. After mapping CA to BA, we don’t need the counts in CA anymore, so to decrease the memory overhead in GPU, we use CA to store the final results. Figure 6 The three-step framework to get final results in GPU for a query {a,b,e,h}
World Wide Web
Additionally, the number of the final results can be got easily by adding the last value of BA and the last value of OLA. Finally, we can transfer the exact final results back to MM. The complete algorithm of GS-Serial is depicted in Algorithm.1. GS-Serial-Atomic has no great differences with GSSerial except the kernel in line 4, so we don’t discuss it in detail.
GS-Serial Algorithm 1. for i = 0 to N-1 2. copy query Qi to GPU memory. 3. get the |Qi| inverted lists by G-cuckoo in parallel. 4. execute a kernel scan all lists serially to get count array CA. 5. execute a three-step framework to get final result in CA. 6. copy final result back to MM.
Algorithm 1 GS-Serial algorithm 6.2 Parallel GPU ScanCount algorithm Another GII based ScanCount algorithm we implemented is to process queries in parallel with one block each. As serial algorithms, we should also set the three arrays: CA, BA and OLA in GPU, but each with a length of |D|*N for N queries, so the total DM needed for these three arrays are 3*| D|*N*sizeof(int). The GPU memory needed for these three arrays may be very large when N is large. To decrease the memory overhead when N is large, we don’t process all queries simultaneously, but divide queries into groups. Queries in a group are processed in parallel by multiple blocks, one block each. Groups are processed serially. Determining appropriate group size is important. We denote multiprocessors coefficient (MPC) as multiples of the number of multiprocessors our GPU contains. To maximum parallelism, the appropriate group size should be MPC and should be less than the maximum number of queries supported. The maximum number of queries supported can be computed by Mavail/(3*| D|*sizeof(int)) where Mavail is the total DM size available for queries before processing queries. Since a block is responsible for a query in a group, the corresponding inverted lists of a query execute serially in a block. This means we don’t need atomic operations when counting rids encountered. We call the algorithm using this kernel GS-Parallel-Group. After the kernel executed, CA stores the counts of rids for all queries of a group. We can execute the same three-step framework as in serial GPU ScanCount algorithms to get final results in GPU directly. There is still an important problem need to be addressed. Results for all queries are all at the front of CA, how to split the results for each query? To address this issue, we add an additional step by executing block reduce primitive on BA to get array NA containing the number of qualified records for each query. Figure 7 shows the steps for getting final results in GPU for two queries, {a,b,e,h} and {a,b,e,f}. In Figure 7, Cells with the same shading belong to the same block. We can see that the first three cells in CA are results for the first query, and the fourth cell in CA is result for the second query.
World Wide Web
Figure 7 The process of getting final results in GPU for queries {a,b,e,h} and {a,b,e,f}
After the results of a group are transferred back to MM, we repeat to execute the next group of queries until all groups are processed.
7 Experiments 7.1 General setup To evaluate the performance of our GPU based algorithms, we performed extensive experiments on two real datasets to compare our GPU based algorithms with other four CPU based algorithms. All experiments were carried on a Core 2 Duo CPU E7500 @2.93 GHz with 2 GB memory running windows 7 32 bit, using Microsoft Visual Studio C++ 2008 as compiler. The GPU card we used is NVIDIA GeForce GTX 480. Main parameters of GTX 480 are shown in Table 2. The version of PCI Express bus is 1.0 and the bandwidth between MM and DM is about 2.5GB/s. 7.2 Datasets and queries 1) Datasets The first real dataset is MSWEB [4] dataset of UCI KDD Archive. MSWEB is a oneweek log tracing of the virtual areas that users visited in the web portal www.microsoft.com. Table 2 The main parameters of GTX 480
Parameters
Values
Cuda cores
480
Processor clock
1401 MHz
MultiProcessors
15
DM size
1536 MB
DM bandwidth
177.4 GB/sec
World Wide Web
Each record corresponds to a user session and the set value comprises the areas visited. MSWEB is a small dense dataset and there are 32 K records in total and the vocabulary of the data set contains 294 distinct elements. The maximum cardinality of records is 35. The second dataset is extracted from DBLP Bibliography site [http://www.informatik.unitrier.de/~ley/db/]. We parse the author and editor info of each publication to one record and remove the duplicate elements in it. The raw data has about 3 M records and we select the top 1 M records as our dataset. This dataset is a large dense dataset and has 283885 distinct elements and the maximum cardinality of records is 176. 2) Queries When generating queries, there is meaningless for a query returning an empty answer set, so we select 1000 records randomly with cardinality no less than T as queries for each of threshold T=2, 5, 10, 15, 20 and compute the total running times for different algorithms. When performing queries for a certain threshold T, we use the corresponding queries generated for this threshold. In the experiments below, when not pointed out explicitly, TPB is been set to 512 and MPC is been set to 3. There are 15 multiprocessors in GTX 480, so the maximum number of queries in a group is 45. 7.3 Compared with other T-Overlap algorithms The CPU based algorithms we compared with our GPU based algorithms are ScanCount, ProbeCount, MergeOpt, MergeSkip and T-Similarity. We perform queries generated above for our three GPU based algorithms and five CPU based algorithms with threshold T=2, 5, 10, 15, 20 respectively. Figure 8 shows the comparison results for MSWEB. The total elapsed times for all algorithms include the times spent on transferring data between DM and MM. From Figure 8 we can see that both the efficiency of GS-Serial-Atomic
Figure 8 Query on Msweb
World Wide Web
and GS-Serial are higher than all other CPU based algorithms except MergeOpt when threshold is high. GS-Serial-Atomic runs slightly faster than GS-Serial shows that atomic operator gives us benefit by improving parallelism. GS-Parallel-Group is near an order of magnitude higher than its CPU counterpart, ScanCount, and runs much faster than any other CPU based algorithms for all thresholds. For DBLP, we get similar result, GS-Parallel-Group are still the best algorithm among all other algorithms. Figure 9 shows the corresponding results. 7.4 The effects of MPC and TPB Query parameters, such as MPC and TPB, also have important impacts on running time. We fix T=5 and execute GS-Parallel-Group on generated 1000 queries for different TPBs and MPCs. Figures 10 and 11 show the corresponding results for MSWEB and DBLP respectively. From Figure 10 we can see that the curves for all TPBs go downwards with the increasing of MPC. The curves are nearly horizontal when MPC is no less than 3. From Figure 11, a similar trend has been observed. The curve of higher TPB decreases much faster. We also can see that the total running times are also nearly stable when MPC is equal or greater than 3 for all TPBs except 128. We can conclude from these two Figs that there is no need for too large MPC since larger MPC means larger memory overhead. We can get near optimal performance while expending a relative reasonable GPU memory size by setting appropriate MPC.
Figure 9 Query on DBLP
World Wide Web Figure 10 The effects of MPCs and TPBs on MSWEB
7.5 Weakness of GII The main weakness of GII lies in the difficulty of updating index dynamically. This is because that the current GPU does not support allocating DM dynamically. When there is a need for updating GII, we should transfer the four arrays mentioned in section 5 from MM to
Figure 11 The effects of MPCs and TPBs on DBLP
World Wide Web
DM and rebuild GII. Fortunately, this cost is not too expensive owning to high bandwidth between MM and DM and high processing capability of GPU.
8 Conclusion and future work In this article, unlike traditional CPU based indexes, we design a new index based on GPU. Based on GII, two serial algorithms (GS-Serial and GS-Serial-Atomic) and one grouped parallel algorithm (GS-Parallel-Group) are proposed to perform T-Overlap query. Extensive experiments are carried out and results show that GS-Parallel-Group has good performance and outperforms state-of-the-art algorithms significantly. Our future work aims to exploit more characteristics about GII and further reduce running times of our GPU based algorithms. Expanding GII to other problems, such as set containment query is also an interesting work should to be done. Acknowledgements The research is supported by Key Project of Educational Commission in Yunnan Province of China(NO. 2012Z008), Science & Technology Projects of Guangdong Province(NO.2009B050700008, 2008B090500193) and the Science & Technology Project on integration of production, education and research, Guangdong Province and Ministry of Education(NO. 2010B090400335).
References 1. Ahsaee, M. G., Naghibzadeh, M., Naeini, S. E. Y.: Semantic similarity assessment of words using weighted WordNet. International Journal of Machine Learning and Cybernetics, 1–12 (2012) 2. Ao, N., Zhang, F., Wu, D., Stones, D.S., Wang, G., Liu, X., Liu, J., Lin, S.: Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment, 470–481 (2011) 3. Bandi, N., Sun, C., Abbadi, A.E., Agrawal, D.: Hardware acceleration in commercial databases: a case study of spatial operations. Proceedings of the VLDB Endowment, 1021–1032 (2004) 4. Bay, S.D., Kibler, D.F., Pazzani, M.J., Smyth, P.: The UCI KDD Archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter, 81–85 (2000) 5. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. Proceedings of the 16th international conference on World Wide Web 131–140 (2007). 6. Castillo, J.J.: A WordNet-based semantic approach to textual entailment and cross-lingual textual entailment. Int. J. Mach. Learn. Cybern. 2(3), 177–189 (2011) 7. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Skadron, K.: A performance study of generalpurpose applications on graphics processors using CUDA. Journal of parallel and distributed computing, 1370–1380 (2008) 8. Ding, S., He, J., Yan, H., Suel, T.: Using graphics processors for high performance IR query processing. Proceedings of the 18th International Conference on World Wide Web, 421–430 (2009) 9. Fang, W., Lu, M., Xiao, X., He, B., Luo, Q.: Frequent itemset mining on graphics processors. Proceedings of the fifth international workshop on data management on new hardware, 34–42 (2009) 10. Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: GPUTeraSort: high performance graphics coprocessor sorting for large database management. Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 325–336 (2006) 11. Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M.C., Manocha, D.: Fast computation of database operations using graphics processors. Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 215–226 (2004) 12. Harris, M.: Parallel prefix sum (scan) with CUDA. www.nvidia.com, April (2007). 13. He, B., Govindaraju, N.K., Luo, Q., Smith, B.: Efficient gather and scatter operations on graphics processors. Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, (2007).
World Wide Web 14. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational joins on graphics processors. Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 511–524 (2008) 15. Jia, L., Xi, J., Li, M., Liu, Y., Miao, D.: ETI: An efficient index for set similarity queries. Frontiers of computer science, 700–712 (2012) 16. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. IEEE 24th international conference on data engineering, 257–266 (2008) 17. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval. Cambridge University Press, 1–482 (2008) 18. Pagh, R., Rodler, F.F.: Cuckoo hashing. Journal of Algorithms, 122–144 (2004) 19. NVIDIA CUDA Compute Unified Device Architecture—Programming Guide, (2007). 20. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 743–754 (2004) 21. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In Graphics Hardware, 97–106 (2007) 22. Shalom, S.A.A., Dash, M., Tue, M.: Efficient K-means clustering using accelerated graphics processors. Data Warehousing and Knowledge Discovery, 166–175, (2008) 23. Wu, D., Zhang, F., Ao, N., Wang, G., Liu, X., Liu, J. Efficient lists intersection by CPU-GPU cooperative computing. 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 1–8 (2010) 24. Wu, D., Zhang, F., Ao, N., Wang, F., Liu, X., Wang, G.: A Batched GPU Algorithm for set intersection. 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN), 752–756 (2009) 25. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. Proceedings of the 17th international conference on World Wide Web, 131–140 (2008).