Analysis of spatiotemporal data relationship using information granules

Data analysis especially data with space and time feature in a human-centric way requires interpretable representation of data. With this motivation, ...

1 downloads 23 Views 1MB Size

Download PDF

Int. J. Mach. Learn. & Cyber. DOI 10.1007/s13042-015-0386-x

ORIGINAL ARTICLE

Analysis of spatiotemporal data relationship using information granules Mingli Song1 • Wenqian Shang1 • Lidong Wang2 • Witold Pedrycz3

Received: 30 November 2014 / Accepted: 4 June 2015 Springer-Verlag Berlin Heidelberg 2015

Abstract Data analysis especially data with space and time feature in a human-centric way requires interpretable representation of data. With this motivation, we present a granular way of data analysis in which the data and the relationships therein are described through a collection of sets or fuzzy sets (information granules). In this paper, data are described by semantically meaningful descriptors-information granules over the space and time domain. The design process is guided by information granulation and degranulation. Thus a performance index used to obtain the best combination of information granules becomes a crucial issue. The effectiveness of the algorithm is demonstrated by experiments on two kinds of synthetic data and data from Alberta agriculture website. Keywords Spatiotemporal data Information granules FCM Granular descriptor

& Mingli Song [email protected]; [email protected] Wenqian Shang [email protected] Lidong Wang [email protected] Witold Pedrycz [email protected] 1

School of Computer Science, Communication University of China, Beijing 100024, China

2

Department of Mathematics, Dalian Maritime University, Dalian 116026, China

3

Department of Electrical and Computer Engineering, University of Alberta, Edmonton T6R 2G7, Canada

1 Introduction Spatiotemporal data become visible in various application fields such as marketing, land use, economics, engineering, industry, multimedia and so on. We have been witnessing a slew of approaches originating from studies developed within the setting of analysis and processing of spatiotemporal data. Some studies focus on mining knowledge from the data [1, 2], some try to build a model using spatiotemporal data [3], and some others determine the relationship of spatiotemporal data [4]. The need for a concise, highly interpretable, and accurate descriptors of data is highly visible so that such descriptions reveal and describe an essence of the main relationships and associations among variables of the systems. Information granules, as the name itself stipulates, are collections of entities, usually originating at the numeric level, that are arranged together due to their similarity, functional adjacency, and coherency or alike. Information granules are seen everywhere: spatial granulation like image processing and geographic information system (GIS), temporal granulation like cultural, legal, business orientation of the designer and so on. They are central to processes of abstraction which guide our intellectual pursuits. In this paper, we concentrate on the study of representation of spatiotemporal data with information granules which will help the subsequent analysis of spatiotemporal data. The spatial and temporal data can be described at various levels of abstraction. In fact, this level of abstraction helps realize a user-friendly approach for data analysis. The user can establish the most suitable perspective to view the data; in other words, the user can identify a proper level of abstraction. We here name this level of abstraction the granularity. This view inherently brings the concept of

123

Int. J. Mach. Learn. & Cyber.

information granules into picture. Information granules, which can be modeled in many formats, such as sets (intervals), fuzzy sets, rough sets, etc. cf. [5, 6, 21–23, 25] are regarded as a collection of conceptual landmarks using which one can look at the data. From data points to information granules, we show two forming ways: adopt clustering mechanisms; first describe data points (time series) in a shorter format (we get some coefficients) and then form information granules on those coefficients. The proper level of granularity can be optimized in an evolutionary way in which an objective function with one or two criteria is needed. The paper is organized as follows. Section 2 introduces the formulation of the problem. In Sect. 3, spatial data and temporal data representation are discussed. A series of synthetic and real data experiments are illustrated in Sect. 3. The conclusion is covered in Sect. 4.

Fuzzy sets provide a possibility to formally express concepts of continuous boundaries. These concepts are everywhere. When expressing ideas, describing concepts and communicating with people, we always use terms to which the yes–no does barely apply. Fuzzy set A is generally described by a membership function which maps the universe of discourse X in which A is defined into a unit interval:

2 Problem formulation

where r is the standard deviation and m is the average value. Another format of information granules is rough sets. The description of information granules completed with the aid of some vocabulary is usually imprecise. Intuitively, such description may lead to some approximations called lower and upper bounds. This is the essence of rough sets. The description of information granules completed with the aid of some vocabulary is usually imprecise. Intuitively, such description may lead to some approximations called lower and upper bounds. This is the essence of rough sets introduced by [10, 11]; refer also to [14]. Clustering and fuzzy clustering have been regarded as a synonym of structure discovery in data. The result, no matter what technique has been used, comes as a collection of information granules which serve as a quantification of concepts serving as descriptors of the phenomenon behind the data. Fuzzy C-Means [6] is one of the commonly used mechanisms of fuzzy clustering. With FCM, one is concerned with discovering the prototypes and partition matrix U by a minimization of the following objective function Q being regarded as a sum of the squared distances:

The spatiotemporal system can be effectively analyzed through separating data into spatial and temporal or using data as a whole entity. Spatial and temporal data are often correlated. For example, in the following Fig. 1, we encounter a collection of spatial data on the earth—cities characterized by x–y coordinates. For each city, there is an associated suite of temperature time series. Each city has its own feature on temperature thus spatial data and temporal data correlate. There are some different formalisms and concepts of information granulesand first we will review some basic concepts and operations on these information granules. Set theory occupies an important and unique place in modern mathematics since it can be shown that it can be used as a starting point for the derivation of all other branches of mathematics. In set theory, interval arithmetic which first appeared in [7–9] offered an important generalization of arithmetic defined on real numbers. If information granules are represented in the form of intervals X, we assume that X = [a, b], in which a and b are numeric values.

A : X ! ½0; 1 Formally, A(x) denotes a degree of membership that describes an extent to which x belongs to A. When A is triangular membership function, it can be expressed in the form Aðx, a, m, bÞ ¼ max fmin ½ðx aÞ=ðm aÞ; ðb xÞ=ðb mÞ; 0g. The membership functions are described by Eq. (1) when A is Gaussian function. Aðx; m; rÞ ¼ e

Q¼

c X N X

ðxmÞ2 r2

2 um ik kxk vi k

ð1Þ

ð2Þ

i¼1 k¼1

Fig. 1 A collection of stations along with associated time series

123

where vi’s are n-dimensional prototypes of the clusters, i = 1, 2,…,c, and U = [uik] stands for a partition matrix, k = 1, 2, …,N; uik is the membership degree of data xk in the ith cluster; m ([1.0) is the fuzzification coefficient which expresses the impact of the membership grades on the individual clusters. A position on the earth can be described by a combination of x coordinate and y coordinate. The information granules in the space x–y form regions by bring together

Int. J. Mach. Learn. & Cyber.

points being in close vicinity to each other. From the algorithm perspective, one can consider to use a clustering mechanism. It should be noted that a distance function used to quantify the closeness of points needs to be carefully defined. From the user perspective, the formed information granules provide a better way to look at the regions on earth, their location and size, and reflect the inner homogeneity of the points. We define a collection of spatial information granules in the x coordinate and y coordinate respectively, Ai : R ! ½0; 1; i ¼ 1; 2; . . .; n Bj : R ! ½0; 1; j ¼ 1; 2; . . .; m The selected forms of fuzzy sets can be: intervals, triangular membership functions and Gaussian membership functions, etc. In this case, they are described by the corresponding membership functions: Ai(x) and Bj(y). Thus we have n fuzzy sets on the x coordinate and m fuzzy sets

Fig. 2 From a collection of stations to spatial information granules

on the y coordinate. These fuzzy sets come with welldefined semantics. After the development of information granules on spatial data, we obtain not only fuzzy sets but also some clusters characterized by their closeness, please refer to Fig. 2. Here we give an example (in Fig. 3), in which we have 200 stations with x–y coordinates. For each coordinate, we define four triangular functions. As to Gaussian functions, there are two parameters needed to be carefully dealt with—average value and standard deviation. It is a headache problem to decide the value of standard deviation and an evolutionary method may be a choice. A time series is a collection of observations made chronologically. The nature of time series data includes: large in data size, high dimensionality and necessary to update continuously. Moreover time series data, which is characterized by its numerical and continuous nature, is always considered as a whole instead of individual numerical field. There are many ways for their representation which are popular in research [24]. One of the major reasons for time series representation is to reduce the dimension (i.e., the number of data point) of the original data. The simplest method perhaps is sampling [12]. In this method, a rate of m/n is used, where m is the length of a time series P and n is the dimension after dimensionality reduction. Thus the coefficients are the n sampled values. However, the sampling method has the drawback of distorting the shape of sampled/compressed time series, if the

Fig. 3 Four triangular functions in A and four triangular functions in B

123

Int. J. Mach. Learn. & Cyber.

sampling rate is too low. An enhanced method is to use the average (mean) value of each segment to represent the corresponding set of data points. First we divide the m points of times series into n parts and calculate the average of each segment as our coefficients. Some other ways of representing time series are in [13]: piece wise linear approximation (PLA), parameters of the autoregressive time series model (AR model), coefficients of the Fourier expansion, cepstral coefficients, components of discrete wavelet transforms (DWT), parameters of fuzzy rule based system (in case of fuzzy modeling of the series) and others. Assume that we have a time series T with t1 points and we cut it into t2 segments with equal length (have same number of time points, for easy analysis). In general, we may have ti = t2 coefficients or ti = 2 9 t2 coefficients according to the method selected. As to the sampling method and average value method, we have ti = t2 coefficients; as to piecewise linear approximation method and Fourier expansion method, we have ti = 2 9 t2 coefficients, etc. Normalize those coefficients by a certain method since we have to unify and compare the method for both spatial data and temporal data. Adopt a clustering mechanism like FCM on the normalized coefficients and we obtain a granular way of description—membership degrees (U in FCM) and prototypes. Cluster the coefficients set W with Fuzzy C-Means (FCM, c clusters). The performance of FCM on W can be illustrated by using from a small number of c to a large number. Thus the number of clusters can be determined in this way. We obtain prototypes Z1, Z2,…,Zc and membership functions C1, C2,…,Cc. Now we have collected information granules for spatial data and temporal data. Hence, we define the granular description (encoding) of spatiotemporal data like: gijl ¼ Ai Bj Cl ð3Þ The operator ‘‘9’’ represents Cartesian product and ‘‘s’’ represents sup-t composition (or inf-s) of fuzzy relations (which can also be seemed as convolution) [6]. The characters ‘‘t’’ in sup-t is a short form of t-norms and ‘‘s’’ in inf-s is s-norms. Thus, we can use the following formulas to rewrite formula Eq. (3).For one dimensional case (only x), there is no y coordinate: gil ¼ maxxk;zk ðAi ðxk Þ tCl ðzk ÞÞ

ð4Þ

or gil ¼ minxk;zk ðð1Ai ðxk ÞÞ s Cl ðzk ÞÞ

ð5Þ

where ‘‘t’’ stands for a certain t-norm, and ‘‘s’’ is the corresponding s-norm, i = 1, 2,…,n, l = 1, 2,…,c. gil is a number between 0 and 1 and is a possibility measure of the temporal data with respect to the selected information granules indexed (i, l). They describe a degree of

123

‘‘activation’’ of a certain element of the Cartesian product of the codebooks (Ai and Bj) by the membership functions of temporal data and also reveal the consistency between spatial and temporal data. The higher values of gil, the more consistent supported by experimental evidence. The Cartesian products of elements of A, B and C of information granules can be ordered linearly depending upon the values of G = [gil]. The reconstruction (decoding) of Cl can produce an upper bound and a lower bound of Cl. G = [gil] calculated from Eq. (4) can be used to generate an upper bound Cul since Cl ( Cul. And G = [gil] calculated from Eq. (5) can be used to generate a lower bound Cll since Cl ) Cll. These two bounds together with Cl provide a way to describe the quality of vocabulary. The solution of the reconstruction is expressed as follows [18–20]: For one dimensional case: If Cl is generated by (4) Cul = Ai / gil = mini;j ½Ai ðxÞ / gil

ð6Þ

If Cl is generated by Eq. (5) Cll ¼ ð1 Ai Þ u gil = maxi;j ½ð1 Ai ðxÞÞ u gil

ð7Þ

The pseudo-residuation operators associated with given t-norm and s-norm are defined respectively in the following form a/b ¼ supfv 2 ½0; 1 j atv bg

ð8Þ

aub = inffv 2 ½0; 1 j asv bg

ð9Þ

The associated performance index expressing the quality of reconstruction is defined as follows Q¼

N X c X

Cul ðxk ; zk Þ Cll ðxk ; zk Þ2

ð10Þ

k¼1 l¼1

The value of Q represents the distance of the reconstructed two bounds of Cl. Thus the smaller the value of Q is, the more compact of the algorithm. We try to minimize Q. The optimization of the vocabulary of information granules depending on the framework of information granules and the number of information granules is realized by means of PSO [15–17], which occurs to a viable optimization alternative for this problem. The objective function is expressed in Eq. (10). The PSO is well documented in the existing literature with numerous modifications and augmentations. Refer to the generic flow of computing in which velocities and positions of the particles are updated. The pseudocode of the whole algorithm is as follows Given the number of intervals n (or fuzzy sets, other information granules), determine each interval Ai, i = 1, 2,…,n, with coefficients therein. (If the spatial data are more than one dimension, determine Bj);

Int. J. Mach. Learn. & Cyber.

Represent the time series with a selected method, here we choose PLA; Run FCM on the coefficients of PLA to obtain Cl; Choose t-norm and s-norm, here we use Godel Calculate gil, Cul, and Cll Run PSO to minimize the objective function Q Finally we get the optimized coefficients of Ai, Cl;

3 Experimental studies In this section, we apply our algorithm on two different types of one dimensional functions as an example, Y1, Y2 and a real world problem. Here Y1 and Y2 can be seemed as Cl and l = 1 in this case. The functions are illustrated in Fig. 4. Y1 is a smooth curve with a crest and a flat area whereas Y2 is a straight line. We select the Godel t-norm and the corresponding s-norm [Go¨del: atb = min (a, b), asb = max (a, b)]. Their pseudo-residuation formulas are: a / b ¼ b if a [ b, a / b ¼ 1 if a b; a u b ¼ 0 if a b; a u b =b if

a \ b. Three types of fuzzy sets are tried: intervals, triangular functions, asymmetric Gaussian functions. Now let’s look at the results of Y1 and Y2. The performance index Q calculated with function Y1 is list in Table 1. The smallest Q occurs using intervals when the number of intervals is eight. From the above table we can see that the best performance occurs when a collection of eight interval functions is used. The two bounds of this best performance are shown below in Fig. 5 which gives an acceptable range (error) of the real function. The performance of PSO in Fig. 6 shows its effectiveness. The curve starts to become flat after 100 generations. The distance between the first generation and the last generation in Fig. 6 is around 0.25. The average distance Q between upper bound and lower bound is calculated with function Y2 in Table 2. We can see that the best performance occurs when a collection of nine interval functions is used. The two bounds of this best performance and the original function are shown below in Fig. 7. The two bounds are compact yet give a range for small errors.

Table 1 The value of Q with different types of fuzzy sets, and different numbers of fuzzy sets t-norm and s-norm type and no. of fuzzy sets

Godel

Intervals 3

0.4055

4

0.2826

5

0.2043

6

0.1742

7

0.1617

8

0.1143

9

0.1591

Triangular 3

0.4177

4

0.3490

5

0.2901

6

0.2193

7

0.1989

8 9

0.1738 0.1625

Gaussian

Fig. 4 Functions of Y1 (a), Y2 (b)

3

0.3214

4

0.2399

5

0.2669

6

0.1756

7

0.1838

8

0.1873

9

0.1484

123

Int. J. Mach. Learn. & Cyber.

Fig. 5 The original function and its two bound

Fig. 7 The original function and its two bounds Table 2 The value of Q with different types of fuzzy sets, and different numbers of fuzzy sets t-norm and s-norm type and no. of fuzzy sets

Godel

Intervals 3

0.3234

4

0.2400

5

0.1900

6

0.1568

7

0.1330

8 9

0.1152 0.1012

Triangular 0.3266

Fig. 6 Performance of PSO

The performance of PSO is shown in Fig. 8. We can see the stability of the curve after around 100 generations. The distance between the first generation and the last generation in Fig. 6 is around 0.025. We collect spatiotemporal data from Alberta agriculture website (http://www.agric.gov.ab.ca/app116/stationview. jsp.). Each station on the map is associated with several time series, like temperature, humidity and others. We use 100 stations and their temperature time series in the following experiments (in Figs. 9, 10). The average distance Q between upper bound and lower bound is: 2.3860 (intervals), 2.3598 (triangular function), 2.3354 (Gaussian function). We choose three

123

4

0.2668

5

0.2247

6

0.2022

7

0.1824

8

0.1672

9

0.1566

Gaussian 3

0.3269

4

0.2438

5

0.1920

6

0.1804

7 8

0.1581 0.1309

9

0.1168

fuzzy sets on A and B and the t-norm and s-norm are Godel norm. The smallest value of Q occurs when using Gaussian function.

Int. J. Mach. Learn. & Cyber.

Fig. 8 Performance of PSO

Fig. 10 Stations converted to x–y coordinates

products whereas temporal data particularly time series are dealt with in the way of representation methods and clustering mechanisms. A series of experimental results done by synthetic data shows a clear outstanding performance of our idea. It is a future research topic that how to combine these two parts into one granular descriptor. Acknowledgments Support from the Natural Science Foundation of China (NSFC) 61305100 and 61203283, 71401026, the Youth Science Foundation of Communication University of China 3132015XNG1501, YXJS201529, the technology research of cinema management system and collaborative network service platform (2012BAH02F04), National Key Technology Support Program (2013BAH66F02) are gratefully appreciated.

References

Fig. 9 Stations in the real map

4 Conclusions In this paper, we show a way how to analyze the relationship between spatial and temporal data with information granules. First spatiotemporal data are split into spatial and temporal parts and then analyzed respectively. Spatial data are described by fuzzy sets like triangular membership functions, Gaussian functions and so on and their Cartesian

1. Alatrista-Salas H, Aze´ J, Bringay S, Cernesson F, SelmaouiFolcher N, Teisseire M (2015) A knowledge discovery process for spatiotemporal data: application to river water quality monitoring. Ecol Inform 26(2):127–139 2. Parrott Lael, Proulx Raphae¨l, Thibert-Plante Xavier (2008) Three-dimensional metrics for the analysis of spatiotemporal data in ecology. Ecol Inform 3(6):343–353 3. Mur J, Lo´pez FA (2010) Modeling spatiotemporal data: an issue in honor of Dr. Jean Paelinck. J Geogr Sys 12:105–109 4. Bai L, Yan L, Ma ZM (2013) Determining topological relationship of fuzzy spatiotemporal data integrated with XML twig pattern. Appl Intell 39:75–100 5. Bargiela A, Pedrycz W (2003) Granular computing: an introduction. Kluwer Academic, London 6. Pedrycz W, Gomide F (2007) Fuzzy systems engineering: toward human-centric computing. Wiley, Hoboken 7. Warmus M (1961) Approximations of inequalities in the calculus of approximations: classification of approximate numbers. Bulletin de l’Academie Polonaise des Sciences 9(4):241–245

123

Int. J. Mach. Learn. & Cyber. 8. Warmus M (1956) Calculus of approximations. Bulletin de l’Academie Polonaise des Sciences 4(5):253–259 9. Sunaga T (1958) Theory of interval algebra and its applications to numerical analysis. Gaukutsu Bunken Fukeyu-kai, Tokyo 10. Pawlak Z (1982) Rough sets. Int J Comp Inform Sci 11:341–356 11. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht 12. Astrom KJ (1969) On the choice of sampling rates in parametric identification of time series. Inf Sci 1(3):273–278 13. Fu T (2011) A review on time series data mining. Eng Appl Artif Intell 24:164–181 14. Skowron A (1989) The relationship between the rough set theory and evidence theory. Bull Pol Acad Sci Tech 37:87–90 15. Kennedy J, Eberhart RC (1942–1948) ‘‘Particle swarm optimization.’’ IEEE international conference on neural networks. NJ, 1995 16. Shi Y, Eberhart R (1998) ‘‘A modified particle swarm optimizer.’’ Proc IEEE Int Conf Evol Comput 1998. 69–73 17. Shi Y, Eberhart R (1945–1950) ‘‘Empirical study of particle swarm optimization.’’ Proc IEEE Congr Evol Comput 1999

123

18. Nobuhara H, Pedrycz W, Sessa S, Hirota K (2006) A motion compression/reconstruction method based on max t-norm composite fuzzy relational equations. Inf Sci 176:2526–2552 19. Hirota K, Pedrycz W (1999) Fuzzy relational compression. IEEE Trans Syst Man Cybern B Cybern 29(3):407–415 20. DiNola A, Sessa S, Pedrycz W, Sanchez E (1989) Fuzzy relational equations and their applications in knowledge engineering. Kluwer Academic Publishers, Dordrecht 21. Zadeh LA (1979) Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst 90(2):111–127 22. Pedrycz W, Skowron A, Kreinovich V (eds) (2008) Handbook of granular computing, wiley, Boca Raton, London, New York 23. Pedrycz W (2013) Granular computing analysis and design of intelligent systems. CRC Press Taylor and Francis Group, New York 24. Singh P (2015) A brief review of modeling approaches based on fuzzy time series. Int J Mac Learn Cybernet 25. Tan A, Li J (2014) A kind of approximations of generalized rough set model. Int J Mac Learn Cybernet

Analysis of spatiotemporal data relationship using information granules

Recommend Documents