Generalized possibilistic fuzzy c-means with novel cluster validity indices for clustering noisy data
Description of this data
Thank you for using this code and datasets. I explain how GPFCM code related to my paper "Generalized possibilistic fuzzy c-means with novel cluster validity indices for clustering noisy data" published in Applied Soft Computing, works. The main datasets mentioned in the paper together with GPFCM code are included.
If there is any question, feel free to contact me at:
Guidelines for GPFCM algorithm:
- Open the file "GPFCM-Code" using MATLAB.
- DATA1 to DATA6 are the data sets we used in the paper. Each data set contains the data "yd", optimal value of ρ "ruopt" and number of clusters "C".
- In line 13 of the code, change the number in "DATA1" to the number of the desired data set. For example, to load DATA3, change "load DATA1" to "load DATA3".
- Click somewhere on the file "GPFCM-Code" and then Press "Ctrl+Enter" to run the code.
- VFCM, VPFCM, and VGPFCM which appear on the command window are cluster centers computed by each of the algorithms FCM, PFCM, and GPFCM, respectively. You can find all of them in the "Workspace" of MATLAB as well.
- Sometimes, PFCM may yield two or more coincident clusters for DATA4 or any other data. Then GPFCM will also give two or more coincident clusters because it starts with PFCM. You may run the algorithm again to get probably all cluster centers accurately. Generally, if you use GFCM rather than GPFCM, you'll get better results with no coincident clusters. Settings of the code for GFCM are mentioned in item 14.
- Since the algorithm starts randomly, order of the cluster centers may be different in various runs but numerical values of the cluster centers will not change. For example, if is obtained as the third cluster center in one run which is the third column of the matrix VGPFCM, it may move to the fifth row of the matrix in another run (if ). But its value would not considerably change and is very close to . This is just because of random initializations of the algorithm. Since FCM (by which GPFCM is initialized) is randomly initialized, sometimes it is sensitive to initialization (depending on the data) and there may be negligible differences between cluster centers obtained in different runs. For example, consider DATA3 with 6 clusters. In one run we get:
-4.9960 -1.0169 -4.9708 1.9575 1.0521 -2.0271
-1.9853 -5.0464 5.9470 0.0031 6.0183 1.9896
And in another run we have:
-4.9960 -1.0169 1.9575 1.0521 -2.0271 -4.9708
-1.9853 -5.0464 0.0031 6.0183 1.9896 5.9470
It is observed that cluster centers are the same as those of the first run but their positions in the matrix VGPFCM is changed.
- Line 46 computes Covariance norm matrix. If you "uncomment" line 47, then the program uses Identity norm matrix (Euclidean distance).
Experiment data files
Steps to reproduce
- In line 20, "cPCM0" is used for GFCM where cPCM=0. This is because in line 226 we compute typicalities from which R(i) of line 244 is calculated.
- In line 22, "eta" is η.
- The functions in the folder are related to those in the paper as: , , , .
- The parameter "omega" in the GPFCM Code (line 245) and the functions is .
- For details of the parameters cFCM, cPCM, eta, and m, please read the paper. cFCM=0 and cPCM=1 gives GPCM, cFCM=1 and cPCM=0 gives GFCM, cFCM≠0 and cPCM≠0 gives GPFCM.
- Our experience shows that GFCM is usually better that GPFCM.
- Lines 15 and 16: "N" is number of data vectors and "D" is number of independent variables.
- Line 23: "C" is number of clusters. To input your own desired value for number of clusters, "uncomment" this line and then enter the value. Since the datasets provided here, include "C", this line is "comment".
- Line 25: "ruopt" is optimal value of ρ discussed in equation 13 of the paper. To enter your own value of ρ, "uncomment" this line. Since the datasets provided here, include "ruopt ", this line is "comment". Please note that ρ is a pivotal parameter in GPFCM and GFCM algorithms. Improper value of ρ results inaccurate cluster centers.
- If line 47 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used.
- When running the algorithm, first FCM is applied to the data. Cluster centers calculated by FCM initialize PFCM. Then PFCM is applied to the data and cluster centers computed by PFCM initialize GPFCM. Finally, GPFCM is applied to the data.
- The file "Cluster generator" is used to create DATA1 to DATA6. You can use it to generate your own dataset where x0 and y0 (lines 6 and 7) are x-coordinates and y-coordinates of the cluster centers that you should enter. Finally, in line 51, 2400 noise points are randomly added that you can change it to any value you want.
- The file "Normalization" normalizes the data when required. For details, please read section 6 of the paper and equation (27).
- The file "PFCM" is code of PFCM algorithm with the same symbols used in "GPFCM-Code". cFCM=0 and cPCM=1 gives PCM, cFCM=1 and cPCM=0 gives FCM, cFCM≠0 and cPCM≠0 gives PFCM.
- In both "GPFCM-Code" and "PFCM", CVI1, CVI2, and CVI3 are the ones discussed in section 3 of the paper. Moreover, CVIXB is Xie-Beni CVI, CVIFS is Fukuyama and Sugeno CVI, and CVIKwon is Kwon CVI.
- The datasets "GlassIdentification", "Ionosphere", "IRIS", "PimaIndiansDiabetes", "Seeds", and "Wine" are the real life data used in section 6 of the paper.
This data is associated with the following publication:
Cite this dataset
Askari Lasaki, Salar; Askari Lasaki, Salar (2017), “Generalized possibilistic fuzzy c-means with novel cluster validity indices for clustering noisy data”, Mendeley Data, v1 http://dx.doi.org/10.17632/dgxfv4s5vt.1