Code and Data for "How Large is Large Enough?"

Published: 29 August 2018| Version 1 | DOI: 10.17632/xsy73v92wv.1
Contributor:
Jinshan Wu

Description

Here we share the data used in our studis of the minimum representatitve size of subset and the code of the metric. Bassically, when two sets of data points, such as recieved citations of all papers of two journals within certian time period, are compared, usually people compare the mean, for example, the journal impact factor (JIF). We may compare the two set directly by taking one sample from each set and then compare the two samples, and count this ratio of sample from set one is bigger than sample from set two. It is quite possible that when the mean value of set one is larger than that of set two, the above ratio can still be very low, especially when the two sets have large variances, ie when the summation of the variance is close or even larger than the difference between the means. In that case, there is a large overlap between the two data sets. We find a way to reduce the varaince, thus also reduce the overlap: By taking a set of K1 samples fromt set one and K2 samples from set 2, and calculate and compare the average of the two subsets. Based on this observation, we find that as long as K1 and K2 are large enough, then the ratio of the K1-average of set one is larger than the K2-average of set two can be quite large, as contrast to the original low ratio of the sinle-sample average of the first is large than that of the second set. We then define the necessary size of each set need for a reliable comparison of the two sets to be the minimum representative size of the set, and apply it to a set of journals. Here we provide data and the code. There are examples provides in the comments in the code. The Python program $PrMultiSamComp\left(X, Y, K_{X}, K_{Y}, Pr, K2PorP2K\right)$, implementing the metric in Python. Basically, given the two set $X, Y$ and the threshold probability $Pr$, the program calculate $K_{X}, K_{Y}$ with the flag value $K2PorP2K=0$ and given the two set $X, Y$ and the size of re-sampling subsets $K_{X}, K_{Y}$, the program calculate $Pr$ with $K2PorP2K=1$.

Files

Institutions

Beijing Normal University

Categories

Statistics, Informatics

Licence