public class GTest extends Object
This is known in statistical genetics as the McDonald-Kreitman test. The implementation handles both known and unknown distributions.
Two samples tests can be used when the distribution is unknown a priori but provided by one sample, or when the hypothesis under test is that the two samples come from the same underlying distribution.
Constructor and Description |
---|
GTest() |
Modifier and Type | Method and Description |
---|---|
double |
g(double[] expected,
long[] observed)
|
double |
gDataSetsComparison(long[] observed1,
long[] observed2)
Computes a G (Log-Likelihood Ratio) two sample test statistic for
independence comparing frequency counts in
observed1 and observed2 . |
double |
gTest(double[] expected,
long[] observed)
Returns the observed significance level, or p-value,
associated with a G-Test for goodness of fit comparing the
observed frequency counts to those in the expected array. |
boolean |
gTest(double[] expected,
long[] observed,
double alpha)
Performs a G-Test (Log-Likelihood Ratio Test) for goodness of fit
evaluating the null hypothesis that the observed counts conform to the
frequency distribution described by the expected counts, with
significance level
alpha . |
double |
gTestDataSetsComparison(long[] observed1,
long[] observed2)
Returns the observed significance level, or
p-value, associated with a G-Value (Log-Likelihood Ratio) for two
sample test comparing bin frequency counts in
observed1 and
observed2 . |
boolean |
gTestDataSetsComparison(long[] observed1,
long[] observed2,
double alpha)
Performs a G-Test (Log-Likelihood Ratio Test) comparing two binned
data sets.
|
double |
gTestIntrinsic(double[] expected,
long[] observed)
Returns the intrinsic (Hardy-Weinberg proportions) p-Value, as described
in p64-69 of McDonald, J.H.
|
double |
rootLogLikelihoodRatio(long k11,
long k12,
long k21,
long k22)
Calculates the root log-likelihood ratio for 2 state Datasets.
|
public double g(double[] expected, long[] observed) throws NotPositiveException, NotStrictlyPositiveException, DimensionMismatchException
observed
and expected
frequency counts.
This statistic can be used to perform a G test (Log-Likelihood Ratio Test) evaluating the null hypothesis that the observed counts follow the expected distribution.
Preconditions:
If any of the preconditions are not met, a
MathIllegalArgumentException
is thrown.
Note:This implementation rescales the
expected
array if necessary to ensure that the sum of the
expected and observed counts are equal.
observed
- array of observed frequency countsexpected
- array of expected frequency countsNotPositiveException
- if observed
has negative entriesNotStrictlyPositiveException
- if expected
has entries that
are not strictly positiveDimensionMismatchException
- if the array lengths do not match or
are less than 2.public double gTest(double[] expected, long[] observed) throws NotPositiveException, NotStrictlyPositiveException, DimensionMismatchException, MaxCountExceededException
observed
frequency counts to those in the expected
array.
The number returned is the smallest significance level at which one can reject the null hypothesis that the observed counts conform to the frequency distribution described by the expected counts.
The probability returned is the tail probability beyond
g(expected, observed)
in the ChiSquare distribution with degrees of freedom one less than the
common length of expected
and observed
.
Preconditions:
If any of the preconditions are not met, a
MathIllegalArgumentException
is thrown.
Note:This implementation rescales the
expected
array if necessary to ensure that the sum of the
expected and observed counts are equal.
observed
- array of observed frequency countsexpected
- array of expected frequency countsNotPositiveException
- if observed
has negative entriesNotStrictlyPositiveException
- if expected
has entries that
are not strictly positiveDimensionMismatchException
- if the array lengths do not match or
are less than 2.MaxCountExceededException
- if an error occurs computing the
p-value.public double gTestIntrinsic(double[] expected, long[] observed) throws NotPositiveException, NotStrictlyPositiveException, DimensionMismatchException, MaxCountExceededException
The probability returned is the tail probability beyond
g(expected, observed)
in the ChiSquare distribution with degrees of freedom two less than the
common length of expected
and observed
.
observed
- array of observed frequency countsexpected
- array of expected frequency countsNotPositiveException
- if observed
has negative entriesNotStrictlyPositiveException
- expected
has entries that are
not strictly positiveDimensionMismatchException
- if the array lengths do not match or
are less than 2.MaxCountExceededException
- if an error occurs computing the
p-value.public boolean gTest(double[] expected, long[] observed, double alpha) throws NotPositiveException, NotStrictlyPositiveException, DimensionMismatchException, OutOfRangeException, MaxCountExceededException
alpha
. Returns true iff the null
hypothesis can be rejected with 100 * (1 - alpha)
percent confidence.
Example:
To test the hypothesis that
observed
follows expected
at the 99% level,
use
gTest(expected, observed, 0.01)
Returns true iff gTestGoodnessOfFitPValue(expected, observed)
< alpha
Preconditions:
0 < alpha < 0.5
If any of the preconditions are not met, a
MathIllegalArgumentException
is thrown.
Note:This implementation rescales the
expected
array if necessary to ensure that the sum of the
expected and observed counts are equal.
observed
- array of observed frequency countsexpected
- array of expected frequency countsalpha
- significance level of the testNotPositiveException
- if observed
has negative entriesNotStrictlyPositiveException
- if expected
has entries that
are not strictly positiveDimensionMismatchException
- if the array lengths do not match or
are less than 2.MaxCountExceededException
- if an error occurs computing the
p-value.OutOfRangeException
- if alpha is not strictly greater than zero
and less than or equal to 0.5public double gDataSetsComparison(long[] observed1, long[] observed2) throws DimensionMismatchException, NotPositiveException, ZeroException
Computes a G (Log-Likelihood Ratio) two sample test statistic for
independence comparing frequency counts in
observed1
and observed2
. The sums of frequency
counts in the two samples are not required to be the same. The formula
used to compute the test statistic is
2 * totalSum * [H(rowSums) + H(colSums) - H(k)]
where H
is the
Shannon Entropy of the random variable formed by viewing the elements
of the argument array as incidence counts;
k
is a matrix with rows [observed1, observed2]
;
rowSums, colSums
are the row/col sums of k
;
and totalSum
is the overall sum of all entries in k
.
This statistic can be used to perform a G test evaluating the null hypothesis that both observed counts are independent
Preconditions:
observed1
and observed2
must have
the same length and their common length must be at least 2. If any of the preconditions are not met, a
MathIllegalArgumentException
is thrown.
observed1
- array of observed frequency counts of the first data setobserved2
- array of observed frequency counts of the second data
setDimensionMismatchException
- the the lengths of the arrays do not
match or their common length is less than 2NotPositiveException
- if any entry in observed1
or
observed2
is negativeZeroException
- if either all counts of
observed1
or observed2
are zero, or if the count
at the same index is zero for both arrays.public double rootLogLikelihoodRatio(long k11, long k12, long k21, long k22)
gDataSetsComparison(long[], long[] )
.
Given two events A and B, let k11 be the number of times both events occur, k12 the incidence of B without A, k21 the count of A without B, and k22 the number of times neither A nor B occurs. What is returned by this method is
(sgn) sqrt(gValueDataSetsComparison({k11, k12}, {k21, k22})
where sgn
is -1 if k11 / (k11 + k12) < k21 / (k21 + k22))
;
1 otherwise.
Signed root LLR has two advantages over the basic LLR: a) it is positive where k11 is bigger than expected, negative where it is lower b) if there is no difference it is asymptotically normally distributed. This allows one to talk about "number of standard deviations" which is a more common frame of reference than the chi^2 distribution.
k11
- number of times the two events occurred together (AB)k12
- number of times the second event occurred WITHOUT the
first event (notA,B)k21
- number of times the first event occurred WITHOUT the
second event (A, notB)k22
- number of times something else occurred (i.e. was neither
of these events (notA, notB)public double gTestDataSetsComparison(long[] observed1, long[] observed2) throws DimensionMismatchException, NotPositiveException, ZeroException, MaxCountExceededException
Returns the observed significance level, or
p-value, associated with a G-Value (Log-Likelihood Ratio) for two
sample test comparing bin frequency counts in observed1
and
observed2
.
The number returned is the smallest significance level at which one can reject the null hypothesis that the observed counts conform to the same distribution.
See gTest(double[], long[])
for details
on how the p-value is computed. The degrees of of freedom used to
perform the test is one less than the common length of the input observed
count arrays.
Preconditions:
observed1
and observed2
must
have the same length and their common length must be at least 2.
If any of the preconditions are not met, a
MathIllegalArgumentException
is thrown.
observed1
- array of observed frequency counts of the first data setobserved2
- array of observed frequency counts of the second data
setDimensionMismatchException
- the the length of the arrays does not
match or their common length is less than 2NotPositiveException
- if any of the entries in observed1
or
observed2
are negativeZeroException
- if either all counts of observed1
or
observed2
are zero, or if the count at some index is
zero for both arraysMaxCountExceededException
- if an error occurs computing the
p-value.public boolean gTestDataSetsComparison(long[] observed1, long[] observed2, double alpha) throws DimensionMismatchException, NotPositiveException, ZeroException, OutOfRangeException, MaxCountExceededException
Performs a G-Test (Log-Likelihood Ratio Test) comparing two binned
data sets. The test evaluates the null hypothesis that the two lists
of observed counts conform to the same frequency distribution, with
significance level alpha
. Returns true iff the null
hypothesis can be rejected with 100 * (1 - alpha) percent confidence.
See gDataSetsComparison(long[], long[])
for details
on the formula used to compute the G (LLR) statistic used in the test and
gTest(double[], long[])
for information on how
the observed significance level is computed. The degrees of of freedom used
to perform the test is one less than the common length of the input observed
count arrays.
observed1
and observed2
must
have the same length and their common length must be at least 2. 0 < alpha < 0.5
If any of the preconditions are not met, a
MathIllegalArgumentException
is thrown.
observed1
- array of observed frequency counts of the first data setobserved2
- array of observed frequency counts of the second data
setalpha
- significance level of the testDimensionMismatchException
- the the length of the arrays does not
matchNotPositiveException
- if any of the entries in observed1
or
observed2
are negativeZeroException
- if either all counts of observed1
or
observed2
are zero, or if the count at some index is
zero for both arraysOutOfRangeException
- if alpha
is not in the range
(0, 0.5]MaxCountExceededException
- if an error occurs performing the testCopyright © 2003–2016 The Apache Software Foundation. All rights reserved.