One very useful feature of Google Analytics is the comparison of date ranges for metrics. You can compare number of sessions, bounce rate, E-commerce transactions, AdWords cost, etc. between adjacent weeks, months, or years. One shortcoming of this comparison feature is that the percentage change between date ranges is not qualified. If there is a 5% change compared to a previous time period, is this validation of the success of a marketing campaign, site changes, etc., or just due to randomness?

The approach of this tool is to help qualify these date range comparisons with statistical context. What is the probability that a change in a given metric was simply due to chance? (the p-value.) If the p-value is less than 10%, 5%, or 1%, depending on the threshold you want to set, then you can conclude that there is a low probability that the percentage change is due chance, and there is a 1-p (90%, 95%, 1%) probability that the change is due to an actual, non-random, factor differing between the two date ranges. Of course, if we get a statistically significant change in a metric, we still don't know the source of this change. Was the change due to a factor inside our control, like site changes or marketing campaigns? Or was it due to an external factor (seasonality and increased e-commerce shopping near holidays), or one-off increase in traffic due to new external site links or general increased search interest in a topic related to the site?

Also, a quick note on p-values. For this extension, we use 0.05 as a cut-off. Depending on the nature of the domain, it may make sense to use a different value. For example, in scientific/medical research you might see a lower cutoff like 0.01, or in a social study with higher variance, a higher cutoff of 0.10. When the p-value increases you are decreasing the chance that you erroneously miss an actual trend, but increasing the chance (the p-value itself) that you wrongly categorize a random outcome as meaningful. And as you lower the p-value, you are moving towards the opposite trade-off. Here is a better overview if you'd like a refresher on Type I and Type II errors.

## Statistical Approach

We use pairwise statistical tests (pairwise t-test and Wilcoxon Signed-Rank test) in order to more easily detect small yet statistically significant changes in metrics across date ranges. An example of a situation warranting a pairwise test is when you have a medical study, say on blood pressure before and after a medication on a group of individuals. Since there is such a high variance within any sample group of people, by testing on the same individuals, you lessen effect of this internal variance, and thus run a more sensitive test (to small changes in blood pressure).

Thus, when date ranges are selected, they should begin and end on the same weekdays. Essentially, we are treating a Monday in date range one and a Monday in date range two as being the same individual, sampled at different times. This is actually a questionable assumption, as two different days are obviously not the same individual. And it can actually backfire by overreacting to large changes between the same weekdays, since it is assuming they are different measurements of the same individual. On the other hand, if we treat the two date ranges as independent samples, we lose out on the increased sensitivity derived from a pairwise test.

The ideal approach will be to only partially consider the dependence due to temporal relationship between dates (day of week, month, seasonal effects, etc.), but also maintains a some assumption of independence. This will be the scope of a future release.### Wilcoxon Signed-Rank Test

We have two main options for comparing groups of pairwise samples. The typical choice is the pairwise t-test. This test does require the assumption that the means of the sample populations be normally distributed. So in order to use it we would typically need to verify this requirement. However, due to the central limit theorem, as the sample size increases, the distribution of the sample mean approaches a normal distribution, so when our sample size (in this case instances of our metric for the given time interval: days, weeks, months, etc.) increases, we are okay to use the pairwise t-test.

When we have smaller sample size however, say two weeks against two previous weeks, we may not be as accurate if we use the pairwise t-test since we can't necessarily meet its normality assumption. Instead, we can use the Wilcoxon Signed-Rank test, which is a non parametric test that operates by ranking the differences between each pairwise observation. Since we're only ranking and then summing these difference (separately for negative and positive directions of the difference), we are also eliminating the effect of very large differences (like outliers), and thus do not have the same requirement for the sample means to be normally distributed. Of course, there has to be a trade-off when we are giving up assumptions (information) on our samples, and the Wilcoxon has less statistical power than the t-test.

There is currently no implementation of the Wilcoxon Signed-Rank test available in any javascript (such as in jstat), so I ported the python SciPy implementation into javascript. Code is here