 Originally Posted by Replicon
I totally get that they're aware of these factors, and do their best to come up with elegant experiments, but even in your hypothetical examples, there are unknowns creeping in, that you can't just mitigate easily. If you're interested in the number of reasons they give, you're likely trying to correlate the number of reasons ("rationalization factor?" hehe) to something specific. However, a variance in numbers of reasons given might be more highly affected by another, unanticipated factor. It's probably true that with a big enough sample size, and good enough randomization, a correlation would be meaningful. In other words, if you're isolating two groups of people that are random, except for one attribute you want to measure, there is a size where the attribute split will overwhelm other random factors that could affect the results. But since there's no good way to know, statistically, how big of a sample you need to have any kind of expectations about the result (since, among other reasons, you don't actually know just how strong a correlation to expect), any experiment conducted by a master's student on a relatively small number of people (in the tens or hundreds), I would think, has a reliability issue to be dealt with.
After reading this passage a lot of times, I think I see the point you're getting at, but I'm still not completely sure, so correct me if I misinterpret you.
First a side note on "good enough randomization." Achieving what is essentially completely random assignment to experimental conditions is trivial. In the simplest case of two experimental groups with no constraints on sample size, it can be done perfectly adequately just by tossing a fair coin before each participant arrives. For more than two groups I like to use RANDOM.ORG - True Random Number Service; and more complicated randomization is usually easy to accomplish with a simple R script. So I have to confess that I don't see what your worry is on this issue, if indeed you have one.
Now let me try to paraphrase my understanding of your more central point about small samples being biased and/or unreliable. Assume we have a small sample of 10 people which we randomly assign to either the treatment or control conditions. Assume also that, whatever our experimental manipulation and outcome variable may be, it happens to be the case that the manipulation has a strong effect on the outcome variable for men, but little or no effect on the outcome variable for women. Let's say that, by chance, we end up with a total of 6 men and 4 women in our sample. Finally let's suppose that random assignment leaves us with 4 men/1 woman in the treatment group and 2 men/3 women in the control group. So obviously in this case we should expect to see a stronger effect on the outcome variable for the treatment group vs. the control group simply because of the unbalanced gender split. And if for some reason we failed to measure gender and take that factor into account in our analysis, that would make it look like the treatment is simply more effective than the control overall, when in fact the truth is more subtle and qualified than that. And your point is that random quirks of this kind should be far diminished in much larger samples, due to the law of large numbers. Is this interpretation basically right?
That is all true; but of course, it is perhaps the most basic function of statistical analysis to deal with precisely these issues. To start off in brief, technical terms: there is a term for sample size in the confidence interval formula for a parameter estimate which functions such that an increase in sample size is associated with a monotonic decrease in the confidence interval around that estimate. So parameter estimates (such as an estimated difference between two experimental groups) that are based on very small sample sizes (as in our example above) will have correspondingly wide confidence intervals around them (that is, there will be a lot of explicit "uncertainty" built around the estimate). This means that with small sample sizes, it typically takes a very strong "true" treatment effect and/or very rigorous methodological controls for the effect to clearly emerge empirically due to the built-in uncertainty in the parameter estimates. And in larger sample sizes, where the problematic issues we discussed above tend to be less so, there is correspondingly little uncertainty built around the parameter estimates and so "true" treatment effects can be revealed more easily.
All of this is to say that, while the issues you raised are real and will always be a possibility in an analysis, it is not the case that small samples are particularly vulnerable to this problem.
 Originally Posted by Replicon
What kind of statistical analysis do they go through to ensure, to the best of their abilities, that the results will be reliable? If you think there's a correlation, but the experiment shows just randomness (no correlation), how do you know whether to conclude that the sample set needs to be much, much bigger, or that you can mostly abandon trying to find a correlation, because there isn't one?
Well, you can never "know" for sure, but for most simple tests there are fairly straightforward techniques for obtaining estimates of this kind which the researcher can then use to inform their decision about whether to continue. These techniques are known as power analysis. For simple tests, they essentially consist of working algebraically backward through the standard statistical testing procedure by first assuming some nonzero true effect size (for example, you could assume that the estimated effect size you obtained from a small pilot study is equal to the true effect size) and then solving for what the values of certain variables (such as sample size) would need to be in order to have a particular probability of correctly revealing the true effect--this probability is known as the statistical power of the test. Alternatively, you can solve for statistical power given a fixed a priori sample size, or for what the true effect size would have to be given a fixed power level and sample size, etc.
For more complicated tests, these estimates are often obtained through Monte Carlo simulation, where instead of assuming a nonzero effect size X and deriving power analytically, you generate many, many simulated data sets where the true effect size is programmed to be X on average, analyze all of the data sets using the statistical procedure of interest, and then estimate power simply as the proportion of simulated experiments in which you correctly uncovered a treatment effect.
 Originally Posted by Replicon
I guess sometimes, they can do some calibration... like if people are rating pain on a scale from 1 to 10, they understand that everyone has a different tolerance, and therefore a different scale, so they might have them rate common painful things to get a feel for skew...
Mmm, not really sure what you're getting at with this one.
|
|
Bookmarks