I recently read a very thought-provoking article (sorry, I think a subscription to the journal 'Bioinformatics' is required for this link) by Anne-Laure Boulesteix. The paper's title is 'Over-optimism in bioinformatics research' and one of the points the author makes is that there can be an effect called 'fishing for significance' when developing (for example) new statistical methods.
This problem is a version of what happens when you make multiple hypothesis tests. Imagine that you test 100 different genes to see whether or not they're differentially expressed between two different experiments and that for each gene you compute a regular, single-test p-value. If you decide to keep the genes that have a p-value <0.05, then you would expect to keep about 5 genes that aren't differentially expressed at all, but crop up by chance (false positives). This leads statisticians to make p-value corrections when making multiple hypothesis tests, to avoid getting (so many) false positives.
With that idea in mind, consider the typical development process for a new statistical method. Imagine you have a good idea for a new type of statistical method; it's clever, it should be really useful scientifically, and you're very keen to spend some time working on it. You build some software to implement your idea, analyse an example data-set and produce some results and, while it works quite quite well, the results give you several ideas about how to improve it.
You apply each of these ideas in turn, keep the good ones, discarding the bad ones and this improves the results. It also leads to even more good ideas, which you try out in a similar fashion. And eventually you have a method that's producing pretty impressive results on the test data-set. You write up the results, publish them and move on with the next project.
This is all well and good, and people certainly use this approach to produce genuinely good statistical methods. But there is an element of multiple testing in what I've just described. By trying out a range of ideas on our test data-set, then keeping the 'good' ones, we're optimising our approach to do well on the test data.
What's really happening in this process is two forms of apparent improvement are going on. The first is due to our developing genuinely improved statistical models that simply work better. The second is that we're over-fitting to our test data-set, finding models that just happen to work well on these particular data. This second one is a problem because it's an illusion; it won't help us with any other data (it will generalise poorly).
So, what's the solution? Probably the most robust way is to validate any new method on independent data once the model has been finalised, measuring performance using metrics that were decided upon in advance. This isn't always easy, because suitable data aren't always easy to obtain, but I think this has to be the goal to aspire to.
Interestingly, this highlights a merit in turning a clever new statistical method into a tool for people to use: this is a great way to test said method on a wide range of different data-sets. Of course, it can be substantially more effort to develop such a tool, and it can be a bit intimidating to subject your new method to such vigorous scrutiny. But maybe this is the only way to find new methods that really are improvements over the current state of the art.