A discussion on Baseball Primer prompted me to look into the pythagorean model for wins.
Ben has a nice summary of the math, and the Wikipedia offers two formulas for the co-efficients.
I used David Smyths, and looked at the seasons from 1961 through 2003, comparing actual wins to predicted wins. Seasonal data came from the Lahman database, and the number crunching was done in excel.
Actual wins is always an integer, predicted wins generally isn't. I used the square of the difference as a measure of the error of the predictor.
First, I plotted the error against the predicted winning percentage. That graph looks like a big circular blob around (.500,0). Taking a linear trendline showed a small bias in the data - the estimator is, on average, a little bit high for good teams and a little bit low for bad ones. This is consistent with Ben's remarks that there's a higher order term to consider.
Next, I took the errors and rounded them to the nearest whole number. With this, I created a simple bar chart - x axis is games from estimate, y axis is percentage of population. No big surprise, it looks a bit like a normal curve.
.100495 * e ^ - .030926 x^2
if you believe in that many decimal places. 62% of the population is within 3.5 wins of the estimate, 91% within 6.5 wins, 99% within 9.5 wins.
May 19, 2004 11:23 PM
| TrackBack