For a more robust predictor, we need a wider variety of data. This is achieved by performing analytic calculations from a model, then looking to fit the curves.
For my baseline, I began from true scoring rates, and used the Tango distribution to predict the numbers of runs scored per inning. This is then applied to a combinatoric model first calculated by Ben Vollmayr-Lee to determine the frequencies of runs per game. Treating runs scored and runs allowed independently this produces some probability of wins and losses.
Note that I'm using rates, not runs. If you look at runs, you have to worry about expected runs scored in extra innings, the fact that the home team doesn't bat when leading after eight and a half innings, and the fact that the home team stops batting in the last inning when they outscore the opponent.
None of these corrections change winning percentage - but I don't pretend to guess how much of an effect this has on runs. If you make the model more sophisticated, correcting for the final inning courtesies and assuming the home team is limited to outscoring the visitors by one run in the final inning, I suspect you'll have most of the correction you need for a raw runs to wins converter.
I'm gambling that you can just use runs per game as a reasonable predictor of run rate, and work that direction.
There is one correction I do make for extra innings. If we assume that the innings continue until the tie is resolved, you don't have to fret over the recursion. It's fairly trivial to demonstrate that, with probabilities p(W), p(L), p(T), winning percentage in extra innings is p(W)/(p(W)+p(L)).
So in the model, I calculate W = p9(W) + p9(T) * ( p1(W)/(p1(W)+p1(L))). By visual inspection, it doesn't change the answer much.
June 25, 2004 11:34 PM
| TrackBack