Syndicate this site: (RSS)

September 18, 2004

Winning Percentage

OK, so what should W% look like? Provided that the Heaviside function H, matches the condition H(0) = 1/2 (which most of the common examples do), we can use the following expressions which are certain to fit the three fixed points.


R = RS+RA
X = (RS-RA) / R
W% = ( H(X) + H(1) - 1 ) / ( 2 * H(1) - 1 )

From here, the approach is pretty straight forward: choose a version of the Heaviside function for H, find the appropriate coefficients, and go. As an exercise, it is worth confirming that very small coefficients do ensure that W% is approximately linear (specifically (1+x) / 2 ).

The first order approximations I've experimented with get within a game or two of the right answer; second order gets a faction of a game, third order a smaller fraction - not big enough to bother with.

In a 9.7 run/game context, we get


W%
= ( -1.01111) * ( .5 * ERFC( 1.558019 X + 0.240296 X ^ 3 ) - 0.99451 )

This is accurate to about 1/8 of a game. Between 9.2 and 10.2 R/G, these are accurate to about 3/5 of a game. Between 8.7 and 10.7 R/G, these are accurate to about 7/5 of a game.

In the range between 5 and 15 runs, the X coefficient fits well to 0.532476 Ln(R) + .34973; the X^3 term looks like 0.164906 Ln(R) - 0.12689, which produces estimates accurate to 1/9 of a game per 162.


September 18, 2004 Comments (0) TrackBack (0)

September 16, 2004

Hockey

Poisson Toolbox has a tour of goal to win converters for hockey, by Alan Ryder.

I'm vain enough to be disappointed that he hadn't noticed my work (though as far as I can tell, we've been working in parallel, so mine may not have been available. He does seem to have missed a few things I've been covering, so I still feel cutting edge.

How the hell did he manage to write 18 pages on Win Estimators without referencing Ben Vollmayr-Lee ? That just seems criminal.

September 16, 2004 Comments (0) TrackBack (0)

September 11, 2004

Fitting the Heavisides

The home stretch - having established that we want to experiment with some sort of heaviside function, we take a moment to check that they have the right properties, then do some fitting.

[more ]

September 11, 2004 Comments (0) TrackBack (0)

Another approach to guessing

I ended up attacking this yet again after my walk today, in the hopes of making further progress, and sure enough a few more things broke loose.

Again, the object of the exercise is to come up with a reasonable form of the solution, which has all the right properties, and enough flexibility left over to allow fitting.

[more ]

September 11, 2004 Comments (0) TrackBack (0)

July 29, 2004

A guess at a solution form

Another interesting possibility is to consider a form of the function that satisfies the boundary conditions and has a satisfactory derivative.

We know that W has a positive derivative with a maximum at 0, and have every reason to believe that it will be well behaved. So casting about for a similar function, one familiar form of the derivative would be Aexp[-Bx^2]. Again, we see that I introduce a bias toward e.

Using the fundamental theorem of calculus


W(x)
= Integral[0:x] ( A exp[ -Bz^2 ] dz )
= A Integral[0:x] ( exp[ -Bz^2 ] dz )
= A/sqrt(B) Integral[0:x sqrt(B)] ( exp[-t^2] dt )
= A / sqrt(B) erf ( x sqrt(B) )
= A erf ( x sqrt(B) ) / sqrt (B) )

W(1)
= 1/2
= A erf ( sqrt(B) ) / sqrt (B) )

I wasn't able to find a nice closed form, but you can choose a polynomial representation with a reasonable amount of accuracy (if you restrict yourself to the narrow band where baseball is actually played, for example, you can probably be satisfied with a simple quadratic).

July 29, 2004 Comments (0) TrackBack (0)

July 26, 2004

Pythag abstraction and correction

It's a useful exercise, I believe, to consider the pythagorean approximation in its abstract form, to contemplate the features which must be present in the correct expression.

"In order to make it easier to think", I modify the scale of the previous solutions.


R = RS + RA
r = RS - RA
x = r/R

As before, we consider the consequences of fixing R. x is allowed to range from [-1,1]. We thus have the following boundary cases


W(0) = 0
W(1) = .5
W(-x) = -W(x)

The last of these conditions is most important (though it makes the first redundant); as before it allows us to restrict the Taylor expansion of W to the odd terms alone.

W(x) = A(R)x + B(R)x^3 + C(R)x^5 +...

There are a few additional conditions that we can work out, from the nature of the problem itself. For instance, it should never make a teams winning percentage worse to score a larger percentage of the runs. So the first derivative my be positive everywhere - not just for all x, but also for all R. It is likely that the correct solution also exhibits the law of diminishing returns - so while the first derivative is positive everywhere, it should be maximum as 0. So the second derivative (which is an odd function) should be negative whenever x in (0,1), for all R.

Working from here, A(R) > 0, for all R. B(R) < 0 for all R.

Actually, we can do a bit better than that in our expression for A(R). The derivative must be maximum at x = 0, where only the A(R) term is in play. Furthermore, by the fundamental theorem of calculus, the derivative of W(r) has to be 1/2 somewhere: [ W(1) - W(-1) ] / 1 - (-1) = 1/2. Therefore, A(R) has to be greater than 1/2 for all R.

Now, if the derivative is very close to 1/2 near the origin, it is going to be very close to 1/2 everywhere (it has to be, to meet the terms of the boundary conditions), in which case we're dealing with a line: all of the higher order coefficients are zero.

This is exactly what has to happen in very low scoring conditions. When the run environment is low enough, neither team is capable of scoring two runs, and in effect the team that wins will be that which scores the run. So winning percentage is directly proportional to x.

Of course, A = 1/2 requires that the higher order coefficients cancel out (to achieve the boundary condition W(1) = 1/2. But this naturally follows from the fact that all of the higher order coefficients are zero.

Also note that we never actually achieve this condition - A(0) = 1/2, but W itself makes no sense when R=0, because x = r/R is undefined. The reflects the fact that our definition of the boundary of W (W=.5 if RA=0; W=-.5 if RS=0, W=0 if RS=RA) has an unresolvable contradiction when RS=RA=0.

If A(0) = 1/2, then the previous guesses at A(R) are wrong. We need 1/2 plus some function of R with a root at zero, which is monotonically increasing and tending toward infinity.

One solution form is A(R) = 1/2 + b asinh( cR ), which in the region of interest produces a pretty good fit with b = .330227 and c = .215276, and you can probably find a simpler fit A(R) = 1/2 + b ln ( 1 + cR ).


Of course, you can also simply choose to fit the data as best you can with a polynomial: A(R) = .5 + .0535R + .0088R^2 - .0024R^3 + .0003R^4 + ...

July 26, 2004 Comments (0) TrackBack (0)

June 26, 2004

Pythag win predictors - fitted coefficients

Excel was used to determine the equation below, using the tricks previously described. The form of the coefficients was determined by first drawing a plot, and looking to see which trendline gave the right shape. The datapoints used for the fitting covered each 1/10 of a run increment in the scoring rate from 0.1 to 20 (we assume that tango's distribution continues to hold at this level of offense). 19900 or so datapoints in all, as the reflections are known to be anti-symmetric.

W / Y
= .5232 X ^ -0.7234
- .0603 X ^ -1.8241 Y^2
+ .0036 X ^ -2.8565 Y^4
-  8e-5 X ^ -3.3888 Y^6

There's nothing magic here about using 4 terms beyond convenience. If we were to be truly rigorous, there would be one last term to ensure that W/X = .5

As a check, let's consider the linear predictor derived by Ben Vollmayr-Lee.

p = 1/2 + n * ( x - 1/2 )
p - 1/2 = n * ( RS/(RS+RA) - 1/2 )
W = n/2 (RS-RA)/(RS+RA)
W = n/2 Y/X

And he found a best fit for n = 1.819. Taking only the first order term of the fitted estimator, and rearranging

W 
= .5232 X ^ -0.7234 Y 
= (.5232 X ^  0.2766) (Y/X)
= (1.064 X ^ 0.2766)/2 (Y/X)
n = (1.064 X ^ 0.2766)

If we look at the 2003AL, the scoring average was roughly 9.72 runs per game. This translates to an X vaue of 6.8731, which gives n = 1.813.

The coefficients associated with X are likely some function of n (the index of the term in the series), the number of innings in the game, perhaps also Tango's magic coefficient. You could work it out by dickering with the data that is input.

June 26, 2004 Comments (0) TrackBack (0)

June 25, 2004

Pythag win predictors - excel tricks

A short catalogue of the tricks I found myself relying on when trying to fit the data from my model to the curves.

1) We know that the curve we are looking for is odd, but excel doesn't. So take the choice away. Instead of W = f(Y), fit instead W/Y = g(Y^2).

2) The OFFSET worksheet function allows the creation of ranges using cells to determine the extents of the range. The general drill was to sort the data, define ranges that shared a value for X, apply an estimator to get Y coefficients, then apply another coefficient to those coefficients to measure how they depend on X.

3) Y = aX^b => ln(Y) = ln(a) + b * ln(X). To fit a power series, use a linear fit on the log, then unwind.

4) The LINEST worksheet function can also handle polynomials. LINEST(Y1:Y100,X1:X100^{1,2,3}) gives a range of 4 values - but not in the order I expected. The INDEX worksheet function extracts each coefficient from the fit. Yeah, I had expressions like INDEX(LINEST(OFFSET,OFFSET^{1,2,3})) - bleah.

5) When bouncing around a worksheet with 19,000 rows, use the name box to specify a range before invoking the graph tool. Otherwise there's a lot of plotting to be done.

6) If you suspect a power series, make sure you take the ABS of the range first, so that the errors near zero don't invalidate the estimate.

June 25, 2004 Comments (0) TrackBack (0)

Pythag win predictors - a population model

For a more robust predictor, we need a wider variety of data. This is achieved by performing analytic calculations from a model, then looking to fit the curves.

For my baseline, I began from true scoring rates, and used the Tango distribution to predict the numbers of runs scored per inning. This is then applied to a combinatoric model first calculated by Ben Vollmayr-Lee to determine the frequencies of runs per game. Treating runs scored and runs allowed independently this produces some probability of wins and losses.

Note that I'm using rates, not runs. If you look at runs, you have to worry about expected runs scored in extra innings, the fact that the home team doesn't bat when leading after eight and a half innings, and the fact that the home team stops batting in the last inning when they outscore the opponent.

None of these corrections change winning percentage - but I don't pretend to guess how much of an effect this has on runs. If you make the model more sophisticated, correcting for the final inning courtesies and assuming the home team is limited to outscoring the visitors by one run in the final inning, I suspect you'll have most of the correction you need for a raw runs to wins converter.

I'm gambling that you can just use runs per game as a reasonable predictor of run rate, and work that direction.

There is one correction I do make for extra innings. If we assume that the innings continue until the tie is resolved, you don't have to fret over the recursion. It's fairly trivial to demonstrate that, with probabilities p(W), p(L), p(T), winning percentage in extra innings is p(W)/(p(W)+p(L)).

So in the model, I calculate W = p9(W) + p9(T) * ( p1(W)/(p1(W)+p1(L))). By visual inspection, it doesn't change the answer much.

June 25, 2004 Comments (0) TrackBack (0)

Simple pythag predictors

Getting the coefficients of Y for many powers takes work, but for a rough approximation relatively little data is needed.

Y is odd - therefore in the neightborhood Y = 0, F(Y) = dF/dy(0) * Y. Furthermore, we know that F(X) = .5. So we can create a third order expression of F using these two known datapoints.

F(Y) = dF/dy(0) * Y + B * Y^3
F(X) = dF/dy(0) * X + B * X ^ 3
.5 =  dF/dy(0) * X + B * X ^ 3
B = ( .5 - dF/dy(0) * X ) * X ^ -3
F(Y) = dF/dy(0) * Y + (.5-dF/df(0) * X ) * (Y/X)^3

So how do we procede from here? Well, (RS+e, RS-e) when rotated yields X = RS* 2^.5, Y = e * 2^.5. So, in this neighborhood,

dF/dy 
= [ W(RS+e,RS-e) - W(RS-e,RS+e) ] / 2*2^.5e
= W(RS+e,RS-e) / 2^.5e

So to predict W(RS,RA), find Z = 2^-0.5(RS+RA), find a close match for W(Z+e,Z-e), calculate dF/dy, then find the third order coefficient B. Plug and chug.

June 25, 2004 Comments (0) TrackBack (0)

Pythag win predictors, properties of the solution

By first applying some thought to the boundary terms, we can greatly narrow down the space of functions we need to consider.

A team that allows runs as frequently as it scores them should be expected to finish right around .500. A team the scores RS and allows RA should have the opposite record to that of a team which scores RA and allows RS. A team that shuts out its opponents should win every game.

We can simplify the algebra by rewriting these ideas referenced to the average. Rather than looking for winning percentage, instead define W = Wpct - .500. Thus we can write

W = f(RS,RA)

where f is the elusive predictor, and its properties taken from above are

f(RS,RA) = - f(RA,RS)
f(RS,RS) = 0
f(RS,0) = .5

The second equation follows from the first, as a special case, but offers a big hint on how to proceed. We need to rotate from RS, RA to "natural units", where one dimension runs along that axis where W = 0. A simple rotation (technically a rotation and a reflection, to keep the signs consistent with the natural language) by 45 degrees should do it. The factor 2^-0.5 comes from the trigonometry of that rotation.

X = (RS+RA)*2^-0.5
Y = (RS-RA)*2^-0.5

So X is simply an expression for the run scoring environment, with a scaling factor. Y is the run differential with a similar factor. We can now think about fixing X, and studying how W varies with Y.

Observe that X is unchanged when RS and RA are reversed, but Y changes sign. We also need W to change sign[ W(X,a) = - W(X,-a) ] with Y, and can deduce that W is an odd function of Y.

In other words, when we consider a Taylor expansion of W, we need only consider odd powers of Y - the even coefficients are known to be zero.

June 25, 2004 Comments (0) TrackBack (0)

Pythag win predictors, the answer

The general form of the increasingly misnamed Pythagorean formula in baseball

The definitions

W = Wins/Games - 1/2
S = Runs scored per 9 innings
A = Runs allowed per 9 innings

X = (S+A) * 2^-0.5
Y = (S-A) * 2^-0.5

The predictor

W = Y * Sum[0:n] ( a[n] X ^ b[n] ) Y^2n
June 25, 2004 Comments (0) TrackBack (0)