Syndicate this site: (RSS)

December 31, 2008

Time keeps on ticking ticking ticking

via Bad Astronomy

The official clock will be ticking tonight, counting down to the New Year. Just before midnight it will read 23 hours 59 minutes 59 seconds. But instead of clicking over to 00 hours 00 minutes 00 seconds on January 1, it will first read 11 hours 59 minutes 60 seconds on December 31!

OK, I admit that I missed this edge case.

December 31, 2008 Comments (0) TrackBack (0)

It's a Boy! and a Girl!

Jeff Atwood celebrates the end of the year by posting the what are the odds that's it's a boy and a girl puzzle on his blog.

After six and a half hours, there are 435 comments, most of which are really awful. I think I want to nominate "Here's a way of reframing the question that makes it a lot clearer" as the dumbest of these - reframing never works, it only ever serves to establish that you don't understand what they are talking about.

Solution after the break - I only promise that it is correct, not that it is easy. You Have Been Warned...

Some Notation

OK, we're going to be leaning hard on notation here, so some quick definitions

First, we need a couple of operators for combining propositions
Q,S = the proposition that Q and S are both true
!Q = the proposition that Q is not true.

I'm limiting myself to the space of boolean propositions, with the following axioms
P(Q) + P(!Q) = 1
P(Q,!Q) = 0
P(Q,S) = P(S,Q)
P(Q) = P(Q,S) + P(Q,!S)


Next, we need a couple of operators to evaluate probabilities.
P(Q) = the probability of some proposition Q.
P(Q|S) = the probability that both Q and S are true, given the additional information that proposition S is true. Often called the "conditional probability".

The latter of these is actually an English description of a mathematical definition. The equality below defines P(Q|S):
P(Q,S) = P(Q|S) * P(S)

we'll want to extend this definition indefinitely
P(Q,R,S) = P(Q,R|S) * P(S)

Bayes

P(Q|S)
= P(Q,S) / P(S)
= P(S,Q) / P(S)
= [ P(S|Q) * P(Q) ] / P(S)

In other words, if we have enough information, we can compute the probability of the reversed relationship. Again, we can use it for more complex terms as well:

P(Q,R|S)
= P(Q,R,S) / P(S)
= P(S,Q,R) / P(S)
= [ P(S|Q,R) * P(Q,R) ] / P(S)

We hadn't actually defined what P(S|Q,R) meant yet, but we see that the expected definition just falls out of the mix magically, assuring us that the definitions are mutually consistent.

Repeat of the problem statement

Let's say, hypothetically speaking, you met someone who told you they had two children, and one of them is a girl. What are the odds that person has a boy and a girl?

This is a typical baysian problem, with the added bonus of ambiguous language to make it so much more fun to calculate.

Q = "they had two children, and one of them is a girl"
S = "has a boy and a girl"

First, I call your attention to the fact that Q is not actually part of the problem. What we have been given is Q' - that the person stated Q, not that Q is actually true.

What were looking for here is
P(S|Q')
= P(S,Q') / P(Q')
= P(S,Q') / [ P(S,Q') + P(!S,Q') ]

P(S,Q')
= P(Q',Q,S) + P(Q',!Q,S)
= P(Q'|Q,S) * P(S|Q) * P(Q) +P(Q'|!Q,S) * P(S|!Q) * P(!Q)

P(!S,Q')
= P(Q',Q,!S) + P(Q',!Q,!S)
= P(Q'|Q,!S) * P(!S|Q) * P(Q) +P(Q'|!Q,!S) * P(!S|!Q) * P(!Q)

Calling attention to a few of the terms

The P(Q'|!Q...) terms are basically a measure of the probability that a subject would make the false statement Q' in various circumstances. The common form of this problem assumes that these terms vanish. When these terms do vanish, we have the added benefit of not needing to worry about how likely the condition Q actually is, since it appears in the numerator and denominator, it can be canceled.

If we don't choose to discount that term, we have to show some care with the P(S|!Q) terms, because of the ambiguities of the English language - "has a boy and a girl" doesn't assure us that the subject has exactly one of each.

The other reason to include the Q' term is that the probability that the information is volunteered is not the same in the two cases. There's some non zero probability that the subject would make some other statement instead of the one given. Confirming Q is not the same as volunteering Q. In particular, someone might tell you "they had two children, and one of them is a boy". Messy messy messy; we need the P(Q'|Q,S) term to get this right.

If the subject has exactly two children, there are three cases to consider - Q,S represents your mixed genders results. Q,!S is matched girls. Matched boys are hidden somewhere in the !Q,!S terms. Note that these are your a priori odds. You can go nuts here with census odds and correlations and so forth, but treating the births as independent Bernoulli trials (aka coin flips) is a lot simpler - the error you introduce here is considerably smaller than the error you introduce misunderstanding/misrepresenting the problem.

In most circumstances when this question is raised, you should understand the following - that the a priori probability of a boy and a girl is

P = mixed / ( girls + boys + mixed ) = about 50%.

the additional statement of "one girl" doesn't change any of the values, but it does eliminate a term in the denominator

P' = mixed / ( girls + mixed ) = about 66%.

December 31, 2008 Comments (0) TrackBack (0)

December 29, 2008

Hans Rosling

TED: Hans Rosling.

Primarily this is a presentation about progress the world has made in the last 40 years or so, but it's also worth watching for the data visualization as well.

December 29, 2008 Comments (0) TrackBack (0)

December 23, 2008

Feared

With yet another Hall of Fame induction coming up, I decided to take another look at Rice, trying to understand the context... not so much of the performance itself, but rather what people had to compare it to.

Let's look at the AL for a ten year period (1969-1978), and specifically at the leader boards for total bases:

1969 AL   6 340,Frank Howard     334,Reggie Jackson
1970 AL  12 335,Carl Yastrzemski 323,Tony Oliva
1971 AL  14 302,Reggie Smith     288,Reggie Jackson
1972 AL   9 314,Bobby Murcer     305,Dick Allen
1973 AL   0 295,Sal Bando        295,Dave May
1974 AL   6 287,Joe Rudi         281,Ken Henderson
1975 AL  15 318,George Scott     303,Reggie Jackson
1976 AL  15 298,George Brett     283,Chris Chambliss
1977 AL  31 382,Jim Rice         351,Rod Carew
1978 AL 113 406,Jim Rice         293,Eddie Murray

Suddenly, Rice is head and waist above the rest of the pack. The piece that really caught me by surprise here is that Murray's finish in 1978 doesn't really look historically out of line with the other top AL performances during this period. Instead, it's Carew's 351 that's the outlier; he's a steady 270 guy otherwise. The field is Rice, then nobody.

Compare that to the next 10 years.

1979 AL   6 369,Jim Rice         363,George Brett
1980 AL   2 335,Cecil Cooper     333,Ben Oglivie
1981 AL   4 215,Dwight Evans     211,Tony Armas
1982 AL  22 367,Robin Yount      345,Cecil Cooper
1983 AL   1 344,Jim Rice         343,Cal Ripken
1984 AL   4 339,Tony Armas       335,Dwight Evans
1985 AL  48 370,Don Mattingly    322,George Brett
1986 AL  23 388,Don Mattingly    365,Kirby Puckett
1987 AL  25 369,George Bell      344,Mark McGwire
1988 AL  11 358,Kirby Puckett    347,Jose Canseco

Now we've got some people hitting the ball. Rice's monster seasons are the best of this bunch, but the others no longer look like Munchkin Intramural League scores (excusing the strike in 1981). Bonus: check out the symmetry of 1981 and 1984.

So where does that lead fit, historically? Here are the seasons since the modern era began, where the leader has had 100+ bases more than the runner up...

1922 NL 136 450,Rogers Hornsby 314,Irish Meusel
1948 NL 113 429,Stan Musial    316,Johnny Mize
1978 AL 113 406,Jim Rice       293,Eddie Murray

"One of these things, is not like the others." It's not a crime to be a worse slugger than Hornsby (.722), or Musial (.702) - but none of their contemporaries managed to get within 100 points of slugging percentage.

In 1978, there were a handful of hitters within 100 points of Rice (.5997), but all of them (Hisle, DeCinces, Otis, Thornton) all managed to miss 20 or so games, taking them right off the leader board, whereas Rice showed up every day. Which counts, of course - that level of excellence is hard to sustain.

Noting the absence of Ruth, I decided I needed a few more years - so here's the next 10 "dominating" seasons, without comment.

1921 AL  92 457,Babe Ruth      365,Harry Heilmann
1945 NL  88 367,Tommy Holmes   279,Goody Rosen
1924 AL  85 391,Babe Ruth      306,Baby Doll Jacobson
1946 NL  83 366,Stan Musial    283,Enos Slaughter
1917 AL  74 335,Ty Cobb        261,Bobby Veach
1937 NL  73 406,Joe Medwick    333,Johnny Mize
1940 NL  70 368,Johnny Mize    298,Frank McCormick
1921 NL  68 378,Rogers Hornsby 310,George Kelly
1932 AL  68 438,Jimmie Foxx    370,Lou Gehrig
1933 NL  66 365,Chuck Klein    299,Wally Berger
December 23, 2008 Comments (1) TrackBack (0)

December 10, 2008

BBWAA

Law and Neyer make it in on the second ballot; clearly demonstrating that they aren't "inner circle" internet baseball writers.

December 10, 2008 Comments (0) TrackBack (0)

Posted pending further review

Abstract art using Hilbert Space Filling Curves.

December 10, 2008 Comments (0) TrackBack (0)

December 9, 2008

Words to live by

Atrios: "If those were my goals I'd use the word 'fuck' a lot less."

December 9, 2008 Comments (0) TrackBack (0)

December 8, 2008

Wanted

That did not feel like 110 minutes worth of plot.

December 8, 2008 Comments (0) TrackBack (0)

December 1, 2008

No application should fail...

... when the alternative of failing AND pissing off the user is available.

One of my arch-nemones, of which I would appear to have several in the software world, is the Ant Copy task. I don't think I've ever had employment where Ant is in use without that I have tracked down some subtle bug to the copy semantics designed into that task (usually the choice that overwrite defaults to false).

Today was especially fun. Why, I was asked, is such and such build failing in the continuous integration environment, when it "works fine on the developers machine". And with some work, we did in fact track the problem to a local area that had been changed - because of the new choices in defining the fileset for the task, two files with the same name were being picked up. And of course, for one of the usual reasons, the seemingly deterministic ordering of the copies on the developer's machine doesn't match the seemingly deterministic ordering of the copies in the CI environment.

That discovery is the point at which today's rant begins...


We had actually completely worked through the problem, when I decided to verify our intuition by turning on the verbose attribute of the copy task. And sure enough, it happily reported that both files were copied to the same location.

Of course, this being ant, that isn't what happened. Only the first file was copied - the second was skipped because the noverwrite bit was defaulted. And that seemed a bit silly - but since I happen to have the sources hooked up (precisely for this sort of occasion, which is basically weekly right now), I decided to look into the implementation to confirm for myself that the status message is consistently wrong.

And it ISN'T. Yes, there really is a "hey, we are skipping your file message" in the Copy task, but because our copy is based on a file set, we actually travel down a different branch in the code path - one that assumes that even if you have turned on verbose mode in the hopes of learning something useful, you still wouldn't be interested in knowing what is really going on.

December 1, 2008 Comments (0) TrackBack (0)