Log in

The mind of Ike

Opinions nobody asked for

April 28th, 2008

So I've decided that this isn't really the best place to be putting all of my baseball related posts. This space was really more of an exercise for myself, intended for a very small audience. Since I'd like to share my baseball findings with a larger audience, I feel it's appropriate that it get a seperate space all to it's own.

So, my new baseball-only blog can be found at:

From now on, I'll only post non-baseball type stuff here, although I may link to something I've put there in this space.

April 1st, 2008

2008 Data

So now that the 2008 MLB season is officially underway, the pitchf/x system has been turned on in all parks. Although I imagine the Sportsvision folks have spent a lot of time making the system better in '08 than it was in '07, I imagine that there will still be systematic park to park variations in the data. We can get an idea of what these variations will be like by looking at the drag coefficient versus average velocity for each park. In my previous post, I outlined how I obtain the drag coefficient, although the method now has changed slightly. It turns out that I get a much more 'accurate' measurement if I remove the part of the magnus force that acts in the y direction. Alan Nathan and I wrote up a little paper detailing how this is done, and in it, Alan showed that this approximation for the drag coefficient is within 1% (although almost always low) of the values obtained through fitting points on the measured trajectory to the known equations of motion for drag+magnus+gravity. You can find that write up here.

So, rather than plot each park separately, I've instead decided that since we only have a few games at each park so far, I'll just plot all parks that have data on top of each other in one massive plot. In this plot, the vertical axis represents the drag coefficient, and the horizontal axis represents the time averaged velocity of each pitch. Each point corresponds to one pitch. For each game, I also calculate the air density, so that what I am really drawing is the actual drag coefficient multiplied by standard air density and divided by the actual game time air density. If there is no variation between parks, we should expect to really only see the points of the last few parks to get drawn on the plot (no variation between parks).

So it's pretty obvious that there still exist some variations between parks. Some appear to be on the low side and some on the high side. Notably, Citizens Bank and AT&T appear fairly low and PETCO, and the Metrodome are somewhat high. We don't really know what the actual values should be, so it's hard to say which vary from 'true' values most. Although I think we can pretty safely say that there may be some big problems with Dolphin Stadium. That may require some further investigation. Last year, I saw that the strike zone at Dolphin Stadium appeared to be heavily shifted toward the right hand batters box, and I wonder if thats still the case. Interestingly, looking at some of my other Cd plots from last year, it appears that Dolphin Stadium had some very low values of Cd, and now they are very high. So while there appears to have been a re-calibration there, it looks like, at least for accelerations, Dolphin Stadium was overcorrected. The values it's showing here are extremely high.

Anyway, I don't have much more at the moment, but it looks like a correction algorithm for 2008 is going to be needed. I'm presently working on one such method that corrects only accelerations based on measurements of Cd, and gets the 'correct' values for release point and velocity for free by doing so (or at least, it should!). It works really well at some parks, and not so well at others. I hope to post more about it soon. If I can wind up making it work, it should have the nice feature that we can make 2008 data compatible with 2007 data. Looking at a pitchers evolution from year to year is certainly an interesting thing to do, but without some reliable corrections, I would seriously mistrust any conclusions one might reach in doing so.

February 27th, 2008

I've been playing around with corrections to the pitchF/X data recently. I was kind of saving these results to present at a proposed pitchf/x workshop that was originally slated to be in a couple of weeks, but plans for that fell through. The workshop still may happen, but it will be later in the baseball season if it does. So I figure I'll go ahead and dump what I have so far here, and hopefully when/if the workshop happens, I'll have something more to present.

Anyway, as the title proclaims, this post is all about acceleration, it's measurement by the pitchF/X system, and park by park corrections to the values of acceleration.

So, if we want to check whether or not any corrections we apply are valid, the absolute best thing to do is to try to measure some physical quantity that we know. With acceleration, primarily with the z component of acceleration, it would be super nice if we could use the pitchf/x data to measure the acceleration of a baseball due to gravity, g. Unfortunately, there just isn't enough information in the data to make a measurement of g. The baseball experiences acceleration due to at least 3 separate phenomena; drag, magnus force, and gravity, and without the prior knowledge of other parameters, it is impossible to separate g out from the other components. In all prior analyses of pitchf/x data I can find on the web, g is simply treated as a constant that is the same everywhere. It's possible that this may not be such a good idea...

Click here to read onCollapse )

October 3rd, 2007

Finding the zone.

So, I wanted to post some pretty plots with this page, and in order to do so, I had to either upgrade to a paid account or to an account that allows ads to be displayed.  So if you are annoyed with the advertisements, I apologize.  I also had to change the layout to accomodate the size of these plots.  I don't really like it right now, but I'll play with some more layout options till I find one I can live with.

So in a similar vein to what Josh has done, corrections wise, with the initial positions and velocities, I wanted to test the assumption that the best data in the PITCHf/x data is at home plate.  The reason for doing so is because I feel that if you are going to correct for the initial parameters of the pitch, you can either choose to let the final position stay constant, or perform a transformation that changes the final position.  Which should you choose?  Could the location of the strike zone vary from park to park?  In earlier musings, I make the assumption that this should not be the case, due in large part to the existence of an easy calibration point right there near the ball...home plate.  But is that really a good assumption to make?

So, here I take a tact similar to the method employed by John Walsh in his two articles for The Hardball Times, Strike Zone: Fact vs. Fiction, and The Eye of the Umpire.  Although, to make my pretty plots, instead of calculating a ball fraction, I calculate the probability that a pitch that PITCHf/x at a particular ballpark measures to be in position x,y is called a strike (this works out to be roughly 1 minus the ball fraction that Walsh comes up with).  And I calculate this in 2 dimensions, rather than just 1.  The original goal was to compare the shape of the strike zone at various ballparks to try to work out what manner of corrections should be applied (and also maybe to ask the question of "is the strike zone really rectangular?").  Unfortunately, this would really only be useful at a handful of parks.  Namely, those that had the PITCHf/x system turned on all year.  The other parks are somewhat statistics limited, and shape information for those zones would be essentially useless.  But there are still some interesting effects that we can observe with these statistics limited parks.

Now that I have an entire seasons (including the 1 game playoff from monday) worth of data taking up about 40MB on my disk, I can look at this with the maximum number of pitches that we are going to see until next year.
So to begin with I take every pitch recorded by PITCHf/x which wound up being either a called strike or a ball.  I really don't want to bother with pitches that were swung at, because that will just futz with the distributions.  For each pitch, I fill two of three 2D histograms which use 2 inch by 2 inch bins (consistent with the resolution of umpire+PITCHf/x system found by Walsh).  One histogram for all pitches, one for called strikes only, and one for called balls only. (I'm not going to show any of the results from the called balls histograms, but they were there in my code anyway.)  Then for each bin of the called strikes histogram, I divide the bin contents by the bin contents of the corresponding bin of the "all pitches" histogram, so that I am left with a histogram where all bins run from 0 to 1 which effectively shows the probability of a pitch that PITCHf/x believes wound up at position x,z being called a strike.   I perform this procedure for every park that has any PITCHf/x data.  I should also note that since Walsh found a difference in the strike zone called for right handed hitters and left handed hitters, I removed all pitches thrown to lefties.   This method also assumes that from day to day, the PITCHf/x system performs roughly the same at any given park.  Josh Kalk and Mike Fast have shown that this may not be true, but from what they have shown, the difference in the PITCHf/x systems performance in any ballpark between any two random games is likely to be small, so I just ignore it for now.

One nice thing that MLB has done that makes this method useful is the fact that their umpires rotate quite often.  If you take a random stadium and look at who the home plate umpire is for each game, it is actually quite rare to see the same name twice.  It happens, but not often.  So especially for parks with a large statistical sample, this effectively removes any bias that may creep in due to specific umpires calling slightly different zones, and we get to see the average umpires zone.

So, on to some pretty pictures.  In these plots, I have chosen to display the called strike probability per bin in the form of a contour plot.  This means that the bin by bin information is somewhat lost, but it is easier on the eyes, and makes it fairly easy to pick out where the borders of the zone should be (somewhat).

So, first, lets look at a park that has had PITCHf/x turned on for most of the season.  Josh has a nice, albeit slightly dated, plot of the number of pitches recorded at each park here.  I'll take "sdn", or the San Diego Padres, who play at PETCO Park.

In this plot, I display the called strike probability for pitches as a function of PETCO x,z.  Unfortunately, the color scale gets cut off by the info box right now, but you can imagine where it goes .  The scale gets redder as it gets close to one.  This isn't too bad.  the red parts in the X direction go roughly from -10 inches to 10 inches...a 20 inch wide zone...and it looks like for about 4 to 5 inches on either side of the plate, between 40-70 percent of pitches are being called for strikes.  Not too bad.  I should note that for the time being, I am not going to pay attention to the z component of the strike zone, as this is heavily dependent on the batter at the plate, so for now, I'm only looking at where the zone is in the X direction.  How about another stadium with a lot of data.  Anaheim looks good, so lets look at Angel Stadium.

Thats interesting.  These two look almost identical, however, at Angel Stadium, it looks like the zone is shifted by about 2 inches to the left.  I don't know that I'd believe that pitchers get more inside strikes called at Anaheim than they would at San Diego, by roughly the same crew of umpires.  So I would harbor a guess that Anaheim, at least compared just to San Diego, is measuring pitches about 2 inches to the left of where they should be.  If we compare this shift to the shift Josh found and posted to the X0 position of pitch release points, we find that there is roughly a 2 inch difference between the release point of these two parks, although he corrects the opposite way that this would indicate should be corrected for (if umpires are closely calling the rulebook strikezone).

So, for now, thats really all I have.  There does appear to be a difference in the X position of strikezone from one ballpark to another.  I'll probably see if I can find a way to correct for that as well as correcting for the initial positions, and post that.  Meanwhile, I'm also searching for ideas for teasing out Z corrections to the strike zone...this will be much more difficult, but if the X position is off, presumably so is the Z position, at least in some places.

Anyway, if you want to look at th rest of the plots I have made (one for each stadium with PITCHf/x data, you can peruse them yourself here.

One thing to note:  Some of these plots are very limited by statistics.  Unfortunately, the number of entries that is reported in the info box is completely meaningless for determining which parks are statistics limited and by how much, due to the normalization of the plots.  For that, I refer you to Josh Kalks plot referenced above.

September 24th, 2007

more calibration thoughts

So after finding out more about how PITCHf/x works, it looks like I was way off base with my previous post.  Turns out that the PITCHf/x system is entirely different from the one I was describing. No biggie.  It's not the first time I've been wrong, and I can guarantee that it won't be the last time I'm wrong. However, for other reasons, I still think that the data closer to home is more accurate than the data near the release point...and I'll explain my new reasoning below.

So first, it appears that PITCHf/x uses a least squares method of fitting rather than a Kalman filter, and that the shutter speed and pixel density is such that blurring isn't really much of an issue.  But according to comments from Dr. Nathan, the PITCHf/x people have told him that the primary sources of error come from camera movement, which they attempt to correct on the fly (presumably by observing the position of the 1st and 3rd base lines and home plate, which remain fixed), and "operator error".  That second one worries me a bit.  I wonder how much influence an operator has over the output of pitchfx?  But for the time being, lets just consider what happens when the camera wiggles a bit.  Now, correcting on the fly is certainly possible, and in my opinion, easy to do for data points around home plate, because you've got a lot of "fixed" targets there to locate.  Home plate, the 2 lines, and the batters box.  But home plate is certainly the easy choice there because its somewhat big, and has a rather unique shape and orientation.

But we run into problems if we look for another point to calibrate on the fly with near the release point.  The closest would be the pitchers rubber.  But I could easily imagine that this is difficult to pick out in an automated way, since the pitchers foot is generally still in the vicinity at release, and depending on where the camera is, the pitchers body may be obscuring the view of the rubber.  This means that the calibration has to happen on objects that are more than 40 feet away from the point we want to observe.  I think that propagating this on the fly calibration out to the pitchers mound can be troublesome to do, and could very well be the source of the wide spreads in release point that have been seen in the data.  Most likely, the more these cameras are subject to vibration from wind, seismic effects, or whatever else, the more they are off.  And thats just for one detector.

Now, I think that this "on the fly" camera calibration may be the cause of a lot of the spread that is seen in horizontal and vertical release points, but I can't find any way that this could explain systematic differences seen in release points (and velocities too!) from park A to park B.  Josh is probably on the right track by correcting for these by iterating over pitchers that have thrown at various parks and computing "park factors" for the release point.  But then, there is another question that goes with that.  If you move the release point by some distance, you can't stop there.   You have to the correct velocity and/or accelerations in order to ensure that the pitch trajectory puts the ball in the same place...In other words, we can't break the laws of physics...although the trajectories they calculate aren't exactly the correct trajectories, I now think that they are close enough to do the job that needs to be done....Although as Josh pointed out, some correction for air density may need to be inserted at some point.

But the fact that these systematic effects exist still bug me.  Most people just write it off to miscalibration and start correcting by moving things around to where they think they "should" be.  I'm wondering if there is a way to determine exactly how park X is miscalibrated and correct from there.  And I may be on to something now, but if I am, I have to assume that the positions and velocites near home are inherently more accurate.  I already said above that I think they are, but obviously, I can't be 100% sure of that with the handwaving argument I just gave.  So, for the rest of this post, I am going to make this very assumption, and later tonight I plan to see whether or not the effect I think I will see is there or not.  I'm also going to assume for the time being that each park utilizes one 3 dimensional camera that automagically captures the x,y,z position of an object at a given point in time, because doing so makes it easier to articulate what I think might be happening.

So the way I see it, there are 2 main forms of miscalibration that can happen on a systematic level.  The first is the simplest to imagine, which I will just call "offset" calibration.  With this form of miscalibration, we would think that point x is really at point x+∂x.  However, I also think that this is the easiest calibration to correct for, and the "on the fly" calibration that they can do probably does a good job of eliminating offset.  The second primary form of miscalibration would be scale calibration, where some distance x-x0 is instead measured to be ∂x*(x-x0).  This could occur as a result of several mismeasurements, and could very well account for park to park differences.  Not knowing exactly where the camera (camera angle), or under/over estimating the zoom on the camera could easily introduce a scale calibration problem.  Furthermore, in the case of scale miscalibration, a 1% miscalibration can have significant consequences.  If we are 1% off in y (which has its axis running from the tip of home plate to the center of the pitchers rubber), we might think that the 60.5 feet that should be there is really 59.9 feet or 61 .1 feet.  And over the roughly half a second that the ball is in flight, this can mean that we add or subtract a whole mile per hour to a pitchers fastball.

Realistically, I would expect scale miscalibrations to be on the order of 1%, or maybe less.  Certainly in y, having 2 cameras that can make measurements in that direction can help pin it down, but I can still imagine ways it might creep in there.  In x and z, I can imagine it being much easier to have scale calibration errors. 

So, what to do?  Well, first I'd like to test the idea that there may be scale calibration errors in y.  To do this, I'll take one pitcher pitching at two different parks.  Preferably a starter who has gone 7 innnings at both parks to get a good sample of data.  I don't really care who.  For each of his pitches, I want to propagate the uncorrected calculated trajectory forward and backward from the standard y=55 "release point" to find the y position of his actual "release point".  I'll define that as the point in y where the x and z spread for all of his pitches is minimized.  If the pitcher actually has a consistent release point for every pitch, curveballs, changeups, and sliders actually help us here.  If not, maybe two or three release points are needed.  Anyway,  comparing these "verticies" in y should give us a good idea of the scale difference between parks, if there is any.  What I expect to find if I am right are a difference in the y vertex of between 2 to 6 inches between any two parks.  Perhaps more if one park or both parks are badly out of line with each other.

I may also see nothing, in which case, the PITCHf/x people would have done a very good job at calibrating the length scale to their cameras.

Anyway, the first pass at this will be done tonight, and hopefully I'll have some pretty plots to throw up tomorrow.

September 17th, 2007

more PITCHfx thoughts

So I've been thinking a little bit more about some of the problems associated with the PITCHfx data, and reading some more.  It seems that a great many people are making a big deal out of the initial positions and initial velocities being off.  They might be justified...but I think there is more going on that perhaps they realize.

As Dr. Alan Nathan points out here, the PITCHfx data records the "initial" velocity and position of the baseball and, along with a calculated acceleration that is used as a constant to calculate the pitch trajectory, all of the rest of the numbers are derived.
Several people, including Joe Sheehan, Josh Kalk (here, and in other posts on his blog), and to a lesser extent, Mike Fast over at Fast Balls (although, I do like the way he classifies pitches using Dr. Nathans spin/rate approximations...that may be useful later) have pointed out that from ballpark to ballpark, the initial velocities and positions are not consistent with one another, and have gone to great lengths to try to correct for this.  I'm sure I missed a few, but those are the guys I have been reading a lot of in my ramp up to actually analyzing data.  (In case you are wondering when that will happen, I've got a framework ready to store the data...now I just have to write the code to get it in there.  And in case you are still wondering, yes, it will be in ROOT, because thats what I'm most familiar with).  Anyway, I think these people are wrong.  Sort of.  It's not that I think they are wrong in believing that the release points and initial velocities are incorrect.  It's that I think they may be going about correcting the data in the wrong way.

At first I thought they might be wrong about worrying about the initial release point height due to variations in mounds.  Being a pitcher myself, I have experienced the entire spectrum of mounds throughout my high-school, college, and now weekend-league career.  From mounds that are little more than pimples with a rubber spiked into them, to those that in flat states like Illinois and Oklahoma could nearly be called mountains.   But seeing the numbers they are showing for this variation, I think thats a second order effect at best.  I've also pitched at a number of minor league and MLB spring training facilities, and without a doubt those are by far the most uniform mounds I have ever run across.  Sure, there are differences...but I would doubt that for a pitcher with a consistent release point, that they would vary by as much as half a foot or more from park to park.  Maybe as much as 3 inches or so...but more would probably be a stretch to blame on mound variation. 

So maybe there really is something funny going on with PITCHfx.

And I'm sure there is, but I'm not sure that making corrections based on averages to initial position or velocity is the best way to go.

So when in doubt, it's best to know your detectors and reconstruction algorithms as well as you can.  And in looking for information in this regard, I came across a nice paper that was linked to on Dr. Nathans site,  which although somewhat dated, is probably a fairly accurate representation of the current system.  So, with that in mind, I direct your attention to the blue-colored sidebar on page 5 wherein the the trajectory fitting algorithm is described.  Hey, look at that, it's a Kalman filter.  Most of the people that regularly read this will probably be at least a little familiar with a Kalman filter, as a very similar filter is used in the DØ tracking and vertex reconstruction.  OK, so maybe you are also like me, and recognize the name, and know you have read about how they work, but don't really remember it.  Thats fine.  Because really, that blue sidebar contains all you really need to know...almost.

The key phrase in that little blue sidebar is this (although it's a little kryptic...for further explanation, I refer you to the wikipedia entry for Kalman filters...not because I think it's right, but because it's a link I can easily find):

"Typically P [the error matrix] gradually decreases as the algorithm incorporates more measurements:  Confidence in the state builds up.  Equation 7 shows that if K [a thing related to the error matrix] is large -- which is the case if R [that part of the error matrix due to noise effects] is small, meaning that there is little noise in the measurement [and I would also add, small measurement errors...thats important later], the new measurement z is weighted heavily.  Instead if K is small, the value the current state x predicts has a higher weight"

Ok, so lets parse that a little bit.  Whats happening is this:  The detector makes a measurement.  It then makes a prediction based, in this case, on the physics of a body in motion under constant acceleration, of where the next measurement will be made.  The next measurement is made, and compared to the prediction.  If the prediction has smaller errors than the next measurement, the prediction is weighted more heavily than the measurement.  If the measurement has the smaller errors, then the measurement is weighted more heavily than the prediction.  From there, the predictive model is adjusted, and a prediction for the next measurement is made, and so on until the measurement process ends for whatever event you are observing.  So forgetting about this predictive model for now, the bit to take home is this:  At the end of the day (or pitch really), what you wind up getting out of the fitting algorithm is a trajectory thats weighted more heavily toward the most precise measurements taken by the detector, and less heavily for those that are less precise....This is generally what you want in a measurement...but it causes problems here.

OK, great, I follow so far.  Whats this got to do with the price of tea in China?  Or more specifically, with initial velocities and positions.  Well, heres whats happening.  At a rate of about 60Hz, a measurement of the baseballs position is taken with cameras positioned high above and behind both first base and home plate, and mapped to a 3D trajectory.  But these measurements have inherent errors in them.  As anyone who has ever tried to take action shots with their camera can testify to, these errors are highly dependent on the velocity of the thing being photographed.  Especially in the direction of motion.  We also know that the baseball slows down on it's way to the plate.  This means that inherent to this system, the most accurate measurements are probably made in the vicinity of home plate, and thus, when determining the trajectory of the baseball, these measurements made in the vicinity of home plate are weighted more heavily than the measurements made near the pitchers release point.  So I think Dr. Nathan is only partially correct in his description of the the initial parameters and acceleration used in computing the trajectory.  He refers to them as "the most important parameters in the database", and that "all other parameters are calculated from them".  While he is correct that solving the equations of motion using these parameters will give you the final position, I believe that due to the way the trajectory fitting algorithm works, the most accurately measured parameters in the database are actually the final positions and velocities....not the initial ones.  Furthermore, I believe then that it's quite possible that second order effects are actually the things that conspire to make initial release points and velocities so inaccurate as others have pointed out.  While the constant acceleration approximation is probably good to first order, it is certainly not correct.  Theres more happening there.  Firstly, the magnitude of the drag force on the baseball is highly dependent on the velocity of the baseball.  Not to mention dependent on other parameters not as easily measured.  Air density (it matters for parks like Coors), (perhaps humidity?), wind speed, and a host of other environmental considerations have an extra effect on the flight of a baseball.  Secondly, loss of velocity is not the only mechanism through which a baseball loses energy on its way to home plate.  Its rate of spin also slows down, which affects the magnus force responsible for pitch break.   Although that last one is probably a much smaller effect than the others over the distance from the mound to the plate...so we can probably safely ignore it.  But maybe not....I honestly don't have a good estimate in my head of how big this effect is.  I'll have to look it up later.

So I think that it's these second order effects, combined with the fact that inherent to the measurement and fitting process, the data points closer to home are weighted more heavily, that are primarily responsible for the  wide variations in release point and initial velocity measurements to the data.  Let me take a minute here to say that, in my opinion, this is probably a desired benefit for the makers of PITCHfx.  Because, to them, the most important thing is putting up those pretty graphics that show you exactly where the pitch went on replays that happen mere seconds after the pitch occurred.  So their algorithm is both fast, and most accurate at the point that matters most for television broadcasts.   They'll eat whatever inaccuracies they have in initial positions for that, because, lets face it, those pretty graphics make the game a bit more fun to watch.  In fact they add almost enough value to the broadcasts to make up for Joe Morgan being in the booth.

But, those inaccuracies at the other end of the balls flight present a problem for researchers wanting to analyze pitches with this data.  So how do you get around that? Well, I haven't thought about that in great detail yet.  Unfortunately, I won't be able to for a few days either.  But I wanted to get this out there so I know where to start when I do get to thinking about it.  Basically, I think what I'll probably start with is using the final positions and velocities and try working backwards from there using the second order equations of motion to see if the release points become any more consistent from park to park.  In order to do this right though, there are a few other factors that will have to be looked up, which could be time consuming.  Primarily, the air density and prevailing winds on any given day strike me as the most important factors to consider.

Perhaps to start, I should begin by looking only at games played in domes, or under retractable roofs with the roof closed, as those would represent the most uniform conditions.  But then again, that also severely cuts into the statistics available.

Feel free to comment.  Please don't be offended if I don't respond to comments for a couple of days though.  But I will get back to them.

(*These thoughts leave out considerations of "bad tracks" which can, and do happen.  I believe it was Joe Sheehan who commented on a pitch during an intentional walk which was tracked as being right down the middle of the plate by pitchfx...That points to a different sort of problem, and at some point, you do want to be able to remove bad tracks from your analysis.  How to identify them though??  With the current data, it might be impossible to identify all of them.  Certainly you can find some though through parameters that just have nonsensical values.)

September 13th, 2007

Thinking about calibration

So in thinking more (and reading from others) about my previous post, It's quickly apparent that calibration of the PITCHf/x data is going to be a problem.  Consider, that in order to have as large a statistical sample of baseball pitches as possible, you'd want data from every ballgame over a season.  Or, for this season anyway, every ballgame that had the PITCHf/x system active.  There are 30 MLB teams.  Which means that over the course of a season, you are dealing with 30* different detectors.  IDEALLY, you would hope that if all 30 detectors could simultaneously view the same pitch, they would all give the exact same data.  However, others have shown (and I'll add links when I find them again) that this is not likely to be the case.   Now, given that, and given that there must be some calibration procedure performed on each detector at some time (ideally, I would do this between every half inning, but that could interrupt the flow of the game...so before the start of every game would be good enough I suppose), but I have no idea when and how often these detectors are calibrated, much less exactly how they are calibrated (presumably, this is published somewhere and I just haven't found it yet), and what the calibration results are.  So we are left with the problem of calibrating on data.

Some parks give different release points to the same pitchers (later, after I find the links again, I'll provide links for this)...But how much of that is the detectors fault and how much of that is due to the fact that no two pitchers mounds are created equal.   If there are differences in mounds, it would be nice if we could pull that out of the data.

Most likely, the 30 different detectors are all running the same reconstruction code, with different calibration parameters.  Each detector is a set of two high-speed video cameras, one placed behind home plate and as far up the stadium as possible ("home high") , and the other somewhere down either the first or third base line, and also as high up as possible ("first-high" or "third-high").  They are supposed to be left in a "static" position at all times, however, wind effects, seismic events, etc, can all conceivably have an effect on the positioning of these cameras.  In other words, they can get out of whack, and in some places, at just about any time...even during a game.  Theres another effect that could play a role, especially when asking the questions involving velocity.  These high speed cameras measure the position of a baseball at a rate of 60 Hz, determining velocity presumably by measuring the distance from one frame to another.  OK, great.  How well should we trust them to all be on the same time though?  Everyone I know has had the experience of posessing an alarm clock that ran either faster or slower than it should have.   Like those alarm clocks, these cameras are electronic devices, capable of running either fast or slow, for whatever reason you might like.  So it could be that at Fenway, the cameras fire at a rate of 61 Hz, while at Wrigley, they fire at a rate of 59 Hz.  In that case, Wrigley would consistently yield pitch velocities higher than those measured at Fenway (For example, a system firing at 59Hz thinking it was firing at 60Hz would turn an 85 mph pitch into an 86.5 mph pitch.  Thats a pretty big difference...three times the advertised error on a JUGS radar gun if the error on clock frequency is of the order of 1Hz...thinking about it, 1Hz might be too large of an error, but without a comparison to an atomic clock, I wouldn't know.)  Now, we could try to calibrate initial velocities of pitchers in different parks to some park that we will call the standard park.  That would work (with enough cross-over games in our sample), but we would have to be careful in doing that.  Because doing so with data, one has to make the assumption that a pitchers velocity does not change over time.  But this is wrong.  Pitchers lose velocity over the course of a single game, not to mention the fact that some days, your arm feels "live" and others, it feels "dead".  Because it would be nice to see what happens to any given pitchers velocity over the course of a 162 game season, calibrating velocity on data has it's own set of pitfalls.  Statistically, with enough pitchers in the sample, calibrating on the data may work out.  But sample size is important here, and as of now, not every park even has the PITCHf/x system implemented, and others have small samples to work with.

What is needed is a good "standard pitch" by which all detectors are calibrated.   But this has it's own set of problems as well.   Assuming you could set up 30 identical pitching machines by which to calibrate your detectors (which is not completely out of the realm of possibility, but difficult),  you need a good knowledge of the physics of a ball in flight.  Because at each park, environmental factors play a significant role in pitch trajectory.  The high altitude at Coors Field means that pitches decellerate, and break less than identical pitches thrown at a park near sea-level, like say, PacBell Park.  This is not to mention wind conditions, the effect of humidity, and quite possibly temperature as well.

So overall then, what is likely to be the case is that the data are likely only to be consistent with data taken during the same game, which puts many large-sample statistical analyses out of reach for the current pitchf/x data (Like say, how does velocity change over the course of a season, and many derivatives of that, and other questions that are heavily dependent on having consistent data from each park).  There are still some questions that can be asked and answered from pitchf/x, but these are limited to game-by-game analyses, and analyses that are less sensitive to detector calibration, like perhaps what is the frequency that pitcher X throws pitch Y for a strike, or something like that.  Anyway, it disappointing, and when needed, I'll try to make corrections when and where I need to, but I have this sinking feeling that for a lot of the questions I'd like to answer, the effect I'd like to measure will quite possibly get lost in the systematic errors.

*PITCHf/x did not come to every park at the same time, and I believe that two parks still do not have it installed.  so the data sets for each park are of very different sizes.

September 12th, 2007

A side project perhaps

Yeah, this probably isn't the time to be picking up side projects....but you know what they say about curiosity...

Anyway, I know most people that stumble their way over here probably already know that I am a bit of a baseball nut. Being the type of guy that also likes to look at data, a big smile was brought to my face when I learned that MLB is collecting, and making available to the public, not only full play by play data of all MLB games, but for a lot of games, a new dataset known as PITCHf/x.  This is a very rich dataset of pitches, which records not only velocity, but pitch direction, movement, and location.  These data are collected with sets of dedicated cameras mounted on the stadium.  The entire pitch trajectory of every pitch is calculable (to first order*) using the data in this set.

Obviously, this dataset opens up a lot of possibilities.  And while I have always wanted to undertake a so-called sabermetric analysis, I've always felt that the data available are sorely lacking.  Perhaps not anymore. 

So, the first big question I want to ask of the data:
Does velocity really matter?
At some point, some coach somewhere tells nearly every kid thats ever wanted to pick up a baseball that the three important tools a pitcher can have are, in order of importance: Location, Movement, and Velocity.  However, most people in the sport of baseball will acknowledge that while many will claim this to be true, velocity *is* a very good indicator of whether or not a young pitcher is given a chance to play professional baseball.  In other words: Throw hard, get drafted.  Who cares if you can't hit the broad side of a barn.  We've got a pitching coach that can teach you how.  Thats how scouts generally think.  In fact, if you've got a 94 mph fastball with no control in college, you've got a considerably larger probability of even being given a chance than your teammate who only throws 85 but hits spots with movement.

I want to know though if the scouts are justified in that thinking.  Does velocity correlate well with effectiveness?  Obviously, I need some metric for effectiveness, and I'll probably just use DIPSERA (roughly Defense Independent Earned Runs Average, which is one of those baseball prospectus massaged stats that supposedly correlates well for pitchers over multiple seasons) as a start.  So I want to know how well a pitchers average fastball velocity correlates with his DIPSERA.  Maybe I'll find a better metric to use eventually, but this is a decent starting point.

Thats the easy question....although it will require some sort of sorting algorithm to separate out a pitchers fastballs from his other pitches....more on that in another post, but I already have a few ideas in mind for this.  At the same time, I'd like to identify other pitches a pitcher throws as well...

Anyway, no matter what the answer to that question, it will beg the further questions: What about Location and Movement?  Do they correlate better with effectiveness than velocity? 

But thats for a later date, because it will require a few things.  It will require a metric for "Location", or better yet, "command"...meaning how often does a given pitcher throw pitches where he wants them.  This is problematic because the dataset says nothing about where the pitch was supposed to go...only where it went.  It will also require a single metric for "Movement".  This is also problematic.  Most MLB pitchers have a variety of pitches that move in a variety of ways.  Some, like Greg Maddux, wind up making one pitch look like an infinite series of pitches, by constantly varying the speed and spin of the ball in minute ways, so as to give the batter an even more difficult time.  But for the most part pitchers have anywhere from 2 to 5 discreet pitches that vary in movement and speed from one another.  Does the number of pitches one throws correlate with effectiveness?  What about the difference in velocity between pitches?  What about the differences in the movement of different pitches? 

Anyway there are a multitude of questions that one can ask of this data, and I think I'm going to ask some of these during my free time, just for fun. 

Which means I need a few things:
a) A program to collect the data, and stick it in an easily readable format...or to just keep it in XML if I want to write an XML parser as part of my analysis programs.  (each pitcher for each game is stored in a separate file)
b) A program to pick out each and every pitch that each pitcher throws (minus the games that PITCHf/x wasn't working)
c) A program to pick fastballs.

So this could be fun.  Stay tuned.  If you have ideas, or other questions you would like to ask the data, feel free to let me know.  I'll see what I can do.

*To first order in this case means that the data collection methods reconstruct the path of the pitch under the assumption of constant acceleration.  This is obviously not the true trajectory (especially for knuckleballs!), as it is well known that that the drag coefficient varies dependent on both the velocity and spin of the baseball.  Probably also on the orientation of the seams.  However, it's not a terrible first approximation. 

February 28th, 2007

Baseball on the brain

You would be right to ask the question "How can you be thinking about baseball when there is snow on the ground?" And to be honest, I don't have an answer for that that is anymore satisfying than when my parents would tell me "because God made it that way." But the fact remains that I've got the itch again. The itch to stand in the middle of a diamond and chuck a little ball 60 feet 6 inches into a mitt that is positioned behind a 17 inch wide piece of white-painted vulcanized rubber, which someone will try to hit as hard as they can. Whoever that nameless person is will undoubtedly fail. I can't wait to do it. Luckily, I have the opportunity to do this. 10 years ago there weren't many opportunities for guys like me to get out on a diamond and play. By guys like me, I mean guys who are out of college and not getting paid to play. In fact, I have to pay out of my own pocket to play. $260 for a 22 game schedule.

When you work it out, thats not really all that bad. About 12 bucks a game. Cheaper than a movie when you factor in the popcorn and the coke. Not only that, but it lasts twice as long, gives me a good reason to stay in relatively decent shape. Oh yeah, and I'm good at it. Obviously not good enough to get paid to do it (or perhaps I should say that I don't throw hard enough to get paid to do it...mid 80's...or thats what I threw when I left college anyway, not 90's), but I still perform fairly well against guys that are looking to play ball while on summer break from college ball. Well enough that I've been able to rack up nearly 600 K's in the last 3 years anyway.

OK, I'll stop tooting my own horn. maybe.

But really, I love this game. Nearly as much as I love my wife. If she told me that it was baseball or her, I'd need a day or so of mourning after choosing her. Thank goodness I know she won't make me make that choice. At least not until we have kids....and then I'll have coaching to look forward too :)

I don't really know where I was going with this post. But if you are so inclined, have a gander at our new team web page...created by yours truly. It's right here. Our teams outlook for 2007 is great. After getting shut out of the playoffs for the last two years (we missed the cut by one game last year), we have stacked up on more pitching because it's getting harder for me to pitch 9 innings every week anymore, and added more power to our lineup. Not only that, but a few teams left and have been replaced by teams from lower divisions, which I am confident that we can whoop up on. At least, we whooped up on these teams when we were in the lower divisions.

Good God, get me on a ballfield.
Powered by LiveJournal.com