Log in

The mind of Ike

more PITCHfx thoughts

more PITCHfx thoughts

Previous Entry Share Next Entry
So I've been thinking a little bit more about some of the problems associated with the PITCHfx data, and reading some more.  It seems that a great many people are making a big deal out of the initial positions and initial velocities being off.  They might be justified...but I think there is more going on that perhaps they realize.

As Dr. Alan Nathan points out here, the PITCHfx data records the "initial" velocity and position of the baseball and, along with a calculated acceleration that is used as a constant to calculate the pitch trajectory, all of the rest of the numbers are derived.
Several people, including Joe Sheehan, Josh Kalk (here, and in other posts on his blog), and to a lesser extent, Mike Fast over at Fast Balls (although, I do like the way he classifies pitches using Dr. Nathans spin/rate approximations...that may be useful later) have pointed out that from ballpark to ballpark, the initial velocities and positions are not consistent with one another, and have gone to great lengths to try to correct for this.  I'm sure I missed a few, but those are the guys I have been reading a lot of in my ramp up to actually analyzing data.  (In case you are wondering when that will happen, I've got a framework ready to store the data...now I just have to write the code to get it in there.  And in case you are still wondering, yes, it will be in ROOT, because thats what I'm most familiar with).  Anyway, I think these people are wrong.  Sort of.  It's not that I think they are wrong in believing that the release points and initial velocities are incorrect.  It's that I think they may be going about correcting the data in the wrong way.

At first I thought they might be wrong about worrying about the initial release point height due to variations in mounds.  Being a pitcher myself, I have experienced the entire spectrum of mounds throughout my high-school, college, and now weekend-league career.  From mounds that are little more than pimples with a rubber spiked into them, to those that in flat states like Illinois and Oklahoma could nearly be called mountains.   But seeing the numbers they are showing for this variation, I think thats a second order effect at best.  I've also pitched at a number of minor league and MLB spring training facilities, and without a doubt those are by far the most uniform mounds I have ever run across.  Sure, there are differences...but I would doubt that for a pitcher with a consistent release point, that they would vary by as much as half a foot or more from park to park.  Maybe as much as 3 inches or so...but more would probably be a stretch to blame on mound variation. 

So maybe there really is something funny going on with PITCHfx.

And I'm sure there is, but I'm not sure that making corrections based on averages to initial position or velocity is the best way to go.

So when in doubt, it's best to know your detectors and reconstruction algorithms as well as you can.  And in looking for information in this regard, I came across a nice paper that was linked to on Dr. Nathans site,  which although somewhat dated, is probably a fairly accurate representation of the current system.  So, with that in mind, I direct your attention to the blue-colored sidebar on page 5 wherein the the trajectory fitting algorithm is described.  Hey, look at that, it's a Kalman filter.  Most of the people that regularly read this will probably be at least a little familiar with a Kalman filter, as a very similar filter is used in the DØ tracking and vertex reconstruction.  OK, so maybe you are also like me, and recognize the name, and know you have read about how they work, but don't really remember it.  Thats fine.  Because really, that blue sidebar contains all you really need to know...almost.

The key phrase in that little blue sidebar is this (although it's a little kryptic...for further explanation, I refer you to the wikipedia entry for Kalman filters...not because I think it's right, but because it's a link I can easily find):

"Typically P [the error matrix] gradually decreases as the algorithm incorporates more measurements:  Confidence in the state builds up.  Equation 7 shows that if K [a thing related to the error matrix] is large -- which is the case if R [that part of the error matrix due to noise effects] is small, meaning that there is little noise in the measurement [and I would also add, small measurement errors...thats important later], the new measurement z is weighted heavily.  Instead if K is small, the value the current state x predicts has a higher weight"

Ok, so lets parse that a little bit.  Whats happening is this:  The detector makes a measurement.  It then makes a prediction based, in this case, on the physics of a body in motion under constant acceleration, of where the next measurement will be made.  The next measurement is made, and compared to the prediction.  If the prediction has smaller errors than the next measurement, the prediction is weighted more heavily than the measurement.  If the measurement has the smaller errors, then the measurement is weighted more heavily than the prediction.  From there, the predictive model is adjusted, and a prediction for the next measurement is made, and so on until the measurement process ends for whatever event you are observing.  So forgetting about this predictive model for now, the bit to take home is this:  At the end of the day (or pitch really), what you wind up getting out of the fitting algorithm is a trajectory thats weighted more heavily toward the most precise measurements taken by the detector, and less heavily for those that are less precise....This is generally what you want in a measurement...but it causes problems here.

OK, great, I follow so far.  Whats this got to do with the price of tea in China?  Or more specifically, with initial velocities and positions.  Well, heres whats happening.  At a rate of about 60Hz, a measurement of the baseballs position is taken with cameras positioned high above and behind both first base and home plate, and mapped to a 3D trajectory.  But these measurements have inherent errors in them.  As anyone who has ever tried to take action shots with their camera can testify to, these errors are highly dependent on the velocity of the thing being photographed.  Especially in the direction of motion.  We also know that the baseball slows down on it's way to the plate.  This means that inherent to this system, the most accurate measurements are probably made in the vicinity of home plate, and thus, when determining the trajectory of the baseball, these measurements made in the vicinity of home plate are weighted more heavily than the measurements made near the pitchers release point.  So I think Dr. Nathan is only partially correct in his description of the the initial parameters and acceleration used in computing the trajectory.  He refers to them as "the most important parameters in the database", and that "all other parameters are calculated from them".  While he is correct that solving the equations of motion using these parameters will give you the final position, I believe that due to the way the trajectory fitting algorithm works, the most accurately measured parameters in the database are actually the final positions and velocities....not the initial ones.  Furthermore, I believe then that it's quite possible that second order effects are actually the things that conspire to make initial release points and velocities so inaccurate as others have pointed out.  While the constant acceleration approximation is probably good to first order, it is certainly not correct.  Theres more happening there.  Firstly, the magnitude of the drag force on the baseball is highly dependent on the velocity of the baseball.  Not to mention dependent on other parameters not as easily measured.  Air density (it matters for parks like Coors), (perhaps humidity?), wind speed, and a host of other environmental considerations have an extra effect on the flight of a baseball.  Secondly, loss of velocity is not the only mechanism through which a baseball loses energy on its way to home plate.  Its rate of spin also slows down, which affects the magnus force responsible for pitch break.   Although that last one is probably a much smaller effect than the others over the distance from the mound to the plate...so we can probably safely ignore it.  But maybe not....I honestly don't have a good estimate in my head of how big this effect is.  I'll have to look it up later.

So I think that it's these second order effects, combined with the fact that inherent to the measurement and fitting process, the data points closer to home are weighted more heavily, that are primarily responsible for the  wide variations in release point and initial velocity measurements to the data.  Let me take a minute here to say that, in my opinion, this is probably a desired benefit for the makers of PITCHfx.  Because, to them, the most important thing is putting up those pretty graphics that show you exactly where the pitch went on replays that happen mere seconds after the pitch occurred.  So their algorithm is both fast, and most accurate at the point that matters most for television broadcasts.   They'll eat whatever inaccuracies they have in initial positions for that, because, lets face it, those pretty graphics make the game a bit more fun to watch.  In fact they add almost enough value to the broadcasts to make up for Joe Morgan being in the booth.

But, those inaccuracies at the other end of the balls flight present a problem for researchers wanting to analyze pitches with this data.  So how do you get around that? Well, I haven't thought about that in great detail yet.  Unfortunately, I won't be able to for a few days either.  But I wanted to get this out there so I know where to start when I do get to thinking about it.  Basically, I think what I'll probably start with is using the final positions and velocities and try working backwards from there using the second order equations of motion to see if the release points become any more consistent from park to park.  In order to do this right though, there are a few other factors that will have to be looked up, which could be time consuming.  Primarily, the air density and prevailing winds on any given day strike me as the most important factors to consider.

Perhaps to start, I should begin by looking only at games played in domes, or under retractable roofs with the roof closed, as those would represent the most uniform conditions.  But then again, that also severely cuts into the statistics available.

Feel free to comment.  Please don't be offended if I don't respond to comments for a couple of days though.  But I will get back to them.

(*These thoughts leave out considerations of "bad tracks" which can, and do happen.  I believe it was Joe Sheehan who commented on a pitch during an intentional walk which was tracked as being right down the middle of the plate by pitchfx...That points to a different sort of problem, and at some point, you do want to be able to remove bad tracks from your analysis.  How to identify them though??  With the current data, it might be impossible to identify all of them.  Certainly you can find some though through parameters that just have nonsensical values.)
  • Ike, I enjoyed reading your article on PITCHf/x and getting your perspective on the measurement error.

    It definitely makes sense that the measurements with the least error would be those around the plate. Sportvision and MLBAM claim their measured locations are within an inch or so around the plate, and I had assumed they were claiming similar accuracy for all data points on the trajectory. Now I understand why that isn't the case.

    I'm curious how you would start with the final position and velocity and work backwards to get the initial position and velocity.

    Josh Kalk has taken a look at air density and its effect on acceleration.

    I have been assuming air density as a constant when solving for spin direction and spin rate, and that works pretty well for my purposes, but obviously it would be an improvement to factor in the game time temperature and altitude since that data is readily available. It would be nice to have the humidity, too, but since that isn't in the XML data from MLB, it would require finding that data from an alternate source.

    We have game time wind speed and approximate direction in the MLB data, too, but I'm reluctant to try to apply that to the data set since wind speed is variable on a much smaller time scale than is temperature or humidity.
    • Hi Mike.
      To be honest, right now I'm not entirely sure how I'm going to go about doing this. I've looked at Josh's post over there, and I think he is doing a good job. One thing I worry about though with all of the focus on correcting the initial parameters is what effect doing so will have on the final parameters, which I think are actually the most well measured. If he's doing so in such a way that he forces the final position to stay the same, and the final velocity to not change a lot, then maybe I should just take his results and start looking at data. But I'm not sure he is doing this.

      So I can't work backward without final velocity information, which doesn't really exist (aside from a speed measurement). If I had final velocity, a simple acceleration similar to what Josh has done would be enough to work backward.

      So what I've been doing with my time since this post is that I've created a pitch monte carlo that generates random pitch trajectories based on the equations in Dr. Nathans papers and talks. (sort of anyway. To simplify things, I just make tiny steps with constant acceleration, and at each step change the acceleration according to those formulae, but it works), and then simulating a measurement with errors and running a toy Kalman filter over those tracks to see where I wind up. From just looking at individual tracks as I have been debugging this (nearly done), it looks like there may be some systematic errors that creep in with respect to the velocites and positions. Anyway, what I'd like to see if I can do is apply corrections to the Kalman filtered tracks to bring them to closer to the correct values, and then apply those corrections to the data and see if I get some kind of improvement. I'm not entirely sure this is going to work because my code is probably not the same as their code, and without access to theirs, theres no telling how what I'm trying matches up to what they actually do.
      The hope is that I can use the parameters that are given to get a reliable estimate of final velocity and work backward, and hopefully the initial parameters I get from doing that are more accurate than the initial parameters returned from the filter. It may work, it may not. I don't know yet, but in the process, I'm learning a lot about whats actually going on.

      But regardless of that, I think that there probably are systematic errors resulting from the way all this works...I also think that there may be some additional impresicions that creep in as a function of camera position. Right now, I just put general (and reasonable) errors on a pure x,y,z measurement, but my code can in principle handle a transformation to the x,y coordinates of two cameras...which is where the real errors are, and see how they propagate.

      One of the things that I really wonder about is this idea of the cameras being calibrated well at home plate, but not so well at the mound. I find that a little counter-intuitive, because it would pretty dumb for them to not calibrate at the mound too I think.

      Anyway, I'll probably play around with this for a week or so, and then either give up and go another route, or if things look real promising, keep going until I can get the data as accurate as possible.
Powered by LiveJournal.com