So I've been thinking a little bit more about some of the problems associated with the PITCHfx data, and reading some more. It seems that a great many people are making a big deal out of the initial positions and initial velocities being off. They might be justified...but I think there is more going on that perhaps they realize.
As Dr. Alan Nathan
points out here
, the PITCHfx data records the "initial" velocity and position of the baseball and, along with a calculated acceleration that is used as a constant to calculate the pitch trajectory, all of the rest of the numbers are derived.
Several people, including Joe Sheehan
, Josh Kalk (here
, and in other posts on his blog), and to a lesser extent, Mike Fast over at Fast Balls
(although, I do like the way he classifies pitches using Dr. Nathans spin/rate approximations...that may be useful later) have pointed out that from ballpark to ballpark, the initial velocities and positions are not consistent with one another, and have gone to great lengths to try to correct for this. I'm sure I missed a few, but those are the guys I have been reading a lot of in my ramp up to actually analyzing data. (In case you are wondering when that will happen, I've got a framework ready to store the data...now I just have to write the code to get it in there. And in case you are still wondering, yes, it will be in ROOT, because thats what I'm most familiar with). Anyway, I think these people are wrong. Sort of. It's not that I think they are wrong in believing that the release points and initial velocities are incorrect. It's that I think they may be going about correcting the data in the wrong way.
At first I thought they might be wrong about worrying about the initial release point height due to variations in mounds. Being a pitcher myself, I have experienced the entire spectrum of mounds throughout my high-school, college, and now weekend-league career. From mounds that are little more than pimples with a rubber spiked into them, to those that in flat states like Illinois and Oklahoma could nearly be called mountains. But seeing the numbers they are showing for this variation, I think thats a second order effect at best. I've also pitched at a number of minor league and MLB spring training facilities, and without a doubt those are by far the most uniform mounds I have ever run across. Sure, there are differences...but I would doubt that for a pitcher with a consistent release point, that they would vary by as much as half a foot or more from park to park. Maybe as much as 3 inches or so...but more would probably be a stretch to blame on mound variation.
So maybe there really is something funny going on with PITCHfx.
And I'm sure there is, but I'm not sure that making corrections based on averages to initial position or velocity is the best way to go.
So when in doubt, it's best to know your detectors and reconstruction algorithms as well as you can. And in looking for information in this regard, I came across a nice paper
that was linked to on Dr. Nathans site, which although somewhat dated, is probably a fairly accurate representation of the current system. So, with that in mind, I direct your attention to the blue-colored sidebar on page 5 wherein the the trajectory fitting algorithm is described. Hey, look at that, it's a Kalman filter
. Most of the people that regularly read this will probably be at least a little familiar with a Kalman filter, as a very similar filter is used in the DØ tracking and vertex reconstruction. OK, so maybe you are also like me, and recognize the name, and know you have read about how they work, but don't really remember it. Thats fine. Because really, that blue sidebar contains all you really need to know...almost.
The key phrase in that little blue sidebar is this (although it's a little kryptic...for further explanation, I refer you to the wikipedia entry for Kalman filters
...not because I think it's right, but because it's a link I can easily find):
[the error matrix
] gradually decreases as the algorithm incorporates more measurements: Confidence in the state builds up. Equation 7 shows that if K
[a thing related to the error matrix
] is large -- which is the case if R
[that part of the error matrix due to noise effects
] is small, meaning that there is little noise in the measurement [and I would also add, small measurement errors...thats important later
], the new measurement z
is weighted heavily. Instead if K
is small, the value the current state x
predicts has a higher weight"
Ok, so lets parse that a little bit. Whats happening is this: The detector makes a measurement. It then makes a prediction based, in this case, on the physics of a body in motion under constant acceleration, of where the next measurement will be made. The next measurement is made, and compared to the prediction. If the prediction has smaller errors than the next measurement, the prediction is weighted more heavily than the measurement. If the measurement has the smaller errors, then the measurement is weighted more heavily than the prediction. From there, the predictive model is adjusted, and a prediction for the next measurement is made, and so on until the measurement process ends for whatever event you are observing. So forgetting about this predictive model for now, the bit to take home is this: At the end of the day (or pitch really), what you wind up getting out of the fitting algorithm is a trajectory thats weighted more heavily toward the most precise measurements taken by the detector, and less heavily for those that are less precise....This is generally what you want in a measurement...but it causes problems here.
OK, great, I follow so far. Whats this got to do with the price of tea in China? Or more specifically, with initial velocities and positions. Well, heres whats happening. At a rate of about 60Hz, a measurement of the baseballs position is taken with cameras positioned high above and behind both first base and home plate, and mapped to a 3D trajectory. But these measurements have inherent errors in them. As anyone who has ever tried to take action shots with their camera can testify to, these errors are highly dependent on the velocity of the thing being photographed. Especially in the direction of motion. We also know that the baseball slows down on it's way to the plate. This means that inherent to this system, the most accurate measurements are probably made in the vicinity of home plate, and thus, when determining the trajectory of the baseball, these measurements made in the vicinity of home plate are weighted more heavily than the measurements made near the pitchers release point. So I think Dr. Nathan is only partially correct in his description of the the initial parameters and acceleration used in computing the trajectory. He refers to them as "the most important parameters in the database", and that "all other parameters are calculated from them". While he is correct that solving the equations of motion using these parameters will give you the final position, I believe that due to the way the trajectory fitting algorithm works, the most accurately measured parameters in the database are actually the final positions and velocities....not the initial ones. Furthermore, I believe then that it's quite possible that second order effects are actually the things that conspire to make initial release points and velocities so inaccurate as others have pointed out. While the constant acceleration approximation is probably good to first order, it is certainly not correct. Theres more happening there. Firstly, the magnitude of the drag force on the baseball is highly dependent on the velocity of the baseball. Not to mention dependent on other parameters not as easily measured. Air density (it matters for parks like Coors), (perhaps humidity?), wind speed, and a host of other environmental considerations have an extra effect on the flight of a baseball. Secondly, loss of velocity is not the only mechanism through which a baseball loses energy on its way to home plate. Its rate of spin also slows down, which affects the magnus force responsible for pitch break. Although that last one is probably a much smaller effect than the others over the distance from the mound to the plate...so we can probably safely ignore it. But maybe not....I honestly don't have a good estimate in my head of how big this effect is. I'll have to look it up later.
So I think that it's these second order effects, combined with the fact that inherent to the measurement and fitting process, the data points closer to home are weighted more heavily, that are primarily responsible for the wide variations in release point and initial velocity measurements to the data. Let me take a minute here to say that, in my opinion, this is probably a desired benefit for the makers of PITCHfx. Because, to them, the most important thing is putting up those pretty graphics that show you exactly where the pitch went on replays that happen mere seconds after the pitch occurred. So their algorithm is both fast, and most accurate at the point that matters most for television broadcasts. They'll eat whatever inaccuracies they have in initial positions for that, because, lets face it, those pretty graphics make the game a bit more fun to watch. In fact they add almost enough value to the broadcasts to make up for Joe Morgan being in the booth.
But, those inaccuracies at the other end of the balls flight present a problem for researchers wanting to analyze pitches with this data. So how do you get around that? Well, I haven't thought about that in great detail yet. Unfortunately, I won't be able to for a few days either. But I wanted to get this out there so I know where to start when I do get to thinking about it. Basically, I think what I'll probably start with is using the final positions and velocities and try working backwards from there using the second order equations of motion to see if the release points become any more consistent from park to park. In order to do this right though, there are a few other factors that will have to be looked up, which could be time consuming. Primarily, the air density and prevailing winds on any given day strike me as the most important factors to consider.
Perhaps to start, I should begin by looking only at games played in domes, or under retractable roofs with the roof closed, as those would represent the most uniform conditions. But then again, that also severely cuts into the statistics available.
Feel free to comment. Please don't be offended if I don't respond to comments for a couple of days though. But I will get back to them.
(*These thoughts leave out considerations of "bad tracks" which can, and do happen. I believe it was Joe Sheehan who commented on a pitch during an intentional walk which was tracked as being right down the middle of the plate by pitchfx...That points to a different sort of problem, and at some point, you do want to be able to remove bad tracks from your analysis. How to identify them though?? With the current data, it might be impossible to identify all of them. Certainly you can find some though through parameters that just have nonsensical values.)