So after finding out more about how PITCHf/x works, it looks like I was way off base with my previous post. Turns out that the PITCHf/x system is entirely different from the one I was describing. No biggie. It's not the first time I've been wrong, and I can guarantee that it won't be the last time I'm wrong. However, for other reasons, I still think that the data closer to home is more accurate than the data near the release point...and I'll explain my new reasoning below.
So first, it appears that PITCHf/x uses a least squares method of fitting rather than a Kalman filter, and that the shutter speed and pixel density is such that blurring isn't really much of an issue. But according to comments from Dr. Nathan, the PITCHf/x people have told him that the primary sources of error come from camera movement, which they attempt to correct on the fly (presumably by observing the position of the 1st and 3rd base lines and home plate, which remain fixed), and "operator error". That second one worries me a bit. I wonder how much influence an operator has over the output of pitchfx? But for the time being, lets just consider what happens when the camera wiggles a bit. Now, correcting on the fly is certainly possible, and in my opinion, easy to do for data points around home plate, because you've got a lot of "fixed" targets there to locate. Home plate, the 2 lines, and the batters box. But home plate is certainly the easy choice there because its somewhat big, and has a rather unique shape and orientation.
But we run into problems if we look for another point to calibrate on the fly with near the release point. The closest would be the pitchers rubber. But I could easily imagine that this is difficult to pick out in an automated way, since the pitchers foot is generally still in the vicinity at release, and depending on where the camera is, the pitchers body may be obscuring the view of the rubber. This means that the calibration has to happen on objects that are more than 40 feet away from the point we want to observe. I think that propagating this on the fly calibration out to the pitchers mound can be troublesome to do, and could very well be the source of the wide spreads in release point that have been seen in the data. Most likely, the more these cameras are subject to vibration from wind, seismic effects, or whatever else, the more they are off. And thats just for one detector.
Now, I think that this "on the fly" camera calibration may be the cause of a lot of the spread that is seen in horizontal and vertical release points, but I can't find any way that this could explain systematic differences seen in release points (and velocities too!) from park A to park B. Josh is probably on the right track by correcting for these by iterating over pitchers that have thrown at various parks and computing "park factors" for the release point. But then, there is another question that goes with that. If you move the release point by some distance, you can't stop there. You have to the correct velocity and/or accelerations in order to ensure that the pitch trajectory puts the ball in the same place...In other words, we can't break the laws of physics...although the trajectories they calculate aren't exactly the correct trajectories, I now think that they are close enough to do the job that needs to be done....Although as Josh pointed out, some correction for air density may need to be inserted at some point.
But the fact that these systematic effects exist still bug me. Most people just write it off to miscalibration and start correcting by moving things around to where they think they "should" be. I'm wondering if there is a way to determine exactly how park X is miscalibrated and correct from there. And I may be on to something now, but if I am, I have to assume that the positions and velocites near home are inherently more accurate. I already said above that I think they are, but obviously, I can't be 100% sure of that with the handwaving argument I just gave. So, for the rest of this post, I am going to make this very assumption, and later tonight I plan to see whether or not the effect I think I will see is there or not. I'm also going to assume for the time being that each park utilizes one 3 dimensional camera that automagically captures the x,y,z position of an object at a given point in time, because doing so makes it easier to articulate what I think might be happening.
So the way I see it, there are 2 main forms of miscalibration that can happen on a systematic level. The first is the simplest to imagine, which I will just call "offset" calibration. With this form of miscalibration, we would think that point x is really at point x+∂x. However, I also think that this is the easiest calibration to correct for, and the "on the fly" calibration that they can do probably does a good job of eliminating offset. The second primary form of miscalibration would be scale calibration, where some distance x-x0 is instead measured to be ∂x*(x-x0). This could occur as a result of several mismeasurements, and could very well account for park to park differences. Not knowing exactly where the camera (camera angle), or under/over estimating the zoom on the camera could easily introduce a scale calibration problem. Furthermore, in the case of scale miscalibration, a 1% miscalibration can have significant consequences. If we are 1% off in y (which has its axis running from the tip of home plate to the center of the pitchers rubber), we might think that the 60.5 feet that should be there is really 59.9 feet or 61 .1 feet. And over the roughly half a second that the ball is in flight, this can mean that we add or subtract a whole mile per hour to a pitchers fastball.
Realistically, I would expect scale miscalibrations to be on the order of 1%, or maybe less. Certainly in y, having 2 cameras that can make measurements in that direction can help pin it down, but I can still imagine ways it might creep in there. In x and z, I can imagine it being much easier to have scale calibration errors.
So, what to do? Well, first I'd like to test the idea that there may be scale calibration errors in y. To do this, I'll take one pitcher pitching at two different parks. Preferably a starter who has gone 7 innnings at both parks to get a good sample of data. I don't really care who. For each of his pitches, I want to propagate the uncorrected calculated trajectory forward and backward from the standard y=55 "release point" to find the y position of his actual "release point". I'll define that as the point in y where the x and z spread for all of his pitches is minimized. If the pitcher actually has a consistent release point for every pitch, curveballs, changeups, and sliders actually help us here. If not, maybe two or three release points are needed. Anyway, comparing these "verticies" in y should give us a good idea of the scale difference between parks, if there is any. What I expect to find if I am right are a difference in the y vertex of between 2 to 6 inches between any two parks. Perhaps more if one park or both parks are badly out of line with each other.
I may also see nothing, in which case, the PITCHf/x people would have done a very good job at calibrating the length scale to their cameras.
Anyway, the first pass at this will be done tonight, and hopefully I'll have some pretty plots to throw up tomorrow.
So first, it appears that PITCHf/x uses a least squares method of fitting rather than a Kalman filter, and that the shutter speed and pixel density is such that blurring isn't really much of an issue. But according to comments from Dr. Nathan, the PITCHf/x people have told him that the primary sources of error come from camera movement, which they attempt to correct on the fly (presumably by observing the position of the 1st and 3rd base lines and home plate, which remain fixed), and "operator error". That second one worries me a bit. I wonder how much influence an operator has over the output of pitchfx? But for the time being, lets just consider what happens when the camera wiggles a bit. Now, correcting on the fly is certainly possible, and in my opinion, easy to do for data points around home plate, because you've got a lot of "fixed" targets there to locate. Home plate, the 2 lines, and the batters box. But home plate is certainly the easy choice there because its somewhat big, and has a rather unique shape and orientation.
But we run into problems if we look for another point to calibrate on the fly with near the release point. The closest would be the pitchers rubber. But I could easily imagine that this is difficult to pick out in an automated way, since the pitchers foot is generally still in the vicinity at release, and depending on where the camera is, the pitchers body may be obscuring the view of the rubber. This means that the calibration has to happen on objects that are more than 40 feet away from the point we want to observe. I think that propagating this on the fly calibration out to the pitchers mound can be troublesome to do, and could very well be the source of the wide spreads in release point that have been seen in the data. Most likely, the more these cameras are subject to vibration from wind, seismic effects, or whatever else, the more they are off. And thats just for one detector.
Now, I think that this "on the fly" camera calibration may be the cause of a lot of the spread that is seen in horizontal and vertical release points, but I can't find any way that this could explain systematic differences seen in release points (and velocities too!) from park A to park B. Josh is probably on the right track by correcting for these by iterating over pitchers that have thrown at various parks and computing "park factors" for the release point. But then, there is another question that goes with that. If you move the release point by some distance, you can't stop there. You have to the correct velocity and/or accelerations in order to ensure that the pitch trajectory puts the ball in the same place...In other words, we can't break the laws of physics...although the trajectories they calculate aren't exactly the correct trajectories, I now think that they are close enough to do the job that needs to be done....Although as Josh pointed out, some correction for air density may need to be inserted at some point.
But the fact that these systematic effects exist still bug me. Most people just write it off to miscalibration and start correcting by moving things around to where they think they "should" be. I'm wondering if there is a way to determine exactly how park X is miscalibrated and correct from there. And I may be on to something now, but if I am, I have to assume that the positions and velocites near home are inherently more accurate. I already said above that I think they are, but obviously, I can't be 100% sure of that with the handwaving argument I just gave. So, for the rest of this post, I am going to make this very assumption, and later tonight I plan to see whether or not the effect I think I will see is there or not. I'm also going to assume for the time being that each park utilizes one 3 dimensional camera that automagically captures the x,y,z position of an object at a given point in time, because doing so makes it easier to articulate what I think might be happening.
So the way I see it, there are 2 main forms of miscalibration that can happen on a systematic level. The first is the simplest to imagine, which I will just call "offset" calibration. With this form of miscalibration, we would think that point x is really at point x+∂x. However, I also think that this is the easiest calibration to correct for, and the "on the fly" calibration that they can do probably does a good job of eliminating offset. The second primary form of miscalibration would be scale calibration, where some distance x-x0 is instead measured to be ∂x*(x-x0). This could occur as a result of several mismeasurements, and could very well account for park to park differences. Not knowing exactly where the camera (camera angle), or under/over estimating the zoom on the camera could easily introduce a scale calibration problem. Furthermore, in the case of scale miscalibration, a 1% miscalibration can have significant consequences. If we are 1% off in y (which has its axis running from the tip of home plate to the center of the pitchers rubber), we might think that the 60.5 feet that should be there is really 59.9 feet or 61 .1 feet. And over the roughly half a second that the ball is in flight, this can mean that we add or subtract a whole mile per hour to a pitchers fastball.
Realistically, I would expect scale miscalibrations to be on the order of 1%, or maybe less. Certainly in y, having 2 cameras that can make measurements in that direction can help pin it down, but I can still imagine ways it might creep in there. In x and z, I can imagine it being much easier to have scale calibration errors.
So, what to do? Well, first I'd like to test the idea that there may be scale calibration errors in y. To do this, I'll take one pitcher pitching at two different parks. Preferably a starter who has gone 7 innnings at both parks to get a good sample of data. I don't really care who. For each of his pitches, I want to propagate the uncorrected calculated trajectory forward and backward from the standard y=55 "release point" to find the y position of his actual "release point". I'll define that as the point in y where the x and z spread for all of his pitches is minimized. If the pitcher actually has a consistent release point for every pitch, curveballs, changeups, and sliders actually help us here. If not, maybe two or three release points are needed. Anyway, comparing these "verticies" in y should give us a good idea of the scale difference between parks, if there is any. What I expect to find if I am right are a difference in the y vertex of between 2 to 6 inches between any two parks. Perhaps more if one park or both parks are badly out of line with each other.
I may also see nothing, in which case, the PITCHf/x people would have done a very good job at calibrating the length scale to their cameras.
Anyway, the first pass at this will be done tonight, and hopefully I'll have some pretty plots to throw up tomorrow.
camera calibration
Just one quick point regarding "operator error." Such an error does not affect the actual tracking of the pitch but only which pitches get recorded. The most famous example is the Bonds 756th home run pitch which was never actually recorded in the PITCHf/x on-line logs. Instead a "throwback" pitch which occurred just prior to the home run was recorded. That is a clear operator error.
Another point regarding the camera calibration: One might guess that there are three things that can change: 1. The camera focal length, which affects the magnification of the image 2. The camera location in the global coordinate system (the origin of which is at home plate) 3. The orientation of the camera axis in the global coordinate system. Now, I am guessing that the most likely thing to be affected by vibration, etc. is #3, the least likely is #1. One must be very careful just correcting the coordinates arbitrarily. The procedure to go from pixel location to x,y,z is a bit involved and requires information from both cameras. In effect, each camera determines a vector pointing from a fixed point on the camera to the baseball. The intersection of those two vectors (one for each camera) determines the ball location. In actuality, the vectors may not intersect (mostly they do not), and the distance of closest approach determines the most probably intersection point. The distance between the two vectors at the closest approach is carefully monitored and needs to be less than an inch or so (I don't know the exact number). If that number is too large, then that signals the operator that a recalibration is necessary.
Re: camera calibration
I don't think that #1 (magnification) would be affected by vibration, but I do think that it is one likely cause of a systematic error. In other words, when the code thinks the magnification level is x and its really some number thats close to x, but not quite exactly. Without many calibration points near the mound this could get problematic (unless maybe they can use the dirt-grass border as a calibration point).
I've started taking a look at this by essentially creating pitch verticies in the z-y plane only for some arbitrary pitcher at two different parks, and, suffice it to say, the results are very confusing at this point. I'll post more about it later.
There are a couple of other things I think as being likely candidates. One is time scale. This shouldn't affect anything but the magnitude of velocities and accelerations (ie. x(y) and z(y) stay the same), but is a consideration due to the fact that nearly all electronic clocks run slow or fast by some degree. I can think of a few ways that the PITCHf/x folks could correct for this, but I don't know whether they try to do this or not. (the big dip in velocities that everyone experiences at Boston being the first thing that pops into my mind).
I also wonder whether or not a mismeasurement of #2 (global position of the camera) has an effect...and it probably does. If the reconstruction code thinks the camera is at position x,y,z, and it's really off by some amount. It seems like this would futz with the reconstruction, and could be a source of systematics. The problem there is that that is something that can't be corrected for without more information (like where we think the cameras really are). Part of this would go to making scales wrong in some directions, depending on the cameras affected.
Right now, I'm not going about changing anything, just looking at what I see, as this is my first chance to play with the data...
Also, the comment about the DCA signaling an operator about when a recalibration is necessary...I wonder how often that happens? It would be nice to know. IMO, if there are certain parks where this happens much more frequently than others, it would be an indication that there really is a time scale (fast or slow clock) problem with one (or more ) of their cameras. Do you happen to know when these recalibrations are done?
So just a few more thoughts...
If I were to try to look at only pitches in certain speed ranges, the results blow up due to the fact that there are limited statistics to work from. Interestingly, in September (which is the only month for which I have data on disk right now), Pettittes 2 starts at Yankee Stadium appear to be vastly different in release point height, so I only used one.
There are some limitations to doing this though. During the course of a game, even pitchers with a consistent release point will have a fluctuation in their release point of +/- 3 inches or so...they often will compensate for this by throwing with more or less downward velocity, so the intersection of tracks will tend to be forward of the actual release point. However, assuming that they throw the same percentage of the same pitches in each outing, the location of this minimum should be relatively stable over the course of a season. I don't think it will work as a tool for correcting the data, but if the data gets corrected appropriately, this may be one interesting thing to look at for other studies.