1 Introduction

1 Introduction

Stat 536 - Final Report: Analysis of 2008 Pitchf/x Data for Jon Lester Jonathan Gruhl June 6, 2008 1 Introduction Major League Baseball (MLB) gener...

97KB Sizes 0 Downloads 17 Views

Recommend Documents

1. Introduction
entdeckt Blupi einen gefährlichen Virus, der ihn mit einer seltsamen Krankheit ..... Die Planierraupe ist ein gefährlich

1 Introduction
in Java and Clojure, Design and Implementation Considerations [1]. Most of it's content focused on: .... able POJO (Plai

1 INTRODUCTION
algorithm in Figure ??, except for the use of a priority queue and the addition of an extra check in .... refers to the

1 Introduction
Select the E Score, Iterations, and Max Results. 5. ..... When Q has a low score (0.1), it means the ... Q per residue i

1 Introduction
Robert ČEP*, Michal HATALA**. SELECTION OF CUTTING TOOL MATERIALS. VÝBĚR VHODNÉHO ŘEZNÉHO MATERIÁLU. Abstract. Th

1. Introduction
corresponds to sampling from the distribution of a generic interarrival time X. We ..... These are independent variables

1 INTRODUCTION
Jan 5, 2013 - research activities of students at different levels of the Van Hiele model. Dynamical worksheets help stud

1 Introduction
that, relative to the disjunction property, IPC is maximal with respect to its set of ..... the existence of a T-saturat

1. INTRODUCTION
Madhanraj et al., (2010) studied Meghamalai forest soils are found to be rich in cellulolytic organisms. The physico che

1 Introduction
Mallows's Cp statistic. KEYWORDS: Geometric Methods; Mallows's Cp; linear models; orthogonal projections. 1 Introduction

Stat 536 - Final Report: Analysis of 2008 Pitchf/x Data for Jon Lester Jonathan Gruhl June 6, 2008

1

Introduction

Major League Baseball (MLB) generated over $6 billion of revenue in 2007. Despite the sizable revenuegenerating capacity of this sport, the league’s 30 franchises (29 based in the US and 1 in Canada) have until recently relied primarily on the subjective wisdom of baseball insiders to guide personnel decisions and game strategy. Only in the last years has quantitative analysis played a substantive role in team decision-making. Fan interest in the quantitative analysis of baseball arose in the late 1970s-early 1980s with the publications of Bill James’ The Bill James Baseball Abstract and Pete Palmer’s The Hidden Game of Baseball. These publications are considered the foundations of sabermetrics, the name given to the application of quantitative analysis to baseball. Bill James defined sabermetrics as “the search for objective knowledge about baseball.”[1] As Michael Lewis noted in his NY Times bestseller Moneyball, “By the early 1990s it was clear that ’sabermetrics’... was an activity that would take place mainly outside of baseball...’There was a profusion of new knowledge and it was ignored.’ Well into the late 1990s you didn’t have to look at big league baseball very closely to see its fierce unwillingness to rethink anything.” In 1997, Billy Beane was hired as General Manager of the Oakland A’s. As a former player, Beane was a baseball insider but he had also read all 12 of Bill James’ Abstracts. In the coming years, the A’s became the model organization for succeeding with a low payroll, astutely valuing personnel with the help of sabermetric methods. Beane was so successful that he ultimately became the main subject of Lewis’ aforementioned bestseller Moneyball. His success validated sabermetrics and would lead to its increased adoption. Given the grassroots nature of the sabermetrics movement that has reshaped the game, MLB has smartly begun to track increasing amounts of data and furthermore has made this data publicly available. One recent initiative on the part of MLB has been the Pitchf/x system. Pitchf/x tracks a number of physical characteristics for every pitch thrown at a baseball game including velocity, horizontal and vertical break, horizontal and vertical location, spin and release point of the pitch from the pitcher’s hand. MLB then uses these characteristics to attempt to classify each pitch. For example, a high velocity pitch with relatively little break is likely to be classified as a ’Fastball’. A pitch with lots of break would likely be classified as a ’Curveball’. A pitch with a similar trajectory to a fastball but with less velocity would probably be categorized as ’Changeup’. The Pitchf/x system was partially rolled out during the 2006 and 2007 seasons. However, 2008 is the first season in which the system has been fully implemented and data is available from every game. As a result, there is little existing public research available regarding this data. The purpose of this study is to use the data provided by the Pitchf/x system to investigate the following problems of interest regarding Red Sox pitcher Jon Lester’s 2008 performance to date. First, we investigate the association among pitch type, the pitch count, and the side from which a batter hits (batter side). We

1

then examine the association among these variables plus another variable, pitch outcome. Given the lack of existing research on this data, this analysis may be more exploratory or data-driven in nature. Nonetheless, we do test two a priori hypotheses related to our problems of interest. For pitch type, pitch count and batter side, we test a model of no second order interaction. For our second problem of interest involving pitch outcome, pitch type, pitch count and batter side, we test a model of no second- or third-order interactions. Beyond testing the fit of these models, we will conduct secondary analyses that focus on model selection.

2 2.1

Data and Methods Data

As mentioned above, our dataset is the 2008 year-to-date Pitchf/x data for Red Sox pitcher Jon Lester. Jon Lester is a 24 year old left-handed starting pitcher for the Boston Red Sox. The fact that Lester is left-handed is relevant as left-handed pitchers on average tend to be more effective against left-handed batters. Through May 25, Lester had started 12 games. As the first of these games was played in Japan where the Pitchf/x system was unavailable, pitches thrown by Lester in the last 11 of these 12 games comprise the dataset. The dataset then totals 1107 pitches. In our analysis, we will be considering four variables: pitch type, pitch count, batter side, and pitch outcome. Each of these variables is categorical. Pitch type is the the classification of the pitch based upon its velocity, break and spin. Each MLB pitcher has different pitches that he executes but the most common pitches are the ’Fastball’, ’Curveball’, ’Cut Fastball’ or ’Cutter’, ’Slider’ and ’Changeup’. Jon Lester predominantly throws fastballs, curveballs and cutters. As a result, our pitch type category will include ’Fastball’, ’Curveball’, ’Cutter’ and ’Other’ for any other pitches. Pitch count is a quantity that tracks the numbers of balls and strikes to a given batter during his turn at bat. The pitcher typically has the advantage when there are more strikes than balls while the batter is considered to have the advantage when there are more balls than strikes. As a result, we consolidated the categories for pitch count to ’First’ when the count is zero strikes and zero balls on the first pitch, ’Ahead’ when there are more strikes than balls in the count, ’Behind’ for the cases in which there are more balls, and ’Even/Full’ for when there is no advantage to the pitcher or batter. Batter side represents the side of the plate on which the batter hits. As discussed above, batter side is important in that a pitcher may be more comfortable or effective pitching to batters who hit on a particular side. Batter side has two categories, ’Left’ and ’Right’. Subjective baseball knowledge suggests that pitch count and batter side are factors affecting the choice of pitch type by the pitcher. Finally, we have pitch outcome. This variable will only be considered in addressing the second question of interest. Pitch outcome has five categories: ’Ball’, ’Strike’, ’In play, no out’, ’In play, out(s)’ and ’Foul’. ’Strike’ and ’In play, out(s)’ are considered positive outcomes for a pitcher while ’Ball’ and ’In play, no out’ are considered negative outcomes for a pitcher. ’Foul’ can be a positive or negative outcome depending on the situation. Again subjective baseball knowledge would suggest that pitch type, pitch count and batter side are factors that influence pitch outcome.

2

2.2

Methods

Because each of the variables included in the analysis is categorical, the data can be represented in a contingency table. One means of measuring associations in contingency tables is the odds ratio. We will also use loglinear models to model the cell counts of the contingency table. We can use these models to examine the nature of the assocations among the variables as well as how the expected cell counts change with levels of the categorical variables. For our first problem of interest, we will test the hypothesis that a loglinear model of no second-order interaction fits a contingency table for pitch type, pitch count and batter side. For our second problem of interest, we will test the hypothesis that a loglinear model of no second- and third-order interaction fits a contingency table for pitch type, pitch count, batter and pitch outcome. If the models do not deviate significantly, then we can use these models to make inferences about the association of those variables and the manner in which the expected cell counts change with levels of the variables. To evaluate the fit of these models, we will calculate the deviance G2 and the Pearson statistic X 2 . Both of these statistics are asymptotically distributed χ2 and we can compare these statistics to this distribution to assess fit. When contingency tables have sampling zeros or many cells with small counts, the above fit statistics may not be distributed χ2 . In this situation, we may simulate the distribution of G2 under the model of interest to obtain a hopefully more precise p-value for the test statistic. For example, we simulate 20,000 contingency tables based on the model of no second-order interaction and then compute G2 for each simulation. This collection of 20,000 simulated G2 values is then our simulated distribution to which me may compare the actual G2 for the model of no-second order interaction to obtain a p-value. Beyond testing the fit of the proposed models, we will perform secondary analyses that focus on model selection where we attempt to identify more parsimonious loglinear models whose fit do not deviate significantly from the observed data according to the fit statistics. This process will allow us to explore the nature of the associations among the variables. Given the lack of existing research on the data, these secondary analyses will be valuable in suggesting models of association for future research in this area.

3

Results

Of the 1107 pitches, pitch type is missing for 65 (5.9%) of the pitches. This leaves us with 1042 pitches for which we have complete data. Examination of the data does not indicate any obvious pattern to the missingness. Given the newness of the system, occasional disruption in the data tracking is not completely unexpected and an assumption of missing completely at random (MCAR) does not seem inappropriate. Of the 1042 pitches, 714 (68.5%) are fastballs, 137 (13.1%) are curveballs, 136 (13.1%) are cutters and 55 (5.2%) are categorized as other. In addition, 334 (32.1%) pitches were thrown to left-handed batters (batter side ’Left’) and 708 (67.9%) were thrown to right-handed batters (batter side (’Right’). We first investigate the association among pitch type, pitch count and batter side by testing a model of no second-order interaction. The contingency table for these variables is listed in Table 1. As discussed previously, the intuition behind a model of no second-order interaction is that the batter side will not influence the association between pitch type and pitch count nor will the pitch count influence the association between

3

pitch type and batter side. After fitting the model of no second-order interaction, we obtain G2 and Pearson X 2 fit statistics of 16.33 and 17.02 respectively on 9 df. Using a χ29 reference distribution, we calculate pvalues of 0.060 and 0.048 respectively. Compared to a significance level of 0.05, these p-values are somewhat ambiguous as to whether the data fits the model. As a result, we compare the conditional spanning cell odds ratios calculated from the fitted model of no second order interaction versus those calculated from the data. Table 1: Contingency Table for Pitch Count, Batter’s Side and Pitch Type Count First Ahead Behind Even/Full

Side L R L R L R L R

Pitch Type Curve Cutter 11 8 21 11 18 6 37 34 4 14 16 35 13 5 17 23

Fast 62 140 67 91 69 138 51 96

Other 1 13 1 18 3 9 1 9

Looking specifically at some of the conditional spanning cell odds ratios for the observed data and the fit model, we see a couple of differences. Conditional on the pitch count being ’First’, we observe that the odds of Lester throwing a ’Cutter’ as compared to a ’Fastball’ to a right-handed batter (side ’Right’) are 0.60 of those same odds when Lester is throwing to a left-handed batter (side ’Left’). Under the model of no- second order interaction, this same conditional spanning cell odds ratio is 1.71. Thus, for the first pitch in an at-bat, we observe that the odds of Lester throwing a cutter as compared to a fastball to a right- handed hitter are lower than the odds of throwing a cutter instead of a fastball to a left-handed hitter. This relationship is reversed under the fit model of no second-order interaction. Another such difference in the conditional spanning cell odds ratios occurs when the pitch count is ’Ahead’. Conditional on the pitch being ’Ahead’, we observe that the odds of Lester throwing a ’Cutter’ as compared to a ’Fastball’ to a right-handed batter are 4.17 times those same odds to a left-handed batter. Under the model of no second-order interaction, this same conditional spanning cell odds ratio is 1.71. In these two examples, pitch count does appear to have a noticeable influence on the association between pitch type and batter side. Under the model of no second-order interaction the influence of pitch count is not included and the association between pitch type and batter side is constant for different categories of pitch count. Outside of the above cases, however, the model seems to be a reasonably good fit and we would not be inclined to reject it. An additional consideration is that three cells in the contingency table have only 1 observation. Similarly, three cells in the expected contingency table under the model of no second-order interaction have counts of approximately 1. Moreover, five of the cells in the table have counts less than 5. Thus, there may be some reason to question whether the G2 statistic is distributed χ2 in this situation. Using the approach of simulating a distribution for G2 described above, we estimate a p-value of 0.079 for G2 , suggesting that we would not reject the fit of the model at a 0.05 level of significance. This results aligns with our intuition and we do not reject the model of no second-order interaction. Comfortable that the model of no second-order interaction fits the data reasonably well, we turn our attention

4

to the estimated parameters. Given the number of parameters estimated, we do not list them all here but instead focus on the first-order interactions of interest. Because the first (or reference) level of each categorical variable was constrained to 0, the exponentiated first-order interaction parameter estimates are equivalent to conditional spanning cell odds ratios. For instance, the exponeniated estimate under the model of no second-order interaction for the interaction term between pitch type ’Cutter’ and batter side ’Right’ is 1.71 and equates to the conditional spanning cell odds ratio discussed above. The 95% confidence interval for this conditional spanning cells odds ratio is (1.12, 2.61). Note that this confidence interval does not include 1, indicating that it would be unlikely to observe this data were the odds of Lester throwing a cutter as opposed to a fastball to a right-handed hitter the same as the corresponding odds for a pitch to a left-handed hitter. Two parameter estimates that stood out were the estimate for the interaction term between pitch type ’Cutter’ and pitch count ’Behind’ and the estimate for the interaction between pitch type ’Cutter’ and pitch count ’Even/Full’. (Note that the reference level for pitch type is ’Fastball’ and for pitch count is ’First’). The exponentiated estimate for the interaction between ’Cutter’ and ’Behind’ is 2.53 (1.44, 4.46) while the exponentiated estimate for the interaction between ’Cutter’ and ’Even/Full’ is 2.05 (1.10, 3.81). Again note that neither of these confidence intervals includes 1. Because these estimates were produced under the model of no-second order interaction, they represent the same spanning cell odds ratio regardless of whether batter side is ’Left’ or ’Right’. These results are particularly interesting due to the fact that pitch counts ’Behind’ and ’Even/Full’ indicates that the pitcher is at a disadvantage or neutral in his matchup with the batter at that point. Existing baseball knowledge suggests that in these situations pitchers will rely on the pitch types with which they are most comfortable. Given that the fastball is clearly Lester’s dominant pitch, the fact that these conditional spanning cell odds ratios are greater than 1 implies that Lester has a certain degree of confidence in his cutter pitch that he does not have in his curveball or the pitches designated as other. For an opponent who may face Lester, this is valuable information. Testing our a priori hypothesis of a model of no second-order interaction among pitch type, batter side, and pitch count was our primary analysis. As part of a secondary analysis, we would like to further consider different models of association for the data. In Table 2, we present a comparison of models considered. The ’Sim. P-val’ column refers to p-values obtained for G2 via simulation as described above. In addition to the model of no second-order interaction that appears to fit the observed data, the model of conditional independence between batter side and pitch count given pitch type does not appear to deviate significantly from the observed data. This result suggests that a model of conditional independence may be appropriate for future hypothesis testing and inference. It is admittedly not obvious to me at this point in time whether this model of association is consistent with existing baseball knowledge. We now turn to our second problem of interest which examines the association among pitch outcome, pitch type, pitch count and batter side. The contingency table for these variables is listed in Table 3. We immediately notice that the table is quite sparse (37 of 160 cells are 0). As a result, we may have difficulty obtaining finite maximum likelihood estimates for models and the fit statistics may not follow χ2df distributions. In following, we will again simulate the distribution of G2 for different models. Our a priori hypothesis is that a model of no second- or third-order interaction fits the observed data. After fitting the model of no second- or third- order interaction, we obtain G2 and Pearson X 2 fit statistics of 142.41 and 143.68 respectively on 105 degrees of freedom. Under the assumption of χ2105 distribution for these statistics, we obtain p-values of 0.0092 and 0.0073. Recognizing that this assumption may be flawed due to the sparsity

5

Table 2: Model Comparison for Pitch Type (T), Pitch Count(C), and Batter Side (B) Model df G2 χ2 P-val Sim. P-val Pearson X 2 χ2 P-val [T B][T C][BC] 9 16.33 0.060 0.079 17.02 0.048 [T B][BC] 18 58.00 <0.001 <0.001 54.69 <0.001 18 17.71 0.12 0.19 18.25 0.11 [T B][T C] [T C][BC] 12 37.30 <0.001 <0.001 32.44 0.0011 [B][T C] 15 38.02 <0.001 0.0013 33.59 0.0039 21 58.72 <0.001 <0.001 55.21 <0.001 [C][T B]

of the table, we also calculate a p-value of 0.00685 using a simulated distribution for G2 under the model of no second- or third-order interactions. Based on these p-values, we would reject the hypothesis that the model does not deviate from the observed data. Table 3: Contingency Table for Pitch Outcome, Pitch Type, Pitch Count and Batter Side Outcome Ball

Foul

In play, no out

In play, out(s)

Strike

Count Side Type Fastball Curveball Cutter Other Fastball Curveball Cutter Other Fastball Curveball Cutter Other Fastball Curveball Cutter Other Fastball Curveball Cutter Other

First L R

Ahead L R

Behind L R

Even/Full L R

26 8 2 1 10 0 0 0 0 0 0 0 3 0 2 0 23 3 4 0

21 7 4 1 18 2 1 0 6 1 0 0 8 5 0 0 14 3 1 0

24 3 5 1 14 1 2 1 3 0 1 0 9 0 0 1 19 0 6 0

9 5 2 0 15 0 2 0 7 2 0 0 13 2 0 0 7 4 1 1

59 13 3 8 11 0 3 0 5 0 1 0 12 0 1 3 53 8 3 2

34 14 13 9 28 6 11 1 4 3 0 2 11 8 6 2 14 6 4 4

64 6 14 3 15 2 9 2 5 1 3 0 24 0 2 1 30 7 7 3

18 6 11 3 36 4 7 3 12 2 1 0 21 2 4 0 9 3 0 3

In comparing the observed data to the expected counts under the model of no second- or third-order interactions, we find that the discrepancies between the two tables are not merely limited to one or two cases. The largest discrepancies seem to occur in table entries for pitch type ’Fastball’, count ’Behind’ and batter side ’Right’. We also notice discrepancies in cells for outcome ’Strike’, pitch type ’Curveball’, pitch count ’First’ and outcome ’Foul’, pitch type ’Cutter’, pitch count ’Even/Full’. Having rejected our a priori hypothesis, we continue with a secondary analysis of this data to further investigate the association among these four variables. Here we consider a large variety of models and check

6

their fit to the data. Again due to the sparseness of the table, we rely on G2 as the Pearson X 2 can often not be calculated. The results of our model search/selection can be seen in Table 4. Table 4: Model Comparison for Pitch Outcome (O), Pitch Type (T), Pitch Count(C), and Batter Side (B) Model [OT B][OT C][OBC][T BC] [OT C][OBC][T BC] [OT B][OBC][T BC] [OT B][OT C][T BC] [OT B][OT C][OBC] [OT C][OBC] [OT C][OT B] [OT C][OB][BC][T B] [OT C][OB][T B] [OT C][T B] [OT C][B] [OT ][T C][OC][B]

df 36 48 72 48 45 57 57 69 72 76 79 115

G2 25.61 37.54 103.36 40.25 37.63 52.83 53.08 66.72 68.56 70.50 90.82 166.21

χ2 P-val 0.90 0.86 0.0091 0.78 0.77 0.63 0.62 0.55 0.59 0.66 0.17 0.0012

Sim. P-val 0.99 0.44 0.0018 0.30 0.28 0.016 0.24 0.28 0.32 0.39 0.093 0.0017

Pearson X 2 NA NA NA NA NA NA NA NA NA NA NA 154.51

χ2 P-val NA NA NA NA NA NA NA NA NA NA NA 0.0082

As one can see in Table 4, there are some large differences in the χ2 P-val and the Sim. P-val. columns. This seems to indicate that the sparseness of the table is having a large effect with regards to the reference distribution for G2 . Relying primarily on the p-values from the simulated reference distribution for G2 , we see that there are a number of models that fit the data or do not deviate significantly from the data. The most parsimonious of these is the most surprising as batter side is jointly independent of pitch outcome, pitch type and pitch count. Prior to this analysis, we would not have necessarily suggested such a model based on common baseball wisdom. If this model of association were appropriate for some pitchers, teams could conceivably alter their strategies to take advantage of this relationship. As such, this model of joint independence is an intriguing selection and, given the sparsity of the table, one that we feel requires tests of the its fit on a separate set of data for Lester and/or on other pitchers.

4

Conclusion and Discussion

In the above analysis, we have utilized odds ratios and loglinear modeling of contingency tables to investigate the association of variables related to the pitching of Jon Lester. In doing so, we considered two problems of interest. The first problem of interest was an investigation of the association of pitch type, pitch count and batter side. We tested a model of no second-order interaction that had been hypothesized a priori. We found that this model did not deviate significantly from the observed data and used the model to make inferences about the data. One particularly interesting observation was the increased odds of Jon Lester using his cutter (pitch type) as compared to his fastball in situations where he might be perceived to be at a disadvantage. The second problem of interest examined the assocation among the variables pitch outcome, pitch type, pitch count and batter side. We hypothesized a model of no second- or third-order interaction for these models and rejected this model. For both problems of interest, secondary analyses revolved around model selection among different models of association. For the second problem of interest, we found that an unexpectedly 7

parsimonious model of joint independence did not deviate significantly from the data. The testing of the fit of this model on additional datasets is a direction for future reseach. We should note that our analyses have many significant limitations. To start, the recent implementation of the Pitchf/x system almost ensures that there is some error in the classification process for pitch type as well as other types of measurement error. We did not adjust for these types of errors in any way. In addition, we did not account for the strengths or weaknesses of the batter facing Jon Lester. In addition, we did not account for the number of pitches thrown by Lester (indicator of fatigue and game progress) nor did we include the number of runners currently on base, a factor that would almost certainly adjust the manner in which Lester pitches. Part of the reason for not including these variables as of yet is the sparsity that would result in higher-dimensional contingency tables. Already in the four dimensional tables, we saw a fair amount of sparsity and interpretation is not always straightforward of higher-order parameters. As more and more data become available, issues of sparsity may decline allowing us to undertake more complex and complete analyses.

References [1] Grabiner, D. The Baseball Manifesto. http://www.baseball1.com/bb-data/grabiner/manifesto.html [2] Lewis, M. Moneyball. [3] Rudas, T. Odds Ratios In The Analysis Of Contingency Tables. [4] Agresti, A. Categorical Data Analysis. [5] Fast, M. Glossary of the Gameday pitch fields. http://fastballs.wordpress.com/2007/08/02/glossary-ofthe-gameday-pitch-fields/

8