Data from all 64 matches in the 2018 FIFA World Cup along with predicted ability differences based on bookmakers odds.
Usage
data("FIFA2018", package = "distributions3")
Format
A data frame with 128 rows and 7 columns.
- goals
integer. Number of goals scored in normal time (90 minutes), \ i.e., excluding potential extra time or penalties in knockout matches.
- team
character. 3-letter FIFA code for the team.
- match
integer. Match ID ranging from 1 (opening match) to 64 (final).
- type
factor. Type of match for groups A to H, round of 16 (R16), quarter final, semi-final, match for 3rd place, and final.
- stage
factor. Group vs. knockout tournament stage.
- logability
numeric. Estimated log-ability for each team based on bookmaker consensus model.
- difference
numeric. Difference in estimated log-abilities between a team and its opponent in each match.
Source
The goals for each match have been obtained from Wikipedia (https://en.wikipedia.org/wiki/2018_FIFA_World_Cup) and the log-abilities from Zeileis et al. (2018) based on quoted odds from Oddschecker.com and Bwin.com.
Details
To investigate the number of goals scored per match in the 2018 FIFA World Cup,
FIFA2018
provides two rows, one for each team, for each of the matches
during the tournament. In addition some basic meta-information for the matches
(an ID, team name abbreviations, type of match, group vs. knockout stage),
information on the estimated log-ability for each team is provided. These
have been estimated by Zeileis et al. (2018) prior to the start of the
tournament (2018-05-20) based on quoted odds from 26 online bookmakers using
the bookmaker consensus model of Leitner et al. (2010). The difference in
log-ability between a team and its opponent is a useful predictor for the
number of goals scored.
To model the data a basic Poisson regression model provides a good fit. This treats the number of goals by the two teams as independent given the ability difference which is a reasonable assumption in this data set.
References
Leitner C, Zeileis A, Hornik K (2010). Forecasting Sports Tournaments by Ratings of (Prob)abilities: A Comparison for the EURO 2008. International Journal of Forecasting, 26(3), 471-481. doi:10.1016/j.ijforecast.2009.10.001
Zeileis A, Leitner C, Hornik K (2018). Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model. Working Paper 2018-09, Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, University of Innsbruck. https://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09
Examples
## load data
data("FIFA2018", package = "distributions3")
## observed relative frequencies of goals in all matches
obsrvd <- prop.table(table(FIFA2018$goals))
## expected probabilities assuming a simple Poisson model,
## using the average number of goals across all teams/matches
## as the point estimate for the mean (lambda) of the distribution
p_const <- Poisson(lambda = mean(FIFA2018$goals))
p_const
#> [1] "Poisson(lambda = 1.297)"
expctd <- pdf(p_const, 0:6)
## comparison: observed vs. expected frequencies
## frequencies for 3 and 4 goals are slightly overfitted
## while 5 and 6 goals are slightly underfitted
cbind("observed" = obsrvd, "expected" = expctd)
#> observed expected
#> 0 0.2578125 0.273384787
#> 1 0.3750000 0.354545896
#> 2 0.2500000 0.229900854
#> 3 0.0781250 0.099384223
#> 4 0.0156250 0.032222229
#> 5 0.0156250 0.008357641
#> 6 0.0078125 0.001806469
## instead of fitting the same average Poisson model to all
## teams/matches, take ability differences into account
m <- glm(goals ~ difference, data = FIFA2018, family = poisson)
summary(m)
#>
#> Call:
#> glm(formula = goals ~ difference, family = poisson, data = FIFA2018)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.21272 0.08125 2.618 0.00885 **
#> difference 0.41344 0.10579 3.908 9.31e-05 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 144.20 on 127 degrees of freedom
#> Residual deviance: 128.69 on 126 degrees of freedom
#> AIC: 359.39
#>
#> Number of Fisher Scoring iterations: 5
#>
## when the ratio of abilities increases by 1 percent, the
## expected number of goals increases by around 0.4 percent
## this yields a different predicted Poisson distribution for
## each team/match
p_reg <- Poisson(lambda = fitted(m))
head(p_reg)
#> 1 2
#> "Poisson(lambda = 1.7680)" "Poisson(lambda = 0.8655)"
#> 3 4
#> "Poisson(lambda = 1.0297)" "Poisson(lambda = 1.4862)"
#> 5 6
#> "Poisson(lambda = 1.4354)" "Poisson(lambda = 1.0661)"
## as an illustration, the following goal distributions
## were expected for the final (that France won 4-2 against Croatia)
p_final <- tail(p_reg, 2)
p_final
#> 127 128
#> "Poisson(lambda = 1.6044)" "Poisson(lambda = 0.9538)"
pdf(p_final, 0:6)
#> d_0 d_1 d_2 d_3 d_4 d_5
#> 127 0.2010078 0.3224993 0.2587107 0.13835949 0.05549639 0.017807808
#> 128 0.3852791 0.3674743 0.1752462 0.05571586 0.01328527 0.002534265
#> d_6
#> 127 0.0047618419
#> 128 0.0004028582
## clearly France was expected to score more goals than Croatia
## but both teams scored more goals than expected, albeit not unlikely many
## assuming independence of the number of goals scored, obtain
## table of possible match results (after normal time), along with
## overall probabilities of win/draw/lose
res <- outer(pdf(p_final[1], 0:6), pdf(p_final[2], 0:6))
sum(res[lower.tri(res)]) ## France wins
#> [1] 0.5245018
sum(diag(res)) ## draw
#> [1] 0.2497855
sum(res[upper.tri(res)]) ## France loses
#> [1] 0.2242939
## update expected frequencies table based on regression model
expctd <- pdf(p_reg, 0:6)
head(expctd)
#> d_0 d_1 d_2 d_3 d_4 d_5 d_6
#> 1 0.1706693 0.3017480 0.2667494 0.15720674 0.069486450 0.024570788 0.0072403041
#> 2 0.4208316 0.3642392 0.1576286 0.04547703 0.009840349 0.001703409 0.0002457231
#> 3 0.3571261 0.3677207 0.1893148 0.06497703 0.016726166 0.003444474 0.0005911098
#> 4 0.2262357 0.3362265 0.2498462 0.12377196 0.045986787 0.013668909 0.0033857384
#> 5 0.2380213 0.3416546 0.2452047 0.11732187 0.042100811 0.012086260 0.0028914265
#> 6 0.3443506 0.3671104 0.1956873 0.06954039 0.018534163 0.003951835 0.0007021718
expctd <- colMeans(expctd)
cbind("observed" = obsrvd, "expected" = expctd)
#> observed expected
#> 0 0.2578125 0.294374450
#> 1 0.3750000 0.340171469
#> 2 0.2500000 0.214456075
#> 3 0.0781250 0.098236077
#> 4 0.0156250 0.036594546
#> 5 0.0156250 0.011726654
#> 6 0.0078125 0.003332718