《R文章翻译》专栏·第2篇
文| R扫地僧
5155字 | 15分钟浏览
【R语言】已经开通R语言社群 ,五湖四海,咫尺天涯,各行各业,有缘相聚 ,同享R事,雕刻数据,求解问题 ,以创价钱。喜乐入群者,请加微信号luqin360,或者增添为摰友 ,同时附上R-入群 。有朋自远方来,不可开交,并诚邀入群 ,以达相互深造以及提高之幽美心愿。
【R语言】已经开通R语言社群,五湖四海,咫尺天涯 ,各行各业,有缘相聚,同享R事,雕刻数据 ,求解问题,以创价钱。喜乐入群者,请加微信号luqin360 ,或者增添为摰友,同时附上R-入群。有朋自远方来,不可开交 ,并诚邀入群,以达相互深造以及提高之幽美心愿 。
R语言出品
问题:Quantify the Covid19 Impact on the SFO Airport Passenger Air Traffic
泉源:
https://ramikrispin.github.io/2021/01/covid19-effect/
编译:R扫地僧
可怜的是, Covid19年夜流行仍对于年夜少数主要行业发生了重年夜影响。尽管对于某些行业来说 ,影响是踊跃的(比方在线批发,互联网以及蒸汽供应商等),但对于其余行业则是晦气的(比方交通 ,游览,娱乐等)。在这两种状况下,咱们都能够应用 时日序列建模来量化Covid19对于行业的影响 。
量化 Covid19年夜流行影响(不管是侧面照样负面)的一种轻易方法将囊括下列步调:
为了演示这种方法,我将应用来自 sfo包的 sfo_passengers数据集。sfo_passengers数据集供应了2005年7月至2020年9月旧金山国内机场(SFO)地面交通的月度统计数据。对于于数据集的更多细节能够在这个帖子以及文档中找到 。
为了停止剖析 ,咱们将应用下列软件包:
- sfo-旅客航空交通数据集
- dplyr-数据准备
- plotly-数据可视化
- TSstudio-时日序列剖析以及展望
library(sfo)
library(dplyr)
library(plotly)
library(TSstudio)
数据
如上所述,sfo_passengers数据集供应2赢咖4娱乐招商【QQ219871】005年以来SFO机场的航空客运量每一个月统计数据。囊括按分比方种别(如营运航空公司 、地区、终点站等)统计的每一个月旅客人数。让咱们加载数据:
data( "sfo_passengers")
str(sfo_passengers)
在将数据转换为时日序列体例以前,咱们将activity_period转换为日期体例:
df <- sfo_passengers %>%
mutate(date = as.Date(paste( substr(sfo_passengers$activity_period, 1, 4),
substr(sfo_passengers$activity_period, 5, 6),
"01", sep = "/")))
接下来 ,咱们将数据集转换为时日序列体例,将搭客按日期变量分组:
df<- df %>%
group_by(date) %>%
su妹妹arise(y = sum(passenger_count), .groups = "drop")
head(df)
现在,咱们有一个按月的时日序列:
plot_ly(data = df,
x = ~ date,
y = ~ y,
type = "scatter",
mode = "line",
name = "Total Passengers") %>%
add_segments(x = as.Date( "2020-02-01"),
xend = as.Date( "2020-02-01"),
y = min(df$y),
yend = max(df$y) * 1.05,
line = list(color = "black", dash = "dash"),
showlegend = FALSE) %>%
add_annotations(text = "Pre-Covid19",
x = as.Date( "2018-09-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
add_annotations(text = "Post-Covid19",
x = as.Date( "2021-08-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
layout(title = "Total Number of Air Passengers - SFO Airport",
yaxis = list(title = "Number of Passengers"),
xaxis = list(title = "Source: San Francisco Open Data Portal"))
从上图能够看出 ,2020年3月以来,Covid19的影响非常明显 。为量化疫情对于旅客人数的影响,咱们将该序列分为下列两个序列:
- Covid19以前序列-所有的考察效果都是在2020年3月以前
- Covid19以后序列-所有考察效果(囊括)2020年3月以后
咱们将应用Covid19以前序列来练习一个时日序列模子。这将使咱们能够展望乘客人数 ,就像不年夜流行同样。
pre_covid &l赢咖4娱乐招商【QQ219871】t;- df %>%
dplyr::filter( date< as. Date( "2020-03-01")) %>%
dplyr::arrange( date)
post_covid <- df %>%
dplyr::filter( date>= as. Date( "2020-03-01")) %>%
dplyr::arrange( date)
咱们将应用Covid前序列练习一个时日序列模子 。这将使咱们能够展望因不这种Covid年夜流行下乘客人数。一旦咱们用Covid19前数据展望Covid19后序列的响应不雅察效果,咱们就能够量化Covid19对于总乘客人数的影响。
剖析数据
在咱们展望该序列以前,让咱们对于该序列停止快速探索性剖析 ,以肯定其主要特色。咱们将应用TSstudio包可视化 pre_covid序列 。
细致,这个包还不反对于tsibble工具。因而,咱们首先将series转换为ts工具:
ts.obj < -ts( pre_covid$ y, start= c(2005,7), frequency= 12)
年夜流行爆发前的序列:
ts_plot(ts.obj,
title = "Total Number of Air Passengers - SFO Airport",
Ytitle = "Number of Passengers",
slider = TRUE)
像年夜少数形容月度航空客流量的序列同样,该序列拥有很强的 月度节令性形式以及正向趋势。您还能够细致到 ,节令性身分的振荡自2017年以来(与前多少年比拟)变患上更年夜 。让咱们应用ts_seasonal函数来建立该序列的节令图:
ts_seasonal(ts.obj = ts.obj, type= "all")
从节令图中能够看出,每一个月的节令效应都在一直增加,而序列却在逐年增进。
相似地 ,咱们能够应用ACF以及PACF函数往返忆与其过去滞后的序列相干性:
ts_cor(ts.obj = ts.obj, lag. max= 36)
正如预期的那样,咱们能够看到该序列与第一以及节令性滞后之间的强相干性。咱们将应用这些信息为节令性数据抉择时日序列模子 。
展望Covid19前序列
我最喜爱的展望策略之一是将分比方时日序列模子之间的赛马以及回溯测试联合起来作为一种练习方法。回溯测试是时日序列等价的机器深造交织考证练习方法。这里的想法很轻易—用回测测试每一个模子,而后在分比方的测试分区上抉择均匀显示最佳的一个 。
TSstudio包中的train_model函数使咱们能够应用 forecast以及 stats包中的模子无缝地应用这个策略。 为了轻易起见 ,咱们将应用分比方的ETS、Holt-Winters模子以及开箱即用的auto.arima以及tslm模子。为了回测,咱们将把序列分成6个测试分区,每一12个月距离3个月 。
methods参数界说要应用的模子 ,train_method参数界说回测的配置。能够在这里找到对于于这个函数的更多细节。
methods < -list( ets1= list(method= "ets",
method_arg= list(opt.crit= "lik"),
notes= "ETS model opt.crit=lik"),
ets2= list(method= "ets",
method_arg= list(opt.crit= "amse"),
notes= "ETS model opt.crit=amse"),
ets3= list(method= "ets",
method_arg= list(opt.crit= "mse"),
notes= "ETS model opt.crit=mse"),
auto_arima= list(method= "auto.arima",
notes= "Auto ARIMA"),
hw1= list(method= "HoltWinters",
method_arg= NULL,
notes= "HoltWinters Model"),
hw2= list(method= "HoltWinters",
method_arg= list(seasonal= "multiplicative"),
notes= "HW with multip. seasonality"),
tslm= list(method= "tslm",
method_arg= list(formula= input~ trend+ season),
notes= "tslm with trend and seasonal"))
train_method= list(partitions= 6,
sample.out= 12,
space= 3)
在界说了方法以及train_method参数以后,咱们将应用train_model函数来练习模子。请细致,展望地平线被配置为post_covid序列的长度 。此外 ,咱们将MAPA配置为偏偏差器量,以评估分比方模子在测试分区上的性能:
md < -train_model( input= ts.obj,
methods= methods,
train_method= train_method,
horizon= nrow(post_covid),
error= "MAPE")
plot_error(md)
plot_model使咱们能够在回测的分比方测试分区上动画化每一个模子的展望值:
plot_model(md)
咱们将抉择霍尔特-温特斯模子(hw1)来盘算covid - 19效应。咱们将把选定的展望增添到post_covid数据集:
post_covid$yhat <- as.numeric(md$forecast$hw1$forecast$mean)
post_covid$upper95 <- as.numeric(md$forecast$hw1$forecast$upper[, 2])
post_covid$lower95 <- as.numeric(md$forecast$hw1$forecast$lower[, 2])
量化Covid19的影响
在咱们增添展望值后,很轻易盘算出Covid19对于旧金山国内机场乘客人数的月度影响:
post_covid $passengers_loss<- post_covid $y- post_covid $yhat
post_covid
2020年3月至9月时代,因为年夜流行 ,估计缩小的搭客总数:
sum(post_covid $passengers_loss)
异样,咱们能够可视化新冠病毒对于航空客运的影响:
plot_ly %>%
add_ribbons(x = post_covid$date,
ymin = post_covid$y,
ymax = post_covid$yhat,
line = list(color = 'rgba(255, 0, 0, 0.05)'),
fillcolor = 'rgba(255, 0, 0, 0.6)',
name = "Estimated Loss") %>%
add_segments(x = as.Date( "2020-02-01"),
xend = as.Date( "2020-02-01"),
y = min(df$y),
yend = max(df$y) * 1.05,
line = list(color = "black", dash = "dash"),
showlegend = FALSE) %>%
add_annotations(text = "Pre-Covid19",
x = as.Date( "2017-09-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
add_annotations(text = "Post-Covid19",
x = as.Date( "2020-09-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
add_annotations(text = paste( "Estimated decrease in", "<br>",
"passengers volume: ~30M",
sep = ""),
x = as.Date( "2020-05-01"),
y = 2* 10^ 6,
arrowhead = 1,
ax = -130,
ay = -40,
showarrow = TRUE) %>%
add_lines(x = df$date,
y = df$y,
line = list(color = "#1f77b4"),
name = "Actual") %>%
layout(title = "Covid19 Impact on SFO Air Passenger Traffic",
yaxis = list(title = "Number of Passengers"),
xaxis = list(title = "Time Series Model - Holt-Winters",
range = c( as.Date( "2015-01-01"), as.Date( "2021-01-01"))),
legend = list(x = 0, y = 0.95))
应用畛域
一旦咱们估计了乘客人数的缩小,就能够量化Covid19形成的丢失。比方 ,如果每一位旅客均匀要支付10美元的机场税,那末在特定时代内,估计的税收丧守信为 3亿美元 。
此外 ,因为underline展望是点估计。因而,您能够应用展望距离为航空客运量的下降供应一个范围。
最后但并非最不主要的一点是,您能够应用自上而下的方法并为数据中的某些可用种别停止展望 。比方 ,基于航空公司、地区 、海内/国内航班的搭客下降。
参考的信息
更多对于于这篇文章中应用的包以及工具的细节
数据集
- sfo 包 - https://github.com/RamiKrispin/sfo
- San Francisco Data Portal - https://datasf.org/opendata/
- dplyr包 - https://dplyr.tidyverse.org/
时日序列剖析
- TSstudio 包 - https://github.com/RamiKrispin/TSstudio
- train_model function documentation - https://ramikrispin.github.io/TSstudio/reference/train_model.html
- Great explanation about backtesting from Uber Engineering blog - https://eng.uber.com/omphalos/
数据可视化
- plotly 包 - https://plotly.com/r/
附录:完好代码
library(sfo)
library(dplyr)
library(plotly)
library(TSstudio)
# 数据集
data( "sfo_passengers")
str(sfo_passengers)
df <- sfo_passengers %>%
mutate(date = as.Date(paste(substr(sfo_passengers$activity_period, 1, 4),
substr(sfo_passengers$activity_period, 5, 6),
"01", sep = "/")))
df <- df %>%
group_by(date) %>%
su妹妹arise(y = sum(passenger_count), .groups = "drop")
head(df)
plot_ly(data = df,
x = ~ date,
y = ~ y,
type = "scatter",
mode = "line",
name = "Total Passengers") %>%
add_segments(x = as.Date( "2020-02-01"),
xend = as.Date( "2020-02-01"),
y = min(df$y),
yend = max(df$y) * 1.05,
line = list(color = "black", dash = "dash"),
showlegend = FALSE) %>%
add_annotations(text = "Pre-Covid19",
x = as.Date( "2018-09-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
add_annotations(text = "Post-Covid19",
x = as.Date( "2021-08-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
layout(title = "Total Number of Air Passengers - SFO Airport",
yaxis = list(title = "Number of Passengers"),
xaxis = list(title = "Source: San Francisco Open Data Portal"))
# 数据集分别
pre_covid <- df %>%
dplyr::filter(date < as.Date( "2020-03-01")) %>%
dplyr::arrange(date)
post_covid <- df %>%
dplyr::filter(date >= as.Date( "2020-03-01")) %>%
dplyr::arrange(date)
ts.obj <- ts(pre_covid$y, start = c( 2005, 7), frequency = 12)
ts_plot(ts.obj,
title = "Total Number of Air Passengers - SFO Airport",
Ytitle = "Number of Passengers",
slider = TRUE)
ts_seasonal(ts.obj = ts.obj, type = "all")
ts_cor(ts.obj = ts.obj, lag.max = 36)
methods <- list(ets1 = list(method = "ets",
method_arg = list(opt.crit = "lik"),
notes = "ETS model opt.crit=lik"),
ets2 = list(method = "ets",
method_arg = list(opt.crit = "amse"),
notes = "ETS model opt.crit=amse"),
ets3 = list(method = "ets",
method_arg = list(opt.crit = "mse"),
notes = "ETS model opt.crit=mse"),
auto_arima = list(method = "auto.arima",
notes = "Auto ARIMA"),
hw1 = list(method = "HoltWinters",
method_arg = NULL,
notes = "HoltWinters Model"),
hw2 = list(method = "HoltWinters",
method_arg = list(seasonal = "multiplicative"),
notes = "HW with multip. seasonality"),
tslm = list(method = "tslm",
method_arg = list(formula = input ~ trend + season),
notes = "tslm with trend and seasonal"))
train_method = list(partitions = 6,
sample.out = 12,
space = 3)
md <- train_model(input = ts.obj,
methods = methods,
train_method = train_method,
horizon = nrow(post_covid),
error = "MAPE")
plot_error(md)
plot_model(md)
post_covid$yhat <- as.numeric(md$forecast$hw1$forecast$mean)
post_covid$upper95 <- as.numeric(md$forecast$hw1$forecast$upper[, 2])
post_covid$lower95 <- as.numeric(md$forecast$hw1$forecast$lower[, 2])
post_covid$passengers_loss <- post_covid$y - post_covid$yhat
post_covid
sum(post_covid$passengers_loss)
plot_ly %>%
add_ribbons(x = post_covid$date,
ymin = post_covid$y,
ymax = post_covid$yhat,
line = list(color = 'rgba(255, 0, 0, 0.05)'),
fillcolor = 'rgba(255, 0, 0, 0.6)',
name = "Estimated Loss") %>%
add_segments(x = as.Date( "2020-02-01"),
xend = as.Date( "2020-02-01"),
y = min(df$y),
yend = max(df$y) * 1.05,
line = list(color = "black", dash = "dash"),
showlegend = FALSE) %>%
add_annotations(text = "Pre-Covid19",
x = as.Date( "2017-09-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
add_annotations(text = "Post-Covid19",
x = as.Date( "2020-09-01"),
y = max(df$y) * 1.05,
showarrow = FALSE) %>%
add_annotations(text = paste( "Estimated decrease in", "<br>",
"passengers volume: ~30M",
sep = ""),
x = as.Date( "2020-05-01"),
y = 2* 10^ 6,
arrowhead = 1,
ax = -130,
ay = -40,
showarrow = TRUE) %>%
add_lines(x = df$date,
y = df$y,
line = list(color = "#1f77b4"),
name = "Actual") %>%
layout(title = "Covid19 Impact on SFO Air Passenger Traffic",
yaxis = list(title = "Number of Passengers"),
xaxis = list(title = "Time Series Model - Holt-Winters",
range = c( as.Date( "2015-01-01"), as.Date( "2021-01-01"))),
legend = list(x = 0, y = 0.95))
朋侪们,能够扫描我的微信号,备注“R-入群”。我会邀请你退出R语言群 ,咱们一起探讨与深造 。
R书籍推举
额 本文暂时没人评论 来添加一个吧
发表评论