My semester project focuses on exploring time series analysis using ARIMA/SARIMA and Bayesian models to forecast and analyze the spread of COVID-19 and CO₂ emissions in Mauna Loa, Hawaii. Below are the literature reviews I have completed, highlighting the methods and techniques applied in my research.
Source: An Introduction to Time Series Analysis with ARIMA, Towards Data Science
This study explores time series forecasting techniques using ARIMA, SARIMA, and Bayesian Structural Time Series (BSTS) models on two distinct datasets: the COVID-19 dataset from the Johns Hopkins CSSEGISandData repository and the CO₂ emissions data from Mauna Loa, Hawaii, provided by NOAA. The primary objective is to apply machine learning techniques to predict future data patterns and to evaluate the efficiency of each model, comparing their performance against each other For large datasets, such as COVID-19, both ARIMA and SARIMA models demonstrated rapid processing capabilities; however, SARIMA introduced additional noise due to explicit seasonality, impacting accuracy. Both models produced forecasts with higher variability and were outperformed by the BSTS model, which demonstrated lower Root Mean Squared Error (RMSE) and Hold-Out Mean Absolute Percentage Error (MAPE) but with greater computational complexity causing slower processing time.
In contrast, the smaller Mauna Loa CO₂ dataset, characterized by smoother seasonal patterns, allowed for more precise forecasting. Among the tested models, the BSTS model achieved the highest accuracy with an RMSE of 0.765 and a MAPE of 0.161%, outperforming ARIMA and SARIMA, which had RMSE values of 12.55 and 13.38, respectively. This analysis highlights the effectiveness of the BSTS model for stable, seasonally predictable datasets, while ARIMA-based models may struggle with complex, volatile data like the COVID-19 dataset. These findings underscore the importance of selecting models that align with the dataset’s characteristics to optimize forecasting accuracy.
Anything that is observed sequentially over time is a time series. In the data, I will look at regular intervals of time. Time series forecasting is a statistical method used to predict future values based on past results. The simplest models focus solely on the variable being forecast, ignoring external factors like marketing efforts or economic shifts, and instead extrapolating existing trends and seasonality. Specifically in this capstone, I will be exploring univariate models ARIMA/SARIMA and the Bayesian STS.
The use of time series analysis gained recognition since 1970 when the Box-Jenkins method popularized ARIMA models to find the best fit of different time series models based on past data. However, the concept of analyzing sequential data over time is not new. As the world increasingly embraces data-driven solutions to forecast future industry outcomes, time series analysis has become vital in numerous fields.
Currently, efforts to perfect the science of forecasting is important as the higher the accuracy of a time series based on past data and events, the more essential the tool becomes for modern organizations as accurate forecasts support decision-making across various timeframes—whether short-term, medium-term, or long-term, depending on the specific application. Key considerations in forecasting include seasonality, trends, and external fluctuations, which play a significant role in shaping predictions. These specifically will be explored more as I deal with the data sets of this project. These forecasts are integral for planning, resource allocation, and adjusting strategies in response to upcoming movements.
The main goal of my paper is to explore and discuss the concept and use real life datasets in order to see how time series analysis can use its historical data and leverage practical and meaningful conclusions. In this paper, we will also be discussing any possible limitations that the different time series models have and what some suggestions are to improve our predictions and possibly overcome these limitations.
In this capstone, I will explore the Johns Hopkins CSSEGISandData COVID-19 Repository, which tracks global COVID-19 cases and Our World in Data (OWID), which provides comprehensive COVID-19 datasets across various indicators datasets to test the univariate models ARIMA/SARIMA and Bayesian STS. Using the same model structures, I will also be exploring the Forecasting CO₂ Emissions in Mauna Loa, Hawaii. This dataset is provided by the Global Monitoring Laboratory within NOAA. The goal of this exploration is to apply time series analysis to forecast the future spread of COVID-19, seeing which models are the strongest to do such a task, and to use those same models to appropriately forecast the increasing trend of the CO₂ emission rates. These two different datasets and applications are to show the strength in forecasting in multiple real-life situations. This is important in data science and statistics, particularly in public health and global policy as creating accurate forecasts can allow policymakers a better direction on what should be applied and when.
To achieve the goals of my capstone, ARIMA/SARIMA will be used for capturing linear trends and seasonal effects, and Bayesian STS for incorporating uncertainty and providing probabilistic forecasts. The project will begin with an exploratory data analysis (EDA) to visualize the time series data, identify patterns, and check for seasonality or trends. After cleaning and preprocessing the data, I will test both models using Python on the COVID-19 and CO₂ datasets. The results will then be compared to assess which approach yields the most accurate and insightful forecasts.
In different papers and studies, it is easy to see that there are a multitude of models to explore and chose from, depending on what is being questioned or asked from by the researcher and data. Most of the models can be seen following into different categories. Traditional statistical modeling, which includes a moving average, exponential smoothing, and autoregressive (ARIMA). These are seen as linear functions taking past observations to make sound predictions. Second category of model type is nonlinear models. This will not be covered in my project, but the main difference lies in the complexity and non-proportional relationships in the data. An example of this can be LSTM, ANN, NAR models. The second main model type that I will be covering is the Bayesian Structural Time Series (BSTS). This model type is considered linear, since it relies on the state-space models, but can handle much more complexity such as trend, seasonality, and regression components. So, although it is linear, the approach is more flexible by adding prior distributions allowing for uncertainty estimations. In this capstone, these two ARIMA/SARIMA and Bayesian STS will be explore in depth.
To begin, throughout this project I will be utilizing the ARIMA model as a traditional method of time series forecasting. ARIMA stands for Autoregressive integrated moving average and it predicts future values with past values. It utilizes lagged moving averages to smooth the time series data. The ARIMA model is a univariate model that can be broken down into two different models, AR and MA, which will be define below.
The ARIMA model is defined by the parameters \( (p, d, q) \), where:
The ARIMA model is composed of two submodels, the AR (Autoregressive) and the MA (Moving Average) model, described mathematically as follows:
\[Y(AR)_t = \alpha + \beta_1 Y(AR)_{t-1} + \beta_2 Y(AR)_{t-2} + \cdots + \beta_p Y(AR)_{t-p}\] (1)
where \( Y(AR)_t \) are the lags of the series, \( \beta \) are the lag coefficients estimated by the model, and \( \alpha \) is the model’s intercept term.
\[Y(MA)_t = \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \cdots + \theta_q \epsilon_{t-q}\] (2)
where \( Y(MA)_t \) depends only on the lagged forecast errors, \( \theta \) are the parameters of the model, and \( \epsilon \) are unrelated error terms.
The mathematical expression of the ARIMA model, as a combination of AR and MA models, is:
\[Y_t = \alpha + \sum_{i=1}^{p} \beta_i Y_{t-i} + \epsilon_t - \sum_{j=1}^{q} \theta_j \epsilon_{t-j}\] (3)
Where \( Y_t \) and \( \epsilon_t \) represent the time series and the random error at time \( t \), \( \beta \) and \( \theta \) are the coefficients of the model.
When estimating the \( p\) and \( q\) parameters of the AR and the MA models, plots from the ACF function and the PACF functions are used. Just to summarize, the autocorrelation function, ACF, is a function that shows the correlation between the observed of the \(t \) time and the observation at the previous times. This function indicates the autocorrelation coefficient, which is the measurement of the correlation between observations at different times. PACF, or partial autocorrelation function differs because it is the correlation between two random variables after the influence of confounding variable is removed [7].
Source: Classification of Time Series
Prior to finding the ACF/PACF, it is obligated for us to know if the data that is obtained is stationary or nonstationary. This is important because if the data is not stationary, the data cannot perform the ARIMA model accurately. Therefore, in order to determine if the model is stationary, it can be visualized or the Dickey-Fuller Test can be performed to determine the classification of the time series data.
When differencing is applied and observed, the \(p\) order deals with the AR model and it is based on the significant spikes from the PACF plot. The MA model, \(q\) order is determined if the ACF plots have sharp cut-off after lags. If there has been any differencing applied to the model, \(d\) is determined by the order of the differencing (i.e. how many times the differencing has been applied).
After the strongest combination of the \( (p, d, q) \) have been selected, it is possible to then run the ARIMA model and see the results. At times, it can be difficult to physically pick out the correct combination. If this becomes the case, there are machine learning approaches to choose the appropriate orders.
Fundamentally, this theorem is unique from ARIMA due to differences in the interpretation of what “probability” means. We will explore criticisms later, but it is important to remember this distinction when understanding the model differences. Bayesian inference is a method of statistical inference where available knowledge about parameters in a statistical model is updated with information from observed data. A Guide to Bayesian Inference.
Bayesian inference provides a powerful framework to update beliefs as new data becomes available. It allows us to calculate the probability of an event occurring, given prior knowledge and new evidence. A simplified version of Bayes’ theorem looks like this:
Bayesian Time Series Forecasting
Referring to the image explaining the visualization, it breaks down the components as follows:
This framework is fundamental in Bayesian time series analysis, allowing models to update predictions as new data arrives. The posterior distribution represents the estimation of a parameter, ensuring the model reflects both the past trends and the latest observations.
Source: Created based on Bayes' theorem principles, adapted for better interpretability.
Now, for Bayesian Structural Time-Series Models, the time series can be broken down into four basic models depending on the following: a level, a local trend, seasonal impacts, and an error term. Structural Time Series models can be defined by the following:
Equation (4) and Equation (5) represent the structure of the model. Equation (4) is trying to understand the distribution of the \( Y_t \)’s of the data, while Equation (5) is closely related to understanding the distribution of the parameters. It provides the evolution model for the parameters and how they change over time. Because the vector evolves through randomness, it describes the distribution of the hidden parameters at the next step. This is crucial because for Bayesian inference to hold, updating the beliefs of the parameters helps incorporate both the priors and the observed data.
The key distinction between the ARIMA and Bayesian methods lies in their underlying theoretical perspectives: frequentist vs. Bayesian. In the frequentist approach (as used in ARIMA models), parameters are treated as fixed values that do not change once estimated. The data is analyzed under the assumption that these parameters have unique, fixed points.
In contrast, the Bayesian approach views parameters as random variables that can evolve over time, with uncertainty captured through prior distributions. This flexibility allows the Bayesian method to incorporate prior knowledge about the parameter’s movement and update these beliefs as new data becomes available, resulting in what is known as posterior distributions.
This difference introduces a potential limitation: if the prior distribution is inaccurate or misinformed, it can negatively affect the model’s output. While prior information can enhance predictions, it can also become a burden if the assumptions are flawed or if prior knowledge conflicts with new observations. Therefore, Bayesian models offer both advantages and risks: they excel when reliable prior knowledge is available but can be inaccurate if the priors introduce bias into the model.
Next, time series analysis inherently faces criticism because it attempts to predict the future—data that has not yet materialized—with as much certainty as possible. When practicing time series forecasting, it is crucial to not only critique the model’s inference but also incorporate evaluation methods into the decision-making process, given the inherent uncertainty of projections.
While cross-validation techniques are still applicable and recommended for time series models, specific approaches such as leave-one-out (LOO) cross-validation can pose challenges. For time series, LOO can be problematic when the goal is to estimate predictive performance for future time points, as removing an observation disrupts the temporal dependencies in the data. This is particularly problematic from a Bayesian perspective, where each time point informs the next. Ignoring these dependencies undermines the theoretical foundation of time series models, which rely on the structure of sequential observations.
My project explores time series forecasting using two distinct datasets: one focused on COVID-19 case trends and the other on the NOAA Mauna Loa CO₂ emission dataset. I will apply my forecasting models to both datasets and evaluate their predictive performance. The models’ strengths will be compared across the two datasets to assess their ability to capture different patterns and trends. Finally, I will generate a summary detailing the strengths, weaknesses, and limitations encountered throughout the project.
This part of the project uses time series forecasting on the COVID-19 dataset from the Johns Hopkins CSSEGISandData, with an emphasis on modeling the data using ARIMA/SARIMA and Bayesian Structural Time Series (BSTS). Both are univariate (single-variable) time series methods without trend or seasonal components, while other time series models, such as SARIMA, can capture multivariate series with trends and seasonal patterns. Within this capstone, I will demonstrate how the predictions appear when forecasted using ARIMA, SARIMA, and BSTS. Then, I will compare the results, determine which model is best suited for the COVID-19 dataset, and revisit the limitations of the models based on the outcomes of their applications.
First, the data will be loaded into JupyterNotebook and processed with Python. Then, the data is already cleaned through the github repository CSSEGISandData.
Using Python, the key libraries utilized for processing my ARIMA model were Pandas, Matplotlib, and Statsmodels. I used Pandas extensively for data manipulation, including loading the dataset, converting date columns to datetime format, applying differencing to achieve stationarity, and other data preparation tasks. Matplotlib was instrumental in visualizing the time series, including plotting the original and differenced data, as well as visualizing the autocorrelation and forecast results. Statsmodels was used to fit the ARIMA model and provided essential tools for analyzing time series properties, such as plotting the partial autocorrelation function (PACF) and autocorrelation function (ACF), as well as fitting the model itself to make predictions.
To begin the data exploration, I processed the data from the Johns Hopkins dataset. The dataset was already prepared, cleaned, and organized by country, province, state, date, confirmed cases, and more. After loading the dataset into JupyterNotebook, I examined its structure and content. To explore the dataset further, I applied a rolling statistic. This approach visualizes a moving window along the time series, which is particularly useful with ARIMA as it helps to observe moving averages. These rolling averages smooth out short-term fluctuations, providing a clearer view of the underlying trends in the data.
rolling_mean_us = us_data['Confirmed'].rolling(window=30).mean()
rolling_std_us = us_data['Confirmed'].rolling(window=30).std()
Rolling Mean and Rolling Standard Deviation over 7-day periods, showing the trend of confirmed COVID-19 cases in Florida (2020-2023).
After creating a line graph displaying the trend of confirmed COVID-19 cases over time, it became clear that the volume of data required narrowing down the location or focus. I chose to concentrate on Florida, my home state. Then, I noticed that the data exhibited seasonality, with confirmed cases rising as the weeks progressed. I began by processing the Johns Hopkins dataset, which was already organized by country, state, and confirmed cases, among other variables. I ensured all date and time variables were correctly formatted, addressing any missing values along the way. I plotted a general USA COVID-19 graph to analyze the structure, which revealed a positive trend and signs of seasonality.
First, to point out the code, the cleaned data set allows me to calculate the means with ease. I used a window of 30 days for both the means and standard deviation. Then, as it can be seen, the blue line of confirmed cases, or the cumulative data has small rigidness on the line. When the rolling mean was applied, it smoothed out the curve based on the 7-day rolling window. This averages out the daily changes to the data over a week at a time. Because the changes of the window size is low, the standard deviation line in green will appear low, indicating that the variations over that period are not significant. All this is applicable to our data as the nature of COVID-19’s spread tends to increase in a consistent matter over time, except as we can see in 2022-01, we had a significant spike. Aside from that, the rolling standard deviation proves that there is lower variability.
As I am exploring and graphing the data, I made sure that the date and time variables were all formatted correctly and cleaned up any empty values. Then, I begin the ARIMA process. Visually, the data appears to be stationary. This can be determined by the characteristics that it is trending upwards and has some sort of seasonality since there is a pattern within the upward trend that appears to be relatively consistent. However, if I want to make sure I will approach by applying the Dickey-Fuller hypothesis testing.
print('Results of Dickey-Fuller Test:')
dftest = adfuller(us_data['Confirmed'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '# Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
This code allows us to test the ADF hypothesis. The outcome of this was that my p-value was greater than 0.05, so we fail to reject the null hypothesis, and therefore perform a first order differencing on the model.
fl_data = cleaned_data[(cleaned_data['Country_Region'] == 'US') & (cleaned_data['Province_State'] == 'Florida')].copy()
fl_data['Last_Update'] = pd.to_datetime(fl_data['Last_Update'], errors='coerce')
fl_data = fl_data.dropna(subset=['Last_Update', 'Confirmed'])
fl_data['Confirmed'] = pd.to_numeric(fl_data['Confirmed'], errors='coerce')
fl_data = fl_data.groupby('Last_Update')['Confirmed'].sum().reset_index()
fl_data = fl_data.sort_values('Last_Update')
fl_data.set_index('Last_Update', inplace=True)
fl_data['Differenced_Confirmed'] = fl_data['Confirmed'].diff()
fl_data.dropna(subset=['Differenced_Confirmed'], inplace=True)
So, I specifically am attempting to apply stationarity to the Florida COVID data, which is seen with the Country Region and State being Florida. Then, after doing data cleaning to structure the data in preparation of stationarization, I apply the differencing to the data. Which is then seen to be stationary under this hypothesis. Since this has now confirmed that the data is stationary, a first order differencing is now necessary to continue.
After the data is differenced, I performed the ADF again, and this time the p-value was less than 0.05. Therefore, the process can continue.
Now, from here it is possible to choose an ideal model from the PACF/ACF models, but I used the ARIMA Order Select AIC function that is found in the statsmodels.tsa.stattools package. The reason I decided to do this instead of looking at the PACF/ACF was because of the variability within those graphs, I believe that using this tool lets me be more efficient and also more accurate since if I chose, I could have human error.
from statsmodels.tsa.stattools import arma_order_select_ic
order_selection = arma_order_select_ic(fl_data['Differenced_Confirmed'].dropna(), ic='aic', max_ar=5, max_ma=5)
print(order_selection.aic_min_order)
From this code, I achieved the following ARIMA order = (5,0,5). Then, I used this to fit my ARIMA model.
model = ARIMA(fl_data['Differenced_Confirmed'].dropna(), order=(5, 0, 5))
fit_arima = model.fit()
print(fit_arima.summary())
To further demonstrate the effectiveness of this model and enable a meaningful comparison, I modified the code to perform a Hold-Out MAPE calculation. In brief, the model initially uses all historical data to forecast future values, which we cannot confirm, evaluate, or measure for accuracy at this time. In time series analysis, given the importance of temporal alignment, performing a Hold-Out MAPE is a logical approach. For this calculation, I split the data, using observations from the beginning up to January 1st, 2023, for training. From January 2nd, 2023, until the latest available date, March 9th, 2023, I generate a forecast and compare the model’s predictions against actual observed values. After calculating both the Hold-Out MAPE and RMSE, these results can be reviewed in the results tab for evaluation.
train_end_date = '2023-01-01'
test_start_date = '2023-01-02'
test_end_date = '2023-03-09'
In this code, I implimented the testing, training, and how I am manipulating the current code to the MAPE necessity These dates are used to split the data for training model and defining that forecasting period.
training_data = fl_data.loc[:train_end_date, 'Differenced_Confirmed'].dropna()
model = ARIMA(training_data, order=(5, 0, 5))
forecast_steps = len(pd.date_range(start=test_start_date, end=test_end_date, freq="D"))
forecast_differenced = fit_arima.forecast(steps=forecast_steps)
last_observed = fl_data.loc[train_end_date, 'Confirmed']
forecast_original_scale = forecast_differenced.cumsum() + last_observed
forecast_dates = pd.date_range(start=test_start_date, periods=forecast_steps)
fit_arima = model.fit()
In this code, I implemented the testing, training, and how I am manipulating the current code to the MAPE necessity These dates are used to split the data for training model and defining that forecasting period.
training_data = fl_data.loc[:train_end_date, 'Differenced_Confirmed'].dropna()
model = ARIMA(training_data, order=(5, 0, 5))
forecast_steps = len(pd.date_range(start=test_start_date, end=test_end_date, freq="D"))
forecast_differenced = fit_arima.forecast(steps=forecast_steps)
last_observed = fl_data.loc[train_end_date, 'Confirmed']
forecast_original_scale = forecast_differenced.cumsum() + last_observed
forecast_dates = pd.date_range(start=test_start_date, periods=forecast_steps)
fit_arima = model.fit()
Here, I am preparing the current differenced dataset and making sure that it will work by dropping any missing values, so only valid points are included in the training set. Then, I run it through the ARIMA model and forecast.
After performing the Hold-Out MAPE, the result is 0.45%, which is a very low error accuracy. Then, for RMSE the result was 44,609.035.
Developed SARIMA Model:
For this COVID-19 dataset, the SARIMA model extends ARIMA by incorporating seasonality. The mathematical expression of my SARIMA model is:
After performing the ARIMA model, it can be seen that the data would have been better fitted with a SARIMA model. SARIMA, an extension of ARIMA, also works with linear regression however takes into consideration of the complexity that seasonality can hold within our data. As it can be seen in the EDA, the data has a cyclical pattern. This is common within different epidemiology such as flu season, where there are times the data spikes and decreases as medical advances (in our case COVID-19 vaccines or leaving the winter season) can cause diagnosis of COVID-19 to increase or decrease.
Within the code we explored, the seasonal part of the SARIMA model would look like:
\[ Y_t = c + \phi_1 Y_{t-1} + \dots + \phi_p Y_{t-p} + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q} + \Phi_1 Y_{t-s} + \Theta_1 \epsilon_{t-s} + \epsilon_t \]
Where:
Now, the main difference between the ARIMA and the SARIMA is the addition of the \( s \) term. When I was exploring the data, it was seen that the data had an upward trend throughout the whole set, but also the data temporally increased every 7 days. Therefore, I chose to have a weekly seasonality within the dataset. The comparison between these two models can be seen in the results tab.
The developed SARIMA model's code still stems from ARIMA, but the difference is the addition of the seasonality.
from statsmodels.tsa.statespace.sarimax import SARIMAX
sarima_model = SARIMAX(fl_data['Differenced_Confirmed'].dropna(),
order=(5, 0, 5),
seasonal_order=(1, 1, 1, 7))
fit_sarima = sarima_model.fit()
forecast_steps = 60
forecast = fit_sarima.forecast(steps=forecast_steps)
Similarly to before, I also did a hold-out MAPE analysis to the SARIMA code to get a critical analysis on the model’s strength on this dataset. I also used the same start and end dates as before for the training and the test data. The biggest difference is the application of seasonality in this model.
training_data = fl_data.loc[:train_end_date, 'Differenced_Confirmed'].dropna()
sarima_model = SARIMAX(training_data, order=(5, 0, 5), seasonal_order=(1, 1, 1, 7))
fit_sarima = sarima_model.fit()
forecast_steps = len(pd.date_range(start=test_start_date, end=test_end_date, freq="D"))
forecast_differenced = fit_sarima.forecast(steps=forecast_steps)
Here, we can clearly see the seasonal_order that is representing the weekly seasonality. Then, instead of ARIMA, it is ultilizing the SARIMAX function which performs the machine learning model given the training and test datasets. The results will be explained in depth in the results tab, but this model produced a Hold-Out MAPE of 1.61% and an RMSE of 132,247.95. Simply, the ARIMA model in this case performed significantly better than SARIMA.