Reviewing "Application of Data Science in Quantitative Finance: Index Forecasting (Part 1)", we collected, exploratory analyzed, and preprocessed stock index data. Next, this article will focus on the detailed process of feature engineering, model selection and training, model evaluation and model prediction, and analyze and summarize the prediction results.

## feature engineering

Before formal modeling, we need to perform some advanced processing on the data—feature engineering, so as to ensure the fairness of each variable in model training. According to the characteristics of the existing data, the feature engineering process we perform roughly has the following three steps:

- Handle missing values and extract required variables
- data standardization
- Dealing with Categorical Variables

### 1. Handle missing values and extract required variables

First, we need to remove rows containing missing values, and only keep the required variable x_input to prepare for the next step of feature engineering.

x_input = (df_model.dropna()[['Year','Month','Day','Weekday','seasonality','sign_t_1','t_1_PricePctDelta','t_2_PricePctDelta','t_1_VolumeDelta']].reset_index(drop=True)) x_input.head(10)

Then, the target prediction column y is extracted from the data.

y = df_model.dropna().reset_index(drop=True)['AdjPricePctDelta']

### 2. Data standardization

Since there is a large gap between the price percentage difference and the transaction volume difference, if the data is not standardized, the model may be biased towards a certain variable. In order to balance the influence of each variable on the model, we need to adjust the data other than the categorical variables so that their values are relatively similar. Python provides a variety of data standardization tools, among which the StandardScaler module of sklearn is more commonly used. There are many methods of data standardization, we choose the standardization algorithm based on mean and standard deviation. Here, you can decide which algorithm to use based on your understanding of data characteristics and model types. For example, for tree models, normalization is not a necessary step.

scaler = StandardScaler() x = x_input.copy() x[['t_1_PricePctDelta','t_2_PricePctDelta','t_1_VolumeDelta']]=scaler.fit_transform(x[['t_1_PricePctDelta','t_2_PricePctDelta','t_1_VolumeDelta']])

### 3. Handling categorical variables

One of the most common methods of handling categorical variables is one-hot encoding. For categorical variables with high cardinality, after coding, the number of variables increases, and you can consider reducing the computational pressure through dimensionality reduction or higher-order algorithms.

x_mod = pd.get_dummies(data=x, columns=['Year','Month','Day','Weekday','seasonality']) x_mod.columns

x_mod.shape

So far, we have completed all the steps of feature engineering, and the processed data can enter the model training link.

## Model selection and training

First, we need to split the training and test sets. For data that does not need to consider the order of records, a part of the data can be randomly selected as the training set, and the remaining part can be used as the test set. For time series data, the order between records needs to be considered. For example, if we want to predict the price change in February, then the model cannot touch the price after February to avoid data leakage. Since the stock index data is a time series, we set the first 75% of the time series as training data, and the last 25% of the data as test data.

### · Model selection

In the model selection stage, we will preliminarily determine the direction of the model based on the characteristics of the data, and select appropriate model evaluation indicators.

Because the variables include historical prices and trading volumes, and the correlation between these variables is too high (high correlation), various regression models based on linear models are not suitable for the target data. Therefore, the focus of our model attempts will be on the ensemble method, mainly such models.

During the training process, we need to consider as appropriate and choose appropriate indicators to evaluate model performance. For regression prediction models, a popular choice is MSE (Mean Squared Error). For stock index data, due to its time-series characteristics, we chose MAPE (Mean Absolute Percentage Error) on the basis of RMSE, a relative measure in percentage. Compared with the traditional MSE, it is not affected by the data size, and the value remains between 0-100. Therefore, we use MAPE as the main model evaluation metric.

### · Model training

In the model training phase, all candidate models will be trained with default parameters, and we judge the most suitable model type for further detailed training based on the value of MAPE. We tried a variety of model algorithms including linear regression, random forest, etc., and printed and returned the model performance of each model trained in the training set in the test set in the form of a dictionary.

## model evaluation

By running the following equation, we can rank the performance of each model according to the magnitude of the difference in predictions (MAPE). You can also explore more different models, and select the best model for subsequent fine-tuning according to the level of evaluation indicators.

trail_result = ensemble_method_reg_trails(x_train, y_train, x_test, y_test)

pd.DataFrame(trail_result).sort_values('model_test_mape', ascending=True)

It can be seen from this that among many model types, Ada Boost has the best effect on the training and test sets, and the MAPE value is the smallest, so we choose Ada Boost for the next step of detailed tuning. At the same time, we found that random forest and gradient boosting also have good prediction performance. Note that although Ada Boost has high accuracy on the training set, the performance of the model is not very stable.

The following model fine-tuning is divided into two steps:

- Approximate range for finding optimal parameters using RandomizedSearchCV
- Use GridSearchCV to find more precise parameters

The parameters that affect the performance of Ada Boost are roughly as follows:

- n_estimators
- base_estimator
- learning_rate

Note that both RandomizedSearchCV and GridSearchCV use cross-validation to evaluate the performance of each model. As we mentioned earlier, time series need to consider the order. For the time series that has been transformed to fit the machine learning model, each record has its corresponding time information, and there is no information in the test set in the training set. The order of records in the training set can be arranged in a specific cross-validation order (more complex), or it can be shuffled. Here, we believe that the disruption of the training set data does not affect the model training.

base_estimator is the basis of the ada boost algorithm, we need to create a list of base_estimators in advance.

l_base_estimator = [] for i in range(1,16): base = DecisionTreeRegressor(max_depth=i, random_state=42) l_base_estimator.append(base) l_base_estimator += [LinearSVR(random_state=42,epsilon=0.01,C=100)]

### 1. Use RandomizedSearchCV to find the approximate range of optimal parameters

With RandomSearchCV, try parameters randomly. Here, we tried 500 different parameter combinations.

randomized_search_grid = {'n_estimators':[10, 50, 100, 500, 1000, 5000], 'base_estimator':l_base_estimator, 'learning_rate':np.linspace(0.01,1)}

search = RandomizedSearchCV(AdaBoostRegressor(random_state=42), randomized_search_grid, n_iter=500, scoring='neg_mean_absolute_error', n_jobs=-1, cv=5, random_state=42) result = search.fit(x_train, y_train)

It can be seen that the best performance among the 500 parameter combinations is:

result.best_params_

result.best_score_

### 2. Use GridSearchCV to find more precise parameters

According to the results of Randomized Search, we then use GridSearchCV for further fine-tuning:

- n_estimators: 1-50
- base_estimator: Decision Tree with max depth 9
- learning_rate: around 0.7

search_grid = {'n_estimators':range(1,51), 'learning_rate':np.linspace(0.6,0.8,num=20)}

The results of GridSearchCV are as follows:

Based on the results from GridSearchCV, we keep the best model, let it train on the whole training set, and make predictions on the test set to evaluate the results.

It can be seen that combined with the cross-validation results of the training set, the performance of the best model in the test set is slightly improved compared to the results of the model selection and training phases. The best model balances the performance of the training set and the test set, which can prevent overfitting more effectively.

After determining the model, because the previous models have only been exposed to the training set, in order to predict future data, we need to retrain the model on all the data and save the best model in the pickle file format.

best_reg.fit(x_mod, y)

The MAPE value of the model in the prediction results of the full amount of data is:

m_forecast the= best_reg.predict(x_mod) mean_absolute_percentage_error(y, m_forecast)

## model prediction

Different from the traditional ARIMA model, each forecast of the existing model needs to re-integrate the forecast information and input it into the model to get a new forecast result. The reintegration of input data can be developed with the following equations, which is convenient to adapt to the needs of various application scenarios.

def forecast_one_period(price_info_adj_data, ml_model, data_processor): # Source data: Data acquired straight from source last_record = price_info_adj_data.reset_index().iloc[-1,:] next_day = last_record['Date'] + relativedelta(days=1) next_day_t_1_PricePctDelta = last_record['AdjPricePctDelta'] next_day_t_2_PricePctDelta = last_record['t_1_PricePctDelta'] next_day_t_1_VolumeDelta = last_record['Volume_in_M'] - last_record['t_1_VolumeDelta'] if next_day_t_1_PricePctDelta > 0: next_day_sign_t_1 = 1 else: next_day_sign_t_1 = 0 # Value -99999 is a placeholder which won't be used in the following modeling process next_day_input = (pd.DataFrame({'Date':[next_day], 'Volume_in_M':[-99999], 'AdjPricePctDelta':[-99999], 't_1_PricePctDelta':[next_day_t_1_PricePctDelta], 't_2_PricePctDelta':[next_day_t_2_PricePctDelta], 't-1volume': last_record['Volume_in_M'], 't-2volume': last_record['t-1volume'], 't_1_VolumeDelta':[next_day_t_1_VolumeDelta], 'sign_t_1':next_day_sign_t_1}).set_index('Date')) # If forecast period is post Feb 15, 2020, input data starts from 2020-02-16, # as our model is dedicated for market under Covid Impact. # Another model could be used for pre-Covid market forecast. if next_day > datetime.datetime(2020, 2, 15): price_info_adj_data = price_info_adj_data[price_info_adj_data.index > datetime.datetime(2020, 2, 15)] price_info_adj_data_next_day = pd.concat([price_info_adj_data, next_day_input]) # Add new record to original data for modeling preparation input_modified = processor.data_modification(price_info_adj_data_next_day) # Prep for modeling x,y = data_processor.data_modeling_prep(input_modified) next_day_x = x.iloc[-1:] forecast_price_delta = ml_model.predict(next_day_x) # Consolidate prediction results forecast_df = {'Date':[next_day], 'price_pct_delta':[forecast_price_delta[0]], 'actual_pct_delta':[np.nan]} return pd.DataFrame(forecast_df)

We read the previously saved model and predict the price movement for the next working day. In the output result, actual_pct_delta is a structure reserved for saving the actual result after the future price is released.

According to the forecast results, we believe that the S&P index will rise slightly on November 1, 2022.

## Analyzing forecast results

According to the trend of the data in the past two years, we have such a forecast result: the S&P index will rise slightly. But when we look at the actual data released on November 1, 2022, the index is down on that day. This means that some kind of information from the outside world may be a change in economic indicators or policy winds, leading to changes in market sentiment. After searching related news, we found the following information:

"Stocks finished lower as data showing a solid US labor market bolstered speculation that Federal Reserve policy could remain aggressively tight even with the threat of a recession."

While the economy is facing multiple challenges, the information about the increase in the number of jobs in the recruitment market has been released, leading investors to believe that the recruitment market is stable and the Fed will not consider relaxing the current economic policy; this negative outlook has been presented in the stock market, As a result, the closing price of the index fell on that day.

In practical applications, the model does not only serve as a forecasting job. In the case of this article, the forecasting of index price changes is more similar to a "marking line". Learning historical data through the model, the results of the model represent the changes we expect if we follow the information of historical records and there is no major external interference, that is, the changes that actually occurred on the day are changes at the "system" level. It is still a change caused by non-"system" factors that need to be dug deeper. On the basis of the model, we can infer these results and develop various functions to make the data play its value as much as possible.

## Summarize

Reviewing the entire content of the first and second articles, the price prediction ideas for the S&P 500 stock index are summarized as follows:

- Determine the forecast target: an index that reflects the North American stock market - S&P 500;
- Data collection: Download historical price data from public financial websites;
- Exploratory data analysis: preliminary understanding of the characteristics of the data, data visualization, and presentation of time series information in the form of images;
- Data preprocessing: convert time to variables, change price data, find cycles and seasonality, adjust trading volume data according to cycles;
- Data engineering: handle missing values and extract required variables, data standardization, handle categorical variables;
- Model selection and training: split the training set and test set, determine the model direction and evaluation indicators, and try to train various models;
- Model evaluation: Select the optimal model according to the indicators, use RandomizedSearchCV to find the approximate range of the best parameters, and then use GridSearchCV to find more precise parameters;
- Model prediction: integrate the input data and predict the price change of the next working day;
- Analysis and prediction results: Combined with the actual situation of the day, understand market changes and give full play to the value of the model.

### References:

- Time series into supervised learning problem
- Tuning Ada Boost
- S&P 500 historical data
- Bloomberg News
- Abu (2021). "The Road to Quantitative Trading Using Python to Do Stock Quantitative Analysis". Mechanical Industry Press.