JPX Tokyo Stock Exchange Prediction Summary - no leakage, experience sharing of more than 0.3 points -20220714

When I was preparing to write this article, I was just releasing the lonely brave~ πŸ˜‚
Let's try to make Yan Guo leave a mark and write down all the previous attempts πŸ˜€

This competition may be a little special. In order to make everyone better test, the data within the prediction period is put in the supplementary data set, so there is a problem of data leakage πŸ˜‚ Now the ranking on the ranking list is made by everyone with leaked data for fun



At present, I have seen few sharing more than 0.3 without data leakage 😳, You can wait until the end of the competition in October, and then look at the high score scheme 😺.

Game question link ↓↓ (data download address can be found in the link):
https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction

catalogue

1. Analysis of competition questions

1.1 introduction to basic information

JPX, the Japan stock exchange, requires modeling based on the financial data of the Japanese market, and predicts the real return in a period of time after the model training.

1.2 provide data (game question input)

Data folder:

folderContent description
data_specificationsIntroduction of each field
jpx_tokyo_market_predictionAPI enabled files. The API is expected to deliver all rows in five minutes and retain less than 0.5 GB of memory
train_filesData folder covering the main training period
supplemental_filesThe supplementary data folder contains the dynamic window of supplementary training data. This will be updated with new data at the main stages of the game in early May, early June and about a week before the submission is locked.
example_test_filesCover the data folder during the public test. Designed to facilitate offline testing

Mainly view train_ Basic information of data under data training data folder

stock_prices.csv

stock_ Column name comparison of prices

column_nameChinese interpretation
RowIdThe unique id of the record, which is a combination of date and securities code
DateTransaction date
SecuritiesCodeLocal securities code
openOpening price
HighHighest price of the day
LowLowest price of the day
CloseClosing price
VolumnTurnover
AdjustmentFactorAdjustment factor
ExpectedDividendExpected dividend
SupervisionFlagSigns of regulated securities and securities to be delisted
TargetThe change rate of the adjusted closing price between t+2 and t+1, where t+0 is the trading date.

I don't use other tables very much. I checked the information of relevant fields and put it here
Link (put it up later)

1.3 submission of results (output of competition questions)

According to example_ test_ Submission sample document in files folder_ submission. CSV can see that the submitted result is the ranking of 2000 stocks per day within a given date.


So, how is the ranking generated?
The competition indicators given can be seen in the interface introduction document
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG lwbniotr-1658224070671) (EN- resource://database/64690:1 )]
Here C(k,t) is the closing price of day t, then R (k, t) (that is, the target in our training data) is calculated according to ((closing price of day t+2) - (closing price of day t+1)) ÷ (closing price of day t+1))

You can calculate the Target column from the Close column; it's the return from buying a stock the next day and selling the day after that

# Read stock_ Take a look at the price data
df_price = pd.read_csv(f"{train_files_dir}/stock_prices.csv",nrows=10000)

df_need = df_price[df_price["SecuritiesCode"]==1301][["RowId","SecuritiesCode","Close","Target"]]

df_need["Close_shift1"] = df_need["Close"].shift(-1)   
df_need["Close_shift2"] = df_need["Close"].shift(-2)
df_need["rate"] = (df_need["Close_shift2"] - df_need["Close_shift1"]) / df_need["Close_shift1"]


We can see that r(k,t) calculated here is equal to the given target.

The profit we calculate here is assumed to buy tomorrow and sell the day after tomorrow. We calculate the daily target.
target is related to the close price on the second and third days after that day.
During prediction, only the data of the day are given each time, and the target value of the day is predicted and ranked.

Note that rank is 0-1999, not 1-2000

1.4 evaluation basis

The latter part is the evaluation basis of our results. We just need to submit the daily ranking according to the above format, and the latter part is realized through the interface (to prevent using the data behind the prediction time for calculation).

The results submitted are evaluated based on the sharp ratio of daily earnings.

The returns for a single day treat the 200 highest (e.g. 0 to 199) ranked stocks as purchased and the lowest (e.g. 1999 to 1800) ranked 200 stocks as shorted.

(the daily return is to treat the top 200 stocks as buying and the bottom 200 stocks as selling)
The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that.

The difference between the changes of the top 200 and the bottom 200 stocks multiplied by the linear weight (take the stock of the top 200 and the bottom 200 to test)



(the last 200 times the corresponding weight)
The top 200 target s multiply by the weight - the bottom 200 multiply by the weight, which calculates the daily income. Finally, we evaluate the average value of daily income over a period of time divided by the standard deviation of daily income over that period of time.
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-t37eZJJE-1658224302871)(en-resource://database/64704:1)]
The ones with high final scores rank high.

2. Exploratory data analysis

This part is mainly about the data of stock price. Although the field names of other data sets have been translated, I don't understand the main meaning and don't know how to use it, so I didn't do it anymore. I looked at what others did later.

2.1 bivariate analysis

The data of 2021 is taken

import seaborn as sb
import matplotlib.pyplot as plt
from datetime import datetime


# Run the data in different periods. Don't dare to run the notebook
df_price2021 = df_price[df_price.Date>datetime.strptime('2021-01-01','%Y-%m-%d')]

# Chart matrix
g = sb.PairGrid(data = df_price2021, vars = ['Open', 'High', 'Low','Close','Volume','Target'])
# g.map_diag(plt.hist) # Use map_diag puts the histogram on the diagonal, otherwise it is a straight line scatter
g.map_offdiag(plt.scatter)


It can be seen that:

  • There is a strong correlation between open, high, low and close prices
  • Between price and trading volume, the trading volume decreases rapidly with the increase of price. Generally, the trading volume of stocks with high price is low
  • target has a normal distribution, and its distribution in trading volume is more dispersed than that in price

Distribution status of each variable over the time span (take the stock with the maximum target in 2021 as an example)

# View the changes of each field of the maximum target in 2021 relative to time in this year
SecuritiesCode2021_1 = df_price2021[df_price2021.Target == max(df_price2021.Target)].SecuritiesCode.item()
df_price2021_1 = df_price2021[df_price2021.SecuritiesCode==SecuritiesCode2021_1]
# price2021_1_timeseries = df_price2021_1.set_index("Date")  # (226, 11)
# Wide table and variable length table
df_price2021_2 = df_price2021_1[["Date",'Open', 'High', 'Low','Close']]
df_price2021_2 = df_price2021_2.set_index(["Date"])
df_price2021_3 = df_price2021_2.stack().reset_index()
df_price2021_3[0] = df_price2021_3[0].astype('float64')
df_price2021_3.Date = df_price2021_3.Date.apply(lambda x:mdates.date2num(x))   # Using datetime here will report an error
df_price2021_3.rename(columns={0:"value"},inplace = True)

# Polyline chart
ax = sb.lineplot(data=df_price2021_3,x="Date",y="value",hue='level_1')
# get current axis
ax = plt.gca()
format_str = '%Y-%m-%d'
format_ = mdates.DateFormatter(format_str)
ax.xaxis.set_major_formatter(format_)
plt.xticks(rotation=15)
plt.show()

It can be seen that:

  • The stock rose sharply from 2021.8 to 2021.9
  • The changes of open, high, low and close are basically consistent

Draw the time change curves of stock price, target and volume together
You can use functions to encapsulate the functions that draw time curves

# Packed time series line chart ()
def time_line(df,col_li,time_col):
    '''
    this function is used to plot the trend of each variable over time
    :param df:dataframe,contains(
    :param col_li:variable list
    :param time_col:name of time column
    :return:figure
    '''
    # Wide table and variable length table
    df_price2021_2 = df_price2021_1[col_li+[time_col]]
    df_price2021_2 = df_price2021_2.set_index([time_col])
    df_price2021_3 = df_price2021_2.stack().reset_index()
    df_price2021_3[0] = df_price2021_3[0].astype('float64')
    df_price2021_3[time_col] = df_price2021_3[time_col].apply(lambda x: mdates.date2num(x))  # Using datetime here will report an error
    if len(col_li) == 1:
        y_name = col_li[0]
    else:
        y_name = "value"

    df_price2021_3.rename(columns={0: y_name}, inplace=True)

    # Long data polyline chart
    if len(col_li) == 1:
        ax = sb.lineplot(data=df_price2021_3, x=time_col, y=y_name)
    else:
        ax = sb.lineplot(data=df_price2021_3, x=time_col,y=y_name, hue='level_1')
    # get current axis
    ax = plt.gca()
    format_str = '%Y-%m-%d'
    format_ = mdates.DateFormatter(format_str)
    ax.xaxis.set_major_formatter(format_)
    plt.xticks(rotation=15)
    plt.show()


# Draw together
plt.figure(figsize=[12,5])

plt.subplot(1,3,1)
time_line(df_price2021_1,['Open', 'High', 'Low','Close'],"Date")

plt.subplot(1,3,2)
time_line(df_price2021_1,["Target"],"Date")

plt.subplot(1,3,3)
time_line(df_price2021_1,["Volume"],"Date")


As you can see,

  • The change trend of target and stock price is similar, both of which have extremely steep peaks in August of 21, but compared with stock price, the local jitter of target image is more intense
  • Volume of trading volume also fluctuates with the stock price, but there is a certain time delay at the peak

    The stocks with the lowest target in 2021 were selected for analysis:
  • It can be seen that where the stock price fluctuated significantly, the trading volume also increased significantly
  • target fluctuates near the zero value line. During the period when the stock price fluctuates significantly, there will be sharp fluctuations
2.2 excellent EDA appreciation πŸ˜‚

This is a popular eda notebook.
https://www.kaggle.com/code/abaojiang/jpx-detailed-eda

This article is mainly a summary of characteristic projects in the financial field πŸ€—πŸš€
https://www.kaggle.com/code/metathesis/feature-engineering-training-with-ta/notebook

3. Try to build the model

3.1 LSTM

Here we first take a stock forecast to see the prediction effect.

# Import required packages
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Dropout

Load data

# set base_dir to load data
base_dir = r"D:/stock_data"

# train_data
train_files_dir = f"{base_dir}/train_files"

train_df = pd.read_csv(f"{base_dir}/train_files"+'/stock_prices.csv', parse_dates=True)  
valid_df = pd.read_csv(f"{base_dir}"+'/supplemental_files/stock_prices.csv', parse_dates=True)
train_df = pd.concat([train_df,valid_df])  # Here is the combination of training data and supplementary data. Later, 20% is taken together for the test
features = ['Open', 'High', 'Low', 'Close','Volume']

# Test with stocks with code 1332
prices = train_df.query("SecuritiesCode==1332")[features]

Partition test set

 test_split = round(len(prices)*0.2)  # 252

df_for_training = prices[:-252][features]   
print(df_for_training.shape)
df_for_training = df_for_training.dropna(how='any')
print(df_for_training.shape)
df_for_testing = prices[-252:][features]   

Zoom data

scaler = MinMaxScaler(feature_range=(0,1))   # Zoom data
df_for_training_scaled = scaler.fit_transform(df_for_training)
df_for_testing_scaled = scaler.transform(df_for_testing)

Generate training data and test data

# createXY
def createXY(dataset,n_past):
    dataX = []
    dataY = []
    for i in range(n_past, len(dataset)):
            dataX.append(dataset[i - n_past:i, 0:dataset.shape[1]])  # [0:30,0:5] data in 0-29 days
            dataY.append(dataset[i,-2])    # 30 predict the value on the 30th day
    return np.array(dataX),np.array(dataY)


# Generate data
trainX,trainY=createXY(df_for_training_scaled,30)
# trainX.shape
testX,testY=createXY(df_for_testing_scaled,30)

Regression prediction by keras

from keras.wrappers.scikit_learn import KerasRegressor   # Regression prediction by keras
from sklearn.model_selection import GridSearchCV

grid_model = Sequential()
grid_model.add(LSTM(50,return_sequences=True,input_shape=(30,5)))
grid_model.add(LSTM(50))
grid_model.add(Dropout(0.2))
grid_model.add(Dense(1))    # Take a look at the parameter definitions of keras!!!

grid_model.compile(loss='mse',optimizer = 'adam')

history = grid_model.fit(trainX,trainY,epochs=10,batch_size=30,validation_data=(testX,testY))   # Fitting training data

Draw a picture to see the training effect

from matplotlib import pyplot as plt
plt.plot(history.history['loss'],label='train')
plt.plot(history.history['val_loss'],label='test')
plt.legend()
plt.show()


Make predictions and compare the results with real data

prediction=grid_model.predict(testX)

prediction_copies_array = np.repeat(prediction,5, axis=-1)

pred=scaler.inverse_transform(np.reshape(prediction_copies_array,(len(prediction),5)))[:,0]  # Reverse change, and the standardized data is converted into the original data

# Real data
original_copies_array = np.repeat(testY,5, axis=-1)

# original_copies_array.shape

original=scaler.inverse_transform(np.reshape(original_copies_array,(len(testY),5)))[:,0]

Draw to see the contrast effect

plt.plot(original, color = 'red', label = 'Real  Stock Price')
plt.plot(pred, color = 'blue', label = 'Predicted  Stock Price')
plt.title(' Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel(' Stock Price')
plt.legend()
plt.show()
  • Take 1332 as an example
  • Column 1377

    It can be seen that the predicted results are roughly consistent with the actual data, but the small jitter is not predicted. Our competition this time mainly focuses on the short-term income generated by daily price fluctuations, so otherwise, adjust the model so that it can fit these disturbances as much as possible, and then try to accurately predict the stock price in the next two days, and then calculate the required target. Otherwise, we can directly try to predict the target, and then try to use sgbt to directly predict the feasibility of the target method.

In addition:
lstm here, I think the figure finally simulated by the following person is still relatively close to the real fluctuation of stock price. I only looked at the figure, but it didn't reappear. I don't know the actual effect, and I don't know whether it shows its optimal effect. You can have a look at what you want to know https://www.kaggle.com/code/onurkoc83/multivariate-lstm-close-open-high-low-volume

3.2 Sgboost

After reading the cases shared in the forum, I tried to predict the target directly with sgbt. Select some features for improvement, use optuna to adjust parameters, and use gpu to accelerate. At this time, the best result can reach 0.297, and then you can't adjust it any more 😹.

# Import the corresponding module
import os
import traceback
import numpy as np
import pandas as pd
import xgboost as xgb
from tqdm import tqdm
import jpx_tokyo_market_prediction
import warnings; warnings.filterwarnings("ignore")

prices1 = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv",parse_dates=True)

# Select the prices data after 2020
prices1 = prices1[prices1.Date>'2020-01-01']

prices = prices1.copy()
prices = prices.drop(["ExpectedDividend"],axis=1)

prices.isnull().sum()
prices = prices.dropna(how='any')
prices.isnull().sum()  # drop null values
# Get the share price of each stock the previous day
cc = prices.groupby("SecuritiesCode").apply(lambda df: df['Close'].shift(1))
cc = pd.DataFrame(cc).reset_index(level=0)

prices = pd.merge(prices,cc['Close'],left_index=True,right_index=True)
prices.head()
prices.tail()
prices['delta'] = prices['Close_x'] - prices['Close_y']

# Check whether it is up or down
def getadvance(x):
    ret = 0
    if x > 0:
        ret = 1
    return(ret)

prices['advance'] = list(map(getadvance, prices['delta']))
prices['Date'] = pd.to_datetime(prices['Date'], format = "%Y-%m-%d")

def get_month(dt):   # Get month
    x = dt.strftime("%m")
    return(x)

prices['Month'] =  list(map(get_month, prices['Date']))

prices.rename(columns={"Close_x":"Close"},inplace=True)
prices.head()
def upper_shadow(df):
    return df['High'] - np.maximum(df['Close'], df['Open'])

def lower_shadow(df):
    return np.minimum(df['Close'], df['Open']) - df['Low']

prices['Upper_Shadow'] = upper_shadow(prices)
prices['Lower_Shadow'] = lower_shadow(prices)
# Finalized characteristics
features = ['Open', 'High', 'Low', 'Close',
            'Volume', 'AdjustmentFactor', 'SupervisionFlag', 
            'delta', 'advance', 'Month','Upper_Shadow','Lower_Shadow']
            
prices = prices.dropna(how='any')
prices.isnull().sum()

del prices['Date']
# Convert to category format
def cat_col(data) :
    data['SecuritiesCode'] = data['SecuritiesCode'].astype('category')
    data['SupervisionFlag'] = data['SupervisionFlag'].astype('category')
    data['advance'] = data['advance'].astype('category')
    data['AdjustmentFactor'] = data['AdjustmentFactor'].astype('category')
    data['Month'] = data['Month'].astype('category')
    return data

prices = cat_col(prices)
X = prices[features]
y = prices['Target']

# optuna parameter adjustment + training model
import optuna
def objectives(trial):
    param = {
        'tree_method':'gpu_hist',
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.6,0.8,1.0]),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.5,0.7,0.9,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.01,0.02,0.05]),
        "n_estimators" : trial.suggest_int('n_estimators', 300, 1000),
        'max_depth': trial.suggest_categorical('max_depth', [5,9,13,15,17,20]),
        'random_state': trial.suggest_categorical('random_state', [24, 48,2020]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10)
    }
    model = xgb.XGBRegressor(**param, enable_categorical=True)  # , enable_categorical=True
    model.fit(X, y)
    score = model.score(X, y)
    return score
    

studyxgb = optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=0))
studyxgb.optimize(objectives, n_trials=5)

trial = studyxgb.best_trial
params_best = dict(trial.params.items())
print(params_best)
# params_best['random_seed'] = 0

model = xgb.XGBRegressor(**params_best,enable_categorical=True,tree_method='gpu_hist')  # xgb.XGBRegressor(**param, enable_categorical=True)

# Print best parameters
print('study.best_params:', studyxgb.best_trial.value)
print('Number of finished trials:', len(studyxgb.trials))
print('Best trial:', studyxgb.best_trial.params)
print('study.best_params:', studyxgb.best_params)

print(model.tree_method)

model.fit(X,y)
model.score(X,y)

Submit

import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

all_data = prices1.copy()


# Get the share price of the previous day
def latest_close(SecuritiesCode,date):
    temp = all_data[all_data.SecuritiesCode==SecuritiesCode].sort_values(by=["Date"],ascending=False)
    temp = temp[temp.Date<=date]
    return temp.iloc[-1]['Close']
   
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
#     print(prices)
#     del prices['Date']
#     print(prices)
#     qq = prices
    all_data = pd.concat([all_data,prices])
#     prices["Avg"] = sample_prediction["SecuritiesCode"].apply(get_avg)

    prices['Close_y'] = prices.apply(lambda x:latest_close(x.SecuritiesCode,x.Date),axis=1)
    prices['delta'] = prices['Close'] - prices['Close_y']
    prices['advance'] = list(map(getadvance, prices['delta']))
    prices['Date'] = pd.to_datetime(prices['Date'], format="%Y-%m-%d")
    prices['Month'] = list(map(get_month, prices['Date']))

    prices = cat_col(prices)
    prices['Date'] = prices['Date'].dt.strftime("%Y%m%d").astype(int)
    prices['Upper_Shadow'] = upper_shadow(prices)
    prices['Lower_Shadow'] = lower_shadow(prices)
    
    securities = prices["SecuritiesCode"]
    prices = prices[features]
    print('-------------------------------prices------------------------------')
    print(prices)
    print('------------------------------------------------------------------------------')

    sample_prediction["Prediction"] = model.predict(prices)
#     sample_prediction['SecuritiesCode'] = securities
    print('-------sample_prediction--------')
    print(sample_prediction)
    sample_prediction = sample_prediction.sort_values(by="Prediction", ascending=False)
    sample_prediction.Rank = np.arange(0, 2000)
    sample_prediction = sample_prediction.sort_values(by="SecuritiesCode", ascending=True)
    sample_prediction.drop(["Prediction"], axis=1)
    submission = sample_prediction[["Date", "SecuritiesCode", "Rank"]]
    print('-------------------------------submission------------------------------')
    print(submission)
    print('------------------------------------------------------------------------------')
    env.predict(submission)
3.3 other attempts
3.3.1 short time fitting prediction

The simpler and rougher way to predict the stock price with lstm is to try to predict the stock price in the next two days in the past three or two days, and make a rough estimate.
code snippet

# Fitting the stock price of the next two days
df.Date = pd.to_datetime(df.Date, format="%Y-%m-%d")
df = df.set_index(['Date'])
df['day3'] = df.Close.rolling(window=3).apply(lambda y:
           np.poly1d(np.polyfit([0,1,2],y,1))(3),raw=True)
df['day2'] = df.Close.rolling(window=2).apply(lambda y:
           np.poly1d(np.polyfit([0,1],y,1))(2),raw=True)
df.reset_index()

The effect of this method is OK, and the score can be about 0.11-0.14.

3.3.2 Prophet

Later, I mainly want to improve the results of xgboost a little. The effect of others adding other features is not very ideal https://www.kaggle.com/code/junjitakeshima/jpx-add-new-features-eng.
So I began to think about how to combine prophet and sgboost to see the effect 😡 (start painfully... hhhh).
This article is mainly for reference


Then, most of the time is transferred to
Solve the installation problem of pystan (you can post a blog another day 😷)
Solve the problem that when using prophet, the printing of pystan is too long... (the use and printing of prophet will be put on another day!!!)
Adjust the code to solve the problem of insufficient memory for 2000 stock forecasts and the problem that too long time will exceed the kaggle time limit πŸ™†πŸ™†πŸ™†

The final effect of this attempt is not ideal. I am not proficient in using prophet, and there is no time to adjust it later πŸ™‡. I will put part of the code below. You can see if you have any ideas to optimize it.

# prophet section
# import prophet
from prophet import Prophet
from prophet.make_holidays import make_holidays_df
import logging
logging.getLogger('prophet').setLevel(logging.WARNING)

# Set up pystan printing
class suppress_stdout_stderr(object):
    '''
    A context manager for doing a "deep suppression" of stdout and stderr in
    Python, i.e. will suppress all print, even if the print originates in a
    compiled C/Fortran sub-function.
       This will not suppress raised exceptions, since exceptions are printed
    to stderr just before a script exits, and after the context manager has
    exited (at least, I think that is why it lets exceptions through).

    '''
    def __init__(self):
        # Open a pair of null files
        self.null_fds = [os.open(os.devnull, os.O_RDWR) for x in range(2)]
        # Save the actual stdout (1) and stderr (2) file descriptors.
        self.save_fds = (os.dup(1), os.dup(2))

    def __enter__(self):
        # Assign the null pointers to stdout and stderr.
        os.dup2(self.null_fds[0], 1)
        os.dup2(self.null_fds[1], 2)

    def __exit__(self, *_):
        # Re-assign the real stdout/stderr back to (1) and (2)
        os.dup2(self.save_fds[0], 1)
        os.dup2(self.save_fds[1], 2)
        # Close the null files
        os.close(self.null_fds[0])
        os.close(self.null_fds[1])
# Set up festivals
year_list = [2017,2018,2019,2020, 2021, 2022]
holidays = make_holidays_df(year_list=year_list, country='JP')

# Because holidays are not traded, holidays are all 0 based on the real date, so the following dates are uniformly reduced by 1 day to try
from datetime import timedelta
holidays['ds'] = holidays['ds'].apply(lambda x:x - timedelta(days=1))
# prophet prediction
def run_prophet(tr):
#     tr = tr[["Date","Target"]]
#     tr.rename(columns={'Target': 'y', 'Date': 'ds'}, inplace=True)
    m = Prophet(holidays=holidays,
                daily_seasonality=False,
                changepoint_prior_scale=0.01)
    with suppress_stdout_stderr():
        m.fit(tr)
    return m


# Added features
add_features = ['trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper', 'additive_terms', 'additive_terms_lower', 'additive_terms_upper', 'holidays']

# Create a prophet for each set of data
from tqdm import tqdm
pbar = tqdm(total=2000)
count = 0
forecast_all = pd.DataFrame()
for cod in cod_list:
#     print(cod)
    names1 = globals()
    temp = names['cod_'+str(cod)][["Date","Target"]]
    temp.rename(columns={'Target': 'y', 'Date': 'ds'}, inplace=True)
    names1['m_'+str(cod)] = run_prophet(temp)
    new_feature = names1['m_'+str(cod)].predict(temp.drop('y', axis=1))
    names['cod_'+str(cod)] = pd.concat([names['cod_'+str(cod)],new_feature[add_features]],axis=1)  # New features of generating training data sets
    # Predict backward for a period of time
    future = names1['m_'+str(cod)].make_future_dataframe(periods=120)
    forecast = names1['m_'+str(cod)].predict(future)       # Store prediction features for future retrieval
#     print(forecast[add_features])
    forecast = forecast[add_features+['ds']]
    forecast['SecuritiesCode'] = cod
    forecast_all = pd.concat([forecast_all,forecast],axis = 0)
    del names1['m_'+str(cod)]
    count += 1
    if count == 200:
        pbar.update(200)
        count = 0
pbar.close()

Part of the code of prophet is mainly as shown above, and the rest is calculated according to the above sgbt method, which is equivalent to adding add to the original feature_ Features section. See git (link) for the complete code.

3.4 final plan

The final submission scheme returned to the sgbt model that ran to 0.297 at the beginning, and dealt with the blank value and price adjustment part, and the result suddenly reached 0.332.
The price adjustment is based on the AdjustmentFactor field. We can select a stock whose AdjustmentFactor is not 1 to view.

prices[prices.SecuritiesCode==3176].head(25) #.query("AdjustmentFactor!=1")


Adjustment code

# Adjust price def adjust_price(price):
    from decimal import ROUND_HALF_UP, Decimal

    pcols = ["Open", "High", "Low", "Close"]

#     price.ExpectedDividend.fillna(0, inplace=True)

    def qround(x):
        return float(Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))

    def adjust_prices(df):
        df = df.sort_values("Date", ascending=False)
        df.loc[:, "CumAdjust"] = df["AdjustmentFactor"].cumprod()

        # generate adjusted prices
        for p in pcols:
            df.loc[:, p] = (df["CumAdjust"] * df[p]).apply(qround)
        df.loc[:, "Volume"] = df["Volume"] / df["CumAdjust"]
        df.ffill(inplace=True)  # The null value is removed
        df.bfill(inplace=True)

        # generate and fill Targets
        # df.loc[:, "Target"] = df.Close.pct_change().shift(-2).fillna(df.Target).fillna(0)
        df.Target.fillna(0, inplace=True)

        return df
    # generate Adjusted#     price = price.sort_values(["SecuritiesCode", "Date"])
    price = price.groupby("SecuritiesCode").apply(adjust_prices).reset_index(drop=True)
    price = price.sort_values("RowId")
    return price

The adjusted result is 0.332 πŸ˜πŸ™Œ

I don't know how many places I can rank in the end. Although I don't see many solutions of 0.3 at present, I can't modify them when I need to add the supply data because I call prophet to use up the submission times~ πŸš‘πŸš‘πŸš‘ (it is the degree to which it can be sent away directly 😰😰😰)

4. Summary 😳

  1. The whole competition process is still very valuable. I have checked and learned a lot. I am very grateful to the active partners in the forum 🌺🌺🌺🌿🌿🌿, And I really participated in the competition completely and insisted on myself until the end hhhh 🌼🌼🌼
  2. For the method of combining prophet that I want to try at last, I think I can save it again if I have a chance. One is that the article I checked also deals with the problem of time series prediction. The effect should be better. See if it is possible to reproduce his code first, and then debug it in combination with the current case.
  3. Another is that there is a summary of financial feature engineering in the excellent eda sharing in the middle. I think we can refer to that article to add some features, and maybe the effect will be better.
  4. I'm not very familiar with kaggle. I can try more later in terms of submission and file storage.
  5. At present, I think of these. Let's sort out and play with what we have learned recently. Thank you and see if the people here have any partners. We can compete together next time. Is the team powerful~~ πŸŽπŸ‘πŸ’πŸ“πŸΈπŸΉπŸ

5. Reference links

https://www.kaggle.com/code/metathesis/feature-engineering-training-with-ta/notebook

https://www.kaggle.com/code/jiripodivin/supervised-stocks-eda-and-basic-pca

https://www.kaggle.com/code/abaojiang/jpx-detailed-eda

https://www.kaggle.com/code/genbufuthark/jpx-datafile-description-in-japanese

https://www.kaggle.com/code/chumajin/english-ver-easy-to-understand-the-competition

https://www.zhihu.com/search?q=%E5%A4%8F%E6%99%AE%E6%AF%94%E7%8E%87&utm_content=search_suggestion&type=content

https://www.kaggle.com/code/bowaka/jpx-buying-strategy-backtest

https://github.com/keras-team/keras/pull/13598/commits/c735ab5b89bbf935075c84aab3437468e1fe8245

https://www.kaggle.com/code/ikeppyo/examples-of-higher-scores-than-perfect-predictions This is a high score technique, which improves scores by reducing the standard deviation of daily profits

https://www.kaggle.com/code/paulorzp/jpx-prophet-forecasting-rolling-regression

https://stackoverflow.com/questions/45551000/how-to-control-output-from-fbprophet

prophet official website link:
https://facebook.github.io/prophet/docs/quick_start.html#python-api

https://www.geeksforgeeks.org/time-series-analysis-using-facebook-prophet/?ref=gcse

https://towardsdatascience.com/time-series-analysis-with-facebook-prophet-how-it-works-and-how-to-use-it-f15ecf2c0e3a

https://towardsdatascience.com/boost-your-time-series-forecasts-combining-gradient-boosting-models-with-prophet-features-8e738234ffd This is an article combined with prophet

Tags: Python AI

Posted by sasikumar81 on Wed, 20 Jul 2022 02:29:32 +0530