When I was preparing to write this article, I was just releasing the lonely brave~ π
Let's try to make Yan Guo leave a mark and write down all the previous attempts π
This competition may be a little special. In order to make everyone better test, the data within the prediction period is put in the supplementary data set, so there is a problem of data leakage π Now the ranking on the ranking list is made by everyone with leaked data for fun
At present, I have seen few sharing more than 0.3 without data leakage π³οΌ You can wait until the end of the competition in October, and then look at the high score scheme πΊ.
Game question link ↓↓ (data download address can be found in the link):
https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction
catalogue
1. Analysis of competition questions
1.1 introduction to basic information
JPX, the Japan stock exchange, requires modeling based on the financial data of the Japanese market, and predicts the real return in a period of time after the model training.
1.2 provide data (game question input)
Data folder:
folder | Content description |
---|---|
data_specifications | Introduction of each field |
jpx_tokyo_market_prediction | API enabled files. The API is expected to deliver all rows in five minutes and retain less than 0.5 GB of memory |
train_files | Data folder covering the main training period |
supplemental_files | The supplementary data folder contains the dynamic window of supplementary training data. This will be updated with new data at the main stages of the game in early May, early June and about a week before the submission is locked. |
example_test_files | Cover the data folder during the public test. Designed to facilitate offline testing |
Mainly view train_ Basic information of data under data training data folder
stock_prices.csv
stock_ Column name comparison of prices
column_name | Chinese interpretation |
---|---|
RowId | The unique id of the record, which is a combination of date and securities code |
Date | Transaction date |
SecuritiesCode | Local securities code |
open | Opening price |
High | Highest price of the day |
Low | Lowest price of the day |
Close | Closing price |
Volumn | Turnover |
AdjustmentFactor | Adjustment factor |
ExpectedDividend | Expected dividend |
SupervisionFlag | Signs of regulated securities and securities to be delisted |
Target | The change rate of the adjusted closing price between t+2 and t+1, where t+0 is the trading date. |
I don't use other tables very much. I checked the information of relevant fields and put it here
Link (put it up later)
1.3 submission of results (output of competition questions)
According to example_ test_ Submission sample document in files folder_ submission. CSV can see that the submitted result is the ranking of 2000 stocks per day within a given date.
So, how is the ranking generated?
The competition indicators given can be seen in the interface introduction document
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG lwbniotr-1658224070671) (EN- resource://database/64690:1 )]
Here C(k,t) is the closing price of day t, then R (k, t) (that is, the target in our training data) is calculated according to ((closing price of day t+2) - (closing price of day t+1)) ÷ (closing price of day t+1))
You can calculate the Target column from the Close column; it's the return from buying a stock the next day and selling the day after that
# Read stock_ Take a look at the price data df_price = pd.read_csv(f"{train_files_dir}/stock_prices.csv",nrows=10000) df_need = df_price[df_price["SecuritiesCode"]==1301][["RowId","SecuritiesCode","Close","Target"]] df_need["Close_shift1"] = df_need["Close"].shift(-1) df_need["Close_shift2"] = df_need["Close"].shift(-2) df_need["rate"] = (df_need["Close_shift2"] - df_need["Close_shift1"]) / df_need["Close_shift1"]
We can see that r(k,t) calculated here is equal to the given target.
The profit we calculate here is assumed to buy tomorrow and sell the day after tomorrow. We calculate the daily target.
target is related to the close price on the second and third days after that day.
During prediction, only the data of the day are given each time, and the target value of the day is predicted and ranked.
Note that rank is 0-1999, not 1-2000
1.4 evaluation basis
The latter part is the evaluation basis of our results. We just need to submit the daily ranking according to the above format, and the latter part is realized through the interface (to prevent using the data behind the prediction time for calculation).
The results submitted are evaluated based on the sharp ratio of daily earnings.
The returns for a single day treat the 200 highest (e.g. 0 to 199) ranked stocks as purchased and the lowest (e.g. 1999 to 1800) ranked 200 stocks as shorted.
(the daily return is to treat the top 200 stocks as buying and the bottom 200 stocks as selling)
The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that.
The difference between the changes of the top 200 and the bottom 200 stocks multiplied by the linear weight (take the stock of the top 200 and the bottom 200 to test)
(the last 200 times the corresponding weight)
The top 200 target s multiply by the weight - the bottom 200 multiply by the weight, which calculates the daily income. Finally, we evaluate the average value of daily income over a period of time divided by the standard deviation of daily income over that period of time.
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-t37eZJJE-1658224302871)(en-resource://database/64704:1)]
The ones with high final scores rank high.
2. Exploratory data analysis
This part is mainly about the data of stock price. Although the field names of other data sets have been translated, I don't understand the main meaning and don't know how to use it, so I didn't do it anymore. I looked at what others did later.
2.1 bivariate analysis
The data of 2021 is taken
import seaborn as sb import matplotlib.pyplot as plt from datetime import datetime # Run the data in different periods. Don't dare to run the notebook df_price2021 = df_price[df_price.Date>datetime.strptime('2021-01-01','%Y-%m-%d')] # Chart matrix g = sb.PairGrid(data = df_price2021, vars = ['Open', 'High', 'Low','Close','Volume','Target']) # g.map_diag(plt.hist) # Use map_diag puts the histogram on the diagonal, otherwise it is a straight line scatter g.map_offdiag(plt.scatter)
It can be seen that:
- There is a strong correlation between open, high, low and close prices
- Between price and trading volume, the trading volume decreases rapidly with the increase of price. Generally, the trading volume of stocks with high price is low
- target has a normal distribution, and its distribution in trading volume is more dispersed than that in price
Distribution status of each variable over the time span (take the stock with the maximum target in 2021 as an example)
# View the changes of each field of the maximum target in 2021 relative to time in this year SecuritiesCode2021_1 = df_price2021[df_price2021.Target == max(df_price2021.Target)].SecuritiesCode.item() df_price2021_1 = df_price2021[df_price2021.SecuritiesCode==SecuritiesCode2021_1] # price2021_1_timeseries = df_price2021_1.set_index("Date") # (226, 11) # Wide table and variable length table df_price2021_2 = df_price2021_1[["Date",'Open', 'High', 'Low','Close']] df_price2021_2 = df_price2021_2.set_index(["Date"]) df_price2021_3 = df_price2021_2.stack().reset_index() df_price2021_3[0] = df_price2021_3[0].astype('float64') df_price2021_3.Date = df_price2021_3.Date.apply(lambda x:mdates.date2num(x)) # Using datetime here will report an error df_price2021_3.rename(columns={0:"value"},inplace = True)
# Polyline chart ax = sb.lineplot(data=df_price2021_3,x="Date",y="value",hue='level_1') # get current axis ax = plt.gca() format_str = '%Y-%m-%d' format_ = mdates.DateFormatter(format_str) ax.xaxis.set_major_formatter(format_) plt.xticks(rotation=15) plt.show()
It can be seen that:
- The stock rose sharply from 2021.8 to 2021.9
- The changes of open, high, low and close are basically consistent
Draw the time change curves of stock price, target and volume together
You can use functions to encapsulate the functions that draw time curves
# Packed time series line chart () def time_line(df,col_li,time_col): ''' this function is used to plot the trend of each variable over time :param df:dataframe,contains( :param col_li:variable list :param time_col:name of time column :return:figure ''' # Wide table and variable length table df_price2021_2 = df_price2021_1[col_li+[time_col]] df_price2021_2 = df_price2021_2.set_index([time_col]) df_price2021_3 = df_price2021_2.stack().reset_index() df_price2021_3[0] = df_price2021_3[0].astype('float64') df_price2021_3[time_col] = df_price2021_3[time_col].apply(lambda x: mdates.date2num(x)) # Using datetime here will report an error if len(col_li) == 1: y_name = col_li[0] else: y_name = "value" df_price2021_3.rename(columns={0: y_name}, inplace=True) # Long data polyline chart if len(col_li) == 1: ax = sb.lineplot(data=df_price2021_3, x=time_col, y=y_name) else: ax = sb.lineplot(data=df_price2021_3, x=time_col,y=y_name, hue='level_1') # get current axis ax = plt.gca() format_str = '%Y-%m-%d' format_ = mdates.DateFormatter(format_str) ax.xaxis.set_major_formatter(format_) plt.xticks(rotation=15) plt.show() # Draw together plt.figure(figsize=[12,5]) plt.subplot(1,3,1) time_line(df_price2021_1,['Open', 'High', 'Low','Close'],"Date") plt.subplot(1,3,2) time_line(df_price2021_1,["Target"],"Date") plt.subplot(1,3,3) time_line(df_price2021_1,["Volume"],"Date")
As you can see,
- The change trend of target and stock price is similar, both of which have extremely steep peaks in August of 21, but compared with stock price, the local jitter of target image is more intense
- Volume of trading volume also fluctuates with the stock price, but there is a certain time delay at the peak
The stocks with the lowest target in 2021 were selected for analysis: - It can be seen that where the stock price fluctuated significantly, the trading volume also increased significantly
- target fluctuates near the zero value line. During the period when the stock price fluctuates significantly, there will be sharp fluctuations
2.2 excellent EDA appreciation π
This is a popular eda notebook.
https://www.kaggle.com/code/abaojiang/jpx-detailed-eda
This article is mainly a summary of characteristic projects in the financial field π€π
https://www.kaggle.com/code/metathesis/feature-engineering-training-with-ta/notebook
3. Try to build the model
3.1 LSTM
Here we first take a stock forecast to see the prediction effect.
# Import required packages import pandas as pd from sklearn.preprocessing import MinMaxScaler import numpy as np from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense, Dropout
Load data
# set base_dir to load data base_dir = r"D:/stock_data" # train_data train_files_dir = f"{base_dir}/train_files" train_df = pd.read_csv(f"{base_dir}/train_files"+'/stock_prices.csv', parse_dates=True) valid_df = pd.read_csv(f"{base_dir}"+'/supplemental_files/stock_prices.csv', parse_dates=True) train_df = pd.concat([train_df,valid_df]) # Here is the combination of training data and supplementary data. Later, 20% is taken together for the test features = ['Open', 'High', 'Low', 'Close','Volume'] # Test with stocks with code 1332 prices = train_df.query("SecuritiesCode==1332")[features]
Partition test set
test_split = round(len(prices)*0.2) # 252 df_for_training = prices[:-252][features] print(df_for_training.shape) df_for_training = df_for_training.dropna(how='any') print(df_for_training.shape) df_for_testing = prices[-252:][features]
Zoom data
scaler = MinMaxScaler(feature_range=(0,1)) # Zoom data df_for_training_scaled = scaler.fit_transform(df_for_training) df_for_testing_scaled = scaler.transform(df_for_testing)
Generate training data and test data
# createXY def createXY(dataset,n_past): dataX = [] dataY = [] for i in range(n_past, len(dataset)): dataX.append(dataset[i - n_past:i, 0:dataset.shape[1]]) # [0:30,0:5] data in 0-29 days dataY.append(dataset[i,-2]) # 30 predict the value on the 30th day return np.array(dataX),np.array(dataY) # Generate data trainX,trainY=createXY(df_for_training_scaled,30) # trainX.shape testX,testY=createXY(df_for_testing_scaled,30)
Regression prediction by keras
from keras.wrappers.scikit_learn import KerasRegressor # Regression prediction by keras from sklearn.model_selection import GridSearchCV grid_model = Sequential() grid_model.add(LSTM(50,return_sequences=True,input_shape=(30,5))) grid_model.add(LSTM(50)) grid_model.add(Dropout(0.2)) grid_model.add(Dense(1)) # Take a look at the parameter definitions of keras!!! grid_model.compile(loss='mse',optimizer = 'adam') history = grid_model.fit(trainX,trainY,epochs=10,batch_size=30,validation_data=(testX,testY)) # Fitting training data
Draw a picture to see the training effect
from matplotlib import pyplot as plt plt.plot(history.history['loss'],label='train') plt.plot(history.history['val_loss'],label='test') plt.legend() plt.show()
Make predictions and compare the results with real data
prediction=grid_model.predict(testX) prediction_copies_array = np.repeat(prediction,5, axis=-1) pred=scaler.inverse_transform(np.reshape(prediction_copies_array,(len(prediction),5)))[:,0] # Reverse change, and the standardized data is converted into the original data # Real data original_copies_array = np.repeat(testY,5, axis=-1) # original_copies_array.shape original=scaler.inverse_transform(np.reshape(original_copies_array,(len(testY),5)))[:,0]
Draw to see the contrast effect
plt.plot(original, color = 'red', label = 'Real Stock Price') plt.plot(pred, color = 'blue', label = 'Predicted Stock Price') plt.title(' Stock Price Prediction') plt.xlabel('Time') plt.ylabel(' Stock Price') plt.legend() plt.show()
- Take 1332 as an example
- Column 1377
It can be seen that the predicted results are roughly consistent with the actual data, but the small jitter is not predicted. Our competition this time mainly focuses on the short-term income generated by daily price fluctuations, so otherwise, adjust the model so that it can fit these disturbances as much as possible, and then try to accurately predict the stock price in the next two days, and then calculate the required target. Otherwise, we can directly try to predict the target, and then try to use sgbt to directly predict the feasibility of the target method.
In addition:
lstm here, I think the figure finally simulated by the following person is still relatively close to the real fluctuation of stock price. I only looked at the figure, but it didn't reappear. I don't know the actual effect, and I don't know whether it shows its optimal effect. You can have a look at what you want to know https://www.kaggle.com/code/onurkoc83/multivariate-lstm-close-open-high-low-volume
3.2 Sgboost
After reading the cases shared in the forum, I tried to predict the target directly with sgbt. Select some features for improvement, use optuna to adjust parameters, and use gpu to accelerate. At this time, the best result can reach 0.297, and then you can't adjust it any more πΉ.
# Import the corresponding module import os import traceback import numpy as np import pandas as pd import xgboost as xgb from tqdm import tqdm import jpx_tokyo_market_prediction import warnings; warnings.filterwarnings("ignore") prices1 = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv",parse_dates=True) # Select the prices data after 2020 prices1 = prices1[prices1.Date>'2020-01-01'] prices = prices1.copy() prices = prices.drop(["ExpectedDividend"],axis=1) prices.isnull().sum() prices = prices.dropna(how='any') prices.isnull().sum() # drop null values
# Get the share price of each stock the previous day cc = prices.groupby("SecuritiesCode").apply(lambda df: df['Close'].shift(1)) cc = pd.DataFrame(cc).reset_index(level=0) prices = pd.merge(prices,cc['Close'],left_index=True,right_index=True) prices.head() prices.tail()
prices['delta'] = prices['Close_x'] - prices['Close_y'] # Check whether it is up or down def getadvance(x): ret = 0 if x > 0: ret = 1 return(ret) prices['advance'] = list(map(getadvance, prices['delta']))
prices['Date'] = pd.to_datetime(prices['Date'], format = "%Y-%m-%d") def get_month(dt): # Get month x = dt.strftime("%m") return(x) prices['Month'] = list(map(get_month, prices['Date'])) prices.rename(columns={"Close_x":"Close"},inplace=True) prices.head()
def upper_shadow(df): return df['High'] - np.maximum(df['Close'], df['Open']) def lower_shadow(df): return np.minimum(df['Close'], df['Open']) - df['Low'] prices['Upper_Shadow'] = upper_shadow(prices) prices['Lower_Shadow'] = lower_shadow(prices)
# Finalized characteristics features = ['Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'SupervisionFlag', 'delta', 'advance', 'Month','Upper_Shadow','Lower_Shadow'] prices = prices.dropna(how='any') prices.isnull().sum() del prices['Date']
# Convert to category format def cat_col(data) : data['SecuritiesCode'] = data['SecuritiesCode'].astype('category') data['SupervisionFlag'] = data['SupervisionFlag'].astype('category') data['advance'] = data['advance'].astype('category') data['AdjustmentFactor'] = data['AdjustmentFactor'].astype('category') data['Month'] = data['Month'].astype('category') return data prices = cat_col(prices)
X = prices[features] y = prices['Target'] # optuna parameter adjustment + training model import optuna def objectives(trial): param = { 'tree_method':'gpu_hist', 'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0), 'subsample': trial.suggest_categorical('subsample', [0.4,0.6,0.8,1.0]), 'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.5,0.7,0.9,1.0]), 'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.01,0.02,0.05]), "n_estimators" : trial.suggest_int('n_estimators', 300, 1000), 'max_depth': trial.suggest_categorical('max_depth', [5,9,13,15,17,20]), 'random_state': trial.suggest_categorical('random_state', [24, 48,2020]), 'min_child_weight': trial.suggest_int('min_child_weight', 1, 10) } model = xgb.XGBRegressor(**param, enable_categorical=True) # , enable_categorical=True model.fit(X, y) score = model.score(X, y) return score studyxgb = optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=0)) studyxgb.optimize(objectives, n_trials=5) trial = studyxgb.best_trial params_best = dict(trial.params.items()) print(params_best) # params_best['random_seed'] = 0 model = xgb.XGBRegressor(**params_best,enable_categorical=True,tree_method='gpu_hist') # xgb.XGBRegressor(**param, enable_categorical=True) # Print best parameters print('study.best_params:', studyxgb.best_trial.value) print('Number of finished trials:', len(studyxgb.trials)) print('Best trial:', studyxgb.best_trial.params) print('study.best_params:', studyxgb.best_params) print(model.tree_method) model.fit(X,y) model.score(X,y)
Submit
import jpx_tokyo_market_prediction env = jpx_tokyo_market_prediction.make_env() iter_test = env.iter_test() all_data = prices1.copy() # Get the share price of the previous day def latest_close(SecuritiesCode,date): temp = all_data[all_data.SecuritiesCode==SecuritiesCode].sort_values(by=["Date"],ascending=False) temp = temp[temp.Date<=date] return temp.iloc[-1]['Close'] for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test: # print(prices) # del prices['Date'] # print(prices) # qq = prices all_data = pd.concat([all_data,prices]) # prices["Avg"] = sample_prediction["SecuritiesCode"].apply(get_avg) prices['Close_y'] = prices.apply(lambda x:latest_close(x.SecuritiesCode,x.Date),axis=1) prices['delta'] = prices['Close'] - prices['Close_y'] prices['advance'] = list(map(getadvance, prices['delta'])) prices['Date'] = pd.to_datetime(prices['Date'], format="%Y-%m-%d") prices['Month'] = list(map(get_month, prices['Date'])) prices = cat_col(prices) prices['Date'] = prices['Date'].dt.strftime("%Y%m%d").astype(int) prices['Upper_Shadow'] = upper_shadow(prices) prices['Lower_Shadow'] = lower_shadow(prices) securities = prices["SecuritiesCode"] prices = prices[features] print('-------------------------------prices------------------------------') print(prices) print('------------------------------------------------------------------------------') sample_prediction["Prediction"] = model.predict(prices) # sample_prediction['SecuritiesCode'] = securities print('-------sample_prediction--------') print(sample_prediction) sample_prediction = sample_prediction.sort_values(by="Prediction", ascending=False) sample_prediction.Rank = np.arange(0, 2000) sample_prediction = sample_prediction.sort_values(by="SecuritiesCode", ascending=True) sample_prediction.drop(["Prediction"], axis=1) submission = sample_prediction[["Date", "SecuritiesCode", "Rank"]] print('-------------------------------submission------------------------------') print(submission) print('------------------------------------------------------------------------------') env.predict(submission)
3.3 other attempts
3.3.1 short time fitting prediction
The simpler and rougher way to predict the stock price with lstm is to try to predict the stock price in the next two days in the past three or two days, and make a rough estimate.
code snippet
# Fitting the stock price of the next two days df.Date = pd.to_datetime(df.Date, format="%Y-%m-%d") df = df.set_index(['Date']) df['day3'] = df.Close.rolling(window=3).apply(lambda y: np.poly1d(np.polyfit([0,1,2],y,1))(3),raw=True) df['day2'] = df.Close.rolling(window=2).apply(lambda y: np.poly1d(np.polyfit([0,1],y,1))(2),raw=True) df.reset_index()
The effect of this method is OK, and the score can be about 0.11-0.14.
3.3.2 Prophet
Later, I mainly want to improve the results of xgboost a little. The effect of others adding other features is not very ideal https://www.kaggle.com/code/junjitakeshima/jpx-add-new-features-eng.
So I began to think about how to combine prophet and sgboost to see the effect π΅ (start painfully... hhhh).
This article is mainly for reference
Then, most of the time is transferred to
Solve the installation problem of pystan (you can post a blog another day π·)
Solve the problem that when using prophet, the printing of pystan is too long... (the use and printing of prophet will be put on another day!!!)
Adjust the code to solve the problem of insufficient memory for 2000 stock forecasts and the problem that too long time will exceed the kaggle time limit πππ
The final effect of this attempt is not ideal. I am not proficient in using prophet, and there is no time to adjust it later π. I will put part of the code below. You can see if you have any ideas to optimize it.
# prophet section # import prophet from prophet import Prophet from prophet.make_holidays import make_holidays_df import logging logging.getLogger('prophet').setLevel(logging.WARNING) # Set up pystan printing class suppress_stdout_stderr(object): ''' A context manager for doing a "deep suppression" of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function. This will not suppress raised exceptions, since exceptions are printed to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through). ''' def __init__(self): # Open a pair of null files self.null_fds = [os.open(os.devnull, os.O_RDWR) for x in range(2)] # Save the actual stdout (1) and stderr (2) file descriptors. self.save_fds = (os.dup(1), os.dup(2)) def __enter__(self): # Assign the null pointers to stdout and stderr. os.dup2(self.null_fds[0], 1) os.dup2(self.null_fds[1], 2) def __exit__(self, *_): # Re-assign the real stdout/stderr back to (1) and (2) os.dup2(self.save_fds[0], 1) os.dup2(self.save_fds[1], 2) # Close the null files os.close(self.null_fds[0]) os.close(self.null_fds[1])
# Set up festivals year_list = [2017,2018,2019,2020, 2021, 2022] holidays = make_holidays_df(year_list=year_list, country='JP') # Because holidays are not traded, holidays are all 0 based on the real date, so the following dates are uniformly reduced by 1 day to try from datetime import timedelta holidays['ds'] = holidays['ds'].apply(lambda x:x - timedelta(days=1))
# prophet prediction def run_prophet(tr): # tr = tr[["Date","Target"]] # tr.rename(columns={'Target': 'y', 'Date': 'ds'}, inplace=True) m = Prophet(holidays=holidays, daily_seasonality=False, changepoint_prior_scale=0.01) with suppress_stdout_stderr(): m.fit(tr) return m # Added features add_features = ['trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper', 'additive_terms', 'additive_terms_lower', 'additive_terms_upper', 'holidays'] # Create a prophet for each set of data from tqdm import tqdm pbar = tqdm(total=2000) count = 0 forecast_all = pd.DataFrame() for cod in cod_list: # print(cod) names1 = globals() temp = names['cod_'+str(cod)][["Date","Target"]] temp.rename(columns={'Target': 'y', 'Date': 'ds'}, inplace=True) names1['m_'+str(cod)] = run_prophet(temp) new_feature = names1['m_'+str(cod)].predict(temp.drop('y', axis=1)) names['cod_'+str(cod)] = pd.concat([names['cod_'+str(cod)],new_feature[add_features]],axis=1) # New features of generating training data sets # Predict backward for a period of time future = names1['m_'+str(cod)].make_future_dataframe(periods=120) forecast = names1['m_'+str(cod)].predict(future) # Store prediction features for future retrieval # print(forecast[add_features]) forecast = forecast[add_features+['ds']] forecast['SecuritiesCode'] = cod forecast_all = pd.concat([forecast_all,forecast],axis = 0) del names1['m_'+str(cod)] count += 1 if count == 200: pbar.update(200) count = 0 pbar.close()
Part of the code of prophet is mainly as shown above, and the rest is calculated according to the above sgbt method, which is equivalent to adding add to the original feature_ Features section. See git (link) for the complete code.
3.4 final plan
The final submission scheme returned to the sgbt model that ran to 0.297 at the beginning, and dealt with the blank value and price adjustment part, and the result suddenly reached 0.332.
The price adjustment is based on the AdjustmentFactor field. We can select a stock whose AdjustmentFactor is not 1 to view.
prices[prices.SecuritiesCode==3176].head(25) #.query("AdjustmentFactor!=1")
Adjustment code
# Adjust price def adjust_price(price): from decimal import ROUND_HALF_UP, Decimal pcols = ["Open", "High", "Low", "Close"] # price.ExpectedDividend.fillna(0, inplace=True) def qround(x): return float(Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)) def adjust_prices(df): df = df.sort_values("Date", ascending=False) df.loc[:, "CumAdjust"] = df["AdjustmentFactor"].cumprod() # generate adjusted prices for p in pcols: df.loc[:, p] = (df["CumAdjust"] * df[p]).apply(qround) df.loc[:, "Volume"] = df["Volume"] / df["CumAdjust"] df.ffill(inplace=True) # The null value is removed df.bfill(inplace=True) # generate and fill Targets # df.loc[:, "Target"] = df.Close.pct_change().shift(-2).fillna(df.Target).fillna(0) df.Target.fillna(0, inplace=True) return df # generate Adjusted# price = price.sort_values(["SecuritiesCode", "Date"]) price = price.groupby("SecuritiesCode").apply(adjust_prices).reset_index(drop=True) price = price.sort_values("RowId") return price
The adjusted result is 0.332 ππ
I don't know how many places I can rank in the end. Although I don't see many solutions of 0.3 at present, I can't modify them when I need to add the supply data because I call prophet to use up the submission times~ πππ (it is the degree to which it can be sent away directly π°π°π°)
4. Summary π³
- The whole competition process is still very valuable. I have checked and learned a lot. I am very grateful to the active partners in the forum πΊπΊπΊπΏπΏπΏοΌ And I really participated in the competition completely and insisted on myself until the end hhhh πΌπΌπΌ
- For the method of combining prophet that I want to try at last, I think I can save it again if I have a chance. One is that the article I checked also deals with the problem of time series prediction. The effect should be better. See if it is possible to reproduce his code first, and then debug it in combination with the current case.
- Another is that there is a summary of financial feature engineering in the excellent eda sharing in the middle. I think we can refer to that article to add some features, and maybe the effect will be better.
- I'm not very familiar with kaggle. I can try more later in terms of submission and file storage.
- At present, I think of these. Let's sort out and play with what we have learned recently. Thank you and see if the people here have any partners. We can compete together next time. Is the team powerful~~ πππππΈπΉπ
5. Reference links
https://www.kaggle.com/code/metathesis/feature-engineering-training-with-ta/notebook
https://www.kaggle.com/code/jiripodivin/supervised-stocks-eda-and-basic-pca
https://www.kaggle.com/code/abaojiang/jpx-detailed-eda
https://www.kaggle.com/code/genbufuthark/jpx-datafile-description-in-japanese
https://www.kaggle.com/code/chumajin/english-ver-easy-to-understand-the-competition
https://www.kaggle.com/code/bowaka/jpx-buying-strategy-backtest
https://github.com/keras-team/keras/pull/13598/commits/c735ab5b89bbf935075c84aab3437468e1fe8245
https://www.kaggle.com/code/ikeppyo/examples-of-higher-scores-than-perfect-predictions This is a high score technique, which improves scores by reducing the standard deviation of daily profits
https://www.kaggle.com/code/paulorzp/jpx-prophet-forecasting-rolling-regression
https://stackoverflow.com/questions/45551000/how-to-control-output-from-fbprophet
prophet official website link:
https://facebook.github.io/prophet/docs/quick_start.html#python-api
https://www.geeksforgeeks.org/time-series-analysis-using-facebook-prophet/?ref=gcse
https://towardsdatascience.com/boost-your-time-series-forecasts-combining-gradient-boosting-models-with-prophet-features-8e738234ffd This is an article combined with prophet