TRADING FOR PROFIT — MACHINE LEARNING WAY
INTRODUCTION
A code competition at Kaggle to develop a predictive model on stock trading for profit using historical data, mathematical and technological tools is taken up for analysis, model development and testing. This problem focusses on being able to make profitable stock market trading decisions in real time. There may be thousands of transactions in a fraction of a second in IT based trading involving 100s of stock exchanges all over the world presenting large number of opportunities for making profit in an idealistic model of market having access of all data on agency and information to all buyers and sellers. However, due to lack of perfect market efficiency due to non-availability of all the data, identify opportunities that can make profits at a given point of time is a very complex problem.
Objective
The objective of the project is to build a highly predictive model which is as close to certainty as possible, and when presented with trading opportunities must suggest whether to accept or reject a certain buying decision.
Business constraints
1. cost of misclassification of negative class can be very high
2. Not very strict latency constraints
3. Interpretability is not very important
Metrics
1. primary metric used will be AUC-ROC curve because we need high TPR and low FPR as high FPR will increase the loss and low TPR will not generate more profits, hence one metric which uses both is ROC-AUC.
2. Macro f1 score.
Methodology: The historical data is modelled using machine learning techniques. The process involved the following steps:
1. Data import and Pre-processing
2. Exploratory data analysis
a. Univariate analysis
b. Bivariate analysis
c. Multivariate analysis
3. Feature engineering
4. Machine learning model development and validation of the models
a. Base line model
b. Logistic regression
c. Decision tree
d. Random forest
e. XGBoost
5. Results
6. Kaggle score
Import and Pre-processing of raw data: Kaggle dataset on the problem containing 130 anonymized features representing real stock market data was imported from a csv file. Each row in the dataset represents a trading opportunity. In addition to the features, weight, resp, resp_{1,2,3,4}, date and ts_id data are also included in the dataset. Date represents trading date and ts_id represents time ordering of the trades. Data on resp_{1,2,3,4} represent returns over different time horizons. Weight column is either positive or zero. Weight*resp values indicate trading opportunities for profit when positive and loss when negative. In the pre-processing stage, new column named ‘label’ was created and filled with 1 when resp is positive else 0. All the rows with 0 weight are removed as they neither represent profit or nor loss and will not contribute towards the scoring evaluation. As the test data does not contain any columns accounting for returns over different time horizons and cannot be used for testing, data columns representing resp_{1,2,3,4} are not helpful in building a model but still retained to explore the data for variability in resp values over different time horizons.
Exploratory data analysis: To understand nature of data and relationships across features, exploratory data analysis is carried out with univariate, bivariate and multivariate methods as detailed below:
Univariate analysis: Balance in the resp data is tested by plotting bar chart of ‘label’ values as given below:
The results show that the opportunities for profit and loss are balanced. Next a probability density function (PDF) of resp values is plotted and it shows that the distribution seems to have long tail which should be minimized as they might be outliers.
Percentile of resp values is calculated and a scatter plot and data indicate existence of outliers in the data.
Cumulative density function of resp values is computed and plotted after removing outliers.
In general, the resp values show a positive trend with an upward curve, with local some declines. Hence it is not a monotonically increasing function. CDF of resp and resp_{1,2,3,4} are plotted to see variation over different time horizons.
It can be observed that resp_4 and resp have very similar CDF, hence it can be inferred that resp_4 is the closest time horizon to resp, and time horizon increases with resp_3, 2 and 1 respectively. As the time horizon is more the local differences even out and we find less and less noise in the data.
Now, out of 130 features, important 10 features (0, 17, 20, 30, 84, 87, 95, 102, 103, 109) are identified using recursive feature elimination and they are subjected to univariate analysis.
Feature_0: Of the 10 features identified for analysis feature_0 seems interesting and important as there are just 2 values (+1,-1).
CDF of resp*weight values are plotted for feature_0 values +1 and -1, and the results show a general positive upward trend when feature_0 is -1 and a general negative trend when feature_0 is +1.
When CDF of resp and resp*weight for feature_0 = -1 is plotted, it shows a general positive increasing agreement between both the variables.
However, when CDF of resp and resp*weight for feature_0 = +1 is plotted, resp values does not show any variation and resp*weight values show decline.
For the -1 and +1 classes of feature_0, box plot of ‘resp’ data does not show any variation in distribution whereas PDFs in violin plot show less variance for class -1 compared to that of +1.
Feature_17:
Feature_17 probability distribution function (PDF) seems very similar to Gaussian except near 0 were the number of values are significantly less compared to other parts, this feature is almost indistinguisable from each of the labels. The CDF reveals that the values of feature with label=1 have lower values in general. The boxplot also shows almost similar results for both the labels, but there are other features which have distribution similar to the PDF of this feature, these features seem to originate from same source with different time horizons.
Feature_20 has similar distribution as that of feature_17, except for larger spread. Also label class wise distribution is not distinguishable. The case remains the same even with boxplot. In the CDF of both classes for this feature, the trend is very similar and there seems to be no observable difference.
Feature 30 this feature also has a similar trend as in the case of above 2 features and may be influenced over different time horizons.
Feature 84
This feature has a very long tail in the positive direction, there is no class wise differentiation which can be found in the class wise PDF. the CDF and PDF are also very close to each other this
feature was selected in recursive feature elimination hence this feature might have a good significance when combined with other features.
Feature 87 is quite different from others as we can see from the PDF. There are a good chunk of points which have the same value but that still is not helping to differentiate classes. The CDF is also very similar to both the labels
Feature 95 looks similar to Gaussian but is positively skewed. Also this feature exhibits different type of distribution compared to others we have observed so far.
Feature 102 has similar distribution as that of feature 95, but they might be from different time horizons. There are considerable values close to -5, a trend which is similar to what is observed in feature 109.
Feature 103 and 109 are similar to 102, but here we have a very large number of values near -4 and this trend seem to be increasing. It is not clear if it is increasing as time horizon is decreasing or when it is increasing as the features are anonymized.
Feature 103
Feature 109
Bivariate Analysis:
Even with bivariate analysis, we still can’t find a clear difference between classes in all of the below pair plots as they seem to be completely jumbled in most of the cases but we still can see some minor clustering in some features. Most clustering is found in the plots related to feature 84.
It can be seen that bivariate plot of feature 17 and 84 seem to exhibit good clustering in different parts of the plot. However, no clear cut relationships can be established using bivariate analysis.
Multi-variate analysis:
Multi-variate analysis of data is carried out on all features using TSNe.
Several small clusters can be seen, but cannot be separated using a line. None of univariate, bivariate or multi-variate analysis methods could help in clearly demarcating criteria for prediction.
Most of the features seem to be very similar and there seem to be no obvious difference between them, we might need as many columns as possible to be able to differentiate classes.
Feature engineering:
As the given features are anonymized and not understood clearly, it is not possible to engineer new features manually. Auto encoder is used to engineer 15 new features from the existing 130 features. The body of the auto encoder is shown below.
After encoding into 15 new features, they are added to the existing 130 features, bringing total features to 145.
Modelling:
In EDA, feature 0 was found to be the most important one and it would be a great starting point to use it for building a baseline model. Feature 0 is labelled as 1 when the feature is -1 and labelled as 0 when feature is 1.
For baseline model, AUC = 0.503
By keeping the results of the baseline model as the bench mark, the first model trained was logistic regression. The results were better with AUC score of 0.525.
Decision tree model was implemented next with AUC as 0.515 which is better than baseline model, but not as effective as logistic regression.
Random forest model has shown better result than logistic regression with AUC =0.530.
The next model used was XGBoost with AUC of 0.525 which is similar to logistic regression.
Results:
AUC score obtained by different models is given below. Out of all the models tested, random forest performed better with AUC of 0.530.\
Kaggle Score:
The test data was not released by Kaggle and the model is supposed to be tested using their API within a given time-frame. As the case study involved pre-processing, feature engineering, and prediction could not be completed in the given time. As a result, scores could not be obtained. The issue was brought to the notice of project guide and in turn advised to deploy the model locally.
Future works:
· As a scope of future work I would like Using Deep learning models to improve the results.
· I also want to experiment to use auto encoders for removing noise from the data and feeding it into NN model.
Conclusion:
This is my First case-study on machine Learning as a part of a course, this case-study is strictly restricted to machine learning. The restriction to use only Machine learning has helped me learn various ways to engineer features to improve the model. As a part of the case study, I have learned every step of the process, from pre-processing to deployment, All the really helped me to understand what all goes into building an end to end system.
This concludes my work. Thank you for reading!
References:
1. https://www.kaggle.com/vivekanandverma/eda-xgboost-hyperparameter-tuning/notebook
2. https://www.kaggle.com/karthickp6/exploratory-data-analysis-of-features-and-tags
3. Appliedaicourse.com
LinkedIn profile:
https://www.linkedin.com/in/alluri-jairam-23a624172/