
gaj


Total Posts: 25 
Joined: Apr 2018 


Say I am trying to build a trading signal out of a number of features. The signal should predict the move of an instrument in some time horizon, say 15 minutes. What target variable and loss function should I use in my machine learning model?
A first attempt would be something like: target variable = 15min return and loss function = MSE. Two problems with this approach. First, MSE does not tell me anything about PnL. I can have a large MSE with good PnL if the model predicts the right direction every time but not the right magnitude. Second, in practice I will only put on a position if the trading signal is strong enough. So I shouldn't penalize the model too much when the signal is weak (although you could argue it represents a missed opportunity).
Ideally, I want the model to output a trading decision at every data point (buy, sell, or stay flat). Then optimize for PnL. Maybe penalize variance as well. How do I fit this in a supervised learning framework? 




Maggette


Total Posts: 1054 
Joined: Jun 2007 


IMO these are to different steps. Training a ML model to predict price movements and then think about strategy and position sizing as a separate model. I like to think about it as a reinforcement learning problem (even though that does not imply you should apply it here) 
Ich kam hierher und sah dich und deine Leute lächeln,
und sagte mir: Maggette, scheiss auf den small talk,
lass lieber deine Fäuste sprechen...



gaj


Total Posts: 25 
Joined: Apr 2018 


Agreed, prediction and execution should be separate. But I find that a simple MSE optimization on predicting returns is extremely noisy. Consider these cases.
Case 1: model predicts +20bps, actual return is 20bps. This is obviously a misprediction and MSE rightly penalizes it.
Case 2: model predicts +200bps, actual return is +240bps. MSE will penalize it as much as case 1. But I wouldn't penalize it too much, because it predicts the right direction and we can expect a lot of noise at this magnitude. MSE will unnecessarily overfit the model.
Case 3: model predicts +0bps, actual return is +40bps. The model fails to predict an up move. Again, I wouldn't penalize too much because in practice we are not going to trade on this signal (as opposed to case 1 where we would trade in the wrong direction).
In theory, I could come up with some custom loss function to take into account these issues. But I'm curious to hear from the more experienced. 




Strange


Total Posts: 1436 
Joined: Jun 2004 


I don’t know much about ML, but I think that you do not want a binary signal. Instead, you want to output some sort of continuous score. So instead of 3 modes  long, short and flat you should have a number that changes the sign with some min and max (zscore is the simple example). Later, at the sizing stages, you can apply some form of transaction cost control. 
I don't interest myself in 'why?'. I think more often in terms of 'when?'...sometimes 'where?'. And always how much?' 


Azx


Total Posts: 37 
Joined: Sep 2009 


You will need some way to determine your position size from the expected return in order to quantify how errors impact your trading.
Let the target variable be expected return and then derive an optimal position size based on maximizing some utility function. Then use the estimated utility gained from trading based on your predicted expected return as the basis for your loss function. This way only errors relevant to trading decisions are penalized. 




ronin


Total Posts: 339 
Joined: May 2006 


@gaj,
Your dichotomy is a bit false. Why exactly would you penalise Case 1 more than Cases 2&3?
Having said that, deciding on your utility function for the signal is not trivial. You can maximise the signal, minimise the noise, or maximise the signal to noise ratio. In different measures which scale with noise. And then there are umpteen ways to interpret "signal" and "noise".
MSE is a measure of noise. Whether or not it's a good measure of noise for what you are doing is probably the first decison. The second decision is whether you should be optimising for noise in the first place.

"There is a SIX am?"  Arthur 


gaj


Total Posts: 25 
Joined: Apr 2018 


"Your dichotomy is a bit false. Why exactly would you penalise Case 1 more than Cases 2&3?"
I was implicitly thinking in terms of PnL again, which may not be the right approach. In case 1, I would go long and lose 20bps. Case 2, I go long and make 240bps (more than expected). Case 3, I do nothing and stay flat.
"MSE is a measure of noise."
Interesting perspective. What is a measure of signal then? 




ronin


Total Posts: 339 
Joined: May 2006 


The signal would be your return, in some simplified form. Maximising the signal would say "making 200 bps is good", minimising the noise would say "being wrong by 40 bps is bad".
So your optimization could be any of these: R > max MSE > min R / MSE > max R / MSE^1/2 > max R  lambda * MSE > max, for some risk tolerance lambda R  lambda * MSE^1/2 > max, for some risk tolerance lambda And many more.
If the problem isn't trivial, each of those will give you something different. And that's just assuming you are doing something sufficiently Gaussian so that MSE is a good measure of error.
It might be a good exercise to do them all to get some idea what you are looking at, before making final decisions. But then once you have, stick with the choice so you don't introduce model choice bias.

"There is a SIX am?"  Arthur 



When applying ML to any problem domain, it's always useful to keep the nofree lunch theorem in mind. For every world where X is the right approach, there's another situation where X is exactly the wrong approach. Sometimes those worlds are absurdly Kafkaesque and don't match reality at all. But you should always be able to articulate, where, when and why your favorite approach breaks down.
In the spirit of NFL, you need to think about what kind of inductive biases are true of the application that you're working on. Without at least some domain knowledge, a blind ML approach is almost never going to be useful. In my experience, when working on financial signals, there's a recurring set of stylized facts that are usually true. The below is my goto list (in order of how universal they seem to be), with the caveat that YMMV depending on your specific problem.
* Markets are mostly efficient, and the signaltonoise ratio when predicting forward returns is *very* low. Make sure that whatever model you're using is extremely robust to noise in the dependent variable. Luckily there's pretty extensive literature on this topic.
* Returns are mostly normalish. At least enough that minimizing MSE is almost always the best approximation for the model's MLE. MSE is stable, tractable, easily computable and dataefficient. Deviating from it isn't worth the minor gain to MLE. (This doesn't apply to data errors, like zeroed prices, which you should clean up before modeling.) (Also keep in mind this is an entirely separate topic than risk management, where deviations to normality *are* important)
* The unconditional prior on returns should be zero. It's almost always worth it to spend the degrees of freedom to hold datapoints outsample, then single OLS a shrinkage coefficient on top of the fitted signal. This is particularly true of ML models like trees and nets, which are optimized for classification. Remember in classification, there's no penalty for overconfidence. Whereas in trading there's almost always *very big* penalties for overconfidence.
* The correct param set for the subset of "special" tradable points (like when signal is above the cost threshold), is usually in close proximity to the unconditional param set. MSE is a clearwinner and we shouldn't deviate it from in in the objective function. It's better to take an EM approach. First unconditionally fit your params. Then use the net signal to score the specialness of each point. Refit giving higher weights to more special points. Rescore with the refitted params. Repeat until you converge.
* Most finished trading systems tend to be a collection of mostly orthogonal subcomponents. Therefore large magnitude signals tend to actually not be that special, as they're mostly driven by the random coincidence of the orthogonal subsignals.
* With regards to the OP's question, don't think too hard about trading a single signal in isolation. Very likely it's going to be mixed with a bunch of other signals, plus some spiffy monetization logic. We're more interested in something that "plays nice" as a building block. Linear models trained with tractable objective functions almost always play nice. Sometimes it's worth it to deviate, but be aware of "downstream costs".
* Within a market, the correct parameters for separate instruments tend to live in proximity to each other. Therefore it's usually better to fit a marketwide model with cardinality of the entire dataset. Then boost instrument specific models on top of the market model's shrunken outsample prediction.
* Less liquid instruments and periods tend to have high predictability. Therefore it's appropriate to weight MSE in proportion to liquidity or capacity.
* That being said, the less equal the weights the lower the effective cardinality of the data set. If you're datastarved and living near the former end of the biasvariance spectrum, sometimes flattening the weights costs less bias than it frees up.
* Param sets change over time, but in a slow, continuous way. Therefore it's usually better to fit a model using longer periods, then boost using shorter more recent periods. Alternatively you can weigh more recent points heavily using something like exponential decays on the weights.
* Shorter horizons are more predictable than longer ones. Optimal long horizon parameters tend to live in close proximity to shorthorizon params. Often it's best to fit with the shortest horizon that makes sense. Optionally boost, shrink and/or stretch to get a long horizon model.
* Interactions between features tend to be pretty shallow. Deep models don't tend to work well in quant trading. Well thought out feature engineering usually guarantees minimal interaction. Consider spending more time on feature engineering if you find yourself getting big gains from depth. Also consider using randomized features as a sanity benchmark. 
Good questions outrank easy answers.
Paul Samuelson 




It's also worth remembering that the NFL theorem is a very worstcase result. 



deeds


Total Posts: 403 
Joined: Dec 2008 


Very generous EL, thank you for sharing 




gaj


Total Posts: 25 
Joined: Apr 2018 


@ EspressoLover: Thank you! So much wisdom in a single post. A few follow up questions. Sorry if these questions sound noobish.
"Markets are mostly efficient, and the signaltonoise ratio when predicting forward returns is *very* low." What kind of Rsquared can I realistically expect? Just wanted to get an idea of when I should keep searching and when should I stop. 0.5%? 3%? 10%?
"single OLS a shrinkage coefficient on top of the fitted signal" Not sure if I understand this correctly. Are you saying multiply the fitted signal by a factor less than 1?
"Within a market, the correct parameters for separate instruments tend to live in proximity to each other. Therefore it's usually better to fit a marketwide model with cardinality of the entire dataset. Then boost instrument specific models on top of the market model's shrunken outsample prediction." Very interesting. I have a hard time imagining how to fit multiple instruments in the same model. Seems to require a lot of normalization.
"Most finished trading systems tend to be a collection of mostly orthogonal subcomponents." "Interactions between features tend to be pretty shallow." What's the difference between "features" and "subsignals". Why don't we put the subsignals in a single model in the first place? Conversely, we can also ask, if interactions between features are shallow, why don't we treat them as separate subsignals and just focus on one feature per model? 




> What kind of Rsquared can I realistically expect?
It really depends a lot on context. Shorter horizons, less liquidity, thicker books, higher tcosts, less developed markets, more eventdriven sampling and noisier price measures all increase Rsquareds. In general Rsquareds above 5% intraday (timesampled) and 1% interday definitely smell. That's usually a sign that there's some sort of overfitting or lookahead bias leaking into the model. Or that your price metric has a pathology, like bidask bounce. On the flip side you may still have a great signal with a much lower Rsquared than the above.
> Are you saying multiply the fitted signal by a factor less than 1?
Yes, I'm suggesting holding out some of the dataset. Let's say you're using kNN regression, which is heavily overconfident in the precense of noise. The signals would be way too large and you'd overtrade like crazy. Say you have 10,000 points, keep 1,000 in reserve and use the rest to train kNN. Then generate outsample kNN predictions for the reserved points. If you apply single OLS, you'll get a coefficient between 0 and 1. That tells you how much to shrink the kNN predictions by. If your kNN model spits out +20 basis points, and your shrinkage coef is 0.25, then your net output would be +5 basis points.
> I have a hard time imagining how to fit multiple instruments in the same model.
It's easy to see this approach with something like SGD. There's a fitness landscape of parameterizations, and the "height" of each point is goodness of fit. Our dataset is just a finite sample from some the Platonic "true model", so MSE is a noisy measure of a point's height in the landscape. Depending on our flavor of SGD we generally have some sort of scheme where we start with larger steps to avoid pits of noise, until we end up in the neighborhood of a local maximum. At which point we start taking smaller, more careful steps to try to find the exact peak.
Now think of a multiinstrument and and singleinstrument fitness landscape. The landscape for AAPL is similar, but not exactly the same as the landscape for all S&P 500 stocks aggregated together. However we have 500 more data points for the S&P500, so our estimates are much less noisy. It makes sense to start by looking for an optimal point on S&P. After finding it we can be pretty confident AAPL's optimal point is in the neighborhood. We get a much better starting point this way. The gradient in the proximity of a local max is much less noise driven. In this way we significantly reduce the empirical risk from a small singleinstrument dataset.
> if interactions between features are shallow, why don't we treat them as separate subsignals and just focus on one feature per model?
Well, first off I'm distinguishing "interaction" from "multicollinearity". (For simplicity, I'm talking about linear OLS, but most of this analogizes to other supervised learning techniques.) If you have nonorthogonal features, then regressing each one individually produces a worse fit then regressing the features together. A toy example is some pairssignal between X and Y. If X moved down last period and Y moved up, we'd predict X's price to go up. However if X and Y both went up, then there's no divergence and we'd have zero signal. If X and Y rarely diverge, then singleregressing X or Y's last move would offer little predictive value. But by regressing the two together we effectively filter out the nondivergent moves, generating a much stronger siganal.
I'm using interaction in the sense of "interaction term", i.e. nonadditive influence on signal. Volume is a good example. We may expect certain events to predict price more or less depending on if they were accompanied by unusual volume. In effect the regression would be Y ~ X + X*Volume. Whereas volume as a standalone term (i.e. Y ~ X + Volume) would be insignicant. This is a 2depth interaction because it involves two features. Image recognitition is a "deep learning" problem because you can't just build an additive model of individual pixel values, or even small combos of pixels.
> Why don't we put the subsignals in a single model in the first place?
There's plenty of reasons you want to segregate features and fits into separate subsignals. First is just practical. You have different researchers working on different problems. Keeping the teams modular with minimal overlapping concerns is easier if each group delivers a separate semifinalized alpha.
Second is that if feature set A is orthogonal to feature set B, then there's no gains to fitting them together. So, why complicate things? Even if they're nonorthogonal, that dependence often just compresses down to their net alpha. Order book features often are colinear with relative value features, however that's largely an artifact of liquidity providers leaning on RV signals. If you fit an order book alpha, then fit a RV alpha, then regress the two together to get combo weights, you'll often end up with a net signal that's basically the same had you just regressed all the features together.
Third you usually have different Bayesian priors on qualitately different categories of features. Simple example: signal X includes hundreds of features, many of which are spurious and unstable. Whereas signal Y has a few rocksolid features. If you're doing LASSO regression on X it's likely that you'll need tight regularization. Once you throw the two together, you'll probably overly shrink Y's features and underly shrink X's. In this case you want to keep your hyperparameters pooled separately. 
Good questions outrank easy answers.
Paul Samuelson 



levkly


Total Posts: 28 
Joined: Nov 2014 


Thank you for sharing EL.
1. How you overcome trading costs with Rsquared below 5% intraday?
2. From my experience in sub second frequency the "winner" take all the cake. What your average alpha time until it decay in sub second prediction frequency alphas?




gaj


Total Posts: 25 
Joined: Apr 2018 


Very informative, thank you. I will take time to digest all of that. 





> How you overcome trading costs with Rsquared below 5% intraday?
Well, let's just play with a toy model. Let's say we're looking at a 1second signal on FB. FB's opentoclose volatility averages about 1.25%. Simplifying to assume that returns are i.i.d. that's a 1 bp for 1second volatility. If our signal has 2% Rsquared, its standard deviation will be 0.14 bps. At $182 that's a magnitude of $0.0025 in dollar terms.
Let's say that we're taking liquidity by crossing the spread, bidask spread is almost always 1tick, and we pay zero fees or commissions. Therefore our one way transaction costs $0.005. Relative to our alpha, that's a 2sigma threshold. Let's simply again and assume the signal is normally i.i.d. distributed. With 23,400 seconds in a session, we'd expect to get 532 profitable trading opportunities per day with an average net profit of $0.0009.
If we have good execution we can do even better. Start with evaluating the signal continuously in realtime rather than at fixed time slices. That gives more opportunity to "catch" a profitable opportunity. If we execute on BYZ and get $0.0015 in rebates for taking liquidity, then the profitability goes way up because we can trade a lot more and net larger profits. Or if we can get significant price improvement from NBBO, either by providing liquidity or hitting dark liquidity.
> From my experience in sub second frequency the "winner" take all the cake.
I generally agree. Some exceptions I can think of though... Different operations have "jitter" between their systems, even if they're modeling exactly the same phenomenon. Your coefficient and model are never going to exactly match mine. So there will always be cases where Alice's signal triggers, whereas Bob's does not. This is more the case when the alpha tends to "drift". If you have an alpha that "jumps", then most of the time the state instantly changes to the point where the opportunity is obvious to everyone. At which point latency wins.
It also depends on the monetization strategy on top of the signal. Taking lit liquidity is definitely biased towards winner take all. Once the signal triggers, then the trader is going to move the price instantly. But for liquidity providers, many participants can simultaneously be trying to fill in the direction of the alpha. Since the space of all providing strategies is much larger and higher dimensional than liquiditytaking, there's often more room for multiple niches, even with identical alphas shared by participants. 
Good questions outrank easy answers.
Paul Samuelson 


gaj


Total Posts: 25 
Joined: Apr 2018 


> Since the space of all providing strategies is much larger and higher dimensional than liquiditytaking, there's often more room for multiple niches, even with identical alphas shared by participants.
Could you expand on this? The only extra dimension I can think of is queue position. 





Well to start the mechanics of placement. Do you improve, join or post away? Will you take the liquidity of a dying level to be the first to post in the opposite direction? How deep do you layer the book? How do you trade off queueholding at deeper levels versus using your capital at more fillable prices? How do you divide your quotes between venues? Do you use midpoint pegs? Do you post with ALO?
Then you have to consider that a limit order has an entire life cycle. Unlike an IOC, the order lives for a long time, so there's the continuously reevaluated decision about whether to cancel, modify or keep it alive. It's not just a onetime calculation of profitability? It's this whole recursive decision process maybe I'll expect to lose if I get filled in the next iota, but if I don't I'll likely move up in the queue by X. But then what's the likelihood that I wind up canceling anyway...
You also have to manage adverse selection and conditionality. (Taking does have an element of adverse selection, especially in the presence of latency competition, but it's definitely a lot less than providing.) Not only do you need an alpha model, but you also need a toxicity model. Even the alphas may need to be conditioned on your order's position. In addition you have to take into account your preexisting inventory, and estimate how long it takes to exit positions. (This isn't a problem without adverse selection, because the unconditional drift is zero.)
Finally you have to consider market impact. With a fastdecaying take strategy, you just swipe all the liquidity as fast as you can. But if you're a decent size relative to the market, then putting out a giant visible limit order will push the market away from you before getting filled.
That being said, remember that many highdimensional optimization problems have low effective dimensionality. 
Good questions outrank easy answers.
Paul Samuelson 


gaj


Total Posts: 25 
Joined: Apr 2018 


Makes sense. Most of what you said are specific to US equities though. If you are trading a deep, liquid, singlevenue instrument without fancy order types, the logic can be as simple as: place order if alpha + queue value > tx costs, else cancel order. Am I correct?
Edit: Of course this framework is an oversimplification. In an ideal model, the queue position valuation should incorporate the recursive decision process you mentioned. It should be able to evaluate the value of keeping an order vs the chance of getting a bad fill, the value of taking liquidity vs being the first in the queue, the value of showing a big order vs market impact, etc. The alpha model should also capture toxicity and adverse selection. 





Not just to *US* equities. It also largely applies to EU equities. 







