Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something wrong in train/valid split #10

Open
rambo-yuanbo opened this issue Jan 13, 2025 · 5 comments
Open

Something wrong in train/valid split #10

rambo-yuanbo opened this issue Jan 13, 2025 · 5 comments

Comments

@rambo-yuanbo
Copy link

Hello,

the author make a list of offsets for training and shuffled it

batch_offsets = np.arange(start=0, stop=valid_index, dtype=int)
...
np.random.shuffle(batch_offsets)

after that, to avoid overlapping with validation data, the author do this.
for j in range(valid_index - lookback_length - steps + 1):
....batch_offsets[j]....

but since batch_offsets is shuffled, it's likely some offset belonging to the valid data will be taken in the training

@rambo-yuanbo
Copy link
Author

also that, looking in the dataset, all i can guess is that maybe the data is normalized including all train/valid/test data. isn't there some risk of looking in to the future?

@rambo-yuanbo
Copy link
Author

i tried something like this:
batch_offsets = np.arange(start=0, stop=valid_index - lookback_length - steps + 1, dtype=int)
...
np.random.shuffle(batch_offsets)
...
for offset in batch_offsets:
...

validation performance is not as good as the original code could give. (as for the NASDAQ dataset)

@rambo-yuanbo
Copy link
Author

i also tried strict rolling-window normalization (within each lookback_length + steps window to ensure no future-data leak) , which also make things worse.

@rambo-yuanbo
Copy link
Author

无标题

as you can see, both of the 2 issues I posted above,

  1. the train/valid split is overlapping
  2. the global normalization (which i guess) that might see future data, especially if it sees max prices in history

each of them boost the IC in the validation period.

@joshua-xia
Copy link

joshua-xia commented Jan 14, 2025

I think the reasonable way are:
1、split train data and validate data at a timestamp, the data before this timestamp is used as training data, and the data after this timestamp is used as validate data,
2、the normalization (for example: Scaler) should be fit by using training data and use this Scaler to normalize the validate data, in this code, I guest the all data is normalized by one Scaler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants