Something wrong in train/valid split #10

rambo-yuanbo · 2025-01-13T07:48:54Z

Hello,

the author make a list of offsets for training and shuffled it

batch_offsets = np.arange(start=0, stop=valid_index, dtype=int)
...
np.random.shuffle(batch_offsets)

after that, to avoid overlapping with validation data, the author do this.
for j in range(valid_index - lookback_length - steps + 1):
....batch_offsets[j]....

but since batch_offsets is shuffled, it's likely some offset belonging to the valid data will be taken in the training

rambo-yuanbo · 2025-01-13T07:52:43Z

also that, looking in the dataset, all i can guess is that maybe the data is normalized including all train/valid/test data. isn't there some risk of looking in to the future?

rambo-yuanbo · 2025-01-13T08:17:48Z

i tried something like this:
batch_offsets = np.arange(start=0, stop=valid_index - lookback_length - steps + 1, dtype=int)
...
np.random.shuffle(batch_offsets)
...
for offset in batch_offsets:
...

validation performance is not as good as the original code could give. (as for the NASDAQ dataset)

rambo-yuanbo · 2025-01-14T01:14:45Z

i also tried strict rolling-window normalization (within each lookback_length + steps window to ensure no future-data leak) , which also make things worse.

rambo-yuanbo · 2025-01-14T02:01:27Z

as you can see, both of the 2 issues I posted above,

the train/valid split is overlapping
the global normalization (which i guess) that might see future data, especially if it sees max prices in history

each of them boost the IC in the validation period.

joshua-xia · 2025-01-14T08:47:49Z

I think the reasonable way are:
1、split train data and validate data at a timestamp, the data before this timestamp is used as training data, and the data after this timestamp is used as validate data,
2、the normalization (for example: Scaler) should be fit by using training data and use this Scaler to normalize the validate data, in this code, I guest the all data is normalized by one Scaler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something wrong in train/valid split #10

Something wrong in train/valid split #10

rambo-yuanbo commented Jan 13, 2025

rambo-yuanbo commented Jan 13, 2025

rambo-yuanbo commented Jan 13, 2025

rambo-yuanbo commented Jan 14, 2025

rambo-yuanbo commented Jan 14, 2025

joshua-xia commented Jan 14, 2025 •

edited

Loading

Something wrong in train/valid split #10

Something wrong in train/valid split #10

Comments

rambo-yuanbo commented Jan 13, 2025

rambo-yuanbo commented Jan 13, 2025

rambo-yuanbo commented Jan 13, 2025

rambo-yuanbo commented Jan 14, 2025

rambo-yuanbo commented Jan 14, 2025

joshua-xia commented Jan 14, 2025 • edited Loading

joshua-xia commented Jan 14, 2025 •

edited

Loading