Retraining model including valid dataset before production

Greetings Chris, Nick and everyone
I have been seeing contradictory statements on using valid data set as well before deploying ML/DL models into production. What’s your take on this?


@ChrisMcC @nickcdryan any comments on this :slight_smile:

really appreciate your time and help!

Hi Sai,

What are the contradictory ideas you’ve heard?

In most cases, your goal is to train a model that will successfully generalize to new data. The validation set and test set are best thought of as tools to help you towards that goal. The validation set helps indicate whether or not you are overfitting to the training data, and the test set should serve as an ultimate indicator of how well your model will generalize to new data.

Maybe you’re seeing some practical exceptions? For example, when there isn’t a lot of data people often shift the proportions of their training/validation/test set away from a large test/validation set, or simply train using the validation because they believe that the positive effects of training the model with more data (which is basically always good) outweighs the risk of possibly overfitting to the trainind data.

1 Like

@nickcdryan , you answered my question. I have seen in few forums that few tend to split data into only two parts train and valid, dependent/independent on the size of data sets.

And there are few posts, which recommend to split the data into train/valid/test. Then, come up with parameters which work good and then using these best parameter combinations to retrain on entire data at the end before deployment. So, I am kind of lost here whether this is a good/preferred idea?

Train/validation/test is best practice because it removes doubts about overfitting and assists in finding good hyperparameters. Exceptions and tradeoffs come with the amount of data you have, the kind of data, how well your data approximates “real world” data, etc.

“come up with parameters which work good and then using these best parameter combinations to retrain on entire data at the end before deployment.” You can do this but you run a risk of overfitting on the final model because you’re using the same hyperparameters which may no longer be optimal now that you’ve added additional data. Doing this is a tradeoff: risk of overfitting vs. better performance from training on more data. I’ve done something similar myself after building up good confidence/evidence that my model will generalize because I needed all the data I could get.

If you use only two sets of data you run into problems like overfitting on the dataset that you validate/test on and finding good hyperparemeters. This is not really a good idea.

If you are using two sets of data the cross-validation is one solultion that helps you build evidence/confidence that you have a robust, generalizable model without using a test set. You do need to examine the results carefully and think about whether your data is a good fit for cross-validation, what kind of cross-validation split to use, and whether you have enough data to make it work.

Cross-validation is a fine way to go but if you’re unsure and can spare the data then train/validation/test is the safest.

1 Like