What happens when 'train_dur' < (total duration of files prepared for training)?

Hi,it’s been a while since we last talked.

I recently noticed that I might have had misusage of some parameters, train_dur, val_dur and test_dur. I was setting much shorter values than total duration of prepared data to be referred as training set. As a result, the ‘split’ column in a csv file which waa an output of vak prep was filled with ‘None’, though training would be completed.
So, questions I have are,
1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?
2.What happens when the sum of those three values exceeded total duration of files prepared for training?

Thank you for your help.

1 Like

Hi @mizuki, welcome back :slightly_smiling_face:

These are good questions.

1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?

Yes, internally when you call vak prep config.toml, it calls the function vak.split.dataframe, that randomly selects files for each split from the total set of files in the dataset. (It’s randomly selecting rows from the pandas.DataFrame that represents the dataset, in case you’re wondering about the name)

If you do not specify any of the options {“train_dur”, “val_dur”, “test_dur”} when you run vak prep with a config that has a [TRAIN] section, then it will put all of the data in data_dir in the train split – maybe this is what you were expecting?
Similarly when you run vak prep for a [PREDICT] config, it just puts all the data in a predict split.

We probably need to document this better somewhere. Please feel free to raise an issue on the vak GitHub repository suggesting that we do so.

2.What happens when the sum of those three values exceeded total duration of files prepared for training?

You should get an error telling you that the sum of those three values exceeded total duration of files. Please tell me if not!

Let me know if that helps!

I see. I might check the source codes.
Thank you for your help!!

1 Like

Sure thing, happy to answer more questions if you have them!

I set only {“train_dur”, “val_dur”} and deleted [`PREP][`test_dur`], and I got prep’s output file(.csv) with no “test_dur” in the split column and it was composed of “train_dur”, “val_dur” and “None”. Was data with “None” in the split column referred as test data in the actual model training?

Hi @mizuki sorry I missed this – I set up the forum to alert me to notifications but apparently it’s not working as expected.

I realize both the naming and the way that splits are made for different tasks is a bit unclear, sorry for the confusion.

The “test” split is not used during training. It’s only used during “eval”.
During training, the “train” split is what’s used to actually train on, and the “val” split is what’s used to evaluate the model at each validation step.

The reason for setting it up so that you can make a test split when you prep the training set is so that you can make a separate “eval” config with the exact same .csv file and be extra sure that you have not included any data from the training or validation splits in your test set.
The gold standard is to only ever evaluate your model on this held-out test set that the model never sees after you find good hyperparameters (like learning rate) with the training + validation split.

In practice in the lab you may go back and forth between them because you’re trying to figure out what’s going on with a specific bird, for example, but we are trying to make a framework lets you benchmark and compare models as well as just train a single model for your data.
Again this is partly confusing because we haven’t documented it very well yet.

We’re in the middle of making it easier to evaluate a model with and without clean-up steps and then we will do a better job of documenting how to run eval.

Does that answer your question?

I was misunderstanding the test set, and now I understand. Thank you.
And, I have been completely overlooked the “eval”. It will be useful when I check the accuracy of model and decide proper parameters.

One more question. “batch_size” in [TRAIN] section is explained as “number of samples per batch presented to models during training”, but what do “one sample” refer?? One set of data from spectrograms sampled by “window” whose size is specified in [DATALOADER] section??

“batch_size” in [TRAIN] section is explained as “number of samples per batch presented to models during training”, but what do “one sample” refer?? One set of data from spectrograms sampled by “window” whose size is specified in [DATALOADER] section??

Yes, I think you have that right. The way we train TweetyNet is to show it batches of windows taken from spectrograms. So if you have a batch size of 128, that’s 128 windows grabbed at random from the total dataset of all possible windows of the size specified in the [DATALOADER] section. These windows are represented by the WindowDataset class in vak.

So, currently in vak, one “sample” = one window. Calling one element of the dataset a “sample” is terminology from machine learning.

Does that line up with your understanding?

Yes, it does line up with mine.

Now, I think I have much better understanding.
I really appreciate for your help. Thank you!

:+1: of course, happy to help whenever @mizuki