What happens when 'train_dur' < (total duration of files prepared for training)?

mizuki · December 1, 2022, 5:55am

Hi,it’s been a while since we last talked.

I recently noticed that I might have had misusage of some parameters, train_dur, val_dur and test_dur. I was setting much shorter values than total duration of prepared data to be referred as training set. As a result, the ‘split’ column in a csv file which waa an output of vak prep was filled with ‘None’, though training would be completed.
So, questions I have are,
1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?
2.What happens when the sum of those three values exceeded total duration of files prepared for training?

Thank you for your help.

nicholdav · December 1, 2022, 5:07pm

Hi @mizuki, welcome back

These are good questions.

1.What happens when the sum of parameter values, ‘train_dur’ val_dur and test_dur is shorter than total duration of files prepared for training? Are randomly selected small amount of files used for each usage?

Yes, internally when you call vak prep config.toml, it calls the function vak.split.dataframe, that randomly selects files for each split from the total set of files in the dataset. (It’s randomly selecting rows from the pandas.DataFrame that represents the dataset, in case you’re wondering about the name)

If you do not specify any of the options {“train_dur”, “val_dur”, “test_dur”} when you run vak prep with a config that has a [TRAIN] section, then it will put all of the data in data_dir in the train split – maybe this is what you were expecting?
Similarly when you run vak prep for a [PREDICT] config, it just puts all the data in a predict split.

We probably need to document this better somewhere. Please feel free to raise an issue on the vak GitHub repository suggesting that we do so.

2.What happens when the sum of those three values exceeded total duration of files prepared for training?

You should get an error telling you that the sum of those three values exceeded total duration of files. Please tell me if not!

Let me know if that helps!

mizuki · December 1, 2022, 11:41pm

I see. I might check the source codes.
Thank you for your help!!

nicholdav · December 2, 2022, 12:14am

Sure thing, happy to answer more questions if you have them!

mizuki · December 5, 2022, 7:25am

I set only {“train_dur”, “val_dur”} and deleted [`PREP][`test_dur`], and I got prep’s output file(.csv) with no “test_dur” in the split column and it was composed of “train_dur”, “val_dur” and “None”. Was data with “None” in the split column referred as test data in the actual model training?

nicholdav · December 13, 2022, 5:33pm

Hi @mizuki sorry I missed this – I set up the forum to alert me to notifications but apparently it’s not working as expected.

I realize both the naming and the way that splits are made for different tasks is a bit unclear, sorry for the confusion.

The “test” split is not used during training. It’s only used during “eval”.
During training, the “train” split is what’s used to actually train on, and the “val” split is what’s used to evaluate the model at each validation step.

The reason for setting it up so that you can make a test split when you prep the training set is so that you can make a separate “eval” config with the exact same .csv file and be extra sure that you have not included any data from the training or validation splits in your test set.
The gold standard is to only ever evaluate your model on this held-out test set that the model never sees after you find good hyperparameters (like learning rate) with the training + validation split.

In practice in the lab you may go back and forth between them because you’re trying to figure out what’s going on with a specific bird, for example, but we are trying to make a framework lets you benchmark and compare models as well as just train a single model for your data.
Again this is partly confusing because we haven’t documented it very well yet.

github.com/vocalpy/vak

DOC/ENH: Add information about evaluation in the vak documentation

opened 04:38PM - 09 Mar 22 UTC

vivinastase

ENH: enhancement DOC: documentation

It would be very useful to add to vak's how-to doc information about the evaluat…ion setting (and maybe also learncurve) In particular, I think it would be good to extend the following section with an evaluation step (and learncurve): https://vak.readthedocs.io/en/latest/get_started/autoannotate.html#autoannotate For easier access it could also help to link some example configuration files on the config file information page: https://vak.readthedocs.io/en/latest/reference/config.html

We’re in the middle of making it easier to evaluate a model with and without clean-up steps and then we will do a better job of documenting how to run eval.

Does that answer your question?

mizuki · December 14, 2022, 7:47am

I was misunderstanding the test set, and now I understand. Thank you.
And, I have been completely overlooked the “eval”. It will be useful when I check the accuracy of model and decide proper parameters.

One more question. “batch_size” in [TRAIN] section is explained as “number of samples per batch presented to models during training”, but what do “one sample” refer?? One set of data from spectrograms sampled by “window” whose size is specified in [DATALOADER] section??

nicholdav · December 14, 2022, 2:09pm

“batch_size” in [TRAIN] section is explained as “number of samples per batch presented to models during training”, but what do “one sample” refer?? One set of data from spectrograms sampled by “window” whose size is specified in [DATALOADER] section??

Yes, I think you have that right. The way we train TweetyNet is to show it batches of windows taken from spectrograms. So if you have a batch size of 128, that’s 128 windows grabbed at random from the total dataset of all possible windows of the size specified in the [DATALOADER] section. These windows are represented by the WindowDataset class in vak.

So, currently in vak, one “sample” = one window. Calling one element of the dataset a “sample” is terminology from machine learning.

Does that line up with your understanding?

mizuki · December 14, 2022, 9:45pm

Yes, it does line up with mine.

Now, I think I have much better understanding.
I really appreciate for your help. Thank you!

nicholdav · December 14, 2022, 10:13pm

of course, happy to help whenever @mizuki

Topic		Replies	Views
Solving "ValueError: Could not find subsets of sufficient duration in less than 5000 iterations." Q&A	9	362	December 28, 2022
BrokenProcessPool during vak prep Q&A vak	9	27	January 21, 2025
Vak prep issue with "simple-seq" annotation format and wave files Q&A vak , crowsetta	7	297	June 10, 2022
Issues with vak predict with the latest version of vak vak	3	83	May 8, 2024
How to optimize training and evaluate `vak eval` results Q&A	1	177	March 6, 2024

What happens when 'train_dur' < (total duration of files prepared for training)?

Related topics