Solving "ValueError: Could not find subsets of sufficient duration in less than 5000 iterations."

Hi, I recently getting
ValueError: Could not find subsets of sufficient duration in less than 5000 iterations.

I tried to solve this by increasing the max_iter in bruteforce function in bruteforce.py up to 20000 but I still get this error as I thought the cause might be just the lack of maximum iteration number to split.
Is there any missing thing in my understanding or is increasing the max_iter the only way to solve this?? I specified train_dur and val_dur in float type and specified test_dur with -1.

Actual output is here.

(vak-env-cuda113) C:\Users\finch>python E:\tweetynet\cyn191\manual_re3\check5\edit.py train && vak prep E:\tweetynet\cyn191\manual_re3\check5\cyn191_train.toml && vak train E:\tweetynet\cyn191\manual_re3\check5\cyn191_train.toml
determined that purpose of config file is: train
will add 'csv_path' option to 'TRAIN' section
purpose for dataset: train
will split dataset
making array files containing spectrograms from audio files in: E:\tweetynet\cyn191\manual_re3\train_data5
creating array files with spectrograms
[########################################] | 100% Completed | 13.5s
creating dataset from spectrogram files in: E:\tweetynet\cyn191\manual_re3\train_data5\spectrograms_generated_221216_132220
validating set of spectrogram files
[########################################] | 100% Completed |  2.8s
creating pandas.DataFrame representing dataset from spectrogram files
[########################################] | 100% Completed |  2.6s
Total target duration of splits: 708.9862040816328 seconds. Will be drawn from dataset with total duration: 787.670.
Traceback (most recent call last):
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\Scripts\vak-script.py", line 9, in <module>
    sys.exit(main())
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\__main__.py", line 45, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\cli\cli.py", line 30, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\cli\prep.py", line 132, in prep
    vak_df, csv_path = core.prep(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\core\prep.py", line 220, in prep
    vak_df = split.dataframe(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\split\split.py", line 141, in dataframe
    train_inds, val_inds, test_inds = train_test_dur_split_inds(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\split\split.py", line 86, in train_test_dur_split_inds
    train_inds, val_inds, test_inds = brute_force(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\split\algorithms\bruteforce.py", line 193, in brute_force
    raise ValueError(
ValueError: Could not find subsets of sufficient duration in less than 5000 iterations.

Hi @mizuki sorry you’re running into this.
This function needs to do a better job of explaining why it may not work in the error message; this is an open issue:

Question: does your labelset have any rare labels that occur in only 1-2 files?
split tries to find splits such that all of those labels are in each split.
If there’s only 1 file with a certain label (to give an extreme example) then any split without that file won’t have all of the labels in labelset, so the algorithm will fail.
I’m guessing something like that is what’s happening.

Thank you for your reply and sorry for my overlooking the issue already raised.
I checked how many files include each label and I found three cases.
In one case, I was including a rare label that occured in only 1-2 files.
In the second case, I was identifying [PREP][labelset] with a label that I never used in train data.
In the third case, though above two were cleared, I got this error. And when I decrease the amount of data used for all of train, validation and test, the error can be avoided.

No apologies needed!
I was trying to apologize to you because it’s not a very clear error message, and show you that we are aware that we need to fix it.

:thinking: not sure what’s up with the third case. Could be worth inspecting your data a little more to understand the distribution of labels. Can you tell me what annotation format you’re using? I can paste in a little script to show you how I’d visualize it

Right back at you.

I’m using simple-seq format.
I tried some and when I set [test_dur = -1, train_dur = dur * 0.6, val_dur = dur * 0.2], it fails, where the dur equals to the total amount of data in seconds to be referred in a folder. However, when I set [test_dur = dur * 0.2, train_dur = -1, val_dur = dur * 0.2], it succeeded.

Thank you @mizuki that helps me understand what’s going on.

I have it on my to-do list to write a short script for visualizing the distributions of labels in a dataset using crowsetta and a dataset of annotations in the simple-seq format. DOC: Add vignette of inspecting distribution/ratios of classes in a dataset using simple-seq · Issue #211 · vocalpy/crowsetta · GitHub
Will reply back here with that when I do.

A couple of follow up questions:

  • your reply gives split sizes in ratios and -1, where -1 means “use the rest of the data”). Are you calling vak.split.algorithms.brute_force directly? Or are you using a config file?
    • If you’re calling brute_force directly maybe there’s some other reason it fails? It might be worth trying with a config file in that case
  • What happens if you do [test_dur = -1, train_dur = dur * 0.2, val_dur = dur * 0.2]? i.e., use the same size split that worked for you, but don’t change which split is which size
    • I’m just trying the “change one thing and see what breaks” approach here, don’t have a good intuition for why one would work and the other wouldn’t

I have it on my to-do list to write a short script for visualizing the distributions of labels in a dataset using crowsetta and a dataset of annotations in the simple-seq format. DOC: Add vignette of inspecting distribution/ratios of classes in a dataset using simple-seq · Issue #211 · vocalpy/crowsetta · GitHub
Will reply back here with that when I do.

Thank you. It will help.

A couple of follow up questions:

  • your reply gives split sizes in ratios and -1, where -1 means “use the rest of the data”). Are you calling vak.split.algorithms.brute_force directly? Or are you using a config file?
    • If you’re calling brute_force directly maybe there’s some other reason it fails? It might be worth trying with a config file in that case

I’m using a config file.

  • What happens if you do [test_dur = -1, train_dur = dur * 0.2, val_dur = dur * 0.2]? i.e., use the same size split that worked for you, but don’t change which split is which size
    • I’m just trying the “change one thing and see what breaks” approach here, don’t have a good intuition for why one would work and the other wouldn’t

I got no error with [test_dur = -1, train_dur = dur * 0.2, val_dur = dur * 0.2].

Thank you @mizuki – I should get to that example code this weekend.

Could you by any chance share a sample dataset with me by email that lets me replicate the error? I’d need the audio + annotation files you’re using for vak prep + the .toml config files.

It’s not immediately obvious to me why one ratio would work but the other wouldn’t. Especially since the smaller ratio is working. I wonder if this is some corner case where there’s a rare label and when you ask for a larger split that makes it more likely that all occurrences / renditions of the rare label end up in that split.

Thank you.
Ok, I will send you an e-mail with the data and config file.

Thanks again @mizuki for sharing the data that helped us figure out what was going on.

Just want to follow up here in case anyone reads this topic later.

We were able to get vak prep to work eventually by re-running it.

I made an issue to change the error message so it suggests re-running, and also to add tips/hints in the docs suggesting to do so when vak prep fails this way

I think this is the best we can do for now without a better algorithm to do the splits.
It would require optimizing for multiple constraints but I haven’t had a chance to sit down and work out how. Some half-formed thoughts here: