Solving "ValueError: Could not find subsets of sufficient duration in less than 5000 iterations."

mizuki · December 16, 2022, 4:42am

Hi, I recently getting
ValueError: Could not find subsets of sufficient duration in less than 5000 iterations.

I tried to solve this by increasing the max_iter in bruteforce function in bruteforce.py up to 20000 but I still get this error as I thought the cause might be just the lack of maximum iteration number to split.
Is there any missing thing in my understanding or is increasing the max_iter the only way to solve this?? I specified train_dur and val_dur in float type and specified test_dur with -1.

Actual output is here.

(vak-env-cuda113) C:\Users\finch>python E:\tweetynet\cyn191\manual_re3\check5\edit.py train && vak prep E:\tweetynet\cyn191\manual_re3\check5\cyn191_train.toml && vak train E:\tweetynet\cyn191\manual_re3\check5\cyn191_train.toml
determined that purpose of config file is: train
will add 'csv_path' option to 'TRAIN' section
purpose for dataset: train
will split dataset
making array files containing spectrograms from audio files in: E:\tweetynet\cyn191\manual_re3\train_data5
creating array files with spectrograms
[########################################] | 100% Completed | 13.5s
creating dataset from spectrogram files in: E:\tweetynet\cyn191\manual_re3\train_data5\spectrograms_generated_221216_132220
validating set of spectrogram files
[########################################] | 100% Completed |  2.8s
creating pandas.DataFrame representing dataset from spectrogram files
[########################################] | 100% Completed |  2.6s
Total target duration of splits: 708.9862040816328 seconds. Will be drawn from dataset with total duration: 787.670.
Traceback (most recent call last):
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\Scripts\vak-script.py", line 9, in <module>
    sys.exit(main())
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\__main__.py", line 45, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\cli\cli.py", line 30, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\cli\prep.py", line 132, in prep
    vak_df, csv_path = core.prep(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\core\prep.py", line 220, in prep
    vak_df = split.dataframe(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\split\split.py", line 141, in dataframe
    train_inds, val_inds, test_inds = train_test_dur_split_inds(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\split\split.py", line 86, in train_test_dur_split_inds
    train_inds, val_inds, test_inds = brute_force(
  File "C:\Users\finch\anaconda3\envs\vak-env-cuda113\lib\site-packages\vak\split\algorithms\bruteforce.py", line 193, in brute_force
    raise ValueError(
ValueError: Could not find subsets of sufficient duration in less than 5000 iterations.

nicholdav · December 16, 2022, 4:52pm

Hi @mizuki sorry you’re running into this.
This function needs to do a better job of explaining why it may not work in the error message; this is an open issue:

github.com/vocalpy/vak

ENH: Improve error message for `vak.split.algorithms.brute_force`

opened 05:31PM - 05 Nov 22 UTC

NickleDave

ENH: enhancement

@lmpascual ran into an issue where this error message got thrown by `vak.split.a…lgorithms.brute_force` ![image](https://user-images.githubusercontent.com/11934090/200132858-44319379-0acd-41fd-8bf5-2a27456a2b31.png) https://github.com/vocalpy/vak/blob/4199bc5616eef83ef853289b7ad93249b75e41fb/src/vak/split/algorithms/bruteforce.py#L255 For a user who doesn't have some of the context of what `prep` is doing under the hood here, this is kind of confusing. It might sound like it's telling them they need to have 5000 renditions of a certain syllable, for instance. The explanation should be a bit clearer, with more context. Something like, "The algorithm that creates splits of the dataset for training, validation, etc., did not converge in the maximum number of iterations". It should also include information to help the user fix the problem, such as: "Check that your labelset does not include vary rare labels, which may make it hard to find splits that all contain at least one instance of each label. You might also consider removing any segments labeled as a 'background' or 'trash' category if you want the trained model to ignore those classes; if the segments are removed they will be given the default 'background' label instead, as long as you have unlabeled segments in your dataset such as silent gaps between vocalizations."

Question: does your labelset have any rare labels that occur in only 1-2 files?
split tries to find splits such that all of those labels are in each split.
If there’s only 1 file with a certain label (to give an extreme example) then any split without that file won’t have all of the labels in labelset, so the algorithm will fail.
I’m guessing something like that is what’s happening.

mizuki · December 17, 2022, 2:58am

Thank you for your reply and sorry for my overlooking the issue already raised.
I checked how many files include each label and I found three cases.
In one case, I was including a rare label that occured in only 1-2 files.
In the second case, I was identifying [PREP][labelset] with a label that I never used in train data.
In the third case, though above two were cleared, I got this error. And when I decrease the amount of data used for all of train, validation and test, the error can be avoided.

nicholdav · December 17, 2022, 3:21am

No apologies needed!
I was trying to apologize to you because it’s not a very clear error message, and show you that we are aware that we need to fix it.

not sure what’s up with the third case. Could be worth inspecting your data a little more to understand the distribution of labels. Can you tell me what annotation format you’re using? I can paste in a little script to show you how I’d visualize it

mizuki · December 17, 2022, 3:45am

Right back at you.

I’m using simple-seq format.
I tried some and when I set [test_dur = -1, train_dur = dur * 0.6, val_dur = dur * 0.2], it fails, where the dur equals to the total amount of data in seconds to be referred in a folder. However, when I set [test_dur = dur * 0.2, train_dur = -1, val_dur = dur * 0.2], it succeeded.

nicholdav · December 20, 2022, 4:21pm

Thank you @mizuki that helps me understand what’s going on.

I have it on my to-do list to write a short script for visualizing the distributions of labels in a dataset using crowsetta and a dataset of annotations in the simple-seq format. DOC: Add vignette of inspecting distribution/ratios of classes in a dataset using simple-seq · Issue #211 · vocalpy/crowsetta · GitHub
Will reply back here with that when I do.

A couple of follow up questions:

your reply gives split sizes in ratios and -1, where -1 means “use the rest of the data”). Are you calling vak.split.algorithms.brute_force directly? Or are you using a config file?
- If you’re calling brute_force directly maybe there’s some other reason it fails? It might be worth trying with a config file in that case
What happens if you do [test_dur = -1, train_dur = dur * 0.2, val_dur = dur * 0.2]? i.e., use the same size split that worked for you, but don’t change which split is which size
- I’m just trying the “change one thing and see what breaks” approach here, don’t have a good intuition for why one would work and the other wouldn’t

mizuki · December 22, 2022, 10:59pm

I have it on my to-do list to write a short script for visualizing the distributions of labels in a dataset using crowsetta and a dataset of annotations in the simple-seq format. DOC: Add vignette of inspecting distribution/ratios of classes in a dataset using simple-seq · Issue #211 · vocalpy/crowsetta · GitHub
Will reply back here with that when I do.

Thank you. It will help.

A couple of follow up questions:

your reply gives split sizes in ratios and -1, where -1 means “use the rest of the data”). Are you calling vak.split.algorithms.brute_force directly? Or are you using a config file?

If you’re calling brute_force directly maybe there’s some other reason it fails? It might be worth trying with a config file in that case

I’m using a config file.

What happens if you do [test_dur = -1, train_dur = dur * 0.2, val_dur = dur * 0.2]? i.e., use the same size split that worked for you, but don’t change which split is which size

I’m just trying the “change one thing and see what breaks” approach here, don’t have a good intuition for why one would work and the other wouldn’t

I got no error with [test_dur = -1, train_dur = dur * 0.2, val_dur = dur * 0.2].

nicholdav · December 22, 2022, 11:35pm

Thank you @mizuki – I should get to that example code this weekend.

Could you by any chance share a sample dataset with me by email that lets me replicate the error? I’d need the audio + annotation files you’re using for vak prep + the .toml config files.

It’s not immediately obvious to me why one ratio would work but the other wouldn’t. Especially since the smaller ratio is working. I wonder if this is some corner case where there’s a rare label and when you ask for a larger split that makes it more likely that all occurrences / renditions of the rare label end up in that split.

mizuki · December 23, 2022, 3:34am

Thank you.
Ok, I will send you an e-mail with the data and config file.

nicholdav · December 28, 2022, 5:36pm

Thanks again @mizuki for sharing the data that helped us figure out what was going on.

Just want to follow up here in case anyone reads this topic later.

We were able to get vak prep to work eventually by re-running it.

I made an issue to change the error message so it suggests re-running, and also to add tips/hints in the docs suggesting to do so when vak prep fails this way

github.com/vocalpy/vak

DOC: Add error message + note/tip about re-running `vak prep` if splits fail

opened 05:32PM - 28 Dec 22 UTC

NickleDave

DOC: documentation

**Is your feature request related to a problem? Please describe.** Sometimes `v…ak prep` can fail to find dataset splits as described here https://forum.vocalpy.org/t/solving-valueerror-could-not-find-subsets-of-sufficient-duration-in-less-than-5000-iterations/52/9 Because of the current algorithm this can just be luck of the draw, and simply re-running can solve the problem. But this is not obvious. **Describe the solution you'd like** - [ ] add note to error message saying "try re-running" - [ ] add tip/hint to docs that better explains this We should for now add a tip/hint to try re-running vak prep when this happens. Should be in - troubleshooting - a detailed vak prep vignette - linked to from FAQs - elsewhere? **Describe alternatives you've considered** better split algorithm! **Additional context** Might be worth providing concrete examples like similar to those Mizuki provided

I think this is the best we can do for now without a better algorithm to do the splits.
It would require optimizing for multiple constraints but I haven’t had a chance to sit down and work out how. Some half-formed thoughts here:

Topic		Replies	Views
How to optimize training and evaluate `vak eval` results Q&A	1	177	March 6, 2024
_spectral_helper error during prep Q&A vak	3	186	July 8, 2023
What happens when 'train_dur' < (total duration of files prepared for training)? Q&A vak	9	291	December 14, 2022
Vak 0.8.0 + TweetyNet 0.9.0 released; vak 1.0 in development Announcements vak	0	276	February 16, 2023
Vak + TweetyNet with an Apple M1 Max?	13	310	September 23, 2023

Solving "ValueError: Could not find subsets of sufficient duration in less than 5000 iterations."

Related topics