BrokenProcessPool during vak prep

Hi all,
I installed vak on linux and everything worked fine with the bengalese finch test set from the tutorial. Now I’m trying my own recordings but I think I’m running into memory problems. During vak prep It starts as usual, but crashed with the following error after about 10-20 seconds:

File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

(full error message below)

Thinks I tried:

  • setting device in toml both to ‘cpu’ or ‘gpu’
  • changing number of workers (between 1 and 8 if I remember correctly)
  • reducing traindur, valdur en testdur, and nr of epchs, but I think these are only used in the training step right?
  • using shorter file. with only 1 short file (5 minutes, 59,3 MB) vak prep does seem to work, so I presume long files is the problem. The original files are about 30 min and about 330-350 MB per file)

As a workaround, I could write a script split all files into shorter ones but given this issue https://github.com/dask/dask/issues/8506 I thought you might have ideas for a better solution? Thanks of course for already putting effort in this David!

My vak version is 1.0.3, so more recent that the issue post above
System info: Linux Min 21.3, Ubuntu 22.04
Graphics card: GeForce GTX 1650

Any ideas are welcome.
Thanks in advance!
Sita

full error message:

2024-11-24 16:41:28,443 - vak.prep.frame_classification.frame_classification - INFO - vak version: 1.0.3
2024-11-24 16:41:28,443 - vak.prep.frame_classification.frame_classification - INFO - Will prepare dataset as directory: /media/sita/sth8T/hoornraven/hoornraven_Tweetynet/tweetynet_hornbills_vanlaptopWin/tweetynet_test_hornb/recording120324M/train/prep_out20241124/used-vak-frame-classification-dataset-generated-241124_164128
2024-11-24 16:41:28,647 - vak.prep.spectrogram_dataset.prep - INFO - making array files containing spectrograms from audio files in: /media/sita/sth8T/hoornraven/hoornraven_Tweetynet/tweetynet_hornbills_vanlaptopWin/tweetynet_test_hornb/recording120324M/train/used
2024-11-24 16:41:28,647 - vak.prep.spectrogram_dataset.audio_helper - INFO - creating array files with spectrograms
[                                        ] | 0% Completed | 21.83 sms
Traceback (most recent call last):
  File "/home/sita/anaconda3/envs/tweetyS4/bin/vak", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/__main__.py", line 49, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/cli/cli.py", line 54, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/cli/cli.py", line 28, in prep
    prep(toml_path=toml_path)
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/cli/prep.py", line 134, in prep
    _, dataset_path = prep_module.prep(
                      ^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/prep_.py", line 194, in prep
    dataset_df, dataset_path = prep_frame_classification_dataset(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/frame_classification/frame_classification.py", line 276, in prep_frame_classification_dataset
    source_files_df: pd.DataFrame = get_or_make_source_files(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/frame_classification/source_files.py", line 144, in get_or_make_source_files
    source_files_df = prep_spectrogram_dataset(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/spectrogram_dataset/prep.py", line 151, in prep_spectrogram_dataset
    spect_files = audio_helper.make_spectrogram_files_from_audio_files(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/spectrogram_dataset/audio_helper.py", line 247, in make_spectrogram_files_from_audio_files
    spect_files = list(bag.map(_spect_file))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/dask/bag/core.py", line 1488, in __iter__
    return iter(self.compute())
                ^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/dask/base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/dask/base.py", line 660, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Hi @Sita sorry vak is not working for you here!

At first glance, my guess is you are right that file size is the issue.
Can you tell me a little bit more about the dataset?
Is it that you have 30 minute files that contain some number of zebra finch song bouts?
I have had other people recently ask me about finding bouts in longer files, I’d be happy to help write a script that does this, I could add it as an example in the VocalPy docs (since you can do this without needing vak)

re: the issue that you linked, maybe you saw that we added some options to the config that give you a little more control over how dask does it’s thing?

(that issue links the one on the dask repo you referenced)

But 30 minutes seems like you would end up with really gigantic files!
The spectrograms typically take up much more space than the audio files too, so if the classes that you’re trying to fit the model to only occur in, say, less than half of those 30 minutes, you might be taking up both a lot of storage space for something you wouldn’t use a lot.

If you can tell me a little bit more about the data that can help us figure out what we need to do here, thanks!

Hi, thanks for the quick reply as usual :slight_smile:
Sorry I missed the dask toml options, I’ll try those indeed. I misinterpreted that the workers were the solution for this. I’ll read it in more detail first :slight_smile:

It’s not zebra finches this time but southern ground hornbills, really cool birds. They don’t produce complex vocalizations, just calls. The calls differ between male and female and I’d like to automatically recognize male/female call notes in long noisy zoo recordings. So for prep I could extract the annotations to reduce the total amount of sound. Indeed there is a lot of sound now not annotated. But in the end I would ideally apply it to continuous recordings which could contain anything between 0 and 1000 calls per half hour (maybe even more, this is just a quick check on a few recordings). But those I can cut into shorter snippets before as well before I run the predict on them.

In addition I could filter high frequencies out, because the sounds are really low, so that could reduce the amount of data as well.

That being said, I would be interested in zebra finch bout detection as well, for other projects :slight_smile:

I’ll let you know how it goes with adjusting the cofig files! (hopefully tomorrow, now occupied with loads of grading :frowning: )

thanks heeps!

Sorry I missed the dask toml options, I’ll try those indeed. I misinterpreted that the workers were the solution for this.

No worries, it’s not super clear.

I’ll read it in more detail first

The short version is you want to add the audio_dask_bag_kwargs option to the [vak.prep] table in the config, and set it to something lower than the default of 100

[vak.prep]
# ... other options here
audio_dask_bag_kwargs = {npartitions = 20}

I have to admit I don’t quite understand what this actually does :confused:
I asked in the dask forum

but never got a reply, the relevant section of the docs is here in case you’re curious
https://docs.dask.org/en/stable/bag-creation.html?highlight=npartitions#create-dask-bags

I basically just added it since Hjalmar said it would make prep work for their data. Would love to know whether it helps you too.

It’s not zebra finches this time but southern ground hornbills, really cool birds.

Cool!

for prep I could extract the annotations to reduce the total amount of sound

I take this to mean that you have some sections annotated and you can safely assume the rest is silence? So you could write a script that does something like “find every period of calls with no silent period between them greater than duration $d$, and then make that into an audio clip with $b$ seconds before the first call and $a$ seconds after, then save this to file”, or something similar?

My hunch is that this would be worth doing so that (1) prep will be quick and take up less space and (2) you don’t create a dataset where there’s such a huge imbalance between the number of frames that are labeled “silent/background” and all the other classes, (e.g., the “call” class") that it causes the model to overfit and end up always predicting the background class – this is a failure mode that can be easy to miss because the model will still “look” like it’s doing good: e.g., getting 99% framewise accuracy, but it’s only getting that because 99% of the frames are “silent/background”!

So you probably want to clip just to ensure you have a good representation of the call class or classes. If you can maintain the class balance, your model will do better even when you apply it to your continuous recordings. The balance doesn’t have to be perfect, we know that it is still usually true in these datasets that “background” frames outnumber the other classes, you just don’t want it to be extremely imbalanced (I can’t give you a good number for “extremely”, sorry – I think it depends on the data).

Am I explaining that clearly?
Just trying to save you some pain down the road!

I’ll let you know how it goes with adjusting the cofig files! (hopefully tomorrow, now occupied with loads of grading :frowning: )

Very understood, no rush!

Ok so setting npartitions to 1 and shorteing the files to about 5 min each (about 58MB) worked, Now wav.spect.npz files are produced. Unfortunately I’m running into another error, but I’ll post about that in a tread with that error message (Could not find subsets of sufficient duration in less than 5000 iterations) if I can’t figure it out. Thanks again for the help!

Excellent, glad to hear that worked @Sita.

Curious, did you try other values for npartitions? Did you need to go down to 1 because it wouldn’t work for other values?

Could not find subsets of sufficient duration in less than 5000 iterations

This happens when vak.split can’t find a way to make a split (e.g., the training split) of the specified duration that has at least one occurrence of each label. It might be because there’s some rare label that only occurs 2-3 times, something like that? Please do make a new thread if you’re stuck on this, happy to help

Thanks @nicholdav!
Yes I tried different numbers of npartitions. Before shortening the recordings I think it helped a bit (as in; it processed one file but not all, as opposed to none of the files with higher npartitions). But not 100% sure anymore. After shortening the recordings, npartitions doesn’t change anything (tried between 1 and 20). In sum, shorter audio files was the most important