BrokenProcessPool during vak prep

Sita · November 24, 2024, 4:21pm

Hi all,
I installed vak on linux and everything worked fine with the bengalese finch test set from the tutorial. Now I’m trying my own recordings but I think I’m running into memory problems. During vak prep It starts as usual, but crashed with the following error after about 10-20 seconds:

File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

(full error message below)

Thinks I tried:

setting device in toml both to ‘cpu’ or ‘gpu’
changing number of workers (between 1 and 8 if I remember correctly)
reducing traindur, valdur en testdur, and nr of epchs, but I think these are only used in the training step right?
using shorter file. with only 1 short file (5 minutes, 59,3 MB) vak prep does seem to work, so I presume long files is the problem. The original files are about 30 min and about 330-350 MB per file)

As a workaround, I could write a script split all files into shorter ones but given this issue https://github.com/dask/dask/issues/8506 I thought you might have ideas for a better solution? Thanks of course for already putting effort in this David!

My vak version is 1.0.3, so more recent that the issue post above
System info: Linux Min 21.3, Ubuntu 22.04
Graphics card: GeForce GTX 1650

Any ideas are welcome.
Thanks in advance!
Sita

full error message:

2024-11-24 16:41:28,443 - vak.prep.frame_classification.frame_classification - INFO - vak version: 1.0.3
2024-11-24 16:41:28,443 - vak.prep.frame_classification.frame_classification - INFO - Will prepare dataset as directory: /media/sita/sth8T/hoornraven/hoornraven_Tweetynet/tweetynet_hornbills_vanlaptopWin/tweetynet_test_hornb/recording120324M/train/prep_out20241124/used-vak-frame-classification-dataset-generated-241124_164128
2024-11-24 16:41:28,647 - vak.prep.spectrogram_dataset.prep - INFO - making array files containing spectrograms from audio files in: /media/sita/sth8T/hoornraven/hoornraven_Tweetynet/tweetynet_hornbills_vanlaptopWin/tweetynet_test_hornb/recording120324M/train/used
2024-11-24 16:41:28,647 - vak.prep.spectrogram_dataset.audio_helper - INFO - creating array files with spectrograms
[                                        ] | 0% Completed | 21.83 sms
Traceback (most recent call last):
  File "/home/sita/anaconda3/envs/tweetyS4/bin/vak", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/__main__.py", line 49, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/cli/cli.py", line 54, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/cli/cli.py", line 28, in prep
    prep(toml_path=toml_path)
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/cli/prep.py", line 134, in prep
    _, dataset_path = prep_module.prep(
                      ^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/prep_.py", line 194, in prep
    dataset_df, dataset_path = prep_frame_classification_dataset(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/frame_classification/frame_classification.py", line 276, in prep_frame_classification_dataset
    source_files_df: pd.DataFrame = get_or_make_source_files(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/frame_classification/source_files.py", line 144, in get_or_make_source_files
    source_files_df = prep_spectrogram_dataset(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/spectrogram_dataset/prep.py", line 151, in prep_spectrogram_dataset
    spect_files = audio_helper.make_spectrogram_files_from_audio_files(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/vak/prep/spectrogram_dataset/audio_helper.py", line 247, in make_spectrogram_files_from_audio_files
    spect_files = list(bag.map(_spect_file))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/dask/bag/core.py", line 1488, in __iter__
    return iter(self.compute())
                ^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/dask/base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/site-packages/dask/base.py", line 660, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sita/anaconda3/envs/tweetyS4/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

nicholdav · November 24, 2024, 5:15pm

Hi @Sita sorry vak is not working for you here!

At first glance, my guess is you are right that file size is the issue.
Can you tell me a little bit more about the dataset?
Is it that you have 30 minute files that contain some number of zebra finch song bouts?
I have had other people recently ask me about finding bouts in longer files, I’d be happy to help write a script that does this, I could add it as an example in the VocalPy docs (since you can do this without needing vak)

re: the issue that you linked, maybe you saw that we added some options to the config that give you a little more control over how dask does it’s thing?

github.com/vocalpy/vak

ENH: Optimize dask.bag for file sizes + add dask.bag options to config

opened 06:48PM - 11 Oct 22 UTC

closed 03:38AM - 24 Jan 23 UTC

kalleknast

ENH: enhancement

**Problem** I'm prepping a big dataset (15 GB of WAV). When computing the spect…rograms I run out of memory with the error: `concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.` **Suggestion** This can be fixed by limiting the number of partitions used by `dask.bag` to 20. I.e. changing line 229 in `audio.py` from `bag = db.from_sequence(audio_files)` to `bag = db.from_sequence(audio_files, npartitions=20)` fixes the problem (setting `partition_size` would probably also work). The best would be to automatically figure out the number of workers and memory available and set `npartitions`/`partition_size` accordingly. Unfortunately, I cannot see a way to do that without additional dependencies. An alternative would be to add the number of workers to the `[PREP]` section in `config.toml`.

(that issue links the one on the dask repo you referenced)

But 30 minutes seems like you would end up with really gigantic files!
The spectrograms typically take up much more space than the audio files too, so if the classes that you’re trying to fit the model to only occur in, say, less than half of those 30 minutes, you might be taking up both a lot of storage space for something you wouldn’t use a lot.

If you can tell me a little bit more about the data that can help us figure out what we need to do here, thanks!

Sita · November 24, 2024, 5:55pm

Hi, thanks for the quick reply as usual
Sorry I missed the dask toml options, I’ll try those indeed. I misinterpreted that the workers were the solution for this. I’ll read it in more detail first

It’s not zebra finches this time but southern ground hornbills, really cool birds. They don’t produce complex vocalizations, just calls. The calls differ between male and female and I’d like to automatically recognize male/female call notes in long noisy zoo recordings. So for prep I could extract the annotations to reduce the total amount of sound. Indeed there is a lot of sound now not annotated. But in the end I would ideally apply it to continuous recordings which could contain anything between 0 and 1000 calls per half hour (maybe even more, this is just a quick check on a few recordings). But those I can cut into shorter snippets before as well before I run the predict on them.

In addition I could filter high frequencies out, because the sounds are really low, so that could reduce the amount of data as well.

That being said, I would be interested in zebra finch bout detection as well, for other projects

I’ll let you know how it goes with adjusting the cofig files! (hopefully tomorrow, now occupied with loads of grading )

thanks heeps!

nicholdav · November 24, 2024, 11:56pm

Sorry I missed the dask toml options, I’ll try those indeed. I misinterpreted that the workers were the solution for this.

No worries, it’s not super clear.

I’ll read it in more detail first

The short version is you want to add the audio_dask_bag_kwargs option to the [vak.prep] table in the config, and set it to something lower than the default of 100

[vak.prep]
# ... other options here
audio_dask_bag_kwargs = {npartitions = 20}

I have to admit I don’t quite understand what this actually does
I asked in the dask forum

but never got a reply, the relevant section of the docs is here in case you’re curious
https://docs.dask.org/en/stable/bag-creation.html?highlight=npartitions#create-dask-bags

I basically just added it since Hjalmar said it would make prep work for their data. Would love to know whether it helps you too.

It’s not zebra finches this time but southern ground hornbills, really cool birds.

Cool!

for prep I could extract the annotations to reduce the total amount of sound

I take this to mean that you have some sections annotated and you can safely assume the rest is silence? So you could write a script that does something like “find every period of calls with no silent period between them greater than duration $d$, and then make that into an audio clip with $b$ seconds before the first call and $a$ seconds after, then save this to file”, or something similar?

My hunch is that this would be worth doing so that (1) prep will be quick and take up less space and (2) you don’t create a dataset where there’s such a huge imbalance between the number of frames that are labeled “silent/background” and all the other classes, (e.g., the “call” class") that it causes the model to overfit and end up always predicting the background class – this is a failure mode that can be easy to miss because the model will still “look” like it’s doing good: e.g., getting 99% framewise accuracy, but it’s only getting that because 99% of the frames are “silent/background”!

So you probably want to clip just to ensure you have a good representation of the call class or classes. If you can maintain the class balance, your model will do better even when you apply it to your continuous recordings. The balance doesn’t have to be perfect, we know that it is still usually true in these datasets that “background” frames outnumber the other classes, you just don’t want it to be extremely imbalanced (I can’t give you a good number for “extremely”, sorry – I think it depends on the data).

Am I explaining that clearly?
Just trying to save you some pain down the road!

I’ll let you know how it goes with adjusting the cofig files! (hopefully tomorrow, now occupied with loads of grading )

Very understood, no rush!

Sita · December 16, 2024, 8:28am

Ok so setting npartitions to 1 and shorteing the files to about 5 min each (about 58MB) worked, Now wav.spect.npz files are produced. Unfortunately I’m running into another error, but I’ll post about that in a tread with that error message (Could not find subsets of sufficient duration in less than 5000 iterations) if I can’t figure it out. Thanks again for the help!

nicholdav · December 18, 2024, 1:23pm

Excellent, glad to hear that worked @Sita.

Curious, did you try other values for npartitions? Did you need to go down to 1 because it wouldn’t work for other values?

Could not find subsets of sufficient duration in less than 5000 iterations

This happens when vak.split can’t find a way to make a split (e.g., the training split) of the specified duration that has at least one occurrence of each label. It might be because there’s some rare label that only occurs 2-3 times, something like that? Please do make a new thread if you’re stuck on this, happy to help

Sita · December 20, 2024, 4:41pm

Thanks @nicholdav!
Yes I tried different numbers of npartitions. Before shortening the recordings I think it helped a bit (as in; it processed one file but not all, as opposed to none of the files with higher npartitions). But not 100% sure anymore. After shortening the recordings, npartitions doesn’t change anything (tried between 1 and 20). In sum, shorter audio files was the most important

nicholdav · January 3, 2025, 2:50pm

Good to know, thank you @Sita!

There is definitely a ceiling on size of audio files – I ran into similar issues when trying to make spectrograms from the 10 minute files in the dataset from @njourjine’s 2023 “Two pup calls” paper, and also ended up making clips to get around that

I did just add a clip method to the Sound class in vocalpy that makes doing this easier – sorry, I should’ve thought to tell you.
more detail here: vocalpy/docs/CHANGELOG.md at main · vocalpy/vocalpy · GitHub

One way to get around this would be to have vak compute spectrograms on-the-fly. There’s a trade-off between that and preparing them all ahead of time (faster training, less compute, but it requires lots of memory + storage to prep an entire dataset). I’ve thought about adding this feature but haven’t had time to really dig into it

Sita · January 8, 2025, 4:44pm

Thanks @nicholdav, good to know and I’ll try it out! And I should check that paper out!
For now I just wrote a little notebook to split everything into e.g. 30 sec chunks and skip chunks without annotations so the whole dataset becomes a bit smaller. A bit work in progress and specified to my data, but in case it’s useful to someone: audio-analyses/split_audio_into_chunks_tweetynet.ipynb at main · sthaar/audio-analyses · GitHub (suggestions always welcome of course )

nicholdav · January 21, 2025, 2:07am

Thank you @Sita for sharing. It’s helpful to have examples like that.

I raised an issue to add a similar vignette to the VocalPy docs that shows how to do something similar with the new Sound.clip method

github.com/vocalpy/vocalpy

DOC: Vignettte showing how to use `Sound.clip` to prepare a dataset

opened 02:06AM - 21 Jan 25 UTC

NickleDave

DOC: documentation Sound

We should add a vignette to the docs that shows how to use the `Sound.clip` meth…od added in version 0.10.0 The main use of this method is to clip large audio files down to sizes that are more amenable to analysis, e.g. you have long hours of recordings and you just want the annotated "regions of interest". The vignette should be something like the notebook @sthaar shared here: https://github.com/sthaar/audio-analyses/blob/main/split_audio_into_chunks_tweetynet.ipynb as we discussed in the forum: https://forum.vocalpy.org/t/brokenprocesspool-during-vak-prep/101/9?u=nicholdav (but translated to VocalPyese 🙂 )

(tagged you there but just commenting here so there’s a link from here to there )

Topic		Replies	Views
Solving "ValueError: Could not find subsets of sufficient duration in less than 5000 iterations." Q&A	9	362	December 28, 2022
Issues with vak predict with the latest version of vak vak	3	83	May 8, 2024
Vak 0.8.0 + TweetyNet 0.9.0 released; vak 1.0 in development Announcements vak	0	276	February 16, 2023
_spectral_helper error during prep Q&A vak	3	187	July 8, 2023
Vak prep issue with "simple-seq" annotation format and wave files Q&A vak , crowsetta	7	297	June 10, 2022

BrokenProcessPool during vak prep

Related topics