Vak + TweetyNet with an Apple M1 Max?

mrm · September 14, 2023, 10:42am

Hello,

I’m trying to work through the introductory example vak to learn to classify notes in the songs of Java sparrows and am running into trouble on a laptop with an Apple M1 Max chip. I’m able to go ‘vak prep …’, but when I go to train my neural net, I get

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn’t support float64. Please use float32 instead.

Have other people had success with this hardware and, if so, what might I be doing wrong?

Thanks,
Mark

nicholdav · September 14, 2023, 9:20pm

Hi @mrm and welcome to the forum. Sorry you’re running into this issue.

I’m trying to work through the introductory example vak to learn to classify notes in the songs of Java sparrows

Great, we’d love to get that working for you

and am running into trouble on a laptop with an Apple M1 Max chip

…

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn’t support float64. Please use float32 instead.

I can think of a couple of people that use Mac – e.g. @koparkanya – but I think they’re not using an Apple M1.

My initial googling suggests that what’s going on here is somewhere we default to float64 but MPS doesn’t have that type (like the error message says). Like in this SO post: python - How to convert float64 to make it work in apple silicon? - Stack Overflow

Could you please paste in the entire traceback so I can see the line of code where it gets triggered? By that I mean the whole long error message leading up to that TypeError. See an example here: BUG: Running vak 1.0.0a1 with device set to CPU crashes · Issue #687 · vocalpy/vak · GitHub

You can use triple backticks so the traceback reads as code

# like this

Can you also please tell me what versions you’re using? I’ll guess that you installed with conda; if so can you do conda list --explicit when you have the conda environment you created active, and also paste that in a reply? Or attach it as a file if you prefer.

Thank you! Would really like to get this working for you but I’m not sure yet if it will require some changes in our code. The library we use as a backend has some initial support for MPS it looks like: MPS training (basic) — PyTorch Lightning 2.1.2 documentation, so it might be as easy as just specifying 'mps' as the accelerator.

I’m about to be out of town a couple of days but will look at this when I get back

denajane13 · September 15, 2023, 11:07am

I am not sure how helpful I can be because I don’t remember what I did, but I was able to run it on a Mac M1! I am happy to print out any versions, etc if that would be useful!

mrm · September 15, 2023, 12:20pm

Hi again and thanks for the quick reply.

I’m pretty sure that I tried setting device = “mps” in the [TRAIN] block, but just to be sure I tried again and still got the following:

(TweetyNet) mark@ultramarine:Practice > vak train gy6or6_train.toml
2023-09-15 13:11:50,211 - vak.cli.train - INFO - vak version: 1.0.0a1
2023-09-15 13:11:50,211 - vak.cli.train - INFO - Logging results to gy6or6/vak/train/results/results_230915_131150
2023-09-15 13:11:50,212 - vak.core.train - INFO - Loading dataset from .csv path: gy6or6/vak/prep/train/032212_prep_230915_131124.csv
2023-09-15 13:11:50,214 - vak.core.train - INFO - Size of timebin in spectrograms from dataset, in seconds: 0.002
2023-09-15 13:11:50,214 - vak.core.train - INFO - using training dataset from gy6or6/vak/prep/train/032212_prep_230915_131124.csv
2023-09-15 13:11:50,214 - vak.core.train - INFO - Total duration of training split from dataset (in s): 57.17199999999999
2023-09-15 13:11:50,362 - vak.core.train - INFO - number of classes in labelmap: 12
2023-09-15 13:11:50,362 - vak.core.train - INFO - no spect_scaler_path provided, not loading
2023-09-15 13:11:50,362 - vak.core.train - INFO - will normalize spectrograms
2023-09-15 13:11:50,405 - vak.core.train - INFO - Duration of WindowDataset used for training, in seconds: 57.172000000000004
2023-09-15 13:11:50,419 - vak.core.train - INFO - Total duration of validation split from dataset (in s): 21.266
2023-09-15 13:11:50,419 - vak.core.train - INFO - will measure error on validation set every 400 steps of training
2023-09-15 13:11:50,426 - vak.core.train - INFO - training TweetyNet
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-15 13:11:50,473 - vak.core.train - INFO - Training start time: 2023-09-15T13:11:50.473669
Missing logger folder: /Users/mark/Current_projects/Anthony Kwong/TweetyNet/Practice/gy6or6/vak/train/results/results_230915_131150/TweetyNet/lightning_logs

  | Name    | Type             | Params
---------------------------------------------
0 | network | TweetyNet        | 1.1 M
1 | loss    | CrossEntropyLoss | 0
---------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.444     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:442: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 10 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0%|                                                                                                                  | 0/2 [00:00<?, ?it/s]/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:212: UserWarning: You called `self.log('val_levenshtein', ...)` in your `validation_step` but the value needs to be floating point. Converting it to torch.float32.
  warning_cache.warn(
Traceback (most recent call last):
  File "/opt/anaconda3/envs/TweetyNet/bin/vak", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/__main__.py", line 48, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/cli/cli.py", line 49, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/cli/cli.py", line 8, in train
    train(toml_path=toml_path)
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/cli/train.py", line 67, in train
    core.train(
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/core/train.py", line 369, in train
    trainer.fit(
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 393, in validation_step
    return self.model.validation_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/models/windowed_frame_classification_model.py", line 208, in validation_step
    self.log(f'val_{metric_name}', metric_callable(y_pred_labels, y_labels), batch_size=1)
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 447, in log
    value = apply_to_collection(value, (Tensor, numbers.Number), self.__to_tensor, name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 51, in apply_to_collection
    return function(data, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 619, in __to_tensor
    else torch.tensor(value, device=self.device)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

I’ve chased back through all the functions mentioned in this trace, but couldn’t find anyplace that explicitly mentioned float64. I’ve attached the results of conda list -explicit.

Thanks again,
Mark
condaList.txt (17.2 KB)

nicholdav · September 15, 2023, 2:25pm

Thank you, that’s helpful.

Looks like you’re running vak==1.0.0.a1 with both pytorch and lightning > 2.0, I wanted to be sure of that

And from the traceback, it looks like the crash happens when lightning does a “dry run” of the validation step and tries to compute some metric.
So that narrows down the list of likely suspects.
I wonder if it’s because I’m saving some numpy vector as the default float64 and not transforming it when I load it, e.g. the pre-computed vectors of frame labels

Also I see lightning says it used ‘mps’, just making sure I follow: did you set that as ‘device’ in the config file or did you change the vak internals manually?

nicholdav · September 15, 2023, 2:30pm

@mrm replying a second time because the Discourse UI is weird on my phone.

Can you please try two things to troubleshoot?

convert all the npy files in the prepared dataset to float32 and see if that magically fixes the error
update to vak==1.0.0a2 in case that also magically fixes things

Let me know if it’s not clear what I mean by converting, I can write a little snippet to show you how

nicholdav · September 15, 2023, 2:36pm

Thank you @denajane13 you are awesome

Yes just in case it helps could you reply with the output attached from

conda list --explicit > spec-file.txt

and

conda env export > environment.yml

Appreciate it

nicholdav · September 15, 2023, 5:21pm

Converting would be something like

import pathlib

import numpy as np

dataset_path = pathlib.Path('path/prep/added/to/config')

for split in ('train', 'val'):
    npy_paths = sorted((dataset_path / split).glob('*.npy')
    for npy_path in npy_paths:
        np.save(np.load(npy_path).astype(np.float32))

mrm · September 18, 2023, 7:52pm

I set device = "mps" in the [TRAIN] section of the config file. I’ll try converting the dataset and let you know how I get on.

Thanks,
Mark

nicholdav · September 18, 2023, 8:09pm

Great, thank you @mrm, definitely just let me know what else I can do to help if you get stuck

Good to know that setting device = "mps" works too

nicholdav · September 21, 2023, 1:12am

Hi again Mark,

I’m afraid I might have led you on a bit of a wild goose chase. The good news is I think I might have a fix, see below.

I think the problem isn’t the way we save the arrays for the dataset. I confirmed this shouldn’t be an issue but I’ll spare you the details (basically, we transform inputs to float32 when we load them).

My best guess is that the error happens when lightning tries to take a computed metric value that is returned and put it in a tensor, in order to log that value. Because the returned value is float64, we get this error.

From the traceback you provided, we see where we’re calling self.log:

 File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/models/windowed_frame_classification_model.py", line 208, in validation_step
    self.log(f'val_{metric_name}', metric_callable(y_pred_labels, y_labels), batch_size=1)

Here’s line 208 in version 1.0.0a1 that you’re using:

github.com

vocalpy/vak/blob/3dcce70030ae9b1fd6d040e055def0d656a7512e/src/vak/models/windowed_frame_classification_model.py#L208


      
              for metric_name, metric_callable in self.metrics.items():
                  if metric_name == "loss":
                      self.log(f'val_{metric_name}', metric_callable(out, y), batch_size=1)
                  elif metric_name == "acc":
                      self.log(f'val_{metric_name}', metric_callable(y_pred, y), batch_size=1)
                      if self.post_tfm:
                          self.log(f'val_{metric_name}_tfm',
                                   metric_callable(y_pred_tfm, y),
                                   batch_size=1)
                  elif metric_name == "levenshtein" or metric_name == "segment_error_rate":
                      self.log(f'val_{metric_name}', metric_callable(y_pred_labels, y_labels), batch_size=1)
                      if self.post_tfm:
                          self.log(f'val_{metric_name}_tfm',
                                   metric_callable(y_pred_tfm_labels, y_labels),
                                   batch_size=1)
          
          def predict_step(self, batch: tuple, batch_idx: int):
              """Perform one prediction step.
          
              Method required by ``lightning.LightningModule``.

You can see that we’re computing an edit distance metric using string labels. So it can’t be the tensor inputs to the model, and it has to be the returned value. I put in a breakpoint() before that line and ran an eval file to confirm that, yes, the segment_edit_distance returns a numpy float with dtype float64. I’m guessing that’s what causes the crash.

I think I have a fix here:

Would you be able to install a development version, check out that branch, and test whether it fixes the problem for you?
Instructions to set up a development environment are here: Contributors Guide - vak documentation
If you run nox -s dev as described there, it will create a virtual environment with the code installed so you can test it and make edits if necessary (e.g. to put in a breakpoint() when troubleshooting or debugging).
After you’ve done that, you’d want to say git checkout -t origin/make-distance-metrics-return-tensors so that you’re working with

If you’re open to it, I’d be happy to hop on a Zoom call so we can get the bottom of this more easily. Sometimes lightning makes it a bit tricky to debug, but still a Zoom call where I can either ask you to put in breakpoints or do it myself through a shared screen might be faster than me asking you to try things via posts here and GitHub issues.

If you are able to confirm this is the source of the error, it would be great if you could raise a bug report on the vak repo too: Sign in to GitHub · GitHub
We’ll be happy to add you as a contributor for catching it.

mrm · September 22, 2023, 11:23am

Hi again,

Thanks for continuing to think about this. I liked your idea of recasting the training data as float32 tensors and it wasn’t difficult (though I couldn’t find any *.npy files, just *.npz files whose contents I unpacked, recast and repacked), but didn’t solve the problem. I then had some trouble installing 1.0.0a2 and, as teaching is about to start at my Uni, hadn’t had time to sort it out. I’d be happy to try the things that you suggest here, though it may take me a little while.

Thanks,
Mark

mrm · September 22, 2023, 11:30am

P.S. I forgot to add, once I have things set up, if I find myself stuck I’d be happy to zoom.

nicholdav · September 23, 2023, 5:56pm

Hi again Mark,

Very understood you are busy with other things.

I went ahead and filed a bug report, I think I have managed to find your GitHub profile and tag you to give you credit–if not please let me know!

github.com/vocalpy/vak

BUG: `Cannot convert a MPS Tensor to float64 dtype` on Apple M1 Max

opened 04:02PM - 23 Sep 23 UTC

closed 05:39PM - 23 Sep 23 UTC

NickleDave

BUG

#### Before submitting a bug, please make sure the issue hasn't been already add…ressed by searching through [the past issues](https://github.com/NickleDave/vak/issues) **Describe the bug** This is a bug reported by @VenetianRed in the vocalpy forum here: https://forum.vocalpy.org/t/vak-tweetynet-with-an-apple-m1-max/78 > > I’m trying to work through the introductory example vak to learn to classify notes in the songs of Java sparrows and am running into trouble on a laptop with an Apple M1 Max chip. I’m able to go ‘vak prep …’, but when I go to train my neural net, I get > > TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn’t support float64. Please use float32 instead. > > > ``` > (TweetyNet) mark@ultramarine:Practice > vak train gy6or6_train.toml > 2023-09-15 13:11:50,211 - vak.cli.train - INFO - vak version: 1.0.0a1 > 2023-09-15 13:11:50,211 - vak.cli.train - INFO - Logging results to gy6or6/vak/train/results/results_230915_131150 > 2023-09-15 13:11:50,212 - vak.core.train - INFO - Loading dataset from .csv path: gy6or6/vak/prep/train/032212_prep_230915_131124.csv > 2023-09-15 13:11:50,214 - vak.core.train - INFO - Size of timebin in spectrograms from dataset, in seconds: 0.002 > 2023-09-15 13:11:50,214 - vak.core.train - INFO - using training dataset from gy6or6/vak/prep/train/032212_prep_230915_131124.csv > 2023-09-15 13:11:50,214 - vak.core.train - INFO - Total duration of training split from dataset (in s): 57.17199999999999 > 2023-09-15 13:11:50,362 - vak.core.train - INFO - number of classes in labelmap: 12 > 2023-09-15 13:11:50,362 - vak.core.train - INFO - no spect_scaler_path provided, not loading > 2023-09-15 13:11:50,362 - vak.core.train - INFO - will normalize spectrograms > 2023-09-15 13:11:50,405 - vak.core.train - INFO - Duration of WindowDataset used for training, in seconds: 57.172000000000004 > 2023-09-15 13:11:50,419 - vak.core.train - INFO - Total duration of validation split from dataset (in s): 21.266 > 2023-09-15 13:11:50,419 - vak.core.train - INFO - will measure error on validation set every 400 steps of training > 2023-09-15 13:11:50,426 - vak.core.train - INFO - training TweetyNet > GPU available: True (mps), used: True > TPU available: False, using: 0 TPU cores > IPU available: False, using: 0 IPUs > HPU available: False, using: 0 HPUs > 2023-09-15 13:11:50,473 - vak.core.train - INFO - Training start time: 2023-09-15T13:11:50.473669 > Missing logger folder: /Users/mark/Current_projects/Anthony Kwong/TweetyNet/Practice/gy6or6/vak/train/results/results_230915_131150/TweetyNet/lightning_logs > > | Name | Type | Params > --------------------------------------------- > 0 | network | TweetyNet | 1.1 M > 1 | loss | CrossEntropyLoss | 0 > --------------------------------------------- > 1.1 M Trainable params > 0 Non-trainable params > 1.1 M Total params > 4.444 Total estimated model params size (MB) > Sanity Checking: 0it [00:00, ?it/s]/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:442: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 10 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. > rank_zero_warn( > Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:212: UserWarning: You called `self.log('val_levenshtein', ...)` in your `validation_step` but the value needs to be floating point. Converting it to torch.float32. > warning_cache.warn( > Traceback (most recent call last): > File "/opt/anaconda3/envs/TweetyNet/bin/vak", line 10, in <module> > sys.exit(main()) > ^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/__main__.py", line 48, in main > cli.cli(command=args.command, config_file=args.configfile) > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/cli/cli.py", line 49, in cli > COMMAND_FUNCTION_MAP[command](toml_path=config_file) > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/cli/cli.py", line 8, in train > train(toml_path=toml_path) > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/cli/train.py", line 67, in train > core.train( > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/core/train.py", line 369, in train > trainer.fit( > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit > call._call_and_handle_interrupt( > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt > return trainer_fn(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl > self._run(model, ckpt_path=ckpt_path) > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run > results = self._run_stage() > ^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage > self._run_sanity_check() > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check > val_loop.run() > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator > return loop_run(self, *args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run > self._evaluation_step(batch, batch_idx, dataloader_idx) > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step > output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values()) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook > output = fn(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 393, in validation_step > return self.model.validation_step(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/vak/models/windowed_frame_classification_model.py", line 208, in validation_step > self.log(f'val_{metric_name}', metric_callable(y_pred_labels, y_labels), batch_size=1) > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 447, in log > value = apply_to_collection(value, (Tensor, numbers.Number), self.__to_tensor, name) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 51, in apply_to_collection > return function(data, *args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/opt/anaconda3/envs/TweetyNet/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 619, in __to_tensor > else torch.tensor(value, device=self.device) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead. > ``` Environment file attached: [condaList.txt](https://github.com/vocalpy/vak/files/12706844/condaList.txt)

and I raised another issue with the fix, and went ahead and merged the fix in process

github.com/vocalpy/vak

BUG/MAINT: Have all metrics return tensors

opened 04:44PM - 23 Sep 23 UTC

closed 05:40PM - 23 Sep 23 UTC

NickleDave

The edit distance metrics in `vak.metrics.distance` currently return either Pyth…on `int`s or `np.float64` scalars. When `models` call the `lightning` method `self.log` with these values, there's an implicit conversion to a torch tensor with a dtype of float32--we currently get a warning about this during the validation step. In most cases this is good enough but it's at the very least kind of weird to store int counts of edits as floats. What's more problematic is that this can cause a failure on some platforms, e.g. on an M1 Mac using `mps` as the `accelarator`, as described in #700--because the returned segment edit distance is np.float64, so converting to tensor gives a torch.float64, that the mps backend can't deal with. We should fix this - [ ] Have distance metrics return tensors (other metrics already do), that are either `float32` or `int32` as needed

github.com/vocalpy/vak

BUG: Make distance metrics return tensors, fix #700 #701

vocalpy:main ← vocalpy:make-distance-metrics-return-tensors

opened 05:39PM - 23 Sep 23 UTC

NickleDave

+4 -2

Makes functions in `vak.transforms.distance.functional` return tensors so we don…'t cause errors when lightning tries to convert from numpy to tensors to log. Letting lightning do the conversion kind of works, but it can cause a fatal error for someone using an Apple M1 with 'mps' as the accelerator, see https://forum.vocalpy.org/t/vak-tweetynet-with-an-apple-m1-max/78/4?u=nicholdav I don't find any explicit statement in either the Lightning or Torchmetrics docs that metrics should always be tensors, and that this guarantees there won't be weird issues (right now we get a warning on start-up that all logged scalars should be float32, but I would expect one should be able to log integers too?). But from various issues I read, it seems like that should be the case, https://github.com/Lightning-AI/lightning/issues/2143 and I notice that torchmetrics classes tend to do things like convert to a float tensor

then released a new alpha version, 1.0.0a3

You should be able to install into a virtualenv with pip install vak==1.0.0a3 so you don’t need to bother with setting up a development environment again. (I think you know this already, just making sure though.)

Please when you have a chance test it out and let me know if that fixes things, much appreciated!
–David

Topic		Replies	Views
Vak 0.8.0 + TweetyNet 0.9.0 released; vak 1.0 in development Announcements vak	0	276	February 16, 2023
Vak version 0.7.0 released! Announcements	0	242	December 5, 2022
Vak 1.0.0a1 released! Announcements vak	0	365	March 6, 2023
Vak prep issue with "simple-seq" annotation format and wave files Q&A vak , crowsetta	7	297	June 10, 2022
Vak prep crashes because of package version requirement Q&A vak	7	479	December 5, 2022

Vak + TweetyNet with an Apple M1 Max?

Related topics