Use of both Sequence and BBox in Annotation class

From what I understand, Annotation is the central class that other classes interface with and serves as a common ground for annotation information. The class stores annotations either as a list of BBox or as a Sequence (or list of Sequence). Is there a reason for having two (sort of three) separate ways to store the information rather than using one consistent approach? From an outsider perspective, it would seem the cleanest/easiest to interact with if Annotations always stored annotation information in the same way, even if there is sometimes missing data (eg, low_freq=None)

1 Like

Hi @samlapp and welcome to the fourm.

Is there a reason for having two (sort of three) separate ways to store the information rather than using one consistent approach?

Very good question.

From an outsider perspective, it would seem the cleanest/easiest to interact with if Annotations always stored annotation information in the same way, even if there is sometimes missing data (eg, low_freq=None)

A similar idea has occurred to me.

The thing is, there’s a lot of people who study sequences specifically. See for example this review (sorry if you know it already):

And accordingly there’s a bunch of existing formats that can all be mapped to a Sequence.
So this abstraction is sort of geared toward them, to represent and groups all the formats together, and advertise to people that study sequences “your favorite format probably lives in this module”. And that makes it easier for them to get what they need – e.g. if there was only a BBox class, they’d need to do something like seqs = [(bbox.onset_s, bbox.offset_s, bbox.label) for annot in annots for bbox in annot.bboxes] every time, or we’d end up adding a convenience method to do something like that.

If I were proposing a spec for audio annotation formats with no restrictions from previous work then I would probably do something more like you say. The JAMS spec for music information retrieval achieves something similar, see GitHub - marl/jams: A JSON Annotated Music Specification for Reproducible MIR Research.

I am definitely open to ideas on how to implement something like what you describe. Happy to hear from you or anyone else that’s interested.

Hi @nicholdav and @samlapp,
It is interesting (to me) to also think ahead about what kinds of annotations and flexibility will be good to have in order to expand the kinds of analyses we do.
Specifically, several applications need annotating overlapping events (e.g. syllables of two, or more, birds singing at the same time).
One can also consider annotation labels that have more than one dimensions and one of the dimensions is a a continuous one (e.g. labeling vocalizations as ‘happy’ or ‘angry’ but also adding the degree of happiness or anger).

When discussing how to represent annotations it should be good to consider (also) data type flexibility.

@nicholdav interesting, that makes sense.

I can see the value of being able to access and create sets of annotations in a sequence-like format. I think one way to achieve this while consistent data storage in the base annotation class would be with @property (or .as_…()/ .to_…() methods) for access, and .from_sequences() methods for creation.

Concretely, it could look like this (if the Annotation class scores information as list of BBox):

class Annotation:

class Annotation:
    def __init__(self,bboxes,annotation_file,notated_file):
    def from_sequences(self,annotation_file,notated_file):
    def sequences(self):
        return _bboxes_to_sequences(self.bboxes)

def _bboxes_to_sequences(bboxes):
    return list_of_sequences

anns = Annotation.from_sequences(sequences)
anns.sequences # gives list of sequences

alternatively, Annotations could store the data as a dataframe. In that case, it could have methods/properties to return either BBox or Series

Either way, I think the API would be more intuitive if the information is stored in a consistent way.

1 Like

For sure! The BBox approach (and maybe the Sequence approach also?) are robust to overlapping events. A nice aspect of the BBox approach is that each single annotation is stored as an object (BBox), so adding an attribute such as my_bbox.happyness_index=0.3 is always possible from the user’s side, without needing to change the underlying code base.

1 Like

Thank you, I definitely agree that just always using a DataFrame as the underlying data structure would allow for things to be more consistent. A column would get added for each additional dimension that an annotation type has. E.g., single time points would have a single column. Segments that have two dimensions would have another column. Bounding boxes would add two more columns.

And then methods would allow for getting annotations in a certain format, as you suggest.


I can’t come up with names for the columns that would work across annotation types though. Like, for segments you could have “start” and “stop” but for a single time point it doesn’t make sense to have a “start” without a “stop”. You could do something generic like (“dim1”, “dim2”) and then have the accessor methods know which dims to access.

Definitely food for thought.
I’m not 100% sure it’s needed right at this exact moment but the approach appeals to me.

If you feel like raising an issue about this on Crowsetta I’d be happy to test this at some point, get feedback from you, and give you contribution credit:

Yeah I think this makes sense. Maybe the column for a single “time” shouldn’t even have the same name as the start_time column of data with start time and end times. But if it should, maybe they could be called time0 and time1 or timeA and timeB? And also freq0 and freq1, or freqA and freqB. I can open an issue, and might even be able to implement it if you’d like me to try.

1 Like

Hi @samlapp, let me say thanks so much for finding time to come back to this, and thanks for raising the detailed issue on the issue tracker for crowsetta. I’d been thinking about this and meaning to follow up with you.

I’m posting the link to the issue, just to make it easier to get there from here: ENH: Store Annotation data in consistent format · Issue #259 · vocalpy/crowsetta · GitHub

Let me also say I’m sorry for not doing a better job of conveying my thinking – reading back over what I wrote above, I think I didn’t make some things clear enough.

You are really helping us by asking these questions. We want crowsetta in particular to be useful for bioacoustics generally, not just people studying acoustic communication with bioacoustics methods. So it helps to know what’s not clear for someone that maybe has more of a bioacoustics background. (I think that’s part of what’s happening here, feel free to tell me if I’m wrong.)

I also think I wasn’t clear enough since @yardenc chimed in; if my collaborator who’s working on the same issues doesn’t understand what I’m talking about then I really haven’t explained myself well!

So because of that I’m going to reply here. Also so anyone else interested can read.

Also I’ll say here at the top that I would love to have you contribute. Based on your question and other things I’ve been thinking about, I can see where we need to rewrite some of the classes and improve the docs. In particular, I think we should add a Boxes class that is implemented basically as you suggest, with a pandas.DataFrame as the underlying data structure. If you want to help with that, or even just review those changes so we can get a fresh set of eyes on them, it would be really helpful.

Ok, with all that out of the way (apologies in advance for the wall of the text–my goal is to to be very clear this time):

What I have failed to make clear is that, by definition, a sequence is “a series of adjacent, non-overlapping line segments”. Whereas, as you well know, bounding boxes like those that are often used to annotate soundscapes definitely can and do overlap. This means we need an Annotation data type that represents either sequences or bounding boxes, and why we can’t (easily) convert a set of bounding boxes to sequences, by throwing away the frequency ranges.

There’s no reference I can point you to that says this is what the “official” definition of a sequence is. This is one of the reasons why there’s a real need for a standard for audio annotations across domains, like the one we started to draft at the first AudioXd.

Since we don’t (for now) have a standard, all I can do is give you a couple of reasons why my understanding of sequence is “a series of directly adjacent but non-overlapping line segments”. I’m happy to hear from you, @yardenc, or anyone else that has a different definition, I’m just laying out my thinking here.

What is a sequence?

Basically, when I say sequence in the context of annotations, I am thinking of the kinds of annotations that become the data for analyses like those described in this paper that I linked to above:
You will notice that they also never state explicitly in that paper that a “sequence of units” (their terminology) should be strictly non-overlapping. However, almost all of the behavioral models they describe assume such a sequence, because the models are meant to be fit to sounds produced by a single animal. By definition, an animal cannot produce multiple “units” at the same time (again, their terminology)–it’s the atomic unit of production. Of course we can think of exceptions (e.g., some songbirds vocalize from both sides of the syrinx simultaneously) but in general the goal of these analyses is to fit models like a context-free grammar or a Markov model to a series of discrete utterances, such as a string of letters. So the units in a sequence can’t overlap.

Of course, there are many cases where multiple animals vocalize simultaneously, and then you would have overlapping sequences, as you know from field recordings, as @yardenc pointed out above, and as that paper discusses. But when we fit the kinds of models discussed in that paper to those annotations where multiple sequences overlap, we still want to treat each individual animal’s sounds as a separate sequence of non-overlapping segments. I’ll come back to this below.

How do existing formats for annotating sequences handle overlap?

The clearest example I can think of is the Praat Textgrid format. This example is good because it’s clearly defined for the kinds of analyses I mean–I’m pretty sure most people can only produce one phoneme at a time–and the developers of Praat actually give us multiple indications of how they want to handle sequences.

First of all, notice that they don’t allow segments to overlap–see the paragraph I’ve put in bold below. From the “spec” for the TextGrid format:
(TextGrid file formats)

  1. Restrictions in a TextGrid object

TextGrid objects maintain several invariants, some stronger and some weaker.

The two strongest invariants within an interval tier are positive duration and adjacency. That is, the end time of each interval has to be greater than the starting time of that same interval, and the starting time of each interval (except the first) has to be equal to the end time of the previous interval. As a result, the union of the time domains of the set of intervals in a tier is a contiguous stretch of time, and no intervals overlap. If a TextGrid file violates these invariants, Praat may refuse to read the file and give an error message instead (or Praat may try to repair the TextGrid, but that is not guaranteed).

A weaker invariant is that the starting time of the first interval on a tier equals the starting time of that tier, and the end time of the last interval equals the end time of the tier. When you create a TextGrid with Praat, this invariant is automatically maintained, and most types of modifications also maintain it, except sometimes the commands that combine multiple TextGrid objects with different durations into a new TextGrid. Praat will happily read TextGrid files that do not honour this weak invariant, and will also display such a TextGrid reasonably well in a TextGrid window. It is nevertheless advisable for other programs that create TextGrids to honour this weak invariant.

Second of all, notice that Praat TextGrids have tiers.

When you create a TextGrid, you specify the names of the tiers. For instance, if you want to segment the sound into words and into phonemes, you may want to create two tiers and call them “words” and “phonemes” (you can easily add, remove, and rename tiers later). Since both of these tiers are interval tiers (you label the intervals between the word and phoneme boundaries, not the boundaries themselves), you specify “phonemes words” for Tier names, and you leave the Point tiers empty.

This is how they handle an multiple sequences in an annotation for the same file.

Notice that Audacity similarly has a notion of “label tracks” (makes sense – audio people think in terms of tracks) but they don’t currently let you export multiple tracks in annotations: Importing and Exporting Labels - Audacity Manual

How does / should crowsetta handle overlap in sequences and multiple sequences per annotated file?

So I’ll start by saying that currently the Sequence class does not currently enforce the definition I’ve given above. It doesn’t check for overlap and raise an error if there is any. This is partly my mistake as a programmer, in part because I haven’t thought about it more deeply until now, and in part because I avoided adding this validation, since I wanted people to be able to load annotations where there are overlapping segments. I’m pretty sure that was my thinking, although I don’t think I documented it.

I now think we actually should enforce this, and make the definition of sequence more explicit in the documentation, for all the reasons I’ve given here.
You can see me reasoning through this on this VocalPy issue, where we also need such classes:

I’m adding this because I think we should basically rewrite both Sequence and BBox in crowsetta to be Sequence and Boxes as described in that issue. In this case we will repeat ourselves because people should be able to use crowsetta without needing vocalpy and because the classes have different uses within each package. You can see I’m basically thinking along the same lines as you, that we should represent a set of bounding boxes with a pandas.DataFrame since we rarely need to work with a single box on its own (and when we do, we can just use a dataframe with a single row).

That said, we do have a way to handle multiple sequences per file, like those @yardenc mentioned above. I added this explicitly to be able to convert a TextGrid with tiers to an Annotation; we iterate over the tiers converting them to a list of Sequences. To me, this is the most natural, dirt simple way to do it. This also lets us handle the case where an annotator wants to annotate multiple sequences that do not overlap in a file. An example of this is the birdsong-recognition dataset: BirdsongRecognition.
You can see I first thought about dealing with intervals here:
allow for user-defined `tiers` for a Segment, like Praat? · Issue #14 · vocalpy/crowsetta · GitHub
And then added the abilty to have Sequences here:
Modify Annotation to allow for multiple seq · Issue #42 · vocalpy/crowsetta · GitHub
Finish pyOpenSci review by NickleDave · Pull Request #243 · vocalpy/crowsetta · GitHub

Long story short, I will raise issues on Crowsetta to rewrite both the Sequence and BBox class as a Sequence class that enforces overlap and a Boxes class, linking to the discussion on VocalPy, and to better document the definitions of these. We’d be very happy to have you contribute however you’d like to that.