From what I understand, Annotation is the central class that other classes interface with and serves as a common ground for annotation information. The class stores annotations either as a list of BBox or as a Sequence (or list of Sequence). Is there a reason for having two (sort of three) separate ways to store the information rather than using one consistent approach? From an outsider perspective, it would seem the cleanest/easiest to interact with if Annotations always stored annotation information in the same way, even if there is sometimes missing data (eg, low_freq=None)
Hi @samlapp and welcome to the fourm.
Is there a reason for having two (sort of three) separate ways to store the information rather than using one consistent approach?
Very good question.
From an outsider perspective, it would seem the cleanest/easiest to interact with if Annotations always stored annotation information in the same way, even if there is sometimes missing data (eg, low_freq=None)
A similar idea has occurred to me.
The thing is, thereās a lot of people who study sequences specifically. See for example this review (sorry if you know it already): https://onlinelibrary.wiley.com/doi/abs/10.1111/brv.12160
And accordingly thereās a bunch of existing formats that can all be mapped to a Sequence
.
So this abstraction is sort of geared toward them, to represent and groups all the formats together, and advertise to people that study sequences āyour favorite format probably lives in this moduleā. And that makes it easier for them to get what they need ā e.g. if there was only a BBox
class, theyād need to do something like seqs = [(bbox.onset_s, bbox.offset_s, bbox.label) for annot in annots for bbox in annot.bboxes]
every time, or weād end up adding a convenience method to do something like that.
If I were proposing a spec for audio annotation formats with no restrictions from previous work then I would probably do something more like you say. The JAMS spec for music information retrieval achieves something similar, see GitHub - marl/jams: A JSON Annotated Music Specification for Reproducible MIR Research.
I am definitely open to ideas on how to implement something like what you describe. Happy to hear from you or anyone else thatās interested.
Hi @nicholdav and @samlapp,
It is interesting (to me) to also think ahead about what kinds of annotations and flexibility will be good to have in order to expand the kinds of analyses we do.
Specifically, several applications need annotating overlapping events (e.g. syllables of two, or more, birds singing at the same time).
One can also consider annotation labels that have more than one dimensions and one of the dimensions is a a continuous one (e.g. labeling vocalizations as āhappyā or āangryā but also adding the degree of happiness or anger).
When discussing how to represent annotations it should be good to consider (also) data type flexibility.
@nicholdav interesting, that makes sense.
I can see the value of being able to access and create sets of annotations in a sequence-like format. I think one way to achieve this while consistent data storage in the base annotation class would be with @property (or .as_ā¦()/ .to_ā¦() methods) for access, and .from_sequences() methods for creation.
Concretely, it could look like this (if the Annotation class scores information as list of BBox):
class Annotation:
class Annotation:
def __init__(self,bboxes,annotation_file,notated_file):
...
@classmethod
def from_sequences(self,annotation_file,notated_file):
...
@property
def sequences(self):
return _bboxes_to_sequences(self.bboxes)
def _bboxes_to_sequences(bboxes):
...
return list_of_sequences
anns = Annotation.from_sequences(sequences)
anns.sequences # gives list of sequences
alternatively, Annotations could store the data as a dataframe. In that case, it could have methods/properties to return either BBox or Series
Either way, I think the API would be more intuitive if the information is stored in a consistent way.
For sure! The BBox approach (and maybe the Sequence approach also?) are robust to overlapping events. A nice aspect of the BBox approach is that each single annotation is stored as an object (BBox), so adding an attribute such as my_bbox.happyness_index=0.3
is always possible from the userās side, without needing to change the underlying code base.
Thank you, I definitely agree that just always using a DataFrame as the underlying data structure would allow for things to be more consistent. A column would get added for each additional dimension that an annotation type has. E.g., single time points would have a single column. Segments that have two dimensions would have another column. Bounding boxes would add two more columns.
And then methods would allow for getting annotations in a certain format, as you suggest.
I canāt come up with names for the columns that would work across annotation types though. Like, for segments you could have āstartā and āstopā but for a single time point it doesnāt make sense to have a āstartā without a āstopā. You could do something generic like (ādim1ā, ādim2ā) and then have the accessor methods know which dims to access.
Definitely food for thought.
Iām not 100% sure itās needed right at this exact moment but the approach appeals to me.
If you feel like raising an issue about this on Crowsetta Iād be happy to test this at some point, get feedback from you, and give you contribution credit:
Yeah I think this makes sense. Maybe the column for a single ātimeā shouldnāt even have the same name as the start_time column of data with start time and end times. But if it should, maybe they could be called time0 and time1 or timeA and timeB? And also freq0 and freq1, or freqA and freqB. I can open an issue, and might even be able to implement it if youād like me to try.
Hi @samlapp, let me say thanks so much for finding time to come back to this, and thanks for raising the detailed issue on the issue tracker for crowsetta. Iād been thinking about this and meaning to follow up with you.
Iām posting the link to the issue, just to make it easier to get there from here: ENH: Store Annotation data in consistent format Ā· Issue #259 Ā· vocalpy/crowsetta Ā· GitHub
Let me also say Iām sorry for not doing a better job of conveying my thinking ā reading back over what I wrote above, I think I didnāt make some things clear enough.
You are really helping us by asking these questions. We want crowsetta in particular to be useful for bioacoustics generally, not just people studying acoustic communication with bioacoustics methods. So it helps to know whatās not clear for someone that maybe has more of a bioacoustics background. (I think thatās part of whatās happening here, feel free to tell me if Iām wrong.)
I also think I wasnāt clear enough since @yardenc chimed in; if my collaborator whoās working on the same issues doesnāt understand what Iām talking about then I really havenāt explained myself well!
So because of that Iām going to reply here. Also so anyone else interested can read.
Also Iāll say here at the top that I would love to have you contribute. Based on your question and other things Iāve been thinking about, I can see where we need to rewrite some of the classes and improve the docs. In particular, I think we should add a Boxes
class that is implemented basically as you suggest, with a pandas.DataFrame
as the underlying data structure. If you want to help with that, or even just review those changes so we can get a fresh set of eyes on them, it would be really helpful.
Ok, with all that out of the way (apologies in advance for the wall of the textāmy goal is to to be very clear this time):
What I have failed to make clear is that, by definition, a sequence is āa series of adjacent, non-overlapping line segmentsā. Whereas, as you well know, bounding boxes like those that are often used to annotate soundscapes definitely can and do overlap. This means we need an Annotation
data type that represents either sequences or bounding boxes, and why we canāt (easily) convert a set of bounding boxes to sequences, by throwing away the frequency ranges.
Thereās no reference I can point you to that says this is what the āofficialā definition of a sequence is. This is one of the reasons why thereās a real need for a standard for audio annotations across domains, like the one we started to draft at the first AudioXd.
Since we donāt (for now) have a standard, all I can do is give you a couple of reasons why my understanding of sequence is āa series of directly adjacent but non-overlapping line segmentsā. Iām happy to hear from you, @yardenc, or anyone else that has a different definition, Iām just laying out my thinking here.
What is a sequence?
Basically, when I say sequence in the context of annotations, I am thinking of the kinds of annotations that become the data for analyses like those described in this paper that I linked to above: https://onlinelibrary.wiley.com/doi/abs/10.1111/brv.12160.
You will notice that they also never state explicitly in that paper that a āsequence of unitsā (their terminology) should be strictly non-overlapping. However, almost all of the behavioral models they describe assume such a sequence, because the models are meant to be fit to sounds produced by a single animal. By definition, an animal cannot produce multiple āunitsā at the same time (again, their terminology)āitās the atomic unit of production. Of course we can think of exceptions (e.g., some songbirds vocalize from both sides of the syrinx simultaneously) but in general the goal of these analyses is to fit models like a context-free grammar or a Markov model to a series of discrete utterances, such as a string of letters. So the units in a sequence canāt overlap.
Of course, there are many cases where multiple animals vocalize simultaneously, and then you would have overlapping sequences, as you know from field recordings, as @yardenc pointed out above, and as that paper discusses. But when we fit the kinds of models discussed in that paper to those annotations where multiple sequences overlap, we still want to treat each individual animalās sounds as a separate sequence of non-overlapping segments. Iāll come back to this below.
How do existing formats for annotating sequences handle overlap?
The clearest example I can think of is the Praat Textgrid format. This example is good because itās clearly defined for the kinds of analyses I meanāIām pretty sure most people can only produce one phoneme at a timeāand the developers of Praat actually give us multiple indications of how they want to handle sequences.
First of all, notice that they donāt allow segments to overlapāsee the paragraph Iāve put in bold below. From the āspecā for the TextGrid format:
(TextGrid file formats)
- Restrictions in a TextGrid object
TextGrid objects maintain several invariants, some stronger and some weaker.
The two strongest invariants within an interval tier are positive duration and adjacency. That is, the end time of each interval has to be greater than the starting time of that same interval, and the starting time of each interval (except the first) has to be equal to the end time of the previous interval. As a result, the union of the time domains of the set of intervals in a tier is a contiguous stretch of time, and no intervals overlap. If a TextGrid file violates these invariants, Praat may refuse to read the file and give an error message instead (or Praat may try to repair the TextGrid, but that is not guaranteed).
A weaker invariant is that the starting time of the first interval on a tier equals the starting time of that tier, and the end time of the last interval equals the end time of the tier. When you create a TextGrid with Praat, this invariant is automatically maintained, and most types of modifications also maintain it, except sometimes the commands that combine multiple TextGrid objects with different durations into a new TextGrid. Praat will happily read TextGrid files that do not honour this weak invariant, and will also display such a TextGrid reasonably well in a TextGrid window. It is nevertheless advisable for other programs that create TextGrids to honour this weak invariant.
Second of all, notice that Praat TextGrids have tiers.
https://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html
When you create a TextGrid, you specify the names of the tiers. For instance, if you want to segment the sound into words and into phonemes, you may want to create two tiers and call them āwordsā and āphonemesā (you can easily add, remove, and rename tiers later). Since both of these tiers are interval tiers (you label the intervals between the word and phoneme boundaries, not the boundaries themselves), you specify āphonemes wordsā for Tier names, and you leave the Point tiers empty.
This is how they handle an multiple sequences in an annotation for the same file.
Notice that Audacity similarly has a notion of ālabel tracksā (makes sense ā audio people think in terms of tracks) but they donāt currently let you export multiple tracks in annotations: Importing and Exporting Labels - Audacity Manual
How does / should crowsetta handle overlap in sequences and multiple sequences per annotated file?
So Iāll start by saying that currently the Sequence
class does not currently enforce the definition Iāve given above. It doesnāt check for overlap and raise an error if there is any. This is partly my mistake as a programmer, in part because I havenāt thought about it more deeply until now, and in part because I avoided adding this validation, since I wanted people to be able to load annotations where there are overlapping segments. Iām pretty sure that was my thinking, although I donāt think I documented it.
I now think we actually should enforce this, and make the definition of sequence more explicit in the documentation, for all the reasons Iāve given here.
You can see me reasoning through this on this VocalPy issue, where we also need such classes:
Iām adding this because I think we should basically rewrite both Sequence
and BBox
in crowsetta to be Sequence
and Boxes
as described in that issue. In this case we will repeat ourselves because people should be able to use crowsetta without needing vocalpy and because the classes have different uses within each package. You can see Iām basically thinking along the same lines as you, that we should represent a set of bounding boxes with a pandas.DataFrame
since we rarely need to work with a single box on its own (and when we do, we can just use a dataframe with a single row).
That said, we do have a way to handle multiple sequences per file, like those @yardenc mentioned above. I added this explicitly to be able to convert a TextGrid with tiers to an Annotation
; we iterate over the tiers converting them to a list
of Sequence
s. To me, this is the most natural, dirt simple way to do it. This also lets us handle the case where an annotator wants to annotate multiple sequences that do not overlap in a file. An example of this is the birdsong-recognition dataset: BirdsongRecognition.
You can see I first thought about dealing with intervals here:
allow for user-defined `tiers` for a Segment, like Praat? Ā· Issue #14 Ā· vocalpy/crowsetta Ā· GitHub
And then added the abilty to have Sequences here:
Modify Annotation to allow for multiple seq Ā· Issue #42 Ā· vocalpy/crowsetta Ā· GitHub
Finish pyOpenSci review by NickleDave Ā· Pull Request #243 Ā· vocalpy/crowsetta Ā· GitHub
Long story short, I will raise issues on Crowsetta to rewrite both the Sequence and BBox class as a Sequence class that enforces overlap and a Boxes class, linking to the discussion on VocalPy, and to better document the definitions of these. Weād be very happy to have you contribute however youād like to that.