Predicting Japanese pitch accent
I made a model that predicts Japanese pitch accent that you can try out here: https://huggingface.co/spaces/mizoru/Japanese_pitch
But what is pitch accent and how did I make this?
Let's look at two Japanese words: 飴 "candy" and 雨 "rain". They're both spelled ame, but they don't sound the same. Let's listen to how they're pronounced!
Can you tell the difference? The second word has a drop in pitch after the first vowel.
This is pitch accent. In pitch-accent languages one syllable of the word can be marked with contrasting pitch, rather than by loudness or length as in languages with stress accent. Various pitch accent systems exist in many languages, in Swedish, Norwegian, Serbo-Croatian and Lithuanian among others. In fact, pitch accent is reconstructed for Proto-Indo-European, the proto-language of most of the languages of Europe.
In Japanese pitch accent is caracterized by a drop in pitch. The mora preceding the drop is said to have the accent. So the pitch accent of a word can be described with one number, designating the mora that carries the accent. The word 雨 "rain" has the accent on the first mora, the pitch accent of it would be written down as 「1」. But there's a different way to classify words by pitch accent.
When the first mora a word carries the accent, the pitch accent pattern of the word is 頭高 atamadaka, literally "head-high". If the pitch drops after that the pitch accent pattern is 中高 nakadaka, "middle-high". Many words in Japanese don't have a drop in pitch, these are called 平板 heiban, "flat"
Let's listen to some examples.
Atamadaka:
Nakadaka:
Heiban:
In some words, the drop in pitch is apparent only if there is a particle attached to the word. In these words the last mora of the word is accented, so the particle attached after it is low. These words have the pitch accent pattern 尾高 odaka, "tail-high". In practice, the phonetic word sounds heiban when there's no particle attached, and sounds nakadaka, when there is. So we're not going to use these for training.
Let's import the libraries we're going to be using to make the model.
from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *
Here are fastai, fastaudio and torchaudio versions accordingly:
import fastai
import fastaudio
import torchaudio
fastai.__version__, fastaudio.__version__,torchaudio.__version__
Let's take a look at our data
Path("dict1").ls()
Path("dict2").ls()
dict1 = pd.read_csv('dict1_labels.csv')
dict1
dict2 = pd.read_csv('dict2_labels.csv')
dict2
The first dictionary is what I started with. It has recordings by 3-4 male voices, which word is pronounced by which speaker is not marked anywhere. I had to parse a json file to get the labels for it.
I was afraid that my model wouldn't generalize well, so I needed the data from the second dictionary. The most important was having at least one female voice in the training data, and having a male voice not present in the training data for the validation set was nice as well. The pitch accent labels for these recordings were xml inside a json file.
Let's convert our previous example audio into a tensor and then make a spectrogram out of it.
at = AudioTensor.create("あめー雨.mp3")
at.show()
To make a spectrogram in fastaudio the easiest way is to first create a config object with parameters:
cfg = AudioConfig.Voice()
aud2spec = AudioToSpec.from_cfg(cfg)
show_image(aud2spec(at))
I don't know why it comes out flipped, but it's easy to fix
show_image(aud2spec(at)).invert_yaxis()
Let's load a couple of examples to try this configuration on them
ex_paths = [ Path('dict2').ls()[20006], Path('dict2').ls()[20007],Path('dict1').ls()[21345]]
ex_paths
Let's make a function to plot a spectrogram for path with config.
def show_spectro_cfg(cfg, path):
at = AudioTensor.create(path)
aud2spec = AudioToSpec.from_cfg(cfg)
spec = aud2spec(at)
show_image(spec, figsize=(12,8)).invert_yaxis()
return spec.shape
for path in ex_paths:
print(show_spectro_cfg(cfg, path))
These are good enough and I got to 98% accuracy with them, but let's see if we can do better.
To see what the parameters for AudioConfig
mean let's look at torchaudio docs.
Edit: this looks like a much better resource.
??torchaudio.transforms.MelSpectrogram
I decided to decrease the upper bound on frequency, since we don't really need the information contained in there for pitch and increase the window size, so as to get better low frequency resolution at the cost of temporal resolution. The rest is heuristics.
pitch_cfg = AudioConfig.Voice(f_min=0, f_max=1200, n_fft=1024*4, win_length=1024*2)
for path in ex_paths:
show_spectro_cfg(pitch_cfg, path)
I would like to also add minimum decibel cut-off like you can in torchaudio.transforms.AmplitudeToDB
to add contrast and remove noise, but I wasn't able to find that option if it exists in fastaudio.
Let's throw out anything that's not atamadaka, nakadaka or heiban for now.
dict1 = dict1[dict1.pattern.isin(['頭高', '中高', '平板'])]
dict2 = dict2[dict2.pattern.isin(['頭高', '中高', '平板'])]
Add the whole path.
dict1.path = 'dict1/'+dict1.path
dict2.path = 'dict2/'+dict2.path
Merge the labels into one file.
all_labels = pd.concat([dict1, dict2]).reset_index(drop=True)
Convert the labels into romaji, so we can plot the later.
all_labels.pattern.replace(['頭高', '中高', '平板'], ['atamadaka', 'nakadaka', 'heiban'], inplace=True)
Add a column, that we're going to use to split the data into training and validation set.
all_labels['is_valid'] = False
all_labels.loc[all_labels.type == 'dict2 male', 'is_valid'] = True
all_labels
Let's take a sample of the data to make quick experiments.
sample_df = all_labels.sample(16000)
I'm closely following Zach Mueller's lesson on audio here.
Let's remove the silence from our files (this is going to be most useful at inference), make our data the same size and convert them to spectrograms using the usual configuration.
item_tfms = [RemoveSilence, ResizeSignal(2000), aud2spec]
Preparing the Dataloaders
def get_x(df):
return df.path
def get_y(df):
return df.pattern
dblock = DataBlock(blocks=[AudioBlock, CategoryBlock],
item_tfms=item_tfms,
get_x=get_x,
get_y=get_y,
splitter=ColSplitter())
dls = dblock.dataloaders(sample_df, shuffle=True)
AudioSpectrogram.show()
throws out an error for me so I had to hardcode around it by making this function.
def show_spec_batch(dls):
_,axes = plt.subplots(3,3, figsize=(12,8))
for (spec, pattern),ax in zip(dls.show_batch(show=False)[2], axes.flatten()):
show_image(spec, title=pattern, ax=ax).invert_yaxis()
show_spec_batch(dls)
Adding n_out=3
for the number of classes we have.
learn = Learner(dls, xresnet34(pretrained=True, n_out=3), CrossEntropyLossFlat(),
metrics=[accuracy, F1Score(average='weighted')], wd=0.05).to_fp16()
Our spectrograms only have one channel, so we have to change the first Conv Layer.
def alter_learner(learn):
layer = learn.model[0][0]
layer.in_channels = 1
layer.weight = nn.Parameter(layer.weight[:,1,:,:].unsqueeze(1))
learn.model[0][0] = layer
alter_learner(learn)
learn.unfreeze()
learn.lr_find()
learn.fit_one_cycle(7, 3e-2)
learn.recorder.plot_loss()
Let's make that a function for easier experimenting.
def make_pitch_learner(df, item_tfms, model=xresnet34(pretrained=True, n_out=3)):
new_dblock = DataBlock(blocks=[AudioBlock, CategoryBlock],
item_tfms=item_tfms,
get_x=get_x,
get_y=get_y,
splitter=ColSplitter())
new_dls = new_dblock.dataloaders(df, shuffle=True)
new_learn = Learner(new_dls, model, CrossEntropyLossFlat(),
metrics=[accuracy, F1Score(average='weighted')], wd=0.05).to_fp16()
alter_learner(new_learn)
new_learn.unfreeze()
return new_learn
Let's try it with the spectrogram parameters we chose.
learn2 = make_pitch_learner(sample_df, [RemoveSilence(), ResizeSignal(2000, AudioPadType.Zeros),
AudioToSpec.from_cfg(pitch_cfg)])
learn2.lr_find()
learn2.fit_one_cycle(7, 3e-2)
learn2.recorder.plot_loss()
It'd be nice to use torchaudio.transforms.PitchShift
here, which would force my model to generalize better, but it seems the torchaudio version supported doesn't have that yet.
I tried to implement it on my own, but to no avail.
torchaudio.sox_effects.effect_names()
class SoxEffectTransform(torch.nn.Module):
def __init__(self, effects):
super().__init__()
self.effects = effects
self.rate = 16000
def forward(self, tensor: torch.Tensor):
return torchaudio.sox_effects.apply_effects_tensor(
tensor, self.rate, self.effects)
effects = [
['pitch', '10']
]
PitchShift = SoxEffectTransform(effects)
at_shift = PitchShift(at)[0]
torchaudio.sox_effects.init_sox_effects()
Audio(at_shift, rate=24000)
So, let's finalize our project by training a bigger model on the whole dataset.
fin_learn = make_pitch_learner(all_labels, [RemoveSilence(), ResizeSignal(2000, AudioPadType.Zeros),
AudioToSpec.from_cfg(pitch_cfg)], xresnet50(pretrained=True, n_out=3))
show_spec_batch(fin_learn.dls)
fin_learn.lr_find()
fin_learn.fit_one_cycle(4, 3e-4)
fin_learn.recorder.plot_loss()
fin_learn.export("removesilence_pitch.pkl")
fin_learn.lr_find(end_lr=3e-3)
fin_learn.fit_one_cycle(1, 3e-7)
fin_learn.export("removesilence_pitch2.pkl")
fin_learn.fit_one_cycle(1, slice(1e-7, 5e-5))
fin_learn.fit_one_cycle(1, slice(1e-7, 5e-3))
fin_learn.export('removesilence_pitch3.pkl')
I'm getting pretty desperate here, because I had gotten an accuracy of 0.984 before with the default spectrogram parameters.
fin_learn.fit_one_cycle(1, slice(1e-7, 5e-4))
fin_learn.fit_one_cycle(1, slice(1e-8, 1e-4))
fin_learn.fit_one_cycle(1, slice(1e-8, 1e-4))
fin_learn.fit_one_cycle(1, slice(1e-6, 1e-4))
Was my previous model maybe learning some patterns from the phonemes as well?
inf_learn = load_learner("removesilence_pitch3.pkl")
Confirming that the saved learner works.
spec,pred,predtens,probs = inf_learn.predict('あめー雨.mp3', with_input=True)
show_image(spec)
print(probs, pred)
spec,pred,predtens,probs = inf_learn.predict('あめー雨.mp3', with_input=True)
show_image(spec)
print(probs, pred)
This model is not robust, I ended up not using it.
I used HuggingFace spaces with Gradio to deploy the model.
I created a new space. Then I added a requirements.txt
file with the following python libraries
fastaudio
librosa
soundfile
Then after some googling I made a packages.txt
with the name of a C package libsndfile1
, that we need for fastaudio to work properly, in it.
Then you create an app.py
file.
Importing dependencies.
import gradio as gr
from fastai.vision.all import *
from fastaudio.core.all import *
Loading the learner.
def get_x(df):
return df.path
def get_y(df):
return df.pattern
learn = load_learner('xresnet50_pitch3.pkl')
labels = learn.dls.vocab
We need a function that will return the predictions and a spectrogram from the inputs it gets. We want users to be able to both upload and record audio.
def predict(Record, Upload):
if Upload: path = Upload
else: path = Record
spec,pred,pred_idx,probs = learn.predict(str(path), with_input=True)
fig,ax = plt.subplots(figsize=(16,10))
show_image(spec, ax=ax)
ax.invert_yaxis()
return [{labels[i]: float(probs[i]) for i in range(len(labels))}, fig]
So as not to make the spectrogram ourselves once again we pass with_input=True
to the predict
method. Gradio's gr.outputs.Image
can take in a figure from matplotlib, so we create that first.
We return a list with a dictionary for probabilities for the labels and a spectrogram figure.
Preparing other parameters.
title = "Japanese Pitch Accent Pattern Detector"
description = "This model will predict the pitch accent pattern of a word based on the recording of its pronunciation."
article="<p style='text-align: center'><a href='https://mizoru.github.io/blog' target='_blank'>Blog</a></p>"
examples = [['代わる.mp3'],['大丈夫な.mp3'],['熱くない.mp3'], ['あめー雨.mp3'], ['あめー飴.mp3']]
enable_queue=True
Putting everything into this final call.
gr.Interface(fn=predict,inputs=[gr.inputs.Audio(source='microphone', type='filepath', optional=True),
gr.inputs.Audio(source='upload', type='filepath', optional=True)],
outputs= [gr.outputs.Label(num_top_classes=3),
gr.outputs.Image(type="plot", label='Spectrogram')],
title=title, description=description, article=article,
examples=examples).launch(debug=True, enable_queue=enable_queue)