I made a model that predicts Japanese pitch accent that you can try out here: https://huggingface.co/spaces/mizoru/Japanese_pitch

But what is pitch accent and how did I make this?

Pitch accent

Let's look at two Japanese words: 飴 "candy" and 雨 "rain". They're both spelled ame, but they don't sound the same. Let's listen to how they're pronounced!

Can you tell the difference? The second word has a drop in pitch after the first vowel.

This is pitch accent. In pitch-accent languages one syllable of the word can be marked with contrasting pitch, rather than by loudness or length as in languages with stress accent. Various pitch accent systems exist in many languages, in Swedish, Norwegian, Serbo-Croatian and Lithuanian among others. In fact, pitch accent is reconstructed for Proto-Indo-European, the proto-language of most of the languages of Europe.

Japanese pitch accent

In Japanese pitch accent is caracterized by a drop in pitch. The mora preceding the drop is said to have the accent. So the pitch accent of a word can be described with one number, designating the mora that carries the accent. The word 雨 "rain" has the accent on the first mora, the pitch accent of it would be written down as 「１」. But there's a different way to classify words by pitch accent.

When the first mora a word carries the accent, the pitch accent pattern of the word is 頭高 atamadaka, literally "head-high". If the pitch drops after that the pitch accent pattern is 中高 nakadaka, "middle-high". Many words in Japanese don't have a drop in pitch, these are called 平板 heiban, "flat"

Let's listen to some examples.

Atamadaka:

Nakadaka:

Heiban:

In some words, the drop in pitch is apparent only if there is a particle attached to the word. In these words the last mora of the word is accented, so the particle attached after it is low. These words have the pitch accent pattern 尾高 odaka, "tail-high". In practice, the phonetic word sounds heiban when there's no particle attached, and sounds nakadaka, when there is. So we're not going to use these for training.

Data

I thank Yoga from Migaku for making this data available.

Let's import the libraries we're going to be using to make the model.

from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *

Here are fastai, fastaudio and torchaudio versions accordingly:

import fastai
import fastaudio
import torchaudio
fastai.__version__, fastaudio.__version__,torchaudio.__version__

('2.3.1', '1.0.2', '0.8.1')

Let's take a look at our data

Path("dict1").ls()

(#80120) [Path('dict1/α粒子.yomi00013233_063E.mp3'),Path('dict1/α線.yomi0001323E_0162.mp3'),Path('dict1/γ.yomi00013247_04EE.mp3'),Path('dict1/γ線.yomi0001324B_0100.mp3'),Path('dict1/λ.yomi00013255_034C.mp3'),Path('dict1/π.yomi0001325F_0238.mp3'),Path('dict1/σ.yomi00013265_07D4.mp3'),Path('dict1/ω.yomi0001326D_05E0.mp3'),Path('dict1/○×式.yomi00013273_004A.mp3'),Path('dict1/ああ.yomi00013280_030E.mp3')...]

Path("dict2").ls()

(#84356) [Path('dict2/あくどい-2830_1_1_male.mp3'),Path('dict2/あくどい-2830_2_1_female.mp3'),Path('dict2/あくどい-2830_2_1_male.mp3'),Path('dict2/あくどいです-2830_3_1_female.mp3'),Path('dict2/あくどいです-2830_3_1_male.mp3'),Path('dict2/あくどかった-2830_6_1_female.mp3'),Path('dict2/あくどかった-2830_6_1_male.mp3'),Path('dict2/あくどかった-2830_6_2_female.mp3'),Path('dict2/あくどかった-2830_6_2_male.mp3'),Path('dict2/あくどく-2830_4_1_female.mp3')...]

dict1 = pd.read_csv('dict1_labels.csv')

dict1

dict2 = pd.read_csv('dict2_labels.csv')

dict2

The first dictionary is what I started with. It has recordings by 3-4 male voices, which word is pronounced by which speaker is not marked anywhere. I had to parse a json file to get the labels for it.

I was afraid that my model wouldn't generalize well, so I needed the data from the second dictionary. The most important was having at least one female voice in the training data, and having a male voice not present in the training data for the validation set was nice as well. The pitch accent labels for these recordings were xml inside a json file.

Feautre extraction

Let's convert our previous example audio into a tensor and then make a spectrogram out of it.

at = AudioTensor.create("あめー雨.mp3")
at.show()

<AxesSubplot:>

To make a spectrogram in fastaudio the easiest way is to first create a config object with parameters:

cfg = AudioConfig.Voice()

aud2spec = AudioToSpec.from_cfg(cfg)
show_image(aud2spec(at))

<AxesSubplot:>

I don't know why it comes out flipped, but it's easy to fix

show_image(aud2spec(at)).invert_yaxis()

Let's load a couple of examples to try this configuration on them

ex_paths = [ Path('dict2').ls()[20006],  Path('dict2').ls()[20007],Path('dict1').ls()[21345]]
ex_paths

[Path('dict2/取り残した-473_4_1_male.mp3'),
 Path('dict2/取り残して-473_3_1_female.mp3'),
 Path('dict1/不人気.yomi0003F319_0108.mp3')]

Let's make a function to plot a spectrogram for path with config.

def show_spectro_cfg(cfg, path):
    at = AudioTensor.create(path)
    aud2spec = AudioToSpec.from_cfg(cfg)
    spec = aud2spec(at)
    show_image(spec, figsize=(12,8)).invert_yaxis()
    return spec.shape

for path in ex_paths:
    print(show_spectro_cfg(cfg, path))

torch.Size([1, 128, 397])
torch.Size([1, 128, 460])
torch.Size([1, 128, 532])

These are good enough and I got to 98% accuracy with them, but let's see if we can do better.

To see what the parameters for AudioConfig mean let's look at torchaudio docs.

Edit: this looks like a much better resource.

??torchaudio.transforms.MelSpectrogram

Init signature:
torchaudio.transforms.MelSpectrogram(
    sample_rate: int = 16000,
    n_fft: int = 400,
    win_length: Optional[int] = None,
    hop_length: Optional[int] = None,
    f_min: float = 0.0,
    f_max: Optional[float] = None,
    pad: int = 0,
    n_mels: int = 128,
    window_fn: Callable[..., torch.Tensor] = <built-in method hann_window of type object at 0x7f2703e74d00>,
    power: Optional[float] = 2.0,
    normalized: bool = False,
    wkwargs: Optional[dict] = None,
    center: bool = True,
    pad_mode: str = 'reflect',
    onesided: bool = True,
    norm: Optional[str] = None,
) -> None
Source:        
class MelSpectrogram(torch.nn.Module):
    r"""Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram
    and MelScale.

    Sources
        * https://gist.github.com/kastnerkyle/179d6e9a88202ab0a2fe
        * https://timsainb.github.io/spectrograms-mfccs-and-inversion-in-python.html
        * http://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

    Args:
        sample_rate (int, optional): Sample rate of audio signal. (Default: ``16000``)
        win_length (int or None, optional): Window size. (Default: ``n_fft``)
        hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``)
        n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``)
        f_min (float, optional): Minimum frequency. (Default: ``0.``)
        f_max (float or None, optional): Maximum frequency. (Default: ``None``)
        pad (int, optional): Two sided padding of signal. (Default: ``0``)
        n_mels (int, optional): Number of mel filterbanks. (Default: ``128``)
        window_fn (Callable[..., Tensor], optional): A function to create a window tensor
            that is applied/multiplied to each frame/window. (Default: ``torch.hann_window``)
        wkwargs (Dict[..., ...] or None, optional): Arguments for window function. (Default: ``None``)
        center (bool, optional): whether to pad :attr:`waveform` on both sides so
            that the :math:`t`-th frame is centered at time :math:`t \times \text{hop\_length}`.
            Default: ``True``
        pad_mode (string, optional): controls the padding method used when
            :attr:`center` is ``True``. Default: ``"reflect"``
        onesided (bool, optional): controls whether to return half of results to
            avoid redundancy. Default: ``True``
        norm (Optional[str]): If 'slaney', divide the triangular mel weights by the width of the mel band
        (area normalization). (Default: ``None``)

    Example
        >>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True)
        >>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform)  # (channel, n_mels, time)
    """
    __constants__ = ['sample_rate', 'n_fft', 'win_length', 'hop_length', 'pad', 'n_mels', 'f_min']

    def __init__(self,
                 sample_rate: int = 16000,
                 n_fft: int = 400,
                 win_length: Optional[int] = None,
                 hop_length: Optional[int] = None,
                 f_min: float = 0.,
                 f_max: Optional[float] = None,
                 pad: int = 0,
                 n_mels: int = 128,
                 window_fn: Callable[..., Tensor] = torch.hann_window,
                 power: Optional[float] = 2.,
                 normalized: bool = False,
                 wkwargs: Optional[dict] = None,
                 center: bool = True,
                 pad_mode: str = "reflect",
                 onesided: bool = True,
                 norm: Optional[str] = None) -> None:
        super(MelSpectrogram, self).__init__()
        self.sample_rate = sample_rate
        self.n_fft = n_fft
        self.win_length = win_length if win_length is not None else n_fft
        self.hop_length = hop_length if hop_length is not None else self.win_length // 2
        self.pad = pad
        self.power = power
        self.normalized = normalized
        self.n_mels = n_mels  # number of mel frequency bins
        self.f_max = f_max
        self.f_min = f_min
        self.spectrogram = Spectrogram(n_fft=self.n_fft, win_length=self.win_length,
                                       hop_length=self.hop_length,
                                       pad=self.pad, window_fn=window_fn, power=self.power,
                                       normalized=self.normalized, wkwargs=wkwargs,
                                       center=center, pad_mode=pad_mode, onesided=onesided)
        self.mel_scale = MelScale(self.n_mels, self.sample_rate, self.f_min, self.f_max, self.n_fft // 2 + 1, norm)

    def forward(self, waveform: Tensor) -> Tensor:
        r"""
        Args:
            waveform (Tensor): Tensor of audio of dimension (..., time).

        Returns:
            Tensor: Mel frequency spectrogram of size (..., ``n_mels``, time).
        """
        specgram = self.spectrogram(waveform)
        mel_specgram = self.mel_scale(specgram)
        return mel_specgram
File:           ~/.conda/envs/default/lib/python3.9/site-packages/torchaudio/transforms.py
Type:           type
Subclasses:

I decided to decrease the upper bound on frequency, since we don't really need the information contained in there for pitch and increase the window size, so as to get better low frequency resolution at the cost of temporal resolution. The rest is heuristics.

pitch_cfg = AudioConfig.Voice(f_min=0, f_max=1200, n_fft=1024*4, win_length=1024*2)

for path in ex_paths:
    show_spectro_cfg(pitch_cfg, path)

I would like to also add minimum decibel cut-off like you can in torchaudio.transforms.AmplitudeToDB to add contrast and remove noise, but I wasn't able to find that option if it exists in fastaudio.

Preparing our labels for training

Let's throw out anything that's not atamadaka, nakadaka or heiban for now.

dict1 = dict1[dict1.pattern.isin(['頭高', '中高', '平板'])]

dict2 = dict2[dict2.pattern.isin(['頭高', '中高', '平板'])]

Add the whole path.

dict1.path = 'dict1/'+dict1.path

dict2.path = 'dict2/'+dict2.path

Merge the labels into one file.

all_labels = pd.concat([dict1, dict2]).reset_index(drop=True)

Convert the labels into romaji, so we can plot the later.

all_labels.pattern.replace(['頭高', '中高', '平板'], ['atamadaka', 'nakadaka', 'heiban'], inplace=True)

Add a column, that we're going to use to split the data into training and validation set.

all_labels['is_valid'] = False

all_labels.loc[all_labels.type == 'dict2 male', 'is_valid'] = True

all_labels

Let's take a sample of the data to make quick experiments.

sample_df = all_labels.sample(16000)

Making a Learner and training

I'm closely following Zach Mueller's lesson on audio here.

Let's remove the silence from our files (this is going to be most useful at inference), make our data the same size and convert them to spectrograms using the usual configuration.

item_tfms = [RemoveSilence, ResizeSignal(2000), aud2spec]

Preparing the Dataloaders

def get_x(df):
    return df.path
def get_y(df):
    return df.pattern

dblock = DataBlock(blocks=[AudioBlock, CategoryBlock],
                  item_tfms=item_tfms,
                  get_x=get_x,
                  get_y=get_y,
                  splitter=ColSplitter())

dls = dblock.dataloaders(sample_df, shuffle=True)

AudioSpectrogram.show() throws out an error for me so I had to hardcode around it by making this function.

def show_spec_batch(dls):
    _,axes = plt.subplots(3,3, figsize=(12,8))
    for (spec, pattern),ax in zip(dls.show_batch(show=False)[2], axes.flatten()):
        show_image(spec, title=pattern, ax=ax).invert_yaxis()

show_spec_batch(dls)

Adding n_out=3 for the number of classes we have.

learn = Learner(dls, xresnet34(pretrained=True, n_out=3), CrossEntropyLossFlat(), 
                metrics=[accuracy, F1Score(average='weighted')], wd=0.05).to_fp16()

Our spectrograms only have one channel, so we have to change the first Conv Layer.

def alter_learner(learn):
    layer = learn.model[0][0]
    layer.in_channels = 1
    layer.weight = nn.Parameter(layer.weight[:,1,:,:].unsqueeze(1))
    learn.model[0][0] = layer

alter_learner(learn)

learn.unfreeze()

learn.lr_find()

SuggestedLRs(lr_min=0.33113112449646, lr_steep=2.7542285919189453)

learn.fit_one_cycle(7, 3e-2)

learn.recorder.plot_loss()

Let's make that a function for easier experimenting.

def make_pitch_learner(df, item_tfms, model=xresnet34(pretrained=True, n_out=3)):
    new_dblock = DataBlock(blocks=[AudioBlock, CategoryBlock],
                  item_tfms=item_tfms,
                  get_x=get_x,
                  get_y=get_y,
                  splitter=ColSplitter())
    new_dls = new_dblock.dataloaders(df, shuffle=True)
    new_learn = Learner(new_dls, model, CrossEntropyLossFlat(), 
                metrics=[accuracy, F1Score(average='weighted')], wd=0.05).to_fp16()
    alter_learner(new_learn)
    new_learn.unfreeze()
    return new_learn

Let's try it with the spectrogram parameters we chose.

learn2 = make_pitch_learner(sample_df, [RemoveSilence(), ResizeSignal(2000, AudioPadType.Zeros), 
                                        AudioToSpec.from_cfg(pitch_cfg)])

learn2.lr_find()

SuggestedLRs(lr_min=0.19054607152938843, lr_steep=1.5848932266235352)

learn2.fit_one_cycle(7, 3e-2)

learn2.recorder.plot_loss()

It'd be nice to use torchaudio.transforms.PitchShift here, which would force my model to generalize better, but it seems the torchaudio version supported doesn't have that yet.

I tried to implement it on my own, but to no avail.

torchaudio.sox_effects.effect_names()

['allpass',
 'band',
 'bandpass',
 'bandreject',
 'bass',
 'bend',
 'biquad',
 'chorus',
 'channels',
 'compand',
 'contrast',
 'dcshift',
 'deemph',
 'delay',
 'dither',
 'divide',
 'downsample',
 'earwax',
 'echo',
 'echos',
 'equalizer',
 'fade',
 'fir',
 'firfit',
 'flanger',
 'gain',
 'highpass',
 'hilbert',
 'loudness',
 'lowpass',
 'mcompand',
 'norm',
 'oops',
 'overdrive',
 'pad',
 'phaser',
 'pitch',
 'rate',
 'remix',
 'repeat',
 'reverb',
 'reverse',
 'riaa',
 'silence',
 'sinc',
 'speed',
 'stat',
 'stats',
 'stretch',
 'swap',
 'synth',
 'tempo',
 'treble',
 'tremolo',
 'trim',
 'upsample',
 'vad',
 'vol']

class SoxEffectTransform(torch.nn.Module):
    def __init__(self, effects):
        super().__init__()
        self.effects = effects
        self.rate = 16000
    def forward(self, tensor: torch.Tensor):
        return torchaudio.sox_effects.apply_effects_tensor(
            tensor, self.rate, self.effects)

effects = [ 
     ['pitch', '10'] 
 ]

PitchShift = SoxEffectTransform(effects)

at_shift = PitchShift(at)[0]

torchaudio.sox_effects.init_sox_effects()

Audio(at_shift, rate=24000)

Final model

So, let's finalize our project by training a bigger model on the whole dataset.

fin_learn = make_pitch_learner(all_labels, [RemoveSilence(), ResizeSignal(2000, AudioPadType.Zeros), 
                                        AudioToSpec.from_cfg(pitch_cfg)], xresnet50(pretrained=True, n_out=3))

show_spec_batch(fin_learn.dls)

fin_learn.lr_find()

SuggestedLRs(lr_min=0.0006309573538601399, lr_steep=0.3019951581954956)

fin_learn.fit_one_cycle(4, 3e-4)

fin_learn.recorder.plot_loss()

fin_learn.export("removesilence_pitch.pkl")

fin_learn.lr_find(end_lr=3e-3)

SuggestedLRs(lr_min=3.8196694163161736e-08, lr_steep=8.417093340540305e-06)

fin_learn.fit_one_cycle(1, 3e-7)

fin_learn.export("removesilence_pitch2.pkl")

fin_learn.fit_one_cycle(1, slice(1e-7, 5e-5))

fin_learn.fit_one_cycle(1, slice(1e-7, 5e-3))

fin_learn.export('removesilence_pitch3.pkl')

I'm getting pretty desperate here, because I had gotten an accuracy of 0.984 before with the default spectrogram parameters.

fin_learn.fit_one_cycle(1, slice(1e-7, 5e-4))

fin_learn.fit_one_cycle(1, slice(1e-8, 1e-4))

fin_learn.fit_one_cycle(1, slice(1e-8, 1e-4))

fin_learn.fit_one_cycle(1, slice(1e-6, 1e-4))

Was my previous model maybe learning some patterns from the phonemes as well?

Inference

inf_learn = load_learner("removesilence_pitch3.pkl")

Confirming that the saved learner works.

spec,pred,predtens,probs = inf_learn.predict('あめー雨.mp3', with_input=True)
show_image(spec)
print(probs, pred)

tensor([0.6475, 0.3349, 0.0176]) atamadaka

spec,pred,predtens,probs = inf_learn.predict('あめー雨.mp3', with_input=True)
show_image(spec)
print(probs, pred)

tensor([0.3057, 0.6840, 0.0103]) heiban

This model is not robust, I ended up not using it.

HuggingFace Space

I used HuggingFace spaces with Gradio to deploy the model.

I created a new space. Then I added a requirements.txt file with the following python libraries

fastaudio
librosa
soundfile

Then after some googling I made a packages.txt with the name of a C package libsndfile1, that we need for fastaudio to work properly, in it.

Then you create an app.py file.

Importing dependencies.

import gradio as gr
from fastai.vision.all import *
from fastaudio.core.all import *

Loading the learner.

def get_x(df):
    return df.path

def get_y(df):
    return df.pattern

learn = load_learner('xresnet50_pitch3.pkl')

labels = learn.dls.vocab

We need a function that will return the predictions and a spectrogram from the inputs it gets. We want users to be able to both upload and record audio.

def predict(Record, Upload):
    if Upload: path = Upload
    else: path = Record
    spec,pred,pred_idx,probs = learn.predict(str(path), with_input=True)
    fig,ax = plt.subplots(figsize=(16,10))
    show_image(spec, ax=ax)
    ax.invert_yaxis()
    return [{labels[i]: float(probs[i]) for i in range(len(labels))}, fig]

So as not to make the spectrogram ourselves once again we pass with_input=True to the predict method. Gradio's gr.outputs.Image can take in a figure from matplotlib, so we create that first.

We return a list with a dictionary for probabilities for the labels and a spectrogram figure.

Preparing other parameters.

title = "Japanese Pitch Accent Pattern Detector"

description = "This model will predict the pitch accent pattern of a word based on the recording of its pronunciation."

article="<p style='text-align: center'><a href='https://mizoru.github.io/blog' target='_blank'>Blog</a></p>"

examples = [['代わる.mp3'],['大丈夫な.mp3'],['熱くない.mp3'], ['あめー雨.mp3'], ['あめー飴.mp3']]

enable_queue=True

Putting everything into this final call.

gr.Interface(fn=predict,inputs=[gr.inputs.Audio(source='microphone', type='filepath', optional=True),
                                gr.inputs.Audio(source='upload', type='filepath', optional=True)], 
             outputs=  [gr.outputs.Label(num_top_classes=3), 
                        gr.outputs.Image(type="plot", label='Spectrogram')],
             title=title, description=description, article=article,
             examples=examples).launch(debug=True, enable_queue=enable_queue)

	path	pattern	kana	morae	drop	type
0	dict2/ある-66_1_1_male.mp3	頭高	ある	2	1	dict2 male
1	dict2/ある-66_1_1_female.mp3	頭高	ある	2	1	dict2 female
2	dict2/あります-66_2_1_male.mp3	中高	あります	4	3	dict2 male
3	dict2/あります-66_2_1_female.mp3	中高	あります	4	3	dict2 female
4	dict2/あって-66_3_1_male.mp3	頭高	あって	3	1	dict2 male
...	...	...	...	...	...	...
84477	dict2/立て-377_10_1_female.mp3	頭高	たて	2	1	dict2 female
84478	dict2/立てる-377_11_1_male.mp3	中高	たてる	3	2	dict2 male
84479	dict2/立てる-377_11_1_female.mp3	中高	たてる	3	2	dict2 female
84480	dict2/立とう-377_12_1_male.mp3	中高	たとう	3	2	dict2 male
84481	dict2/立とう-377_12_1_female.mp3	中高	たとう	3	2	dict2 female

epoch	train_loss	valid_loss	accuracy	f1_score	time
0	0.271414	3.523385	0.625543	0.499470	00:55
1	0.196330	3.277715	0.445437	0.402725	00:52
2	0.155347	0.152431	0.947610	0.945649	00:52
3	0.111935	0.165354	0.934814	0.935485	00:52
4	0.071950	0.171538	0.937711	0.934709	00:52
5	0.051120	0.083185	0.972718	0.972574	00:52
6	0.045892	0.065998	0.976098	0.976093	00:52

epoch	train_loss	valid_loss	accuracy	f1_score	time
0	0.329706	2.352126	0.608402	0.462043	01:22
1	0.206030	0.551953	0.828102	0.829986	01:23
2	0.154101	0.194723	0.952921	0.952699	01:22
3	0.119368	0.105970	0.963061	0.963882	01:22
4	0.091843	0.103478	0.964510	0.964183	01:22
5	0.063340	0.068436	0.975616	0.976229	01:22
6	0.046640	0.060200	0.981169	0.981281	01:22

epoch	train_loss	valid_loss	accuracy	f1_score	time
0	0.091439	0.182869	0.930946	0.935078	15:07
1	0.050077	0.076877	0.974457	0.974723	14:16
2	0.033197	0.076989	0.977416	0.977832	13:47
3	0.024043	0.070777	0.979973	0.980235	13:36

	path	pattern	kana	morae	drop	type
0	dict1/ある.yomi000142BB_0596.mp3	頭高	アル	2	1	dict1
1	dict1/思う.yomi0006C617_043A.mp3	中高	オモウ	3	2	dict1
2	dict1/など.yomi000240B7_0028.mp3	頭高	ナド	2	1	dict1
3	dict1/私.yomi00092F63_0072.mp3	平板	ワタくシ	4	0	dict1
4	dict1/見る.yomi000A41BD_001E.mp3	頭高	ミル	2	1	dict1
...	...	...	...	...	...	...
79480	dict1/捨てがな_捨て仮名.yomi00072538_06BE.mp3	平板	すテカ゚ナ	5	0	dict1
79481	dict1/くも膜下出血_蜘蛛膜下出血.yomi0001AAD1_0622.mp3	中高	クモマッカしュッケツ	9	6	dict1
79482	dict1/捜す.yomi00072507_0088.mp3	平板	サカ゚ス	4	0	dict1
79483	dict1/捜し物.yomi000724FD_0424.mp3	平板	サカ゚シモノ	6	0	dict1
79484	dict1/あこや貝_阿古屋貝.yomi00013767_0114.mp3	中高	アコヤカ゚イ	6	3	dict1

	path	pattern	kana	morae	drop	type	is_valid
0	dict1/ある.yomi000142BB_0596.mp3	atamadaka	アル	2	1	dict1	False
1	dict1/思う.yomi0006C617_043A.mp3	nakadaka	オモウ	3	2	dict1	False
2	dict1/など.yomi000240B7_0028.mp3	atamadaka	ナド	2	1	dict1	False
3	dict1/私.yomi00092F63_0072.mp3	heiban	ワタくシ	4	0	dict1	False
4	dict1/見る.yomi000A41BD_001E.mp3	atamadaka	ミル	2	1	dict1	False
...	...	...	...	...	...	...	...
160758	dict2/立て-377_10_1_female.mp3	atamadaka	たて	2	1	dict2 female	False
160759	dict2/立てる-377_11_1_male.mp3	nakadaka	たてる	3	2	dict2 male	True
160760	dict2/立てる-377_11_1_female.mp3	nakadaka	たてる	3	2	dict2 female	False
160761	dict2/立とう-377_12_1_male.mp3	nakadaka	たとう	3	2	dict2 male	True
160762	dict2/立とう-377_12_1_female.mp3	nakadaka	たとう	3	2	dict2 female	False