|
1 |
| -1. General information |
2 |
| -====================== |
3 | 1 |
|
4 |
| -LibriSpeech is a corpus of read speech, based on LibriVox's public domain |
5 |
| -audio books. Its purpose is to enable the training and testing of automatic |
6 |
| -speech recognition(ASR) systems. |
7 |
| - |
8 |
| - |
9 |
| -2. Structure |
10 |
| -============ |
11 |
| - |
12 |
| -The corpus is split into several parts to enable users to selectively download |
13 |
| -subsets of it, according to their needs. The subsets with "clean" in their name |
14 |
| -are supposedly "cleaner"(at least on average), than the rest of the audio and |
15 |
| -US English accented. That classification was obtained using very crude automated |
16 |
| -means, and should not be considered completely reliable. The subsets are |
17 |
| -disjoint, i.e. the audio of each speaker is assigned to exactly one subset. |
18 |
| - |
19 |
| -The parts of the corpus are as follows: |
20 |
| - |
21 |
| -* dev-clean, test-clean - development and test set containing "clean" speech. |
22 |
| -* train-clean-100 - training set, of approximately 100 hours of "clean" speech |
23 |
| -* train-clean-360 - training set, of approximately 360 hours of "clean" speech |
24 |
| -* dev-other, test-other - development and test set, with speech which was |
25 |
| - automatically selected to be more "challenging" to |
26 |
| - recognize |
27 |
| -* train-other-500 - training set of approximately 500 hours containing speech |
28 |
| - that was not classified as "clean", for some (possibly wrong) |
29 |
| - reason |
30 |
| -* intro - subset containing only the LibriVox's intro disclaimers for some of the |
31 |
| - readers. |
32 |
| -* mp3 - the original MP3-encoded audio on which the corpus is based |
33 |
| -* texts - the original Project Gutenberg texts on which the reference transcripts |
34 |
| - for the utterances in the corpus are based. |
35 |
| -* raw_metadata - SQLite databases which record various pieces of information about |
36 |
| - the source text/audio materials used, and the alignment process. |
37 |
| - (mostly for completeness - probably not very interesting or useful) |
38 |
| - |
39 |
| -2.1 Organization of the training and test subsets |
40 |
| -------------------------------------------------- |
41 |
| - |
42 |
| -When extracted, each of the {dev,test,train} sets re-creates LibriSpeech's root |
43 |
| -directory, containing some metadata, and a dedicated subdirectory for the subset |
44 |
| -itself. The audio for each individual speaker is stored under a dedicated |
45 |
| -subdirectory in the subset's directory, and each audio chapter read by this |
46 |
| -speaker is stored in separate subsubdirectory. The following ASCII diagram |
47 |
| -depicts the directory structure: |
48 |
| - |
49 |
| - |
50 |
| -<corpus root> |
51 |
| - | |
52 |
| - .- README.TXT |
53 |
| - | |
54 |
| - .- READERS.TXT |
55 |
| - | |
56 |
| - .- CHAPTERS.TXT |
57 |
| - | |
58 |
| - .- BOOKS.TXT |
59 |
| - | |
60 |
| - .- train-clean-100/ |
61 |
| - | |
62 |
| - .- 19/ |
63 |
| - | |
64 |
| - .- 198/ |
65 |
| - | | |
66 |
| - | .- 19-198.trans.txt |
67 |
| - | | |
68 |
| - | .- 19-198-0001.flac |
69 |
| - | | |
70 |
| - | .- 14-208-0002.flac |
71 |
| - | | |
72 |
| - | ... |
73 |
| - | |
74 |
| - .- 227/ |
75 |
| - | ... |
76 |
| - |
77 |
| - |
78 |
| - |
79 |
| -, where 19 is the ID of the reader, and 198 and 227 are the IDs of the chapters |
80 |
| -read by this speaker. The *.trans.txt files contain the transcripts for each |
81 |
| -of the utterances, derived from the respective chapter and the FLAC files contain |
82 |
| -the audio itself. |
83 |
| - |
84 |
| -The main metainfo about the speech is listed in the READERS and the CHAPTERS: |
85 |
| - |
86 |
| -- READERS.TXT contains information about speaker's gender and total amount of |
87 |
| - audio in the corpus. |
88 |
| - |
89 |
| -- CHAPTERS.TXT has information about the per-chapter audio durations. |
90 |
| - |
91 |
| -The file BOOKS.TXT makes contains the title for each book, whose text is used in |
92 |
| -the corpus, and its Project Gutenberg ID. |
93 |
| - |
94 |
| -2.2 Organization of the "intro-disclaimers" subset |
95 |
| --------------------------------------------------- |
96 |
| - |
97 |
| -This part of the data contains simply the LibriVox's intro disclaimers that were |
98 |
| -successfully extracted, using a slight modification of the alignment algorithms |
99 |
| -used to derive the test training sets. The standard LibriVox disclaimer is: |
100 |
| - |
101 |
| -"This is a LibriVox recording. All LibriVox recordings are in the public domain. |
102 |
| - For more information, or to volunteer, please visit: librivox DOT org" |
103 |
| - |
104 |
| -As is the case for the training and test sets, there is one subdirectory for |
105 |
| -each reader, and a subsubdirectory for each of the chapters, read by this speaker |
106 |
| -for which the announcement was successfully extracted. |
107 |
| - |
108 |
| - |
109 |
| -2.3 Organization of the "original-mp3" subset |
110 |
| ---------------------------------------------- |
111 |
| - |
112 |
| -This part contains the original MP3-compressed recordings as downloaded from the |
113 |
| -Internet Archive. It is intended to serve as a secure reference "snapshot" for |
114 |
| -the original audio chapters, but also to preserve (most of) the information both |
115 |
| -about audio, selected for the corpus, and audio that was discarded. I decided to |
116 |
| -try make the corpus relatively balanced in terms of per-speaker durations, so |
117 |
| -part of the audio available for some of the speakers was discarded. Also for the |
118 |
| -speakers in the training sets, only up to 10 minutes of audio is used, to |
119 |
| -introduce more speaker diversity during evaluation time. There should be enough |
120 |
| -information in the "mp3" subset to enable the re-cutting of an extended |
121 |
| -"LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed. |
122 |
| - |
123 |
| -The directory hierarchy follows the already familiar pattern. In each |
124 |
| -speaker directory there is a file named "utterance_map" which list for each |
125 |
| -of the utterances in the corpus, the original "raw" aligned utterance. |
126 |
| -In the "header" of that file there are also 2 lines, that show if the |
127 |
| -sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the |
128 |
| -reader is assigned to a test set) and the maximum allowed duration for |
129 |
| -the set to which this speaker was assigned. |
130 |
| - |
131 |
| -Then in the chapter directory, besides the original audio chapter .mp3 file, |
132 |
| -there are two sets of ".seg.txt" and ".trans.txt" files. The former contain |
133 |
| -the time range(in seconds) for each of the original(that I called "raw" above) |
134 |
| -utterances. The latter contains the respective transcriptions. There are two |
135 |
| -sets for the two possible segmentations of each chapter. The ".sents" |
136 |
| -segmentation is "sentence-aware", that is, we only split on silence intervals |
137 |
| -coinciding with (automatically obtained) sentence boundaries in the text. |
138 |
| -The other segmentation was derived by allowing splitting on every silence |
139 |
| -interval longer than 300ms, which leads to better utilization of the aligned |
140 |
| -audio. |
141 |
| - |
142 |
| -2.4 Organization of the "text" subset |
143 |
| -------------------------------------- |
144 |
| - |
145 |
| -This part just contains one subdirectory, with name equal to the ID of the |
146 |
| -text in Project Gutenberg's database, for each book. The books are also |
147 |
| -separated in directories by their encoding-- could be either ASCII or UTF-8. |
148 |
| -The sole purpose of this subset is to be a permanent snapshot of the original |
149 |
| -text used for LibriSpeech's construction. |
150 |
| - |
151 |
| - |
152 |
| -2.5 Organization of the "raw-metadata" part |
153 |
| -------------------------------------------- |
154 |
| - |
155 |
| -Contains just few SQLite databases. Some of the more important bits of |
156 |
| -information from this tables are described in the README file within |
157 |
| -the "raw_data" subdirectory. |
158 |
| - |
159 |
| - |
160 |
| -Acknowledgments |
161 |
| -=============== |
162 |
| - |
163 |
| -First and foremost, I would like to thank the thousands of Project Gutenberg |
164 |
| -and LibriVox volunteers, without whose contributions the LibriSpeech corpus |
165 |
| -would not have existed. |
166 |
| -The successful completion of this project would have been much more difficult, |
167 |
| -and the quality of the finished corpus much worse, if it wasn't for the |
168 |
| -generous support and the many helpful advice, provided by Daniel Povey - thanks, Dan! |
169 |
| -I would also like to express my gratitude to Tony Robinson, for the very |
170 |
| -interesting, and useful discussions on the long audio alignment problem, that |
171 |
| -we had some time ago. |
172 |
| -Thanks also to Guoguo Chen and Sanjeev Khudanpur, with whom we are collaborating |
173 |
| -on a (yet-to-be-published) paper on the corpus, and who helped to improve |
174 |
| -the LibriSpeech's example scripts in Kaldi. |
175 |
| - |
176 |
| ---- |
177 |
| -Vassil Panayotov, |
178 |
| -Oct. 2, 2014 |
0 commit comments