Skip to content

Commit 40be1e6

Browse files
Update README.TXT
1 parent 00d0f9a commit 40be1e6

File tree

1 file changed

+0
-177
lines changed
  • tensorflow_datasets/testing/test_data/fake_examples/librispeech/train-other-500/LibriSpeech

1 file changed

+0
-177
lines changed
Lines changed: 0 additions & 177 deletions
Original file line numberDiff line numberDiff line change
@@ -1,178 +1 @@
1-
1. General information
2-
======================
31

4-
LibriSpeech is a corpus of read speech, based on LibriVox's public domain
5-
audio books. Its purpose is to enable the training and testing of automatic
6-
speech recognition(ASR) systems.
7-
8-
9-
2. Structure
10-
============
11-
12-
The corpus is split into several parts to enable users to selectively download
13-
subsets of it, according to their needs. The subsets with "clean" in their name
14-
are supposedly "cleaner"(at least on average), than the rest of the audio and
15-
US English accented. That classification was obtained using very crude automated
16-
means, and should not be considered completely reliable. The subsets are
17-
disjoint, i.e. the audio of each speaker is assigned to exactly one subset.
18-
19-
The parts of the corpus are as follows:
20-
21-
* dev-clean, test-clean - development and test set containing "clean" speech.
22-
* train-clean-100 - training set, of approximately 100 hours of "clean" speech
23-
* train-clean-360 - training set, of approximately 360 hours of "clean" speech
24-
* dev-other, test-other - development and test set, with speech which was
25-
automatically selected to be more "challenging" to
26-
recognize
27-
* train-other-500 - training set of approximately 500 hours containing speech
28-
that was not classified as "clean", for some (possibly wrong)
29-
reason
30-
* intro - subset containing only the LibriVox's intro disclaimers for some of the
31-
readers.
32-
* mp3 - the original MP3-encoded audio on which the corpus is based
33-
* texts - the original Project Gutenberg texts on which the reference transcripts
34-
for the utterances in the corpus are based.
35-
* raw_metadata - SQLite databases which record various pieces of information about
36-
the source text/audio materials used, and the alignment process.
37-
(mostly for completeness - probably not very interesting or useful)
38-
39-
2.1 Organization of the training and test subsets
40-
-------------------------------------------------
41-
42-
When extracted, each of the {dev,test,train} sets re-creates LibriSpeech's root
43-
directory, containing some metadata, and a dedicated subdirectory for the subset
44-
itself. The audio for each individual speaker is stored under a dedicated
45-
subdirectory in the subset's directory, and each audio chapter read by this
46-
speaker is stored in separate subsubdirectory. The following ASCII diagram
47-
depicts the directory structure:
48-
49-
50-
<corpus root>
51-
|
52-
.- README.TXT
53-
|
54-
.- READERS.TXT
55-
|
56-
.- CHAPTERS.TXT
57-
|
58-
.- BOOKS.TXT
59-
|
60-
.- train-clean-100/
61-
|
62-
.- 19/
63-
|
64-
.- 198/
65-
| |
66-
| .- 19-198.trans.txt
67-
| |
68-
| .- 19-198-0001.flac
69-
| |
70-
| .- 14-208-0002.flac
71-
| |
72-
| ...
73-
|
74-
.- 227/
75-
| ...
76-
77-
78-
79-
, where 19 is the ID of the reader, and 198 and 227 are the IDs of the chapters
80-
read by this speaker. The *.trans.txt files contain the transcripts for each
81-
of the utterances, derived from the respective chapter and the FLAC files contain
82-
the audio itself.
83-
84-
The main metainfo about the speech is listed in the READERS and the CHAPTERS:
85-
86-
- READERS.TXT contains information about speaker's gender and total amount of
87-
audio in the corpus.
88-
89-
- CHAPTERS.TXT has information about the per-chapter audio durations.
90-
91-
The file BOOKS.TXT makes contains the title for each book, whose text is used in
92-
the corpus, and its Project Gutenberg ID.
93-
94-
2.2 Organization of the "intro-disclaimers" subset
95-
--------------------------------------------------
96-
97-
This part of the data contains simply the LibriVox's intro disclaimers that were
98-
successfully extracted, using a slight modification of the alignment algorithms
99-
used to derive the test training sets. The standard LibriVox disclaimer is:
100-
101-
"This is a LibriVox recording. All LibriVox recordings are in the public domain.
102-
For more information, or to volunteer, please visit: librivox DOT org"
103-
104-
As is the case for the training and test sets, there is one subdirectory for
105-
each reader, and a subsubdirectory for each of the chapters, read by this speaker
106-
for which the announcement was successfully extracted.
107-
108-
109-
2.3 Organization of the "original-mp3" subset
110-
---------------------------------------------
111-
112-
This part contains the original MP3-compressed recordings as downloaded from the
113-
Internet Archive. It is intended to serve as a secure reference "snapshot" for
114-
the original audio chapters, but also to preserve (most of) the information both
115-
about audio, selected for the corpus, and audio that was discarded. I decided to
116-
try make the corpus relatively balanced in terms of per-speaker durations, so
117-
part of the audio available for some of the speakers was discarded. Also for the
118-
speakers in the training sets, only up to 10 minutes of audio is used, to
119-
introduce more speaker diversity during evaluation time. There should be enough
120-
information in the "mp3" subset to enable the re-cutting of an extended
121-
"LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed.
122-
123-
The directory hierarchy follows the already familiar pattern. In each
124-
speaker directory there is a file named "utterance_map" which list for each
125-
of the utterances in the corpus, the original "raw" aligned utterance.
126-
In the "header" of that file there are also 2 lines, that show if the
127-
sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the
128-
reader is assigned to a test set) and the maximum allowed duration for
129-
the set to which this speaker was assigned.
130-
131-
Then in the chapter directory, besides the original audio chapter .mp3 file,
132-
there are two sets of ".seg.txt" and ".trans.txt" files. The former contain
133-
the time range(in seconds) for each of the original(that I called "raw" above)
134-
utterances. The latter contains the respective transcriptions. There are two
135-
sets for the two possible segmentations of each chapter. The ".sents"
136-
segmentation is "sentence-aware", that is, we only split on silence intervals
137-
coinciding with (automatically obtained) sentence boundaries in the text.
138-
The other segmentation was derived by allowing splitting on every silence
139-
interval longer than 300ms, which leads to better utilization of the aligned
140-
audio.
141-
142-
2.4 Organization of the "text" subset
143-
-------------------------------------
144-
145-
This part just contains one subdirectory, with name equal to the ID of the
146-
text in Project Gutenberg's database, for each book. The books are also
147-
separated in directories by their encoding-- could be either ASCII or UTF-8.
148-
The sole purpose of this subset is to be a permanent snapshot of the original
149-
text used for LibriSpeech's construction.
150-
151-
152-
2.5 Organization of the "raw-metadata" part
153-
-------------------------------------------
154-
155-
Contains just few SQLite databases. Some of the more important bits of
156-
information from this tables are described in the README file within
157-
the "raw_data" subdirectory.
158-
159-
160-
Acknowledgments
161-
===============
162-
163-
First and foremost, I would like to thank the thousands of Project Gutenberg
164-
and LibriVox volunteers, without whose contributions the LibriSpeech corpus
165-
would not have existed.
166-
The successful completion of this project would have been much more difficult,
167-
and the quality of the finished corpus much worse, if it wasn't for the
168-
generous support and the many helpful advice, provided by Daniel Povey - thanks, Dan!
169-
I would also like to express my gratitude to Tony Robinson, for the very
170-
interesting, and useful discussions on the long audio alignment problem, that
171-
we had some time ago.
172-
Thanks also to Guoguo Chen and Sanjeev Khudanpur, with whom we are collaborating
173-
on a (yet-to-be-published) paper on the corpus, and who helped to improve
174-
the LibriSpeech's example scripts in Kaldi.
175-
176-
---
177-
Vassil Panayotov,
178-
Oct. 2, 2014

0 commit comments

Comments
 (0)