Update README.TXT

ChanchalKumarMaji · web-flow · commit 40be1e67a653 · 2019-03-12T18:31:40.000+05:30
diff --git a/tensorflow_datasets/testing/test_data/fake_examples/librispeech/train-other-500/LibriSpeech/README.TXT b/tensorflow_datasets/testing/test_data/fake_examples/librispeech/train-other-500/LibriSpeech/README.TXT
@@ -1,178 +1 @@
-1. General information
-======================
 
-LibriSpeech is a corpus of read speech, based on LibriVox's public domain
-audio books. Its purpose is to enable the training and testing of automatic
-speech recognition(ASR) systems. 
-
-
-2. Structure
-============
-
-The corpus is split into several parts to enable users to selectively download
-subsets of it, according to their needs. The subsets with "clean" in their name
-are supposedly "cleaner"(at least on average), than the rest of the audio and
-US English accented. That classification was obtained using very crude automated 
-means, and should not be considered completely reliable. The subsets are
-disjoint, i.e. the audio of each speaker is assigned to exactly one subset.
-
-The parts of the corpus are as follows:
-
-* dev-clean, test-clean - development and test set containing "clean" speech.
-* train-clean-100 - training set, of approximately 100 hours of "clean" speech
-* train-clean-360 - training set, of approximately 360 hours of "clean" speech
-* dev-other, test-other - development and test set, with speech which was 
-                          automatically selected to be more "challenging" to
-                          recognize
-* train-other-500 - training set of approximately 500 hours containing speech
-                    that was not classified as "clean", for some (possibly wrong)
-                    reason
-* intro - subset containing only the LibriVox's intro disclaimers for some of the
-          readers.
-* mp3 - the original MP3-encoded audio on which the corpus is based
-* texts - the original Project Gutenberg texts on which the reference transcripts
-          for the utterances in the corpus are based.
-* raw_metadata - SQLite databases which record various pieces of information about 
-                 the source text/audio materials used, and the alignment process.
-                 (mostly for completeness - probably not very interesting or useful)
-
-2.1 Organization of the training and test subsets
--------------------------------------------------
-
-When extracted, each of the {dev,test,train} sets re-creates LibriSpeech's root
-directory, containing some metadata, and a dedicated subdirectory for the subset
-itself. The audio for each individual speaker is stored under a dedicated 
-subdirectory in the subset's directory, and each audio chapter read by this
-speaker is stored in separate subsubdirectory. The following ASCII diagram 
-depicts the directory structure:
-
-
-<corpus root>
-    |
-    .- README.TXT
-    |
-    .- READERS.TXT
-    |
-    .- CHAPTERS.TXT
-    |
-    .- BOOKS.TXT
-    |
-    .- train-clean-100/
-                   |
-                   .- 19/
-                       |
-                       .- 198/
-                       |    |
-                       |    .- 19-198.trans.txt
-                       |    |    
-                       |    .- 19-198-0001.flac
-                       |    |
-                       |    .- 14-208-0002.flac
-                       |    |
-                       |    ...
-                       |
-                       .- 227/
-                            | ...
-
-
-
-, where 19 is the ID of the reader, and 198 and 227 are the IDs of the chapters
-read by this speaker. The *.trans.txt files contain the transcripts for each
-of the utterances, derived from the respective chapter and the FLAC files contain
-the audio itself.
-
-The main metainfo about the speech is listed in the READERS and the CHAPTERS:
-
-- READERS.TXT contains information about speaker's gender and total amount of
-  audio in the corpus.
-
-- CHAPTERS.TXT has information about the per-chapter audio durations.
-
-The file BOOKS.TXT makes contains the title for each book, whose text is used in
-the corpus, and its Project Gutenberg ID.
-
-2.2 Organization of the "intro-disclaimers" subset
---------------------------------------------------
-
-This part of the data contains simply the LibriVox's intro disclaimers that were
-successfully extracted, using a slight modification of the alignment algorithms
-used to derive the test training sets. The standard LibriVox disclaimer is:
-
-"This is a LibriVox recording. All LibriVox recordings are in the public domain. 
- For more information, or to volunteer, please visit: librivox DOT org"
-
-As is the case for the training and test sets, there is one subdirectory for
-each reader, and a subsubdirectory for each of the chapters, read by this speaker
-for which the announcement was successfully extracted.
-
-
-2.3 Organization of the "original-mp3" subset
----------------------------------------------
-
-This part contains the original MP3-compressed recordings as downloaded from the
-Internet Archive. It is intended to serve as a secure reference "snapshot" for 
-the original audio chapters, but also to preserve (most of) the information both
-about audio, selected for the corpus, and audio that was discarded. I decided to
-try make the corpus relatively balanced in terms of per-speaker durations, so
-part of the audio available for some of the speakers was discarded. Also for the
-speakers in the training sets, only up to 10 minutes of audio is used, to 
-introduce more speaker diversity during evaluation time. There should be enough 
-information in the "mp3" subset to enable the re-cutting of an extended 
-"LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed.
-
-The directory hierarchy follows the already familiar pattern. In each
-speaker directory there is a file named "utterance_map" which list for each
-of the utterances in the corpus, the original "raw" aligned utterance.
-In the "header" of that file there are also 2 lines, that show if the
-sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the
-reader is assigned to a test set) and the maximum allowed duration for
-the set to which this speaker was assigned.
-
-Then in the chapter directory, besides the original audio chapter .mp3 file,
-there are two sets of ".seg.txt" and ".trans.txt" files. The former contain
-the time range(in seconds) for each of the original(that I called "raw" above)
-utterances. The latter contains the respective transcriptions. There are two
-sets for the two possible segmentations of each chapter. The ".sents"
-segmentation is "sentence-aware", that is, we only split on silence intervals 
-coinciding with (automatically obtained) sentence boundaries in the text.
-The other segmentation was derived by allowing splitting on every silence
-interval longer than 300ms, which leads to better utilization of the aligned
-audio.
-
-2.4 Organization of the "text" subset
--------------------------------------
-
-This part just contains one subdirectory, with name equal to the ID of the
-text in Project Gutenberg's database, for each book. The books are also
-separated in directories by their encoding-- could be either ASCII or UTF-8.
-The sole purpose of this subset is to be a permanent snapshot of the original
-text used for LibriSpeech's construction.
-
-
-2.5 Organization of the "raw-metadata" part
--------------------------------------------
-
-Contains just few SQLite databases. Some of the more important bits of
-information from this tables are described in the README file within
-the "raw_data" subdirectory.
-
-
-Acknowledgments
-===============
-
-First and foremost, I would like to thank the thousands of Project Gutenberg
-and LibriVox volunteers, without whose contributions the LibriSpeech corpus 
-would not have existed.
-The successful completion of this project would have been much more difficult,
-and the quality of the finished corpus much worse, if it wasn't for the
-generous support and the many helpful advice, provided by Daniel Povey - thanks, Dan!
-I would also like to express my gratitude to Tony Robinson, for the very
-interesting, and useful discussions on the long audio alignment problem, that
-we had some time ago.
-Thanks also to Guoguo Chen and Sanjeev Khudanpur, with whom we are collaborating
-on a (yet-to-be-published) paper on the corpus, and who helped to improve
-the LibriSpeech's example scripts in Kaldi.
-
----
-Vassil Panayotov,
-Oct. 2, 2014