LHCP dataset v1.0
www.mllp.upv.es/lhcp-asr
An English speech corpus of high-energy particle physics talks for narrow-domain ASR benchmarking.
Keywords: automatic speech recognition; speech corpus; domain adaptation; manual transcription; pseudo-labelling; particle physics
The LHCP dataset includes:
- 30 hours of 2020 and 2022 LHCP plenary conference talks, with timed, manual (human) verbatim transcriptions, split into two sub-tasks, LHCP-2020 and LHCP-2022, with their respective development and test partitions.
- 205 hours of LHCP conference talks, with timed, automatic verbatim transcriptions generated by a very competitive in-domain ASR system, for training or adaptation purposes.
- 1.5G tokens of in-domain English text data extracted from scientific papers and reports, PhD theses and news, to build in-domain language models.
Download the full LHCP-ASR speech and text corpus from:
https://www.mllp.upv.es/lhcp-asr/lhcp_v1.0.tar.gz
Size: 79 GiB (compressed file)
SHA-256 checksum: 09f423b7bfa042efbd0614fb23b7ade4b9b8da0e60aa6355564a631a5be98cd9
Download the revision guidelines from:
https://www.mllp.upv.es/lhcp-asr/lhcp_guidelines_v1.0.pdf
SHA-256 checksum: f0c27177f6baa731b2d84fd98b4b7775111bb4d406862a5932b3a48c6a3a2738
Total size: 84 GiB (uncompressed)
The data is organized in 2 main directories: "speech" and "text" data. Within the speech directory we can find two subdirectories, LHCP-2020 and LHCP-2022, one for each evaluation task (with each own "dev" and "test" subdirectories, and LHCP-train, with the samples for training. For each talk, we provide a video file containing the speech of the talk; the timed verbatim transcription in SRT (sub-rip) format; and the slides file used by the main speaker in the talk in PDF or PPTX format. We also provide, for the development and test sets, the normalised references (lowercased, no punctuation) used to compute the WER%. The "train" directory and each "dev" and "test" directories contain a list with the IDs for all talks in case they are needed.
Inside the "text" top-level directory we can find the in-domain text data gathered (up to February 2025) from two sources the CERN Document Server (CDS) repository, and the CERN News portal. This data is organised into three sources: CERN-news, CDS-abstracts and CDS-documents subsets. For each data source, we provide two compressed text files: raw text, without any processing; and a cleaned and normalise version (lowercased, no punctuation).
Here we can see more completely the corpus structure, with additional subdirectories:
LHCP-ASR/
├── speech
│ ├── LHCP-2020
│ │ ├── dev
│ │ │ ├── <sample_id>
│ │ │ │ ├── <sample_id>.mp4
│ │ │ │ ├── <sample_id>.pdf
│ │ │ │ ├── <sample_id>.ref
│ │ │ │ ├── <sample_id>.srt
│ │ ├── dev_LHCP-2020.lst
│ │ ├── test /* Same as dev */
│ │ └── test_LHCP-2020.lst
│ ├── LHCP-2022 /* Same as LHCP-2020 */
│ └── LHCP-train
│ ├── 2020
│ ├── 2021
│ ├── 2022
│ └── train_samples.lst
└── text
├── prepro
│ ├── cds.abstracts.txt.prep.gz
│ ├── cds.pdfs.txt.prep.gz
│ └── cern-news.txt.prep.gz
└── raw
├── cds.abstracts.txt.gz
├── cds.pdfs.txt.gz
└── cern-news.txt.gz
The authors would like to thank the European Organisation for Nuclear Research (CERN) for their support during the PO OV9177345.
The research leading to these results has received funding from EU4Health Programme 2021-2027 as part of Europe's Beating Cancer Plan under Grant Agreements nos. 101056995 and 101129375; and from the Government of Spain's grant PID2021--122443OB-I00 funded by MICIU/AEI/10.13039/-00011033 and by "ERDF/EU", and grant PDC2022-133049-I00 funded by MICIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.
The authors gratefully acknowledge the financial support of Generalitat Valenciana under project IDIFEDER/2021/059.
Speech and text data were provided by the European Organisation for Nuclear Research (CERN) under the PO OV9177345. The following disclaimers are those available in the CERN Document Server (CDS) repository on May 30th, 2025:
Use of the CERN Document Server service (hereafter "CDS") denotes agreement with the following terms of use:
- CDS is provided free of charge. It serves as a comprehensive institutional repository and dissemination platform for the research and historical output produced by CERN, the European Organization for Nuclear Research, and its members of personnel. See Content Policy [1] for more details.
- By uploading content to CDS, the content provider affirms that such content complies with all applicable laws, licence conditions and third party rights, and shall hold CERN free and harmless from any related liability.
- All content is provided "as is" and without warranty of any kind. The user shall hold CERN and individual content providers free and harmless from any related liability in connection with its use of such content.
- Users shall respect copyright and all applicable licence conditions. The download and use of content from CDS does not amount to a transfer of intellectual property.
- CERN reserves the right, without notice or liability, and at its sole discretion, to restrict or remove a user's access or remove any uploaded content, where it considers that use of CDS interferes with its operations or violates these Terms and Conditions, and/or applicable laws.
- CERN bases CDS on leading technologies and architectures, operated within the limits of its financial and human resources and made available by CERN on an "as is" and "best efforts" basis. Access to, availability and use of CDS is not guaranteed nor can be expected.
- CERN excludes and disclaims all liability for damage resulting from users' access, or inability to access, or use of CDS.
- These terms and conditions of use are subject to change by CERN at any time and without notice, other than through posting the updated terms on the CDS website. Any revised terms and conditions of use shall become effective immediately upon posting.
If you have any questions or comments with respect to CDS, or if you are unsure whether your intended use is in line with these Terms and Conditions, or if you seek permission for a use that does not fall within these Terms and Conditions, please contact support.