Skip to content

Commit 959063e

Browse files
authored
Merge pull request #5 from Thoroldvix/feat/playlist-transcripts
Add bulk retrieval of transcripts for playlists and channels
2 parents ebd7b4b + 6eb2f66 commit 959063e

23 files changed

+1322
-116
lines changed

README.md

Lines changed: 54 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
## 📖 Introduction
1313

1414
Java library which allows you to retrieve subtitles/transcripts for a YouTube video.
15-
It supports manual and automatically generated subtitles and does not use headless browser for scraping.
15+
It supports manual and automatically generated subtitles, bulk transcript retrieval for all videos in the playlist or
16+
on the channel and does not use headless browser for scraping.
1617
Inspired by [Python library](https://github.com/jdepoix/youtube-transcript-api).
1718

1819
## 🤖 Features
@@ -21,6 +22,8 @@ Inspired by [Python library](https://github.com/jdepoix/youtube-transcript-api).
2122

2223
✅ Automatically generated transcripts retrieval
2324

25+
✅ Bulk transcript retrieval for all videos in the playlist or channel
26+
2427
✅ Transcript translation
2528

2629
✅ Transcript formatting
@@ -79,7 +82,7 @@ TranscriptList transcriptList = youtubeTranscriptApi.listTranscripts("videoId");
7982

8083
// Iterate over transcript list
8184
for(Transcript transcript : transcriptList) {
82-
System.out.println(transcript);
85+
System.out.println(transcript);
8386
}
8487

8588
// Find transcript in specific language
@@ -143,6 +146,8 @@ TranscriptContent transcriptContent = youtubeTranscriptApi.listTranscripts("vide
143146
TranscriptContent transcriptContent = youtubeTranscriptApi.getTranscript("videoId");
144147
```
145148

149+
For bulk transcript retrieval see [Bulk Transcript Retrieval](#bulk-transcript-retrieval).
150+
146151
## 🔧 Detailed Usage
147152

148153
### Use fallback language
@@ -241,7 +246,7 @@ TranscriptFormatter jsonFormatter = TranscriptFormatters.jsonFormatter();
241246
String formattedContent = jsonFormatter.format(transcriptContent);
242247
````
243248

244-
### YoutubeClient customization
249+
### YoutubeClient Customization
245250

246251
By default, `YoutubeTranscriptApi` uses Java 11 HttpClient for making requests to YouTube, if you want to use a
247252
different client,
@@ -275,6 +280,52 @@ TranscriptList transcriptList = youtubeTranscriptApi.listTranscriptsWithCookies(
275280
TranscriptContent transcriptContent = youtubeTranscriptApi.getTranscriptWithCookies("videoId", "path/to/cookies.txt", "en");
276281
```
277282

283+
### Bulk Transcript Retrieval
284+
285+
All bulk transcript retrieval operations are done via the `PlaylistsTranscriptApi` interface. Same as with the
286+
`YoutubeTranscriptApi`,
287+
you can create a new instance of the PlaylistsTranscriptApi by calling the `createDefaultPlaylistsApi` method of the
288+
`TranscriptApiFactory`.
289+
Playlists and channels information is retrieved from
290+
the [YouTube V3 API](https://developers.google.com/youtube/v3/docs/),
291+
so you will need to provide API key for all methods.
292+
293+
```java
294+
// Create a new default PlaylistsTranscriptApi instance
295+
PlaylistsTranscriptApi playlistsTranscriptApi = TranscriptApiFactory.createDefaultPlaylistsApi();
296+
297+
// Retrieve all available transcripts for a given playlist
298+
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForPlaylist("playlistId", "apiKey", true);
299+
300+
// Retrieve all available transcripts for a given channel
301+
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForChannel("channelName", "apiKey", true);
302+
```
303+
304+
As you can see, there is also a boolean flag `continueOnError`, which tells whether to continue if transcript retrieval
305+
fails for a video or not. For example, if it's set to `true`, all transcripts that could not be retrieved will be
306+
skipped, if
307+
it's set to `false`, operation will fail fast on the first error.
308+
309+
All methods are also have overloaded versions which accept path to [cookies.txt](#cookies) file.
310+
311+
```java
312+
// Retrieve all available transcripts for a given playlist
313+
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForPlaylist(
314+
"playlistId",
315+
"apiKey",
316+
true,
317+
"path/to/cookies.txt"
318+
);
319+
320+
// Retrieve all available transcripts for a given channel
321+
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForChannel(
322+
"channelName",
323+
"apiKey",
324+
true,
325+
"path/to/cookies.txt"
326+
);
327+
```
328+
278329
## 🤓 How it works
279330

280331
Within each YouTube video page, there exists JSON data containing all the transcript information, including an
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
package io.github.thoroldvix.api;
2+
3+
import io.github.thoroldvix.internal.TranscriptApiFactory;
4+
5+
import java.util.Map;
6+
7+
/**
8+
* Retrieves transcripts for all videos in a playlist, or all videos for a specific channel.
9+
* <p>
10+
* Playlists and channel videos are retrieved from the YouTube API, so you will need to have a valid api key to use this.
11+
* </p>
12+
* <p>
13+
* To get implementation for this interface see {@link TranscriptApiFactory}
14+
* </p>
15+
*/
16+
public interface PlaylistsTranscriptApi {
17+
18+
/**
19+
* Retrieves transcript lists for all videos in the specified playlist using provided API key and cookies file from a specified path.
20+
*
21+
* @param playlistId The ID of the playlist
22+
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
23+
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
24+
* otherwise an exception will be thrown.
25+
* @param cookiesPath The file path to the text file containing the authentication cookies. Used in the case if some videos are age restricted see {<a href="https://github.com/Thoroldvix/youtube-transcript-api#cookies">Cookies</a>}
26+
* @return A map of video IDs to {@link TranscriptList} objects
27+
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
28+
*/
29+
Map<String, TranscriptList> listTranscriptsForPlaylist(String playlistId, String apiKey, String cookiesPath, boolean continueOnError) throws TranscriptRetrievalException;
30+
31+
32+
/**
33+
* Retrieves transcript lists for all videos in the specified playlist using provided API key.
34+
*
35+
* @param playlistId The ID of the playlist
36+
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
37+
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
38+
* otherwise an exception will be thrown.
39+
* @return A map of video IDs to {@link TranscriptList} objects
40+
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
41+
*/
42+
Map<String, TranscriptList> listTranscriptsForPlaylist(String playlistId, String apiKey, boolean continueOnError) throws TranscriptRetrievalException;
43+
44+
45+
/**
46+
* Retrieves transcript lists for all videos for the specified channel using provided API key and cookies file from a specified path.
47+
*
48+
* @param channelName The name of the channel
49+
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
50+
* @param cookiesPath The file path to the text file containing the authentication cookies. Used in the case if some videos are age restricted see {<a href="https://github.com/Thoroldvix/youtube-transcript-api#cookies">Cookies</a>}
51+
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
52+
* otherwise an exception will be thrown.
53+
* @return A map of video IDs to {@link TranscriptList} objects
54+
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
55+
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
56+
*/
57+
Map<String, TranscriptList> listTranscriptsForChannel(String channelName, String apiKey, String cookiesPath, boolean continueOnError) throws TranscriptRetrievalException;
58+
59+
60+
/**
61+
* Retrieves transcript lists for all videos for the specified channel using provided API key.
62+
*
63+
* @param channelName The name of the channel
64+
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
65+
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
66+
* otherwise an exception will be thrown.
67+
* @return A map of video IDs to {@link TranscriptList} objects
68+
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
69+
*/
70+
Map<String, TranscriptList> listTranscriptsForChannel(String channelName, String apiKey, boolean continueOnError) throws TranscriptRetrievalException;
71+
}

lib/src/main/java/io/github/thoroldvix/api/TranscriptList.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,13 @@ public interface TranscriptList extends Iterable<Transcript> {
5656
*/
5757
Transcript findManualTranscript(String... languageCodes) throws TranscriptRetrievalException;
5858

59+
/**
60+
* Retrieves the ID of the video to which transcript was retrieved.
61+
*
62+
* @return The video ID.
63+
*/
64+
String getVideoId();
65+
5966
@Override
6067
default void forEach(Consumer<? super Transcript> action) {
6168
Iterable.super.forEach(action);

lib/src/main/java/io/github/thoroldvix/api/TranscriptRetrievalException.java

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ public class TranscriptRetrievalException extends Exception {
1010

1111
private static final String ERROR_MESSAGE = "Could not retrieve transcript for the video: %s.\nReason: %s";
1212
private static final String YOUTUBE_WATCH_URL = "https://www.youtube.com/watch?v=";
13-
private final String videoId;
13+
private String videoId;
1414

1515
/**
1616
* Constructs a new exception with the specified detail message and cause.
@@ -36,10 +36,22 @@ public TranscriptRetrievalException(String videoId, String message) {
3636
}
3737

3838
/**
39-
* @return The ID of the video for which the transcript retrieval failed.
39+
* Constructs a new exception with the specified detail message and cause.
40+
*
41+
* @param message The detail message explaining the reason for the failure.
42+
* @param cause The cause of the failure (which is saved for later retrieval by the {@link Throwable#getCause()} method).
4043
*/
41-
public String getVideoId() {
42-
return videoId;
44+
public TranscriptRetrievalException(String message, Throwable cause) {
45+
super(message, cause);
46+
}
47+
48+
/**
49+
* Constructs a new exception with the specified detail message.
50+
*
51+
* @param message The detail message explaining the reason for the failure.
52+
*/
53+
public TranscriptRetrievalException(String message) {
54+
super(message);
4355
}
4456

4557
/**
@@ -53,5 +65,12 @@ private static String buildErrorMessage(String videoId, String message) {
5365
String videoUrl = YOUTUBE_WATCH_URL + videoId;
5466
return String.format(ERROR_MESSAGE, videoUrl, message);
5567
}
68+
69+
/**
70+
* @return The ID of the video for which the transcript retrieval failed.
71+
*/
72+
public String getVideoId() {
73+
return videoId;
74+
}
5675
}
5776

lib/src/main/java/io/github/thoroldvix/api/YoutubeClient.java

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
/**
77
* Responsible for sending GET requests to YouTube.
88
*/
9-
@FunctionalInterface
109
public interface YoutubeClient {
1110

1211
/**
@@ -18,5 +17,16 @@ public interface YoutubeClient {
1817
* @throws TranscriptRetrievalException If the request to YouTube fails.
1918
*/
2019
String get(String url, Map<String, String> headers) throws TranscriptRetrievalException;
20+
21+
22+
/**
23+
* Sends a GET request to the specified endpoint and returns the response body.
24+
*
25+
* @param endpoint The endpoint to which the GET request is made.
26+
* @param params A map of parameters to include in the request.
27+
* @return The body of the response as a {@link String}.
28+
* @throws TranscriptRetrievalException If the request to YouTube fails.
29+
*/
30+
String get(YtApiV3Endpoint endpoint, Map<String, String> params) throws TranscriptRetrievalException;
2131
}
2232

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
package io.github.thoroldvix.api;
2+
3+
/**
4+
* The YouTube API V3 endpoints. Used by the {@link YoutubeClient}.
5+
*/
6+
public enum YtApiV3Endpoint {
7+
PLAYLIST_ITEMS("playlistItems"),
8+
SEARCH("search"),
9+
CHANNELS("channels");
10+
private final static String YOUTUBE_API_V3_BASE_URL = "https://www.googleapis.com/youtube/v3/";
11+
12+
private final String resource;
13+
private final String url;
14+
15+
YtApiV3Endpoint(String resource) {
16+
this.url = YOUTUBE_API_V3_BASE_URL + resource;
17+
this.resource = resource;
18+
}
19+
20+
public String url() {
21+
return url;
22+
}
23+
24+
@Override
25+
public String toString() {
26+
return resource;
27+
}
28+
}

0 commit comments

Comments
 (0)