Skip to content

Commit

Permalink
Merge pull request #5 from Thoroldvix/feat/playlist-transcripts
Browse files Browse the repository at this point in the history
Add bulk retrieval of transcripts for playlists and channels
  • Loading branch information
Thoroldvix authored Jun 12, 2024
2 parents ebd7b4b + 6eb2f66 commit 959063e
Show file tree
Hide file tree
Showing 23 changed files with 1,322 additions and 116 deletions.
57 changes: 54 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
## 📖 Introduction

Java library which allows you to retrieve subtitles/transcripts for a YouTube video.
It supports manual and automatically generated subtitles and does not use headless browser for scraping.
It supports manual and automatically generated subtitles, bulk transcript retrieval for all videos in the playlist or
on the channel and does not use headless browser for scraping.
Inspired by [Python library](https://github.com/jdepoix/youtube-transcript-api).

## 🤖 Features
Expand All @@ -21,6 +22,8 @@ Inspired by [Python library](https://github.com/jdepoix/youtube-transcript-api).

✅ Automatically generated transcripts retrieval

✅ Bulk transcript retrieval for all videos in the playlist or channel

✅ Transcript translation

✅ Transcript formatting
Expand Down Expand Up @@ -79,7 +82,7 @@ TranscriptList transcriptList = youtubeTranscriptApi.listTranscripts("videoId");

// Iterate over transcript list
for(Transcript transcript : transcriptList) {
System.out.println(transcript);
System.out.println(transcript);
}

// Find transcript in specific language
Expand Down Expand Up @@ -143,6 +146,8 @@ TranscriptContent transcriptContent = youtubeTranscriptApi.listTranscripts("vide
TranscriptContent transcriptContent = youtubeTranscriptApi.getTranscript("videoId");
```

For bulk transcript retrieval see [Bulk Transcript Retrieval](#bulk-transcript-retrieval).

## 🔧 Detailed Usage

### Use fallback language
Expand Down Expand Up @@ -241,7 +246,7 @@ TranscriptFormatter jsonFormatter = TranscriptFormatters.jsonFormatter();
String formattedContent = jsonFormatter.format(transcriptContent);
````

### YoutubeClient customization
### YoutubeClient Customization

By default, `YoutubeTranscriptApi` uses Java 11 HttpClient for making requests to YouTube, if you want to use a
different client,
Expand Down Expand Up @@ -275,6 +280,52 @@ TranscriptList transcriptList = youtubeTranscriptApi.listTranscriptsWithCookies(
TranscriptContent transcriptContent = youtubeTranscriptApi.getTranscriptWithCookies("videoId", "path/to/cookies.txt", "en");
```

### Bulk Transcript Retrieval

All bulk transcript retrieval operations are done via the `PlaylistsTranscriptApi` interface. Same as with the
`YoutubeTranscriptApi`,
you can create a new instance of the PlaylistsTranscriptApi by calling the `createDefaultPlaylistsApi` method of the
`TranscriptApiFactory`.
Playlists and channels information is retrieved from
the [YouTube V3 API](https://developers.google.com/youtube/v3/docs/),
so you will need to provide API key for all methods.

```java
// Create a new default PlaylistsTranscriptApi instance
PlaylistsTranscriptApi playlistsTranscriptApi = TranscriptApiFactory.createDefaultPlaylistsApi();

// Retrieve all available transcripts for a given playlist
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForPlaylist("playlistId", "apiKey", true);

// Retrieve all available transcripts for a given channel
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForChannel("channelName", "apiKey", true);
```

As you can see, there is also a boolean flag `continueOnError`, which tells whether to continue if transcript retrieval
fails for a video or not. For example, if it's set to `true`, all transcripts that could not be retrieved will be
skipped, if
it's set to `false`, operation will fail fast on the first error.

All methods are also have overloaded versions which accept path to [cookies.txt](#cookies) file.

```java
// Retrieve all available transcripts for a given playlist
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForPlaylist(
"playlistId",
"apiKey",
true,
"path/to/cookies.txt"
);

// Retrieve all available transcripts for a given channel
Map<String, TranscriptList> transcriptLists = playlistsTranscriptApi.listTranscriptsForChannel(
"channelName",
"apiKey",
true,
"path/to/cookies.txt"
);
```

## 🤓 How it works

Within each YouTube video page, there exists JSON data containing all the transcript information, including an
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
package io.github.thoroldvix.api;

import io.github.thoroldvix.internal.TranscriptApiFactory;

import java.util.Map;

/**
* Retrieves transcripts for all videos in a playlist, or all videos for a specific channel.
* <p>
* Playlists and channel videos are retrieved from the YouTube API, so you will need to have a valid api key to use this.
* </p>
* <p>
* To get implementation for this interface see {@link TranscriptApiFactory}
* </p>
*/
public interface PlaylistsTranscriptApi {

/**
* Retrieves transcript lists for all videos in the specified playlist using provided API key and cookies file from a specified path.
*
* @param playlistId The ID of the playlist
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
* otherwise an exception will be thrown.
* @param cookiesPath The file path to the text file containing the authentication cookies. Used in the case if some videos are age restricted see {<a href="https://github.com/Thoroldvix/youtube-transcript-api#cookies">Cookies</a>}
* @return A map of video IDs to {@link TranscriptList} objects
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
*/
Map<String, TranscriptList> listTranscriptsForPlaylist(String playlistId, String apiKey, String cookiesPath, boolean continueOnError) throws TranscriptRetrievalException;


/**
* Retrieves transcript lists for all videos in the specified playlist using provided API key.
*
* @param playlistId The ID of the playlist
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
* otherwise an exception will be thrown.
* @return A map of video IDs to {@link TranscriptList} objects
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
*/
Map<String, TranscriptList> listTranscriptsForPlaylist(String playlistId, String apiKey, boolean continueOnError) throws TranscriptRetrievalException;


/**
* Retrieves transcript lists for all videos for the specified channel using provided API key and cookies file from a specified path.
*
* @param channelName The name of the channel
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
* @param cookiesPath The file path to the text file containing the authentication cookies. Used in the case if some videos are age restricted see {<a href="https://github.com/Thoroldvix/youtube-transcript-api#cookies">Cookies</a>}
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
* otherwise an exception will be thrown.
* @return A map of video IDs to {@link TranscriptList} objects
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
*/
Map<String, TranscriptList> listTranscriptsForChannel(String channelName, String apiKey, String cookiesPath, boolean continueOnError) throws TranscriptRetrievalException;


/**
* Retrieves transcript lists for all videos for the specified channel using provided API key.
*
* @param channelName The name of the channel
* @param apiKey API key for the YouTube V3 API (see <a href="https://developers.google.com/youtube/v3/getting-started">Getting started</a>)
* @param continueOnError Whether to continue if transcript retrieval fails for a video. If true, all transcripts that could not be retrieved will be skipped,
* otherwise an exception will be thrown.
* @return A map of video IDs to {@link TranscriptList} objects
* @throws TranscriptRetrievalException If the retrieval of the transcript lists fails
*/
Map<String, TranscriptList> listTranscriptsForChannel(String channelName, String apiKey, boolean continueOnError) throws TranscriptRetrievalException;
}
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ public interface TranscriptList extends Iterable<Transcript> {
*/
Transcript findManualTranscript(String... languageCodes) throws TranscriptRetrievalException;

/**
* Retrieves the ID of the video to which transcript was retrieved.
*
* @return The video ID.
*/
String getVideoId();

@Override
default void forEach(Consumer<? super Transcript> action) {
Iterable.super.forEach(action);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ public class TranscriptRetrievalException extends Exception {

private static final String ERROR_MESSAGE = "Could not retrieve transcript for the video: %s.\nReason: %s";
private static final String YOUTUBE_WATCH_URL = "https://www.youtube.com/watch?v=";
private final String videoId;
private String videoId;

/**
* Constructs a new exception with the specified detail message and cause.
Expand All @@ -36,10 +36,22 @@ public TranscriptRetrievalException(String videoId, String message) {
}

/**
* @return The ID of the video for which the transcript retrieval failed.
* Constructs a new exception with the specified detail message and cause.
*
* @param message The detail message explaining the reason for the failure.
* @param cause The cause of the failure (which is saved for later retrieval by the {@link Throwable#getCause()} method).
*/
public String getVideoId() {
return videoId;
public TranscriptRetrievalException(String message, Throwable cause) {
super(message, cause);
}

/**
* Constructs a new exception with the specified detail message.
*
* @param message The detail message explaining the reason for the failure.
*/
public TranscriptRetrievalException(String message) {
super(message);
}

/**
Expand All @@ -53,5 +65,12 @@ private static String buildErrorMessage(String videoId, String message) {
String videoUrl = YOUTUBE_WATCH_URL + videoId;
return String.format(ERROR_MESSAGE, videoUrl, message);
}

/**
* @return The ID of the video for which the transcript retrieval failed.
*/
public String getVideoId() {
return videoId;
}
}

12 changes: 11 additions & 1 deletion lib/src/main/java/io/github/thoroldvix/api/YoutubeClient.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
/**
* Responsible for sending GET requests to YouTube.
*/
@FunctionalInterface
public interface YoutubeClient {

/**
Expand All @@ -18,5 +17,16 @@ public interface YoutubeClient {
* @throws TranscriptRetrievalException If the request to YouTube fails.
*/
String get(String url, Map<String, String> headers) throws TranscriptRetrievalException;


/**
* Sends a GET request to the specified endpoint and returns the response body.
*
* @param endpoint The endpoint to which the GET request is made.
* @param params A map of parameters to include in the request.
* @return The body of the response as a {@link String}.
* @throws TranscriptRetrievalException If the request to YouTube fails.
*/
String get(YtApiV3Endpoint endpoint, Map<String, String> params) throws TranscriptRetrievalException;
}

28 changes: 28 additions & 0 deletions lib/src/main/java/io/github/thoroldvix/api/YtApiV3Endpoint.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
package io.github.thoroldvix.api;

/**
* The YouTube API V3 endpoints. Used by the {@link YoutubeClient}.
*/
public enum YtApiV3Endpoint {
PLAYLIST_ITEMS("playlistItems"),
SEARCH("search"),
CHANNELS("channels");
private final static String YOUTUBE_API_V3_BASE_URL = "https://www.googleapis.com/youtube/v3/";

private final String resource;
private final String url;

YtApiV3Endpoint(String resource) {
this.url = YOUTUBE_API_V3_BASE_URL + resource;
this.resource = resource;
}

public String url() {
return url;
}

@Override
public String toString() {
return resource;
}
}
Loading

0 comments on commit 959063e

Please sign in to comment.