Scraping Youtube Data

Requirements

  • YouTube API Key (generate via Google Cloud Console)

  • Python 3.10+ installed

  • Linux environment or WSL environment

Getting Started (Backend CLI Setup)

  1. Clone the Repository

git clone https://github.com/victorchimakanu/macrocosmos-youtube-scrapper.git
cd macrocosmos-youtube-scrapper

  1. Create and Activate a Virtual Environment

python -m venv venv 
source venv/bin/activate

3. Install Required Packages

pip install -r requirements.txt

This might take a while depending on your environment.

4. Get a YouTube API Key

Visit Google Cloud Console and follow tutorial documentation to generate your API key.

  1. Set Up Environment Variables

Create a .env file in the root directory and add your API key:

YOUTUBE_API_KEY="YOUR_API_KEY_HERE"

6. Finalize Package Setup

pip install -e .

This validates and installs local dependencies including the data-universe package.

Running the Scraper (CLI Mode)

1. Navigate to the YouTube Scraper Module

cd scraping/youtube

2. Run the Scraper

python youtube_custom_scraper.py

If you encounter an error like:ModuleNotFoundError

ModuleNotFoundError: No module named 'common.data'

Go back to the root directory and run:

PYTHONPATH="." python3.11 -m scraping.youtube.youtube_custom_scraper

3. Choose an Option:

You’ll be prompted to select one of the following:

  1. Scrape using a default test script

  2. Scrape any video of your choice

  3. Scrape up to 5 random videos from a specific channel

Transcripts are returned in the terminal. For local downloads, use the Custom API endpoints below.

Custom APIs – Download Transcripts via HTTP

1. Start the Backend API Server

Navigate to the project root and run:

python backend/app.py

Available Custom Endpoints

Video Scrapper

POST http://127.0.0.1:5001/api/scrape/video

Downloads transcript to local machine

Headers

Name
Value

X-API-KEY

"youtube_api_key"

Body (JSON)

Name
Details
Description

video_id

{
  "video_id": "UH_sOZSIk10"
}

video id of youtube video

Response

{
  "job_type": "video",
  "status": "started",
  "video_id": "UH_sOZSIk10"
}

Channel Scrapper

POST http://127.0.0.1:5001/api/scrape/channel

Scrapes random videos , allowing you specify the total number of videos you'd like to scrape

Headers

Name
Value

X-API-KEY

"youtube_api_key"

Body (JSON)

Name
Details
Description

channel_id

{
  "channel_id": "UC92OMuTHmkrk0Crz5Xqi-5w",
  "max_videos": 3
}

Youtube Channel ID

Response

{
  "channel_id": "UC92OMuTHmkrk0Crz5Xqi-5w",
  "job_type": "channel",
  "max_videos": 3,
  "status": "started"
}

These endpoints wrap the CLI scraper logic and save the output to your local Transcripts folder in .txt and .pdf formats.

Testing APIs with Postman

You can test these APIs using Postman by:

  • Setting request method to POST

  • Using the appropriate URL

  • Providing the correct JSON body

  • Viewing transcript generation in your terminal and output folder

For this example, we're using custom video scrapper endpoint , when you send the request , a Transcript folder is generated and your desired transcript is downloaded to your local machine in PDF and .txt format!

Frontend Interface

Now lets interact with our custom scrapper using a sample frontend application

  1. To setup the frontend, open a split terminal and navigate into the frontend folder

cd frontend 

2. Install Dependencies

npm install

  1. Set Up Frontend Environment Variables

Create a .env file in the frontend directory:

VITE_API_BASE="http://127.0.0.1:5001/"
VITE_API_KEY="YOUR_YOUTUBE_API_KEY"

  1. Launch the frontend app

npm run dev

Follow any of the links and it will spin up a sample application for the youtube scrapper on your local machine

🧑‍💻 Using the Frontend

  • Paste a YouTube video URL or ID

  • Click Scrape Video

  • Then click Download Transcript

Transcripts are saved to the local Transcripts folder in .pdf formats. You can monitor scraper activity via the terminal.

Last updated