Subnet 13 Mining

Introduction

Miners scrape data from various Data Sources and get rewarded based on how much valuable data they have, see the Incentive Mechanismarrow-up-right for the full details. The incentive mechanism does not require a Miner to scrape from all Data Sources, allowing Miners to specialize and choose exactly what kinds of data they want to scrape. However, Miners are scored, in part, based on the total amount of data they have. So Miners should make sure they are scraping sufficient amounts of data.

The Miner stores all scraped data in their local database.

System Requirements

Miners do not require a GPU and should be able to run on a low-tier machine, as long as it has sufficient network bandwidth and disk space. Must have python >= 3.10.

Getting Started

Prerequisites

  1. As of Dec 17th 2023, we support X (Twitter) and Reddit scraping via Apify. You can setup your Apify API token herearrow-up-right, or use official APIs from X (Twitter) and Reddit (recommended). The data delivered by miner have to be compliant with Miner Data Compliance Policy (v1.0, March 2025)arrow-up-right.

  2. Clone the repo

git clone https://github.com/RusticLuftig/data-universe.git
  1. Install the requirements. From your virtual environment, run

cd data-universe
python -m pip install -e .
  1. (Optional) Run your miner in offline modearrow-up-right to scrape an initial set of data.

Running the Miner

For this guide, we'll use pm2arrow-up-right to manage the Miner process, because it'll restart the Miner if it crashes. If you don't already have it, install pm2.

Online

From the data-universe folder, run:

Offline

From the data-universe folder, run:

Please note that your miner will not respond to validator requests in this mode and therefore if you have already registered to the subnet you should run in online mode.

Configuring the Miner

Flags

The Miner offers some flags to customize properties, such as the database name and the maximum amount of data to store.

You can view the full set of flags by running

Configuring

The frequency and types of data your Miner will scrape is configured in the scraping_config.jsonarrow-up-right file. This file defines which scrapers your Miner will use. To customize your Miner, you either edit scraping_config.json or create your own file and pass its filepath via the --neuron.scraping_config_file flag.

By default scraping_config.json is setup use both the apify actor and the personal reddit account for scraping reddit.

If you do not want to use Apify you should remove the sections where the scraper_id is set to either Reddit.lite or X.microworlds or X.apidojo.

If you do not want to use a personal Reddit account you should remove the sections where the scraper_id is set to either Reddit.custom.

If either of them is in the configuration but not setup properly in your .env file then your miner will log errors but still scrape using any configured scrapers that are properly setup.

For each scraper, you can define:

  1. cadence_seconds: to control how frequently the scraper will run.

  2. labels_to_scrape: to define how much of what type of data to scrape from this source. Each entry in this list consists of the following properties:

    1. label_choices: is a list of DataLabels to scrape. Each time the scraper runs, one of these labels is chosen at random to scrape.

    2. max_age_hint_minutes: provides a hint to the scraper of the maximum age of data you'd like to collect for the chosen label. Not all scrapers provide date/time filters so this is a hint, not a rule.

    3. max_data_entities: defines the maximum number of items to scrape for this set of labels, each time the scraper runs. This gives you full control over the maximum cost of scraping data from paid sources (e.g. Apify)

Let's walk through an example to explain how all these properties fit together.

In this example, we configure the Miner to scrape using a single scraper, the "X.microworlds" scraper. The scraper will run every 5 minutes (300 seconds). When it runs, it'll run 2 scrapes:

  1. The first will be a scrape for at most 100 items with #bittensor. The data scrape will choose a random TimeBucketarrow-up-right in (now - max_age_in_minutes, now). The probability distribution used to select a TimeBucket matches the Validator's incentive for Data Freshnessarrow-up-right: that is, it's weighted towards newer data.

  2. The second will be a scrape for either #decentralizedfinance or #tao, chosen at random (uniformly). The scrape will scrape at most 50 items, and will use a random TimeBucket between now and the maximum data freshness threshold.

You can start your Miner with a different scraping config by passing the filepath to --neuron.scraping_config_file.

On Demand request handle

As described in on demand request handlearrow-up-right

Choosing which data to scrape

As described in the incentive mechanism, miners are, in part, scored based on their data's desirability and uniqueness. We encourage miners to tune their Miners to maximize their scores by scraping unique, desirable data.

For desirability, the DataDesirabilityLookuparrow-up-right defines the exact rules Validators use to compute data desirability.

Last updated