Skip to content

Conversation

@JakobWong
Copy link

feat(data_collector): add Binance USDⓈ-M perpetual futures collector

Description

This PR adds a new data collector for Binance USDⓈ-M perpetual futures (scripts/data_collector/binance_um/), supporting three frequencies: 1min, 60min (1h), and 1d.

The collector follows Qlib's standard BaseCollector/BaseNormalize/BaseRun pattern and provides:

  • Historical data via ZIP: Downloads monthly ZIP files from data.binance.vision and converts them to per-symbol CSVs
  • Live/incremental data via REST: Fetches klines from Binance Futures REST API (/fapi/v1/klines) with automatic resume from existing CSV tail
  • Field mapping: Maps Binance kline fields to Qlib schema, including:
    • amount (mapped from quote_volume, representing USDT notional turnover)
    • vwap (computed as amount / volume)
    • Standard OHLCV fields plus trades, taker_buy_volume, taker_buy_amount
  • Qlib dump integration: Seamlessly dumps normalized CSVs to Qlib .bin format via dump_bin.py

Instrument naming uses prefix binance_um. (e.g., binance_um.BTCUSDT) to avoid conflicts with other data sources.

Motivation and Context

Binance USDⓈ-M perpetual futures is a major crypto derivatives market. This collector enables Qlib users to:

  • Build factor models and backtests on crypto perpetual futures data
  • Access high-frequency (1min) and daily data in Qlib's unified format
  • Leverage Qlib's existing workflow/pipeline tools with crypto data

The implementation aligns with existing collectors (yahoo, tushare, crypto) for consistency and maintainability.

How Has This Been Tested?

  • If you are adding a new feature, test on your own test scripts.

Tested on local venv with real Binance data:

  1. 1min frequency (ZIP → CSV → normalize → dump):

    • Downloaded BTCUSDT 2024-01 monthly ZIP (1m)
    • Converted to source CSV (44,640 rows)
    • Normalized and dumped to Qlib .bin
    • Verified data retrieval via qlib.data.D.features() with freq='1min'
  2. 60min frequency (ZIP → CSV → normalize → dump):

    • Downloaded BTCUSDT 2024-01 monthly ZIP (1h)
    • Converted, normalized, and dumped successfully
    • Verified data retrieval with freq='60min' (744 rows)
  3. Daily frequency (ZIP → CSV → normalize → dump):

    • Downloaded BTCUSDT 2024-01 monthly ZIP (1d)
    • Converted, normalized, and dumped successfully
    • Verified data retrieval with freq='day' (31 rows)

Test commands used:

# 1min example
python qlib/scripts/data_collector/binance_um/collector.py download_monthly_zip \
  --months 2024-01 --raw_zip_dir ~/.qlib/binance_um/raw_zip --zip_interval 1m --symbols BTCUSDT
python qlib/scripts/data_collector/binance_um/collector.py convert_monthly_zip_to_source \
  --raw_zip_dir ~/.qlib/binance_um/raw_zip --source_dir ~/.qlib/binance_um/source_1min --zip_interval 1m
python qlib/scripts/data_collector/binance_um/collector.py normalize_data \
  --source_dir ~/.qlib/binance_um/source_1min --normalize_dir ~/.qlib/binance_um/nor_1min --interval 1min
python qlib/scripts/data_collector/binance_um/collector.py dump_to_bin \
  --source_dir ~/.qlib/binance_um/source_1min --normalize_dir ~/.qlib/binance_um/nor_1min \
  --interval 1min --qlib_dir ~/.qlib/qlib_data/binance_um_1min

Note: REST API incremental fetching was tested but encountered HTTP 451 (region restriction) in the test environment. The code handles this gracefully by returning empty DataFrame and logging warnings. ZIP-based historical data collection works reliably.

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

Copy link

@Abhijais4896 Abhijais4896 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binance Futures kline columns (REST & monthly ZIP CSV) use the same order:

@SunsetWolf
Copy link
Collaborator

Hi, @JakobWong Thanks for contributing the code again.

I tested the code according to the description in the documentation and found some problems.
There are 14 commands in the documentation, but 6 of them report errors. According to the information in the Traceback, these 6 commands can be roughly categorized into 3 types.

  • Category 1:

    • Commands:
      • python scripts/data_collector/binance_um/collector.py download_monthly_zip --months 2023-11,2023-12,2024-01 --raw_zip_dir ~/.qlib/binance_um/raw_zip --zip_interval 1m --symbols BTCUSDT,ETHUSDT
      • python scripts/data_collector/binance_um/collector.py download_monthly_zip --months 2023-11,2023-12,2024-01 --raw_zip_dir ~/.qlib/binance_um/raw_zip --zip_interval 1h --symbols BTCUSDT,ETHUSDT
    • Traceback: AttributeError: 'tuple' object has no attribute 'split'
  • Category 2:

    • Command:
      • python scripts/data_collector/binance_um/collector.py convert_monthly_zip_to_source --raw_zip_dir ~/.qlib/binance_um/raw_zip --source_dir ~/.qlib/binance_um/source_1min --zip_interval 1m --inst_prefix binance_um.
    • Traceback: FileNotFoundError: raw zip root not found: /home/admin/.qlib/binance_um/raw_zip/raw/um_perp/1m
  • Category 3:

    • Commands:
      • python scripts/data_collector/binance_um/collector.py dump_to_bin --source_dir ~/.qlib/binance_um/source_1min --normalize_dir ~/.qlib/binance_um/normalize_1min --interval 1min --qlib_dir ~/.qlib/qlib_data/binance_um_1min
      • python scripts/data_collector/binance_um/collector.py dump_to_bin --source_dir ~/.qlib/binance_um/source_60min --normalize_dir ~/.qlib/binance_um/normalize_60min --interval 60min --qlib_dir ~/.qlib/qlib_data/binance_um_60min
      • python qlib/scripts/data_collector/binance_um/collector.py dump_to_bin --source_dir ~/.qlib/binance_um/source_1d --normalize_dir ~/.qlib/binance_um/normalize_1d --interval 1d --qlib_dir ~/.qlib/qlib_data/binance_um_1d
    • Traceback: TypeError: DumpDataBase.__init__() got an unexpected keyword argument 'data_path'

Please fix these problems. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants