Data

What data types do you support?

We strive to provide complete raw market data capture as published by exchanges real-time WebSocket feeds. This means that data types that we support can vary on per exchange basis. For example for BitMEX we store liquidations and chat messages data in addition to tick by tick trades and order book L2 messages, but not for FTX which doesn't provide those non standard data types. See historical data details to find out captured real-time channels for each exchange.

We also provide following normalized data types via our client libs:

  • tick-by-tick trades

  • order book L2 updates

  • order book snapshots (tick-by-tick, 10ms, 100ms, 1s, 10s etc)

  • quotes

  • derivative tick info (open interest, funding rate, mark price, index price)

  • OHLCV

  • volume/tick based trade bars

and downloadable CSV data files:

What does high frequency historical data mean?

We always collect and provide data with the most granularity that exchange can offer via it's real-time WS feeds. High frequency means different things for each exchange so for example for Coinbase Pro it can mean L3 order book data (market-by-order), for Binance Futures all order book L2 real-time updates and for Binance it means order book updates aggregated in 100ms intervals.

What L2 order book data can be used for?

L2 data (market-by-price) includes bids and asks orders aggregated by price level and can be used to analyse among other things:

  • order book imbalance

  • average execution cost

  • average liquidity away from midpoint

  • average spread

  • hidden interest (i.e., iceberg orders)

We do provide L2 data both in CSV format as incremental order book L2 updates (initial order book snapshot and incremental tick level updates for each updated price level) as well as in exchange-native format via API - our client libraries can perform full order book reconstruction client-side.

What L3 order book data can be used for?

L3 data (market-by-order) includes every order book order addition, update, cancellation and match and can be used to analyse among other things:

  • order resting time

  • order fill probability

  • order queue dynamics

Historical L3 data is currently available via API for Bitfinex, Coinbase Pro and Bitstamp - rest of the supported exchanges provides L2 data only.

Do you provide historical options data?

Yes, we do provide historical options data for Deribit and OKEx Options - see options chain CSV data type.

Do you provide historical futures data?

We specialize in derivatives exchanges market data and cover all top venues that trade futures contracts: BitMEX, Deribit, Binance USDT Futures, Binance COIN Futures, FTX, OKEx, Huobi Futures, Huobi Swap, Bitfinex Derivatives, Bybit and many more.

Can you record market data for exchange that's not currently supported?

Yes, we're always open to support new promising exchanges. Contact us and we'll get back to you as soon as possible to discuss details.

Do you provide market data in normalized format?

Normalized market data (unified data format for every exchange) is available via our official libraries and downloadable CSV files. Our HTTP API provides data only in exchange-native format.

What is a difference between exchange-native and normalized data format?

Cryptocurrency markets are very fragmented and every exchange provides data in it's own bespoke data format which we call exchange-native data format. Our HTTP API provides market data in this format.

For example BitMEX trade message looks like this:

{"table":"trade","action":"insert","data":[{"timestamp":"2019-06-01T00:03:11.589Z","symbol":"ETHUSD","side":"Sell","size":10,"price":268.7,"tickDirection":"ZeroMinusTick","trdMatchID":"ebc230d9-0b6e-2d5d-f99a-f90109a2b113","grossValue":268700,"homeNotional":0.08555051758063137,"foreignNotional":22.987424073915648}]}

and Deribit trade message:

{"jsonrpc":"2.0","method":"subscription","params":{"channel":"trades.ETH-26JUN20.raw","data":[{"trade_seq":18052,"trade_id":"ETH-10813935","timestamp":1577836825724,"tick_direction":0,"price":132.65,"instrument_name":"ETH-26JUN20","index_price":128.6,"direction":"buy","amount":1.0}]}}

In contrast, normalized data format means the same, unified format across multiple exchanges. We provide normalized data via our client libs (data normalization is performed client-side) as well as via downloadable CSV files.

Sample normalized trade message:

{
"type": "trade",
"symbol": "XBTUSD",
"exchange": "bitmex",
"id": "282a0445-0e3a-abeb-f403-11003204ea1b",
"price": 7996,
"amount": 50,
"side": "sell",
"timestamp": "2019-10-23T10:32:49.669Z",
"localTimestamp": "2019-10-23T10:32:49.740Z"
}

What time zone is used in the data?

UTC.

How historical raw market data is being sourced?

Raw market data is sourced from WebSocket real-time APIs provided by the exchanges. See details.

Is provided raw market data complete?

We're doing our best to provide the most complete and reliable historical raw data API on the market. To do so amongst many other things, we utilize highly available Kubernetes clusters on Google Cloud Platform that offer best in the class availability, networking and monitoring. However due to exchanges' APIs downtimes (maintenance, deployments, connection drops etc.) we can experience market data gaps and cannot guarantee 100% data completeness. In rare circumstances, when exchange's API changes without any notice or we hit new unexpected rate limits we also may fail to record data during such period, it happens very rarely and is very specific for each exchange. Use /exchanges/:exchange API endpoint and check for incidentReports field in order to get most detailed and up to date information on that subject.

Can historical order books reconstructed from L2 updates be crossed (bid/ask overlap) occasionally?

Although is should never happen in theory, in practice due to various crypto exchanges bugs and peculiarities it can happen (very occasionally), see some posts from users reporting those issues:

We do track sequence numbers of WebSocket L2 order book messages when collecting the data and restart connection when sequence gap is detected for exchanges that do provide those numbers. Even in that scenario when sequence numbers are in check, bid/ask overlap can occur. When such scenario occurs, exchanges 'forget' to publish delete messages for the opposite side of the book when publishing new level for given side - we validated that hypothesis by comparing reconstructed order book snapshots that had crossed order book (bid/ask overlap) for which we removed order book levels for the opposite side manually (as exchange didn't publish that 'delete'), with quote/ticker feeds if best bid/ask matches (for exchanges that provide those) - see sample code that implements that manual level removal logic.

Can exchange publish data with non monotonically increasing timestamps for single data channel?

That shouldn't happen in theory, but we've detected that for some exchanges when new connection is established sometimes first message for given channel & symbol has newer timestamp than subsequent message, e.g., order book snapshot has newer timestamp than first order book update. This is why we provide data via API and CSV downloads for given data ranges based on local timestamps (timestamp of message arrival) which are always monotonically increasing.

Are exchanges publishing duplicated trades data messages?

Some exchanges are occasionally publishing duplicated trades (trades with the same ids). Since we collect real-time data we also collect and provide duplicate trades via API if those were published by real-time WebSocket feeds of exchanges. Some of our client libraries have functionality that when working with normalized data can deduplicate such trades, similarly for downloadable CSV files we deduplicate tick-by-tick trades data.

What kind of protocols are used for data collection from exchanges?

We use WebSocket protocol for real-time data collection and occasionally HTTP REST APIs for fetching initial full order book snapshots and other info for exchanges that do not provide it via WebSocket.

What is the difference between collecting live streaming real-time WebSocket data and collecting data by periodically calling REST endpoints?

Collecting real-time streaming WebSocket data allows us preserving and providing the most granular data that exchanges offer in contrary to collecting data by pooling periodically REST APIs. Historical data sourced from WebSocket real-time exchanges' feeds adheres to what you'll see when trading live even if it means some occasional connection drops resulting in missing data, real-time data publishing delays by exchanges especially during larger market moves, duplicated trades or crossed books in some cases. We find that trade-off acceptable and even if data isn't as clean as sourced from REST APIs, it allows for more insight into market microstructure and various unusual exchanges behaviors that simply can't be captured otherwise. Simple example would be latency spikes for many exchanges during increased volatility periods where exchange publish trade/order book/quote WebSocket messages with larger than usual latency or simply skip some of the the updates and then return those in one batch. Querying the REST API would result in nice, clean trade history, but such data wouldn't fully reflect real actionable market behavior.

How order book data snapshots are provided?

Historical market data available via HTTP API provides order book snapshots at the beginning of each day (00:00 UTC) - see details.

We also provide custom order book snapshots with customizable time intervals from tick-by-tick, milliseconds to minutes or hours via client libs in which case custom snapshots are computed client side from raw data provided via HTTP API.

Do you collect order books as snapshots or in streaming mode?

Order books are collected in streaming mode - snapshot at the beginning of each day and then incremental updates. See details.

We do also provide custom order book snapshots with customizable time intervals from tick-by-tick, milliseconds to minutes or hours via client libs in which case custom snapshots are computed client side from raw data provided via HTTP API.

How market data messages are being timestamped?

Each message received via WebSocket connection is timestamped with 100ns precision using synchronized clock at arrival time (before any message processing) and stored in ISO 8601 format.

What is the new historical market data delay in relation to real-time?

For API access it's 4 minutes (T - 4min), downloadable CSV files for given day are available on the next day around 03:00 UTC.