Skip to main content
Version: Next

HTTP(s) Data Connector Deployment Guide

Production operating guide for the HTTP(s) data connector covering authentication, rate control, retry tuning, and observability.

Authentication & Secrets​

The connector supports HTTP Basic, custom-header, and OAuth2 refresh-token authentication. Secrets must be sourced from a secret store in production.

ParameterDescription
http_usernameUsername for HTTP Basic authentication.
http_passwordPassword for HTTP Basic authentication. Use ${secrets:...} to resolve from a secret store.
http_headersCustom headers (e.g. Authorization:Bearer ${secrets:api_token}). Treated as sensitive — not logged.
auth_token_urlOAuth2 token endpoint URL (must be HTTPS in production).
http_auth_refresh_tokenOAuth2 refresh token. Required when auth_token_url is set.
http_auth_client_idOAuth2 client ID (required for confidential clients).
http_auth_client_secretOAuth2 client secret (required for confidential clients). Use ${secrets:...}.

For OAuth2-protected APIs, prefer refresh-token flow over storing long-lived bearer tokens. The connector exchanges the refresh token for short-lived access tokens at startup and refreshes them before expiry.

TLS​

Use HTTPS endpoints in production. auth_token_url must use HTTPS (loopback addresses are allowed for local testing only). Self-signed certificates require a trusted CA bundle in the container or host OS trust store.

Resilience Controls​

Rate Control​

The HTTP connector participates in the shared HTTP rate control system. Concurrency and per-second/per-minute request limits can be configured per-dataset (in params) or globally (in runtime.params). Dataset-level settings override the global defaults. Multiple datasets targeting the same upstream origin share a single rate controller.

ParameterDescription
max_concurrent_requestsMaximum concurrent HTTP requests to the same origin. Disabled when unset.
requests_per_second_limitMaximum HTTP requests per second to the same origin. Disabled when unset.
requests_per_minute_limitMaximum HTTP requests per minute to the same origin. Disabled when unset.
rate_control_jitter_minMinimum random delay before requests when rate control is active. Defaults to 5ms.
rate_control_jitter_maxMaximum random delay before requests when rate control is active. Defaults to 10ms.

The runtime equivalents (http_max_concurrent_requests, http_requests_per_second_limit, http_requests_per_minute_limit, http_rate_control_jitter_min, http_rate_control_jitter_max) set defaults that apply to every HTTP-based connector unless overridden per dataset.

runtime:
params:
http_max_concurrent_requests: 10
http_requests_per_second_limit: 5

datasets:
- from: https://api.example.com/v1
name: api_data
params:
file_format: json
allowed_request_paths: '/data/**'
max_concurrent_requests: 3 # Override for this dataset
requests_per_minute_limit: 60

Use rate control when the upstream API enforces request quotas, when many datasets share a single origin, or when running large IN-list refreshes that would otherwise burst hundreds of concurrent requests.

Retry Behavior​

HTTP-level retries follow the shared resilient_http policy: 408, 429, and 5xx responses plus transient network errors are retried. The connector respects Retry-After, retry-after-ms, and x-retry-after-ms headers.

ParameterDefaultDescription
max_retries3Maximum retry attempts per request.
retry_backoff_methodfibonacciBackoff strategy. Options: fibonacci, linear, exponential.
retry_max_durationunsetMaximum total duration across all retries (e.g. 30s, 5m). When set, retries stop after this elapsed time.
retry_jitter0.3Randomization factor (0.0–1.0) applied to retry delays. Set to 0 to disable jitter.

Retries are independent of rate control. If a retry would exceed the configured per-second or per-minute rate, it waits for the rate window to open before issuing the request.

Timeouts and Connection Pool​

ParameterDefaultDescription
client_timeout30Maximum time (seconds) to wait for the entire request-response cycle.
connect_timeout10Maximum time (seconds) to establish a TCP/TLS connection.
pool_max_idle_per_host10Maximum idle connections held per upstream host.
pool_idle_timeout90Idle connection lifetime (seconds) before the pool closes them.

Increase client_timeout for endpoints with large response bodies or expensive server-side computation. Reduce pool_max_idle_per_host when running many small datasets against the same host to keep the runtime's open file descriptors bounded.

Caching Mode​

When using refresh_mode: caching, transient HTTP errors (5xx, 429) are excluded from the cache and propagated to clients. Set caching_stale_if_error: enabled to serve expired cached data on upstream failure. Always set caching_ttl explicitly — the default of 30s is rarely the desired window.

Capacity & Sizing​

  • Throughput: Bounded by the upstream rate limit, then by max_concurrent_requests and connect_timeout. Plan limits to stay within the API quota.
  • Memory: Response bodies are streamed; memory footprint is bounded by max_request_body_bytes (filter inputs) and DataFusion's record-batch size for response rows.
  • Connection setup: TLS handshake adds latency. The connection pool keeps pool_max_idle_per_host warm connections to absorb burst traffic.
  • Partitioned refreshes: When using IN-list filters or cross-product partitioning, the runtime issues one HTTP request per partition. Use max_request_partitions to cap the request count for unbounded filter combinations, and max_concurrent_requests to throttle their fan-out.

Metrics​

When rate control is active, the connector exposes per-origin metrics that can be enabled per-dataset:

Metric NameTypeDescription
inflight_operationsGaugeCurrent number of HTTP requests holding a rate-control permit.
rate_control_max_concurrent_requestsGaugeConfigured maximum concurrent HTTP requests for this upstream origin; 0 means disabled.
rate_control_requests_per_second_limitGaugeConfigured HTTP request-per-second limit for this upstream origin; 0 means disabled.
rate_control_requests_per_minute_limitGaugeConfigured HTTP request-per-minute limit for this upstream origin; 0 means disabled.
rate_control_jitter_min_msGaugeConfigured minimum rate-control jitter (ms) before HTTP requests.
rate_control_jitter_max_msGaugeConfigured maximum rate-control jitter (ms) before HTTP requests.
rate_control_available_permitsGaugeCurrent available permits in the HTTP request concurrency semaphore; 0 when concurrency is disabled.
rate_control_acquisitions_totalCounterTotal HTTP request rate-control permits acquired.
rate_control_acquire_errors_totalCounterTotal HTTP request rate-control permit acquisition errors.
rate_control_wait_duration_msCounterCumulative time (ms) spent waiting for HTTP rate-control permits, quotas, and jitter.
rate_limit_retry_after_updates_totalCounterTotal upstream cooldown hints accepted from Retry-After or RateLimit reset headers.
rate_limit_retry_after_waits_totalCounterTotal waits caused by Retry-After or RateLimit reset headers.
rate_limit_retry_after_wait_duration_msCounterCumulative time (ms) spent waiting because of Retry-After or RateLimit reset headers.
rate_limit_retry_after_remaining_msGaugeCurrent remaining Retry-After / RateLimit cooldown (ms) for this upstream origin.

Enable component metrics in the dataset's metrics section. See Component Metrics for general configuration.

For broader observability, also monitor:

  • Spice query execution metrics (query_duration_ms, query_processed_rows, query_failures_total) from runtime.metrics.
  • HTTP response status distribution via the shared resilient_http instrumentation.

Task History​

HTTP requests participate in task history through the HTTP client's span. Each partitioned request and each pagination page is a child of the enclosing sql_query or accelerated_table_refresh task.

Known Limitations​

  • Read-only: The connector is read-only. Only GET and POST (via request_body filters) are supported.
  • Filter pushdown is opt-in: request_path, request_query, request_body, and request_headers filters require explicit allowlists or _filters: enabled parameters.
  • OAuth2 OOS scope: Only the refresh-token grant is supported. Client-credentials and authorization-code flows are not exposed.
  • OR across virtual filter columns: WHERE request_path = '/a' OR request_query = 'b=1' is rejected. Use separate datasets or UNION ALL for cross-column alternatives. Single-column OR (and IN-lists) is supported.

Troubleshooting​

SymptomLikely causeResolution
401 UnauthorizedWrong/expired token or password.Rotate the credential in the secret store.
429 Too Many Requests (frequent)Upstream rate limit hit; concurrency too high.Set requests_per_second_limit / requests_per_minute_limit; reduce max_concurrent_requests.
Refresh blocked / queue building upmax_concurrent_requests set too low for the workload.Raise the dataset-level limit or move heavy datasets to their own origin.
OAuth2 token refresh failsauth_token_url not HTTPS, or wrong client credentials.Verify the token endpoint URL; check http_auth_client_id/secret and required scopes.
Request rejected: "OR across HTTP filter columns"WHERE request_path = '...' OR request_query = '...'.Split into separate refreshes or UNION ALL.
Many partitions created from cross-productMultiple IN-list filters multiplied into many requests.Set max_request_partitions to cap; tighten filters.
Slow first refreshCold connection pool + TLS handshake per request.Raise pool_max_idle_per_host; ensure pool_idle_timeout is long enough to keep connections warm.