| name | persistent-cache-gap-filling |
| description | Persistent data cache with gap-filling for historical market data. Trigger when: (1) cache re-downloads complete data unnecessarily, (2) time-based cache expiry wastes API calls, (3) historical data needs incremental updates only, (4) gap-fill warnings appear when cache is sufficient. |
| author | Claude Code |
| date | "2026-02-01T00:00:00.000Z" |
Persistent Cache with Gap-Filling (v3.7.0)
Experiment Overview
| Item | Details |
|---|
| Date | 2026-01-01 |
| Goal | Eliminate redundant downloads of historical data by removing time-based cache expiry |
| Environment | alpaca_trading/data/ modules |
| Status | Success |
Context
User noticed that re-running the training notebook caused complete re-downloads of historical data even though:
- Data was downloaded earlier the same day
- Historical data is immutable (past candles never change)
- Only new bars since the last download were needed
The root cause was time-based cache expiry:
- SQLite cache (
cache.py): 12-hour TTL via PERSISTED_TTL_HOURS
- Pickle cache (
caching_fetcher.py): 3-7 day expiry via cache_expiry_days
v2.8.0 Solution: Persistent Cache + Gap-Filling
Core Principle
Historical market data is immutable. Once downloaded and validated, it should persist indefinitely. Only fetch new bars to fill the gap between cache end and current time.
Changes Made
1. cache.py - SQLite Cache
def get(self, ..., ttl_hours: int = 24):
if created_at < ttl_cutoff or expires_at < now_ts:
self._remove_entry(cache_key)
return None
def get(self, ..., ttl_hours: Optional[int] = None):
if ttl_hours is not None:
2. fetcher.py - DataFetcher
PERSISTED_TTL_HOURS = 12
self._cache_ttl_hours = PERSISTED_TTL_HOURS
def _load_persisted(self, symbol: str, timeframe: str) -> pd.DataFrame:
cached = self._cache.get(symbol, timeframe, start="", end="")
return cached
3. caching_fetcher.py - CachingDataFetcher
class CachingDataFetcher:
def get_bars(self, symbol, timeframe, lookback_days, **kwargs):
cached_df = load_from_cache(symbol, timeframe, cache_dir=self._cache_dir)
if cached_df is not None:
cache_end = cached_df.index.max()
if cache_start <= start_dt and cache_end >= end_dt - tolerance:
return cached_df
fetch_start = cache_end + timedelta(hours=1)
new_df = self._fetcher.get_bars(symbol, start=fetch_start, ...)
combined = pd.concat([cached_df, new_df])
save_to_cache(symbol, combined, ...)
return combined
Behavior Comparison
Before (Time-Based Expiry)
Run 1 (10:00 AM): Fetch 4 years of data [API] -> Cache (12h TTL)
Run 2 (10:30 AM): Cache valid -> [CACHE] instant
Run 3 (11:00 PM): Cache expired -> [API] Fetch 4 years AGAIN
After (Persistent + Gap-Fill)
Run 1 (10:00 AM): Fetch 4 years of data [API] -> Cache (persistent)
Run 2 (10:30 AM): Cache complete -> [CACHE] instant
Run 3 (11:00 PM): Cache + gap-fill -> [GAP-FILL] Fetch 13 new bars only
Output Messages
| Message | Meaning |
|---|
[CACHE] AAPL: 35,040 bars (complete) | Cache covers full range, no API call |
[GAP-FILL] AAPL: Fetching 2026-01-01 to 2026-01-01... | Fetching only new bars |
[UPDATED] AAPL: 35,038 + 2 = 35,040 bars | Merged new bars with cache |
[API] AAPL: Fetching 1460 days... | No cache, full download |
Cache Statistics
New gap_fills counter added:
stats = fetcher.get_cache_stats()
v3.7.0 Update: gap_fill_threshold_days
Problem: Gap-fill attempts were logging alarming 401 warnings even when cache had sufficient data for training. A 3-day-old cache with 4,318 bars is perfectly fine for training, but the 2-hour tolerance triggered unnecessary API calls.
Solution: Added gap_fill_threshold_days parameter (default: 7 days)
fetcher = CachingDataFetcher(
cache_dir='/path/to/cache',
gap_fill_threshold_days=7,
)
Behavior:
- Cache < 7 days old: Return cached data immediately, no API call
- Cache >= 7 days old: Attempt gap-fill (may show warnings if API fails)
- Cache with sufficient bars: Always returned regardless of age
Output Messages (v3.7.0):
| Message | Meaning |
|---|
[CACHE] AAPL: 4,318 bars (3d old) | Cache is recent enough, no gap-fill attempted |
[CACHE] AAPL: 4,318 bars (complete) | Cache fully covers requested range |
[GAP-FILL] AAPL: Fetching... | Cache is stale, attempting to fetch new bars |
Failed Attempts
| Approach | Result | Why It Failed |
|---|
| Increase TTL to 30 days | Worked but fragile | Still expires eventually, arbitrary cutoff |
| Check file modification time | Partial | Doesn't verify data completeness |
| 2-hour tolerance for gap-fill | Unnecessary warnings | Cache 3 days old is fine for training |
Troubleshooting: 401 During Gap-Fill
If gap-fill shows 401 Authorization Required from nginx:
- Check API key parsing first - Key files use
Key:\n<value>\nSecret:\n<value> format. Naive line reading sends "Key:" as the API key ID. Use _read_keys_from_file() from alpaca_trading.trading.broker.
- Check key validity - Paper trading keys expire if the account is deactivated.
- The cache still works - Gap-fill failure is graceful; stale cache is returned with
(Xd old) message. Training can proceed with slightly older data.
v5.2.1 Update: Gap-Fill Warning Suppression
Problem: When pickle cache is current but SQLite cache is empty (common across Colab sessions), gap-fill triggers and the is_incremental flag stays False. The API returns empty (data is up-to-date), and _fetch_remote() logs "No data returned from Alpaca" — a misleading warning.
Solution: Added is_incremental parameter to DataFetcher.get_bars():
def get_bars(self, ..., is_incremental: bool = False):
CachingDataFetcher.get_bars() now passes is_incremental=True in the gap-fill path:
new_df = self._fetch_bars(symbol, ..., is_incremental=True, **kwargs)
Flow: CachingDataFetcher.get_bars() → _fetch_bars(**kwargs) → DataFetcher.get_bars(is_incremental=True) → _fetch_remote(is_incremental=True) → warning suppressed.
Key Insights
- Historical data is immutable - Past candles never change, so there's no reason to re-fetch them
- Only the edge needs updating - New bars appear at the end of the series
- Time-based expiry is wrong model - For mutable data (news, weather) TTL makes sense; for historical OHLCV it's waste
- Completeness > freshness - Check if cache covers the requested date range, not how old the file is
Files Modified
alpaca_trading/data/cache.py:
- get(): ttl_hours now Optional[int] = None (no expiry by default)
alpaca_trading/data/fetcher.py:
- Removed PERSISTED_TTL_HOURS constant
- _load_persisted(): No TTL check
- _save_persisted(): Uses 10-year TTL (effectively infinite)
alpaca_trading/data/caching_fetcher.py:
- DEFAULT_*_CACHE_EXPIRY_DAYS = None (no expiry)
- is_cache_valid(): Just checks file exists
- get_bars(): Gap-filling logic added
- get_cache_stats(): Added gap_fills counter
Backward Compatibility
- Existing
.pkl cache files work unchanged
cache_expiry_days parameter still accepted but ignored
- Old caches are automatically upgraded (no migration needed)
References
- Skill:
selection-data-caching - Original caching implementation (v2.5.1)
- Skill:
data-source-priority - Data fetching hierarchy
alpaca_trading/data/caching_fetcher.py: Gap-filling implementation