Performance
FrameRight is designed to add type safety and validation with minimal performance overhead. This page documents the performance characteristics based on comprehensive benchmarks.
TL;DR
Memory overhead: 48 bytes per Schema instance — same whether wrapping 1,000 or 1,000,000 rows
Column access: Adds ~0.2 microseconds per access (negligible for typical workloads)
Construction without validation: Sub-millisecond (0.0003ms)
Construction with validation: 25-51ms for 100,000 rows depending on schema complexity
Scaling: Linear with data size — validation is O(n), column access is O(1)
Note
These measurements are per-instance overhead (the cost of wrapping a DataFrame). There is also a one-time cost to import the FrameRight module and its dependencies (Pandera, etc.), but this is amortized across all Schema instances in your program.
Detailed Benchmarks
All benchmarks were run on 100,000-row DataFrames unless otherwise noted.
Memory Overhead
FrameRight wraps DataFrames without copying data. Each Schema instance adds a constant 48 bytes of overhead regardless of the wrapped DataFrame size:
DataFrame Size |
Raw DataFrame |
Schema Wrapper |
Overhead |
|---|---|---|---|
1,000 rows |
31.5 KB |
31.5 KB |
48 bytes (0.15%) |
100,000 rows |
3.05 MB |
3.05 MB |
48 bytes (0.00%) |
1,000,000 rows |
30.5 MB |
30.5 MB |
48 bytes (0.00%) |
Conclusion: Per-instance memory overhead is negligible for all practical DataFrame sizes. The 48 bytes is the cost of the Python wrapper object itself (__dict__, internal state, etc.), measured using sys.getsizeof(). This is separate from the one-time cost of importing the module.
Construction Time
Construction time depends on whether validation is enabled:
Without Validation
orders = OrderData(df, validate=False) # 0.0003 ms
Wrapping a DataFrame with validate=False is essentially free — it just stores a reference.
With Validation
Validation uses Pandera under the hood and scales linearly with data size:
Rows |
Simple Schema (3 cols) |
Complex Schema (8 cols) |
|---|---|---|
1,000 |
2.5 ms |
5.1 ms |
10,000 |
2.8 ms |
5.6 ms |
100,000 |
13.1 ms |
50.8 ms |
Simple schema: 3 columns with basic type checks (int, float, str)
Complex schema: 8 columns with constraints (unique, ge, isin, nullable, etc.)
Scaling: Approximately linear with data size. For 100k rows:
Simple schema: 13ms (~0.13 microseconds per row)
Complex schema: 51ms (~0.51 microseconds per row)
Recommendation: For performance-critical code paths that process data in small batches, consider validating once at the entry point and using validate=False for intermediate operations.
Column Access Overhead
Column property access (e.g., orders.revenue) goes through Python’s descriptor protocol. Benchmarks show minimal overhead:
Single Column Access
# Raw DataFrame
df['revenue'] # 9.36 microseconds
# Schema property
orders.revenue # 9.59 microseconds (adds ~0.2 microseconds)
Multiple Column Access
# Raw DataFrame
df['price'], df['qty'], df['revenue'] # 27.86 microseconds
# Schema properties
orders.price, orders.qty, orders.revenue # 28.73 microseconds (adds ~0.9 microseconds)
Conclusion: Property access adds less than 1 microsecond of overhead. For typical data pipelines where operations take milliseconds or more, this is negligible.
Column Operations
Once you have a Series, operations run at native speed:
# Both take ~55 microseconds (no measurable difference)
df['revenue'].sum()
orders.revenue.sum()
The overhead is in accessing the column, not in operating on it.
Polars Backend Performance
Polars is significantly faster than Pandas for many operations. FrameRight adds the same minimal overhead:
Construction (100k rows)
# Polars construction with validation
orders = OrderData(pl_df) # 5.3 ms
This is ~5x faster than Pandas (25-51ms) for the same dataset, demonstrating that Polars’ performance benefits are preserved.
Column Access
# Raw Polars DataFrame
pl_df['revenue'] # 0.43 microseconds
# Schema property
orders.revenue # 0.64 microseconds (adds ~0.2 microseconds)
Polars column access is ~20x faster than Pandas, and FrameRight adds the same ~0.2 microsecond overhead.
Performance Best Practices
Validate at Boundaries
Validate data once when it enters your system, then use validate=False for internal operations:
# Entry point: validate thoroughly
def load_orders(path: str) -> OrderData:
df = pd.read_csv(path)
return OrderData(df, validate=True) # Full validation
# Internal operations: skip validation
def process_orders(orders: OrderData) -> Revenue:
filtered = OrderData(orders.fr_data[orders.revenue > 100], validate=False)
# ... processing ...
return Revenue(result, validate=False)
This gives you type safety and validation guarantees without paying the validation cost repeatedly.
Use Type Coercion Strategically
Type coercion (coerce=True) adds overhead. Use it only when needed:
# Reading from CSV: types may not match
df = pd.read_csv("data.csv")
orders = OrderData(df, coerce=True) # Convert dtypes as needed
# Internal operations: types already correct
result = Revenue(computed_df, validate=False) # No coercion needed
Choose the Right Backend
For large datasets (100k+ rows), Polars offers significant performance improvements:
# Pandas: ~25ms construction, ~9 microseconds column access
import pandas as pd
from frameright.pandas import Schema
# Polars: ~5ms construction, ~0.4 microseconds column access
import polars as pl
from frameright.polars.eager import Schema
FrameRight makes switching backends trivial — the schema definition stays the same.
Use Lazy Evaluation for Complex Pipelines
For complex data pipelines, Polars’ lazy evaluation can provide significant speedups:
from frameright.polars.lazy import Schema, Col
import polars as pl
class OrderData(Schema):
order_id: Col[int]
revenue: Col[float]
# Build query plan (no execution yet)
lazy_orders = OrderData(pl.scan_csv("orders.csv"))
filtered = lazy_orders.fr_data.filter(lazy_orders.revenue > 100)
grouped = filtered.group_by('customer_id').agg(pl.col('revenue').sum())
# Execute optimized plan
result = grouped.collect()
Polars optimizes the entire query plan before execution, often resulting in significant speedups.
Summary
FrameRight is designed for zero-cost abstraction semantics:
Memory: Constant 48-byte overhead (negligible)
Column access: Adds ~0.2 microseconds (negligible compared to actual operations)
Validation: O(n) with data size, but can be controlled with
validate=FalseOperations: Run at native backend speed (pandas/polars/narwhals)
The type safety, IDE support, and validation features come with virtually no runtime cost for typical data pipelines where operations take milliseconds or more.
The performance tests are available in tests/test_performance.py and can be run with:
pytest tests/test_performance.py -v
This will show detailed timing and memory measurements on your system.