Performance

FrameRight is designed to add type safety and validation with minimal performance overhead. This page documents the performance characteristics based on comprehensive benchmarks.

TL;DR

  • Memory overhead: 48 bytes per Schema instance — same whether wrapping 1,000 or 1,000,000 rows

  • Column access: Adds ~0.2 microseconds per access (negligible for typical workloads)

  • Construction without validation: Sub-millisecond (0.0003ms)

  • Construction with validation: 25-51ms for 100,000 rows depending on schema complexity

  • Scaling: Linear with data size — validation is O(n), column access is O(1)

Note

These measurements are per-instance overhead (the cost of wrapping a DataFrame). There is also a one-time cost to import the FrameRight module and its dependencies (Pandera, etc.), but this is amortized across all Schema instances in your program.

Detailed Benchmarks

All benchmarks were run on 100,000-row DataFrames unless otherwise noted.

Memory Overhead

FrameRight wraps DataFrames without copying data. Each Schema instance adds a constant 48 bytes of overhead regardless of the wrapped DataFrame size:

DataFrame Size

Raw DataFrame

Schema Wrapper

Overhead

1,000 rows

31.5 KB

31.5 KB

48 bytes (0.15%)

100,000 rows

3.05 MB

3.05 MB

48 bytes (0.00%)

1,000,000 rows

30.5 MB

30.5 MB

48 bytes (0.00%)

Conclusion: Per-instance memory overhead is negligible for all practical DataFrame sizes. The 48 bytes is the cost of the Python wrapper object itself (__dict__, internal state, etc.), measured using sys.getsizeof(). This is separate from the one-time cost of importing the module.

Construction Time

Construction time depends on whether validation is enabled:

Without Validation

orders = OrderData(df, validate=False)  # 0.0003 ms

Wrapping a DataFrame with validate=False is essentially free — it just stores a reference.

With Validation

Validation uses Pandera under the hood and scales linearly with data size:

Rows

Simple Schema (3 cols)

Complex Schema (8 cols)

1,000

2.5 ms

5.1 ms

10,000

2.8 ms

5.6 ms

100,000

13.1 ms

50.8 ms

Simple schema: 3 columns with basic type checks (int, float, str)

Complex schema: 8 columns with constraints (unique, ge, isin, nullable, etc.)

Scaling: Approximately linear with data size. For 100k rows:

  • Simple schema: 13ms (~0.13 microseconds per row)

  • Complex schema: 51ms (~0.51 microseconds per row)

Recommendation: For performance-critical code paths that process data in small batches, consider validating once at the entry point and using validate=False for intermediate operations.

Column Access Overhead

Column property access (e.g., orders.revenue) goes through Python’s descriptor protocol. Benchmarks show minimal overhead:

Single Column Access

# Raw DataFrame
df['revenue']           # 9.36 microseconds

# Schema property
orders.revenue          # 9.59 microseconds (adds ~0.2 microseconds)

Multiple Column Access

# Raw DataFrame
df['price'], df['qty'], df['revenue']    # 27.86 microseconds

# Schema properties
orders.price, orders.qty, orders.revenue # 28.73 microseconds (adds ~0.9 microseconds)

Conclusion: Property access adds less than 1 microsecond of overhead. For typical data pipelines where operations take milliseconds or more, this is negligible.

Column Operations

Once you have a Series, operations run at native speed:

# Both take ~55 microseconds (no measurable difference)
df['revenue'].sum()
orders.revenue.sum()

The overhead is in accessing the column, not in operating on it.

Polars Backend Performance

Polars is significantly faster than Pandas for many operations. FrameRight adds the same minimal overhead:

Construction (100k rows)

# Polars construction with validation
orders = OrderData(pl_df)  # 5.3 ms

This is ~5x faster than Pandas (25-51ms) for the same dataset, demonstrating that Polars’ performance benefits are preserved.

Column Access

# Raw Polars DataFrame
pl_df['revenue']           # 0.43 microseconds

# Schema property
orders.revenue             # 0.64 microseconds (adds ~0.2 microseconds)

Polars column access is ~20x faster than Pandas, and FrameRight adds the same ~0.2 microsecond overhead.

Performance Best Practices

Validate at Boundaries

Validate data once when it enters your system, then use validate=False for internal operations:

# Entry point: validate thoroughly
def load_orders(path: str) -> OrderData:
    df = pd.read_csv(path)
    return OrderData(df, validate=True)  # Full validation

# Internal operations: skip validation
def process_orders(orders: OrderData) -> Revenue:
    filtered = OrderData(orders.fr_data[orders.revenue > 100], validate=False)
    # ... processing ...
    return Revenue(result, validate=False)

This gives you type safety and validation guarantees without paying the validation cost repeatedly.

Use Type Coercion Strategically

Type coercion (coerce=True) adds overhead. Use it only when needed:

# Reading from CSV: types may not match
df = pd.read_csv("data.csv")
orders = OrderData(df, coerce=True)  # Convert dtypes as needed

# Internal operations: types already correct
result = Revenue(computed_df, validate=False)  # No coercion needed

Choose the Right Backend

For large datasets (100k+ rows), Polars offers significant performance improvements:

# Pandas: ~25ms construction, ~9 microseconds column access
import pandas as pd
from frameright.pandas import Schema

# Polars: ~5ms construction, ~0.4 microseconds column access
import polars as pl
from frameright.polars.eager import Schema

FrameRight makes switching backends trivial — the schema definition stays the same.

Use Lazy Evaluation for Complex Pipelines

For complex data pipelines, Polars’ lazy evaluation can provide significant speedups:

from frameright.polars.lazy import Schema, Col
import polars as pl

class OrderData(Schema):
    order_id: Col[int]
    revenue: Col[float]

# Build query plan (no execution yet)
lazy_orders = OrderData(pl.scan_csv("orders.csv"))
filtered = lazy_orders.fr_data.filter(lazy_orders.revenue > 100)
grouped = filtered.group_by('customer_id').agg(pl.col('revenue').sum())

# Execute optimized plan
result = grouped.collect()

Polars optimizes the entire query plan before execution, often resulting in significant speedups.

Summary

FrameRight is designed for zero-cost abstraction semantics:

  • Memory: Constant 48-byte overhead (negligible)

  • Column access: Adds ~0.2 microseconds (negligible compared to actual operations)

  • Validation: O(n) with data size, but can be controlled with validate=False

  • Operations: Run at native backend speed (pandas/polars/narwhals)

The type safety, IDE support, and validation features come with virtually no runtime cost for typical data pipelines where operations take milliseconds or more.

The performance tests are available in tests/test_performance.py and can be run with:

pytest tests/test_performance.py -v

This will show detailed timing and memory measurements on your system.