Performance
===========

FrameRight is designed to add type safety and validation with minimal performance overhead. This page documents the performance characteristics based on comprehensive benchmarks.

TL;DR
-----

* **Memory overhead**: 48 bytes per Schema instance — same whether wrapping 1,000 or 1,000,000 rows
* **Column access**: Adds ~0.2 microseconds per access (negligible for typical workloads)
* **Construction without validation**: Sub-millisecond (0.0003ms)
* **Construction with validation**: 25-51ms for 100,000 rows depending on schema complexity
* **Scaling**: Linear with data size — validation is O(n), column access is O(1)

.. note::
   These measurements are per-instance overhead (the cost of wrapping a DataFrame). There is also a one-time cost to import the FrameRight module and its dependencies (Pandera, etc.), but this is amortized across all Schema instances in your program.

Detailed Benchmarks
-------------------

All benchmarks were run on 100,000-row DataFrames unless otherwise noted.

Memory Overhead
~~~~~~~~~~~~~~~

FrameRight wraps DataFrames without copying data. Each Schema instance adds a constant 48 bytes of overhead regardless of the wrapped DataFrame size:

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 20

   * - DataFrame Size
     - Raw DataFrame
     - Schema Wrapper
     - Overhead
   * - 1,000 rows
     - 31.5 KB
     - 31.5 KB
     - 48 bytes (0.15%)
   * - 100,000 rows
     - 3.05 MB
     - 3.05 MB
     - 48 bytes (0.00%)
   * - 1,000,000 rows
     - 30.5 MB
     - 30.5 MB
     - 48 bytes (0.00%)

**Conclusion**: Per-instance memory overhead is negligible for all practical DataFrame sizes. The 48 bytes is the cost of the Python wrapper object itself (__dict__, internal state, etc.), measured using ``sys.getsizeof()``. This is separate from the one-time cost of importing the module.

Construction Time
~~~~~~~~~~~~~~~~~

Construction time depends on whether validation is enabled:

Without Validation
^^^^^^^^^^^^^^^^^^

.. code-block:: python

    orders = OrderData(df, validate=False)  # 0.0003 ms

Wrapping a DataFrame with ``validate=False`` is essentially free — it just stores a reference.

With Validation
^^^^^^^^^^^^^^^

Validation uses Pandera under the hood and scales linearly with data size:

.. list-table::
   :header-rows: 1
   :widths: 25 25 25

   * - Rows
     - Simple Schema (3 cols)
     - Complex Schema (8 cols)
   * - 1,000
     - 2.5 ms
     - 5.1 ms
   * - 10,000
     - 2.8 ms
     - 5.6 ms
   * - 100,000
     - 13.1 ms
     - 50.8 ms

**Simple schema**: 3 columns with basic type checks (int, float, str)

**Complex schema**: 8 columns with constraints (unique, ge, isin, nullable, etc.)

**Scaling**: Approximately linear with data size. For 100k rows:

* Simple schema: 13ms (~0.13 microseconds per row)
* Complex schema: 51ms (~0.51 microseconds per row)

**Recommendation**: For performance-critical code paths that process data in small batches, consider validating once at the entry point and using ``validate=False`` for intermediate operations.

Column Access Overhead
~~~~~~~~~~~~~~~~~~~~~~

Column property access (e.g., ``orders.revenue``) goes through Python's descriptor protocol. Benchmarks show minimal overhead:

Single Column Access
^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Raw DataFrame
    df['revenue']           # 9.36 microseconds

    # Schema property
    orders.revenue          # 9.59 microseconds (adds ~0.2 microseconds)

Multiple Column Access
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Raw DataFrame
    df['price'], df['qty'], df['revenue']    # 27.86 microseconds

    # Schema properties
    orders.price, orders.qty, orders.revenue # 28.73 microseconds (adds ~0.9 microseconds)

**Conclusion**: Property access adds less than 1 microsecond of overhead. For typical data pipelines where operations take milliseconds or more, this is negligible.

Column Operations
^^^^^^^^^^^^^^^^^

Once you have a Series, operations run at native speed:

.. code-block:: python

    # Both take ~55 microseconds (no measurable difference)
    df['revenue'].sum()
    orders.revenue.sum()

The overhead is in accessing the column, not in operating on it.

Polars Backend Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Polars is significantly faster than Pandas for many operations. FrameRight adds the same minimal overhead:

Construction (100k rows)
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    # Polars construction with validation
    orders = OrderData(pl_df)  # 5.3 ms

This is ~5x faster than Pandas (25-51ms) for the same dataset, demonstrating that Polars' performance benefits are preserved.

Column Access
^^^^^^^^^^^^^

.. code-block:: python

    # Raw Polars DataFrame
    pl_df['revenue']           # 0.43 microseconds

    # Schema property
    orders.revenue             # 0.64 microseconds (adds ~0.2 microseconds)

Polars column access is ~20x faster than Pandas, and FrameRight adds the same ~0.2 microsecond overhead.

Performance Best Practices
---------------------------

Validate at Boundaries
~~~~~~~~~~~~~~~~~~~~~~

Validate data once when it enters your system, then use ``validate=False`` for internal operations:

.. code-block:: python

    # Entry point: validate thoroughly
    def load_orders(path: str) -> OrderData:
        df = pd.read_csv(path)
        return OrderData(df, validate=True)  # Full validation

    # Internal operations: skip validation
    def process_orders(orders: OrderData) -> Revenue:
        filtered = OrderData(orders.fr_data[orders.revenue > 100], validate=False)
        # ... processing ...
        return Revenue(result, validate=False)

This gives you type safety and validation guarantees without paying the validation cost repeatedly.

Use Type Coercion Strategically
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Type coercion (``coerce=True``) adds overhead. Use it only when needed:

.. code-block:: python

    # Reading from CSV: types may not match
    df = pd.read_csv("data.csv")
    orders = OrderData(df, coerce=True)  # Convert dtypes as needed

    # Internal operations: types already correct
    result = Revenue(computed_df, validate=False)  # No coercion needed

Choose the Right Backend
~~~~~~~~~~~~~~~~~~~~~~~~

For large datasets (100k+ rows), Polars offers significant performance improvements:

.. code-block:: python

    # Pandas: ~25ms construction, ~9 microseconds column access
    import pandas as pd
    from frameright.pandas import Schema

    # Polars: ~5ms construction, ~0.4 microseconds column access
    import polars as pl
    from frameright.polars.eager import Schema

FrameRight makes switching backends trivial — the schema definition stays the same.

Use Lazy Evaluation for Complex Pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For complex data pipelines, Polars' lazy evaluation can provide significant speedups:

.. code-block:: python

    from frameright.polars.lazy import Schema, Col
    import polars as pl

    class OrderData(Schema):
        order_id: Col[int]
        revenue: Col[float]

    # Build query plan (no execution yet)
    lazy_orders = OrderData(pl.scan_csv("orders.csv"))
    filtered = lazy_orders.fr_data.filter(lazy_orders.revenue > 100)
    grouped = filtered.group_by('customer_id').agg(pl.col('revenue').sum())

    # Execute optimized plan
    result = grouped.collect()

Polars optimizes the entire query plan before execution, often resulting in significant speedups.

Summary
-------

FrameRight is designed for **zero-cost abstraction** semantics:

* **Memory**: Constant 48-byte overhead (negligible)
* **Column access**: Adds ~0.2 microseconds (negligible compared to actual operations)
* **Validation**: O(n) with data size, but can be controlled with ``validate=False``
* **Operations**: Run at native backend speed (pandas/polars/narwhals)

The type safety, IDE support, and validation features come with virtually no runtime cost for typical data pipelines where operations take milliseconds or more.

**The performance tests are available in** ``tests/test_performance.py`` **and can be run with:**

.. code-block:: bash

    pytest tests/test_performance.py -v

This will show detailed timing and memory measurements on your system.