Backend Support
===============

FrameRight supports multiple DataFrame backends. You can choose your backend using:

1. **Backend-specific classes** (recommended for type safety)
2. **Base Schema class** (defaults to pandas, or specify ``backend`` parameter)

Supported Backends
------------------

* **Pandas** — Mature ecosystem, extensive third-party library support
* **Polars** — High-performance, Rust-based, with lazy evaluation
* **Narwhals** — Backend-agnostic DataFrame API

Backend Selection
-----------------

**Explicit Module Imports (Required)**

Import ``Schema`` and ``Col`` from backend-specific modules:

.. code-block:: python

    from frameright.pandas import Schema as PandasSchema, Col, Field
    from frameright.polars.eager import Schema as PolarsSchema, Col as PolarsCol, Field
    from frameright.polars.lazy import Schema as LazySchema, Col as LazyCol, Field

    # Pandas backend
    class OrdersPandas(PandasSchema):
        order_id: Col[int] = Field(unique=True)
        revenue: Col[float] = Field(ge=0)

    # Polars eager backend
    class OrdersPolars(PolarsSchema):
        order_id: PolarsCol[int] = Field(unique=True)
        revenue: PolarsCol[float] = Field(ge=0)

    # Polars lazy backend
    class OrdersLazy(LazySchema):
        order_id: LazyCol[int] = Field(unique=True)
        revenue: LazyCol[float] = Field(ge=0)

**Backend Auto-Detection**

Each backend module's ``Schema`` class is tied to its specific backend.
The underlying data type determines validation behavior (e.g., ``pl.DataFrame`` uses ``pandera.polars``):

.. code-block:: python

    from frameright.polars.eager import Schema, Col, Field
    from frameright.typing import Col

    class Orders(Schema):  # Defaults to pandas
        order_id: Col[int] = Field(unique=True)
        revenue: Col[float] = Field(ge=0)

    # Pandas backend (default)
    import pandas as pd
    pandas_df = pd.DataFrame({...})
    orders_pd = Orders(pandas_df)  # Uses pandas backend

    # Polars backend (explicit parameter)
    import polars as pl
    polars_df = pl.DataFrame({...})
    orders_pl = Orders(polars_df, backend="polars")  # Explicitly use polars

**Type Safety:** Explicit module imports like ``frameright.pandas``, ``frameright.polars.eager`` provide stronger
type guarantees and are recommended for production code.

Typing Notes
------------

FrameRight schemas are backend-agnostic, but you can opt into backend-specific typing for a better IDE experience:

* Pandas: ``from frameright.typing.pandas import Col`` (columns can type-check as ``pd.Series[T]`` with pandas stubs)
* Polars eager: ``from frameright.typing.polars_eager import Col`` (columns type-check as ``pl.Series``)
* Polars lazy: ``from frameright.typing.polars_lazy import Col`` (columns type-check as ``pl.Expr`` for expression chaining)

.. note::

    Polars and Narwhals do not currently expose fully generic ``Series[T]`` / ``Expr[T]`` types upstream.
    FrameRight's ``Col[T]`` is still valuable as a schema contract and for IDE autocomplete, but type checkers
    generally treat the runtime values as unparameterized ``pl.Series`` / ``pl.Expr`` / ``nw.Series`` today.

At runtime, the actual values you get depend on the backend:

* Pandas: properties return ``pd.Series``
* Polars eager ``pl.DataFrame``: properties return ``pl.Series``
* Polars lazy ``pl.LazyFrame``: properties return ``pl.Expr`` (lazy expressions)

Python 3.12+ Generic Syntax (PEP 695)
------------------------------------

Python 3.12 adds a new generic class syntax that avoids manual ``TypeVar`` boilerplate.
FrameRight works well with this style and still preserves backend inference:

.. code-block:: python

        from frameright import Schema, Field
        from frameright.typing import Col

        class Sales[T](Schema[T]):
                order_id: Col[int] = Field(unique=True)
                customer: Col[str]
                revenue: Col[float] = Field(ge=0)

Why both ``T``s?

* ``Sales[T]`` declares the generic parameter.
* ``Schema[T]`` forwards it to the base class so type checkers can infer
    the backend type from the constructor argument.

This is the shortest syntax that keeps full static typing without defaulting
to a specific backend or collapsing to ``Any``.

Pandas Backend
--------------

**Installation:**

.. code-block:: bash

    pip install frameright

Pandas comes as a default dependency.

**Features:**

* Full validation with ``pandera.pandas``
* Access to the entire Pandas ecosystem
* Familiar API for existing Pandas users
* Great for exploratory analysis and data science workflows

**Example:**

.. code-block:: python

    import pandas as pd

    df = pd.read_csv("data.csv")
    orders = Orders(df)

    # Use fr_data for backend-specific operations
    customer_totals = orders.fr_data.groupby(orders.customer_id).sum()

    # You can always access the underlying DataFrame directly
    print(orders.fr_data.columns)

Polars Backend
--------------

**Installation:**

.. code-block:: bash

    pip install frameright[polars]

**Why Polars?**

* **10-100x faster** than Pandas on large datasets (1M+ rows)
* **Parallel execution** — uses all CPU cores automatically
* **Lazy evaluation** — build optimized query plans
* **Memory efficient** — better memory layout and columnar processing
* **Modern API** — expressive, consistent, and type-safe

**Example:**

.. code-block:: python

    import polars as pl

    df = pl.read_csv("data.csv")
    orders = Orders(df)

    # Use fr_data for backend-specific operations
    customer_totals = orders.fr_data.group_by(orders.customer_id).sum()

    # You can always access the underlying DataFrame directly
    print(orders.fr_data.columns)

**Lazy Evaluation:**

Polars supports lazy evaluation for complex query optimization:

.. code-block:: python

    # LazyFrame is automatically handled
    lazy_df = pl.scan_csv("data.csv")
    orders = Orders(lazy_df)  # Schema works with LazyFrames too

    # Operations are lazy until you collect()
    filtered_df = orders.fr_data.filter(orders.revenue > 1000)
    # Execute the full query plan
    result = filtered_df.collect()

Backend-Agnostic Schemas
-------------------------

The key benefit: **write your schema once, use it with any backend**.

.. code-block:: python

    class SalesData(Schema):
        """Works with both Pandas and Polars."""
        date: Col[str]
        product: Col[str]
        revenue: Col[float] = Field(ge=0)
        quantity: Col[int] = Field(ge=1)

    # Use with Pandas during development
    dev_df = pd.read_csv("sample.csv")
    dev_data = SalesData(dev_df)

    # Switch to Polars in production for better performance
    prod_df = pl.read_csv("full_dataset.csv")
    prod_data = SalesData(prod_df)

This means you can:

* Prototype with Pandas (familiar, extensive library ecosystem)
* Scale with Polars (performance, parallelism, memory efficiency)
* Never rewrite your schema definitions

Validation with Pandera
------------------------

Both backends use **Pandera** for validation:

* Pandas backend uses ``pandera.pandas``
* Polars backend uses ``pandera.polars``

The validation logic is identical. Pandera automatically handles backend-specific validation:

.. code-block:: python

    class Validated(Schema):
        amount: Col[float] = Field(ge=0, le=1000)
        status: Col[str] = Field(isin=["active", "inactive"])

    # Pandera validates with pandas
    pandas_df = pd.DataFrame({...})
    data_pd = Validated(pandas_df)  # Uses pandera.pandas.DataFrameSchema

    # Pandera validates with polars
    polars_df = pl.DataFrame({...})
    data_pl = Validated(polars_df)  # Uses pandera.polars.DataFrameSchema

Backend-Specific Operations
----------------------------

For backend-specific operations, use ``fr_data`` to access the underlying DataFrame directly:

.. code-block:: python

    orders = Orders(df)  # Works with either backend

    # Pandas-specific
    if isinstance(orders.fr_data, pd.DataFrame):
        result = orders.fr_data.groupby(orders.customer_id).sum()

    # Polars-specific
    elif isinstance(orders.fr_data, pl.DataFrame):
        result = orders.fr_data.group_by(orders.customer_id).sum()

For backend-agnostic access, use ``fr_data`` which returns a `narwhals
<https://narwhals-dev.github.io/narwhals/>`__ wrapper with full IDE autocomplete:

.. code-block:: python

    import narwhals as nw

    # These work regardless of backend
    orders.fr_data              # Returns nw.DataFrame or nw.LazyFrame
    orders.fr_data.columns      # Column names (backend-agnostic)
    orders.fr_data.schema       # Narwhals schema
    orders.fr_data.to_native()  # Escape to native DataFrame (zero-copy)

Performance Comparison
----------------------

Rough performance guidelines (results vary by dataset and operation):

+-------------------------+-------------+-------------+
| Operation               | Pandas      | Polars      |
+=========================+=============+=============+
| Small datasets (<100K)  | Similar     | Similar     |
+-------------------------+-------------+-------------+
| Medium datasets (1M)    | 1x          | 5-20x       |
+-------------------------+-------------+-------------+
| Large datasets (10M+)   | 1x          | 10-100x     |
+-------------------------+-------------+-------------+
| Memory usage            | 1x          | 0.3-0.7x    |
+-------------------------+-------------+-------------+
| Parallel aggregations   | Single core | All cores   |
+-------------------------+-------------+-------------+

**When to use Pandas:**

* Exploratory data analysis with lots of interactivity
* Working with libraries that only support Pandas
* Small to medium datasets where performance isn't critical
* When you need the extensive Pandas ecosystem

**When to use Polars:**

* Large datasets (1M+ rows)
* Performance-critical production pipelines
* Memory-constrained environments
* When you can benefit from parallel execution

Migrating Between Backends
---------------------------

Switching backends requires minimal code changes:

.. code-block:: python

    # Before (Pandas)
    df = pd.read_csv("data.csv")
    orders = Orders(df)
    result = orders.fr_data.groupby(orders.customer_id).sum()

    # After (Polars)
    df = pl.read_csv("data.csv")
    orders = Orders(df)
    result = orders.fr_data.group_by(orders.customer_id).sum()  # Note: group_by vs groupby

The schema definition (``Orders``) stays exactly the same. Only the DataFrame creation and backend-specific method calls change.
For backend-agnostic code, use ``fr_data`` — the narwhals API is the same regardless of backend.


Adding a Backend (Advanced)
---------------------------

FrameRight's backend layer is a simple adapter interface implemented per DataFrame library.
Each backend module (``frameright.pandas``, ``frameright.polars.eager``, etc.) provides its own
``Schema`` class with a hardcoded backend adapter instance.

**No auto-detection or dispatch logic** — importing from a specific module gives you that backend.
This design is intentionally simple and fast:

.. code-block:: python

    from frameright.pandas import Schema       # _fr_backend = PandasBackend()
    from frameright.polars.eager import Schema # _fr_backend = PolarsEagerBackend()
    from frameright.polars.lazy import Schema  # _fr_backend = PolarsLazyBackend()

If you want to integrate another DataFrame implementation:

1. Implement a ``BackendAdapter`` (see ``frameright.backends.base.BackendAdapter``)
2. Create a new module with a ``Schema`` class that sets ``_fr_backend``

.. code-block:: python

    from frameright.backends.base import BackendAdapter
    from frameright.core import BaseSchema

    class MyBackend(BackendAdapter):
        # Implement required methods...
        pass

    class Schema(BaseSchema):
        _fr_backend = MyBackend()

    # Users import your Schema directly
    from mypackage import Schema


Notes on cuDF
-------------

cuDF is a natural candidate because its API is intentionally close to Pandas.
That said, there are two separate concerns:

* **DataFrame operations** (get/set columns, filtering, I/O, etc.): cuDF can often be supported with a
    fairly thin adapter because many method names mirror Pandas.
* **Runtime validation** (Pandera): Schema currently relies on Pandera's Pandas and Polars backends.
    If Pandera doesn't support cuDF validation in your environment, a cuDF adapter would either need to:

    - raise a clear ``NotImplementedError`` for ``fr_validate()``, or
    - validate by materialising to Pandas (acceptable for small/medium data, but defeats GPU benefits), or
    - provide an alternative validation implementation.

If your primary goal is "typed column access + autocomplete" in production analysis code, cuDF can still be
valuable even before full runtime validation is available — but it’s best treated as an *experimental* backend
until the validation story is nailed down.