Efficient and Reusable DataFrame Queries#
Querier module for lydata package.
This module provides the Q and C classes for creating and
combining reusable queries to filter pandas.DataFrame objects. These
classes are inspired by Django’s Q objects and allow for a more readable and modular
way to filter and query data.
For example, we may want to keep only patient with tumors of T-category 3 or higher. Then, we can write
from lydata import C
has_t_stage = C("t_stage") >= 3
Now, through the equality comparison of an instance of C, the
has_t_stage is an instance of Q that can be combined with other queries
and applied via our custom LyDataAccessor to a table:
is_old = C("age") >= 65
data.ly.query(has_t_stage & is_old)
Internally, this works by calling the Q.execute() method, which returns a
boolean mask to filter the DataFrame. So, the above example is equivalent to
(has_t_stage & is_old).execute(data)
- class lydata.querier.CombineQMixin[source]#
Mixin class for combining queries.
Four operators are defined for combining queries:
&for logical AND operations.The returned object is an
AndQinstance and - when executed - returns a boolean mask where both queries are satisfied. When the right-hand side isNone, the left-hand side query object is returned unchanged.
|for logical OR operations.The returned object is an
OrQinstance and - when executed - returns a boolean mask where either query is satisfied. When the right-hand side isNone, the left-hand side query object is returned unchanged.
~for inverting a query.The returned object is a
NotQinstance and - when executed - returns a boolean mask where the query is not satisfied.
==for checking if two queries are equal.Two queries are equal if their column names, operators, and values are equal. Note that this does not check if the queries are semantically equal, i.e., if they would return the same result when executed.
- class lydata.querier.Q(column: str, operator: Literal['==', '<', '<=', '>', '>=', '!=', 'in', 'contains'], value: Any)[source]#
Combinable query object for filtering a DataFrame.
The syntax for this object is similar to Django’s
Qobject. It can be used to define queries in a more readable and modular way.Caution
The column names are not checked upon instantiation. This is only done when the query is executed. In fact, the
Qobject does not even know about theDataFrameit will be applied to in the beginning. On the flip side, this means a query may be reused for different DataFrames.The
operatorargument may be one of the following:'==': Checks ifcolumnvalues are equal to thevalue.'<': Checks ifcolumnvalues are less than thevalue.'<=': Checks ifcolumnvalues are less than or equal tovalue.'>': Checks ifcolumnvalues are greater than thevalue.'>=': Checks ifcolumnvalues are greater than or equal tovalue.'!=': Checks ifcolumnvalues are not equal to thevalue. This is equivalent to~Q(column, '==', value).'in': Checks ifcolumnvalues are in the list ofvalue. For this, pandas’isin()method is used.'contains': Checks ifcolumnvalues contain the stringvalue. Here, pandas’contains()method is used.'pass_to': Passes the column values to the callablevalue. This is usefulfor custom filtering functions that may not be covered by the other operators.
- execute(df: DataFrame) Series[source]#
Return a boolean mask where the query is satisfied for
df.>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']}) >>> Q('col1', '<=', 2).execute(df) 0 True 1 True 2 False Name: col1, dtype: bool >>> Q('col2', 'contains', 'ba').execute(df) 0 False 1 True 2 True Name: col2, dtype: bool >>> Q('col1', 'pass_to', lambda x: x % 2 == 0).execute(df) 0 False 1 True 2 False Name: col1, dtype: bool
- class lydata.querier.NoneQ[source]#
Query object that always returns the entire DataFrame. Useful as default.
- class lydata.querier.AndQ(q1: CanExecute, q2: CanExecute)[source]#
Query object for combining two queries with a logical AND.
>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']}) >>> q1 = Q('col1', '!=', 3) >>> q2 = Q('col2', 'contains', 'ba') >>> and_q = q1 & q2 >>> print(and_q) (Q('col1', '!=', 3) & Q('col2', 'contains', 'ba')) >>> isinstance(and_q, AndQ) True >>> and_q.execute(df) 0 False 1 True 2 False dtype: bool >>> all((q1 & None).execute(df) == q1.execute(df)) True
- class lydata.querier.OrQ(q1: CanExecute, q2: CanExecute)[source]#
Query object for combining two queries with a logical OR.
>>> df = pd.DataFrame({'col1': [1, 2, 3]}) >>> q1 = Q('col1', '==', 1) >>> q2 = Q('col1', '==', 3) >>> or_q = q1 | q2 >>> print(or_q) (Q('col1', '==', 1) | Q('col1', '==', 3)) >>> isinstance(or_q, OrQ) True >>> or_q.execute(df) 0 True 1 False 2 True Name: col1, dtype: bool >>> all((q1 | None).execute(df) == q1.execute(df)) True
- class lydata.querier.NotQ(q: CanExecute)[source]#
Query object for negating a query.
>>> df = pd.DataFrame({'col1': [1, 2, 3]}) >>> q = Q('col1', '==', 2) >>> not_q = ~q >>> print(not_q) ~Q('col1', '==', 2) >>> isinstance(not_q, NotQ) True >>> not_q.execute(df) 0 True 1 False 2 True Name: col1, dtype: bool >>> print(~(Q('col1', '==', 2) & Q('col1', '!=', 3))) ~(Q('col1', '==', 2) & Q('col1', '!=', 3))
- class lydata.querier.C(*column: str)[source]#
Wraps a column name and produces a
Qobject upon comparison.This is basically a shorthand for creating a
Qobject that avoids writing the operator and value in quotes. Thus, it may be more readable and allows IDEs to provide better autocompletion.Caution
Just like for the
Qobject, it is not checked upon instantiation whether the column name is valid. This is only done when the query is executed.- isin(value: list[Any]) Q[source]#
Create a query object for checking if the column values are in a list.
>>> C('foo').isin([1, 2, 3]) Q('foo', 'in', [1, 2, 3])
- contains(value: str) Q[source]#
Create a query object for checking if the column values contain a string.
>>> C('foo').contains('bar') Q('foo', 'contains', 'bar')
- pass_to(value: Callable[[Series], Series]) Q[source]#
Create a query object that passes the column values to a callable.
This is useful for custom filtering functions that may not be covered by the other operators.
>>> C('foo').pass_to(lambda x: x > 42) Q('foo', 'pass_to', <function <lambda> at ...>)