Efficient and Reusable DataFrame Queries#

Querier module for lydata package.

This module provides the Q and C classes for creating and combining reusable queries to filter pandas.DataFrame objects. These classes are inspired by Django’s Q objects and allow for a more readable and modular way to filter and query data.

For example, we may want to keep only patient with tumors of T-category 3 or higher. Then, we can write

from lydata import C
has_t_stage = C("t_stage") >= 3

Now, through the equality comparison of an instance of C, the has_t_stage is an instance of Q that can be combined with other queries and applied via our custom LyDataAccessor to a table:

is_old = C("age") >= 65
data.ly.query(has_t_stage & is_old)

Internally, this works by calling the Q.execute() method, which returns a boolean mask to filter the DataFrame. So, the above example is equivalent to

(has_t_stage & is_old).execute(data)
class lydata.querier.CombineQMixin[source]#

Mixin class for combining queries.

Four operators are defined for combining queries:

  1. & for logical AND operations.

    The returned object is an AndQ instance and - when executed - returns a boolean mask where both queries are satisfied. When the right-hand side is None, the left-hand side query object is returned unchanged.

  2. | for logical OR operations.

    The returned object is an OrQ instance and - when executed - returns a boolean mask where either query is satisfied. When the right-hand side is None, the left-hand side query object is returned unchanged.

  3. ~ for inverting a query.

    The returned object is a NotQ instance and - when executed - returns a boolean mask where the query is not satisfied.

  4. == for checking if two queries are equal.

    Two queries are equal if their column names, operators, and values are equal. Note that this does not check if the queries are semantically equal, i.e., if they would return the same result when executed.

class lydata.querier.Q(column: str, operator: Literal['==', '<', '<=', '>', '>=', '!=', 'in', 'contains'], value: Any)[source]#

Combinable query object for filtering a DataFrame.

The syntax for this object is similar to Django’s Q object. It can be used to define queries in a more readable and modular way.

Caution

The column names are not checked upon instantiation. This is only done when the query is executed. In fact, the Q object does not even know about the DataFrame it will be applied to in the beginning. On the flip side, this means a query may be reused for different DataFrames.

The operator argument may be one of the following:

  • '==': Checks if column values are equal to the value.

  • '<': Checks if column values are less than the value.

  • '<=': Checks if column values are less than or equal to value.

  • '>': Checks if column values are greater than the value.

  • '>=': Checks if column values are greater than or equal to value.

  • '!=': Checks if column values are not equal to the value. This is equivalent to ~Q(column, '==', value).

  • 'in': Checks if column values are in the list of value. For this, pandas’ isin() method is used.

  • 'contains': Checks if column values contain the string value. Here, pandas’ contains() method is used.

  • 'pass_to': Passes the column values to the callable value. This is useful

    for custom filtering functions that may not be covered by the other operators.

execute(df: DataFrame) Series[source]#

Return a boolean mask where the query is satisfied for df.

>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']})
>>> Q('col1', '<=', 2).execute(df)
0     True
1     True
2    False
Name: col1, dtype: bool
>>> Q('col2', 'contains', 'ba').execute(df)
0    False
1     True
2     True
Name: col2, dtype: bool
>>> Q('col1', 'pass_to', lambda x: x % 2 == 0).execute(df)
0    False
1     True
2    False
Name: col1, dtype: bool
class lydata.querier.NoneQ[source]#

Query object that always returns the entire DataFrame. Useful as default.

execute(df: DataFrame) Series[source]#

Return a boolean mask with all entries set to True.

class lydata.querier.AndQ(q1: CanExecute, q2: CanExecute)[source]#

Query object for combining two queries with a logical AND.

>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']})
>>> q1 = Q('col1', '!=', 3)
>>> q2 = Q('col2', 'contains', 'ba')
>>> and_q = q1 & q2
>>> print(and_q)
(Q('col1', '!=', 3) & Q('col2', 'contains', 'ba'))
>>> isinstance(and_q, AndQ)
True
>>> and_q.execute(df)
0    False
1     True
2    False
dtype: bool
>>> all((q1 & None).execute(df) == q1.execute(df))
True
execute(df: DataFrame) Series[source]#

Return a boolean mask where both queries are satisfied.

class lydata.querier.OrQ(q1: CanExecute, q2: CanExecute)[source]#

Query object for combining two queries with a logical OR.

>>> df = pd.DataFrame({'col1': [1, 2, 3]})
>>> q1 = Q('col1', '==', 1)
>>> q2 = Q('col1', '==', 3)
>>> or_q = q1 | q2
>>> print(or_q)
(Q('col1', '==', 1) | Q('col1', '==', 3))
>>> isinstance(or_q, OrQ)
True
>>> or_q.execute(df)
0     True
1    False
2     True
Name: col1, dtype: bool
>>> all((q1 | None).execute(df) == q1.execute(df))
True
execute(df: DataFrame) Series[source]#

Return a boolean mask where either query is satisfied.

class lydata.querier.NotQ(q: CanExecute)[source]#

Query object for negating a query.

>>> df = pd.DataFrame({'col1': [1, 2, 3]})
>>> q = Q('col1', '==', 2)
>>> not_q = ~q
>>> print(not_q)
~Q('col1', '==', 2)
>>> isinstance(not_q, NotQ)
True
>>> not_q.execute(df)
0     True
1    False
2     True
Name: col1, dtype: bool
>>> print(~(Q('col1', '==', 2) & Q('col1', '!=', 3)))
~(Q('col1', '==', 2) & Q('col1', '!=', 3))
execute(df: DataFrame) Series[source]#

Return a boolean mask where the query is not satisfied.

class lydata.querier.C(*column: str)[source]#

Wraps a column name and produces a Q object upon comparison.

This is basically a shorthand for creating a Q object that avoids writing the operator and value in quotes. Thus, it may be more readable and allows IDEs to provide better autocompletion.

Caution

Just like for the Q object, it is not checked upon instantiation whether the column name is valid. This is only done when the query is executed.

isin(value: list[Any]) Q[source]#

Create a query object for checking if the column values are in a list.

>>> C('foo').isin([1, 2, 3])
Q('foo', 'in', [1, 2, 3])
contains(value: str) Q[source]#

Create a query object for checking if the column values contain a string.

>>> C('foo').contains('bar')
Q('foo', 'contains', 'bar')
pass_to(value: Callable[[Series], Series]) Q[source]#

Create a query object that passes the column values to a callable.

This is useful for custom filtering functions that may not be covered by the other operators.

>>> C('foo').pass_to(lambda x: x > 42)   
Q('foo', 'pass_to', <function <lambda> at ...>)