Custom Pandas Accessor

Custom Pandas Accessor#

Module containing a custom accessor and helpers for querying lyDATA.

Because of the special three-level header of the lyDATA tables, it is sometimes cumbersome and lengthy to access the columns. While this is certainly necessary to access e.g. the contralateral involvement of LNL II as observed on CT images (df["CT", "contra", "II"]), for simple patient information such as age and HPV status, it is more convenient to use short names, which we implement in this module.

The main class in this module is the LyDataAccessor class, which provides the above mentioned functionality. That way, accessing the age of all patients is now as easy as typing df.ly.age.

Beyond that, the module implements a convenient wat to query the DataFrame: The Q object, that was inspired by Django’s Q object. It allows for more readable and modular queries, which can be combined with logical operators and reused across different DataFrames.

The Q objects can be passed to the LyDataAccessor.query() and LyDataAccessor.portion() methods to filter the DataFrame or compute the QueryPortion of rows that satisfy the query. Alternatively, any of these Q objects have a method called execute() that can be called with a DataFrame to get a boolean mask of the rows satisfying the query.

Further, we implement methods like combine(), infer_sublevels(), and infer_superlevels() to compute additional columns from the lyDATA tables. This is sometimes necessary, because not all data contains all the possibly necessary columns. E.g., in some cohorts we do have detailed sublevel information (i.e., IIa and IIb), while in others only the superlevel (II) is reported. In such a case, one can now simply call df.ly.infer_sublevels() to get the additional columns.

class lydata.accessor.CombineQMixin[source]#

Mixin class for combining queries.

Four operators are defined for combining queries:

& for logical AND operations.
The returned object is an AndQ instance and - when executed - returns a boolean mask where both queries are satisfied. When the right-hand side is None, the left-hand side query object is returned unchanged.
| for logical OR operations.
The returned object is an OrQ instance and - when executed - returns a boolean mask where either query is satisfied. When the right-hand side is None, the left-hand side query object is returned unchanged.
~ for inverting a query.
The returned object is a NotQ instance and - when executed - returns a boolean mask where the query is not satisfied.
== for checking if two queries are equal.
Two queries are equal if their column names, operators, and values are equal. Note that this does not check if the queries are semantically equal, i.e., if they would return the same result when executed.

class lydata.accessor.Q(column: str, operator: Literal['==', '<', '<=', '>', '>=', '!=', 'in', 'contains'], value: Any)[source]#

Combinable query object for filtering a DataFrame.

The syntax for this object is similar to Django’s Q object. It can be used to define queries in a more readable and modular way.

Caution

The column names are not checked upon instantiation. This is only done when the query is executed. In fact, the Q object does not even know about the DataFrame it will be applied to in the beginning. On the flip side, this means a query may be reused for different DataFrames.

The operator argument may be one of the following:

'==': Checks if column values are equal to the value.
'<': Checks if column values are less than the value.
'<=': Checks if column values are less than or equal to value.
'>': Checks if column values are greater than the value.
'>=': Checks if column values are greater than or equal to value.
'!=': Checks if column values are not equal to the value. This is equivalent to ~Q(column, '==', value).
'in': Checks if column values are in the list of value. For this, pandas’ isin() method is used.
'contains': Checks if column values contain the string value. Here, pandas’ contains() method is used.

Note

During initialization, a private attribute _column_map is set to the default column map returned by get_default_column_map(). This is used to convert short column names to long ones. If one feels adventurous, they may set this attribute to a custom column map containing additional or other column short names. This could also be achieved by subclassing the Q. However, the attribute may change in the future, and without notice.

execute(df: DataFrame) → Series[source]#

Return a boolean mask where the query is satisfied for df.

>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']})
>>> Q('col1', '<=', 2).execute(df)
0     True
1     True
2    False
Name: col1, dtype: bool
>>> Q('col2', 'contains', 'ba').execute(df)
0    False
1     True
2     True
Name: col2, dtype: bool

Query object for combining two queries with a logical AND.

>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']})
>>> q1 = Q('col1', '!=', 3)
>>> q2 = Q('col2', 'contains', 'ba')
>>> and_q = q1 & q2
>>> print(and_q)
(Q('col1', '!=', 3) & Q('col2', 'contains', 'ba'))
>>> isinstance(and_q, AndQ)
True
>>> and_q.execute(df)
0    False
1     True
2    False
dtype: bool
>>> all((q1 & None).execute(df) == q1.execute(df))
True

execute(df: DataFrame) → Series[source]#: Return a boolean mask where both queries are satisfied.

Query object for combining two queries with a logical OR.

>>> df = pd.DataFrame({'col1': [1, 2, 3]})
>>> q1 = Q('col1', '==', 1)
>>> q2 = Q('col1', '==', 3)
>>> or_q = q1 | q2
>>> print(or_q)
(Q('col1', '==', 1) | Q('col1', '==', 3))
>>> isinstance(or_q, OrQ)
True
>>> or_q.execute(df)
0     True
1    False
2     True
Name: col1, dtype: bool
>>> all((q1 | None).execute(df) == q1.execute(df))
True

execute(df: DataFrame) → Series[source]#: Return a boolean mask where either query is satisfied.

class lydata.accessor.NotQ(q: Q | AndQ | OrQ | NotQ | None)[source]#

Query object for negating a query.

>>> df = pd.DataFrame({'col1': [1, 2, 3]})
>>> q = Q('col1', '==', 2)
>>> not_q = ~q
>>> print(not_q)
~Q('col1', '==', 2)
>>> isinstance(not_q, NotQ)
True
>>> not_q.execute(df)
0     True
1    False
2     True
Name: col1, dtype: bool
>>> print(~(Q('col1', '==', 2) & Q('col1', '!=', 3)))
~(Q('col1', '==', 2) & Q('col1', '!=', 3))

execute(df: DataFrame) → Series[source]#: Return a boolean mask where the query is not satisfied.

class lydata.accessor.NoneQ[source]#

Query object that always returns the entire DataFrame. Useful as default.

execute(df: DataFrame) → Series[source]#: Return a boolean mask with all entries set to True.

lydata.accessor.QTypes = lydata.accessor.Q | lydata.accessor.AndQ | lydata.accessor.OrQ | lydata.accessor.NotQ | None#: Type for a query object or a combination of query objects.

class lydata.accessor.C(*column: str)[source]#

Wraps a column name and produces a Q object upon comparison.

This is basically a shorthand for creating a Q object that avoids writing the operator and value in quotes. Thus, it may be more readable and allows IDEs to provide better autocompletion.

Caution

Just like for the Q object, it is not checked upon instantiation whether the column name is valid. This is only done when the query is executed.

isin(value: list[Any]) → Q[source]#

Create a query object for checking if the column values are in a list.

>>> C('foo').isin([1, 2, 3])
Q('foo', 'in', [1, 2, 3])

contains(value: str) → Q[source]#

Create a query object for checking if the column values contain a string.

>>> C('foo').contains('bar')
Q('foo', 'contains', 'bar')

class lydata.accessor.QueryPortion(match: int, total: int)[source]#

Dataclass for storing the portion of a query.

property fail: int#

Get the number of failures.

>>> QueryPortion(2, 5).fail
3

property ratio: float#

Get the ratio of matches over the total.

>>> QueryPortion(2, 5).ratio
0.4

property percent: float#

Get the percentage of matches over the total.

>>> QueryPortion(2, 5).percent
40.0

invert() → QueryPortion[source]#

Return the inverted portion.

>>> QueryPortion(2, 5).invert()
QueryPortion(match=3, total=5)

lydata.accessor.align_diagnoses(dataset: DataFrame, modalities: list[str]) → list[DataFrame][source]#: Stack aligned diagnosis tables in dataset for each of modalities.

class lydata.accessor.LyDataAccessor(obj: DataFrame)[source]#

Custom accessor for handling lymphatic involvement data.

This aims to provide an easy and user-friendly interface to the most commonly needed operations on the lymphatic involvement data we publish in the lydata project.

validate(modalities: list[str] | None = None) → DataFrame[source]#

Validate the DataFrame against the lydata schema.

The schema is constructed by the construct_schema() function using the modalities provided or it will get_default_modalities() if None are provided.

get_modalities(_filter: list[str] | None = None) → list[str][source]#: Return the modalities present in this DataFrame.

Warning

This method assumes that all top-level columns are modalities, except for some predefined non-modality columns. For some custom dataset, this may not be correct. In that case, you should provide a list of columns to _filter, i.e., the columns that are not modalities.

query(query: Q | AndQ | OrQ | NotQ | None = None) → DataFrame[source]#

Return a DataFrame with rows that satisfy the query.

A query is a Q object that can be combined with logical operators. See this class’ documentation for more information.

As a shorthand for creating these Q objects, you can use the C object as in the example below, where we query all entries where x is greater than 1 and not less than 3:

>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> df.ly.query((C('x') > 1) & ~(C('x') < 3))
   x
2  3

Compute how many rows satisfy a query, given some other conditions.

This returns a QueryPortion object that contains the number of rows satisfying the query and given Q object divided by the number of rows satisfying only the given condition.

>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> df.ly.portion(query=C('x') ==  2, given=C('x') > 1)
QueryPortion(match=np.int64(1), total=np.int64(2))
>>> df.ly.portion(query=C('x') ==  2, given=C('x') > 3)
QueryPortion(match=np.int64(0), total=np.int64(0))

stats(agg_funcs: dict[str | tuple[str, str, str], Callable[[Series], Series]] | None = None, use_shortnames: bool = True, out_format: str = 'dict') → Any[source]#

Compute statistics.

The agg_funcs argument is a mapping of column names to functions that receive a pd.Series and return a pd.Series. The default is a useful selection of statistics for the most common columns. E.g., for the column ('patient', '#', 'age') (or its short column name age), the default function returns the value counts.

The use_shortnames argument determines whether the output should use the short column names or the long ones. The default is to use the short names.

With out_format one can specify the output format. Available options are those formats for which pandas has a to_<format> method.

>>> df = pd.DataFrame({
...     ('patient', '#', 'age'): [61, 52, 73, 61],
...     ('patient', '#', 'hpv_status'): [True, False, None, True],
...     ('tumor', '1', 't_stage'): [2, 3, 1, 2],
... })
>>> df.ly.stats()   
{'age': {61: 2, 52: 1, 73: 1},
 'hpv': {True: 2, False: 1, None: 1},
 't_stage': {2: 2, 3: 1, 1: 1}}

combine(modalities: dict[str, ModalityConfig] | None = None, method: Literal['max_llh', 'rank'] = 'max_llh') → DataFrame[source]#

Combine diagnoses of modalities using method.

The order of the provided modalities does not matter, as it is aligned with the order in the DataFrame. With method="max_llh", the most likely true state of involvement is inferred based on all available diagnoses for each patient and level. With method="rank", only the most trustworthy diagnosis is chosen for each patient and level based on the sensitivity and specificity of the given list of modalities.

The result contains only the combined columns. The intended use is to update() the original DataFrame with the result.

>>> df = pd.DataFrame({
...     ('CT'       , 'ipsi', 'I'): [False, True , False,  True, None],
...     ('MRI'      , 'ipsi', 'I'): [False, True , True ,  None, None],
...     ('pathology', 'ipsi', 'I'): [True , None ,  None, False, None],
... })
>>> df.ly.combine()   
     ipsi
        I
0    True
1    True
2   False
3   False
4    None

infer_sublevels(modalities: list[str] | None = None, sides: list[Literal['ipsi', 'contra']] | None = None, subdivisions: dict[str, list[str]] | None = None) → DataFrame[source]#

Determine involvement status of an LNL’s sublevels (e.g., IIa and IIb).

Some LNLs have sublevels, e.g., IIa and IIb. The involvement of these sublevels is not always reported, but only the superlevel’s status. This function infers the status of the sublevels from the superlevel.

The sublevel’s status is computed for the specified modalities. If and what sublevels a superlevel has, is specified in subdivisions. The default subdivisions argument looks like this:

{
    "I": ["a", "b"],
    "II": ["a", "b"],
    "V": ["a", "b"],
}

The resulting DataFrame will only contain the newly inferred sublevel columns and only for those sublevels that were not already present in the DataFrame. Thus, one can simply join() the original DataFrame with the result.

>>> df = pd.DataFrame({
...     ('MRI', 'ipsi'  , 'I' ): [True , False, False, None],
...     ('MRI', 'contra', 'I' ): [False, True , False, None],
...     ('MRI', 'ipsi'  , 'II'): [False, False, True , None],
...     ('MRI', 'ipsi'  , 'IV'): [False, False, True , None],
...     ('CT' , 'ipsi'  , 'I' ): [True , False, False, None],
... })
>>> df.ly.infer_sublevels(modalities=["MRI"])   
     MRI
    ipsi                      contra
      Ia     Ib    IIa    IIb     Ia     Ib
0   None   None  False  False  False  False
1  False  False  False  False   None   None
2  False  False   None   None  False  False
3   None   None   None   None   None   None

infer_superlevels(modalities: list[str] | None = None, sides: list[Literal['ipsi', 'contra']] | None = None, subdivisions: dict[str, list[str]] | None = None) → DataFrame[source]#

Determine involvement status of an LNL’s superlevel (e.g., II).

Some LNLs have sublevels, e.g., IIa and IIb. In real data, sometimes the sublevels are reported, sometimes only the superlevel. This function infers the status of the superlevel from the sublevels.

The superlevel’s status is computed for the specified modalities. If and what sublevels a superlevel has, is specified in subdivisions.

The resulting DataFrame will only contain the newly inferred superlevel columns and only for those superlevels that were not already present in the DataFrame. This way, it is straightforward to join() it with the original DataFrame.

>>> df = pd.DataFrame({
...     ('MRI', 'ipsi'  , 'Ia' ): [True , False, False, None],
...     ('MRI', 'ipsi'  , 'Ib' ): [False, True , False, None],
...     ('MRI', 'contra', 'IIa'): [False, False, None , None],
...     ('MRI', 'contra', 'IIb'): [False, True , True , None],
...     ('CT' , 'ipsi'  , 'I'  ): [True , False, False, None],
... })
>>> df.ly.infer_superlevels(modalities=["MRI"]) 
     MRI
    ipsi contra
       I     II
0   True  False
1   True   True
2  False   True
3   None   None

Custom Pandas Accessor

Contents

Custom Pandas Accessor#