Custom Pandas Accessor#

Module containing a custom accessor for interacting with lyDATA tables.

Because of the special three-level header of the lyDATA tables, it is sometimes cumbersome and lengthy to access the columns. While this is certainly necessary to access e.g. the contralateral involvement of LNL II as observed on CT images (df["CT", "contra", "II"]), for simple patient information such as age and HPV status, it is more convenient to use short names, which we implement in this module.

The main class in this module is the LyDataAccessor class, which provides the above mentioned functionality. That way, accessing the age of all patients is now as easy as typing df.ly.age.

Beyond that, we implement methods like query() for filtering the DataFrame using reusable query objects (see the lydata.querier module for more information), stats() for computing common statistics that we use in our LyProX web app, and combine() for combining diagnoses from different modalities into a single column.

class lydata.accessor.QueryPortion(match: int, total: int)[source]#

Dataclass for storing the portion of a query.

An instance of this is returned by the LyDataAccessor.portion() method.

property fail: int#

Get the number of failures.

>>> QueryPortion(2, 5).fail
3
property ratio: float#

Get the ratio of matches over the total.

>>> QueryPortion(2, 5).ratio
0.4
property percent: float#

Get the percentage of matches over the total.

>>> QueryPortion(2, 5).percent
40.0
invert() QueryPortion[source]#

Return the inverted portion.

>>> QueryPortion(2, 5).invert()
QueryPortion(match=3, total=5)
class lydata.accessor.LyDataAccessor(obj: DataFrame)[source]#

Custom accessor for handling lymphatic involvement data.

This aims to provide an easy and user-friendly interface to the most commonly needed operations on the lymphatic involvement data we publish in the lydata project.

validate(modalities: list[str] | None = None) DataFrame[source]#

Validate the DataFrame against the lydata schema.

get_modalities(ignore_cols: list[str] | None = None) list[str][source]#

Return the modalities present in this DataFrame.

Warning

This method assumes that all top-level columns are modalities, except for some predefined non-modality columns. For some custom dataset, this may not be correct. In that case, you should provide a list of columns to ignore_cols, i.e., the columns that are not modalities.

get_tnm() DataFrame[source]#

Return the T, N, and M stage with all pre- and suffixes.

This info will be collected in three separate column “T”, “N”, and “M”.

>>> df = pd.DataFrame({
...     ('tumor', 'core', 't_stage_prefix'):   ['c', 'p'],
...     ('tumor', 'core', 't_stage'):          [2  ,  3 ],
...     ('tumor', 'core', 't_stage_suffix'):   ['a', 'b'],
...     ('patient', 'core', 'n_stage'):        [1  ,  2 ],
...     ('patient', 'core', 'n_stage_suffix'): ['a', 'b'],
...     ('patient', 'core', 'm_stage'):        [0  ,  1 ],
... })
>>> df.ly.get_tnm()   
   T    N   M
0  c2a  1a  0
1  p3b  2b  1
query(query: CanExecute | None = None) DataFrame[source]#

Return a DataFrame with rows that satisfy the query.

A query is a Q object that can be combined with logical operators. See this class’ documentation for more information.

As a shorthand for creating these Q objects, you can use the C object as in the example below, where we query all entries where x is greater than 1 and not less than 3:

>>> from lydata import C
>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> df.ly.query((C('x') > 1) & ~(C('x') < 3))
   x
2  3
>>> df.ly.query(C('x').isin([1, 3]))
   x
0  1
2  3
portion(query: CanExecute | None = None, given: CanExecute | None = None) QueryPortion[source]#

Compute how many rows satisfy a query, given some other conditions.

This returns a QueryPortion object that contains the number of rows satisfying the query and given Q object divided by the number of rows satisfying only the given condition.

>>> from lydata import C
>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> df.ly.portion(query=C('x') ==  2, given=C('x') > 1)
QueryPortion(match=np.int64(1), total=np.int64(2))
>>> df.ly.portion(query=C('x') ==  2, given=C('x') > 3)
QueryPortion(match=np.int64(0), total=np.int64(0))
stats(agg_funcs: dict[str | tuple[str, str, str], Callable[[Series], Series]] | None = None, use_shortnames: bool = True, out_format: str = 'dict') Any[source]#

Compute statistics.

The agg_funcs argument is a mapping of column names to functions that receive a pd.Series and return a pd.Series. The default is a useful selection of statistics for the most common columns. E.g., for the column ('patient', 'core', 'age') (or its short column name age), the default function returns the value counts.

The use_shortnames argument determines whether the output should use the short column names or the long ones. The default is to use the short names.

With out_format one can specify the output format. Available options are those formats for which pandas has a to_<format> method.

>>> df = pd.DataFrame({
...     ('patient', '#', 'age'): [61, 52, 73, 61],
...     ('patient', '#', 'hpv_status'): [True, False, None, True],
...     ('tumor', '1', 't_stage'): [2, 3, 1, 2],
... })
>>> df.ly.stats()   
{'age': {61: 2, 52: 1, 73: 1},
 'hpv': {True: 2, False: 1, None: 1},
 't_stage': {2: 2, 3: 1, 1: 1}}
>>> df = pd.DataFrame({
...     ('patient', 'core', 'age'): [61, 52, 73, 61],
...     ('patient', 'core', 'hpv_status'): [True, False, None, True],
...     ('tumor', 'core', 't_stage'): [2, 3, 1, 2],
... })
>>> df.ly.stats()   
{'age': {61: 2, 52: 1, 73: 1},
 'hpv': {True: 2, False: 1, None: 1},
 't_stage': {2: 2, 3: 1, 1: 1}}
combine(modalities: dict[str, ModalityConfig] | None = None, method: Literal['max_llh', 'rank'] = 'max_llh', subdivisions: Mapping[str, Sequence[str]] | None = None) DataFrame[source]#

Combine diagnoses of modalities using method.

The order of the provided modalities does not matter, as it is aligned with the order in the DataFrame. With method="max_llh", the most likely true state of involvement is inferred based on all available diagnoses for each patient and level. With method="rank", only the most trustworthy diagnosis is chosen for each patient and level based on the sensitivity and specificity of the given list of modalities.

The result contains only the combined columns and no top-level header. This means that if you want to add that to the original DataFrame, you could do so like this:

combined = data.ly.combine()
combined_full_header = pd.concat({"foo": combined}, axis="columns")
combined_full_header.index = data.index
data = pd.concat([data, combined_full_header], axis="columns")

The method enhance() is a shorthand for combining, augmenting, and joining the results in a way similar to that example above.

Warning

Here, the default value for subdivisions is set to an empty dictionary. This is because on the one hand, we still want to retain the functionality of combining and augmenting in one step (necessary in the enhance() method), but if not explicitly chosen, we keep only the originally provided levels.

>>> df = pd.DataFrame({
...     ('CT'       , 'ipsi', 'I'): [False, True , False,  True, None],
...     ('MRI'      , 'ipsi', 'I'): [False, True , True ,  None, None],
...     ('pathology', 'ipsi', 'I'): [True , None ,  None, False, None],
... })
>>> df.ly.combine()   
     ipsi
        I
0    True
1    True
2   False
3   False
4    None
augment(modality: str = 'max_llh', subdivisions: dict[str, list[str]] | None = None) DataFrame[source]#

Complete the sub- and superlevel involvement columns.

This is useful if the intention is not to combine multiple modalities, but rather to fill in the missing super- and sub-level involvement columns for a single modality.

Like the combine() method, the returned DataFrame only has a two-level header. So, for combining this with the original data, one has to perform additional steps. Or use the enhance() method.

>>> df = pd.DataFrame({
...     ('MRI', 'ipsi'  , 'I' ): [True , False, False, None],
...     ('MRI', 'contra', 'I' ): [False, True , False, None],
...     ('MRI', 'ipsi'  , 'II'): [False, False, True , None],
...     ('MRI', 'ipsi'  , 'IV'): [False, False, True , None],
...     ('CT' , 'ipsi'  , 'I' ): [True , False, False, None],
... })
>>> df.ly.augment(modality="MRI")   
  contra                 ipsi
       I     Ia     Ib      I     Ia     Ib     II    IIa    IIb     IV
0  False  False  False   True   None   None  False  False  False  False
1   True   None   None  False  False  False  False  False  False  False
2  False  False  False  False  False  False   True   None   None   True
3   None   None   None   None   None   None   None   None   None   None
enhance(modalities: dict[str, ModalityConfig] | None = None, method: Literal['max_llh', 'rank'] = 'max_llh', subdivisions: Mapping[str, Sequence[str]] | None = None) LyDataFrame[source]#

Shorthand for first combining modalities and then augmenting them.

This first runs the combine() method and after that the augment() for every modality in modalities and the newly combined method column.

cast(casters: Mapping[type, str] | None = None) LyDataFrame[source]#

Cast the dtypes of the DataFrame to the expected types.

This uses the annotations of the Pydantic schema to cast the individual columns of the DataFrame to the expected types. It uses the casters mapping to determine the type to cast to. By default, it uses the mapping from the _get_default_casters() function.

class lydata.accessor.LyDataFrame(data=None, index: Axes | None = None, columns: Axes | None = None, dtype: Dtype | None = None, copy: bool | None = None)[source]#

Subclass of a pandas DataFrame with a custom lydata accessor.