Custom Pandas Accessor#
Module containing a custom accessor and helpers for querying lyDATA.
Because of the special three-level header of the lyDATA tables, it is sometimes
cumbersome and lengthy to access the columns. While this is certainly necessary to
access e.g. the contralateral involvement of LNL II as observed on CT images
(df["CT", "contra", "II"]), for simple patient information such as age and HPV
status, it is more convenient to use short names, which we implement in this module.
The main class in this module is the LyDataAccessor class, which provides
the above mentioned functionality. That way, accessing the age of all patients is now
as easy as typing df.ly.age.
Beyond that, the module implements a convenient wat to query the
pd.DataFrame: The Q object, that was inspired by Django’s
Q object. It allows for more readable and modular queries, which can be combined
with logical operators and reused across different DataFrames.
The Q objects can be passed to the LyDataAccessor.query() and
LyDataAccessor.portion() methods to filter the DataFrame or compute the
QueryPortion of rows that satisfy the query.
Further, we implement methods like LyDataAccessor.combine(),
LyDataAccessor.infer_sublevels(), and
LyDataAccessor.infer_superlevels() to compute additional columns from the
lyDATA tables. This is sometimes necessary, because not all data contains all the
possibly necessary columns. E.g., in some cohorts we do have detailed sublevel
information (i.e., IIa and IIb), while in others only the superlevel (II) is reported.
- class lydata.accessor.Q(column: str, operator: Literal['==', '<', '<=', '>', '>=', '!=', 'in', 'contains'], value: Any)[source]#
Combinable query object for filtering a DataFrame.
The syntax for this object is similar to Django’s
Qobject. It can be used to define queries in a more readable and modular way.Caution
The column names are not checked upon instantiation. This is only done when the query is executed. In fact, the
Qobject does not even know about theDataFrameit will be applied to in the beginning. On the flip side, this means a query may be reused for different DataFrames.- execute(df: DataFrame) Series[source]#
Return a boolean mask where the query is satisfied for
df.>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']}) >>> Q('col1', '<=', 2).execute(df) 0 True 1 True 2 False Name: col1, dtype: bool >>> Q('col2', 'contains', 'ba').execute(df) 0 False 1 True 2 True Name: col2, dtype: bool
- class lydata.accessor.AndQ(q1: Q | AndQ | OrQ | NotQ | None, q2: Q | AndQ | OrQ | NotQ | None)[source]#
Query object for combining two queries with a logical AND.
>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['foo', 'bar', 'baz']}) >>> q1 = Q('col1', '!=', 3) >>> q2 = Q('col2', 'contains', 'ba') >>> and_q = q1 & q2 >>> print(and_q) Q('col1', '!=', 3) & Q('col2', 'contains', 'ba') >>> isinstance(and_q, AndQ) True >>> and_q.execute(df) 0 False 1 True 2 False dtype: bool
- class lydata.accessor.OrQ(q1: Q | AndQ | OrQ | NotQ | None, q2: Q | AndQ | OrQ | NotQ | None)[source]#
Query object for combining two queries with a logical OR.
>>> df = pd.DataFrame({'col1': [1, 2, 3]}) >>> q1 = Q('col1', '==', 1) >>> q2 = Q('col1', '==', 3) >>> or_q = q1 | q2 >>> print(or_q) Q('col1', '==', 1) | Q('col1', '==', 3) >>> isinstance(or_q, OrQ) True >>> or_q.execute(df) 0 True 1 False 2 True Name: col1, dtype: bool
- class lydata.accessor.NotQ(q: Q | AndQ | OrQ | NotQ | None)[source]#
Query object for negating a query.
>>> df = pd.DataFrame({'col1': [1, 2, 3]}) >>> q = Q('col1', '==', 2) >>> not_q = ~q >>> print(not_q) ~Q('col1', '==', 2) >>> isinstance(not_q, NotQ) True >>> not_q.execute(df) 0 True 1 False 2 True Name: col1, dtype: bool
- class lydata.accessor.NoneQ[source]#
Query object that always returns the entire DataFrame. Useful as default.
- class lydata.accessor.C(*column: str)[source]#
Wraps a column name and produces a
Qobject upon comparison.Caution
Just like for the
Qobject, it is not checked upon instantiation whether the column name is valid. This is only done when the query is executed.
- class lydata.accessor.QueryPortion(match: int, total: int)[source]#
Dataclass for storing the portion of a query.
- property percent: float#
Get the percentage of matches over the total.
>>> QueryPortion(2, 5).percent 40.0
- invert() QueryPortion[source]#
Return the inverted portion.
>>> QueryPortion(2, 5).invert() QueryPortion(match=3, total=5)
- lydata.accessor.align_diagnoses(dataset: DataFrame, modalities: list[str]) list[DataFrame][source]#
Stack aligned diagnosis tables in
datasetfor each ofmodalities.
- class lydata.accessor.LyDataAccessor(obj: DataFrame)[source]#
Custom accessor for handling lymphatic involvement data.
This aims to provide an easy and user-friendly interface to the most commonly needed operations on the lymphatic involvement data we publish in the lydata project.
- validate(modalities: list[str] | None = None) DataFrame[source]#
Validate the DataFrame against the lydata schema.
The schema is constructed by the
construct_schema()function using themodalitiesprovided or it willget_default_modalities()ifNoneare provided.
- get_modalities(_filter: list[str] | None = None) list[str][source]#
Return the modalities present in this DataFrame.
Warning
This method assumes that all top-level columns are modalities, except for some predefined non-modality columns. For some custom dataset, this may not be correct. In that case, you should provide a list of columns to
_filter, i.e., the columns that are not modalities.
- query(query: Q | AndQ | OrQ | NotQ | None = None) DataFrame[source]#
Return a DataFrame with rows that satisfy the
query.A query is a
Qobject that can be combined with logical operators. See this class’ documentation for more information.As a shorthand for creating these
Qobjects, you can use theCobject as in the example below, where we query all entries wherexis greater than 1 and not less than 3:>>> df = pd.DataFrame({'x': [1, 2, 3]}) >>> df.ly.query((C('x') > 1) & ~(C('x') < 3)) x 2 3
- portion(query: Q | AndQ | OrQ | NotQ | None = None, given: Q | AndQ | OrQ | NotQ | None = None) QueryPortion[source]#
Compute how many rows satisfy a
query,givensome other conditions.This returns a
QueryPortionobject that contains the number of rows satisfying thequeryandgivenQobject divided by the number of rows satisfying only thegivencondition.>>> df = pd.DataFrame({'x': [1, 2, 3]}) >>> df.ly.portion(query=C('x') == 2, given=C('x') > 1) QueryPortion(match=np.int64(1), total=np.int64(2)) >>> df.ly.portion(query=C('x') == 2, given=C('x') > 3) QueryPortion(match=np.int64(0), total=np.int64(0))
- stats(agg_funcs: dict[str | tuple[str, str, str], Callable[[Series], Series]] | None = None, use_shortnames: bool = True, out_format: str = 'dict') Any[source]#
Compute statistics.
The
agg_funcsargument is a mapping of column names to functions that receive apd.Seriesand return apd.Series. The default is a useful selection of statistics for the most common columns. E.g., for the column('patient', '#', 'age')(or its short column nameage), the default function returns the value counts.The
use_shortnamesargument determines whether the output should use the short column names or the long ones. The default is to use the short names.With
out_formatone can specify the output format. Available options are those formats for which pandas has ato_<format>method.>>> df = pd.DataFrame({ ... ('patient', '#', 'age'): [61, 52, 73, 61], ... ('patient', '#', 'hpv_status'): [True, False, None, True], ... ('tumor', '1', 't_stage'): [2, 3, 1, 2], ... }) >>> df.ly.stats() {'age': {61: 2, 52: 1, 73: 1}, 'hpv': {True: 2, False: 1, None: 1}, 't_stage': {2: 2, 3: 1, 1: 1}}
- combine(modalities: dict[str, ModalityConfig] | None = None, method: Literal['max_llh', 'rank'] = 'max_llh') DataFrame[source]#
Combine diagnoses of
modalitiesusingmethod.The order of the provided
modalitiesdoes not matter, as it is aligned with the order in the DataFrame. Withmethod="max_llh", the most likely true state of involvement is inferred based on all available diagnoses for each patient and level. Withmethod="rank", only the most trustworthy diagnosis is chosen for each patient and level based on the sensitivity and specificity of the given list ofmodalities.The result contains only the combined columns. The intended use is to
update()the original DataFrame with the result.>>> df = pd.DataFrame({ ... ('CT' , 'ipsi', 'I'): [False, True , False, True, None], ... ('MRI' , 'ipsi', 'I'): [False, True , True , None, None], ... ('pathology', 'ipsi', 'I'): [True , None , None, False, None], ... }) >>> df.ly.combine() ipsi I 0 True 1 True 2 False 3 False 4 None
- infer_sublevels(modalities: list[str] | None = None, sides: list[Literal['ipsi', 'contra']] | None = None, subdivisions: dict[str, list[str]] | None = None) DataFrame[source]#
Determine involvement status of an LNL’s sublevels (e.g., IIa and IIb).
Some LNLs have sublevels, e.g., IIa and IIb. The involvement of these sublevels is not always reported, but only the superlevel’s status. This function infers the status of the sublevels from the superlevel.
The sublevel’s status is computed for the specified
modalities. If and what sublevels a superlevel has, is specified insubdivisions. The defaultsubdivisionsargument looks like this:{ "I": ["a", "b"], "II": ["a", "b"], "V": ["a", "b"], }
The resulting DataFrame will only contain the newly inferred sublevel columns and only for those sublevels that were not already present in the DataFrame. Thus, one can simply
join()the original DataFrame with the result.>>> df = pd.DataFrame({ ... ('MRI', 'ipsi' , 'I' ): [True , False, False, None], ... ('MRI', 'contra', 'I' ): [False, True , False, None], ... ('MRI', 'ipsi' , 'II'): [False, False, True , None], ... ('MRI', 'ipsi' , 'IV'): [False, False, True , None], ... ('CT' , 'ipsi' , 'I' ): [True , False, False, None], ... }) >>> df.ly.infer_sublevels(modalities=["MRI"]) MRI ipsi contra Ia Ib IIa IIb Ia Ib 0 None None False False False False 1 False False False False None None 2 False False None None False False 3 None None None None None None
- infer_superlevels(modalities: list[str] | None = None, sides: list[Literal['ipsi', 'contra']] | None = None, subdivisions: dict[str, list[str]] | None = None) DataFrame[source]#
Determine involvement status of an LNL’s superlevel (e.g., II).
Some LNLs have sublevels, e.g., IIa and IIb. In real data, sometimes the sublevels are reported, sometimes only the superlevel. This function infers the status of the superlevel from the sublevels.
The superlevel’s status is computed for the specified
modalities. If and what sublevels a superlevel has, is specified insubdivisions.The resulting DataFrame will only contain the newly inferred superlevel columns and only for those superlevels that were not already present in the DataFrame. This way, it is straightforward to
join()it with the original DataFrame.>>> df = pd.DataFrame({ ... ('MRI', 'ipsi' , 'Ia' ): [True , False, False, None], ... ('MRI', 'ipsi' , 'Ib' ): [False, True , False, None], ... ('MRI', 'contra', 'IIa'): [False, False, None , None], ... ('MRI', 'contra', 'IIb'): [False, True , True , None], ... ('CT' , 'ipsi' , 'I' ): [True , False, False, None], ... }) >>> df.ly.infer_superlevels(modalities=["MRI"]) MRI ipsi contra I II 0 True False 1 True True 2 False True 3 None None