Pandera Schemas to Validate Datasets#

Module to transform to and validate the CSV schema of the lydata datasets.

Here we define the function construct_schema() to dynamically create a pandera.DataFrameSchema that we can use to validate that a given DataFrame conforms to the minimum requirements of the lyDATA datasets.

Currently, we only publish the validate_datasets() function that validates all datasets that are found by the function available_datasets(). In the future, we may want to make this more flexible.

In this module, we also provide the transform_to_lyprox() function that can be used to transform any raw data into the format that can be uploaded to the LyProX platform database.

exception lydata.validator.ParsingError[source]#: Error while parsing the CSV file.

lydata.validator.get_modality_columns(modality: str, lnls: list[str] = ['I', 'Ia', 'Ib', 'II', 'IIa', 'IIb', 'III', 'IV', 'V', 'Va', 'Vb', 'VI', 'VII', 'VIII', 'IX', 'X']) → dict[tuple[str, str, str], Column][source]#: Get the validation columns for a given modality.

lydata.validator.construct_schema(modalities: list[str], lnls: list[str] = ['I', 'Ia', 'Ib', 'II', 'IIa', 'IIb', 'III', 'IV', 'V', 'Va', 'Vb', 'VI', 'VII', 'VIII', 'IX', 'X']) → DataFrameSchema[source]#: Construct a pandera.DataFrameSchema for the lydata datasets.

lydata.validator.validate_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', use_github: bool = True, repo: str = 'lycosystem/lydata', ref: str = 'main', **kwargs) → None[source]#

Validate all lydata datasets.

The arguments of this function are directly passed to the available_datasets() function to determine which datasets to validate.

Keyword arguments beyond the ones that available_datasets() accepts are passed to the load() method of the Dataset instances.

lydata.validator.delete_private_keys(nested: dict) → dict[source]#

Delete private keys from a nested dictionary.

A ‘private’ key is a key whose name starts with an underscore. For example:

>>> delete_private_keys({"patient": {"__doc__": "some patient info", "age": 61}})
{'patient': {'age': 61}}
>>> delete_private_keys({"patient": {"age": 61}})
{'patient': {'age': 61}}

lydata.validator.flatten(nested: dict, prev_key: tuple = (), max_depth: int | None = None) → dict[source]#

Flatten nested dict by creating key tuples for each value at max_depth.

>>> nested = {"tumor": {"1": {"t_stage": 1, "size": 12.3}}}
>>> flatten(nested)
{('tumor', '1', 't_stage'): 1, ('tumor', '1', 'size'): 12.3}
>>> mapping = {"patient": {"#": {"age": {"func": int, "columns": ["age"]}}}}
>>> flatten(mapping, max_depth=3)
{('patient', '#', 'age'): {'func': <class 'int'>, 'columns': ['age']}}

Note that flattening an already flat dictionary will yield some weird results.

lydata.validator.unflatten(flat: dict) → dict[source]#

Take a flat dictionary with tuples of keys and create nested dict from it.

>>> flat = {('tumor', '1', 't_stage'): 1, ('tumor', '1', 'size'): 12.3}
>>> unflatten(flat)
{'tumor': {'1': {'t_stage': 1, 'size': 12.3}}}
>>> mapping = {('patient', '#', 'age'): {'func': int, 'columns': ['age']}}
>>> unflatten(mapping)
{'patient': {'#': {'age': {'func': <class 'int'>, 'columns': ['age']}}}}

lydata.validator.get_depth(nested_map: dict, leaf_keys: set | None = None) → int[source]#

Get the depth at which ‘leaf’ dicts sit in a nested dictionary.

A leaf is a dictionary that contains any of the leaf_keys. The default is {"func", "default"}.

>>> nested_column_map = {"patient": {"age": {"func": int}}}
>>> get_depth(nested_column_map)
2
>>> flat_column_map = flatten(nested_column_map, max_depth=2)
>>> get_depth(flat_column_map)
1
>>> nested_column_map = {"patient": {"__doc__": "some patient info", "age": 61}}
>>> get_depth(nested_column_map)   
Traceback (most recent call last):
    ...
ValueError: Leaf of nested map must be dict with any of ['default', 'func'].

lydata.validator.transform_to_lyprox(raw: DataFrame, column_map: dict[str | tuple, dict | Any]) → DataFrame[source]#

Transform raw data into table that can be uploaded directly to LyProX.

To do so, it uses instructions in the colum_map dictionary, that needs to have a particular structure:

For each column in the final ‘lyproxified’ pd.DataFrame, one entry must exist in the column_map dictionary. E.g., for the column corresponding to a patient’s age, the dictionary should contain a key-value pair of this shape:

column_map = {
    ("patient", "#", "age"): {
        "func": compute_age_from_raw,
        "kwargs": {"randomize": False},
        "columns": ["birthday", "date of diagnosis"]
    },
}

In this example, the function compute_age_from_raw is called with the values of the columns "birthday" and "date of diagnosis" as positional arguments, and the keyword argument "randomize" is set to False. The function then returns the patient’s age, which is subsequently stored in the column ("patient", "#", "age").

Alternatively, this dictionary can also have a nested, tree-like structure, like this:

column_map = {
    "patient": {
        "#": {
            "age": {
                "func": compute_age_from_raw,
                "kwargs": {"randomize": False},
                "columns": ["birthday", "date of diagnosis"]
            }
        }
    }
}

In this case it is imortant that all the leaf nodes, which are defined by having either a "func" or a "default" key, are at the same depth. Because this nested dictionary is flattened to look like the first example above.

Pandera Schemas to Validate Datasets

Contents

Pandera Schemas to Validate Datasets#