Pandera Schemas to Validate Datasets#
Module to transform to and validate the CSV schema of the lydata datasets.
Here we define the function construct_schema() to dynamically create a
pandera.DataFrameSchema that we can use to validate that a given
DataFrame conforms to the minimum requirements of the lyDATA
datasets.
Currently, we only publish the validate_datasets() function that validates all
datasets that are found by the function available_datasets().
In the future, we may want to make this more flexible.
In this module, we also provide the transform_to_lyprox() function that can be
used to transform any raw data into the format that can be uploaded to the LyProX
platform database.
- lydata.validator.get_modality_columns(modality: str, lnls: list[str] = ['I', 'Ia', 'Ib', 'II', 'IIa', 'IIb', 'III', 'IV', 'V', 'Va', 'Vb', 'VI', 'VII', 'VIII', 'IX', 'X']) dict[tuple[str, str, str], Column][source]#
Get the validation columns for a given modality.
- lydata.validator.construct_schema(modalities: list[str], lnls: list[str] = ['I', 'Ia', 'Ib', 'II', 'IIa', 'IIb', 'III', 'IV', 'V', 'Va', 'Vb', 'VI', 'VII', 'VIII', 'IX', 'X']) DataFrameSchema[source]#
Construct a
pandera.DataFrameSchemafor the lydata datasets.
- lydata.validator.validate_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', use_github: bool = False, repo: str = 'lycosystem/lydata', ref: str = 'main', **kwargs) None[source]#
Validate all lydata datasets.
The arguments of this function are directly passed to the
available_datasets()function to determine which datasets to validate.Keyword arguments beyond the ones that
available_datasets()accepts are passed to theload()method of theDatasetinstances.
- lydata.validator.delete_private_keys(nested: dict) dict[source]#
Delete private keys from a nested dictionary.
A ‘private’ key is a key whose name starts with an underscore. For example:
>>> delete_private_keys({"patient": {"__doc__": "some patient info", "age": 61}}) {'patient': {'age': 61}} >>> delete_private_keys({"patient": {"age": 61}}) {'patient': {'age': 61}}
- lydata.validator.flatten(nested: dict, prev_key: tuple = (), max_depth: int | None = None) dict[source]#
Flatten
nesteddict by creating key tuples for each value atmax_depth.>>> nested = {"tumor": {"1": {"t_stage": 1, "size": 12.3}}} >>> flatten(nested) {('tumor', '1', 't_stage'): 1, ('tumor', '1', 'size'): 12.3} >>> mapping = {"patient": {"#": {"age": {"func": int, "columns": ["age"]}}}} >>> flatten(mapping, max_depth=3) {('patient', '#', 'age'): {'func': <class 'int'>, 'columns': ['age']}}
Note that flattening an already flat dictionary will yield some weird results.
- lydata.validator.unflatten(flat: dict) dict[source]#
Take a flat dictionary with tuples of keys and create nested dict from it.
>>> flat = {('tumor', '1', 't_stage'): 1, ('tumor', '1', 'size'): 12.3} >>> unflatten(flat) {'tumor': {'1': {'t_stage': 1, 'size': 12.3}}} >>> mapping = {('patient', '#', 'age'): {'func': int, 'columns': ['age']}} >>> unflatten(mapping) {'patient': {'#': {'age': {'func': <class 'int'>, 'columns': ['age']}}}}
- lydata.validator.get_depth(nested_map: dict, leaf_keys: set | None = None) int[source]#
Get the depth at which ‘leaf’ dicts sit in a nested dictionary.
A leaf is a dictionary that contains any of the
leaf_keys. The default is{"func", "default"}.>>> nested_column_map = {"patient": {"age": {"func": int}}} >>> get_depth(nested_column_map) 2 >>> flat_column_map = flatten(nested_column_map, max_depth=2) >>> get_depth(flat_column_map) 1 >>> nested_column_map = {"patient": {"__doc__": "some patient info", "age": 61}} >>> get_depth(nested_column_map) Traceback (most recent call last): ... ValueError: Leaf of nested map must be dict with any of ['default', 'func'].
- lydata.validator.transform_to_lyprox(raw: DataFrame, column_map: dict[str | tuple, dict | Any]) DataFrame[source]#
Transform
rawdata into table that can be uploaded directly to LyProX.To do so, it uses instructions in the
colum_mapdictionary, that needs to have a particular structure:For each column in the final ‘lyproxified’
pd.DataFrame, one entry must exist in thecolumn_mapdictionary. E.g., for the column corresponding to a patient’s age, the dictionary should contain a key-value pair of this shape:column_map = { ("patient", "#", "age"): { "func": compute_age_from_raw, "kwargs": {"randomize": False}, "columns": ["birthday", "date of diagnosis"] }, }
In this example, the function
compute_age_from_rawis called with the values of the columns"birthday"and"date of diagnosis"as positional arguments, and the keyword argument"randomize"is set toFalse. The function then returns the patient’s age, which is subsequently stored in the column("patient", "#", "age").Alternatively, this dictionary can also have a nested, tree-like structure, like this:
column_map = { "patient": { "#": { "age": { "func": compute_age_from_raw, "kwargs": {"randomize": False}, "columns": ["birthday", "date of diagnosis"] } } } }
In this case it is imortant that all the leaf nodes, which are defined by having either a
"func"or a"default"key, are at the same depth. Because this nested dictionary is flattened to look like the first example above.