Type Casting and Validation

Type Casting and Validation#

Module to cast dtypes and to and validate the lyDATA datasets.

The two main functions here are cast_dtypes() and is_valid(). The first one can be used to cast the dtypes of the columns in a LyDataFrame to the expected types according to the schema constructed using create_full_record_model().

Subsequently, is_valid() can be used to validate every row in the table, again using the constructed schema.

lydata.validator.flatten(nested: dict, prev_key: tuple = (), max_depth: int | None = None) dict[source]#

Flatten nested dict by creating key tuples for each value at max_depth.

>>> nested = {"tumor": {"1": {"t_stage": 1, "size": 12.3}}}
>>> flatten(nested)
{('tumor', '1', 't_stage'): 1, ('tumor', '1', 'size'): 12.3}
>>> mapping = {"patient": {"#": {"age": {"func": int, "columns": ["age"]}}}}
>>> flatten(mapping, max_depth=3)
{('patient', '#', 'age'): {'func': <class 'int'>, 'columns': ['age']}}

Note that flattening an already flat dictionary will yield some weird results.

lydata.validator.unflatten(flat: dict) dict[source]#

Take a flat dictionary with tuples of keys and create nested dict from it.

>>> flat = {('tumor', '1', 't_stage'): 1, ('tumor', '1', 'size'): 12.3}
>>> unflatten(flat)
{'tumor': {'1': {'t_stage': 1, 'size': 12.3}}}
>>> mapping = {('patient', '#', 'age'): {'func': int, 'columns': ['age']}}
>>> unflatten(mapping)
{'patient': {'#': {'age': {'func': <class 'int'>, 'columns': ['age']}}}}
lydata.validator.is_valid(dataset: LyDataFrame, fail_on_error: bool = True) bool[source]#

Validate the given dataset against the lyDATA schema.

Returns True if all records are valid, otherwise it either raises an error (if fail_on_error is True) or returns False.

lydata.validator.cast_dtypes(dataset: LyDataFrame, casters: Mapping[type, str] | None = None, fail_on_error: bool = True) LyDataFrame[source]#

Cast the dtypes of the dataset to the expected types.

This function uses the annotations of the Pydantic schema to cast the individual columns of the dataset to the expected types. It uses the casters mapping to determine the type to cast to. By default, it uses the mapping from the _get_default_casters() function.

That way, pandas uses e.g. the nullable integer type Int64 if we specify in pydantic that a field can be an integer or None. If you want to use a different mapping, you can pass it as the casters argument.