Helpers for Loading Datasets#
Provides functions to easily load lyDATA CSV tables as pandas.DataFrame.
The loading itself is implemented in the LyDataset class, which
is a pydantic.BaseModel subclass. It validates the unique specification
that identifies a dataset and then allows loading it from the disk (if present) or
from GitHub.
The available_datasets() function can be used to create a generator of such
LyDataset instances, corresponding to all available datasets that
are either found on disk or on GitHub.
Consequently, the load_datasets() function can be used to load all datasets
matching the given specs/pattern. It takes the same arguments as the function
available_datasets() but returns a generator of pandas.DataFrame
instead of LyDataset.
Lastly, with the join_datasets() function, one can load and concatenate all
datasets matching the given specs/pattern into a single pandas.DataFrame.
The docstring of all functions contains some basic doctest examples.
- exception lydata.loader.SkipDiskError[source]#
Raised when the user wants to skip loading from disk.
- class lydata.loader.LyDataset(*, year: Annotated[int, Gt(gt=0), Le(le=2025)], institution: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)], subsite: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)], repo_name: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)] = 'rmnldwg/lydata', ref: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)] = 'main')[source]#
Specification of a dataset.
- property name: str#
Get the name of the dataset.
>>> conf = LyDataset(year=2023, institution="clb", subsite="multisite") >>> conf.name '2023-clb-multisite'
- property path_on_disk: Path#
Get the path to the dataset.
>>> conf = LyDataset(year="2021", institution="usz", subsite="oropharynx") >>> conf.path_on_disk.exists() True
- get_repo(token: str | None = None, user: str | None = None, password: str | None = None) <module 'github.Repository' from '/home/docs/checkouts/readthedocs.org/user_builds/lydata/envs/0.2.4/lib/python3.10/site-packages/github/Repository.py'>[source]#
Get the GitHub repository object.
With the arguments
tokenoruserandpassword, one can authenticate with GitHub. If no authentication is provided, the function will try to use the environment variablesGITHUB_TOKENorGITHUB_USERandGITHUB_PASSWORD.>>> conf = LyDataset( ... year=2021, ... institution="clb", ... subsite="oropharynx", ... ) >>> conf.get_repo().full_name == conf.repo_name True >>> conf.get_repo().visibility 'public'
- get_content_file(token: str | None = None, user: str | None = None, password: str | None = None) ContentFile[source]#
Get the GitHub content file of the data CSV.
This method always tries to fetch the most recent version of the file.
>>> conf = LyDataset( ... year=2023, ... institution="usz", ... subsite="hypopharynx-larynx", ... repo_name="rmnldwg/lydata.private", ... ref="2023-usz-hypopharynx-larynx", ... ) >>> conf.get_content_file() ContentFile(path="2023-usz-hypopharynx-larynx/data.csv")
- get_dataframe(use_github: bool = False, token: str | None = None, user: str | None = None, password: str | None = None, **load_kwargs) DataFrame[source]#
Load the
data.csvfile from disk or from GitHub.One can also choose to
use_github. Any keyword arguments are passed topandas.read_csv().The method will store the output of
model_dump()in theattrsattribute of the returnedDataFrame.>>> conf = LyDataset(year=2021, institution="clb", subsite="oropharynx") >>> df_from_disk = conf.get_dataframe() >>> df_from_disk.shape (263, 82) >>> df_from_github = conf.get_dataframe(use_github=True) >>> np.all(df_from_disk.fillna(0) == df_from_github.fillna(0)) np.True_
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- lydata.loader.available_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', search_paths: list[Path] | None = None, use_github: bool = False, repo_name: str = 'rmnldwg/lydata', ref: str = 'main') Generator[LyDataset, None, None][source]#
Generate
LyDatasetinstances of available datasets.The arguments
year,institution, andsubsiterepresent glob patterns and all datasets matching these patterns can be iterated over using the returned generator.By default, the functions will look for datasets on the disk at paths specified in the
search_pathsargument. If no paths are provided, it will look in the the parent directory of the directory containing this file. If the library is installed, this will be thesite-packagesdirectory.With
use_githubset toTrue, the function will not look for datasets on disk, but will instead look for them on GitHub. Therepoandrefarguments can be used to specify the repository and the branch/tag/commit to look in.>>> avail_gen = available_datasets() >>> sorted([ds.name for ds in avail_gen]) ['2021-clb-oropharynx', '2021-usz-oropharynx', '2023-clb-multisite', '2023-isb-multisite'] >>> avail_gen = available_datasets( ... repo_name="rmnldwg/lydata.private", ... ref="2024-umcg-hypopharynx-larynx", ... use_github=True, ... ) >>> sorted([ds.name for ds in avail_gen]) ['2021-clb-oropharynx', '2021-usz-oropharynx', '2023-clb-multisite', '2023-isb-multisite', '2024-umcg-hypopharynx-larynx'] >>> avail_gen = available_datasets( ... institution="hvh", ... ref="6ac98d", ... use_github=True, ... )
- lydata.loader.load_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', search_paths: list[Path] | None = None, use_github: bool = False, repo_name: str = 'rmnldwg/lydata', ref: str = 'main', **kwargs) Generator[DataFrame, None, None][source]#
Load matching datasets from the disk.
It loads every dataset from the
LyDatasetinstances generated by theavailable_datasets()function, which also receives all arguments of this function.