Helpers for Loading Datasets#

Provides functions to easily load lyDATA CSV tables as pandas.DataFrame.

The loading itself is implemented in the LyDataset class, which is a pydantic.BaseModel subclass. It validates the unique specification that identifies a dataset and then allows loading it from the disk (if present) or from GitHub.

The available_datasets() function can be used to create a generator of such LyDataset instances, corresponding to all available datasets that are either found on disk or on GitHub.

Consequently, the load_datasets() function can be used to load all datasets matching the given specs/pattern. It takes the same arguments as the function available_datasets() but returns a generator of pandas.DataFrame instead of LyDataset.

Lastly, with the join_datasets() function, one can load and concatenate all datasets matching the given specs/pattern into a single pandas.DataFrame.

The docstring of all functions contains some basic doctest examples.

exception lydata.loader.SkipDiskError[source]#: Raised when the user wants to skip loading from disk.

class lydata.loader.LyDataset(*, year: Annotated[int, Gt(gt=0), Le(le=2024)], institution: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)], subsite: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)], repo_name: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)] = 'rmnldwg/lydata', ref: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)] = 'main')[source]#

Specification of a dataset.

property name: str#

Get the name of the dataset.

>>> conf = LyDataset(year=2023, institution="clb", subsite="multisite")
>>> conf.name
'2023-clb-multisite'

property path_on_disk: Path#

Get the path to the dataset.

>>> conf = LyDataset(year="2021", institution="usz", subsite="oropharynx")
>>> conf.path_on_disk.exists()
True

get_repo(token: str | None = None, user: str | None = None, password: str | None = None) → <module 'github.Repository' from '/home/docs/checkouts/readthedocs.org/user_builds/lydata/envs/0.2.0/lib/python3.10/site-packages/github/Repository.py'>[source]#

Get the GitHub repository object.

With the arguments token or user and password, one can authenticate with GitHub. If no authentication is provided, the function will try to use the environment variables GITHUB_TOKEN or GITHUB_USER and GITHUB_PASSWORD.

>>> conf = LyDataset(
...     year=2021,
...     institution="clb",
...     subsite="oropharynx",
... )
>>> conf.get_repo().full_name == conf.repo_name
True
>>> conf.get_repo().visibility
'public'

get_content_file(token: str | None = None, user: str | None = None, password: str | None = None) → ContentFile[source]#

Get the GitHub content file of the data CSV.

This method always tries to fetch the most recent version of the file.

>>> conf = LyDataset(
...     year=2023,
...     institution="usz",
...     subsite="hypopharynx-larynx",
...     repo_name="rmnldwg/lydata.private",
...     ref="2023-usz-hypopharynx-larynx",
... )
>>> conf.get_content_file()
ContentFile(path="2023-usz-hypopharynx-larynx/data.csv")

get_dataframe(use_github: bool = False, token: str | None = None, user: str | None = None, password: str | None = None, **load_kwargs) → DataFrame[source]#

Load the data.csv file from disk or from GitHub.

One can also choose to use_github. Any keyword arguments are passed to pandas.read_csv().

The method will store the output of model_dump() in the attrs attribute of the returned DataFrame.

>>> conf = LyDataset(year=2021, institution="clb", subsite="oropharynx")
>>> df_from_disk = conf.get_dataframe()
>>> df_from_disk.shape
(263, 82)
>>> df_from_github = conf.get_dataframe(use_github=True)
>>> np.all(df_from_disk.fillna(0) == df_from_github.fillna(0))
np.True_

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'institution': FieldInfo(annotation=str, required=True, description="Institution's short code. E.g., University Hospital Zurich: `usz`.", metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'ref': FieldInfo(annotation=str, required=False, default='main', description='Branch/tag/commit of the repo.', metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'repo_name': FieldInfo(annotation=str, required=False, default='rmnldwg/lydata', description='GitHub `repository/owner`.', metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'subsite': FieldInfo(annotation=str, required=True, description='Tumor subsite(s) patients in this dataset were diagnosed with.', metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'year': FieldInfo(annotation=int, required=True, description='Release year of dataset.', metadata=[Gt(gt=0), Le(le=2024)])}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

model_post_init(context: Any, /) → None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Args:: self: The BaseModel instance. context: The context.

lydata.loader.remove_subheadings(tokens: Iterable[Token], min_level: int = 1) → list[Token][source]#

Remove anything under min_level headings.

With this, one can truncate markdown content to e.g. to the top-level heading and the text that follows immediately after. Any subheadings after that will be removed.

lydata.loader.format_description(readme: TextIOWrapper | str, short: bool = False, max_line_length: int = 60) → str[source]#

Get a markdown description from a file.

Truncate the description before the first second-level heading if short is set to True.

lydata.loader.available_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', search_paths: list[Path] | None = None, use_github: bool = False, repo_name: str = 'rmnldwg/lydata', ref: str = 'main') → Generator[LyDataset, None, None][source]#

Generate LyDataset instances of available datasets.

The arguments year, institution, and subsite represent glob patterns and all datasets matching these patterns can be iterated over using the returned generator.

By default, the functions will look for datasets on the disk at paths specified in the search_paths argument. If no paths are provided, it will look in the the parent directory of the directory containing this file. If the library is installed, this will be the site-packages directory.

With use_github set to True, the function will not look for datasets on disk, but will instead look for them on GitHub. The repo and ref arguments can be used to specify the repository and the branch/tag/commit to look in.

>>> avail_gen = available_datasets()
>>> sorted([ds.name for ds in avail_gen])   
['2021-clb-oropharynx',
 '2021-usz-oropharynx',
 '2023-clb-multisite',
 '2023-isb-multisite']
>>> avail_gen = available_datasets(
...     repo_name="rmnldwg/lydata.private",
...     ref="2024-umcg-hypopharynx-larynx",
...     use_github=True,
... )
>>> sorted([ds.name for ds in avail_gen])   
['2021-clb-oropharynx',
 '2021-usz-oropharynx',
 '2023-clb-multisite',
 '2023-isb-multisite',
 '2024-umcg-hypopharynx-larynx']
>>> avail_gen = available_datasets(
...     institution="hvh",
...     ref="6ac98d",
...     use_github=True,
... )

lydata.loader.load_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', search_paths: list[Path] | None = None, use_github: bool = False, repo_name: str = 'rmnldwg/lydata', ref: str = 'main', **kwargs) → Generator[DataFrame, None, None][source]#

Load matching datasets from the disk.

It loads every dataset from the LyDataset instances generated by the available_datasets() function, which also receives all arguments of this function.

lydata.loader.join_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', search_paths: list[Path] | None = None, use_github: bool = False, repo_name: str = 'rmnldwg/lydata', ref: str = 'main', **kwargs) → DataFrame[source]#

Join matching datasets from the disk.

This uses the load_datasets() function to load the datasets and then concatenates them along the index axis. All arguments are also directly passed to the load_datasets() function.

>>> join_datasets(year="2023").shape
(705, 219)
>>> join_datasets(year="2023", use_github=True).shape
(705, 219)

Helpers for Loading Datasets

Contents

Helpers for Loading Datasets#