Helpers for Loading Datasets#

Module for loading the lydata datasets.

exception lydata.loader.SkipDiskError[source]#

Raised when the user wants to skip loading from disk.

class lydata.loader.LyDatasetConfig(*, year: Annotated[int, Gt(gt=0), Le(le=2024)], institution: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)], subsite: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)], repo: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)] = 'rmnldwg/lydata', ref: Annotated[str, StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)] = 'main')[source]#

Specification of a dataset.

property name: str#

Get the name of the dataset.

>>> conf = LyDatasetConfig(year=2023, institution="clb", subsite="multisite")
>>> conf.name
'2023-clb-multisite'
property path: Path#

Get the path to the dataset.

>>> conf = LyDatasetConfig(year="2021", institution="usz", subsite="oropharynx")
>>> conf.path.exists()
True
get_url(file: str) str[source]#

Get the URL to the dataset’s directory, CSV file, or README file.

>>> conf = LyDatasetConfig(year=2021, institution="clb", subsite="oropharynx")
>>> conf.get_url("")
'https://raw.githubusercontent.com/rmnldwg/lydata/main/2021-clb-oropharynx/'
>>> conf.get_url("data.csv")
'https://raw.githubusercontent.com/rmnldwg/lydata/main/2021-clb-oropharynx/data.csv'
>>> conf.get_url("README.md")
'https://raw.githubusercontent.com/rmnldwg/lydata/main/2021-clb-oropharynx/README.md'
get_description() str[source]#

Get the description of the dataset.

First, try to load it from the README.md file that should sit right next to the data.csv file. If that fails, try to look for the README.md file in the GitHub repository.

>>> conf = LyDatasetConfig(year=2021, institution="clb", subsite="oropharynx")
>>> print(conf.get_description())   
# 2021 CLB Oropharynx
...
load(skip_disk: bool = False, **load_kwargs) DataFrame[source]#

Load the data.csv file from disk or from GitHub.

One can also choose to skip_disk. Any keyword arguments are passed to pandas.read_csv().

The method will store the output of model_dump() in the attrs attribute of the returned DataFrame.

>>> conf = LyDatasetConfig(year=2021, institution="clb", subsite="oropharynx")
>>> df_from_disk = conf.load()
>>> df_from_disk.shape
(263, 82)
>>> df_from_github = conf.load(skip_disk=True)
>>> np.all(df_from_disk.fillna(0) == df_from_github.fillna(0))
np.True_
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'institution': FieldInfo(annotation=str, required=True, description="Institution's short code. E.g., University Hospital Zurich: `usz`.", metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'ref': FieldInfo(annotation=str, required=False, default='main', description='Branch/tag/commit of the repo.', metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'repo': FieldInfo(annotation=str, required=False, default='rmnldwg/lydata', description='GitHub `repository/owner`.', metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'subsite': FieldInfo(annotation=str, required=True, description='Subsite(s) this dataset covers.', metadata=[StringConstraints(strip_whitespace=None, to_upper=None, to_lower=True, strict=None, min_length=1, max_length=None, pattern=None)]), 'year': FieldInfo(annotation=int, required=True, description='Release year of dataset.', metadata=[Gt(gt=0), Le(le=2024)])}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

lydata.loader.remove_subheadings(tokens: Iterable[Token], min_level: int = 1) list[Token][source]#

Remove anything under min_level headings.

With this, one can truncate markdown content to e.g. to the top-level heading and the text that follows immediately after. Any subheadings after that will be removed.

lydata.loader.format_description(readme: TextIOWrapper | str, short: bool = False, max_line_length: int = 60) str[source]#

Get a markdown description from a file.

Truncate the description before the first second-level heading if short is set to True.

lydata.loader.available_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', skip_disk: bool = False, repo: str = 'rmnldwg/lydata', ref: str = 'main') Generator[LyDatasetConfig, None, None][source]#

Generate names of available datasets.

The arguments year, institution, and subsite represent glob patterns and all datasets matching these patterns can be iterated over using the returned generator.

With skip_disk set to True, the function will not look for datasets on disk.

>>> avail_gen = available_datasets()
>>> sorted([ds.name for ds in avail_gen])   
['2021-clb-oropharynx',
 '2021-usz-oropharynx',
 '2023-clb-multisite',
 '2023-isb-multisite']
>>> avail_gen = available_datasets(skip_disk=True)
>>> sorted([ds.name for ds in avail_gen])   
['2021-clb-oropharynx',
 '2021-usz-oropharynx',
 '2023-clb-multisite',
 '2023-isb-multisite']
>>> avail_gen = available_datasets(
...     institution="hvh",
...     ref="6ac98d",
...     skip_disk=True,
... )
>>> sorted([ds.get_url("") for ds in avail_gen])   
['https://raw.githubusercontent.com/rmnldwg/lydata/6ac98d/2024-hvh-oropharynx/']
lydata.loader.load_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', skip_disk: bool = False, repo: str = 'rmnldwg/lydata', ref: str = 'main', **kwargs) Generator[DataFrame, None, None][source]#

Load matching datasets from the disk.

The argument skip_disk is passed to both the available_datasets() function to check for what can be loaded and to the LyDatasetConfig.load() method to decide whether to load from disk (default) or from GitHub.

lydata.loader.load_dataset(year: int | str = '*', institution: str = '*', subsite: str = '*', skip_disk: bool = False, repo: str = 'rmnldwg/lydata', ref: str = 'main', **kwargs) DataFrame[source]#

Load the first matching dataset.

skip_disk is passed to the load_datasets() function.

>>> ds = load_dataset(year="2021", institution='clb', subsite='oropharynx')
>>> ds.attrs["year"]
2021
>>> conf_from_ds = LyDatasetConfig(**ds.attrs)
>>> conf_from_ds.name
'2021-clb-oropharynx'
lydata.loader.join_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', skip_disk: bool = False, repo: str = 'rmnldwg/lydata', ref: str = 'main', **kwargs) DataFrame[source]#

Join matching datasets from the disk.

This uses the load_datasets() function to load the datasets and then concatenates them along the index axis.

>>> join_datasets(year="2023").shape
(705, 219)
>>> join_datasets(year="2023", skip_disk=True).shape
(705, 219)
lydata.loader.run_doctests() None[source]#

Run the doctests.