Helpers for Loading Datasets#

Module for loading the lydata datasets.

class lydata.loader.DatasetSpec(year: int | str, institution: str, subsite: str, path: Path | None = None, description: str = '', repo: str = 'rmnldwg/lydata', revision: str = 'main')[source]#

Specification of a dataset.

property name: str#

Get the name of the dataset.

>>> spec = DatasetSpec(2023, "clb", "multisite", Path("path"), "description")
>>> spec.name
'2023-clb-multisite'
property url: str#

Get the URL to the dataset.

>>> spec = DatasetSpec(2023, "clb", "multisite", Path("path"), "description")
>>> spec.url
'https://raw.githubusercontent.com/rmnldwg/lydata/main/2023-clb-multisite/data.csv'
load(**load_kwargs) DataFrame[source]#

Load the dataset.

fetch(**load_kwargs) DataFrame[source]#

Fetch the dataset from the web.

lydata.loader.remove_subheadings(elements: list, min_level: int = 1) list[source]#

Remove anything under min_level headings.

lydata.loader.get_description(readme: TextIOWrapper | str, short: bool = False, max_line_length: int = 60) str[source]#

Get a markdown description from a file.

Truncate the description before the first second-level heading if short is set to True.

lydata.loader.available_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', where: Literal['disk', 'github'] = 'disk') Generator[DatasetSpec, None, None][source]#

Generate names of available datasets.

>>> avail_gen = available_datasets(where='disk')
>>> sorted([ds.name for ds in avail_gen])   
['2021-clb-oropharynx',
 '2021-usz-oropharynx',
 '2023-clb-multisite',
 '2023-isb-multisite']
>>> avail_gen = available_datasets(where='github')
>>> sorted([ds.name for ds in avail_gen])   
['2021-clb-oropharynx',
 '2021-usz-oropharynx',
 '2023-clb-multisite',
 '2023-isb-multisite']
lydata.loader.load_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', **load_kwargs) Generator[DataFrame, None, None][source]#

Load matching datasets from the disk.

lydata.loader.load_dataset(year: int | str = '*', institution: str = '*', subsite: str = '*', **load_kwargs) DataFrame[source]#

Load the first matching dataset from the disk.

Note that datasets loaded (or fetched) with this function will have the dataset specification stored in the attrs attribute. See below for an example of how to access the dataset specification.

>>> ds = load_dataset(year=2021, institution='clb', subsite='oropharynx')
>>> ds.attrs["year"]
'2021'
>>> spec_from_ds = DatasetSpec(**ds.attrs)
>>> spec_from_ds.name
'2021-clb-oropharynx'
lydata.loader.fetch_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', **load_kwargs) Generator[DataFrame, None, None][source]#

Fetch matching datasets from the web.

lydata.loader.fetch_dataset(year: int | str = '*', institution: str = '*', subsite: str = '*', **load_kwargs) DataFrame[source]#

Fetch the first matching dataset from the web.

lydata.loader.join_datasets(year: int | str = '*', institution: str = '*', subsite: str = '*', method: Literal['fetch', 'load'] = 'load', **load_or_fetch_kwargs) DataFrame[source]#

Join matching datasets from the disk.