roksana.datasets package

Submodules

roksana.datasets.datasets module

class roksana.datasets.datasets.UserDataset(root: str, transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, data_list: List[Data] | None = None)[source]

Bases: InMemoryDataset

A dataset class for user-provided datasets adhering to PyG’s InMemoryDataset structure.

Users should provide their data in a specific format, typically as a list of torch_geometric.data.Data objects.

__init__(root: str, transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, data_list: List[Data] | None = None)[source]

Initialize the UserDataset.

Parameters:

root (str) – Root directory where the dataset should be saved.
transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access.
pre_transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk.
pre_filter (Callable, optional) – A function that takes in a torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset.
data_list (List[Data], optional) – A list of torch_geometric.data.Data objects. If provided, it will be used to initialize the dataset.

download()[source]: Users are expected to provide their own data, so no download is necessary.

process()[source]

Process the user-provided data and save it in the processed file.

Users can modify this method if they have specific processing requirements.

property processed_file_names: List[str]: The name of the processed file.

property raw_file_names: List[str]: Since users provide their own data, this can be left empty or used to list expected raw files.

roksana.datasets.datasets.get_dataset_info(dataset: InMemoryDataset) → Dict[str, Any][source]

Retrieve basic information about a dataset.

Parameters:: dataset (InMemoryDataset) – The dataset instance.
Returns:: A dictionary containing dataset information.
Return type:: Dict[str, Any]

roksana.datasets.datasets.list_available_standard_datasets() → List[str][source]

List all available standard datasets supported by ROKSANA.

Returns:: A list of supported dataset names.
Return type:: List[str]

roksana.datasets.datasets.load_dataset(dataset_name: str | None = None, root: str = 'data', transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, data_list: List[Data] | None = None) → InMemoryDataset[source]

Load a dataset, either a standard dataset or a user-provided dataset.

Parameters:

dataset_name (str, optional) – Name of the standard dataset to load (e.g., ‘cora’, ‘citeseer’). If None, a UserDataset should be provided via data_list.
root (str, optional) – Root directory where the dataset should be saved or loaded from. Defaults to ‘data’.
transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access.
pre_transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk.
pre_filter (Callable, optional) – A function that takes in a torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset.
data_list (List[Data], optional) – A list of torch_geometric.data.Data objects. Required if dataset_name is None.

Returns:

An instance of the loaded dataset.

Return type:

InMemoryDataset

roksana.datasets.datasets.load_standard_dataset(name: str, root: str = 'data') → Planetoid[source]

Load a standard dataset from PyG’s built-in datasets.

Supported datasets: ‘cora’, ‘citeseer’, ‘pubmed’, etc. Refer to PyG’s Planetoid datasets for more.

Parameters:

name (str) – Name of the dataset to load (e.g., ‘Cora’, ‘Citeseer’).
root (str, optional) – Root directory where the dataset should be saved. Defaults to ‘data’.

Returns:

An instance of the Planetoid dataset.

Return type:

Planetoid

roksana.datasets.datasets.load_user_dataset_from_files(data_dir: str, file_format: str = 'json', transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None) → UserDataset[source]

Load a user dataset from files in a specified directory.

Supported file formats: ‘json’, ‘csv’, ‘pickle’.

Parameters:

data_dir (str) – Directory containing the dataset files.
file_format (str, optional) – Format of the dataset files. Defaults to ‘json’.
transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access.
pre_transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk.
pre_filter (Callable, optional) – A function that takes in a torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset.

Returns:

An instance of the UserDataset loaded from the files.

Return type:

UserDataset

roksana.datasets.datasets.prepare_search_set(data: Data, percentage: float = 0.1, seed: int = 42) → Tuple[List[int], List[List[int]]][source]

Prepare a search set for search evaluation by selecting a percentage of nodes as queries and creating corresponding gold sets based on feature similarity.

Parameters:

data (Data) – The graph dataset.
percentage (float, optional) – Percentage of nodes to select as queries. Must be between 0 and 1. Defaults to 0.1 (10%).
seed (int, optional) – Seed for random number generator to ensure reproducibility. Defaults to 42.

Returns:

A tuple containing:

queries (List[int]): List of node indices selected as queries.
gold_sets (List[List[int]]): List of gold sets, where each gold set is a list of node indices
with the same features as the corresponding query.

Return type:

Tuple[List[int], List[List[int]]]

Raises:

ValueError – If percentage is not between 0 and 1.
AttributeError – If dataset does not contain node features (data.x).

Module contents

class roksana.datasets.UserDataset(root: str, transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, data_list: List[Data] | None = None)[source]

Bases: InMemoryDataset

A dataset class for user-provided datasets adhering to PyG’s InMemoryDataset structure.

Users should provide their data in a specific format, typically as a list of torch_geometric.data.Data objects.

__init__(root: str, transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, data_list: List[Data] | None = None)[source]

Initialize the UserDataset.

Parameters:

root (str) – Root directory where the dataset should be saved.
transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access.
pre_transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk.
pre_filter (Callable, optional) – A function that takes in a torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset.
data_list (List[Data], optional) – A list of torch_geometric.data.Data objects. If provided, it will be used to initialize the dataset.

download()[source]: Users are expected to provide their own data, so no download is necessary.

process()[source]

Process the user-provided data and save it in the processed file.

Users can modify this method if they have specific processing requirements.

property processed_file_names: List[str]: The name of the processed file.

property raw_file_names: List[str]: Since users provide their own data, this can be left empty or used to list expected raw files.

roksana.datasets.get_dataset_info(dataset: InMemoryDataset) → Dict[str, Any][source]

Retrieve basic information about a dataset.

Parameters:: dataset (InMemoryDataset) – The dataset instance.
Returns:: A dictionary containing dataset information.
Return type:: Dict[str, Any]

roksana.datasets.list_available_standard_datasets() → List[str][source]

List all available standard datasets supported by ROKSANA.

Returns:: A list of supported dataset names.
Return type:: List[str]

roksana.datasets.load_dataset(dataset_name: str | None = None, root: str = 'data', transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None, data_list: List[Data] | None = None) → InMemoryDataset[source]

Load a dataset, either a standard dataset or a user-provided dataset.

Parameters:

dataset_name (str, optional) – Name of the standard dataset to load (e.g., ‘cora’, ‘citeseer’). If None, a UserDataset should be provided via data_list.
root (str, optional) – Root directory where the dataset should be saved or loaded from. Defaults to ‘data’.
transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access.
pre_transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk.
pre_filter (Callable, optional) – A function that takes in a torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset.
data_list (List[Data], optional) – A list of torch_geometric.data.Data objects. Required if dataset_name is None.

Returns:

An instance of the loaded dataset.

Return type:

InMemoryDataset

roksana.datasets.load_standard_dataset(name: str, root: str = 'data') → Planetoid[source]

Load a standard dataset from PyG’s built-in datasets.

Supported datasets: ‘cora’, ‘citeseer’, ‘pubmed’, etc. Refer to PyG’s Planetoid datasets for more.

Parameters:

name (str) – Name of the dataset to load (e.g., ‘Cora’, ‘Citeseer’).
root (str, optional) – Root directory where the dataset should be saved. Defaults to ‘data’.

Returns:

An instance of the Planetoid dataset.

Return type:

Planetoid

roksana.datasets.load_user_dataset_from_files(data_dir: str, file_format: str = 'json', transform: Callable | None = None, pre_transform: Callable | None = None, pre_filter: Callable | None = None) → UserDataset[source]

Load a user dataset from files in a specified directory.

Supported file formats: ‘json’, ‘csv’, ‘pickle’.

Parameters:

data_dir (str) – Directory containing the dataset files.
file_format (str, optional) – Format of the dataset files. Defaults to ‘json’.
transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access.
pre_transform (Callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk.
pre_filter (Callable, optional) – A function that takes in a torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset.

Returns:

An instance of the UserDataset loaded from the files.

Return type:

UserDataset

roksana.datasets.prepare_search_set(data: Data, percentage: float = 0.1, seed: int = 42) → Tuple[List[int], List[List[int]]][source]

Prepare a search set for search evaluation by selecting a percentage of nodes as queries and creating corresponding gold sets based on feature similarity.

Parameters:

data (Data) – The graph dataset.
percentage (float, optional) – Percentage of nodes to select as queries. Must be between 0 and 1. Defaults to 0.1 (10%).
seed (int, optional) – Seed for random number generator to ensure reproducibility. Defaults to 42.

Returns:

A tuple containing:

queries (List[int]): List of node indices selected as queries.
gold_sets (List[List[int]]): List of gold sets, where each gold set is a list of node indices
with the same features as the corresponding query.

Return type:

Tuple[List[int], List[List[int]]]

Raises:

ValueError – If percentage is not between 0 and 1.
AttributeError – If dataset does not contain node features (data.x).