Dataset structure¶
Most of the repositories available in scikit-datasets have datasets in some
regular format.
In that case, its corresponding fetch
function in scikit-datasets converts
the data to a standardized format, similar to the one used in scikit-learn, but
with new optional fields for additional features that some repositories
include, such as indices for train, validation and test partitions.
Note
Data in the CRAN repository is unstructured, and thus there is no fetch
function for it. The data is returned in the original format.
The structure is a Bunch
object with the
following fields:
data
: The matrix of observed data. A 2d NumPy array, ready to be used with scikit-learn tools. Each row correspond to a different observation while each column is a particular feature. For datasets with train, validation and test partitions, the whole data is included here. Usetrain_indices
,validation_indices
andtest_indices
to select each partition.target
: The target of the classification or regression problem. This is a 1d NumPy array except for multioutput problems, in with it is a 2d array, where each column correspond to a different output.DESCR
: A human readable description of the dataset.feature_names
: The list of feature names, if the repository has that information available.target_names
: For classification problems, this correspond to the names of the different classes, if available. Note that this field in scikit-learn is used in some cases for naming the outputs in multioutput problems. As we will try to maintain compatibility with scikit-learn, the meaning of this field could change in future versions.train_indices
: Indexes of the elements of the train partition, if available in the repository.validation_indices
: Indexes of the elements of the validation partition, if available in the repository.test_indices
: Indexes of the elements of the test partition, if available in the repository.inner_cv
: A CV splitter object, usable for cross validation and hyperparameter selection, if the repository provides a cross validation strategy (such as using a particular validation partition).outer_cv
: A Python iterable over different train and test partitions, when they are provided in the repository.