# Datasets in Python

There are many providers of free datasets for data science. Some of them are summarized [here](https://datascience.stackexchange.com/questions/155/publicly-available-datasets) and [here](https://www.kdnuggets.com/datasets/index.html). These datasets are often provided through an API and are stored in different formats. Getting them into a `pandas` `DataFrame` is often an overkill if we just want to quickly try out some machine-learning algorithm or a visualization. In this post, I give an overview of "built-in" datasets that are provided by popular python data science packages, such as [`statsmodels`](http://www.statsmodels.org), [`scikit-learn`](http://scikit-learn.org), and [`seaborn`](https://seaborn.pydata.org/).  These datasets can be easily accessed in form of a `pandas` `DataFrame` and can be used for quick experimenting.

## Statsmodels

[Statsmodels provides two types of datasets](http://www.statsmodels.org/dev/datasets/index.html): around two dozens of built-in datasets that are installed alongside the `statsmodels` package, and a collection of datasets from multiple R packages that can be downloaded on demand. Both types of datasets can be easily accessed using the Statsmodels' `statsmodels.api.datasets` module.

### Built-in Datasets

An example of a built-in datasets is the American National Election Studies of 1996 dataset that is stored in the `anes96` submodule of the `datasets` module. Every dataset submodule has attributes `DESCRLONG` and `NOTE` that give a detailed description of the dataset:

In [1]:
%%capture --no-stdout --no-display
import statsmodels.api as sm
anes96 = sm.datasets.anes96

print(anes96.DESCRLONG)

This data is a subset of the American National Election Studies of 1996.


In [2]:
print(anes96.NOTE)

::

    Number of observations - 944
    Number of variables - 10

    Variables name definitions::

            popul - Census place population in 1000s
            TVnews - Number of times per week that respondent watches TV news.
            PID - Party identification of respondent.
                0 - Strong Democrat
                1 - Weak Democrat
                2 - Independent-Democrat
                3 - Independent-Indpendent
                4 - Independent-Republican
                5 - Weak Republican
                6 - Strong Republican
            age : Age of respondent.
            educ - Education level of respondent
                1 - 1-8 grades
                2 - Some high school
                3 - High school graduate
                4 - Some college
                5 - College degree
                6 - Master's degree
                7 - PhD
            income - Income of household
                1  - None or less than $2,999
                2  - $3,000-$4,9

The data itself is represented by a `Dataset` object that is returned by the `load_pandas()` function of the submodule.

In [3]:
dataset_anes96 = anes96.load_pandas()

The `data` property of the `Dataset` object contains a `pandas` `DataFrame` with the data.

In [4]:
df_anes96 = dataset_anes96.data
df_anes96.head()

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,educ,income,vote,logpopul
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,3.0,1.0,1.0,-2.302585
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,4.0,1.0,0.0,5.24755
2,31.0,7.0,2.0,2.0,6.0,1.0,24.0,6.0,1.0,0.0,3.437208
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624


So, if you know the submodule in which the dataset is stored (e.g., `anes96`), you can get the `DataFrame` with the data in just one line:

In [None]:
sm.datasets.anes96.load_pandas().data

The table below lists all built-in datasets provided by Statsmodels and the corresponding submodules.

Dataset Description | Submodule
:-|:-
[American National Election Survey 1996](http://www.statsmodels.org/dev/datasets/generated/anes96.html) | anes96
[Breast Cancer Data](http://www.statsmodels.org/dev/datasets/generated/cancer.html) | cancer
[Bill Greene’s credit scoring data.](http://www.statsmodels.org/dev/datasets/generated/ccard.html) | ccard
[Smoking and lung cancer in eight cities in China.](http://www.statsmodels.org/dev/datasets/generated/china_smoking.html) | china_smoking
[Mauna Loa Weekly Atmospheric CO2 Data](http://www.statsmodels.org/dev/datasets/generated/co2.html) | co2
[First 100 days of the US House of Representatives 1995](http://www.statsmodels.org/dev/datasets/generated/committee.html) | committee
[World Copper Market 1951-1975 Dataset](http://www.statsmodels.org/dev/datasets/generated/copper.html) | copper
[US Capital Punishment dataset.](http://www.statsmodels.org/dev/datasets/generated/cpunish.html) | cpunish
[El Nino - Sea Surface Temperatures](http://www.statsmodels.org/dev/datasets/generated/elnino.html) | elnino
[Engel (1857) food expenditure data](http://www.statsmodels.org/dev/datasets/generated/engel.html) | engel
[Affairs dataset](http://www.statsmodels.org/dev/datasets/generated/fair.html) | fair
[World Bank Fertility Data](http://www.statsmodels.org/dev/datasets/generated/fertility.html) | fertility
[Grunfeld (1950) Investment Data](http://www.statsmodels.org/dev/datasets/generated/grunfeld.html) | grunfeld
[Transplant Survival Data](http://www.statsmodels.org/dev/datasets/generated/heart.html) | heart
[Longley dataset](http://www.statsmodels.org/dev/datasets/generated/longley.html) | longley
[United States Macroeconomic data](http://www.statsmodels.org/dev/datasets/generated/macrodata.html) | macrodata
[Travel Mode Choice](http://www.statsmodels.org/dev/datasets/generated/modechoice.html) | modechoice
[Nile River flows at Ashwan 1871-1970](http://www.statsmodels.org/dev/datasets/generated/nile.html) | nile
[RAND Health Insurance Experiment Data](http://www.statsmodels.org/dev/datasets/generated/randhie.html) | randhie
[Taxation Powers Vote for the Scottish Parliamant 1997](http://www.statsmodels.org/dev/datasets/generated/scotland.html) | scotland
[Spector and Mazzeo (1980) - Program Effectiveness Data](http://www.statsmodels.org/dev/datasets/generated/spector.html) | spector
[Stack loss data](http://www.statsmodels.org/dev/datasets/generated/stackloss.html) | stackloss
[Star98 Educational Dataset](http://www.statsmodels.org/dev/datasets/generated/star98.html) | star98
[Statewide Crime Data 2009](http://www.statsmodels.org/dev/datasets/generated/statecrime.html) | statecrime
[U.S. Strike Duration Data](http://www.statsmodels.org/dev/datasets/generated/strikes.html) | strikes
[Yearly sunspots data 1700-2008](http://www.statsmodels.org/dev/datasets/generated/sunspots.html) | sunspots

### Datasets from R

Besides the built-in datasets, Statsmodels provides access to 1173 datasets from the [Rdatasets project](https://github.com/vincentarelbundock/Rdatasets). The Rdataets project is a collection of datasets that were originally distributed with R and its add-on packages. To access a particular dataset you need its name and the name of the original R package. For example, the famous iris dataset, which is often used to demonstrate classification algorithms, can be accessed under the name "iris" and package "datasets". Calling the `get_rdataset()` function with these arguments downloads the corresponding dataset from the Rdatasets project's repository and returns it in a `Dataset` object:

In [6]:
# import statsmodels.api as sm
dataset_iris = sm.datasets.get_rdataset(dataname='iris', package='datasets')

The `__doc__` attribute of the `Dataset` object stores a detailed description of the dataset.

In [7]:
print(dataset_iris.__doc__)

+--------+-------------------+
| iris   | R Documentation   |
+--------+-------------------+

Edgar Anderson's Iris Data
--------------------------

Description
~~~~~~~~~~~

This famous (Fisher's or Anderson's) iris data set gives the
measurements in centimeters of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are *Iris setosa*, *versicolor*, and
*virginica*.

Usage
~~~~~

::

    iris
    iris3

Format
~~~~~~

``iris`` is a data frame with 150 cases (rows) and 5 variables (columns)
named ``Sepal.Length``, ``Sepal.Width``, ``Petal.Length``,
``Petal.Width``, and ``Species``.

``iris3`` gives the same data arranged as a 3-dimensional array of size
50 by 4 by 3, as represented by S-PLUS. The first dimension gives the
case number within the species subsample, the second the measurements
with names ``Sepal L.``, ``Sepal W.``, ``Petal L.``, and ``Petal W.``,
and the third the species.

Source
~~~~~~



The `data` attribute stores a `pandas` `DataFrame` with the data:

In [8]:
df_iris = dataset_iris.data
df_iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


So, if you know the dataname and the package of a dataset (e.g., "iris" and "datasets"), you can download the data and get the corresponding `DataFrame` in just one line:

In [None]:
sm.datasets.get_rdataset(dataname='iris', package='datasets').data

[This index](https://vincentarelbundock.github.io/Rdatasets/datasets.html) provides a complete overview of all datasets available in the Rdatasets repository with the corresponding datanames (the `item` column) and packages (the `package` column). The index is also available in the [CSV format](http://vincentarelbundock.github.com/Rdatasets/datasets.csv).

## Scikit-learn

[Scikit-learn's `datasets` module provides 7 built-in toy datasets](http://scikit-learn.org/stable/datasets/index.html) that are used in Scikit-learn's documentation for quick illustration of the algorithms, but are actually too small to be representative for real-world data. More interestingly, Scikit-learn also provides a set of random sample generators that can be used to generate artificial datasets of controlled size and complexity for different machine-learning problems.

### Built-in Toy Datasets

For each of the built-in datasets there is a load function that returns a `Bunch` object representing the dataset. For example, the Boston House Prices dataset can be loaded with the `load_boston()` function:

In [10]:
from sklearn import datasets
dataset_boston = datasets.load_boston()

The `DESCR` attribute of the `Bunch` object stores a detailed description of the dataset:

In [11]:
print(dataset_boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

The data itself is provided in form of two `numpy` arrays: one for the independent variables (`Bunch.data` attribute) and one for the dependent variables (`Bunch.target` attribute). The names of the features are stored in the `Bunch.feature_names` attribute. A `pandas` `DataFrame` can be easily constructed from a `numpy` array and a list of feature names:

In [12]:
import pandas as pd

# Independent variables (i.e. features)
df_boston_features = pd.DataFrame(data=dataset_boston.data, columns=dataset_boston.feature_names)
df_boston_features.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [13]:
# Dependent variables (i.e. targets)
df_boston_target = pd.DataFrame(data=dataset_boston.target, columns=['MEDV'])
df_boston_target.head()

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


The table below lists the built-in datasets in Scikit-learn and the corresponding load functions. Some of these datasets are also available in Statsmodels through Rdatasets project. The corresponding datanames and packages to access these datasets from Statsmodels are also listed.

Description | scikit-learn | statsmodels
:--|:--|:--
Boston house-prices dataset (regression)             | load_boston() 	    | dataname='Boston', package='MASS'
The iris dataset (classification)                    | load_iris() 	        | dataname='iris', package='datasets'
The diabetes dataset (regression)                    | load_diabetes() 	    | --
The digits dataset (classification)                  | load_digits() 	    | --
The Linnerud dataset (multivariate regression)       | load_linnerud() 	    | --
The wine dataset (classification)                    | load_wine() 	        | --
The breast cancer Wisconsin dataset (classification) | load_breast_cancer() | dataname='biopsy', package='MASS'

### Random Sample Generators

Besides the built-in datasets, the Scikit-learn's `datasets` module provides multiple generators that can generate random data for regression, classification, and clustering problems.

**`make_regression()`** generates a random regression problem. To generate a random regression problem with 5 samples, 4 features (2 of which are informative, that is, influence the target variable), and with 1 target variable run:

In [14]:
X, y = datasets.make_regression(n_samples=5, n_features=4, n_informative=2, n_targets=1)
X, y

(array([[ 1.05346552, -0.34351678,  0.93001029, -0.91155277],
        [-2.36111187,  0.38597133, -0.56347197, -0.69699855],
        [-0.04006776,  0.14933861, -0.38230566,  0.21558756],
        [ 1.95034825,  0.0027949 , -1.55220034, -2.44003012],
        [-0.22771902,  0.48066688, -0.02599171,  0.57654205]]),
 array([ -82.91729951,  -54.04014369,   20.59313399, -208.64138022,
          56.245038  ]))

**`make_classification()`** generates a random classification problem. To generate a random classification problem with 5 samples, 3 features (2 of which are informative and 1 is redundant), 2 classes, and with 1 cluster per class run:

In [15]:
X, y = datasets.make_classification(n_samples=5, n_features=3, n_informative=2, n_redundant=1, n_classes=2, n_clusters_per_class=1)
X, y

(array([[ 1.66628744, -2.02640651,  1.25452612],
        [ 1.27026677, -0.76741118,  0.47848104],
        [ 0.90954462,  1.11831033, -0.68264732],
        [ 0.18963063,  1.44678466, -0.88838424],
        [ 0.23511216, -0.7861893 ,  0.48454277]]), array([0, 0, 1, 1, 0]))

**`make_blobs()`** generates a random clustering problem. To generate a random clustering problem with 5 samples, 3 centers, and 2 features run:

In [16]:
X, y = datasets.make_blobs(n_samples=5, centers=3, n_features=2)
X, y

(array([[-1.55890987, -1.81384645],
        [ 4.36009723,  2.53848732],
        [ 4.48446074,  0.77947815],
        [ 4.30121547, -2.13087759],
        [ 3.68318513, -3.04260308]]), array([2, 0, 0, 1, 1]))

## Seaborn

Seaborn provides 13 datasets from its own [collection](https://github.com/mwaskom/seaborn-data/). The available datasets can be listed with the `get_dataset_names()` function:

In [17]:
%%capture --no-stdout --no-display
import seaborn as sns
sns.get_dataset_names()

['anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'iris',
 'planets',
 'tips',
 'titanic']

The data in a dataset can be accessed in form of a `pandas` `DataFrame` by calling the `load_dataset()` function with the name of the dataset as the argument:

In [18]:
df_planets = sns.load_dataset('planets')
df_planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


Some datasets, such as anscombe and iris, seem to be from R collection and some are not. There is no any description of the datasets available. This reduces their usefulness.

## Summary

[`Statsmodels`](http://www.statsmodels.org), [`scikit-learn`](http://scikit-learn.org), and [`seaborn`](https://seaborn.pydata.org/) provide convenient access to a large number of datasets of different sizes and from different domains. In one or two lines of code the datasets can be accessed in a python script in form of a `pandas` `DataFrame`. This is particularly useful for quick experimenting with machine-learning algorithms and visualizations.