Datasets in Python
There are many providers of free datasets for data science. Some of them are summarized here and here. These datasets are often provided through an API and are stored in different formats. Getting them into a pandas
DataFrame
is often an overkill if we just want to quickly try out some machine-learning algorithm or a visualization. In this post, I give an overview of “built-in” datasets that are provided by popular python data science packages, such as statsmodels
, scikit-learn
, and seaborn
. These datasets can be easily accessed in form of a pandas
DataFrame
and can be used for quick experimenting.
Statsmodels
Statsmodels provides two types of datasets: around two dozens of built-in datasets that are installed alongside the statsmodels
package, and a collection of datasets from multiple R packages that can be downloaded on demand. Both types of datasets can be easily accessed using the Statsmodels’ statsmodels.api.datasets
module.
Built-in Datasets
An example of a built-in datasets is the American National Election Studies of 1996 dataset that is stored in the anes96
submodule of the datasets
module. Every dataset submodule has attributes DESCRLONG
and NOTE
that give a detailed description of the dataset:
import statsmodels.api as sm
anes96 = sm.datasets.anes96
print(anes96.DESCRLONG)
This data is a subset of the American National Election Studies of 1996.
print(anes96.NOTE)
::
Number of observations - 944
Number of variables - 10
Variables name definitions::
popul - Census place population in 1000s
TVnews - Number of times per week that respondent watches TV news.
PID - Party identification of respondent.
0 - Strong Democrat
1 - Weak Democrat
2 - Independent-Democrat
3 - Independent-Indpendent
4 - Independent-Republican
5 - Weak Republican
6 - Strong Republican
age : Age of respondent.
educ - Education level of respondent
1 - 1-8 grades
2 - Some high school
3 - High school graduate
4 - Some college
5 - College degree
6 - Master's degree
7 - PhD
income - Income of household
1 - None or less than $2,999
2 - $3,000-$4,999
3 - $5,000-$6,999
4 - $7,000-$8,999
5 - $9,000-$9,999
6 - $10,000-$10,999
7 - $11,000-$11,999
8 - $12,000-$12,999
9 - $13,000-$13,999
10 - $14,000-$14.999
11 - $15,000-$16,999
12 - $17,000-$19,999
13 - $20,000-$21,999
14 - $22,000-$24,999
15 - $25,000-$29,999
16 - $30,000-$34,999
17 - $35,000-$39,999
18 - $40,000-$44,999
19 - $45,000-$49,999
20 - $50,000-$59,999
21 - $60,000-$74,999
22 - $75,000-89,999
23 - $90,000-$104,999
24 - $105,000 and over
vote - Expected vote
0 - Clinton
1 - Dole
The following 3 variables all take the values:
1 - Extremely liberal
2 - Liberal
3 - Slightly liberal
4 - Moderate
5 - Slightly conservative
6 - Conservative
7 - Extremely Conservative
selfLR - Respondent's self-reported political leanings from "Left"
to "Right".
ClinLR - Respondents impression of Bill Clinton's political
leanings from "Left" to "Right".
DoleLR - Respondents impression of Bob Dole's political leanings
from "Left" to "Right".
logpopul - log(popul + .1)
The data itself is represented by a Dataset
object that is returned by the load_pandas()
function of the submodule.
dataset_anes96 = anes96.load_pandas()
The data
property of the Dataset
object contains a pandas
DataFrame
with the data.
df_anes96 = dataset_anes96.data
df_anes96.head()
popul TVnews selfLR ClinLR DoleLR PID age educ income vote logpopul
0 0.0 7.0 7.0 1.0 6.0 6.0 36.0 3.0 1.0 1.0 -2.302585
1 190.0 1.0 3.0 3.0 5.0 1.0 20.0 4.0 1.0 0.0 5.247550
2 31.0 7.0 2.0 2.0 6.0 1.0 24.0 6.0 1.0 0.0 3.437208
3 83.0 4.0 3.0 4.0 5.0 1.0 28.0 6.0 1.0 0.0 4.420045
4 640.0 7.0 5.0 6.0 4.0 0.0 68.0 6.0 1.0 0.0 6.461624
So, if you know the submodule in which the dataset is stored (e.g., anes96
), you can get the DataFrame
with the data in just one line:
sm.datasets.anes96.load_pandas().data
The table below lists all built-in datasets provided by Statsmodels and the corresponding submodules.
Datasets from R
Besides the built-in datasets, Statsmodels provides access to 1173 datasets from the Rdatasets project. The Rdataets project is a collection of datasets that were originally distributed with R and its add-on packages. To access a particular dataset you need its name and the name of the original R package. For example, the famous iris dataset, which is often used to demonstrate classification algorithms, can be accessed under the name “iris” and package “datasets”. Calling the get_rdataset()
function with these arguments downloads the corresponding dataset from the Rdatasets project’s repository and returns it in a Dataset
object:
import statsmodels.api as sm
dataset_iris = sm.datasets.get_rdataset(dataname='iris', package='datasets')
The __doc__
attribute of the Dataset
object stores a detailed description of the dataset.
print(dataset_iris.__doc__)
+--------+-------------------+
| iris | R Documentation |
+--------+-------------------+
Edgar Anderson's Iris Data
--------------------------
Description
~~~~~~~~~~~
This famous (Fisher's or Anderson's) iris data set gives the
measurements in centimeters of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are *Iris setosa*, *versicolor*, and
*virginica*.
Usage
~~~~~
::
iris
iris3
Format
~~~~~~
``iris`` is a data frame with 150 cases (rows) and 5 variables (columns)
named ``Sepal.Length``, ``Sepal.Width``, ``Petal.Length``,
``Petal.Width``, and ``Species``.
``iris3`` gives the same data arranged as a 3-dimensional array of size
50 by 4 by 3, as represented by S-PLUS. The first dimension gives the
case number within the species subsample, the second the measurements
with names ``Sepal L.``, ``Sepal W.``, ``Petal L.``, and ``Petal W.``,
and the third the species.
Source
~~~~~~
Fisher, R. A. (1936) The use of multiple measurements in taxonomic
problems. *Annals of Eugenics*, **7**, Part II, 179–188.
The data were collected by Anderson, Edgar (1935). The irises of the
Gaspe Peninsula, *Bulletin of the American Iris Society*, **59**, 2–5.
References
~~~~~~~~~~
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) *The New S
Language*. Wadsworth & Brooks/Cole. (has ``iris3`` as ``iris``.)
See Also
~~~~~~~~
``matplot`` some examples of which use ``iris``.
Examples
~~~~~~~~
::
dni3 <- dimnames(iris3)
ii <- data.frame(matrix(aperm(iris3, c(1,3,2)), ncol = 4,
dimnames = list(NULL, sub(" L.",".Length",
sub(" W.",".Width", dni3[[2]])))),
Species = gl(3, 50, labels = sub("S", "s", sub("V", "v", dni3[[3]]))))
all.equal(ii, iris) # TRUE
The data
attribute stores a pandas
DataFrame
with the data:
df_iris = dataset_iris.data
df_iris.head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
So, if you know the dataname and the package of a dataset (e.g., “iris” and “datasets”), you can download the data and get the corresponding DataFrame
in just one line:
sm.datasets.get_rdataset(dataname='iris', package='datasets').data
This index provides a complete overview of all datasets available in the Rdatasets repository with the corresponding datanames (the item
column) and packages (the package
column). The index is also available in the CSV format.
Scikit-learn
Scikit-learn’s datasets
module provides 7 built-in toy datasets that are used in Scikit-learn’s documentation for quick illustration of the algorithms, but are actually too small to be representative for real-world data. More interestingly, Scikit-learn also provides a set of random sample generators that can be used to generate artificial datasets of controlled size and complexity for different machine-learning problems.
Built-in Toy Datasets
For each of the built-in datasets there is a load function that returns a Bunch
object representing the dataset. For example, the Boston House Prices dataset can be loaded with the load_boston()
function:
from sklearn import datasets
dataset_boston = datasets.load_boston()
The DESCR
attribute of the Bunch
object stores a detailed description of the dataset:
print(dataset_boston.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
The data itself is provided in form of two numpy
arrays: one for the independent variables (Bunch.data
attribute) and one for the dependent variables (Bunch.target
attribute). The names of the features are stored in the Bunch.feature_names
attribute. A pandas
DataFrame
can be easily constructed from a numpy
array and a list of feature names:
import pandas as pd
# Independent variables (i.e. features)
df_boston_features = pd.DataFrame(data=dataset_boston.data,
columns=dataset_boston.feature_names)
df_boston_features.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
# Dependent variables (i.e. targets)
df_boston_target = pd.DataFrame(data=dataset_boston.target,
columns=['MEDV'])
df_boston_target.head()
MEDV
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
The table below lists the built-in datasets in Scikit-learn and the corresponding load functions. Some of these datasets are also available in Statsmodels through Rdatasets project. The corresponding datanames and packages to access these datasets from Statsmodels are also listed.
Description | scikit-learn | statsmodels |
---|---|---|
Boston house-prices dataset (regression) | load_boston() | dataname=’Boston’, package=’MASS’ |
The iris dataset (classification) | load_iris() | dataname=’iris’, package=’datasets’ |
The diabetes dataset (regression) | load_diabetes() | – |
The digits dataset (classification) | load_digits() | – |
The Linnerud dataset (multivariate regression) | load_linnerud() | – |
The wine dataset (classification) | load_wine() | – |
The breast cancer Wisconsin dataset (classification) | load_breast_cancer() | dataname=’biopsy’, package=’MASS’ |
Random Sample Generators
Besides the built-in datasets, the Scikit-learn’s datasets
module provides multiple generators that can generate random data for regression, classification, and clustering problems.
make_regression()
generates a random regression problem. To generate a random regression problem with 5 samples, 4 features (2 of which are informative, that is, influence the target variable), and with 1 target variable run:
X, y = datasets.make_regression(n_samples=5, n_features=4, n_informative=2, n_targets=1)
X, y
(array([[-0.42590826, -1.13659088, -1.12081439, 0.19618605],
[-0.24469292, 1.49406562, 0.00402701, 0.0853733 ],
[ 0.26415806, -0.89860911, -0.33905577, -0.51633174],
[-1.9304862 , 0.30020868, 0.76146937, -1.05332537],
[ 0.41696969, -1.12000527, 0.70649028, 0.76996861]]),
array([-64.48100936, -15.23036444, 5.17304557, -95.59148543, 49.97033047]))
make_classification()
generates a random classification problem. To generate a random classification problem with 5 samples, 3 features (2 of which are informative and 1 is redundant), 2 classes, and with 1 cluster per class run:
X, y = datasets.make_classification(n_samples=5, n_features=3, n_informative=2, n_redundant=1, n_classes=2, n_clusters_per_class=1)
X, y
(array([[ 1.12990679, -0.77494575, -0.33554705],
[-1.16200073, -1.38196269, 1.07136609],
[-0.28757222, 0.27740255, 0.05867687],
[ 0.18949525, 0.60855967, -0.30244284],
[ 1.65745753, 0.04606751, -0.88648095]]),
array([1, 0, 0, 0, 1]))
make_blobs()
generates a random clustering problem. To generate a random clustering problem with 5 samples, 3 centers, and 2 features run:
X, y = datasets.make_blobs(n_samples=5, centers=3, n_features=2)
X, y
(array([[-0.1460543 , 7.99240431],
[-0.59781982, 5.73052323],
[ 5.0929357 , -3.77332084],
[ 4.56787662, -3.30370353],
[ 7.89484287, -4.47653814]]),
array([1, 1, 0, 0, 2]))
Seaborn
Seaborn provides 13 datasets from its own collection. The available datasets can be listed with the get_dataset_names()
function:
import seaborn as sns
sns.get_dataset_names()
['anscombe',
'attention',
'brain_networks',
'car_crashes',
'dots',
'exercise',
'flights',
'fmri',
'gammas',
'iris',
'planets',
'tips',
'titanic']
The data in a dataset can be accessed in form of a pandas
DataFrame
by calling the load_dataset()
function with the name of the dataset as the argument:
df_planets = sns.load_dataset('planets')
df_planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
Some datasets, such as anscombe and iris, seem to be from R collection and some are not. There is no any description of the datasets available. This reduces their usefulness.
Summary
Statsmodels
, scikit-learn
, and seaborn
provide convenient access to a large number of datasets of different sizes and from different domains. In one or two lines of code the datasets can be accessed in a python script in form of a pandas
DataFrame
. This is particularly useful for quick experimenting with machine-learning algorithms and visualizations.