Pandas - Random Sample of a subset of a DataFrame - rows or columns

Pandas - Random Sample of a subset of a DataFrame - rows or columns

This post describes how to DataFrame sampling in Pandas works: basics, conditionals and by group. You can use the following code in order to get random sample of DataFrame by using Pandas and Python:

df.sample()

The rest of the article contains explanation of the functions, advanced examples and interesting use cases. Check it if you want to dive deeper into data sampling with Pandas.

Steps to generate random sample of data with Pandas

Step 1: Random sampling of rows(columns) from DataFrame by sample()

The easiest way to generate random set of rows with Python and Pandas is by: df.sample. By default returns one random row from DataFrame:

# Default behavior of sample()
df.sample()

result:

row3433

If you like to get more than a single row than you can provide a number as parameter:

# return n rows
df.sample(3)

result:

row23
row2256
row4774

It can be applied for rows by using axis=1 as shown below.

# columns
df.sample(3, axis=1).head()

result:

col11, col2, col7

If you like to get a fraction (or percentage of your data) than you can use parameter fraction. In the example below dataset has 5000 rows and 0.001 will return exactly 5 random rows:

# The fraction of rows and columns: frac
df.sample(frac=0.001)

Finally you can provide seed for the better randomization - random_state

# sample with seed
df.sample(n=3, random_state=5)

This is the explanation of this parameter:

random_state : int or numpy.random.RandomState, optional
Seed for the random number generator (if int), or numpy RandomState object.

Step 2: Get random rows with np.random.choice

As alternative or if you want to engineer your own random mechanism you can use np.random.choice - in order to generate sample of index numbers and later to check/amend the selection. Finally you can access by iloc:

import numpy as np
random_idx = np.random.choice(1000, replace=False, size=5)
df.iloc[random_idx]

result:

646	Color ..
550	Color ..

Note: That using: np.random.choice(1000, limit the selection to first 1000 rows!

Step 3: Random sample of rows based on column value

The final step of data sampling with Pandas is the case when you have condition based on the values of a given column. Let say that you have column with several values:

  • color
  • black/white

and you want to get 3 samples for the first type and 3 for the second. The code below will show you how to achieve this:

col = 'color'
for typ in list(df[col].dropna().unique()):
    print(typ, end=' - ')
    display(df[df[col] == typ].sample(3))

output is like:

Color - 

    2867	Color ..
    4345	Color ..
    764 	Color ..

Black and White -     

    286	    Black and White ..
    3904	Black and White ..
    1595	Black and White ..
    ...

In case of more values in the column and if you like to get only one random row per kind from this DataFrame - you can use the code below:

col = 'content_rating'
sample = []

variants = list(df[col].dropna().unique())
print(variants)

for typ in variants:
    sample.append(df[df[col] == typ].sample())
pd.concat(sample)

which result in:

['PG-13', 'PG', 'G', 'R', 'TV-14', 'TV-PG', 'TV-MA', 'TV-G', 'Not Rated', 'Unrated', 'Approved', 'TV-Y', 'NC-17', 'X', 'TV-Y7', 'GP', 'Passed', 'M']
1514 ..
4526 ..
2686 ..
..

One sample row per each variation in column - content_rating - group and displayed as a single DataFrame.

Step 4: Conditional DataFrame sampling with numpy and weights

Lets work again with the same column color and this time to sample rows all except - 'Color' - in this case we can use np.where in order to built weights. This weights will be passed to method sample which is going to generate samples based on our weighted list:

# excluding 'Color' values by applying weights 0 - Color and 1 - rest
df['weights'] = np.where(df['color'] == 'Color', .0, 1)
df.sample(frac=.001, weights='weights')

result in:

4846 NaN .. 
4370 Black and White ..
...

In case that we need to get random rows only for 'Color' then we can do:

# Including only 'Color' values by applying weights 1 - Color and 0 - rest
df['weights'] = np.where(df['color'] == 'Color', 1, 0.0)
df.sample(frac=.001, weights='weights')

result in:

3036 Color ..
4632 Color ..
...

How the conditional sampling with np.where works? You can understand from the code below:

np.where(df['color'] == 'Color', 1, 0)[270:280]
print(list(df['color'][270:280]))

This produce the following weights which are passed to the sample method:

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0])
['Color', 'Color', ' Black and White', 'Color', 'Color', 'Color', 'Color', 'Color', 'Color', nan]

So in case of matched condition the first value is applied - 1, otherwise the second one. This list is passed as probability to the sampling method which generate the data based on it.

This is the official explanation:

weights : str or ndarray-like, optional
Default ‘None’ results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.

Step 5: Pandas sample rows by group

At this step we are going to group the rows by column and then apply a lambda in order to call sample with 3 rows per group:

df.groupby('color').apply(lambda x: x.sample(n=3))

If you need to reset the index you can do it by: reset_index(drop=True)

df.groupby('color').apply(lambda x: x.sample(n=3)).reset_index(drop = True)

The result is visible on this picture:

python_online_editor_2020

Bonus: Pandas DataFrame combine head() and tail()

Sometimes sampling start with the getting the first and the last rows of a DataFrame. For this purpose df.head(2) and df.tail(2) can be used. If you want to combine them in a single line you can use append and get the result as:

# combine head and tail variant 1
rows = 2
df.head(rows).append(df.tail(rows))

result:

0 Color ..
1 Color ..
5041 Color .. 
5042 Color ..
Share Tweet Send
0 Comments
Loading...