Python and Pandas are very useful when you need to generate some test / random / fake data. For example let say that there is a need of two dataframes:
- 5 columns with 500 rows of integer numbers
- 5 columns with 100 rows of random characters
- 3 columns and 10 rows with random decimals
Generate Dataframe with random numbers 5 colums 100 rows
The most common need for me is to generate Dataframe with random numbers(integers) from 0 to 100. This can be achieved by using numpy randint function:
np.random.randint(0,100,size=(100, 5))
This will be the code:
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDF'))
df2.head()
the result of which is:
A | B | C | D | F | |
---|---|---|---|---|---|
0 | 19 | 71 | 99 | 21 | 5 |
1 | 85 | 89 | 38 | 40 | 83 |
2 | 95 | 29 | 1 | 11 | 22 |
3 | 39 | 26 | 43 | 43 | 93 |
4 | 6 | 1 | 33 | 14 | 54 |
Generate Dataframe with random characters 5 colums 500 rows
Another useful example might be generating dataframe with random characters. This can be achieved by using
pd.util.testing.rands(3)
result of which is:
'E0z'
in order to split the random generate string we are going to use built in function list.
The first part of the code is:
rand_chars = []
for i in range(0, 5):
rand_chars.append(list(pd.util.testing.rands(100)))
rand_chars = list(map(list, zip(*rand_chars)))
rand_chars[0:5]
the result of which is:
[['4', '8', 'v', 'g', 'c'],
['d', '6', 'n', 'b', 'H'],
['D', 'g', 'I', 's', 'O'],
['0', 'h', 'm', 'z', 's'],
['T', 'n', 'c', 'U', 'S']]
You may notice that we are doing transpose of list of lists by:
rand_chars = list(map(list, zip(*rand_chars)))
Finally we are creating the DataFrame:
df2 = pd.DataFrame(rand_chars)
df2.head()
result:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | H | L | x | s | 3 |
1 | S | Y | l | p | n |
2 | q | d | F | 9 | 6 |
3 | O | k | w | C | L |
4 | D | E | U | C | n |
Generate Dataframe with random decimal numbers 3 colums 10 rows
The last example is generating dataframe with random floating point numbers.
In this example we are going to use:
np.random.rand(253, 3)
which gives:
array([[0.34322362, 0.58491385, 0.0421841 ],
[0.72594607, 0.99322651, 0.72207976],
[0.86410573, 0.92330185, 0.84427074]..]
and this is the full code:
pd.DataFrame(np.random.rand(10, 3) , columns=list('XYZ'))
result:
x | y | x | |
---|---|---|---|
0 | 0.769363 | 0.122776 | 0.880724 |
1 | 0.114435 | 0.658999 | 0.193133 |
2 | 0.547094 | 0.037303 | 0.058781 |
3 | 0.335808 | 0.359005 | 0.047081 |
4 | 0.787799 | 0.834477 | 0.594807 |
5 | 0.926310 | 0.653232 | 0.592580 |