Python read, validate and import CSV/JSON file to MySQL
In this post how to read, parse and load CSV/JSON file to MySQL table:
In summary:
- Python with Pandas and MySQL - read CSV file(pandas), connect to MySQL(sqlalchemy), create a table with CSV data
- Read, parse and load JSON file into MySQL table- Read and parse JSON, validate data, connect and insert to MySQL(PyMySQL)
You can be interested in:
Read CSV file with Pandas and MySQL
The easiest and simplest way to read CSV file in Python and to import its date into MySQL table is by using pandas. Several useful method will automate the important steps while giving you freedom for customization:
This is the example:
import pandas as pd
from sqlalchemy import create_engine
# read CSV file
column_names = ['person','year', 'company']
df = pd.read_csv('/home/user/data/test.csv', header = None, names = column_names)
print(df)
df = pd.read_csv('/home/user/data/test.csv', header = 0)
print(df)
engine = create_engine('mysql://root:@localhost/test')
with engine.connect() as conn, conn.begin():
df.to_sql('csv', conn, if_exists='append', index=False)
Code explanation:
Open CSV file with pandas
First step is to read the CSV file data. This is done by:
df = pd.read_csv('/home/user/data/test.csv', header = None, names = column_names)
print(df)
or
df = pd.read_csv('/home/user/data/test.csv', header = 0)
print(df)
the difference is about headers - in first code the csv files is without headers and we provide column names. The second example the file has headers and we show that line - 0 - should be used as header.
Note also that method pd.read_csv will add index or line number to your rows and the result would be:
person year company
0 Person Year Company
1 John 2018 Google
Connect to MySQL DB with sqlalchemy
next step is to connect to MySQL DB ( or any other - you will need to change your module)
in this code:
engine = create_engine('mysql://root:@localhost/test')
you have:
'mysql://username:password@localhost/dbname'
where
username - is the username which will connect to DB
password - the password of the user. In case of no password than you can leave it empty. So for root without password you can use: root:
localhost - the MySQL server address
dbname - the database that will be used for your connection.
create new MySQL table from the CSV data
Final step is creating new table from the CSV data:
with engine.connect() as conn, conn.begin():
df.to_sql('csv', conn, if_exists='append', index=False)
for this method: df.to_sql
df.to_sql('csv', conn, if_exists='append', index=False)
- csv - is the table name which will be created in database test.
- conn - we are creating new connection with the parameters from the previous step
- if_exists - this parameter has two options
- append - if you want to append data to existing table
- replace - recreate the table and insert the data into newly created table
- index - it can be two values
- False - insert the CSV data as it is
- True - this will add column 'index' in the field list of your CSV. So instead of having:
INSERT INTO csv (`Person`, `Year`, `Company`) VALUES (%s, %s, %s)'] [parameters: (('John', 2018, 'Google')
you will have:
INSERT INTO csv (`index`, `Person`, `Year`, `Company`) VALUES (%s, %s, %s, %s)'] [parameters: ((0, 'John', 2018, 'Google')
Import JSON file into MySQL
If you want to read json file and parse it's data you have several ways to do it with Python. In this tutorial we will see how to do it by using modules: pymysql, os, json. We will validate JSON data before loading it to MySQL.
The JSON file:
[
{
"person": "John",
"year": 2018,
"company": "Google"
},
{
"person": "Sam",
"year": 2017,
"company": "IBM"
},
{
"person": "Jeff",
"year": 2014,
"company": "Yahoo"
}
]
The example:
import pymysql, os, json
# read JSON file which is in the next parent folder
file = os.path.abspath('../../..') + "/test.json"
json_data=open(file).read()
json_obj = json.loads(json_data)
# do validation and checks before insert
def validate_string(val):
if val != None:
if type(val) is int:
#for x in val:
# print(x)
return str(val).encode('utf-8')
else:
return val
# connect to MySQL
con = pymysql.connect(host = 'localhost',user = 'root',passwd = '',db = 'test')
cursor = con.cursor()
# parse json data to SQL insert
for i, item in enumerate(json_obj):
person = validate_string(item.get("person", None))
year = validate_string(item.get("year", None))
company = validate_string(item.get("company", None))
cursor.execute("INSERT INTO testp (person, year, company) VALUES (%s, %s, %s)", (person, year, company))
con.commit()
con.close()
Explanation:
Read and parse JSON with JSON
- the first step is to read and parse the JSON file into python dict. You can open the file by absolute or relative path:
file = '/home/user/data/test.json'
- parse JSON data. Pasring JSON data and iterating over the values is easy with:
for i, item in enumerate(json_obj):
person = validate_string(item.get("person", None))
The enumerate is used in order to get index - i - just in case if you want to insert a unique key to your data. At this step is invoked method validate_string to verify JSON data. You are free to add or change validation to match your needs. Useful data integrity checks are:
* check for duplicated data
* check for incomplete data - partially required fields (fields depending on other fields)
* check for required fields
* check for wrong types
* check for bad formats - incomplete date
* method validate_string - validates if the values are type string. You can write check that meet your requirments. It's better to be used is_instance rather than direct comparisson of types.
Connect and insert to MySQL with pymysql
-
then we are connecting to the DB. The parameters are similar to the CSV example.
-
last step is to prepare insert method and write the data into MySQL
cursor.execute("INSERT INTO test (person, year, company) VALUES (%s, %s, %s)", (person, year, company))
For JSON you can also use pandas in order to load JSON data into DB. Both ways has their advantages and disadvantages. What you are going to use depends on you.
This is a simple example using JSON and pandas.io:
from pandas.io.json import json_normalize
data = [{
"Person": "John",
"Year": 2018,
"Company": "Google"
}, {
"Person": "Sam",
"Year": 2017,
"Company": "IBM"
}, {
"Person": "Jeff",
"Year": 2014,
"Company": "Yahoo"
}]
json_normalize(data)
CSV file example:
This is the example CSV file used in this example.
Person,Year,Company
John,2018,Google
Sam,2017,IBM
Jeff,2014,Yahoo
Allen,2015,Facebook