text/CSV file to dataframe with Python and pandas

In this post you can find information about several topics related to files - text and CSV and pandas dataframes. The post is appropriate for complete beginners and include full code examples and results. The covered topics are:

DataFrame is a two-dimensional labeled data structure in commonly Python and Pandas. Many people refer it to dictionary(of series), excel spreadsheet or SQL table. Often is needed to convert text or CSV files to dataframes and the reverse.

Convert text file to dataframe

Converting simple text file without formatting to dataframe can be done by(which one to chose depends on your data):

  • pandas.read_fwf - Read a table of fixed-width formatted lines into DataFrame

    • pandas.read_fwf(filepath_or_buffer, colspecs='infer', widths=None, **kwds)
  • pandas.read_csv - Read CSV (comma-separated) file into DataFrame. Full list with parameters can be found on the link or at the bottom of the post.

    • pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None,..)

Let's assume that we have text file with content like:

1 Python 35
2 Java 28
3 Javascript 15

Next code examples shows how to convert this text file to pandas dataframe.

Code example for pandas.read_fwf:

import pandas as pd
df = pd.read_fwf('myfile.txt')

Code example for pandas.read_csv:

import pandas as pd
df = pd.read_csv('myfile.txt', sep=" ")

or

import pandas as pd
df = pd.read_csv('myfile.txt', ,delimiter="\t")

or

import pandas as pd
df = pd.read_csv('myfile.txt', sep=" ", header=None, names=["Pos", "Lang", "Perc"])

the resulted dataframe is:

   Pos        Lang  Perc
0    1      Python    35
1    2        Java    28
2    3  Javascript    15

Convert CSV file to dataframe

The same logic can be applied to convert CSV file to dataframe. The example below shows converting file with data:

1, Python, 35
2, Java, 28
3, Javascript, 15

This can be read and converted to dataframe with:

import pandas as pd
df = pd.read_csv('myfile.txt', sep=",", header=None, names=["Pos", "Lang", "Perc"])

result:

   Pos         Lang  Perc
0    1       Python    35
1    2         Java    28
2    3   Javascript    15

Convert dataframe to CSV file

The reverse operation is done again with method of panda:

  • pandas.DataFrame.to_csv - Write DataFrame to a comma-separated values (csv) file
    • DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')

Example:

df.to_csv('/home/user/myfile.txt', sep=',', encoding='utf-8', header=None)

this would result in:

,Pos,Lang,Perc
0,1, Python,35
1,2, Java,28
2,3, Javascript,15

If you don't want the headers and the indexes you can run:

df.to_csv('/home/user/myfile.txt', sep=',', encoding='utf-8', header=None, index=False)

result:

1, Python,35
2, Java,28
3, Javascript,15

Note: You can use None or False.

Error 'python' engine because the 'c' engine does not support regex separators

You may get error like - ParserWarning:

    ParserWarning: Falling back to the 'python' engine 
    because the 'c' engine does not support regex separators 
    (separators > 1 char and different from '\s+' are interpreted as regex); 
    you can avoid this warning by specifying engine='python'.
      df = pd.read_csv('/home/user/myfile.txt', sep=", ", header=None, names=["Pos", "Lang", "Perc"])

this means that you are using more than one separator for method: pd.read_csv. In order to solve it leave only one of the separators.

pandas.read_csv signature

You can see all the parameters which can be used for method: pandas.read_csv and their default values:

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None,
                squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None,
                true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None,
                keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False,
                infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False,
                chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"',
                quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None,
                error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True,
                delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True,
                buffer_lines=None, memory_map=False, float_precision=None)