Python Read File Line by Line Evaluate

Most of the information is available in a tabular format of CSV files. It is very pop. Y'all can catechumen them to a pandas DataFrame using the read_csv function. The pandas.read_csv is used to load a CSV file as a pandas dataframe.

In this article, you volition learn the dissimilar features of the read_csv function of pandas apart from loading the CSV file and the parameters which can be customized to get improve output from the read_csv office.

pandas.read_csv

  • Syntax: pandas.read_csv( filepath_or_buffer, sep, header, index_col, usecols, prefix, dtype, converters, skiprows, skiprows, nrows, na_values, parse_dates)Purpose: Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking the file into chunks.
  • Parameters:
    • filepath_or_buffer : str, path object or file-like object Whatever valid string path is adequate. The string could be a URL too. Path object refers to os.PathLike. File-like objects with a read() method, such every bit a filehandle (e.thou. via congenital-in open function) or StringIO.
    • sep : str, (Default ',') Separating purlieus which distinguishes between whatever ii subsequent data items.
    • header : int, list of int, (Default 'infer') Row number(s) to apply as the cavalcade names, and the start of the data. The default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and cavalcade names are inferred from the first line of the file.
    • names : array-like Listing of column names to use. If the file contains a header row, then you lot should explicitly pass header=0 to override the cavalcade names. Duplicates in this listing are not allowed.
    • index_col : int, str, sequence of int/str, or Imitation, (Default None) Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int/str is given, a MultiIndex is used.
    • usecols : listing-like or callable Render a subset of the columns. If callable, the callable function will be evaluated against the cavalcade names, returning names where the callable part evaluates to True.
    • prefix : str Prefix to add to column numbers when no header, eastward.g. 'X' for X0, X1
    • dtype : Type name or dict of column -> type Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32, 'c': 'Int64'} Use str or object together with suitable na_values settings to preserve and not translate dtype.
    • converters : dict Dict of functions for converting values in certain columns. Keys tin can either be integers or column labels.
    • skiprows : list-like, int or callable Line numbers to skip (0-indexed) or the number of lines to skip (int) at the beginning of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and Simulated otherwise.
    • skipfooter : int Number of lines at bottom of the file to skip
    • nrows : int Number of rows of file to read. Useful for reading pieces of large files.
    • na_values : scalar, str, list-similar, or dict Additional strings to recognize every bit NA/NaN. If dict passed, specific per-column NA values. Past default the following values are interpreted equally NaN: '', '#Northward/A', '#N/A N/A', '#NA', '-i.#IND', '-one.#QNAN', '-NaN', '-nan', 'ane.#IND', '1.#QNAN', '', 'Due north/A', 'NA', 'Naught', 'NaN', 'due north/a', 'nan', 'nil'.
    • parse_dates : bool or list of int or names or list of lists or dict, (default False) If ready to True, volition endeavor to parse the index, else parse the columns passed
  • Returns: DataFrame or TextParser, A comma-separated values (CSV) file is returned every bit a ii-dimensional information construction with labeled axes. _For full listing of parameters, refer to the offical documentation

Reading CSV file

The pandas read_csv function can be used in unlike ways as per necessity like using custom separators, reading just selective columns/rows and and so on. All cases are covered below one later on another.

Default Separator

To read a CSV file, phone call the pandas part read_csv() and laissez passer the file path as input.

Footstep one: Import Pandas

                      import            pandas            as            pd        

Step ii: Read the CSV

                      # Read the csv file            df            = pd.read_csv("data1.csv")            # First 5 rows            df.head()        
read_csv file from pandas

Unlike, Custom Separators

Past default, a CSV is seperated by comma. Just you can utilise other seperators as well. The pandas.read_csvpart is not limited to reading the CSV file with default separator (i.e. comma). It can be used for other separators such equally ;, | or :. To load CSV files with such separators, the sep parameter is used to laissez passer the separator used in the CSV file.

Let'southward load a file with | separator

          #            Read            the csv            file            sep='|'            df = pd.read_csv("data2.csv", sep='|') df                  
Custom Separators for read  _csv pandas file

Ready any row as column header

Permit's meet the data frame created using the read_csv pandas function without any header parameter:

                      # Read the csv file            df            = pd.read_csv("data1.csv") df.caput()                  
Column header for read  _csv pandas file

The row 0 seems to exist a better fit for the header. It can explain better about the figures in the table. You can make this 0 row as a header while reading the CSV past using the header parameter. Header parameter takes the value as a row number.

Notation: Row numbering starts from 0 including column header

                      # Read the csv file with header parameter            df            = pd.read_csv("data1.csv",            header=i)            df.head()                  
Column header for read  _csv pandas file

Renaming column headers

While reading the CSV file, yous can rename the column headers by using the names parameter. The names parameter takes the list of names of the column header.

          # Read the csv            file            with names            parameter            df            = pd.read_csv(            "information.csv"            , names=[            'Ranking'            ,            'ST Name'            ,            'Pop'            ,            'NS'            ,            'D'            ])            df.head()                  
Renaming Column header for read  _csv pandas file

To avert the former header being inferred as a row for the data frame, you can provide the header parameter which will override the old header names with new names.

          # Read the csv            file            with header            and            names            parameter            df            = pd.read_csv(            "information.csv"            , header=0, names=[            'Ranking'            ,            'ST Name'            ,            'Pop'            ,            'NS'            ,            'D'            ])            df.head()                  
Renaming Column header for read  _csv pandas file

Loading CSV without column headers in pandas

There is a adventure that the CSV file yous load doesn't have any column header. The pandas will make the offset row as a column header in the default case.

                      # Read the csv file            df            = pd.read_csv("data3.csv") df.head()                  
Default case without column header

To avoid any row existence inferred every bit cavalcade header, yous tin can specify header as None. Information technology will force pandas to create numbered columns starting from 0.

                      # Read the csv file with header=None            df            = pd.read_csv("data3.csv",            header=None)            df.head()                  
Default case without column header

Adding Prefixes to numbered columns

Yous tin as well give prefixes to the numbered cavalcade headers using the prefix parameter of pandas read_csv function.

                      # Read the csv file with header=None and prefix=column_            df            = pd.read_csv("data3.csv",            header=None,            prefix='column_')            df.caput()                  

Ready any cavalcade(s) equally Index

Past default, Pandas adds an initial index to the data frame loaded from the CSV file. You can command this behavior and brand whatever cavalcade of your CSV as an alphabetize past using the index_col parameter.

It takes the name of the desired cavalcade which has to be made every bit an index.

Case i: Making one column as index

          # Read the csv file            with            'Rank'            as            alphabetize df = pd.read_csv("data.csv", index_col='Rank') df.caput()                  

Example 2: Making multiple columns as index

For two or more than columns to be made equally an index, pass them equally a list.

          # Read the csv            file            with            'Rank'            and            'Date'            equally            index            df = pd.read_csv("data.csv", index_col=['Rank',            'Date']) df.head()                  

Selecting columns while reading CSV

In exercise, all the columns of the CSV file are not important. You can select only the necessary columns later on loading the file simply if y'all're aware of those beforehand, you can salve the space and time.

usecols parameter takes the listing of columns you want to load in your information frame.

Selecting columns using list

          #            Read            the csv file            with            'Rank',            'Date'            and            'Population'            columns (list) df = pd.read_csv("data.csv", usecols=['Rank',            'Date',            'Population']) df.head()                  
Selecting column for read_csv pandas file

Selecting columns using callable functions

usecols parameter can as well have callable functions. The callable functions evaluate on column names to select that specific cavalcade where the office evaluates to True.

          # Read the csv file            with            columns            where            length            of            cavalcade name >            10            df = pd.read_csv("information.csv", usecols=lambda x: len(10)>ten) df.head()                  
Selecting column for read_csv pandas file

Selecting/skipping rows while reading CSV

You can skip or select a specific number of rows from the dataset using the pandas.read_csv function. There are 3 parameters that can exercise this task: nrows, skiprows and skipfooter.

All of them have different functions. Let'south discuss each of them separately.

A. nrows : This parameter allows you to control how many rows y'all want to load from the CSV file. It takes an integer specifying row count.

                      # Read the csv file with v rows            df            = pd.read_csv("data.csv",            nrows=5)            df                  
Selecting rows for read_csv pandas file

B. skiprows : This parameter allows you to skip rows from the beginning of the file.

Skiprows by specifying row indices

                      # Read the csv file with kickoff row skipped            df            = pd.read_csv("data.csv",            skiprows=1)            df.head()                  
Selecting rows for read_csv pandas file

Skiprows by using callback role

skiprows parameter can also take a callable function as input which evaluates on row indices. This means the callable function will check for every row indices to make up one's mind if that row should be skipped or non.

                      # Read the csv file with odd rows skipped            df            = pd.read_csv("data.csv",            skiprows=lambda            x: ten%2!=0) df.head()                  
Selecting rows for read_csv pandas file

C. skipfooter : This parameter allows y'all to skip rows from the end of the file.

                      # Read the csv file with one row skipped from the end            df            = pd.read_csv("data.csv",            skipfooter=1)            df.tail()                  
Selecting rows for read_csv pandas file

Irresolute the data type of columns

You tin specify the data types of columns while reading the CSV file. dtype parameter takes in the dictionary of columns with their data types defined. To assign the data types, you can import them from the numpy packet and mention them against suitable columns.

Information Type of Rank earlier alter

                      # Read the csv file                        df            = pd.read_csv("data.csv")            # Display datatype of Rank            df.Rank.dtypes                  
                                    dtype              ('int64')                              

Data Type of Rank after change

          #            import            numpy            import            numpy            as            np  #            Read            the csv file with data            blazon            specified for            Rank.            df            = pd.read_csv("data.csv", dtype={'Rank':np.int8})  #            Display            datablazon            of rank            df.Rank.dtypes                  
                                    dtype              ('int8')                              

Parse Dates while reading CSV

Date time values are very crucial for data analysis. You can catechumen a column to a datetime blazon column while reading the CSV in two means:

Method one. Brand the desired cavalcade as an alphabetize and pass parse_dates=True

          # Read the csv file            with            'Engagement'            as            index and parse_dates=Truthful            df = pd.read_csv("data.csv", index_col='Date', parse_dates=True, nrows=5)  # Display index df.index                  
          DatetimeIndex(['2021            -02            -25', '2021            -04            -14', '2021            -02            -nineteen', '2021            -02            -24',                '2021            -02            -thirteen'],               dtype='datetime64[ns]', name='Date', freq=None)                  

Method two. Pass desired cavalcade in parse_dates as list

          # Read the csv file            with            parse_dates=['Date'] df = pd.read_csv("information.csv", parse_dates=['Date'], nrows=five)  # Display datatypes            of            columns df.dtypes                  
                      Rank            int64            Country                          object                        Population                          object                        National            Share            (%)                          object                        Date            datetime64[ns] dtype:                          object                              

Calculation more NaN values

Pandas library tin handle a lot of missing values. But in that location are many cases where the information contains missing values in forms that are not present in the pandas NA values list. It doesn't empathise 'missing', 'not establish', or 'not available' as missing values.

So, you need to assign them as missing. To practise this, use the na_values parameter that takes a list of such values.

Loading CSV without specifying na_values

                      # Read the csv file            df            = pd.read_csv("data.csv",            nrows=5)            df                  
Adding NaN values

Loading CSV with specifying na_values

          # Read the csv file            with            'missing'            as            na_values df = pd.read_csv("data.csv", na_values=['missing'], nrows=5) df                  
Adding NaN values

Catechumen values of the column while reading CSV

Y'all can transform, modify, or convert the values of the columns of the CSV file while loading the CSV itself. This tin can be done by using the converters parameter. converters takes in a lexicon with keys as the column names and values are the functions to be applied to them.

Allow's convert the comma seperated values (i.eastward xix,98,12,341) of the Population cavalcade in the dataset to integer value (199812341) while reading the CSV.

                      # Part which converts comma seperated value to integer            toInt = lambda x:            int(x.replace(',',            ''))            if            10!='missing'            else            -1            # Read the csv file                        df = pd.read_csv("data.csv", converters={'Population': toInt}) df.head()                  

Applied Tips

  • Before loading the CSV file into a pandas data frame, always have a skimmed look at the file. It will help you estimate which columns y'all should import and determine what data types your columns should have.
  • You should as well sentinel for the total row count of the dataset. A system with iv GB RAM may not be able to load seven-8M rows.

Exam your knowledge

Q1: You cannot load files with the $ separator using the pandas read_csv function. True or Imitation?

Reply:

Answer: False. Because, you can employ sep parameter in read_csv function.

Q2: What is the use of the converters parameter in the read_csv function?

Answer:

Answer: converters parameter is used to alter the values of the columns while loading the CSV.

Q3: How volition y'all brand pandas recognize that a particular cavalcade is datetime blazon?

Answer:

Answer: By using parse_dates parameter.

Q4: A dataset contains missing values no, non available, and '-100'. How will you specify them equally missing values for Pandas to correctly interpret them? (Assume CSV file name: example1.csv)

Answer:

Answer: By using na_values parameter.

                          import              pandas              as              pd  df = pd.read_csv("example1.csv", na_values=['no',              'not available',              '-100'])                      

Q5: How would you read a CSV file where,

  1. The heading of the columns is in the tertiary row (numbered from 1).
  2. The last 5 lines of the file take garbage text and should be avoided.
  3. Only the column names whose kickoff alphabetic character starts with vowels should exist included. Assume they are 1 word only.

(CSV file proper name: example2.csv)

Answer:

Reply:

                          import              pandas              as              pd  colnameWithVowels = lambda              x:              ten.lower()[0]              in              ['a',              'e',              'i',              'o',              'u']  df = pd.read_csv("example2.csv", usecols=colnameWithVowels, header=two, skipfooter=5)                      

The article was contributed by Kaustubh One thousand and Shrivarsheni

chappelstery1976.blogspot.com

Source: https://www.machinelearningplus.com/pandas/pandas-read_csv-completed/

0 Response to "Python Read File Line by Line Evaluate"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel