Open and read Excel file with Python using Pandas (With Examples)

The Pandas Python library is a fantastic tool to add to your Python toolbelt. Whether you need to work with CSV, XLS, XLSX files, or you have other data in another multi-dimensional format. Pandas is absolutely a must-have. I highly recommend you take a look at the Pandas documentation at https://pandas.pydata.org/pandas-docs/stable/. The files for this article can be found here: https://github.com/CoffieldWeb/Python-Pandas-Quick-Start

Today I'll introduce you to the library by opening an Excel file and doing some filtering.

First, let's install Pandas and XLRD. 


pip install pandas xlrd

Let's create a file called solution.py. In that file, we'll import pandas and alias it as pd. Then we'll import the xlrd library that helps us read the Excel files. More on that https://github.com/python-excel/xlrd.


import pandas as pd
import xlrd

Now we'll open an Excel file.  This file has one sheet in it with historical stock prices. You can find the file in the Github link above.


#Open stocks.xlsx and create a DataFrame with the results
df = pd.read_excel("stocks.xlsx")

Let's take a look at the type of object returned.


#Notice the type for df is a DataFrame (pandas.core.frame.DataFrame)
type(df)

There are so many powerful ways to filter and apply transformations to this data, but we'll just get started with a few basic examples.


#Show the first 5 rows
df.head()

#	Date	Open	High	Low	Close	Volume	Adj Close
#0	2014-07-21	83.46	83.53	81.81	81.93	2359300	81.93
#1	2014-07-18	83.30	83.40	82.52	83.35	4020800	83.35
#2	2014-07-17	84.35	84.63	83.33	83.63	1974000	83.63
#3	2014-07-16	83.77	84.91	83.66	84.91	1755600	84.91
#4	2014-07-15	84.30	84.38	83.20	83.58	1874700	83.58

If I wanted to isolate just one column I can do that by simply referencing the column name.


#Let's get the Date column. 
my_series = df["Date"]

But notice that the returned data type is no longer a DataFrame.  We'll talk about the power of a Pandas series in another tutorial.  But you can look them up here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html


#Notice the type for my_series is a Series (pandas.core.series.Series)
type(my_series)

If we wanted to create a new Dataframe to hold specific columns we can do that by providing a list of column names.


#Create a new DataFrame with just the Date and Open columns.
limited_columns = df[['Date','Open']]
limited_columns.head()


#	Date	Open
#0	2014-07-21	83.46
#1	2014-07-18	83.30
#2	2014-07-17	84.35
#3	2014-07-16	83.77
#4	2014-07-15	84.30

Now to get to work by introducing Boolean selection. Basically, we're specifying a condition as selection criteria.


#lets get the days when Open was below 83.5
below_83_Open = df[df.Open < 83.5]
below_83_Open.head()

#	Date	Open	High	Low	Close	Volume	Adj Close
#0	2014-07-21	83.46	83.53	81.81	81.93	2359300	81.93
#1	2014-07-18	83.30	83.40	82.52	83.35	4020800	83.35
#41	2014-05-23	83.20	83.23	82.49	83.08	1596900	83.08
#42	2014-05-22	83.05	83.26	82.61	83.00	1382100	83.00
#43	2014-05-21	82.15	83.09	81.60	82.97	1762600	82.97

We can supply multiple conditions as well.


#lets get the days when Open was below 83.5 and the Close was above 85
filtered_rows = df[(df.Open < 83.5) & (df.Close > 85)]
filtered_rows.head()


#	Date	Open	High	Low	Close	Volume	Adj Close
#1669	2008-01-22	80.25	87.53	79.14	85.5	21774200	33.32


I hope this has whetted your appetite a bit. We'll be doing a bunch of tutorials that help you understand all the finer points of pandas and the numpy library.

Pingbacks are closed.

Comments are closed.