Pandas

From wiki
Jump to navigation Jump to search

Check the 10 minutes to Pandas too or pythonexamples.

import pandas as pd
Import the library, we assume this was done on this page

DataFrame

Object for tabular data (that is e.g. obtained by read_html).

Each column is a Pandas#Series

df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
Create a simple dataframe having a range index, 2 rows and 2 columns
df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}],[row1,row2])
Create a dataframe having a list index. The size of index must match the number of rows.

Single columns can be addressed by there name like in df.columnname.unique() to get unique values in a column.

Single or multiple columns can be addressed as list like in df.set_index([column1,column2]) to set an index over 2 columns.

Information

df.info()
Information on dataframe (Index, size, datatypes per column)
df.size
Return the number of cells in a dataframe
df.shape
Return the number of rows and columns in a dataframe as tuple.
df.describe()
Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
df.head(x)
df.tail(x)
Return first/last x data rows of df (5 is the default value for x).
df.index
Return the table index (class = pandas.core.indexes.base.Index)
df.columns
Return the column headers (class = pandas.core.indexes.base.Index)
df.dtypes
Return the data-type of the columns
df.columnname.dtype
Return the data-type of column columnname
df.copy
Make a copy of df ( = will point to the same dataframe, like with dictionaries)

Modify

df.columns=[list,of,column,names]
Redefine the column headers (modifies the dataframe itself, nothing is returned)
df.set_index(['col1','col2'])
Return the dataframe with a new index composed of 2 columns
df.reset_index()
Return the dataframe with a default index (range)
df.sort_index(ascending=False)
Sort the dataframe on the index. 'True' is the default if 'acsending' is not supplied.
df.index = df.index.map(str)
For a data-type on the index
df.drop(index)
df.drop([indexes])
df.drop(range(0,3))
Remove row(s) from a dataframe.
df.drop(columns=[listofcolumnstodrop]
Remove columns from a dataframe.
df.dropna(thresh=x)
Return only rows that have at least x columns with a none-NA value. If x < 2 the threshold can be dropped.
df.loc[<index>,['field1','field2']] = [value1,value2]
Change values for index <index> in a dataframe
df.fillna(<value>)
df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>)
Return dataframe with all NA-values (in the selected columns) replaced by <value>
table.agg(newname=('columname', np.max))
Add newname to df with the max of columnname. This named aggregations is only supported from version v0.25
df.transform(<function>)
Apply a function to an existing column or the entire dataframe.
df.assign(newcolumn = <expression>)
Add column newcolumn to the dataframe
Expression can be a function like "lambda d: d['column'] * <something>"
df.apply(<function>)
Apply a function to the dataframe. By default every element of the table.
df.apply(lambda x: <function>(x) if x.name in [columnname] else x)
Apply a function to specified colums
df[columname] = df[columname].apply(lambda x: <function>(x))
Change a column in the dataframe by applying <function>
df.replace(<pattern>,<newvalue>,regex=True)
Replace <pattern> with <newvalue> in the entire dataframe
df[columna].mask(df[columna] == origvalue , newvalue, inplace=True)
Change all cells in columa that have origvalue into newvalue (inplace, so in the dataframe itself).


df.join(table2)
Like SQL-join merge on index
df.merge(table2,on=<column>)
Like SQL-join on <column> = df.index
dfall.merge(table2, on = <column>, how='left', indicator=True)
Return df joined with table2. Add column _merge with value 'left_only' to records that only exist in df.
df = dfall[dfall['_merge'] == 'left_only']
Return dataframe with all records not in table2

Select data

Use .loc, else colomn-names will be considered too (if I understand this[1] correctly).

WARNING!!
at least with .loc if only 1 item is found is it returned as string(/float?) else as dataframe/series.
df.iloc[<slice>]
Return the rows indicated by <slice> as pandas.Series
df.iloc[:,0:3]
df[[col0,col1,col2]]
Return 3 columns from the dataframe
df.filter(regex=<regex>,axis='columns')
Return all columns which name matches <regexp>. (axis=1)
df.loc[<index>]
Select the row(s) with index <index>
df.loc[<indexname>].<columnname>
df.loc[<indexname>].[<columnname>]
Select only 1 column from a row as pandas series. You may convert that to a list or set immediately by using list(df.loc[<indexname>].<columnname>)
df.loc[0][0]
Select a cell in a dataframe with no index or header
df.loc[lambda d: d[column1] == <value> ]
df.loc[df[column1] == <value> ]
df.loc[df[column1].notnull()]
Select all rows where column1 equals <value> or do not have a null-value.
df.loc[(df[column1] == <value>) | (df[colomn1] == <value2>)]
Combine conditions with '|' (or) or '&' (and). Use () on each condition!!
df.loc[datetime.today()-pd.Timedelta(days=7) : datetime.today()]
Select all rows where the (datetime)-index is between 7 days ago and today.
NOTE!! The index must be sorted in ascending order[2].
df.last('7D')
Select the last 7 days in the dataframe. For this als the datetime index must be sorted in ascending order

See also the pandas series pages

df[df['fieldname'].str.contains('regexp',regex=True)]
Select all rows in df where fieldname matches regexp. The field must have a string datetype. If you are looking for a simple string you can leave out the regex=True part.
df[df['fieldname'].gt(value)]
Select all rows in df where fieldname is bigger than value. The field must have a numeric datetype
df.filter(regex=<regex>,axis='index')
df.filter(regex=<regex>,axis='index').<columnname>
df.filter(regex=<regex>,axis='index').index
Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
The .index returns the matching indexes as a list.
df.sort_values(<columnname>)
df.sort_values([<column1>,<colunm2>],ascending=(True,False))
Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
df.groupby([column1,column2])
Return a DataFrameGroupBy object, this is not a dataframe
df.collname.unique()
Return a Numpy array with all distinct values in column 'collname'
Get all rows from df1 that have an index that exists in df2
df = df1[df1.index.isin(df2.index)]
Or get all intersectioned indices and select by loc
df = df1.loc[df1.index & df2.index]
df = df1.loc[np.intersect1d(df1.index, df2.index)]
df = df1.loc[df1.index.intersection(df2.index)]

Use data

Print 2 attributes for each row and check for no value
for index, row in df.iterrows():
    print(row['ColumnA'],row['ColumnB'])
    if np.isnan(row['ColumnA']):
        print('Found a NaN value')
Print all values for the first row
for column, row in df.iteritems():
    print(column,row[0])
Loop over columns
for column in df:
    print(f"Columnname {column}")


df.iloc[0]
return the values of the first row as pandas series. If there is only 1 column returns the value.
df.iloc[0][0]
return the value of the first column in the first row. Columns can also be identified by name
df[<column>].quantile(q=0.1)
Return the 10th percentile (10% of the values in <column> is lower than the 10th percentile)
df[<column].str.extract(r"A regex(p\w+)")
Return a series (column) of the values selected by Regular Expressions (p\w+)
df[<columna>].map(str) + '_' + df[<columnb>].map(str)
Return a series (column) with the columns concatenated. If not mapped to str an error will be thrown.
df[<column>].map(str).apply(lambda x: x[-1])
Return a series (column) with the last character of the <column>

Series

Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.

s = pd.Series([])
s = pd.Series([valuelist],[indexlist])
Initialize a series. If indexlist is omitted the keys are integers starting at 0.
s[<key>] = <value>
Assign <value> to the series element with key <key>
The order in the series is the order in which they are created, NOT the numeric order.
Elements can be addressed as s[<key>], s.<key> or s[<numkey>]. Where <numkey> is defined by the order the element was created.
Once you have used named keys in a series you cannot create new elements with a numeric key.
s.index
All indexes in the series. Can be sliced to find a particular index.
s.describe()
Series statistics

All in 1 example:

import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
    s[i] = int(np.random.random() * 100)

for i in s.index:
    print(i,s[i])

Funny, you can do s[0] but not

for i in s:
    print(s[i])

To get all values from the series you do:

for v in s:
    print(v)

To get the indexes too:

for i in s.index:
    print(i,s[i])

TimeSeries

Indexes that contains datetime values are automatically casted to a DatetimeIndex.

df.resample('5D').mean
Return the dataframe with the average value of each 5 days.
df.resample('1H').agg({'ColumnA':'first','ColumnB':'max', 'ColumnC':'min', 'ColumnD':'last' })
Resample with aggregation specified per column.
df.resample('5D',origin='end')
Resample so the last timestamp is used, by default it starts with the first, the last set may be incomplete.

Datetime functions

df[datefield] = pd.to_datetime(df[datefield],format='%d-%m-%Y')
Convert a string-column with a date into datetime. Format is the input format.
WARNING: without format to_datetime tries to make up what the format is. This is pretty intelligent but can easily be wrong with day-numbers under 13
df[datefield] = pd.to_datetime(df[datefield],unit='s')
Convert a column with unix time into datetime.
df[<timefield>] = testdf[<datetime-field>].dt.time
Return a series (column) with the time-value of a datetime-column

Reading Data

read_html(url)
Read html tables into a list of dataframes (no header, no index)

Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.

tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)
read_sql(query,cnx,index_columns=[col1,col2])
Read data from the database opened on cnx (see Python:Databases)
df = pd.read_sql(query,cnx,index_col=['Primarykey'])
read_excel(xlsfile,sheetname)
Read data from a microsoft Excel file.
read_excel(xlsfile,sheetname,converters={'columna':str,'columnb':str})
Force columns to be read as string
read_csv(csvfile)
Read data from a file with Character Separated Values
read_csv(csvfile, delimiter=',', usecols=[<columnA>, <columnB>], dtype = {<columnA>: str})
Read columns <columnA> and <columnB> into a dataframe and force columnA to string type.
read_csv(csvfile, index_col=False, header=None, names=['columnA', columnB'], na_values = 'None')
Read from a file with no indexcolumn nor a headerrow, set the colomnnames and handle 'None' and NA value.