Revision as of 15:43, 29 June 2020

Check the 10 minutes to Pandas too or pythonexamples.

import pandas as pd: Import the library, we assume this was done on this page

DataFrame

Object for tabular data (that is e.g. obtained by read_html).

Each column is a Pandas#Series

df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}]): Create a simple dataframe having a range index, 2 rows and 2 columns
df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}],[row1,row2]): Create a dataframe having a list index. The size of index must match the number of rows.

Information

df.info(): Information on dataframe (Index, size, datatypes per column)
df.size: Return the number of cells in a dataframe
df.shape: Return the number of rows and columns in a dataframe as tuple.
df.describe(): Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
df.head(x)
df.tail(x): Return first/last x data rows of df (5 is the default value for x).

df.index: Return the table index (first column) (class = pandas.core.indexes.base.Index)

df.columns: Return the column headers (class = pandas.core.indexes.base.Index)
df.dtypes: Return the data-type of the columns
df.columnname.dtype: Return the data-type of column columnname

Modify

df.<columname>: Address a column by its name.
df.columns=[list,of,column,names]: Redefine the column headers (modifies the dataframe itself, nothing is returned)

df.set_index(['col1','col2']): Return the dataframe with a new index
df.reset_index(): Return the dataframe with a default index (range)

df.drop(index)
df.drop([indexes])
df.drop(range(0,3)): Remove row(s) from a dataframe.
df.drop(columns=[listofcolumnstodrop]: Remove columns from a dataframe.

df.dropna(thresh=x): Return rows that have at least x columns with a none-NA value. If x < 2 the threshold can be dropped.

df.fillna(<value>)
df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>): Return dataframe with all NA-values (in the selected columns) replaced by <value>

table.agg(newname=('columname', np.max)): This sample uses named aggregations, that is only supported from version v0.25

df.transform(<function>): Apply a function to an existing column or the entire dataframe.

df.join(table2): Like SQL-join merge on index

df.merge(table2,on='column'): Like SQL-join on a column

df.assign(newcolumn = <expression>): Add column newcolumn to the dataframe; Expression can be a function like "lambda d: d['column'] * <something>"

df.apply(<function>): Apply a function to the dataframe. By default every element of the table.

df.replace(<pattern>,<newvalue>,regex=True): Replace <pattern> with <newvalue> in the entire dataframe

df[columna].mask(df[columna] == origvalue , newvalue, inplace=True): Change all cells in columa that have origvalue into newvalue (inplace, so in the dataframe itself).

Select data

Use .loc, else colomn-names will be considered too (if I understand this[1] correctly.

df.iloc[:,0:3]
df[[col0,col1,col2]]: Return 3 columns from the dataframe
df.filter(regex=<regex>,axis='columns'): Return all columns which name matches <regexp>. (axis=1)

df.loc[<indexname>]
df.loc[<indexname>].<columnname>
df.loc[0][0]
df.loc[lambda d: d[colum1] == <value> ]
df.loc[df[colum1] == <value> ]
df.loc[df[colum1].notnull()]: Return the content of the index (row) as pandas Series or just the named column. [0][0]-form for tables without header or index.; The last 3 forms form selects all rows where column1 equals <value> or do not have a null-value.
df.iloc[<slice>]: Return the rows indicated by <slice> as pandas.Series

df.filter(regex=<regex>,axis='index')
df.filter(regex=<regex>,axis='index').<columnname>
df.filter(regex=<regex>,axis='index').index: Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).; The .index returns the matching indexes as a list.

df.sort_values(<columnname>)
df.sort_values([<column1>,<colunm2>],ascending=(True,False)): Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
df.groupby([column1,column2]): Return a DataFrameGroupBy object, this is not a dataframe
df.collname.unique(): Return a Numpy array with all distinct values in column 'collname'

Use data

Print 2 attributes for each row

for index, row in df.iterrows():
    print(row['ColumnA'],row['ColumnB'])

Print all values for the first row

for column, row in df.iteritems():
    print(column,row[0])

Series

Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.

s = pd.Series([])
s = pd.Series([valuelist],[indexlist]): Initialize a series. If indexlist is omitted the keys are integers starting at 0.
s[<key>] = <value>: Assign <value> to the series element with key <key>; The order in the series is the order in which they are created, NOT the numeric order.; Elements can be addressed as s[<key>], s.<key> or s[<numkey>]. Where <numkey> is defined by the order the element was created.; Once you have used named keys in a series you cannot create new elements with a numeric key.
s.index: All indexes in the series. Can be sliced to find a particular index.
s.describe(): Series statistics

All in 1 example:

import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
    s[i] = int(np.random.random() * 100)

for i in s.index:
    print(i,s[i])

Funny, you can do s[0] but not

for i in s:
    print(s[i])

To get all values from the series you do:

for v in s:
    print(v)

To get the indexes too:

for i in s.index:
    print(i,s[i])

TimeSeries

Indexes that contains datetime values are automatically casted to a DatetimeIndex.

df.resample('5D').mean: Return the dataframe with the avarage value of each 5 days.

Reading Data

read_html(url): Read html tables into a list of dataframes (no header, no index)

Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.

tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)

read_sql(query,cnx,index_columns=[col1,col2]): Read data from the database opened on cnx (see Python:Databases)

df = pd.read_sql(query,cnx,index_col=['Primarykey'])

read_excel(xlsfile,sheetname): Read data from a microsoft Excel file.
read_excel(xlsfile,sheetname,converters={'columna':str,'columnb':str}): Force columns to be read as string

read_csv(csvfile): Read data from a file with Character Separated Values

@@ Line 50: / Line 50: @@
 ;df.drop(index)
-:Remove rows from a dataframe.
+;df.drop([indexes])
+;df.drop(range(0,3))
+:Remove row(s) from a dataframe.
 ;df.drop(columns=[listofcolumnstodrop]
 :Remove columns from a dataframe.
 ;df.dropna(thresh=x)
-:Remove rows that have x columns with a 'Na'  value. If x < 2 the threshold can be dropped.
+:Return rows that have at least x columns with a none-NA  value. If x < 2 the threshold can be dropped.
+;df.fillna(<value>)
+;df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>)
+:Return dataframe with all NA-values (in the selected columns) replaced by <value>
 ;table.agg(newname=('columname', np.max))
@@ Line 84: / Line 90: @@
 ==Select data==
 Use .loc, else colomn-names will be considered too (if I understand this[https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc] correctly.
+;<nowiki>df.iloc[:,0:3]</nowiki>
+;<nowiki>df[[col0,col1,col2]]</nowiki>
+:Return 3 columns from the dataframe
+;df.filter(regex=<regex>,axis='columns')
+:Return all columns which name matches <regexp>. (axis=1)
 ;df.loc[<indexname>]
 ;df.loc[<indexname>].<columnname>
@@ Line 92: / Line 107: @@
 :Return the content of the index (row) as pandas [[#Series|Series]] or just the named column. [0][0]-form for tables without header or index.
 :The last 3 forms form selects all rows where column1 equals <value> or do not have a null-value.
 ;df.[[iloc]][<slice>]
 :Return the rows indicated by <slice> as pandas.Series
 ;df.filter(regex=<regex>,axis='index')
@@ Line 101: / Line 116: @@
 :Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
 :The .index returns the matching indexes as a [[Python:DataTypes#List|list]].
-;df.filter(regex=<regex>,axis='columns')
-:Return all columns which name matches <regexp>. (axis=1)
 ;df.sort_values(<columnname>)

Difference between revisions of "Pandas"