Latest revision as of 17:13, 3 May 2024

Check the 10 minutes to Pandas too or pythonexamples.

import pandas as pd: Import the library, we assume this was done on this page

DataFrame

Object for tabular data (that is e.g. obtained by read_html).

df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}]): Create a simple dataframe having a range index, 2 rows and 2 columns
df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}],[row1,row2]): Create a dataframe having a list index. The size of index must match the number of rows.

Single columns can be addressed by there name like in df.columnname.unique() to get unique values in a column.

Single or multiple columns can be addressed as list like in df.set_index([column1,column2]) to set an index over 2 columns.

Information

df.info(): Information on dataframe (Index, size, datatypes per column)
df.size: Return the number of cells in a dataframe
df.shape: Return the number of rows and columns in a dataframe as tuple.
df.describe(): Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
df.head(x)
df.tail(x): Return first/last x data rows of df (5 is the default value for x).

df.index: Return the table index (class = pandas.core.indexes.base.Index)

df.columns: Return the column headers (class = pandas.core.indexes.base.Index)
df.dtypes: Return the data-type of the columns
df.columnname.dtype: Return the data-type of column columnname

df.copy: Make a copy of df ( = will point to the same dataframe, like with dictionaries)

Modify

df.rename(columns={oldname: newname}, inplace=True); Redefine the column headers (modifies the dataframe itself, nothing is returned)

df.set_index(['col1','col2']): Return the dataframe with a new index composed of 2 columns
df.reset_index(): Return the dataframe with a default index (range)
df.sort_index(ascending=False): Sort the dataframe on the index. 'True' is the default if 'acsending' is not supplied.
df.index = df.index.map(str): For a data-type on the index

df.drop(index)
df.drop([indexes])
df.drop(range(0,3)): Remove row(s) from a dataframe.
df.drop(columns=[listofcolumnstodrop]: Remove columns from a dataframe.

df.dropna(thresh=x): Return only rows that have at least x columns with a none-NA value. If x < 2 the threshold can be dropped.

df.loc[<index>,['field1','field2']] = [value1,value2]: Change values for index <index> in a dataframe
df.fillna(<value>)
df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>): Return dataframe with all NA-values (in the selected columns) replaced by <value>

table.agg(newname=('columname', np.max)): Add newname to df with the max of columnname. This named aggregations is only supported from version v0.25

df.transform(<function>): Apply a function to an existing column or the entire dataframe.

df.assign(newcolumn = <expression>): Add column newcolumn to the dataframe; Expression can be a function like "lambda d: d['column'] * <something>"

df.apply(<function>): Apply a function to the dataframe. By default every element of the table.
df.apply(lambda x: <function>(x) if x.name in [columnname] else x): Apply a function to specified colums
df[columname] = df[columname].apply(lambda x: <function>(x)): Change a column in the dataframe by applying <function>

df.replace(<pattern>,<newvalue>,regex=True): Replace <pattern> with <newvalue> in the entire dataframe

df[columna].mask(df[columna] == origvalue , newvalue, inplace=True): Change all cells in columa that have origvalue into newvalue (inplace, so in the dataframe itself).

df.join(table2): Like SQL-join merge on index

df.merge(table2,on=<column>): Like SQL-join on <column> = df.index

dfall.merge(table2, on = <column>, how='left', indicator=True): Return df joined with table2. Add column _merge with value 'left_only' to records that only exist in df.
df = dfall[dfall['_merge'] == 'left_only']: Return dataframe with all records not in table2

Select data

Use .loc, else colomn-names will be considered too (if I understand this[1] correctly).

WARNING!!: at least with .loc if only 1 item is found is it returned as string(/float?) else as dataframe/series.

df.iloc[<slice>]: Return the rows indicated by <slice> as pandas.Series
df.iloc[:,0:3]
df[[col0,col1,col2]]: Return 3 columns from the dataframe
df.filter(regex=<regex>,axis='columns'): Return all columns which name matches <regexp>. (axis=1)

df.loc[<index>]: Select the row(s) with index <index>
df.loc[<indexname>].<columnname>
df.loc[<indexname>].[<columnname>]: Select only 1 column from a row as pandas series. You may convert that to a list or set immediately by using list(df.loc[<indexname>].<columnname>)
df.loc[0][0]: Select a cell in a dataframe with no index or header
df.loc[lambda d: d[column1] == <value> ]
df.loc[df[column1] == <value> ]
df.loc[df[column1].notnull()]: Select all rows where column1 equals <value> or do not have a null-value.
df.loc[(df[column1] == <value>) | (df[colomn1] == <value2>)]: Combine conditions with '|' (or) or '&' (and). Use () on each condition!!
df.loc[datetime.today()-pd.Timedelta(days=7) : datetime.today()]: Select all rows where the (datetime)-index is between 7 days ago and today.; NOTE!! The index must be sorted in ascending order[2].
df.last('7D'): Select the last 7 days in the dataframe. For this als the datetime index must be sorted in ascending order

Use data

Print 2 attributes for each row and check for no value

for index, row in df.iterrows():
    print(row['ColumnA'],row['ColumnB'])
    if np.isnan(row['ColumnA']):
        print('Found a NaN value')

Print all values for the first row

for column, row in df.iteritems():
    print(column,row[0])

Loop over columns

for column in df:
    print(f"Columnname {column}")

df.iloc[0]: return the values of the first row as pandas series. If there is only 1 column returns the value.

df.iloc[0][0]: return the value of the first column in the first row. Columns can also be identified by name

df[<column>].quantile(q=0.1): Return the 10th percentile (10% of the values in <column> is lower than the 10th percentile)

df[<column].str.extract(r"A regex(p\w+)"): Return a series (column) of the values selected by Regular Expressions (p\w+)
df[<columna>].map(str) + '_' + df[<columnb>].map(str): Return a series (column) with the columns concatenated. If not mapped to str an error will be thrown.
df[<column>].map(str).apply(lambda x: x[-1]): Return a series (column) with the last character of the <column>

Series

Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.

s = pd.Series([])
s = pd.Series([valuelist],[indexlist]): Initialize a series. If indexlist is omitted the keys are integers starting at 0.
s[<key>] = <value>: Assign <value> to the series element with key <key>; The order in the series is the order in which they are created, NOT the numeric order.; Elements can be addressed as s[<key>], s.<key> or s[<numkey>]. Where <numkey> is defined by the order the element was created.; Once you have used named keys in a series you cannot create new elements with a numeric key.
s.index: All indexes in the series. Can be sliced to find a particular index.
s.describe(): Series statistics

All in 1 example:

import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
    s[i] = int(np.random.random() * 100)

for i in s.index:
    print(i,s[i])

Funny, you can do s[0] but not

for i in s:
    print(s[i])

To get all values from the series you do:

for v in s:
    print(v)

To get the indexes too:

for i in s.index:
    print(i,s[i])

TimeSeries

Indexes that contains datetime values are automatically casted to a DatetimeIndex.

df.resample('5D').mean: Return the dataframe with the average value of each 5 days.

df.resample('1H').agg({'ColumnA':'first','ColumnB':'max', 'ColumnC':'min', 'ColumnD':'last' }): Resample with aggregation specified per column.

df.resample('5D',origin='end'): Resample so the last timestamp is used, by default it starts with the first, the last set may be incomplete.

Datetime functions

df[datefield] = pd.to_datetime(df[datefield],format='%d-%m-%Y'): Convert a string-column with a date into datetime. Format is the input format.; WARNING: without format to_datetime tries to make up what the format is. This is pretty intelligent but can easily be wrong with day-numbers under 13
df[datefield] = pd.to_datetime(df[datefield],unit='s'): Convert a column with unix time into datetime.
df[<timefield>] = testdf[<datetime-field>].dt.time: Return a series (column) with the time-value of a datetime-column

Reading Data

read_html(url): Read html tables into a list of dataframes (no header, no index)

Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.

tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)

read_sql(query,cnx,index_columns=[col1,col2]): Read data from the database opened on cnx (see Python:Databases)

df = pd.read_sql(query,cnx,index_col=['Primarykey'])

read_excel(xlsfile,sheetname): Read data from a microsoft Excel file.
read_excel(xlsfile,sheetname,converters={'columna':str,'columnb':str}): Force columns to be read as string

read_csv(csvfile): Read data from a file with Character Separated Values
read_csv(csvfile, delimiter=',', usecols=[<columnA>, <columnB>], dtype = {<columnA>: str}): Read columns <columnA> and <columnB> into a dataframe and force columnA to string type.
read_csv(csvfile, index_col=False, header=None, names=['columnA', columnB'], na_values = 'None'): Read from a file with no indexcolumn nor a headerrow, set the colomnnames and handle 'None' and NA value.

@@ Line 46: / Line 46: @@
 ==Modify==
-;df.columns=[list,of,column,names]
+;df.rename(columns={oldname:newname}, inplace=True)
 :Redefine the column headers (modifies the dataframe itself, nothing is returned)
@@ Line 53: / Line 53: @@
 ;df.reset_index()
 :Return the dataframe with a default index (range)
+;df.sort_index(ascending=False)
+:Sort the dataframe on the index. 'True' is the default if 'acsending' is not supplied.
+;df.index = df.index.map(str)
+:For a data-type on the index
 ;df.drop(index)
@@ Line 63: / Line 66: @@
 ;df.dropna(thresh=x)
-:Return rows that have at least x columns with a none-NA  value. If x < 2 the threshold can be dropped.
+:Return only rows that have at least x columns with a none-NA  value. If x < 2 the threshold can be dropped.
+;df.loc[<index>,['field1','field2']] = [value1,value2]
+:Change values for index <index> in a dataframe
 ;df.fillna(<value>)
 ;df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>)
@@ Line 70: / Line 75: @@
 ;table.agg(newname=('columname', np.max))
-:This sample uses named aggregations, that is only supported from version v0.25
+:Add newname to df with the max of columnname. This named aggregations is only supported from version v0.25
 ;df.transform(<function>)
 :Apply a function to an existing column or the entire dataframe.
-;df.join(table2)
-:Like SQL-join merge on index
-;df.merge(table2,on='column')
-:Like SQL-join on a column
 ;df.assign(newcolumn = <expression>)
@@ Line 97: / Line 96: @@
 ;df[columna].mask(df[columna] == origvalue , newvalue, inplace=True)
 :Change all cells in columa that have origvalue into newvalue (inplace, so in the dataframe itself).
+;df.join(table2)
+:Like SQL-join merge on index
+;df.merge(table2,on=<column>)
+:Like SQL-join on  <column> = df.index
+;dfall.merge(table2, on = <column>, how='left', indicator=True)
+:Return df joined with table2. Add column _merge with value 'left_only' to records that only exist in df.
+;df = dfall[dfall['_merge'] == 'left_only']
+:Return dataframe with all records not in table2
 ==Select data==
 Use .loc, else colomn-names will be considered too (if I understand this[https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc] correctly).
+;WARNING!!
+:at least with .loc if only 1 item is found is it returned as string(/float?) else as dataframe/series.
 ;df.[[iloc]][<slice>]
@@ Line 110: / Line 124: @@
 ;df.loc[<index>]
-:Select the row with index <index>
+:Select the row(s) with index <index>
 ;df.loc[<indexname>].<columnname>
-:Select only 1 column from a row
+;df.loc[<indexname>].[<columnname>]
+:Select only 1 column from a row as pandas series. You may convert that to a list or set immediately by using <code>list(df.loc[<indexname>].<columnname>)</code>
 ;df.loc[0][0]
 :Select a cell in a dataframe with no index or header
-;<nowiki>df.loc[lambda d: d[colum1] == <value> ]</nowiki>
+;<nowiki>df.loc[lambda d: d[column1] == <value> ]</nowiki>
-;<nowiki>df.loc[df[colum1] == <value> ]</nowiki>
+;<nowiki>df.loc[df[column1] == <value> ]</nowiki>
-;<nowiki>df.loc[df[colum1].notnull()]</nowiki>
+;<nowiki>df.loc[df[column1].notnull()]</nowiki>
 :Select all rows where column1 equals <value> or do not have a null-value.
-;<nowiki>df.loc[date.today()-pd.Timedelta(days=7) : date.today()]</nowiki>
+;<nowiki>df.loc[(df[column1] == <value>) | (df[colomn1] == <value2>)]</nowiki>
-;df.last('7D')
+:Combine conditions with '|' (or) or '&' (and). Use () on each condition!!
+;<nowiki>df.loc[datetime.today()-pd.Timedelta(days=7) : datetime.today()]</nowiki>
 :Select all rows where the (datetime)-index is between 7 days ago and today.
 :NOTE!! The index must be sorted in ascending order[https://stackoverflow.com/q/67298584/2213323].
+;df.last('7D')
+:Select the last 7 days in the dataframe. For this als the datetime index must be sorted in ascending order
 See also the [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html#pandas.Series.str.contains pandas series pages]
@@ Line 144: / Line 161: @@
 ;df.collname.unique()
 :Return a [[Numpy]] array with all distinct values in column 'collname'
+;Get all rows from df1 that have an index that exists in df2
+ df = df1[df1.index.isin(df2.index)]
+;Or get all intersectioned indices and select by loc:
+ df = df1.loc[df1.index & df2.index]
+ df = df1.loc[np.intersect1d(df1.index, df2.index)]
+ df = df1.loc[df1.index.intersection(df2.index)]
 ==Use data==
-;Print 2 attributes for each row
+;Print 2 attributes for each row and check for no value
 <syntaxhighlight lang=python>
 for index, row in df.iterrows():
      print(row['ColumnA'],row['ColumnB'])
+    if np.isnan(row['ColumnA']):
+        print('Found a NaN value')
 </syntaxhighlight>
 ;Print all values for the first row
@@ Line 156: / Line 184: @@
      print(column,row[0])
 </syntaxhighlight>
+;Loop over columns
+<syntaxhighlight lang=python>
+for column in df:
+    print(f"Columnname {column}")
+</syntaxhighlight>
 ;df.iloc[0]
@@ Line 162: / Line 197: @@
 ;df.iloc[0][0]
 :return the value of the first column in the first row. Columns can also be identified by name
+;df[<column>].quantile(q=0.1)
+:Return the 10th percentile (10% of the values in <column> is lower than the 10th percentile)
+;df[<column].str.extract(r"A regex(p\w+)")
+:Return a series (column) of the values selected by [[Regular Expressions]] (p\w+)
+;df[<columna>].map(str) + '_' + df[<columnb>].map(str)
+:Return a series (column) with the columns concatenated. If not mapped to str an error will be thrown.
+;<nowiki>df[<column>].map(str).apply(lambda x: x[-1])</nowiki>
+:Return a series (column) with the last character of the <column>
 =Series=
@@ Line 212: / Line 257: @@
 ;df.resample('5D').mean
-:Return the dataframe with the avarage value of each 5 days.
+:Return the dataframe with the average value of each 5 days.
+;<nowiki>df.resample('1H').agg({'ColumnA':'first','ColumnB':'max', 'ColumnC':'min', 'ColumnD':'last' })</nowiki>
+:Resample with aggregation specified per column.
+;df.resample('5D',origin='end')
+:Resample so the last timestamp is used, by default it starts with the first, the last set may be incomplete.
 ==Datetime functions==
-;pd.to_datetime(date,format='%d-%m-%Y')
+;df[datefield] = pd.to_datetime(df[datefield],format='%d-%m-%Y')
-:Convert a date sting to datetime
+:Convert a string-column with a date into datetime. Format is the input format.
+:WARNING: without format to_datetime tries to make up what the format is. This is pretty intelligent but can easily be wrong with day-numbers under 13
+;df[datefield] = pd.to_datetime(df[datefield],unit='s')
+:Convert a column with [[unix time]] into datetime.
+;df[<timefield>] = testdf[<datetime-field>].dt.time
+:Return a series (column) with the time-value of a datetime-column
 =Reading Data=
@@ Line 239: / Line 295: @@
 ;read_csv(csvfile)
 :Read data from a file with Character Separated Values
+;<nowiki>read_csv(csvfile, delimiter=',', usecols=[<columnA>, <columnB>], dtype = {<columnA>: str})</nowiki>
+:Read columns <columnA> and <columnB> into a dataframe and force columnA to string type.
+;<nowiki>read_csv(csvfile, index_col=False, header=None, names=['columnA', columnB'], na_values = 'None')</nowiki>
+:Read from a file with no indexcolumn nor a headerrow,  set the colomnnames and handle 'None' and NA value.

Difference between revisions of "Pandas"