Difference between revisions of "Pandas"

From wiki
Jump to navigation Jump to search
Line 52: Line 52:
 
Object for tabular data (that is e.g. obtained by read_html).
 
Object for tabular data (that is e.g. obtained by read_html).
  
;table = <nowiki>pd.DataFrame([{col1: 1,col2: 2}])</nowiki>
+
Each column is a [[Pandas#Series]]
:Create a dataframs
+
 
;table.head(x)
+
;df = <nowiki>pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])</nowiki>
;table.tail(x)
+
:Create a simple dataframe having a range index, 2 rows and 2 columns
:Return first/last x data rows of table (5 is the default value for x).
+
;df = <nowiki>pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])</nowiki>
;table.columns
+
:Create a dataframe having a list index. The size of index must match the number of rows.
:The column headers (class = pandas.core.indexes.base.Index)  
+
 
;table.columns=[list,of,column,names]
+
==Showing the dataframe
:Redefine the column headers
+
;df.size
;table.index
+
:Return the number of cells in a dataframe
:The table index (first column) (class = pandas.core.indexes.base.Index)
+
;df.shape
;table.<columname>
+
:Return the number of rows and columns in a dataframe as [[Python:DataTypes#Tuple|tuple]].
:Address a column by its name. Each column is a pandas [[#Series|Series]]
+
;df.describe()
;table.loc[<indexname>]
+
:Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
;table.loc[<indexname>].<columnname>
+
;df.head(x)
;table.loc[0][0]
+
;df.tail(x)
;table.loc[lambda d: d[colum1] == <value> ]
+
:Return first/last x data rows of df (5 is the default value for x).
:The content of the index (row) as pandas [[#Series|Series]] or just the named column. [0][0]-form for tables without header or index.
+
;df.columns
 +
:Return the column headers (class = pandas.core.indexes.base.Index)  
 +
;df.<columname>
 +
:Address a column by its name.
 +
;df.columns=[list,of,column,names]
 +
:Redefine the column headers (modifies the dataframe itself, nothing is returned)
 +
;df.index
 +
:Return the table index (first column) (class = pandas.core.indexes.base.Index)
 +
;df.set_index(['col1','col2'])
 +
:Rerun the dataframe with a new index
 +
;df.reset_index()
 +
:Return the dataframe with a default index (range)
 +
 
 +
;df.loc[<indexname>]
 +
;df.loc[<indexname>].<columnname>
 +
;df.loc[0][0]
 +
;<nowiki>df.loc[lambda d: d[colum1] == <value> ]</nowiki>
 +
:Return the content of the index (row) as pandas [[#Series|Series]] or just the named column. [0][0]-form for tables without header or index.
 
:The last form selects all rows where column1 equals <value>
 
:The last form selects all rows where column1 equals <value>
;table.filter(regex=<regex>,axis='index')
+
;df.filter(regex=<regex>,axis='index')
;table.filter(regex=<regex>,axis='index').<columnname>
+
;df.filter(regex=<regex>,axis='index').<columnname>
;table.filter(regex=<regex>,axis='index').index
+
;df.filter(regex=<regex>,axis='index').index
:Find all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s)
+
:Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
;table.filter(regex=<regex>,axis='columns')
+
:The .index returns the matching indexes as a [[Python:DataTypes#List|list]].
:Find all column-names which name matches <regexp>. (axis=1)
+
;df.filter(regex=<regex>,axis='columns')
;table.sort_values(<columnname>)
+
:Return all columns which name matches <regexp>. (axis=1)
;table.sort_values([<column1>,<colunm2>],ascending=(True,False))
+
;df.sort_values(<columnname>)
:Sort table on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
+
;df.sort_values([<column1>,<colunm2>],ascending=(True,False))
;table.groupby([column1,column2])
+
:Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
:Group the table by the column-values
+
;df.groupby([column1,column2])
 +
:Return the dataframe grouped by the column-values
 
;table.agg(newname=('columname', np.max))
 
;table.agg(newname=('columname', np.max))
 
:This sample uses named aggregations, that is only supported from version v0.25  
 
:This sample uses named aggregations, that is only supported from version v0.25  
;table.reset_index()
+
;df.reset_index()
;table.transform()
+
;df.transform()
  
;table.merge(table2,on='column')
+
;df.merge(table2,on='column')
 
:Like SQL-join
 
:Like SQL-join
  
;table.assign
+
;df.assign
 
:Add a column
 
:Add a column
  
;table.drop(columns=[listofcolumnstodrop]
+
;df.drop(columns=[listofcolumnstodrop]
 
:Remove columns from a dataframe.
 
:Remove columns from a dataframe.
  
;table.apply(<function>)
+
;df.apply(<function>)
:Apply a function to the table. By default every element of the table.
+
:Apply a function to the dataframe. By default every element of the table.
  
 
=Reading Data=
 
=Reading Data=

Revision as of 17:49, 5 October 2019

Check the 10 minutes to Pandas too.

import pandas as pd
Import the library, we assume this was done on this page

Series

Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.

s = pd.Series([])
s = pd.Series([valuelist],[indexlist])
Initialize a series. If indexlist is omitted the keys are integers starting at 0.
s[<key>] = <value>
Assign <value> to the series element with key <key>
The order in the series is the order in which they are created, NOT the numeric order.
Elements can be addressed as s[<key>], s.<key> or s[<numkey>]. Where <numkey> is defined by the order the element was created.
Once you have used named keys in a series you cannot create new elements with a numeric key.
s.index
All indexes in the series. Can be sliced to find a particular index.
s.describe()
Series statistics

All in 1 example:

import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
    s[i] = int(np.random.random() * 100)

for i in s.index:
    print(i,s[i])

Funny, you can do s[0] but not

for i in s:
    print(s[i])

To get all values from the series you do:

for v in s:
    print(v)

To get the indexes too:

for i in s.index:
    print(i,s[i])

DataFrame

Object for tabular data (that is e.g. obtained by read_html).

Each column is a Pandas#Series

df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
Create a simple dataframe having a range index, 2 rows and 2 columns
df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
Create a dataframe having a list index. The size of index must match the number of rows.

==Showing the dataframe

df.size
Return the number of cells in a dataframe
df.shape
Return the number of rows and columns in a dataframe as tuple.
df.describe()
Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
df.head(x)
df.tail(x)
Return first/last x data rows of df (5 is the default value for x).
df.columns
Return the column headers (class = pandas.core.indexes.base.Index)
df.<columname>
Address a column by its name.
df.columns=[list,of,column,names]
Redefine the column headers (modifies the dataframe itself, nothing is returned)
df.index
Return the table index (first column) (class = pandas.core.indexes.base.Index)
df.set_index(['col1','col2'])
Rerun the dataframe with a new index
df.reset_index()
Return the dataframe with a default index (range)
df.loc[<indexname>]
df.loc[<indexname>].<columnname>
df.loc[0][0]
df.loc[lambda d: d[colum1] == <value> ]
Return the content of the index (row) as pandas Series or just the named column. [0][0]-form for tables without header or index.
The last form selects all rows where column1 equals <value>
df.filter(regex=<regex>,axis='index')
df.filter(regex=<regex>,axis='index').<columnname>
df.filter(regex=<regex>,axis='index').index
Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
The .index returns the matching indexes as a list.
df.filter(regex=<regex>,axis='columns')
Return all columns which name matches <regexp>. (axis=1)
df.sort_values(<columnname>)
df.sort_values([<column1>,<colunm2>],ascending=(True,False))
Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
df.groupby([column1,column2])
Return the dataframe grouped by the column-values
table.agg(newname=('columname', np.max))
This sample uses named aggregations, that is only supported from version v0.25
df.reset_index()
df.transform()
df.merge(table2,on='column')
Like SQL-join
df.assign
Add a column
df.drop(columns=[listofcolumnstodrop]
Remove columns from a dataframe.
df.apply(<function>)
Apply a function to the dataframe. By default every element of the table.

Reading Data

read_html(url)
Read html tables into a list of dataframes (no header, no index)

Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.

tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)
read_sql(query,cnx,index_columns=[col1,col2])
Read data from the database opened on cnx (see Python:Databases)
df = pd.read_sql(query,cnx,index_col=['Primarykey'])