Difference between revisions of "Pandas"

From wiki
Jump to navigation Jump to search
Line 3: Line 3:
 
;import pandas as pd
 
;import pandas as pd
 
:Import the library, we assume this was done on this page
 
:Import the library, we assume this was done on this page
 
=Series=
 
Pandas Series [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html online documentation].<br>
 
A pandas series is a 1 dimensional array with named keys.<br>
 
Pandas Series have all kind of methods similar to [[Numpy]] like main, std, min, max,.... In fact Pandas is using numpy to do this.
 
;s = pd.Series([])
 
;s = pd.Series([valuelist],[indexlist])
 
:Initialize a series. If indexlist is omitted the keys are integers starting at 0.
 
;s[<key>] = <value>
 
:Assign <value> to the series element with key <key>
 
:The order in the series is the order in which they are created, NOT the numeric order.
 
:Elements can be addressed as <code>s[<key>]</code>, <code>s.<key></code> or <code>s[<numkey>]</code>. Where <numkey> is defined by the order the element was created.
 
:Once you have used named keys in a series you cannot create new elements with a numeric key.
 
;s.index
 
:All indexes in the series. Can be sliced to find a particular index.
 
;s.describe()
 
:Series statistics
 
 
All in 1 example:
 
<syntaxhighlight lang=python>
 
import numpy as np
 
import pandas as pd
 
s = pd.Series([])
 
for i in range(50):
 
    s[i] = int(np.random.random() * 100)
 
 
for i in s.index:
 
    print(i,s[i])
 
</syntaxhighlight>
 
 
Funny, you can do <code>s[0]</code> but not
 
<syntaxhighlight lang=python>
 
for i in s:
 
    print(s[i])
 
</syntaxhighlight>
 
To get all values from the series you do:
 
<syntaxhighlight lang=python>
 
for v in s:
 
    print(v)
 
</syntaxhighlight>
 
To get the indexes too:
 
<syntaxhighlight lang=python>
 
for i in s.index:
 
    print(i,s[i])
 
</syntaxhighlight>
 
  
 
=DataFrame=
 
=DataFrame=
Line 59: Line 14:
 
:Create a dataframe having a list index. The size of index must match the number of rows.
 
:Create a dataframe having a list index. The size of index must match the number of rows.
  
==Showing the dataframe
+
==Dataframe actions==
 
;df.size
 
;df.size
 
:Return the number of cells in a dataframe
 
:Return the number of cells in a dataframe
Line 69: Line 24:
 
;df.tail(x)
 
;df.tail(x)
 
:Return first/last x data rows of df (5 is the default value for x).
 
:Return first/last x data rows of df (5 is the default value for x).
 +
 
;df.columns
 
;df.columns
 
:Return the column headers (class = pandas.core.indexes.base.Index)  
 
:Return the column headers (class = pandas.core.indexes.base.Index)  
Line 75: Line 31:
 
;df.columns=[list,of,column,names]
 
;df.columns=[list,of,column,names]
 
:Redefine the column headers (modifies the dataframe itself, nothing is returned)
 
:Redefine the column headers (modifies the dataframe itself, nothing is returned)
 +
 
;df.index
 
;df.index
 
:Return the table index (first column) (class = pandas.core.indexes.base.Index)
 
:Return the table index (first column) (class = pandas.core.indexes.base.Index)
Line 88: Line 45:
 
:Return the content of the index (row) as pandas [[#Series|Series]] or just the named column. [0][0]-form for tables without header or index.
 
:Return the content of the index (row) as pandas [[#Series|Series]] or just the named column. [0][0]-form for tables without header or index.
 
:The last form selects all rows where column1 equals <value>
 
:The last form selects all rows where column1 equals <value>
 +
 
;df.filter(regex=<regex>,axis='index')
 
;df.filter(regex=<regex>,axis='index')
 
;df.filter(regex=<regex>,axis='index').<columnname>
 
;df.filter(regex=<regex>,axis='index').<columnname>
Line 95: Line 53:
 
;df.filter(regex=<regex>,axis='columns')
 
;df.filter(regex=<regex>,axis='columns')
 
:Return all columns which name matches <regexp>. (axis=1)
 
:Return all columns which name matches <regexp>. (axis=1)
 +
 
;df.sort_values(<columnname>)
 
;df.sort_values(<columnname>)
 
;df.sort_values([<column1>,<colunm2>],ascending=(True,False))
 
;df.sort_values([<column1>,<colunm2>],ascending=(True,False))
Line 100: Line 59:
 
;df.groupby([column1,column2])
 
;df.groupby([column1,column2])
 
:Return the dataframe grouped by the column-values
 
:Return the dataframe grouped by the column-values
 +
 
;table.agg(newname=('columname', np.max))
 
;table.agg(newname=('columname', np.max))
 
:This sample uses named aggregations, that is only supported from version v0.25  
 
:This sample uses named aggregations, that is only supported from version v0.25  
Line 116: Line 76:
 
;df.apply(<function>)
 
;df.apply(<function>)
 
:Apply a function to the dataframe. By default every element of the table.
 
:Apply a function to the dataframe. By default every element of the table.
 +
 +
=Series=
 +
Pandas Series [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html online documentation].<br>
 +
A pandas series is a 1 dimensional array with named keys.<br>
 +
Pandas Series have all kind of methods similar to [[Numpy]] like main, std, min, max,.... In fact Pandas is using numpy to do this.
 +
;s = pd.Series([])
 +
;s = pd.Series([valuelist],[indexlist])
 +
:Initialize a series. If indexlist is omitted the keys are integers starting at 0.
 +
;s[<key>] = <value>
 +
:Assign <value> to the series element with key <key>
 +
:The order in the series is the order in which they are created, NOT the numeric order.
 +
:Elements can be addressed as <code>s[<key>]</code>, <code>s.<key></code> or <code>s[<numkey>]</code>. Where <numkey> is defined by the order the element was created.
 +
:Once you have used named keys in a series you cannot create new elements with a numeric key.
 +
;s.index
 +
:All indexes in the series. Can be sliced to find a particular index.
 +
;s.describe()
 +
:Series statistics
 +
 +
All in 1 example:
 +
<syntaxhighlight lang=python>
 +
import numpy as np
 +
import pandas as pd
 +
s = pd.Series([])
 +
for i in range(50):
 +
    s[i] = int(np.random.random() * 100)
 +
 +
for i in s.index:
 +
    print(i,s[i])
 +
</syntaxhighlight>
 +
 +
Funny, you can do <code>s[0]</code> but not
 +
<syntaxhighlight lang=python>
 +
for i in s:
 +
    print(s[i])
 +
</syntaxhighlight>
 +
To get all values from the series you do:
 +
<syntaxhighlight lang=python>
 +
for v in s:
 +
    print(v)
 +
</syntaxhighlight>
 +
To get the indexes too:
 +
<syntaxhighlight lang=python>
 +
for i in s.index:
 +
    print(i,s[i])
 +
</syntaxhighlight>
  
 
=Reading Data=
 
=Reading Data=

Revision as of 17:54, 5 October 2019

Check the 10 minutes to Pandas too.

import pandas as pd
Import the library, we assume this was done on this page

DataFrame

Object for tabular data (that is e.g. obtained by read_html).

Each column is a Pandas#Series

df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
Create a simple dataframe having a range index, 2 rows and 2 columns
df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
Create a dataframe having a list index. The size of index must match the number of rows.

Dataframe actions

df.size
Return the number of cells in a dataframe
df.shape
Return the number of rows and columns in a dataframe as tuple.
df.describe()
Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
df.head(x)
df.tail(x)
Return first/last x data rows of df (5 is the default value for x).
df.columns
Return the column headers (class = pandas.core.indexes.base.Index)
df.<columname>
Address a column by its name.
df.columns=[list,of,column,names]
Redefine the column headers (modifies the dataframe itself, nothing is returned)
df.index
Return the table index (first column) (class = pandas.core.indexes.base.Index)
df.set_index(['col1','col2'])
Rerun the dataframe with a new index
df.reset_index()
Return the dataframe with a default index (range)
df.loc[<indexname>]
df.loc[<indexname>].<columnname>
df.loc[0][0]
df.loc[lambda d: d[colum1] == <value> ]
Return the content of the index (row) as pandas Series or just the named column. [0][0]-form for tables without header or index.
The last form selects all rows where column1 equals <value>
df.filter(regex=<regex>,axis='index')
df.filter(regex=<regex>,axis='index').<columnname>
df.filter(regex=<regex>,axis='index').index
Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
The .index returns the matching indexes as a list.
df.filter(regex=<regex>,axis='columns')
Return all columns which name matches <regexp>. (axis=1)
df.sort_values(<columnname>)
df.sort_values([<column1>,<colunm2>],ascending=(True,False))
Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
df.groupby([column1,column2])
Return the dataframe grouped by the column-values
table.agg(newname=('columname', np.max))
This sample uses named aggregations, that is only supported from version v0.25
df.reset_index()
df.transform()
df.merge(table2,on='column')
Like SQL-join
df.assign
Add a column
df.drop(columns=[listofcolumnstodrop]
Remove columns from a dataframe.
df.apply(<function>)
Apply a function to the dataframe. By default every element of the table.

Series

Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.

s = pd.Series([])
s = pd.Series([valuelist],[indexlist])
Initialize a series. If indexlist is omitted the keys are integers starting at 0.
s[<key>] = <value>
Assign <value> to the series element with key <key>
The order in the series is the order in which they are created, NOT the numeric order.
Elements can be addressed as s[<key>], s.<key> or s[<numkey>]. Where <numkey> is defined by the order the element was created.
Once you have used named keys in a series you cannot create new elements with a numeric key.
s.index
All indexes in the series. Can be sliced to find a particular index.
s.describe()
Series statistics

All in 1 example:

import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
    s[i] = int(np.random.random() * 100)

for i in s.index:
    print(i,s[i])

Funny, you can do s[0] but not

for i in s:
    print(s[i])

To get all values from the series you do:

for v in s:
    print(v)

To get the indexes too:

for i in s.index:
    print(i,s[i])

Reading Data

read_html(url)
Read html tables into a list of dataframes (no header, no index)

Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.

tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)
read_sql(query,cnx,index_columns=[col1,col2])
Read data from the database opened on cnx (see Python:Databases)
df = pd.read_sql(query,cnx,index_col=['Primarykey'])