Difference between revisions of "Pandas"
Jump to navigation
Jump to search
(→Other) |
|||
Line 52: | Line 52: | ||
Object for tabular data (that is e.g. obtained by read_html). | Object for tabular data (that is e.g. obtained by read_html). | ||
− | ; | + | Each column is a [[Pandas#Series]] |
− | :Create a | + | |
− | ; | + | ;df = <nowiki>pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])</nowiki> |
− | ; | + | :Create a simple dataframe having a range index, 2 rows and 2 columns |
− | :Return first/last x data rows of | + | ;df = <nowiki>pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])</nowiki> |
− | ; | + | :Create a dataframe having a list index. The size of index must match the number of rows. |
− | : | + | |
− | ; | + | ==Showing the dataframe |
− | :Redefine the column headers | + | ;df.size |
− | ; | + | :Return the number of cells in a dataframe |
− | : | + | ;df.shape |
− | ; | + | :Return the number of rows and columns in a dataframe as [[Python:DataTypes#Tuple|tuple]]. |
− | : | + | ;df.describe() |
− | ; | + | :Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers. |
− | ; | + | ;df.head(x) |
− | ; | + | ;df.tail(x) |
− | ; | + | :Return first/last x data rows of df (5 is the default value for x). |
− | : | + | ;df.columns |
+ | :Return the column headers (class = pandas.core.indexes.base.Index) | ||
+ | ;df.<columname> | ||
+ | :Address a column by its name. | ||
+ | ;df.columns=[list,of,column,names] | ||
+ | :Redefine the column headers (modifies the dataframe itself, nothing is returned) | ||
+ | ;df.index | ||
+ | :Return the table index (first column) (class = pandas.core.indexes.base.Index) | ||
+ | ;df.set_index(['col1','col2']) | ||
+ | :Rerun the dataframe with a new index | ||
+ | ;df.reset_index() | ||
+ | :Return the dataframe with a default index (range) | ||
+ | |||
+ | ;df.loc[<indexname>] | ||
+ | ;df.loc[<indexname>].<columnname> | ||
+ | ;df.loc[0][0] | ||
+ | ;<nowiki>df.loc[lambda d: d[colum1] == <value> ]</nowiki> | ||
+ | :Return the content of the index (row) as pandas [[#Series|Series]] or just the named column. [0][0]-form for tables without header or index. | ||
:The last form selects all rows where column1 equals <value> | :The last form selects all rows where column1 equals <value> | ||
− | ; | + | ;df.filter(regex=<regex>,axis='index') |
− | ; | + | ;df.filter(regex=<regex>,axis='index').<columnname> |
− | ; | + | ;df.filter(regex=<regex>,axis='index').index |
− | : | + | :Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s). |
− | ; | + | :The .index returns the matching indexes as a [[Python:DataTypes#List|list]]. |
− | : | + | ;df.filter(regex=<regex>,axis='columns') |
− | ; | + | :Return all columns which name matches <regexp>. (axis=1) |
− | ; | + | ;df.sort_values(<columnname>) |
− | : | + | ;df.sort_values([<column1>,<colunm2>],ascending=(True,False)) |
− | ; | + | :Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending |
− | : | + | ;df.groupby([column1,column2]) |
+ | :Return the dataframe grouped by the column-values | ||
;table.agg(newname=('columname', np.max)) | ;table.agg(newname=('columname', np.max)) | ||
:This sample uses named aggregations, that is only supported from version v0.25 | :This sample uses named aggregations, that is only supported from version v0.25 | ||
− | ; | + | ;df.reset_index() |
− | ; | + | ;df.transform() |
− | ; | + | ;df.merge(table2,on='column') |
:Like SQL-join | :Like SQL-join | ||
− | ; | + | ;df.assign |
:Add a column | :Add a column | ||
− | ; | + | ;df.drop(columns=[listofcolumnstodrop] |
:Remove columns from a dataframe. | :Remove columns from a dataframe. | ||
− | ; | + | ;df.apply(<function>) |
− | :Apply a function to the | + | :Apply a function to the dataframe. By default every element of the table. |
=Reading Data= | =Reading Data= |
Revision as of 16:49, 5 October 2019
Check the 10 minutes to Pandas too.
- import pandas as pd
- Import the library, we assume this was done on this page
Series
Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.
- s = pd.Series([])
- s = pd.Series([valuelist],[indexlist])
- Initialize a series. If indexlist is omitted the keys are integers starting at 0.
- s[<key>] = <value>
- Assign <value> to the series element with key <key>
- The order in the series is the order in which they are created, NOT the numeric order.
- Elements can be addressed as
s[<key>]
,s.<key>
ors[<numkey>]
. Where <numkey> is defined by the order the element was created. - Once you have used named keys in a series you cannot create new elements with a numeric key.
- s.index
- All indexes in the series. Can be sliced to find a particular index.
- s.describe()
- Series statistics
All in 1 example:
import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
s[i] = int(np.random.random() * 100)
for i in s.index:
print(i,s[i])
Funny, you can do s[0]
but not
for i in s:
print(s[i])
To get all values from the series you do:
for v in s:
print(v)
To get the indexes too:
for i in s.index:
print(i,s[i])
DataFrame
Object for tabular data (that is e.g. obtained by read_html).
Each column is a Pandas#Series
- df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
- Create a simple dataframe having a range index, 2 rows and 2 columns
- df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
- Create a dataframe having a list index. The size of index must match the number of rows.
==Showing the dataframe
- df.size
- Return the number of cells in a dataframe
- df.shape
- Return the number of rows and columns in a dataframe as tuple.
- df.describe()
- Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
- df.head(x)
- df.tail(x)
- Return first/last x data rows of df (5 is the default value for x).
- df.columns
- Return the column headers (class = pandas.core.indexes.base.Index)
- df.<columname>
- Address a column by its name.
- df.columns=[list,of,column,names]
- Redefine the column headers (modifies the dataframe itself, nothing is returned)
- df.index
- Return the table index (first column) (class = pandas.core.indexes.base.Index)
- df.set_index(['col1','col2'])
- Rerun the dataframe with a new index
- df.reset_index()
- Return the dataframe with a default index (range)
- df.loc[<indexname>]
- df.loc[<indexname>].<columnname>
- df.loc[0][0]
- df.loc[lambda d: d[colum1] == <value> ]
- Return the content of the index (row) as pandas Series or just the named column. [0][0]-form for tables without header or index.
- The last form selects all rows where column1 equals <value>
- df.filter(regex=<regex>,axis='index')
- df.filter(regex=<regex>,axis='index').<columnname>
- df.filter(regex=<regex>,axis='index').index
- Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
- The .index returns the matching indexes as a list.
- df.filter(regex=<regex>,axis='columns')
- Return all columns which name matches <regexp>. (axis=1)
- df.sort_values(<columnname>)
- df.sort_values([<column1>,<colunm2>],ascending=(True,False))
- Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
- df.groupby([column1,column2])
- Return the dataframe grouped by the column-values
- table.agg(newname=('columname', np.max))
- This sample uses named aggregations, that is only supported from version v0.25
- df.reset_index()
- df.transform()
- df.merge(table2,on='column')
- Like SQL-join
- df.assign
- Add a column
- df.drop(columns=[listofcolumnstodrop]
- Remove columns from a dataframe.
- df.apply(<function>)
- Apply a function to the dataframe. By default every element of the table.
Reading Data
- read_html(url)
- Read html tables into a list of dataframes (no header, no index)
Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.
tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)
- read_sql(query,cnx,index_columns=[col1,col2])
- Read data from the database opened on cnx (see Python:Databases)
df = pd.read_sql(query,cnx,index_col=['Primarykey'])