Difference between revisions of "Pandas"
(→Modify) |
|||
(81 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category:Python]] | [[Category:Python]] | ||
− | Check the [https://pandas.pydata.org/pandas-docs/stable/10min.html 10 minutes to Pandas] too. | + | Check the [https://pandas.pydata.org/pandas-docs/stable/10min.html 10 minutes to Pandas] too or [https://pythonexamples.org/pandas-examples/ pythonexamples]. |
;import pandas as pd | ;import pandas as pd | ||
:Import the library, we assume this was done on this page | :Import the library, we assume this was done on this page | ||
+ | |||
+ | =DataFrame= | ||
+ | Object for tabular data (that is e.g. obtained by read_html). | ||
+ | |||
+ | Each column is a [[Pandas#Series]] | ||
+ | |||
+ | ;df = <nowiki>pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])</nowiki> | ||
+ | :Create a simple dataframe having a range index, 2 rows and 2 columns | ||
+ | ;df = <nowiki>pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}],[row1,row2])</nowiki> | ||
+ | :Create a dataframe having a list index. The size of index must match the number of rows. | ||
+ | |||
+ | Single columns can be addressed by there name like in <code>df.columnname.unique()</code> to get unique values in a column. | ||
+ | |||
+ | Single or multiple columns can be addressed as list like in <code>df.set_index([column1,column2])</code> to set an index over 2 columns. | ||
+ | |||
+ | ==Information== | ||
+ | ;df.info() | ||
+ | :Information on dataframe (Index, size, datatypes per column) | ||
+ | ;df.size | ||
+ | :Return the number of cells in a dataframe | ||
+ | ;df.shape | ||
+ | :Return the number of rows and columns in a dataframe as [[Python:DataTypes#Tuple|tuple]]. | ||
+ | ;df.describe() | ||
+ | :Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers. | ||
+ | ;df.head(x) | ||
+ | ;df.tail(x) | ||
+ | :Return first/last x data rows of df (5 is the default value for x). | ||
+ | |||
+ | ;df.index | ||
+ | :Return the table index (class = pandas.core.indexes.base.Index) | ||
+ | |||
+ | ;df.columns | ||
+ | :Return the column headers (class = pandas.core.indexes.base.Index) | ||
+ | ;df.dtypes | ||
+ | :Return the data-type of the columns | ||
+ | ;df.columnname.dtype | ||
+ | :Return the data-type of column columnname | ||
+ | |||
+ | ;df.copy | ||
+ | :Make a copy of df ( = will point to the same dataframe, like with [[Python:DataTypes#Dictionary or dict|dictionaries]]) | ||
+ | |||
+ | ==Modify== | ||
+ | |||
+ | ;df.rename(columns={oldname:newname}, inplace=True) | ||
+ | :Redefine the column headers (modifies the dataframe itself, nothing is returned) | ||
+ | |||
+ | ;df.set_index(['col1','col2']) | ||
+ | :Return the dataframe with a new index composed of 2 columns | ||
+ | ;df.reset_index() | ||
+ | :Return the dataframe with a default index (range) | ||
+ | ;df.sort_index(ascending=False) | ||
+ | :Sort the dataframe on the index. 'True' is the default if 'acsending' is not supplied. | ||
+ | ;df.index = df.index.map(str) | ||
+ | :For a data-type on the index | ||
+ | |||
+ | ;df.drop(index) | ||
+ | ;df.drop([indexes]) | ||
+ | ;df.drop(range(0,3)) | ||
+ | :Remove row(s) from a dataframe. | ||
+ | ;df.drop(columns=[listofcolumnstodrop] | ||
+ | :Remove columns from a dataframe. | ||
+ | |||
+ | ;df.dropna(thresh=x) | ||
+ | :Return only rows that have at least x columns with a none-NA value. If x < 2 the threshold can be dropped. | ||
+ | |||
+ | ;df.loc[<index>,['field1','field2']] = [value1,value2] | ||
+ | :Change values for index <index> in a dataframe | ||
+ | ;df.fillna(<value>) | ||
+ | ;df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>) | ||
+ | :Return dataframe with all NA-values (in the selected columns) replaced by <value> | ||
+ | |||
+ | ;table.agg(newname=('columname', np.max)) | ||
+ | :Add newname to df with the max of columnname. This named aggregations is only supported from version v0.25 | ||
+ | |||
+ | ;df.transform(<function>) | ||
+ | :Apply a function to an existing column or the entire dataframe. | ||
+ | |||
+ | ;df.assign(newcolumn = <expression>) | ||
+ | :Add column newcolumn to the dataframe | ||
+ | :Expression can be a function like "lambda d: d['column'] * <something>" | ||
+ | |||
+ | ;df.apply(<function>) | ||
+ | :Apply a function to the dataframe. By default every element of the table. | ||
+ | ;<nowiki>df.apply(lambda x: <function>(x) if x.name in [columnname] else x)</nowiki> | ||
+ | :Apply a function to specified colums | ||
+ | ;<nowiki>df[columname] = df[columname].apply(lambda x: <function>(x))</nowiki> | ||
+ | :Change a column in the dataframe by applying <function> | ||
+ | |||
+ | ;df.replace(<pattern>,<newvalue>,regex=True) | ||
+ | :Replace <pattern> with <newvalue> in the entire dataframe | ||
+ | |||
+ | ;df[columna].mask(df[columna] == origvalue , newvalue, inplace=True) | ||
+ | :Change all cells in columa that have origvalue into newvalue (inplace, so in the dataframe itself). | ||
+ | |||
+ | |||
+ | ;df.join(table2) | ||
+ | :Like SQL-join merge on index | ||
+ | |||
+ | ;df.merge(table2,on=<column>) | ||
+ | :Like SQL-join on <column> = df.index | ||
+ | |||
+ | ;dfall.merge(table2, on = <column>, how='left', indicator=True) | ||
+ | :Return df joined with table2. Add column _merge with value 'left_only' to records that only exist in df. | ||
+ | ;df = dfall[dfall['_merge'] == 'left_only'] | ||
+ | :Return dataframe with all records not in table2 | ||
+ | |||
+ | ==Select data== | ||
+ | Use .loc, else colomn-names will be considered too (if I understand this[https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc] correctly). | ||
+ | |||
+ | ;WARNING!! | ||
+ | :at least with .loc if only 1 item is found is it returned as string(/float?) else as dataframe/series. | ||
+ | |||
+ | ;df.[[iloc]][<slice>] | ||
+ | :Return the rows indicated by <slice> as pandas.Series | ||
+ | ;<nowiki>df.iloc[:,0:3]</nowiki> | ||
+ | ;<nowiki>df[[col0,col1,col2]]</nowiki> | ||
+ | :Return 3 columns from the dataframe | ||
+ | ;df.filter(regex=<regex>,axis='columns') | ||
+ | :Return all columns which name matches <regexp>. (axis=1) | ||
+ | |||
+ | ;df.loc[<index>] | ||
+ | :Select the row(s) with index <index> | ||
+ | ;df.loc[<indexname>].<columnname> | ||
+ | ;df.loc[<indexname>].[<columnname>] | ||
+ | :Select only 1 column from a row as pandas series. You may convert that to a list or set immediately by using <code>list(df.loc[<indexname>].<columnname>)</code> | ||
+ | ;df.loc[0][0] | ||
+ | :Select a cell in a dataframe with no index or header | ||
+ | ;<nowiki>df.loc[lambda d: d[column1] == <value> ]</nowiki> | ||
+ | ;<nowiki>df.loc[df[column1] == <value> ]</nowiki> | ||
+ | ;<nowiki>df.loc[df[column1].notnull()]</nowiki> | ||
+ | :Select all rows where column1 equals <value> or do not have a null-value. | ||
+ | ;<nowiki>df.loc[(df[column1] == <value>) | (df[colomn1] == <value2>)]</nowiki> | ||
+ | :Combine conditions with '|' (or) or '&' (and). Use () on each condition!! | ||
+ | ;<nowiki>df.loc[datetime.today()-pd.Timedelta(days=7) : datetime.today()]</nowiki> | ||
+ | :Select all rows where the (datetime)-index is between 7 days ago and today. | ||
+ | :NOTE!! The index must be sorted in ascending order[https://stackoverflow.com/q/67298584/2213323]. | ||
+ | ;df.last('7D') | ||
+ | :Select the last 7 days in the dataframe. For this als the datetime index must be sorted in ascending order | ||
+ | |||
+ | See also the [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html#pandas.Series.str.contains pandas series pages] | ||
+ | ;df[df['fieldname'].str.contains('regexp',regex=True)] | ||
+ | :Select all rows in df where fieldname matches regexp. The field must have a string datetype. If you are looking for a simple string you can leave out the regex=True part. | ||
+ | ;df[df['fieldname'].gt(value)] | ||
+ | :Select all rows in df where fieldname is bigger than value. The field must have a numeric datetype | ||
+ | |||
+ | ;df.filter(regex=<regex>,axis='index') | ||
+ | ;df.filter(regex=<regex>,axis='index').<columnname> | ||
+ | ;df.filter(regex=<regex>,axis='index').index | ||
+ | :Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s). | ||
+ | :The .index returns the matching indexes as a [[Python:DataTypes#List|list]]. | ||
+ | |||
+ | ;df.sort_values(<columnname>) | ||
+ | ;df.sort_values([<column1>,<colunm2>],ascending=(True,False)) | ||
+ | :Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending | ||
+ | ;df.groupby([column1,column2]) | ||
+ | :Return a DataFrameGroupBy object, this is not a dataframe | ||
+ | ;df.collname.unique() | ||
+ | :Return a [[Numpy]] array with all distinct values in column 'collname' | ||
+ | |||
+ | ;Get all rows from df1 that have an index that exists in df2 | ||
+ | df = df1[df1.index.isin(df2.index)] | ||
+ | |||
+ | ;Or get all intersectioned indices and select by loc: | ||
+ | df = df1.loc[df1.index & df2.index] | ||
+ | df = df1.loc[np.intersect1d(df1.index, df2.index)] | ||
+ | df = df1.loc[df1.index.intersection(df2.index)] | ||
+ | |||
+ | ==Use data== | ||
+ | ;Print 2 attributes for each row and check for no value | ||
+ | <syntaxhighlight lang=python> | ||
+ | for index, row in df.iterrows(): | ||
+ | print(row['ColumnA'],row['ColumnB']) | ||
+ | if np.isnan(row['ColumnA']): | ||
+ | print('Found a NaN value') | ||
+ | |||
+ | </syntaxhighlight> | ||
+ | ;Print all values for the first row | ||
+ | <syntaxhighlight lang=python> | ||
+ | for column, row in df.iteritems(): | ||
+ | print(column,row[0]) | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | ;Loop over columns | ||
+ | <syntaxhighlight lang=python> | ||
+ | for column in df: | ||
+ | print(f"Columnname {column}") | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | |||
+ | ;df.iloc[0] | ||
+ | :return the values of the first row as pandas series. If there is only 1 column returns the value. | ||
+ | |||
+ | ;df.iloc[0][0] | ||
+ | :return the value of the first column in the first row. Columns can also be identified by name | ||
+ | |||
+ | ;df[<column>].quantile(q=0.1) | ||
+ | :Return the 10th percentile (10% of the values in <column> is lower than the 10th percentile) | ||
+ | |||
+ | ;df[<column].str.extract(r"A regex(p\w+)") | ||
+ | :Return a series (column) of the values selected by [[Regular Expressions]] (p\w+) | ||
+ | ;df[<columna>].map(str) + '_' + df[<columnb>].map(str) | ||
+ | :Return a series (column) with the columns concatenated. If not mapped to str an error will be thrown. | ||
+ | ;<nowiki>df[<column>].map(str).apply(lambda x: x[-1])</nowiki> | ||
+ | :Return a series (column) with the last character of the <column> | ||
=Series= | =Series= | ||
− | Pandas Series [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html documentation] | + | Pandas Series [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html online documentation].<br> |
+ | A pandas series is a 1 dimensional array with named keys.<br> | ||
+ | Pandas Series have all kind of methods similar to [[Numpy]] like main, std, min, max,.... In fact Pandas is using numpy to do this. | ||
;s = pd.Series([]) | ;s = pd.Series([]) | ||
− | :Initialize a series | + | ;s = pd.Series([valuelist],[indexlist]) |
+ | :Initialize a series. If indexlist is omitted the keys are integers starting at 0. | ||
;s[<key>] = <value> | ;s[<key>] = <value> | ||
:Assign <value> to the series element with key <key> | :Assign <value> to the series element with key <key> | ||
− | : | + | :The order in the series is the order in which they are created, NOT the numeric order. |
− | :Elements | + | :Elements can be addressed as <code>s[<key>]</code>, <code>s.<key></code> or <code>s[<numkey>]</code>. Where <numkey> is defined by the order the element was created. |
+ | :Once you have used named keys in a series you cannot create new elements with a numeric key. | ||
;s.index | ;s.index | ||
− | :All indexes in the series | + | :All indexes in the series. Can be sliced to find a particular index. |
;s.describe() | ;s.describe() | ||
:Series statistics | :Series statistics | ||
Line 45: | Line 253: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | = | + | ==TimeSeries== |
− | + | Indexes that contains datetime values are automatically casted to a DatetimeIndex. | |
− | ; | + | |
− | : | + | ;df.resample('5D').mean |
− | ; | + | :Return the dataframe with the average value of each 5 days. |
− | : | + | |
− | ; | + | ;<nowiki>df.resample('1H').agg({'ColumnA':'first','ColumnB':'max', 'ColumnC':'min', 'ColumnD':'last' })</nowiki> |
− | : | + | :Resample with aggregation specified per column. |
+ | |||
+ | ;df.resample('5D',origin='end') | ||
+ | :Resample so the last timestamp is used, by default it starts with the first, the last set may be incomplete. | ||
+ | |||
+ | ==Datetime functions== | ||
+ | ;df[datefield] = pd.to_datetime(df[datefield],format='%d-%m-%Y') | ||
+ | :Convert a string-column with a date into datetime. Format is the input format. | ||
+ | :WARNING: without format to_datetime tries to make up what the format is. This is pretty intelligent but can easily be wrong with day-numbers under 13 | ||
+ | ;df[datefield] = pd.to_datetime(df[datefield],unit='s') | ||
+ | :Convert a column with [[unix time]] into datetime. | ||
+ | ;df[<timefield>] = testdf[<datetime-field>].dt.time | ||
+ | :Return a series (column) with the time-value of a datetime-column | ||
+ | |||
+ | =Reading Data= | ||
+ | ;read_html(url) | ||
+ | :Read html tables into a list of [[#DataFrame |dataframes]] (no header, no index) | ||
+ | Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character. | ||
+ | <syntaxhighlight lang=python> | ||
+ | tables = pd.read_html(url,header=0,index_col=0,decimal=<char>) | ||
+ | </syntaxhighlight> | ||
− | = | + | ;read_sql(query,cnx,index_columns=[col1,col2]) |
− | + | :Read data from the database opened on cnx (see [[Python:Databases]]) | |
− | :Read | ||
− | |||
<syntaxhighlight lang=python> | <syntaxhighlight lang=python> | ||
− | + | df = pd.read_sql(query,cnx,index_col=['Primarykey']) | |
</syntaxhighlight> | </syntaxhighlight> | ||
+ | |||
+ | ;read_excel(xlsfile,sheetname) | ||
+ | :Read data from a microsoft [[XLS|Excel file]]. | ||
+ | ;<nowiki>read_excel(xlsfile,sheetname,converters={'columna':str,'columnb':str})</nowiki> | ||
+ | :Force columns to be read as string | ||
+ | |||
+ | ;read_csv(csvfile) | ||
+ | :Read data from a file with Character Separated Values | ||
+ | ;<nowiki>read_csv(csvfile, delimiter=',', usecols=[<columnA>, <columnB>], dtype = {<columnA>: str})</nowiki> | ||
+ | :Read columns <columnA> and <columnB> into a dataframe and force columnA to string type. | ||
+ | ;<nowiki>read_csv(csvfile, index_col=False, header=None, names=['columnA', columnB'], na_values = 'None')</nowiki> | ||
+ | :Read from a file with no indexcolumn nor a headerrow, set the colomnnames and handle 'None' and NA value. |
Latest revision as of 17:13, 3 May 2024
Check the 10 minutes to Pandas too or pythonexamples.
- import pandas as pd
- Import the library, we assume this was done on this page
DataFrame
Object for tabular data (that is e.g. obtained by read_html).
Each column is a Pandas#Series
- df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}])
- Create a simple dataframe having a range index, 2 rows and 2 columns
- df = pd.DataFrame([{col1: 1,col2: 2},{col1: 1,col2: 2}],[row1,row2])
- Create a dataframe having a list index. The size of index must match the number of rows.
Single columns can be addressed by there name like in df.columnname.unique()
to get unique values in a column.
Single or multiple columns can be addressed as list like in df.set_index([column1,column2])
to set an index over 2 columns.
Information
- df.info()
- Information on dataframe (Index, size, datatypes per column)
- df.size
- Return the number of cells in a dataframe
- df.shape
- Return the number of rows and columns in a dataframe as tuple.
- df.describe()
- Return various attributes for numeric columns in the dataframe. This may indicate if there are outlyers.
- df.head(x)
- df.tail(x)
- Return first/last x data rows of df (5 is the default value for x).
- df.index
- Return the table index (class = pandas.core.indexes.base.Index)
- df.columns
- Return the column headers (class = pandas.core.indexes.base.Index)
- df.dtypes
- Return the data-type of the columns
- df.columnname.dtype
- Return the data-type of column columnname
- df.copy
- Make a copy of df ( = will point to the same dataframe, like with dictionaries)
Modify
- df.rename(columns={oldname
- newname}, inplace=True)
- Redefine the column headers (modifies the dataframe itself, nothing is returned)
- df.set_index(['col1','col2'])
- Return the dataframe with a new index composed of 2 columns
- df.reset_index()
- Return the dataframe with a default index (range)
- df.sort_index(ascending=False)
- Sort the dataframe on the index. 'True' is the default if 'acsending' is not supplied.
- df.index = df.index.map(str)
- For a data-type on the index
- df.drop(index)
- df.drop([indexes])
- df.drop(range(0,3))
- Remove row(s) from a dataframe.
- df.drop(columns=[listofcolumnstodrop]
- Remove columns from a dataframe.
- df.dropna(thresh=x)
- Return only rows that have at least x columns with a none-NA value. If x < 2 the threshold can be dropped.
- df.loc[<index>,['field1','field2']] = [value1,value2]
- Change values for index <index> in a dataframe
- df.fillna(<value>)
- df[[<column1>,<column2>]] = df[[<column1>,<column2>].fillna(<value>)
- Return dataframe with all NA-values (in the selected columns) replaced by <value>
- table.agg(newname=('columname', np.max))
- Add newname to df with the max of columnname. This named aggregations is only supported from version v0.25
- df.transform(<function>)
- Apply a function to an existing column or the entire dataframe.
- df.assign(newcolumn = <expression>)
- Add column newcolumn to the dataframe
- Expression can be a function like "lambda d: d['column'] * <something>"
- df.apply(<function>)
- Apply a function to the dataframe. By default every element of the table.
- df.apply(lambda x: <function>(x) if x.name in [columnname] else x)
- Apply a function to specified colums
- df[columname] = df[columname].apply(lambda x: <function>(x))
- Change a column in the dataframe by applying <function>
- df.replace(<pattern>,<newvalue>,regex=True)
- Replace <pattern> with <newvalue> in the entire dataframe
- df[columna].mask(df[columna] == origvalue , newvalue, inplace=True)
- Change all cells in columa that have origvalue into newvalue (inplace, so in the dataframe itself).
- df.join(table2)
- Like SQL-join merge on index
- df.merge(table2,on=<column>)
- Like SQL-join on <column> = df.index
- dfall.merge(table2, on = <column>, how='left', indicator=True)
- Return df joined with table2. Add column _merge with value 'left_only' to records that only exist in df.
- df = dfall[dfall['_merge'] == 'left_only']
- Return dataframe with all records not in table2
Select data
Use .loc, else colomn-names will be considered too (if I understand this[1] correctly).
- WARNING!!
- at least with .loc if only 1 item is found is it returned as string(/float?) else as dataframe/series.
- df.iloc[<slice>]
- Return the rows indicated by <slice> as pandas.Series
- df.iloc[:,0:3]
- df[[col0,col1,col2]]
- Return 3 columns from the dataframe
- df.filter(regex=<regex>,axis='columns')
- Return all columns which name matches <regexp>. (axis=1)
- df.loc[<index>]
- Select the row(s) with index <index>
- df.loc[<indexname>].<columnname>
- df.loc[<indexname>].[<columnname>]
- Select only 1 column from a row as pandas series. You may convert that to a list or set immediately by using
list(df.loc[<indexname>].<columnname>)
- df.loc[0][0]
- Select a cell in a dataframe with no index or header
- df.loc[lambda d: d[column1] == <value> ]
- df.loc[df[column1] == <value> ]
- df.loc[df[column1].notnull()]
- Select all rows where column1 equals <value> or do not have a null-value.
- df.loc[(df[column1] == <value>) | (df[colomn1] == <value2>)]
- Combine conditions with '|' (or) or '&' (and). Use () on each condition!!
- df.loc[datetime.today()-pd.Timedelta(days=7) : datetime.today()]
- Select all rows where the (datetime)-index is between 7 days ago and today.
- NOTE!! The index must be sorted in ascending order[2].
- df.last('7D')
- Select the last 7 days in the dataframe. For this als the datetime index must be sorted in ascending order
See also the pandas series pages
- df[df['fieldname'].str.contains('regexp',regex=True)]
- Select all rows in df where fieldname matches regexp. The field must have a string datetype. If you are looking for a simple string you can leave out the regex=True part.
- df[df['fieldname'].gt(value)]
- Select all rows in df where fieldname is bigger than value. The field must have a numeric datetype
- df.filter(regex=<regex>,axis='index')
- df.filter(regex=<regex>,axis='index').<columnname>
- df.filter(regex=<regex>,axis='index').index
- Return all rows for which in index matches <regexp> or get only the column of the matched indexes. (axis=0 ) or the indexname(s).
- The .index returns the matching indexes as a list.
- df.sort_values(<columnname>)
- df.sort_values([<column1>,<colunm2>],ascending=(True,False))
- Return the dataframe sorted on the values in the columns. The second form sorts on column1 first and then on column2, column1 ascending, column2 descending
- df.groupby([column1,column2])
- Return a DataFrameGroupBy object, this is not a dataframe
- df.collname.unique()
- Return a Numpy array with all distinct values in column 'collname'
- Get all rows from df1 that have an index that exists in df2
df = df1[df1.index.isin(df2.index)]
- Or get all intersectioned indices and select by loc
df = df1.loc[df1.index & df2.index] df = df1.loc[np.intersect1d(df1.index, df2.index)] df = df1.loc[df1.index.intersection(df2.index)]
Use data
- Print 2 attributes for each row and check for no value
for index, row in df.iterrows():
print(row['ColumnA'],row['ColumnB'])
if np.isnan(row['ColumnA']):
print('Found a NaN value')
- Print all values for the first row
for column, row in df.iteritems():
print(column,row[0])
- Loop over columns
for column in df:
print(f"Columnname {column}")
- df.iloc[0]
- return the values of the first row as pandas series. If there is only 1 column returns the value.
- df.iloc[0][0]
- return the value of the first column in the first row. Columns can also be identified by name
- df[<column>].quantile(q=0.1)
- Return the 10th percentile (10% of the values in <column> is lower than the 10th percentile)
- df[<column].str.extract(r"A regex(p\w+)")
- Return a series (column) of the values selected by Regular Expressions (p\w+)
- df[<columna>].map(str) + '_' + df[<columnb>].map(str)
- Return a series (column) with the columns concatenated. If not mapped to str an error will be thrown.
- df[<column>].map(str).apply(lambda x: x[-1])
- Return a series (column) with the last character of the <column>
Series
Pandas Series online documentation.
A pandas series is a 1 dimensional array with named keys.
Pandas Series have all kind of methods similar to Numpy like main, std, min, max,.... In fact Pandas is using numpy to do this.
- s = pd.Series([])
- s = pd.Series([valuelist],[indexlist])
- Initialize a series. If indexlist is omitted the keys are integers starting at 0.
- s[<key>] = <value>
- Assign <value> to the series element with key <key>
- The order in the series is the order in which they are created, NOT the numeric order.
- Elements can be addressed as
s[<key>]
,s.<key>
ors[<numkey>]
. Where <numkey> is defined by the order the element was created. - Once you have used named keys in a series you cannot create new elements with a numeric key.
- s.index
- All indexes in the series. Can be sliced to find a particular index.
- s.describe()
- Series statistics
All in 1 example:
import numpy as np
import pandas as pd
s = pd.Series([])
for i in range(50):
s[i] = int(np.random.random() * 100)
for i in s.index:
print(i,s[i])
Funny, you can do s[0]
but not
for i in s:
print(s[i])
To get all values from the series you do:
for v in s:
print(v)
To get the indexes too:
for i in s.index:
print(i,s[i])
TimeSeries
Indexes that contains datetime values are automatically casted to a DatetimeIndex.
- df.resample('5D').mean
- Return the dataframe with the average value of each 5 days.
- df.resample('1H').agg({'ColumnA':'first','ColumnB':'max', 'ColumnC':'min', 'ColumnD':'last' })
- Resample with aggregation specified per column.
- df.resample('5D',origin='end')
- Resample so the last timestamp is used, by default it starts with the first, the last set may be incomplete.
Datetime functions
- df[datefield] = pd.to_datetime(df[datefield],format='%d-%m-%Y')
- Convert a string-column with a date into datetime. Format is the input format.
- WARNING: without format to_datetime tries to make up what the format is. This is pretty intelligent but can easily be wrong with day-numbers under 13
- df[datefield] = pd.to_datetime(df[datefield],unit='s')
- Convert a column with unix time into datetime.
- df[<timefield>] = testdf[<datetime-field>].dt.time
- Return a series (column) with the time-value of a datetime-column
Reading Data
- read_html(url)
- Read html tables into a list of dataframes (no header, no index)
Example code. The first line in the table is a header, the first column the index (e.g. dates), decimal specifies the decimal point character.
tables = pd.read_html(url,header=0,index_col=0,decimal=<char>)
- read_sql(query,cnx,index_columns=[col1,col2])
- Read data from the database opened on cnx (see Python:Databases)
df = pd.read_sql(query,cnx,index_col=['Primarykey'])
- read_excel(xlsfile,sheetname)
- Read data from a microsoft Excel file.
- read_excel(xlsfile,sheetname,converters={'columna':str,'columnb':str})
- Force columns to be read as string
- read_csv(csvfile)
- Read data from a file with Character Separated Values
- read_csv(csvfile, delimiter=',', usecols=[<columnA>, <columnB>], dtype = {<columnA>: str})
- Read columns <columnA> and <columnB> into a dataframe and force columnA to string type.
- read_csv(csvfile, index_col=False, header=None, names=['columnA', columnB'], na_values = 'None')
- Read from a file with no indexcolumn nor a headerrow, set the colomnnames and handle 'None' and NA value.