Difference between revisions of "XLS"

Latest revision as of 15:50, 29 June 2020

Python XLS parser using standard modules. xlsx files are basically just a set of compressed (zipped) .XML pages.

A simple workbook with 3 sheets having the same content like this:

Consists of the following .xml files:

Filename	Remarks
./[Content_Types].xml
./docProps/core.xml
./docProps/app.xml
./xl/workbook.xml	Holds per sheet a node like '<sheet name="Test2" sheetId="2" state="visible" r:id="rId3"/>'
./xl/sharedStrings.xml	Strings that appear in more sheets, references from the sheet by indexnumber.
./xl/styles.xml
./xl/worksheets/sheet2.xml	Content of sheet2
./xl/worksheets/sheet1.xml	Content of sheet1
./xl/worksheets/sheet3.xml	Content of sheet3
./xl/_rels/workbook.xml.rels
./_rels/.rels

Found this on stackoverflow where it has been removed [1]. It was announced as very very basic but it works very well and it only uses modules installed by default (at least in Ubuntu 18.04). You can find the original solution at the end of this page.

The routine returns a list of dicts like { <columname> : <cellvalue> [, .. ] }

Our solution supporting workbooks with and without sharedstrings and fetches sheet content by the sheet name. Using 're' to find the column, this may be slower than the original approach but we have not found that in the test-files used.

def xlsx(fname,sheet):
    import zipfile
    from xml.etree.ElementTree import iterparse
    import re
    z = zipfile.ZipFile(fname)
    if 'xl/sharedStrings.xml' in z.namelist():
        # Get shared strings
        strings = [element.text for event, element
                       in iterparse(z.open('xl/sharedStrings.xml')) 
                       if element.tag.endswith('}t')]
    # Get the sheets available
    sheets = { element.attrib['name']:element.attrib['sheetId'] for event,element in iterparse(z.open('xl/workbook.xml'))
                                          if element.tag.endswith('}sheet') }
    
    # Just to see what is in, comment out for real use
    for s in sheets:
        print('SheetID: '+sheets[s]+' Sheetname: '+s)
        
    rows = []
    row = {}
    value = ''

    if sheet in sheets:
        sheetfile = 'xl/worksheets/sheet'+sheets[sheet]+'.xml'
        #print(sheet,sheetfile)
        for event, element in iterparse(z.open(sheetfile)):
            # get value or index to shared strings
            if element.tag.endswith('}v') or element.tag.endswith('}t'):
                value = element.text
            # If value is a shared string, use value as an index
            if element.tag.endswith('}c'):
                if element.attrib.get('t') == 's':
                    value = strings[int(value)]
                # split the row/col information so that the row leter(s) can be separate
                letter = re.sub('\d','',element.attrib['r'])
                row[letter] = value
                value = ''
            if element.tag.endswith('}row'):
                rows.append(row)
                row = {}

    return rows

Pandas

There is a pandas function to read excel files you can use to avoid all this.

Using the routine above you can try to convert it to a pandas dataframe too:

df = pd.DataFrame(xlsx(xlsfile,sheetname)): Put the sheet into a pandas dataframe

df.columns = df.iloc[1]: Set the column names as the first row in the sheet.

@@ Line 1: / Line 1: @@
 [[Category:Python]]
-Python XLS parser using standard modules.
+Python XLS parser using standard modules. xlsx files are basically just a set of compressed (zipped) .[[XML]] pages.
-Found on [https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python stackoverflow]. It is announced as very very basic but it works very well.
+A simple workbook with 3 sheets having the same content like this:
+[[File:Sample_xls.png]]
+Consists of the following .xml files:
+{| class="wikitable"
+!Filename
+!Remarks
+|-
+| style="width:28ch |./[Content_Types].xml
+|
+|-
+| ./docProps/core.xml
+|
+|-
+| ./docProps/app.xml
+|
+|-
+| ./xl/workbook.xml
+| Holds per sheet a node like '<sheet name="Test2" sheetId="2" state="visible" r:id="rId3"/>'
+|-
+| ./xl/sharedStrings.xml
+| Strings that appear in more sheets, references from the sheet by indexnumber.
+|-
+| ./xl/styles.xml
+|
+|-
+| ./xl/worksheets/sheet2.xml
+| Content of sheet2
+|-
+| ./xl/worksheets/sheet1.xml
+| Content of sheet1
+|-
+| ./xl/worksheets/sheet3.xml
+| Content of sheet3
+|-
+| ./xl/_rels/workbook.xml.rels
+|
+|-
+| ./_rels/.rels
+|
+|}
+Found this on [https://stackoverflow.com/ stackoverflow] where it has been removed [https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python]. It was announced as very very basic but it works very well and it only uses modules installed by default (at least in [https://www.ubuntu.com/ Ubuntu] 18.04). You can find the original solution at the end of this page.
+The routine returns a list of dicts like { <columname> : <cellvalue> [, .. ] }
+Our solution supporting workbooks with and without sharedstrings and fetches sheet content by the sheet name. Using 're' to find the column, this may be slower than the original approach but we have not found that in the test-files used.
-This routine returns a list of dicts like { <columname> : <cellvalue> [, .. ] }
 <syntaxhighlight lang=python>
-def xlsx(fname):
+def xlsx(fname,sheet):
      import zipfile
      from xml.etree.ElementTree import iterparse
+    import re
      z = zipfile.ZipFile(fname)
-     strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
+     if 'xl/sharedStrings.xml' in z.namelist():
+        # Get shared strings
+        strings = [element.text for event, element
+                       in iterparse(z.open('xl/sharedStrings.xml'))
+                       if element.tag.endswith('}t')]
+    # Get the sheets available
+    sheets = { element.attrib['name']:element.attrib['sheetId'] for event,element in iterparse(z.open('xl/workbook.xml'))
+                                          if element.tag.endswith('}sheet') }
+    # Just to see what is in, comment out for real use
+    for s in sheets:
+        print('SheetID: '+sheets[s]+' Sheetname: '+s)
      rows = []
      row = {}
      value = ''
-     for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
-        if el.tag.endswith('}v'):                                 # <v>84</v>
+     if sheet in sheets:
-            value = el.text
+        sheetfile = 'xl/worksheets/sheet'+sheets[sheet]+'.xml'
-        if el.tag.endswith('}c'):                                 # <c r="A3" t="s"><v>84</v></c>
+        #print(sheet,sheetfile)
-            if el.attrib.get('t') == 's':
+        for event, element in iterparse(z.open(sheetfile)):
-                value = strings[int(value)]
+            # get value or index to shared strings
-            letter = el.attrib['r']                               # AZ22
+            if element.tag.endswith('}v') or element.tag.endswith('}t'):
-            while letter[-1].isdigit():
+                value = element.text
-                 letter = letter[:-1]
+            # If value is a shared string, use value as an index
-            row[letter] = value
+            if element.tag.endswith('}c'):
-            value = ''
+                if element.attrib.get('t') == 's':
-        if el.tag.endswith('}row'):
+                    value = strings[int(value)]
-            rows.append(row)
+                # split the row/col information so that the row leter(s) can be separate
-            row = {}
+                letter = re.sub('\d','',element.attrib['r'])
+                 row[letter] = value
+                value = ''
+            if element.tag.endswith('}row'):
+                rows.append(row)
+                row = {}
      return rows
 </syntaxhighlight>
+==Pandas==
+There is a [[Pandas#Reading_Data|pandas function to read excel files ]] you can use to avoid all this.
+Using the routine above you can try to convert it to a pandas dataframe too:
+;df = pd.DataFrame(xlsx(xlsfile,sheetname))
+:Put the sheet into a pandas dataframe
+;df.columns = df.iloc[1]
+:Set the column names as the first row in the sheet.

Difference between revisions of "XLS"

Latest revision as of 15:50, 29 June 2020

Pandas

Navigation menu

Search