Extracting Information From A Table Except Header Of The Table Using Bs4
Solution 1:
For Your first question,
import bs4
text = """
<tableclass="codes"><tr><td><b>Code</b></td><td><b>Display</b></td><td><b>Definition</b></td></tr><tr><td>active<aname="active"></a></td><td>Active</td><td>This account is active and may be used.</td></tr><tr><td>inactive<aname="inactive"></a></td><td>Inactive</td><td>This account is inactive
and should not be used to track financial information.</td></tr></table>"""
table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
tds_all = []
for tr in tr_all:
tds_all.append([td.get_text() for td in tr.findAll("td")])
# if You prefer double list comprefension instead...
table_info = [data[i].encode('utf-8') for data in tds_all
for i in range(len(tds_all))]
print(table_info)
yields
['active ', 'Active', 'inactive ', 'Inactive']
And regarding Your second question
tr_header=table.findAll("tr")[0] i do not get a list
True, []
is indexing operation, which selects first element from list, thus You get single element. [1:]
is slicing operator (take a look at nice tutorial if You need more information).
Actually, You get list two times, for each call of table.findAll("tr") - for header and rest of rows. Sure, this is quite redundant. If You want to separate tokens from header and rest, I think You likely want something like this
tr_all = table.findAll("tr")
header = tr_all[0]
tr_rest = tr_all[1:]
tds_rest = []
header_data = [td.get_text().encode('utf-8') for td in header]
for tr in tr_rest:
tds_rest.append([td.get_text() for td in tr.findAll("td")])
and regarding third question
Is it possible to edit this code to add table information from the first row to the end of the table?
Given Your desired output in comments below:
rows_all = table.findAll("tr")
header = rows_all[0]
rows = rows_all[1:]
data = []
for row in rows:
for td in row:
try:
data.append(td.get_text())
except AttributeError:
continue
print(data)
# or more or less same as above, oneline
data = [td.get_text() for row in rows for td in row.findAll("td")]
yields
[u'active', u'Active', u'This account is active and may be used.', u'inactive', u'Inactive', u'This account is inactive and should not be used to track financial information.']
Solution 2:
JustMe answered this question correctly. Another equivalent variant would be:
import bs4
text = """
<tableclass="codes"><tr><td><b>Code</b></td><td><b>Display</b></td><td><b>Definition</b></td></tr><tr><td>active<aname="active"></a></td><td>Active</td><td>This account is active and may be used.</td></tr><tr><td>inactive<aname="inactive"></a></td><td>Inactive</td><td>This account is inactive
and should not be used to track financial information.</td></tr></table>"""
table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
# critical line:
tds_all = [ td.get_text() for each_tr in tr_all for td in each_tr.findAll("td")]
# and after that unchanged:
table_info= [data.encode('utf-8') for data in tds_all]
# for control:
print(table_info)
This strange construction in the critical line serves as flattening of the list of list 'tds_all'. lambda z: [x for y in z for x in y] flattens the list of list z. I replaced x and y and z according to this specific situation.
Actually I came to it, because I had as an inbetween-step as the critical line: tds_all = [[td.get_text() for td in each_tr.findAll("td")] for each_tr in tr_all ] which generates a list of lists for tds_all: [[u'active ', u'Active', u'This account is active and may be used.'], [u'inactive ', u'Inactive', u'This account is inactive\n and should not be used to track financial information.']] To flatten this, one needs this [x for y in z for x in y] composition. But then I thought, why not apply this structure right to the critical line and flatten it thereby?
z is the list of bs4-objects (tr_all). In this 'for ... in ...'-construct, each_tr (a bs4-object) is taken from the list 'tr_all', and the each_tr object generates in the behind 'for-in'-construct a list of all 'td' matches, by the expression each_tr.findAll("td") from which every match "td" is isolated by this behind 'for ... in ...'-loop, and at the very beginning of this listexpession stands what should be then collected in the final list: the text isolated from this object("td.get_text()"). And this resulting final list is assigned to td_all.
The result of this code is this result list:
['active ', 'Active', 'This account is active and may be used.', 'inactive ', 'Inactive', 'This account is inactive\n and should not be used to track financial information.']
The two longer elements were missing in the example is of JustMe. I think, Mary, you want to have them included, isn't it?
Post a Comment for "Extracting Information From A Table Except Header Of The Table Using Bs4"