Skip to content Skip to sidebar Skip to footer

Extract A Specific Header From Html Using Beautiful Soup

This is the patent example I am using https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry . Below is the code I used. I want the code to display only the cited

Solution 1:

The number of citations is created dynamically via JavaScript. But you can count number of elements with itemprop="forwardReferencesFamily" to get the count. For example:

import requests
from bs4 import BeautifulSoup


url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print(len(soup.select('tr[itemprop="forwardReferencesFamily"]')))

Prints:

4

Solution 2:

Hi in this link https://patents.google.com/patent/WO2012061469A3/en?oq=medicinal+chemistry I want the code to print the patent citations which should give publication number, title. I then want to use pandas to put publication number in a column and the title in another column. so far I have used beautiful soup to convert the HTML file into a readable format.I have selected backward references HTML tag and under that I want it to print the publication number and title of the citations. I am citing one single example, but I have a folder full of HTML files which I will do later.

x=soup.select('tr[itemprop="backwardReferences"]') 
y=soup.select('td[itemprop="title"]') # this line gives all the titles in the document not particularly under the patent citations
print(y)

Post a Comment for "Extract A Specific Header From Html Using Beautiful Soup"