Web Scraping Yahoo News

Saif Gazali
4 min readJul 30, 2021

--

Photo by ScrapingRobot

Web scraping is the process of collecting data from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that uses bots to extract content and data from a website. Screen scraping only copies the pixels displayed on the screen however web scraping extracts underlying HTML code and, with it, data stored in a database. We would be scraping the Yahoo news website in order to get articles on the topic Brexit.

The URL template of Yahoo news is https://news.search.yahoo.com/search?p=query. Here query is the keyword which is given as an input to search news articles for it. We would start by importing all the required modules and creating the URL template.

import re
import csv
from time import sleep
from bs4 import BeautifulSoup
import requests
template = 'https://news.search.yahoo.com/search?p={}'

The format() method formats the specified value(s) and insert them inside the string's placeholder. The placeholder is defined using curly brackets: {}. Read more about the placeholders in the Placeholder section below.

url = template.format('brexit')

Headers for different web browsers.

headers = {'accept': '*/*','accept-encoding': 'gzip, deflate, br','accept-language': 'en-US,en;q=0.9','referer': 'https://www.google.com','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 Edg/85.0.564.44'}

Importing the Requests module. we have a response object called resp. We can get all the information we need from this object.

Creating a response that needs to be given as an input to BeautifulSoup.

resp = requests.get(url,headers=headers)

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with most of the parsers to provide idiomatic ways of navigating, searching, and modifying the parse tree.

soup = BeautifulSoup(resp.text,'lxml')

For getting all the <div> tags, or anything more complicated than the first tag with a certain name such as NewsArticle, we use one of the methods described in Searching the tree, such as find_all():

cards = soup.find_all('div','NewsArticle')
print(cards)

Getting the headline of the news article using the function find.

headline = card.find('h4','s-title').textheadline #Brussels Backs Down: Beyond Brexit

Getting the name of the agency who wrote and published the article.

source = card.find('span','s-source').textsource  #Bloomberg

Getting the time when the article was published

posted = card.find('span','s-time').text.replace('.','').strip()posted #· 8 hours ago

Getting the description of the article.

description = card.find('p','s-desc').text.strip()print(description)print(len(description))
Description of the first article

Getting the raw link of the article from yahoo news which needs to be further processed in order to use it directly in our web browser.

raw_link = card.find('a').get('href')raw_link
#'https://r.search.yahoo.com/_ylt=AwrC1DGXVJFg6GEAzRvQtDMD;_ylu=Y29sbwNiZjEEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1620165912/RO=10/RU=https%3a%2f%2ffinance.yahoo.com%2fnews%2fscottish-hangs-over-u-k-040000427.html/RK=2/RS=1EqQdPNBWiUlyXHtYKQ4HfaPuBI-

The module requests.utils provides utility functions that are used within the requests. We would use the unquote method to unquote all the reserved and the unreserved characters in the raw link.

unquoted_link = requests.utils.unquote(raw_link)unquoted_link
#'https://r.search.yahoo.com/_ylt=AwrC1DGXVJFg6GEAzRvQtDMD;_ylu=Y29sbwNiZjEEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1620165912/RO=10/RU=https://finance.yahoo.com/news/scottish-hangs-over-u-k-040000427.html/RK=2/RS=1EqQdPNBWiUlyXHtYKQ4HfaPuBI-

Using regex to get the proper URL that can be used to search the article in the web browser

pattern = re.compile(r'RU=(.+)\/RK')clear_link = re.search(pattern,unquoted_link).group(1)clear_link
#'https://finance.yahoo.com/news/scottish-hangs-over-u-k-040000427.html'

Putting all the above into a single method get_article that would return all the necessary things.

def get_article(card):
headline = card.find('h4','s-title').text
source = card.find('span','s-source').text
posted = card.find('span','s-time').text.replace('.','').strip()
description = card.find('p','s-desc').text.strip()
raw_link = card.find('a').get('href')
unquoted_link = requests.utils.unquote(raw_link)
pattern = re.compile(r'RU=(.+)\/RK')
clear_link = re.search(pattern,unquoted_link).group(1)

article = (headline,source,posted,description,clear_link)
return article

Getting the headline, source, posting time, description and the proper link for all the articles of page 1 of yahoo news.

articles = []
links = set()
for card in cards:
article = get_article(card)
link = article[-1]
if not link in links:
links.add(link)
articles.append(article)

Printing the first article card which would contain its details.

print(articles[0])
First article card

As we are done processing the articles on the first page, we would scrap all the remaining pages.

url = soup.find('a','next').get('href')url 
# https://news.search.yahoo.com/search;_ylt=AwrC1CjuCgNh_1kAg6DQtDMD;_ylu=Y29sbwNiZjEEcG9zAzEEdnRpZAMEc2VjA3BhZ2luYXRpb24-?p=brexit&b=11&pz=10&bct=0&xargs=0

Hence we would continue to parse until the last page along with a sleep timer that would halt querying the server for some time so as to not be seen as a bot by the Yahoo news server. We also save the articles into a csv file using the csv.writer method that would write all the the scraped articles. The csv file would have 5 columns namely Headline, Source, Posted, Description and Link.

def get_the_news(search):
#Run the main program
template = 'https://news.search.yahoo.com/search?p={}'
url = template.format(search)

articles = []
links = set()

while True:
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,'lxml')
cards = soup.find_all('div','NewsArticle')

#extract articles from page
for card in cards:
article = get_article(card)
link = article[-1]
if not link in links:
links.add(link)
articles.append(article)
#Find the next page
try:
url = soup.find('a','next').get('href')
sleep(2)
except AttributeError:
break
#Save article data with open('newsArticle.csv','w',newline='',encoding='utf-8') as f: writer = csv.writer(f)
writer.writerow(['Headline','Source','Posted','Description',
'Link'])
writer.writerows(articles)

return articles

Querying the Yahoo news website to get all the articles on the topic Brexit.

articles = get_the_news('brexit')

Checking the number of articles collected.

print(len(articles)) # 335

The csv file is generated and we can check it out!

Resources

Web Scraping Article

Beautiful Soup

Modules documentation

--

--