Table of contents
I recently completed a task as a freelancer that involved mining articles from the Wall Street Journal. The idea was to scrape articles related to "Verizon Communications Inc." that were published between July 2021 and March 2023. I thought it would be a great idea to share the Python script that I developed for this task, which can be used as a base for similar tasks. Let's dive into it!
Understanding the Script
The script is designed to perform the following steps:
Request for articles' ids - This part of the script is designed to get the IDs of all the articles related to the specified query.
Request for articles' details - Once we have all the IDs, the script then moves on to get detailed data on the articles corresponding to those IDs.
Compile the articles' details - All the fetched article details are then compiled into a Python list.
Data cleaning - After getting all the details, the script then cleans the data, keeping only the necessary fields, and then saves the data into a .csv file.
Alright, let's dive into the specific code snippets and understand them better.
Code Walkthrough
Setup
First, we import the necessary Python libraries - requests for handling HTTP requests and json and pandas for handling and storing data.
import requests, json
import pandas as pd
Fetch Articles IDs
The function gettingArticlesIds(page)
generates a GET request to the WSJ search URL. This function fetches the article IDs related to a specified query (Verizon Communications Inc.) and a specified page number.
def gettingArticlesIds(page):
url = "https://www.wsj.com/search"
query= 'Verizon Communications Inc.'
# ... Rest of code to setup the query string and headers ...
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
if response.status_code != 200:
raise Exception('Something went wrong with the page request!')
return response
Fetch Article Details
The function gettingArticleDetails(id,type)
uses the IDs fetched in the previous step to generate a GET request to the WSJ search URL for specific article details.
def gettingArticleDetails(id,type):
url = "https://www.wsj.com/search"
querystring = {"id":id,"type":type}
# ... Rest of code to setup headers ...
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
if response.status_code != 200:
raise Exception('Something went wrong with the request!', id)
return response
Compile Article Details
In the main()
function, these functions are called to first get the article IDs, and then their respective details. The article details are then saved into a Python list.
def main():
# ... Code to fetch article IDs ...
allIdsData = list()
for id in ids:
try:
data = gettingArticleDetails(id['id'], id['type']).json()
except:
''
allIdsData.append(data['data'])
return allIdsData
Data Cleaning
Finally, the cleaning(data)
function takes the Python list of article details, cleans it, and then saves it into a .csv file.
def cleaning(data):
allData = list()
for row in data:
cleanData = {
'articleId':row.get('articleId'),
'articleSection':row.get('articleSection'),
'byline':row.get('byline'),
'url':row.get('url'),
'contentType':row.get('contentType'),
'headline':row.get('headline'),
'printedition_wsj_headline':row.get('printedition_wsj_headline'),
'printedition_wsj_pubdate':row.get('printedition_wsj_pubdate'),
'summary':row.get('summary'),
}
allData.append(cleanData)
pd.DataFrame(allData).to_csv('data.csv', index=False)
So there you have it! This script can be easily adjusted for any query or website with a similar structure.
Happy Scraping!