A Freelancer's Guide to Web Scraping WSJ Articles

A Freelancer's Guide to Web Scraping WSJ Articles

I recently completed a task as a freelancer that involved mining articles from the Wall Street Journal. The idea was to scrape articles related to "Verizon Communications Inc." that were published between July 2021 and March 2023. I thought it would be a great idea to share the Python script that I developed for this task, which can be used as a base for similar tasks. Let's dive into it!

Understanding the Script

The script is designed to perform the following steps:

  1. Request for articles' ids - This part of the script is designed to get the IDs of all the articles related to the specified query.

  2. Request for articles' details - Once we have all the IDs, the script then moves on to get detailed data on the articles corresponding to those IDs.

  3. Compile the articles' details - All the fetched article details are then compiled into a Python list.

  4. Data cleaning - After getting all the details, the script then cleans the data, keeping only the necessary fields, and then saves the data into a .csv file.

Alright, let's dive into the specific code snippets and understand them better.

Code Walkthrough

Setup

First, we import the necessary Python libraries - requests for handling HTTP requests and json and pandas for handling and storing data.

import requests, json
import pandas as pd

Fetch Articles IDs

The function gettingArticlesIds(page) generates a GET request to the WSJ search URL. This function fetches the article IDs related to a specified query (Verizon Communications Inc.) and a specified page number.

def gettingArticlesIds(page):
    url = "https://www.wsj.com/search"
    query= 'Verizon Communications Inc.'
    # ... Rest of code to setup the query string and headers ...
    response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
    if response.status_code != 200:
            raise Exception('Something went wrong with the page request!')
    return response

Fetch Article Details

The function gettingArticleDetails(id,type) uses the IDs fetched in the previous step to generate a GET request to the WSJ search URL for specific article details.

def gettingArticleDetails(id,type):
    url = "https://www.wsj.com/search"
    querystring = {"id":id,"type":type}
    # ... Rest of code to setup headers ...
    response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
    if response.status_code != 200:
            raise Exception('Something went wrong with the request!', id)
    return response

Compile Article Details

In the main() function, these functions are called to first get the article IDs, and then their respective details. The article details are then saved into a Python list.

def main():
    # ... Code to fetch article IDs ...
    allIdsData = list()
    for id in ids:
        try:
            data = gettingArticleDetails(id['id'], id['type']).json()
        except:
            ''
        allIdsData.append(data['data'])
    return allIdsData

Data Cleaning

Finally, the cleaning(data) function takes the Python list of article details, cleans it, and then saves it into a .csv file.

def cleaning(data):
    allData = list()
    for row in data:
        cleanData = {
            'articleId':row.get('articleId'),
            'articleSection':row.get('articleSection'),
            'byline':row.get('byline'),
            'url':row.get('url'),
            'contentType':row.get('contentType'),
            'headline':row.get('headline'),
            'printedition_wsj_headline':row.get('printedition_wsj_headline'),
            'printedition_wsj_pubdate':row.get('printedition_wsj_pubdate'),
            'summary':row.get('summary'),
        }
        allData.append(cleanData)
    pd.DataFrame(allData).to_csv('data.csv', index=False)

So there you have it! This script can be easily adjusted for any query or website with a similar structure.

Happy Scraping!