Citizens Advice Data Science team month-notes #4 (January)

No API, no problem — pulling data and working with legacy systems

  1. If there’s no export feature on the website, it’s difficult to download tabular data into a format that’s useful for analysis — especially if the website insists on presenting tabular data across many pages. In our case, we had a legacy system that contained 14,000 volunteer entries spread out across 200 pages.
  2. Some legacy websites run hard limits on the number of changes you can make at once. If you want to add or change hundreds of thousands of entries, but the system limits you to making 100 changes at a time, it’ll take a lot of manual effort to make these changes. Making changes manually is also prone to error.

Using Selenium to extract tabular data from a website

  1. Using Selenium, go to the legacy system’s website. A quick note here: You’ll probably need to provide some form of authentication to access the website. While several clever ways exist for automating authentication in Selenium, for tasks that don’t get repeated very often, I’ve found the simplest way to do this is just to sleep the selenium script until you’ve logged in manually on the Selenium-driven web browser.
  2. Have your Selenium script loop through each page, using Selenium’s page_source (https://selenium-python.readthedocs.io/api.html) to write the HTML contents of that page to disk.
  3. You should now have each of the website’s HTML stored as HTML files so all you need to do now is extract it. You could use Beautiful Soup to do this — and if the data is formatted in a particularly complex way, a scraping package like Beautiful Soup is probably best. However, for most simple HTML tables, Python’s pandas has excellent out-of-the-box HTML scraping features with pd.read_html, making it easy to dump HTML files straight to DataFrame with only a few lines of code.
  4. After dumping all the HTML files to the same DataFrame, you’ve successfully extracted the data you wanted from your finicky legacy website! You can now use your analytical tools of choice on the data 😁

Using selenium to automate bulk manual tasks

  1. Using Selenium, we went to the legacy system’s upload page on the website.
  2. Using glob to loop through a folder containing our change logs, we then uploaded each Excel file one at a time.
  3. We made sure to sleep the script for a few seconds after each upload, to give the upload adequate time to finish.
  4. For most legacy websites, sending a filepath for upload is a simple case of passing a string via send_keys to the “Browse for file…” or “Import file” button. After you’ve done that, it’s a simple matter of getting Selenium to click the upload button.
  5. We also took a HTML snapshot after each upload using page_source, and used pd.read_html to scrape each HTML snapshot and create a log of whether each upload succeeded or failed, along with any error messages that were printed.
  6. If necessary, refresh the upload page, and rinse and repeat for every Excel file in the folder until all tasks have been executed.

--

--

--

Head of Data Science at Citizens Advice. These are my personal thoughts on work.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

1 Hour Animation Loop as a Combination of Different Movements

|Cmd Challenge Write Up|

5 Things To Consider When Developing a Sprint Backlog

Information Overload

Pile of files waited to be sorted.

What’s new in Kubernetes 1.17?

How to consume an external API with Laravel and Guzzle

Monitoring you Golang server with Prometheus and Grafana

How I am overcoming my Imposter syndrome

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dan Barrett

Dan Barrett

Head of Data Science at Citizens Advice. These are my personal thoughts on work.

More from Medium

How I spent a year developing non-technical leadership skills as an SME

mending broken INFORMATION LOOPS

Road to Compliance — Kenya Data Protection Act, 2019 (DPA)

Talent Acquisition Metrics That Matter