Citizens Advice Data Science team month-notes #4 (January)
We are blogging about the work we do. We aim to write a post once a month with contributions from various members of the team. Here’s our previous post.
As a team we believe in continuous improvement, building capability, and focusing on meeting data users’ needs. The opportunities we have for improving the way we work with data at Citizens Advice aren’t unique to our organisation. It’d be great if the practical approaches we take are useful to others.
This month is from Jon, looking at how we can save time on manual tasks.
No API, no problem — pulling data and working with legacy systems
Jon (Senior Data Analyst)
At any organization with significant history, you’ll often need to work with legacy systems that lack things we usually take for granted — for example, REST endpoints. Sometimes those systems are managed by third parties, making change impossible — so people end up having to manage the data manually through clunky web interfaces.
There’s a couple of problems when you’re entirely reliant on legacy websites for your data operations though:
- If there’s no export feature on the website, it’s difficult to download tabular data into a format that’s useful for analysis — especially if the website insists on presenting tabular data across many pages. In our case, we had a legacy system that contained 14,000 volunteer entries spread out across 200 pages.
- Some legacy websites run hard limits on the number of changes you can make at once. If you want to add or change hundreds of thousands of entries, but the system limits you to making 100 changes at a time, it’ll take a lot of manual effort to make these changes. Making changes manually is also prone to error.
Using Selenium to extract tabular data from a website
In this case, our problem is that our data is trapped within a bunch of HTML tables spread out across several different pages.
- Using Selenium, go to the legacy system’s website. A quick note here: You’ll probably need to provide some form of authentication to access the website. While several clever ways exist for automating authentication in Selenium, for tasks that don’t get repeated very often, I’ve found the simplest way to do this is just to sleep the selenium script until you’ve logged in manually on the Selenium-driven web browser.
- Have your Selenium script loop through each page, using Selenium’s page_source (https://selenium-python.readthedocs.io/api.html) to write the HTML contents of that page to disk.
- You should now have each of the website’s HTML stored as HTML files so all you need to do now is extract it. You could use Beautiful Soup to do this — and if the data is formatted in a particularly complex way, a scraping package like Beautiful Soup is probably best. However, for most simple HTML tables, Python’s pandas has excellent out-of-the-box HTML scraping features with pd.read_html, making it easy to dump HTML files straight to DataFrame with only a few lines of code.
- After dumping all the HTML files to the same DataFrame, you’ve successfully extracted the data you wanted from your finicky legacy website! You can now use your analytical tools of choice on the data 😁
Using selenium to automate bulk manual tasks
In this case, our issue is that we need to perform the same manual task thousands of times on a legacy website. Instead of doing this by hand, let’s use Selenium to automate these tasks.
In our case, the legacy system we were dealing with allowed us to commit bulk changes by uploading specially formatted Excel files, but only 100 changes at a time! While it was easy for our colleagues to programmatically generate thousands of Excel files, each containing 100 changes at a time, they were still stuck with the bottleneck of uploading each of these files to the server manually. Having to upload dozens of files for even a small change was frustratingly slow, and the larger changes would often take small teams of people days to perform — which meant the legacy system would cause delays in scheduling for large changes we would have rather performed immediately.
Here’s how we automated this process with Selenium:
- Using Selenium, we went to the legacy system’s upload page on the website.
- Using glob to loop through a folder containing our change logs, we then uploaded each Excel file one at a time.
- We made sure to sleep the script for a few seconds after each upload, to give the upload adequate time to finish.
- For most legacy websites, sending a filepath for upload is a simple case of passing a string via send_keys to the “Browse for file…” or “Import file” button. After you’ve done that, it’s a simple matter of getting Selenium to click the upload button.
- We also took a HTML snapshot after each upload using page_source, and used pd.read_html to scrape each HTML snapshot and create a log of whether each upload succeeded or failed, along with any error messages that were printed.
- If necessary, refresh the upload page, and rinse and repeat for every Excel file in the folder until all tasks have been executed.
In our case, while our script still took a few seconds to perform a single task, the important thing is that our colleagues didn’t have to spend days doing boring upload tasks anymore! For example, in a recent small-sized change, there were 8000 entries that needed to be adjusted. Doing this manually would require uploading 800 files by hand — not impossible, but still arduous — a solid hour if you uploaded a file every 4 seconds, and never got tired or distracted. With our new scripts, we simply dumped the change files in a folder, ran the script, and left it to run for an hour. And when larger changes get scheduled for the future (some of our changes require changing hundreds of thousands of entries at a time), now nobody will get stuck performing boring tasks for dozens of hours on end!
Out of necessity, we’ve skipped over a lot of the details of using Selenium — since implementation details will vary vastly depending on the legacy tasks you’re trying to automate, your preferred scripting language, and other factors.
That said, if you’re another charity or public sector entity that’s interested in automating tasks in a similar way, feel free to get in touch with us, and we’ll see what we can share with you that might be helpful.