Data Scraper and the XPath to Sourcing Success

Comment
Add to Favorites

One of my pet peeves around recruiting tool discussions is how shiny new objects get the bulk of the attention. This, of course, is not a problem exclusive to recruiting. Websites like Product Hunt and BetaList help fill our feeds with bright shiny and new tools that can distract us from what we NEED. Spoiler Alert! Not all that glitters is 24-karat. The alternative is let the test of time prevail and potentially miss out on a great new product. My goal is to show a more in-depth review of a tool with at least one real-world example of how the tool can save you time.

Data Scraper/Data Miner (both names appear on menus and web pages) is similar to a few Chrome Extensions in look and feel. You can right-click on a webpage element, and it will try to “Get Similar” data fields and create an orderly file for export.

The uncommon features that make this free extension part of my regular arsenal require some understanding of scraping, but are well worth exploring. First off, you can Save Recipes with site-specific XPath or jQuery logic. This means that once you have mapped a website perfectly, you will never have to do the hard work again. Saved recipes sync to your private cloud account, or can be shared as a Public Recipe. You can clone public recipes or customize them for private use. This is great for beginners to learn web scraping since you can see the syntax used on popular sites as a way to advance your skills.

Here is a screenshot of the pop-up when on Twitter. My “Private Recipes” appear above while community submitted ones are below. These can be rated with thumbs up or down; some have an “Example” link to see how the expected source page. You can click any one of these formulas to begin extraction. Links to great help videos and your personal data collections are found at the bottom.

But wait there’s more! It handles multi-page (pagination) scraping with configurable delays and somewhat intelligently skips empty fields as seen in gray. This works surprising well. Since all the work is performed in your browser this type of scraping is almost undetectable when done at a proper pace.

Now that we have our data, we can export to CSV and convert to standard Excel for cleanup. The new option is to use the Collections feature to perform even more tricks. Collections act as a mini database with all the data you have collected. In this case, I pulled the 90 names and links for speakers from bluetoothworldevent.com and I can use the search box (far left) to search within the text of this private collection.

I was lazy though, as I did not grab their company and title so now is my chance to fix it. In Chrome, you can right-click on most items on a website and select “inspect element” to view where that item appears in the source code. Right-click again on the element to “Copy XPath” (think shorthand for page formatting). In the larger screenshot, you can see the copied XPaths in Notepad and how the element changes color as you hover over the source code.

The best idea for clean results is to find the common denominator. If you look closely, you can see how each link starts with the same base, with tiny changes at the end. Now we add only those slight changes that define each unique element. This lets us test changes in real-time from the extraction page. Here is new Public Recipe for Bluetooth world with all data fields.

Lastly, there is a Beta version that can be run concurrently with the original (link at the bottom of the tutorial page). This version shows even more promising features in the pipeline that would usually require either dedicated scripting or use of additional scraping tools.

Like any technology, you only learn from playing with it. Data Scraper/Data Miner works out-of-the-box with many popular websites, but you will find the most benefit when you dive deep and create a few formulas of your own. I recommend it this for anyone who wants to take their sourcing skills to the next level.

About the Author: Aaron Lintz is a Talent Sourcing Specialist with @Commvault Systems. Over the last decade, he has held corporate sourcing and agency recruiting roles, helped develop applicant tracking solutions, and managed email & social marketing programs. His passions for experimentation, automation, and willingness to share make him a natural sourcer.

Follow Aaron on Twitter @AaronLintz or connect with him on LinkedIn.

Data Scraper and the XPath to Sourcing Success

By Aaron Lintz

Latest News

Industry Veterans Cindy Songné and Yazad Dalal Join Joveo’s Leadership Team

HCM Talent Technology Roundup February 9, 2024

HCM Talent Technology Roundup January 26, 2024

Related

Monster and CareerBuilder Join Forces to Challenge Job Board Giants

LinkedIn Takes Aim At Recruiting Agency-land

Bullhorn Accelerates AI Strategy with Acquisition of Textkernel

About Recruiting Daily

Explore

Help & Support

Follow Us

Data Scraper and the XPath to Sourcing Success

By Aaron Lintz

Recruit Smarter

Latest News

Industry Veterans Cindy Songné and Yazad Dalal Join Joveo’s Leadership Team

HCM Talent Technology Roundup February 9, 2024

HCM Talent Technology Roundup January 26, 2024

Related

Monster and CareerBuilder Join Forces to Challenge Job Board Giants

LinkedIn Takes Aim At Recruiting Agency-land

Bullhorn Accelerates AI Strategy with Acquisition of Textkernel

About Recruiting Daily

Explore

Help & Support

Follow Us

Membership Required

Login