This is our second ttdatavis how to focusing on new methods for data capture and collection. We will be exploring one aspect of, which is a massively powerful free piece of software that can help to scrape information from websites.

Before continuing, we should have a quick word about copyright. When using other people’s data and when scraping information from the web, there are a number of relevant copyright laws worth keeping in mind. Here are three different scenarios that might apply:

  • Creative Commons: If you’re using an open data set, it’s worth verifying the license by which it is released to the public. Often, open datasets are made available using a Creative Commons license. There may be some restrictions on how information released under this license can be used. For example, it is common to require ‘attribution’, or a note of the source of the data in the resulting product.
  • Government data: Most material published by the US government, for example, is non-copyrightable and available for both personal and commercial use. Other governments may have different applicable laws and regulations.
  • Research use of copyrighted data: Where data remains under copyright, there may be circumstances, such as for private research purposes, where copying and storing that data may still be acceptable. For example, in June 2014, the UK issued a new Text and Data Mining exception to its copyright law that allows download for non-commercial research. It was this exception that allowed a space for the development of Importio. 

Before publishing any information gained through the methods described below, it’s worth checking relevant copyright laws in the appropriate jurisdiction.

Step 1: Install and register for an account.

Luckily, this first step is incredibly straightforward. Just visit the website download page and select the appropriate version for your operating system.

After installation, open the programme and create an account. It is possible to log in using other social media accounts (like Facebook), or you can create a new account using an appropriate email address.

Step 2: Create a new 'API from URL (Extractor)' and navigate to desired website

This option can be found in the 'New' menu at the top-right corner of the screen.

Once prompted, enter the URL for the website you're interested in scraping data from.

The tracer example we're using today comes from the UK's National Health Service. As of last week, they offer a new myNHS service that provides a range of information about each of the hospitals operating in England. There is a lot of potential to better visualise this data and to make it more accessible to the public, which is why we're using it as a tracer example.

The one important thing to note here is that it currently displays a limited number of hospitals per page by default. Therefore, at the end of the URL we've changed the very last bit from "&PageSize=10" to "&PageSize=600". This ensures that we only have to scrape one page instead of setting up a site crawl or more complex extractor.

Once you've navigated to the page, click the bright pink 'I'm there' button at the bottom-right side of the screen.

You will then need to select the button in the same place that says 'Detect optimal settings'. This will disable unnecessary scripting on the page and make the extraction more efficient. In some cases, this might remove the information that needs capturing. In that case, simply indicate that the data is not still in the browser. Otherwise, as is the case in our example, select 'Yes'.

Step 3: Select the 'multiple' import type

The next prompt will ask you to select the type of page from which the data is being extracted. In this case we will select the middle icon which is a 'multiple' page, as we have 501 rows on the page. If each hospital had its own page, we would have selected single (we also would have created a crawler rather than a one-off extractor).

We could possibly also select the Table extractor, but at the time of writing this was still a Beta version and not as accurate as using the 'multiple' select.

Step 4: Train rows

After clicking the multiple button, highlight the first row of data (not the row with headers) and then click the pink 'Train rows' button on the bottom right of the screen. The first row will be highlighted in blue.

In order to create a pattern, needs to know what the second row of data is before continuing. Highlight the second row and once again select the 'Train rows' button.

At this point the page should be full of alternating light and dark blue selections over the 501 indvidiual rows.

Finalise your selection by clicking the button, again in the bottom-right corner 'I've got all X rows!'.

Step 5: Train columns

Training columns is what gives proper structure to your data and is what makes so powerful.

Start by adding a column. Enter a variable name and then select the type of content you'll be putting in that column. Text is the default. In our example we could create a variable for hospital_name, which would be an appropriate text variable.

Select the relevant text and then select the 'Train' button, similar to Step 4 above. You'll note that in most cases the correct information will then be added into the appropriate column at the bottom of the page. In more complicated cases, it may be necessary to select and train using more than one row.

Other types of column data that we can pull from this sample set include: date (such as the date when the most recent hospital inspection was carried out) - which is set by noting the format in which the extractor should expect the data, e.g. dd MMMM yyyy for a date written as 23 November 2014; link (which pulls the link URL, not just the name of the link); number (save the percentages as numbers rather than text by selecting the numbers only); and images.

Step 6: Save, finish training, and upload to Importio

Once you have selected all the data you want to pull out and put them in the appropriate type of column, select 'I've got what I need'.

As all of our results are on one page, select 'I'm done training'.

Finish the process by uploading to Importio and giving the extractor an appropriate name.

Once it is uploaded you can click through to 'show your data'.

Step 7: Download data or link to live Google Spreadsheet

Screenshot 2014-11-23 23.34.29.png

In the data view of your new extractor, click the button second from the left at the top of the screen to view options for saving the data. Normally this will be done as a CSV (a comma-separated file that can easily be imported into Excel). 

Alternatively, if you'd like a live connection with data that might constantly be uploaded, it is also possible to link the data to a live Google Spreadsheet. Find out more on how to do this from directly.