API Queries in OpenRefine
Overview
Teaching: 25 min
Exercises: 30 minQuestions
How do I fetch data from an Application Programming Interface (API) to be used in OpenRefine?
Objectives
Understand how to fetch data from online sources in OpenRefine
Explain how to parse JSON data in OpenRefine
Fetching data from the web
OpenRefine has built-in functions for accessing data through APIs. We can use the function “Add columns by fetching URLs” to pass :
- ‘Edit column’ -> ‘Add column by fetching URLs..’
We will be using the data in File1.xlsx. The column named ‘ISBN’ contains ISBNs we can use to search the Voyager API. We can request the catalog data about the resource and add that data to our Excel file.
Let’s fetch data from Voyager
- Create a new project in OpenRefine using File1.xlsx
- Select column ISBN -> ‘Edit column’ -> ‘Add column by fetching URLs…’
- Name new column VoyagerData
- Change ‘Throttle delay’ from 5000 -> 300
- Add the expression:
"https://libapp.library.yale.edu/VoySearch/GetBibItem?isxn=" + value
- Click ‘OK’
New GREL Functions
In addition to the functions we’ve already covered in the OpenRefine Lesson, these GREL functions will help you complete the exercises in this lesson. You can find the link to the documentation for each function below:
Parse JSON results
Our Voyager API query returns results in JSON format as a single cell in our excel spreadsheet. We can use OpenRefine to parse this JSON to pull out catalog data and insert this data in our spreadsheet.
- OpenRefine has a built-in function called
parseJson()
that we can use to extract specfic data from the JSON results. - Because JSON data is often nested, we must navigate to the appropriate “key” to extract the “value”.
- We often encounter data organized in an array or list, signified by square brackets:
[]
- To access an item in an array, we use the index number starting at zero (0)
- In the following screenshot, we have an array a books with the key “items”
- To get the value of “Book1”, we use
item[0].title
where zero is the index for the first item & “title” is the key for the value we want to extract. - To get “Book2” we would use
item[1].title
, “Book3” isitem[2].title
, and so on…
Let’s find the Call Number
Let’s pull out the Publisher form our JSON results and add it to a new column called Publisher:
- Create a new project using the file File1-VoyagerJSON.xlsx
- Select VoyagerData -> ‘Edit column’ -> ‘Add column based on this column…’
- New column name: Publisher
- Add the expression:
value.parseJson().record[0].publisher
- Click ‘OK’. We should have a new column in our data with the publisher for each item.
Create a link to the catalog record
- Create a new column called BibId with the bibid for each item. What steps did you take?
- What steps would you take to create a new column called HandleLink which is a link to the catalog record for that item? (hint:
"http://hdl.handle.net/10079/bibid/" + ?
)Solution
- Select VoyagerData -> ‘Edit column’ -> ‘Add column based on this column…’
- New column name: BibId
- Add the expression:
value.parseJson().record[0].bibid
- Click ‘OK’
- Select BibId -> ‘Edit column’ -> ‘Add column based on this column…’
- New column name: HandleLink
- Add the expression:
"http://hdl.handle.net/10079/bibid/" + value
- Click ‘OK’
Working with arrays
Many times when working with JSON or XML results, we need to work with an array (or list) of data. To pull out each value in an array, we use a forEach( ) loop (like the Shell for loop). The next exercise will help us understand how to use the forEach()
function.
Return a JSON array or list
Let’s work with an array of data in JSON. We will find the location of each item in the record and save those values in a new column.
- Select VoyagerData -> ‘Edit column’ -> ‘Add column based on this column…’
- New column name: Location
- Add the expression:
forEach(value.parseJson().record[0].items,v,v.loccode).join(" | ")
forEach()
takes 3 arguements: an array, a variable, and variable.function()value.parseJson().record[0].items
is our arrayv
is our variablev.loccode
is our function- Click ‘OK’. We should have a new column in our data with an array of values.
Fetch MARCXML records
The voyager API can also return MARCXML from certain queries. We’ll use a different base URL for our next query to retrive XML: https://libapp.library.yale.edu/VoySearch/GetBibMarc?bibid=
Let’s fetch MARCXML from Voyager
- Select column BibId -> ‘Edit column’ -> ‘Add column by fetching URLs…’
- Name new column MARCXML
- Change ‘Throttle delay’ from 5000 -> 300
- Add the expression:
"https://libapp.library.yale.edu/VoySearch/GetBibMarc?bibid=" + value
- Click ‘OK’
Parse XML results
Our Voyager API query returns reults in MARCXML format for each item in a single cell. We can parse the XML like we did previously with JSON.
parseXml()
works likeparseJson()
; it tells OpenRefine we’re working with XML- We use the function
select()
to choose the XML element we want.select()
always returns an array. - We use the function
xmlText()
to remove the XML tags from any results and return just the text inside an element.
Parse MARCXML results
We need to parse the MARCXML to pull out data about each item. The steps below will walk through creating a new column that contains the Call Number for each item in our dataset.
- Select column MARCXML -> ‘Edit column’ -> ‘Add column based on this column…’
- Name new column: CallNo
- Add the expression (don’t click OK yet):
value.parseXml().select("datafield[tag=050]")
- We use
select()
to select the element with the Call number.- The Call number is found in an element called datafield with an attribute called tag that is equal to 050.
- The preview shows us our results. We should see an array of a single XML element that contain a Call Number.
- To get just the CallNo, we need to access the first, and only, item in our array. Add
[0]
to the end of your expression to find the item at the zeroth index position.- To escape the XML tags, we add the function
xmlText()
to our expression. We should be left with just the Call Number.- Your full expressions should now read:
value.parseXml().select("datafield[tag=050]")[0].xmlText()
. Click ‘OK’.
Key Points
You can augment existing data in OpenRefine with ‘Add column by fetching URLs
JSON or XML results can be parsed within OpenRefine