# Accessing the IUPAC Gold Book API in Python

```{dropdown} About this interactive ![icons](../static/img/rocket.png) recipe
- Author: [Stuart Chalk](https://orcid.org/0000-0002-0703-7776)
- Reviewer: [Sam Munday](https://orcid.org/0000-0001-5404-6934)
- Topics: The IUPAC Gold Book, APIs, JSON
- Format: Interactive Jupyter Notebook (Python)
- Scenarios: Retrieve the definition of a chemical concept via code
- Skills: You should be familiar with
    - [Application Programming Interfaces (APIs)](https://www.ibm.com/topics/api)
    - [The JavaScript Object Notation (JSON) file format](https://www.w3schools.com/js/js_json_intro.asp)
    - [Introductory Python](https://www.youtube.com/watch?v=kqtD5dpn9C8)
    - [Regular expressions](https://www.regular-expressions.info/tutorial.html)
- Learning outcomes: After completing this example you should understand:
    - Python functions ('def' code blocks)
    - How to write Python code to request data from a URL (typically an API)
    - How to use a Python variable to call an API and download data
- Citation: 'Accessing the IUPAC Gold Book API in Python', Stuart Chalk, The IUPAC FAIR Chemistry Cookbook, Contributed: 2023-02-28 [https://w3id.org/ifcc/IFCC003](https://w3id.org/ifcc/IFCC003).
- Reuse: This notebook is made available under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
```

## Step 1: Import needed Python packages
Python has a lot of functionality that can be imported using the 'import' function

In [1]:
import requests                             # package to get data from a URL
import json                                 # package to read/write/display JSON formatted data
import re                                   # package to use regular expression (regex) searching

## Step 2: Add a Python function
This function removes HTML tags from textual data.  It uses [regular expressions](https://www.youtube.com/watch?v=rhzKDrUiJVk) to detect HTML tags (e.g., <b>I am surrounded by HTML tags</b> is really &lt;b&gt;I am surrounded by HTML tags&lt;/b&gt; in the page code).

In [2]:
# Source: https://medium.com/@jorlugaqui/how-to-strip-html-tags-from-a-string-in-python-7cb81a2bbf44
def remove_html_tags(text):                 # a 'def' is a (defined) function that can be called later
    clean = re.compile('<.*?>')             # sets up a regular expression to search with
    return re.sub(clean, '', text)          # removes the matches to the regular expression

## Step 3: Download a JSON file
Download data for all the IUPAC Recommended Terms currently available. Even though the amount of data that we download here is big (804 kB),
it is better to get the data all at once rather than call the API every time in a loop.  This makes the 'for' loop in Step 4 much faster.

In [3]:
allpath = "https://goldbook.iupac.org/terms/index/all/json"  # URL to the IUPAC Gold Book API down
reqdata = requests.get(allpath)                              # download file in JSON
terms = json.loads(reqdata.content)                          # convert JSON to a Python dictionary
print(str(len(terms['terms']['list'])) + ' terms')           # print the number of terms in the list

7052 terms


## Step 4: Search for a term
Here we search the recommended term list and if present get the terms code.  We use the function above to 'normalize' the text of the titles from
the Gold Book entries, by removing the HTML markup, so they match the term we are looking for. (Note: not all term titles have HTML in them)

In [5]:
searchterm = "cis-trans isomers"                            # the term to be found
searchcode = None                                           # empty variable to contain the searchcode
rawtitle = None                                             # empty variable to contain the raw title string
for code, term in terms['terms']['list'].items():           # iterate over each term in the list (code (str), term (obj))
    cleaned = remove_html_tags(term['title'])               # remove any HTML formatting in the title
    if cleaned == searchterm:                               # check if the term matches the one we want
        searchcode = code                                   # if it does, get the code for the term
        rawtitle = term['title']                            # saw the raw title so we can see it below
        break                                               # we have found the term, so we can get out of the for loop
print(rawtitle)                                             # IUPAC Gold Book term code (if found)
print(searchcode)                                           # IUPAC Gold Book term code (if found)

<i>cis</i>-<i>trans</i> isomers
C01093


## Step 5: Use the term code to retrieve its definition
Generate a URL to get data about a term, print out the term, its code and its definition

In [8]:
path = "https://goldbook.iupac.org/terms/view/**/json"      # URL path to the IUPAC Gold Book API for a term
reqdata = requests.get(path.replace("**", searchcode))      # request data from the Gold Book server
jsondata = json.loads(reqdata.content)                      # get the downloaded JSON
print(jsondata)                                             # print out all the downloaded data, so we can 'see' its structure and know how to get the definition

{'term': {'id': '01093', 'doi': '10.1351/goldbook.C01093', 'code': 'C01093', 'status': 'current', 'longtitle': 'IUPAC Gold Book - cis-trans isomers', 'title': '<i>cis</i>-<i>trans</i> isomers', 'version': '2.3.3', 'lastupdated': '2014-02-24', 'definitions': [{'id': '1', 'text': 'Stereoisomeric olefins or cycloalkanes (or hetero-analogues) which differ in the positions of atoms (or groups) relative to a reference plane: in the cis-isomer the atoms are on the same side, in the trans-isomer they are on opposite sides. [image: molecular structures showing cis/trans isomerism]', 'chemicals': [{'type': 'chemimage', 'title': 'molecular structures showing cis/trans isomerism', 'file': 'https://goldbook.iupac.org/img/inline/C01093.png'}], 'links': [{'title': 'Stereoisomeric', 'type': 'internal', 'url': 'https://goldbook.iupac.org/terms/view/S05983'}, {'title': 'olefins', 'type': 'goldify', 'url': 'https://goldbook.iupac.org/terms/view/O04281'}, {'title': 'cycloalkanes', 'type': 'goldify', 'url'

In [9]:
print(searchterm + " (" + searchcode + ")")                 # print the title and Gold Book term code
print(jsondata['term']['definitions'][0]['text'])           # extract out and print the definition of the term (compare to above)

cis-trans isomers (C01093)
Stereoisomeric olefins or cycloalkanes (or hetero-analogues) which differ in the positions of atoms (or groups) relative to a reference plane: in the cis-isomer the atoms are on the same side, in the trans-isomer they are on opposite sides. [image: molecular structures showing cis/trans isomerism]


## Step 6: Try other terms
Change the value of the 'searchterm' variable above and rerun steps 4 and 5