# Accessing PubChem through PUG-REST: Part III

```{dropdown} About this interactive ![icons](../static/img/rocket.png) recipe
- Author(s): [Sunghwan Kim](https://orcid.org/0000-0001-9828-2074)
- Reviewer: [Samuel Munday](https://orcid.org/0000-0001-5404-6934)
- Topic(s): How to retrieve chemical data using chemical identifiers.
- Format: Interactive Jupyter Notebook (Python)
- Scenario: You need to access and potentially download chemical data.
- Skills: You should be familar with:
    - [Application Programming Interfaces (APIs)](https://www.ibm.com/topics/api)
    - [Introductory Python](https://www.youtube.com/watch?v=kqtD5dpn9C8)
    - [SMILES](https://chem.libretexts.org/Courses/University_of_Arkansas_Little_Rock/ChemInformatics_(2017)%3A_Chem_4399_5399/2.3%3A_Chemical_Representations_on_Computer%3A_Part_III)
    - [InChI strings](https://www.inchi-trust.org/)
- Learning outcomes:
    - How to access PubChem chemical data using a chemical identifiers
    - How to search PubChem using 2-D and 3-D molecular similarity
    - How to search PubChem using substructures and superstructures
- Citation: 'Accessing PubChem through PUG-REST - Part III', Sunghwan Kim, The IUPAC FAIR Chemistry Cookbook, Contributed: 2023-02-28 [https://w3id.org/ifcc/IFCC008](https://w3id.org/ifcc/IFCC008).
- Reuse: This notebook is made available under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
```

In [None]:
import requests
import time
import io
import csv
from IPython.display import Image, display

## 1. Using a SMILES or InChI string as an input query

In [None]:
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/" + smiles + "/cids/txt").text.strip())

Some SMILES strings contain characters not compatible with the PUG-REST request URL syntax.  For example, isomeric SMILES uses the "/" character (forward slash) to represent the E/Z or cis/trans stereochemistry of a molecule.  However, because the "/" character is also used in the request URL to separate the segments of the URL path, the use of such SMILES strings as an input structure will result an error.

In [None]:
smiles = "CC(C)C1=NC(=NC(=C1/C=C/[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)N(C)S(=O)(=O)C"
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/" + smiles + "/cids/txt").text.strip())

To circumvent this issue, the SMILES input should be provided in one of the following two ways:
1. as a URL parameter
2. in the HTTP header (using the HTTP POST method).

In [None]:
smiles = "CC(C)C1=NC(=NC(=C1/C=C/[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)N(C)S(=O)(=O)C"

# As a URL parameter
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/cids/txt" + "?smiles=" + smiles).text.strip())

# In the HTTP header (using HTTP Post)
print(requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/cids/txt", data={'smiles':smiles}).text.strip())

InChI encodes the chemical structure information into multiple layers and sublayers, separated by the "/" character.  For this reason, InChI strings should also be provided as a URL parameter or in the HTTP header (using HTTP host).

In [None]:
inchi = "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)"

# With the request URL : WILL NOT WORK
#print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchi/" + inchi + "/cids/txt").text.strip())

# As a URL parameter
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchi/cids/txt" + "?inchi=" + inchi).text.strip())

# In the HTTP header (using HTTP Post)
print(requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchi/cids/txt", data={'inchi':inchi}).text.strip())

## 2. Performing identity search

In [None]:
smiles = "CC(C)/C=C/I"

In [None]:
# Compounds with the same stereochemistry and isotopism (default)
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/14571425/cids/txt").text.strip())
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/14571425/cids/txt?identity_type=same_stereo_isotope").text.strip())

In [None]:
# Compounds with the same isotopism (stereochemistry can be different)
cids1 = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/smiles/cids/txt?identity_type=same_isotope", data={'smiles':smiles}).text.strip().split()
print(cids1)

for mycid in cids1:
    display(Image(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/record/PNG?image_size=200x200").content))
    print("CID " + mycid, ":", requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/property/IsomericSMILES/TXT").text)
    time.sleep(0.2)

In [None]:
# Compounds with the same stereochemistry (isotopism can be different)
cids2 = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/smiles/cids/txt?identity_type=same_stereo", data={'smiles':smiles}).text.strip().split()
print(cids2)

for mycid in cids2:
    display(Image(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/record/PNG?image_size=200x200").content))
    print("CID " + mycid, ":", requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/property/IsomericSMILES/TXT").text)
    time.sleep(0.2)

In [None]:
# Compounds with the same connectivity (stereochemistry and isotopism can be different)
cids3 = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/smiles/cids/txt?identity_type=same_connectivity", data={'smiles':smiles}).text.strip().split()
print(cids3)    # All compounds in cids1 and cids2 are returned.

## 3. Performing 2-D and 3-D similarity search

In [None]:
smiles = "CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))
print(cids)

You can adjust the similarity threshold using the optional parameter "**Threshold**".  T The following request performs a 2-D similarity search with a tighter similarity threshold (95)

In [None]:
smiles = "CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt?Threshold=99", data={'smiles':smiles}).text.strip().split()
print(len(cids))
print(cids)

Note that the use of the higher threshold (99) than the default (90) results in fewer structures.

It is also possible to get line notations and molecular properties for the compounds returned from chemical structure search.

In [None]:
data = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/property/HeavyAtomCount,MolecularFormula,IsomericSMILES/csv?Threshold=99", data={'smiles':smiles}).text.strip()
print(data)

In [None]:
smiles = "CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_3d/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))
print(cids)

Currently, the similarity threshold used for 3-D similarity search is not adjustable, contrary to 2-D similarity search.

## 5. Performing substructure/superstructure search

In [None]:
smiles = "C2CN=C(C1=C(C=CC=C1)N2)C3=CC=CC=C3"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))

In [None]:
smiles = "C2CN=C(C1=C(C=CC=C1)N2)C3=CC=CC=C3"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsuperstructure/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))

## 7. Molecular Formula search

In [None]:
formula = "C6H12O6"
cids = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/cids/txt").text.strip().split()
print(len(cids))

You can download the structural information for the compounds returned from the molecular formula search.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/property/MolecularFormula,IsomericSMILES/CSV").text.strip()

cid_props = {}
reader = csv.reader(io.StringIO(data))
print(next(reader))  # Print the first line (column header)

for row in reader:
    key = row[0]
    cid_props[key] = row[1:]

count = 0
for item in cid_props:
    
    count += 1
    print(item, "\t", cid_props[item][0], "\t", cid_props[item][1])
    if count == 10 :  # For simplicity, print only the first 10 items.
        break

In [None]:
cids = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/cids/txt?AllowOtherElements=True").text.strip().split()
print(len(cids))

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/property/MolecularFormula,IsomericSMILES/CSV?AllowOtherElements=True").text.strip()

cid_props = {}
reader = csv.reader(io.StringIO(data))
print(next(reader))  # Print the first line (column header)

for row in reader:
    key = row[0]
    cid_props[key] = row[1:]

count = 0
for item in cid_props:
    
    count += 1
    print(item, "\t", cid_props[item][0], "\t", cid_props[item][1])
    if count == 10 :  # For simplicity, print only the first 10 items.
        break