The following is a quick and initial overview of PatCID (IBM’s New Open-Access Patent Chemical Structure Database). A more detailed review will follow.
The IBM-PatCID Paper
On August 2nd 2024 a team of researchers at IBM, led by Lucas Morin, published the paper PatCID: an open-access dataset of chemical structures in patent documents in Nature Communications¹. The paper introduces the database and research tool PatCID, and for those of us in the patent search world who deal with chemical structure searching routinely, the paper demands attention.
What is PatCID?
PatCID (Patent-extracted Chemical-structure Images database for Discovery) is an innovative tool developed by IBM researchers, designed to streamline the retrieval of chemical structures from patent documents. As the number of new chemical patent publications and patent databases grows in size and complexity, the need for more efficient and automated search tools has become critical, particularly in fields like intellectual property, drug discovery, and materials science.
Traditional databases like Reaxys and SciFinder, while comprehensive, rely on manual indexing and curation by technical experts, which can be costly. On the other hand, automatically indexed databases like SureCheMBL and Google Patents can be less accurate, leading to incomplete data and sometimes unreliable results. PatCID positions itself as a middle ground between these two options— while it may not be as comprehensive as the commercially available manually indexed databases, it offers a free and open-source alternative with a solid balance between accuracy and accessibility.
A Sneak Peek at the Data
The table below, extracted from the IBM-PatCID paper, compares the automated chemical structure image identification and indexing performance of several commonly used chemical patent search databases. It also highlights PatCID’s performance in comparison to these alternative databases when applied to two patent test sets.
Search comparison for automatically-created databases
Databases | D2C-RND | D2C-UNI | ||
Molecules (200) | Annotated Documents (179) | Molecules (164) | Annotated Documents (164) | |
Automatic, text and image | ||||
SureChEMBL7 | 23.50% | 45.30% | 6.10% | 52.40% |
Google patents | 41.50% | 68.20% | 17.70% | 67.10% |
Reaxys5 | 41.50% | 59.20% | 36.00% | 58.50% |
Automatic, image | ||||
SureChEMBL | 22.00% | 35.80% | 4.90% | 11.60% |
Google patents | 36.50% | 60.00% | 9.80% | 54.30% |
PatCID | 56.00% | 100% | 47.60% | 98.20% |
Comparison of the molecule and document retrieval performances of state-of-the-art automatically-created patent-databases. The recall of molecules and annotated documents is reported for benchmarks based on random (D2C-RND) and uniform (D2C-UNI) distributions of chemical images. The numbers in between parentheses are the numbers of samples in each set.
According to the research data from the IBM team, when looking only at the automated molecule and document recall performance, for both benchmark datasets, PatCID outperforms all assessed databases. This is quite an encouraging result, especially the high marks for document recall in both sets. The PatCID development team plans to significantly expand the library of indexed patents in the future, but it is important to note that at this time, there are 1.2 million indexed patents on PatCID, whereas commercial databases like SciFinder and Reaxys have a reported 18 million and 43 million indexed patents, respectively.
Coming Soon – TPR Assessment of PatCID
In our ongoing efforts to investigate new tools and stay at the forefront of best patent search practices, we have reached out to the authors of the paper and have begun testing the database. In an upcoming blog, we will expand on the paper’s contents, share our notes from discussions with the IBM research team, and offer our impressions and takeaways about the database itself.
Have questions regarding chemical patent searching and want to discuss with our team, learn more here or contact us.
References
- Journal Article: Morin, L., Weber, V., Meijer, G. I., Yu, F., & Staar, P. W. J. (2024, August 2). PatCID: An open-access dataset of chemical structures in patent documents. Nature Communications. https://www.nature.com/articles/s41467-024-50779-y