Percent Identity in Sequence Searching
In this issue of the TPR Biotech Brief, we look at ‘percent identity’ in sequence searching by exploring the following:
- What is meant by ‘percent identity’?
- Why can every alignment have more than one type of ‘percent identity’?
- What are ‘subject percent identity’ and ‘query percent identity’?
- Why is looking at more than ‘alignment percent identity’ helpful?
Understanding percent identity
Let’s get started with the basics; what is percent identity?
Percent identity is a key parameter in evaluating the results of any sequence search. It is a measure of how closely a subject sequence matches your own/your client’s sequence. In brief, the higher the percent identity, the better match a subject sequence is.
For a closer look, we’ll look at an example. Figure 1 below shows an alignment of a 60 amino acid query and a 645 amino acid subject sequence. The alignment length is 57 amino acids, with one mismatch. Percent Identity is calculated accordingly.
Figure 1: Sequence Alignment
Alignment Length = 57aa
Mismatches = 1
Matches = 56
Query Length = 60
Subject Length = 645
%ID Alignment 56/57 (98%)
%ID subject = 56/645 (9%)
%ID query = 56/60 (93%)
As can be seen, there are numerous ways to calculate percent identity and each calculation reveals something different about your results. Many databases, such as the National Library of Medicine’s NCBI database, assign a percent identity using the following equation:
%identity = (# of matches)/(alignment length)
This calculation, also known as alignment percent identity, tells you how well your sequence matches the subject sequence over the frame in which they have been aligned. In Example 1 below the % identity calculated in this manner is 100%, but as you can see the query is actually much smaller than the subject.
What are ‘subject’ percent identity and ‘query’ percent identity?
Other databases also calculate subject percent identity and query percent identity. These two parameters are calculated as follows:
Subject %ID= (# Matches)/(Subject Length)
Query %ID = (# Matches)/(Query Length)
These two parameters help you to build a better picture of how well your query matches with the entire length of the subject or vice versa. In some cases, one or both of these % identities will be equivalent to the alignment identity, but not always.
Why looking at more than alignment percent identity is helpful
In the case of Example 1, the query percent identity will equal the alignment identity, but the subject percent identity will be lower. By comparison, in Example 2 below the subject % identity and alignment identity are equal, but the query percent identity is lower.
Taking the above into consideration, depending on what type of sequence you are looking for, it may be worth reviewing more than just the alignment identity, as these parameters will give you more context with regards to each sequence in its entirety.
When it comes to sequence searching, all these parameters should be understood and considered by your search partner. Sequence searching is an art that takes many years of experience to master. At TPR, our Biotech Search Experts have extensive experience having worked as professional searchers within top R&D corporations at the forefront of biotechnology. Deeply rooted in the science and having many years of patent search experience enables our team to perform more robust searches and provide meaningful data insights, leading to better outcomes for our clients.
If you would like to learn more on this topic or other topics relating to sequence searching, please contact the TPR search team at 1.858.592.9084 or searches@TPRinternational.com.
All of the parameters explained in this brief and more are included as a standard with most TPR sequence search reports. If you have any specific request with regard to sequence searches in general or what information you like to see in your report, please ask the TPR search team, as we are here to help and share our knowledge.