Proteomic ruler: Annotate proteins

Author

Cox Lab

Published

November 15, 2023

If you arrived here directly from Perseus, it is a good idea to read the Proteomic ruler overview first.

For those that want to apply the proteomic ruler concept and are in a hurry: If you imported the columns Sequence length and Molecular weight of your proteinGroups.txt, you can skip this and directly estimate copy numbers.

1 Description

Annotate proteins

extracts information from the UniProt fasta file that was used to process a dataset:

  • Protein annotations from the UniProt fasta headers, e.g. Gene names, protein names, entry names.
  • The sequence length and the molecular mass of the protein.
  • The numbers of theoretical peptides from in silico digestion of the protein with a number of different proteases.
  • The occurrence of user-definable sequence features within the protein sequences.

2 Parameters

Most of the parameters should be self-explanatory. Hover the mouse over the descriptions to get detailed help. The Plugin will pre-select the most useful parameters and auto-detect the correct input columns (if present in your matrix). These default parameters will cover everything needed to estimate copy numbers.

2.1 Input

2.1.1 Protein IDs

Select the column containing your semicolon-separated protein group IDs (UniProt format). It is recommended to use the ‘Majority protein IDs’ when coming from MaxQuant

2.1.2 Fasta file

Specify the uniprot fasta file you used to process your dataset. The plugin will parse this file and extract information from the header and the amino acid sequences.

2.2 Output

As one often has more than one uniprot ID per protein group, you can specify whether you want to extract annotations and calculate sequence properties for the leading ID alone or for all IDs in the protein group. In case of text annotations, all annotations will be semicolon-separated. In case of numeric properties, the plugin will average over the list of sequence by reporting the median.

2.2.1 Fasta header annotations

  • Entry name: e.g. KAL1L_HUMAN
  • Gene name, e.g. KANSL1L
  • Protein name (verbose), e.g. KAT8 regulatory NSL complex subunit 1-like protein
  • Protein name (consensus), e.g. Isoform 2 of KAT8 regulatory NSL complex subunit 1-like protein. The consensus protein names will be stripped of all Isoform xy of prefixes and (Fragment) suffixes.
  • Species, e.g. Homo sapiens

2.2.2 Numeric annotations

2.2.3 Calculate theoretical peptides

The plugin will perform an in-silico digestion of the protein sequences with the specified protease and report the number of theoretically expected peptides without miscleavages in the selected size range. ### Count sequence features

The plugin will count the number of occurrences of a given regular expression in the protein sequences. It is recommended to normalize this count by the sequence length if you want to average across all IDs.