The MGI report MRK_Sequence.rpt provides associations between MGI genetic markers and GenBank, RefSeq, Ensembl and UniProtKB identifiers.

To read this report using the key "genbank_refseq_ensembl_ids", use the following code:

# To read all records (more than 70,000), use `read_report("genbank_refseq_ensembl_ids")`.
(assoc_to_seq_ids <- read_report(report_key = "genbank_refseq_ensembl_ids", n_max = 30L))
## # A tibble: 30 × 19
##    marker_status marker_type marker_id   marker_symbol marker_name  feature_type
##    <fct>         <fct>       <chr>       <chr>         <chr>        <fct>       
##  1 O             BAC/YAC end MGI:1341858 03B03F        DNA segment… BAC/YAC end 
##  2 O             BAC/YAC end MGI:1341869 03B03R        DNA segment… BAC/YAC end 
##  3 O             Gene        MGI:1918911 0610005C13Rik RIKEN cDNA … lncRNA gene 
##  4 O             Gene        MGI:1923503 0610006L08Rik RIKEN cDNA … lncRNA gene 
##  5 O             Gene        MGI:1925547 0610008J02Rik RIKEN cDNA … lncRNA gene 
##  6 O             Gene        MGI:3698435 0610009E02Rik RIKEN cDNA … lncRNA gene 
##  7 O             Gene        MGI:1918921 0610009F21Rik RIKEN cDNA … lncRNA gene 
##  8 O             Gene        MGI:1918931 0610009K14Rik RIKEN cDNA … lncRNA gene 
##  9 O             Gene        MGI:1914088 0610009L18Rik RIKEN cDNA … lncRNA gene 
## 10 O             Gene        MGI:1915609 0610010K14Rik RIKEN cDNA … protein cod…
## # ℹ 20 more rows
## # ℹ 13 more variables: chromosome <fct>, start <int>, end <int>, strand <fct>,
## #   genetic_map_pos <dbl>, genbank_id <list>, refseq_trp_id <list>,
## #   refseq_prt_id <list>, ensembl_trp_id <list>, ensembl_prt_id <list>,
## #   swiss_prt_id <list>, tr_embl_prt_id <list>, unigene_id <chr>

GenBank, RefSeq, Ensembl and UniProtKB identifiers

These variables hold one or more identifiers associated with each genetic marker:

  • genbank_id: GenBank identifier(s)
  • refseq_trp_id: RefSeq transcript identifier(s)
  • refseq_prt_id: RefSeq protein identifier(s)
  • ensembl_trp_id: Ensembl transcript identifier(s)
  • ensembl_prt_id: Ensembl protein identifier(s)
  • swiss_prt_id: UniProtKB/Swiss-Prot identifier(s)
  • tr_embl_prt_id: UniProtKB/TrEMBL identifier(s)
  • unigene_id: UniGene identifier1

Except for unigene_id, all these variables are list-columns, and provide, potentially, multiple values for the same genetic marker.

Having these data nested in list-columns offers the convenience of having a table whose records (rows) correspond to one genetic marker. The downside is that these multiple identifiers are not readily accessible as they would be if stored in atomic columns. To unnest them you may use tidyr::unnest_longer()2.

Here is an example with marker MGI:1915609, where we unnest Ensembl transcript identifiers so that we have one per row:

assoc_to_seq_ids |>
  dplyr::filter(marker_id == "MGI:1915609") |>
  dplyr::select("marker_id", "marker_symbol", "ensembl_trp_id") |>
  tidyr::unnest("ensembl_trp_id") |>
  print(n = Inf)
## # A tibble: 14 × 3
##    marker_id   marker_symbol ensembl_trp_id    
##    <chr>       <chr>         <chr>             
##  1 MGI:1915609 0610010K14Rik ENSMUST00000021180
##  2 MGI:1915609 0610010K14Rik ENSMUST00000021181
##  3 MGI:1915609 0610010K14Rik ENSMUST00000100950
##  4 MGI:1915609 0610010K14Rik ENSMUST00000102569
##  5 MGI:1915609 0610010K14Rik ENSMUST00000108575
##  6 MGI:1915609 0610010K14Rik ENSMUST00000108576
##  7 MGI:1915609 0610010K14Rik ENSMUST00000108577
##  8 MGI:1915609 0610010K14Rik ENSMUST00000108578
##  9 MGI:1915609 0610010K14Rik ENSMUST00000108579
## 10 MGI:1915609 0610010K14Rik ENSMUST00000134700
## 11 MGI:1915609 0610010K14Rik ENSMUST00000135390
## 12 MGI:1915609 0610010K14Rik ENSMUST00000138658
## 13 MGI:1915609 0610010K14Rik ENSMUST00000148878
## 14 MGI:1915609 0610010K14Rik ENSMUST00000150504

Note, however, that now the meaning of each row changed: each row is for a marker / Ensembl transcript combination.



marker_status: genetic marker status is a factor of two levels: 'O' for official, and 'W' for withdrawn. Official indicates a currently in-use genetic marker, whereas withdrawn means that the symbol or name was once approved but has since been replaced.


marker_type: genetic marker type is a factor of 10 levels: Gene, GeneModel, Pseudogene, DNA Segment, Transgene, QTL, Cytogenetic Marker, BAC/YAC end, Complex/Cluster/Region, Other Genome Feature. See ?marker_type_definitions for the meaning of each type.


marker_id: MGI accession identifier. A unique alphanumeric character string that is used to unambiguously identify a particular record in the Mouse Genome Informatics database. The format is MGI:nnnnnn, where n is a digit.


marker_symbol: marker symbol is a unique abbreviation of the marker name.


marker_name: marker name is a word or phrase that uniquely identifies the genetic marker, e.g. a gene or allele name.


feature_type: an attribute of a portion of a genomic sequence. See the dataset ?feature_type_definitions for details.


chromosome: mouse chromosome name. Possible values are names for the autosomal, sexual or mitochondrial chromosomes.


start: genomic start position (one-offset).


end: genomic end position (one-offset).


strand: DNA strand, ‘+’ for sense, and ‘-’ for antisense.


genetic_map_pos: genetic map position in centiMorgan (cM): a unit of length in a genetic map. Two loci are 1 cM apart if recombination is detected between them in 1% of meioses.


genbank_id: NCBI GenBank identifier(s), a list-column.


refseq_trp_id: NCBI RefSeq transcript identifier(s), a list-column.


refseq_prt_id: NCBI RefSeq protein identifier(s), a list-column.


ensembl_trp_id: Ensembl transcript identifier(s), a list-column.


ensembl_prt_id: Ensembl protein identifier(s), a list-column.


swiss_prt_id: UniProtKB/Swiss-Prot identifier(s), a list-column.


tr_embl_prt_id: UniProtKB/TrEMBL identifier(s), a list-column.


unigene_id: NCBI UniGene identifier(s), a character vector.