Skip to contents

The MGI report MRK_Sequence.rpt provides associations between MGI genetic markers and GenBank, RefSeq, Ensembl and UniProtKB identifiers.

To read this report using the key "genbank_refseq_ensembl_ids", use the following code:

# To read all records (more than 70,000), use `read_report("genbank_refseq_ensembl_ids")`.
(assoc_to_seq_ids <- read_report(report_key = "genbank_refseq_ensembl_ids", n_max = 30L))
## # A tibble: 30 × 19
##    marker_status marker_type marker_id   marker_symbol marker_name  feature_type
##    <fct>         <fct>       <chr>       <chr>         <chr>        <fct>       
##  1 O             BAC/YAC end MGI:1341858 03B03F        DNA segment… BAC/YAC end 
##  2 O             BAC/YAC end MGI:1341869 03B03R        DNA segment… BAC/YAC end 
##  3 O             Gene        MGI:1918911 0610005C13Rik RIKEN cDNA … lncRNA gene 
##  4 O             Gene        MGI:1923503 0610006L08Rik RIKEN cDNA … lncRNA gene 
##  5 O             Gene        MGI:1925547 0610008J02Rik RIKEN cDNA … lncRNA gene 
##  6 O             Gene        MGI:3698435 0610009E02Rik RIKEN cDNA … lncRNA gene 
##  7 O             Gene        MGI:1918921 0610009F21Rik RIKEN cDNA … lncRNA gene 
##  8 O             Gene        MGI:1918931 0610009K14Rik RIKEN cDNA … lncRNA gene 
##  9 O             Gene        MGI:1914088 0610009L18Rik RIKEN cDNA … lncRNA gene 
## 10 O             Gene        MGI:1915609 0610010K14Rik RIKEN cDNA … protein cod…
## # ℹ 20 more rows
## # ℹ 13 more variables: chromosome <fct>, start <int>, end <int>, strand <fct>,
## #   genetic_map_pos <dbl>, genbank_id <list>, refseq_trp_id <list>,
## #   refseq_prt_id <list>, ensembl_trp_id <list>, ensembl_prt_id <list>,
## #   swiss_prt_id <list>, tr_embl_prt_id <list>, unigene_id <chr>

GenBank, RefSeq, Ensembl and UniProtKB identifiers

These variables hold one or more identifiers associated with each genetic marker:

  • genbank_id: GenBank identifier(s)
  • refseq_trp_id: RefSeq transcript identifier(s)
  • refseq_prt_id: RefSeq protein identifier(s)
  • ensembl_trp_id: Ensembl transcript identifier(s)
  • ensembl_prt_id: Ensembl protein identifier(s)
  • swiss_prt_id: UniProtKB/Swiss-Prot identifier(s)
  • tr_embl_prt_id: UniProtKB/TrEMBL identifier(s)
  • unigene_id: UniGene identifier1

Except for unigene_id, all these variables are list-columns, and provide, potentially, multiple values for the same genetic marker.

Having these data nested in list-columns offers the convenience of having a table whose records (rows) correspond to one genetic marker. The downside is that these multiple identifiers are not readily accessible as they would be if stored in atomic columns. To unnest them you may use tidyr::unnest_longer()2.

Here is an example with marker MGI:1915609, where we unnest Ensembl transcript identifiers so that we have one per row:

assoc_to_seq_ids |>
  dplyr::filter(marker_id == "MGI:1915609") |>
  dplyr::select("marker_id", "marker_symbol", "ensembl_trp_id") |>
  tidyr::unnest("ensembl_trp_id") |>
  print(n = Inf)
## # A tibble: 14 × 3
##    marker_id   marker_symbol ensembl_trp_id    
##    <chr>       <chr>         <chr>             
##  1 MGI:1915609 0610010K14Rik ENSMUST00000021180
##  2 MGI:1915609 0610010K14Rik ENSMUST00000021181
##  3 MGI:1915609 0610010K14Rik ENSMUST00000100950
##  4 MGI:1915609 0610010K14Rik ENSMUST00000102569
##  5 MGI:1915609 0610010K14Rik ENSMUST00000108575
##  6 MGI:1915609 0610010K14Rik ENSMUST00000108576
##  7 MGI:1915609 0610010K14Rik ENSMUST00000108577
##  8 MGI:1915609 0610010K14Rik ENSMUST00000108578
##  9 MGI:1915609 0610010K14Rik ENSMUST00000108579
## 10 MGI:1915609 0610010K14Rik ENSMUST00000134700
## 11 MGI:1915609 0610010K14Rik ENSMUST00000135390
## 12 MGI:1915609 0610010K14Rik ENSMUST00000138658
## 13 MGI:1915609 0610010K14Rik ENSMUST00000148878
## 14 MGI:1915609 0610010K14Rik ENSMUST00000150504

Note, however, that now the meaning of each row changed: each row is for a marker / Ensembl transcript combination.

Variables

marker_status

marker_status: genetic marker status is a factor of two levels: 'O' for official, and 'W' for withdrawn. Official indicates a currently in-use genetic marker, whereas withdrawn means that the symbol or name was once approved but has since been replaced.

marker_type

marker_type: genetic marker type is a factor of 10 levels: Gene, GeneModel, Pseudogene, DNA Segment, Transgene, QTL, Cytogenetic Marker, BAC/YAC end, Complex/Cluster/Region, Other Genome Feature. See ?marker_type_definitions for the meaning of each type.

marker_id

marker_id: MGI accession identifier. A unique alphanumeric character string that is used to unambiguously identify a particular record in the Mouse Genome Informatics database. The format is MGI:nnnnnn, where n is a digit.

marker_symbol

marker_symbol: marker symbol is a unique abbreviation of the marker name.

marker_name

marker_name: marker name is a word or phrase that uniquely identifies the genetic marker, e.g. a gene or allele name.

feature_type

feature_type: an attribute of a portion of a genomic sequence. See the dataset ?feature_type_definitions for details.

chromosome

chromosome: mouse chromosome name. Possible values are names for the autosomal, sexual or mitochondrial chromosomes.

start

start: genomic start position (one-offset).

end

end: genomic end position (one-offset).

strand

strand: DNA strand, ‘+’ for sense, and ‘-’ for antisense.

genetic_map_pos

genetic_map_pos: genetic map position in centiMorgan (cM): a unit of length in a genetic map. Two loci are 1 cM apart if recombination is detected between them in 1% of meioses.

genbank_id

genbank_id: NCBI GenBank identifier(s), a list-column.

refseq_trp_id

refseq_trp_id: NCBI RefSeq transcript identifier(s), a list-column.

refseq_prt_id

refseq_prt_id: NCBI RefSeq protein identifier(s), a list-column.

ensembl_trp_id

ensembl_trp_id: Ensembl transcript identifier(s), a list-column.

ensembl_prt_id

ensembl_prt_id: Ensembl protein identifier(s), a list-column.

swiss_prt_id

swiss_prt_id: UniProtKB/Swiss-Prot identifier(s), a list-column.

tr_embl_prt_id

tr_embl_prt_id: UniProtKB/TrEMBL identifier(s), a list-column.

unigene_id

unigene_id: NCBI UniGene identifier(s), a character vector.