GenBank, RefSeq and Ensembl IDs
Source:vignettes/articles/genbank_refseq_ensembl_ids.Rmd
genbank_refseq_ensembl_ids.Rmd
The MGI report MRK_Sequence.rpt
provides associations
between MGI genetic markers and GenBank, RefSeq, Ensembl and UniProtKB
identifiers.
To read this report using the key
"genbank_refseq_ensembl_ids"
, use the following code:
# To read all records (more than 70,000), use `read_report("genbank_refseq_ensembl_ids")`.
(assoc_to_seq_ids <- read_report(report_key = "genbank_refseq_ensembl_ids", n_max = 30L))
## # A tibble: 30 × 19
## marker_status marker_type marker_id marker_symbol marker_name feature_type
## <fct> <fct> <chr> <chr> <chr> <fct>
## 1 O BAC/YAC end MGI:1341858 03B03F DNA segment… BAC/YAC end
## 2 O BAC/YAC end MGI:1341869 03B03R DNA segment… BAC/YAC end
## 3 O Gene MGI:1918911 0610005C13Rik RIKEN cDNA … lncRNA gene
## 4 O Gene MGI:1923503 0610006L08Rik RIKEN cDNA … lncRNA gene
## 5 O Gene MGI:1925547 0610008J02Rik RIKEN cDNA … lncRNA gene
## 6 O Gene MGI:3698435 0610009E02Rik RIKEN cDNA … lncRNA gene
## 7 O Gene MGI:1918921 0610009F21Rik RIKEN cDNA … lncRNA gene
## 8 O Gene MGI:1918931 0610009K14Rik RIKEN cDNA … lncRNA gene
## 9 O Gene MGI:1914088 0610009L18Rik RIKEN cDNA … lncRNA gene
## 10 O Gene MGI:1915609 0610010K14Rik RIKEN cDNA … protein cod…
## # ℹ 20 more rows
## # ℹ 13 more variables: chromosome <fct>, start <int>, end <int>, strand <fct>,
## # genetic_map_pos <dbl>, genbank_id <list>, refseq_trp_id <list>,
## # refseq_prt_id <list>, ensembl_trp_id <list>, ensembl_prt_id <list>,
## # swiss_prt_id <list>, tr_embl_prt_id <list>, unigene_id <chr>
GenBank, RefSeq, Ensembl and UniProtKB identifiers
These variables hold one or more identifiers associated with each genetic marker:
-
genbank_id
: GenBank identifier(s) -
refseq_trp_id
: RefSeq transcript identifier(s) -
refseq_prt_id
: RefSeq protein identifier(s) -
ensembl_trp_id
: Ensembl transcript identifier(s) -
ensembl_prt_id
: Ensembl protein identifier(s) -
swiss_prt_id
: UniProtKB/Swiss-Prot identifier(s) -
tr_embl_prt_id
: UniProtKB/TrEMBL identifier(s) -
unigene_id
: UniGene identifier1
Except for unigene_id
, all these variables are list-columns,
and provide, potentially, multiple values for the same genetic
marker.
Having these data nested in list-columns offers the convenience of
having a table whose records (rows) correspond to one genetic marker.
The downside is that these multiple identifiers are not readily
accessible as they would be if stored in atomic columns. To unnest them
you may use tidyr::unnest_longer()
2.
Here is an example with marker MGI:1915609, where we unnest Ensembl transcript identifiers so that we have one per row:
assoc_to_seq_ids |>
dplyr::filter(marker_id == "MGI:1915609") |>
dplyr::select("marker_id", "marker_symbol", "ensembl_trp_id") |>
tidyr::unnest("ensembl_trp_id") |>
print(n = Inf)
## # A tibble: 14 × 3
## marker_id marker_symbol ensembl_trp_id
## <chr> <chr> <chr>
## 1 MGI:1915609 0610010K14Rik ENSMUST00000021180
## 2 MGI:1915609 0610010K14Rik ENSMUST00000021181
## 3 MGI:1915609 0610010K14Rik ENSMUST00000100950
## 4 MGI:1915609 0610010K14Rik ENSMUST00000102569
## 5 MGI:1915609 0610010K14Rik ENSMUST00000108575
## 6 MGI:1915609 0610010K14Rik ENSMUST00000108576
## 7 MGI:1915609 0610010K14Rik ENSMUST00000108577
## 8 MGI:1915609 0610010K14Rik ENSMUST00000108578
## 9 MGI:1915609 0610010K14Rik ENSMUST00000108579
## 10 MGI:1915609 0610010K14Rik ENSMUST00000134700
## 11 MGI:1915609 0610010K14Rik ENSMUST00000135390
## 12 MGI:1915609 0610010K14Rik ENSMUST00000138658
## 13 MGI:1915609 0610010K14Rik ENSMUST00000148878
## 14 MGI:1915609 0610010K14Rik ENSMUST00000150504
Note, however, that now the meaning of each row changed: each row is for a marker / Ensembl transcript combination.
Variables
marker_status
marker_status
: genetic marker status is a factor of two
levels: 'O'
for official, and 'W'
for
withdrawn. Official indicates a currently in-use genetic marker, whereas
withdrawn means that the symbol or name was once approved but has since
been replaced.
marker_type
marker_type
: genetic marker type is a factor of 10
levels: Gene, GeneModel, Pseudogene, DNA Segment, Transgene, QTL,
Cytogenetic Marker, BAC/YAC end, Complex/Cluster/Region, Other Genome
Feature. See ?marker_type_definitions
for the meaning of
each type.
marker_id
marker_id
: MGI accession identifier. A unique
alphanumeric character string that is used to unambiguously identify a
particular record in the Mouse Genome Informatics database. The format
is MGI:nnnnnn
, where n
is a digit.
marker_name
marker_name
: marker name is a word or phrase that
uniquely identifies the genetic marker, e.g. a gene or allele name.
feature_type
feature_type
: an attribute of a portion of a genomic
sequence. See the dataset ?feature_type_definitions
for
details.
chromosome
chromosome
: mouse chromosome name. Possible values are
names for the autosomal, sexual or mitochondrial chromosomes.