Wednesday, October 31, 2012

Gene Names Are Broken

With the completion of the human genome project and the advent of next generation sequencing technologies, the wealth of information about our genes is growing at a rapid pace.  Figuring out the roles of these genes is a complex undertaking in and of itself.  However, we make things much harder for ourselves by giving genes multiple symbols or symbols that have another meaning.  Try finding all abstracts in PUBMED that discuss the gene KIT.  Not so easy (especially since the searches are case insensitive and kit may refer to many other things).  Some of the gene names are as amusing as they are ridiculous (see this blog for some interesting ones, such as pokemon or sonic hedgehog - the approved gene name).

Now, I'm not the first to notice this, and there is actually a committee called the HUGO Gene Nomenclature Committee (HGNC) to "assign unique gene symbols and names to over 33,500 human loci".  Too bad some of these symbols are utterly useless.  The symbols may be unique with respect to other gene symbols, but they are far from unique and distinguishing.  

Here are a few of my (least) favorites, there are numerous other examples:
KIT
CAT
MAX
ACE
BAD
BID
Edit (Nov 5, 2012):  Here are a few more wonderful examples
LARGE
IMPACT
SET
REST
MET
PIGS
SHE
CAMP
PC
NODAL
COIL
CAST
COPE
POLE
CLOCK
ATM
RAN
CAPS

And the worse one ever, drumroll . . . . 
T :  Yes, there is a gene with the approved symbol of T (I pity the fool). Good luck finding any information about that gene.

Here is a breakdown of the lengths of the gene symbols:
# of Names    Length of Name
           1                1
         31                2
       615                3
     3560                4
     6296                5
     4699                6
     2468                7
     1143                8
      216                 9
        18               10
          1               11
          0               12
          1               13

Why does this matter?  There is too much information out there for a single person or army of people to sit down and wade through.  There is more of a need for automated methods to assist in culling and processing the information.  But when it is a challenging problem just to find the terms we are interested in, we are starting down a difficult road before we even get into the car.  

Often times an abstract or paper will use the gene symbol rather than the full, laborious gene name, and these "official gene symbols" are too nondescript to be useful in an automated search.  As the rate of information about genes out-paces our abilities to manually curate it, useful information might be lost or false conclusions may be drawn due to the ambiguity of our naming conventions.  

No comments:

Post a Comment