BioInformatics: IBM Resources and Research

ibmIBM is a DRIVING FORCE of

PARADIGM SHIFTs in LIFE SCIENCES

Everybody knows that IBM is a huge, multi-national corporation that has revenue larger than some country's entire GDP. For those of us that are old enough, we can remember when IBM was schooled by Bill Gates and Microsoft, many wondered if IBM would ever recover. Well, not only have they recovered but are probably stronger than ever! This is company that is and has always been on the leading edge of science and not just computer technology but everything from cosmology to biology. This post attempts to outline what IBM is doing in the area of bioinformatics and computational biology.

IBM Leads the Way

I've talked about the explosion in bioinformatics here, here and here but IBM represents an explosion all by itself, they clearly see life sciences as a major consumer of their platforms and technologies into the future. Other similar companies are following IBM's lead like Oracle, Sun Microsystems (now part of IBM), Hewlett-Packard and Intel but apparently not with the same vigor and commitment. IBM is actually a direct and substantial contributor on the research side which is why they are uniquely positioned to understand and build the tools needed by researchers. This strategy is something that has worked extremely well for IBM because they can make a major investment in a market for years before it actually becomes profitable - how many companies can do that?

Bioinformatics and Pattern Discovery Group

This group works on problems relating to computational molecular biology, both theoretical and applied problems. The group is primarily focused on developing algorithms that mine data without understanding the actual nature of the data! Obviously this has huge applicability on the life science data being accumulated everyday, the algorithms are stated to be generic but its not exactly clear what that means and where the limitations are. The algorithms and data generated by the group are made freely available in what they term: biological repositories. These repositories can be accessed through web services but in some cases can be downloaded and assimilated for local processing.

PTGS and RNAi

In the late 90s the process of RNA interference (RNAi) was discovered, this discovery lead to what is called post-transcription gene silencing (PTGS). This process was originally thought to act as a defensive mechanism against genome marauders but now it is understood to be a vital part of gene regulation and expression. IBM has and is contributing enormously to a paradigm shift away from conventional cell regulation thinking and have hypothesized the following from their research:rani

  1. Researchers believe there are a few tens of thousands of microRNAs in the human genome,
  2. microRNAs target almost 90% of the protein coding genes, and
  3. each microRNA may target several thousand individual genes.

Junk DNA

IBM was one of the first to realize that Junk DNA potentially represented a critical aspect of the (human) genome and have executed large-scale computational analysis to reconcile the protein coding regions of DNA with the non-coding regions, or Junk as it has traditionally been called. It has been found that piRNAs have a gene silencing capability similar to interference RNAs but there is considerable uncertainty as to how they are generated. IBM's work predicted the existence of piRNAs that was later reported by three different groups. I guess one man's junk is another man's treasure....

Systems Biology

IBM defines this term as "an effort to study and understand biological systems by bringing together theoretical, experimental and computational approaches." Basically, this means building a holistic view of the organism by studying the hierarchical relationships between genes, proteins, pathways, organelles,, etc. By default, this is an cross disciplinary approach to collecting and organizing information and data. The group focuses on three specific aspects of systems biology:

  1. Develop methods and tools for discovering the parts of each level of the hierarchy,
  2. characterization of their (parts) behavior, both static and dynamic,
  3. discovering relationships between the parts.

The reference to parts in this context means things like pathway components, miRNA precursors, gene permutations, etc. The last area the group works in relates to medical informatics, which is a nebulous term. What this translates to is developing methods and tools to help collect, organize and analyze medical data. To me this means applications or packages of services that include applications, although its not clear. As I said, the term medical informatics is nebulous and can change radically based on context.

They (Pattern Discovery Group) have a slew of current activities, almost all of which are on the edge of current knowledge. This must be a great group to be a part of in addition to being a great place to work! Anyway, here are some of their current activities:

CURRENT ACTIVITIES
analysis of the "junk" DNA of eukaryotic organismsmouse embryonic stem cell differentiation
tools for the analysis of gene expression datacancer from the standpoint of cell process regulation
gene discovery in prokaryotic genomescomparative and evolutionary genomics
RNA interference in eukaryotes and virusesrational engineering of antimicrobial peptides

This is only a partial list, there is actually closer to 20 activities listed on their site here.

Content and Code

In addition to developing tools and methods the group continuously produces metadata that is made available through the website. For example, this link makes available automatically generated annotations for the proteomes of over 120 genomes although this data looks a little stale (I guess the proteomes haven't changed). Additionally, the bio-dictionaries for several archeal and bacterial genomes are posted here.

Access to the IBM web servers as well the associated software code is freely available provided you are not a commercial entity, if so a different arrangement must be made. Access to the pattern matching application is located here, this provides a simple interface that walks the user through the criteria collection as it cascades through the data sets. If you want to download the actual source code for the pattern matching application you must first accept an agreement here and then enter some demographic data before you can actually download. I believe it is written in PERL.

Links

Here are some key links (some are repeated) to various IBM resources and content that represent both computational biology and medical informatics. This is by no means an exhaustive list but it does provide a good set of starting points to troll through the categories and see what else is available.

  • Computational Biology and Medical Informatics website,
  • Bioinformatics and Pattern Discovery Group website,
  • Web Server Access and Code Downloads page,
  • Web Services for Bioinformatics, Part 1, Part 2 and Part 3,
  • IBM's Haifa Research Laboratory (Israel) Bioinformatics page,
  • Oxford Journal article (2004) Describing the Pattern Matching Group.

Summary

ibmAs usual IBM is leading the way in several new areas of research and development including bioinformatics and computational biology. They have and are continuing to pour resources into a market that is only now starting to blossom into a major consumer of their technologies, clearly a good investment. In addition to developing technologies IBM continues to drive fundamental research in important yet esoteric areas of understanding like Junk DNA content and function.

The entire industry is benefiting from IBM's commitment to developing markets that require massive amounts of long-term investment, this is an approach to business development that IBM has legitimately turned into a massive advantage because it perfectly positions them in the middle of what's happening - brilliant.

I've included a video that provides a six and half minute introduction to the Computational Biology Research Initiatives at IBM's research center in New York, watch the video for more details about what they are doing.

 


No votes yet

Comments

Edit and Clarification

As Dr. Pellionisz points out, I indicated that Sun Micro systems was purchased by IBM when they were purchased by Oracle. I knew this but somehow type IBM anyway, apologize for the error and appreciate it being pointed out.

I don't think IBM views this market to be anything like that of the PC market in the 80s and they are certainly not positioning themselves the same way. The research they are doing is allowing them to play the user, the customer and innovator role all at the same time. Seems to me that approach will lead to a better offering of products and services regardless of the actual science they produce. In my opinion, the science is being done just to help establish the appropriate business models and is secondary to the prime objective - make money.

IBM is also nicely positioned relative to the range of disciplines required for the next round of new science and new thinking, consider that they are a leader in nanotechnology and materials research, computational phenomena and architectures, biology via genetics, etc. 

In terms of large scale computing architectures there are very few companies that can stand next to IBM, especially on the enterprise side. Scientific computing has not traditionally not been their strong suite in terms of revenue, I think they believe (as I do) that this can change with the new computing requirements of the life sciences. When you look ten years out at computational resource (storage, cpu, etc) consumption it is very likely research (nano and life science) will far exceed that of the enterprise.  This spells huge opportunity for IBM because they are positioned to take advantage of it. 

The Government is effectively useless, trying desperately to keep up with patent and legal issues that these new technologies bring. With the cycle of paradigm shifts compressing it is likely the situation will get worse instead of better, let's hope they don't hurt progress as they did under the Bush administration.

Insightful post Pellionisz_at_JunkDNA.com, thanks.

It is for IBM to lose ...

I am very excited about Dave Tribbett' devotion to a series of deep analysis of postmodern genome informatics - as a core of critical mass, producing an explosion in the life sciences. I believe Dave’s present article is also great. Formally, there are only minor errors (for instance, to correct that Sun Microsystems is "now part of IBM" - actually on January 27, 2010 Sun was acquired by Oracle Corporation for US$7.4 billion, based on an agreement signed on April 20, 2009).


I more than agree to the "big picture" that "Life Science computing" is for IBM to lose - just as the PC was an "IBM PC"; just to be lost to Microsoft - that was a very young and aggressive company compared to perhaps too big and at time ossified IBM. I have written years ago a prediction of the "Big One"; and earthquake-like effect whenever the tectonic plates of "Big IT" pile up also private "Big Pharma" upon "Government Genomics". That predicted time is now.

Indeed, with great devotion and expertise, Caroline Kovac (recently retired) built a spectacular "IBM Life Science" program for at least a quarter of a Century - and as Dave’s embedded YouTube overviews shows, a huge array of activities in the Life Science Computing Program is thriving at IBM. I am particularly familiar with the trailblazing work of Isidore Rigoutsos and his colleagues; aiming at novel pattern recognition algorithms to address DNA structure.


Yet, a daresay, that the very enormity of the wide-spectrum Life Science effort of IBM might similarly result in IBM losing this game, the winner of which (just like in the early days of the "PC") just could not be predicted. In retrospect, a Monday Morning Quarterback "wisdom" might come to the crystal-clear "analysis" that the monstrous (and monstrously complex) IBM lost a singular focus. They wanted to accomplish holding both the key to the hardware architecture as well as (belatedly) gaining the upper hand of the PC OS (famed OS/2). They ended up with securing neither - and long time ago sold the entire IBM PC division for a token price. Likewise, what is the Government's slice in the pie of "PC"?  - Just about zero.


Now take Genome Informatics. I was thrilled to hear about the IBM "Blue Gene" at the Monterey 50th Anniversary of the Double Helix (2003 February, exactly 7 years ago), from Caroline Kovac. Observing my surge of enthusiasm, she felt it necessary to utter some words of wisdom as a "caveat". She reminded me that the "Life Science Division" she was leading was NOT in control of Blue Gene at all - the World's fastest supercomputer (at that time) belonged to the IBM' "Computer Development Division"... Those of us who ever worked in huge operations like IBM (or the US Government), know that divisions of an entity are compelled AGAINST their cooperation, since in fact they pitch their efforts to COMPETE AGAINST EACH OTHER for one single pool of resources of the entity.


Thus, I am less than fully convinced that any single existing "Big Whatever" (including US Government...) is going to win "Genome Informatics". If history is any lesson, the winning combination is more likely to be laser-beam focused EMERGING entity (like Microsoft was, at that time) that grabs the singularly most crucial tenet (that was the OS and killer apps software for PC) and develops global alliances with just about all participating entities (minus perhaps IBM and Apple that decided NOT to take part in the cooperation). Far East manufacturers wisely opted in – and came out as winners of hardware production business, also Intel decided to focus on pure-play of CPU serial chip design and likewise became a big winner.


So what is the focus of "Genome Informatics"? Chances are that one answers: "DNA sequencing". Wrong! Affordably revealing the full human genome is a necessary, but not sufficient step. To take the cliche, "Genome Projects" are often compared to the "Moon Shot" (see e.g. Nobelist Sydney Brenner's very recent essay), in which comparison he says is quite literally true: "Getting a man on the Moon is relatively easy". "Getting the man safely back from the Moon is the harder part".


The AAAS grand annual meeting is wrapping up today - and Silicon Valley is teaming with not one but two most successful DNA sequencing Centers (Complete Genomics and Pacific Biosciences) rolling affordable full DNA sequences from the assembly line, like Ford T models avalanched the US from Detroit. (It had to be matched, for sustainability with an entire network of gas-stations…)


While I was not attending AAAS in San Diego, in the coverage I did not detect any major speech pinpointing that DNA sequencing does not reveal the "Language of Genome" at all. It does brings into plain view all the A,C,T,G “letters”, like when you obtain a copy of "War and Peace" with all Cyrillic letters of the Russian language on display - yet “readers” (except those who mastered the Russian language) will understand absolutely nothing of what the overwhelming number of letters might mean.


There are plenty of "Genome Sequencing Centers". To win "Genome Informatics" we need one more "Sequencing Center" as a hole in the head. Instead, we desperately need a private domain core-company that is totally focused in Genome ANALYSIS, such that a "Center of Genome Interpretation" emerges around it, reaching out to all R&D and business of the land.


Absolutely (just like Microsoft's OS and Killer apps), the core of this crucial fulcrum will not remain isolated, but will spread to virtually all hardware and software companies that are eager to grab a slice of the PostModern Genome Informatics market (as big as global Genome Based Economy is...).


Take the mentioned Intel. They bought into Genomics (now "Informatics") by contributing to $100 M investment to Genome Sequencing by PacBio (2007), plus put together a (small) "Downstream Data Analysis Group for Genomics" - that Intel disbanded when the global financial crisis hit us hard. The rationale might have been that "let's focus on sequencing first - the downstream analysis can wait; besides, as long as anybody does it with boxes with ‘Intel inside’ on them, the IntelVC investment in sequencing is safe". Moreover, technically it would take Intel just a few days, weeks or months (at most) to re-assemble an order of magnitude more potent "Downstream Genome Data Analysis Group". However, de facto, organizations of  huge industries hardly ever move that fast, since decision-making is hierarchical and needs elaboration, submission and evaluation by committees of umpteen layers of corporate structure. Thus, it remains an open question if the huge freighter ship of IBM or Intel might turn tighter corners.


Meanwhile, Microsoft, Google, HP, Dell, Oracle could "cut in" - since all of them put together their Life Science programs, for readiness for "health-care business" - for the uncertain time when the Government might be ready with reforms accelerating e.g. digital health-data repositories. It might be debatable, though, if any of them singularly focus on "DNA Functional Analysis" - as it takes a quite unprecedented multiple domain expertise of disparate fields, that need a "psychological welding" (borrowing Nola Masterson's term) to get the two hemispheres of "genome informatics" together (my favorite term refers to the much drier anatomical structure of "corpus callosum"; the massive bundle of cable system connecting the right and left hemispheres of the brain).
So, where is the Government? Any Program at any of the countless branches of the organization where "Algorithmic (software enabling) approaches to interpretation of genome function" are elevated into the role this issue ultimately will be?


Hardly.


The Government is not geared to take the enormous risk that scientist might THINK (as some, indeed, might not). The Government's role is to program of what contractors DO. But it is all right, since actually some "Genome Computing Architectures" (glimpsed e.g. in the first seconds of http://www.youtube.com/watch?v=mSRMCDCVg6Y ), a "Nurture server" (ingredients of what UPC barcoded nutrients contain) and a "Nature server" (knowledge-base of dietary consequences of genomic conditions and environments) are not only eminently doable "on the cloud" - but both USDA and OSHA are actually well under way of having implemented (part of) it. Since ALL government R&D (in the USA) is paid from our precious tax-dollars, it is just a simple mandate to issue that any/all government-funded research results pertaining to both "Nurture" and "Nature" aspects of genomics must be uploaded to a government-maintained cloud computing - and made available for US individuals and industries (sorry, not for Countries that don't pay taxes to support such repository of strategic value).


The author of this blog-reply, having decades of experience in Academia, Industry and Government (see bio at http://www.usa-siliconvalley.com) realizes full well that the above might be perceived (and even mis-labeled) as a "pipe-dream" like "the Manhattan Project" could be (wrongly) labeled as the "pipe-dream" of Albert, when he signed Leo's letter to the President Eisenhower on August 2nd, 1939. Yes, the "Plan" received an  initial pittance of $4,000 from the Government to get going - but ultimately changed history.


Pellionisz_at_JunkDNA.com

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Technology Feeds

Technology