Molecular
Interaction XML Format Documentation
Version 1.0
Introduction
The Proteomics Standards Initiative (PSI) aims to define
community standards for data representation in proteomics to
facilitate data comparison, exchange and verification.
The Proteomics Standards Initiative was founded at the HUPO
meeting in Washington, April 28-29, 2002 (see
Science296, 827). As a first step, the PSI is
developing standards for two key areas of proteomics: mass spectrometry and
protein-protein interaction
data.
This document describes the molecular interaction data
exchange format. PSI is following a leveled approach to
building this specification. Level 1 will describe protein
interactions at a basic level that covers a large amount of
currently available data. Subsequent levels will add
capability to represent new molecular
interaction information that the community wishes to
exchange.
The scope of PSI MI is currently limited to
protein-protein interactions. Other molecules, such as small
molecules, DNA and RNA maybe taken into account in the
future.
PSI MI was designed by a group of people including
representatives from database providers and users in both
academia and industry. PSI MI is supported by
the DIP, MINT, IntAct, BIND and HPRD databases.
Purpose of the
PSI MI XML format
The PSI MI format is a data exchange format for
protein-protein interactions. It is not a proposed database
structure. Intended usages are described by the use cases documentation. These use case
descriptions also provide hints for future tools to
be developed.
Purpose of
this document
The purpose of this document is to describe the general
structure of the PSI MI XML specification in a more
user-friendly manner than the specification does itself. For
the detailed and most up-to-date description please see the auto-generated documentation. This
documentation also provides additional information, e.g.
sample data files and use case descriptions.
Structure of a PSI
MI record
The root element of a PSI MI XML file is the entrySet. An
entrySet contains one or more entries. Each entry is a
self-contained unit. This allows to easily concatenate the
contents of multiple files into a single file by simply adding
all the entries into the entrySet.
Figure 1: The entry top
level element
Each entry describes one or more protein interactions. The
PSI MI format can be used in two forms, a compact and an
expanded form. In the compact form, all interactors
(proteins), experiments, and availability statements are
described once in the respective list elements, and then only
referred to by references from the individual interactions in
the interactionList. The compact form allows a dense,
non-repetitive representation of the data, in particular for
large data sets.
In the expanded form, all proteins, experiments, and
availability statements are described directly in the
interaction element. As a result, each interaction is a
self-contained element providing all necessary information.
The expanded form results in larger files, but is more suitable
for conversion to displayed data, e.g. HTML pages. The PSI MI
consortium provides tools to convert the compact into the
expanded form and back.
In the next sections, the top level elements shown in Figure
1 and their function will be described.
The source element describes the source of the entry, usually
the organisation which provides it. It also contains a release
(number) and a releaseDate.
The availabilityList provides statements on the availability
of the data, usually copyright statements. In the current
version, the availability statements are free text. The PSI MI
format might later be extended to provide predefined
availability statements.
The experimentList contains experimentDescriptions.
Each experimentDescription describes one set of experimental
parameters, usually associated with a single publication. In
large-scale experiments, normally only one parameter is varied
across a series of experiments, usually the bait. The PSI MI
format describes the constant parameters, e.g. experimental
techniques, in an experimentDescription, while the variable
parameters, e.g. the bait, are described in the interaction
element.
The interactorList describes a set of interactors
participating in an interaction. In the current version of the
PSI MI standard, interactors are proteins. It is planned to
extend this to other types, for example small molecules, in
future versions. The interactor element describes the "normal"
form of a protein, consisting of the "administrative"
data like name and cross references, and organism and amino
acid sequence. Attributes which are relevant for a specific
interaction, in particular sequence features, are described in
the participant element within an interaction.
Figure 2: Interaction
element
The interactionList contains one or more interaction
elements. Each interaction contains a description of the data
availability(copyright), and a description of the experimental
conditions under which it has been determined. Both of these
can either be integrated into the interaction element
(expanded form) or refer to the respective lists in the entry
(compact form) as described above.
Figure 3: Participant
element
Each interaction contains two or more participants, the
molecules participating in the interaction. Each participant
element contains a description of the molecule in its "normal"
form, either by reference to an element of the interactorList,
or directly in an interactor element.
Additional elements of the participant element describe the
specific form of the molecule in which it participated in the
interaction. The featureList describes sequence features of
the protein, for example binding domains relevant for the
interaction. The role describes the particular role of the
protein in the experiment, usually whether the protein was a
bait or prey.
The attributeLists are placeholders for semi-structured
additional data the data provider might want to transmit. They
contain simple tag-value pairs.
see http://psidev.sourceforge.net/mi/xml/doc/MIF.html
Use of external
controlledvocabularies
Where possible, external controlled vocabularies are
referenced from PSI MI. External controlled vocabularies are
used in two forms:
-
Open
controlledvocabularies: We think that no existing
controlled vocabulary provides all necessary terms for the
given attribute in the PSI MI format. In this case, it is
up to the data provider to choose a controlledvocabulary,
or to provide a free text string if no
appropriate controlled vocabulary exists.
-
Closed
controlledvocabularies: We think that there is a
controlled vocabulary which appropriately covers all
necessary terms for the given attribute. In this case,
only terms from the defined vocabulary should be used.
The following closed controlled vocabularies are referenced
by PSI MI:
-
interaction type
-
sequence feature type
-
feature detection
-
participant detection
-
interaction detection
These CVs are grouped together in one pair of *.dag
(hierarchy) and*.def (definitions) files in GeneOntology flat file
format. (allfiles, GO
format: psi-mi.dag,psi-mi.def, HTML
version of GOformat).
The correctness of references to external controlled
vocabularies is currently not enforced by the PSI MI schema.
It is the responsibility of the data provider to ensure that
only existing terms at an up-to-date data source are
referenced. The World Wide Web Consortium (W3C) has recently
issued a new recommendation for
referencing between XML documents. Once this
recommendation is implemented by standard software, we will
include it in the PSI MI schema.
List of planned
features
Because we are following a leveled approach, we are
interested in knowing what the community wishes to be included
in the next level.
The following items have been tagged for inclusion in the
next level:
-
Intramolecular interactions
-
Inclusion of other molecule types, e.g. DNA, RNA, small
molecules
The latest list of features to discuss/include in the future
can be found here:
http://sourceforge.net/tracker/?atid=511101&group_id=65472&func=browse
How to comment
If you would like to comment on this document, the PSI MI
XMLspecification, please send a mail to:
psidev-mi-dev@lists.sourceforge.net
Available data
-
The sample
files contain authentic data, though we don't take any
responsibility for the correctness of the data.
-
HPRD provides PSI MI
formatted downloads of interaction data by adding /psimi
to a standard molecule view URL. Sample URL:
http://www.hprd.org/protein/00150/psimi
-
Hybrigenics
H.pylori dataset from
Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy
C, SimonS, Lenzen G, Petel F, Wojcik J, Schachter V,
Chemama Y, Labigne A,Legrain P.: The protein-protein
interaction map of Helicobacter pylori.Nature. 2001 Jan
11;409(6817):211-5. PMID:
11196647.
Available after registration from http://pim.hybrigenics.com.
-
IntAct offers
PSI MI data files at ftp://ftp.ebi.ac.uk:/pub/databases/intact/current/xml
and dynamically generated PSI MI data: A graphical
interaction network viewer allows to download
the displayed network in PSI MI format, for example for
further analysis in one of the tools listed below. Sample
URL:
http://www.ebi.ac.uk/intact/graph2mif/getXML?ac=EBI-367&depth=1&strict=false
-
MINT
offers the complete dataset in PSI MI format.
-
DIP
offers the complete dataset in PSI MI format.
Tools
PSI MI XML format is supported by a growing list of tools.
Currently available are:
Data submission
The following databases currently accept submissions of PSI
MI formatted interaction data:
Further
information and relevant links
Databases involved:
Companies involved:
Related Efforts: