HUPO Proteomics Standards Initiative Protein Interaction Specification Documentation

Proteomics Standards Initiative

Molecular Interaction XML Format Documentation

Version 1.0

Introduction
Purpose of the PSI MI XML format
Purpose of this document
Structure of a PSI MI record
Detailed Documentation
Use of external controlled vocabularies
List of planned features
How to comment
Available data
Tools
Data submission
Further information and relevant links

Introduction

The Proteomics Standards Initiative (PSI) aims to define community standards for data representation in proteomics to facilitate data comparison, exchange and verification.

The Proteomics Standards Initiative was founded at the HUPO meeting in Washington, April 28-29, 2002 (see Science296, 827). As a first step, the PSI is developing standards for two key areas of proteomics: mass spectrometry and protein-protein interaction data.

This document describes the molecular interaction data exchange format. PSI is following a leveled approach to building this specification. Level 1 will describe protein interactions at a basic level that covers a large amount of currently available data. Subsequent levels will add capability to represent new molecular interaction information that the community wishes to exchange.

The scope of PSI MI is currently limited to protein-protein interactions. Other molecules, such as small molecules, DNA and RNA maybe taken into account in the future.

PSI MI was designed by a group of people including representatives from database providers and users in both academia and industry. PSI MI is supported by the DIP, MINT, IntAct, BIND and HPRD databases.

Purpose of the PSI MI XML format

The PSI MI format is a data exchange format for protein-protein interactions. It is not a proposed database structure. Intended usages are described by the use cases documentation. These use case descriptions also provide hints for future tools to be developed.

Purpose of this document

The purpose of this document is to describe the general structure of the PSI MI XML specification in a more user-friendly manner than the specification does itself. For the detailed and most up-to-date description please see the auto-generated documentation. This documentation also provides additional information, e.g. sample data files and use case descriptions.

Structure of a PSI MI record

The root element of a PSI MI XML file is the entrySet. An entrySet contains one or more entries. Each entry is a self-contained unit. This allows to easily concatenate the contents of multiple files into a single file by simply adding all the entries into the entrySet.

Figure 1: The entry top level element

Each entry describes one or more protein interactions. The PSI MI format can be used in two forms, a compact and an expanded form. In the compact form, all interactors (proteins), experiments, and availability statements are described once in the respective list elements, and then only referred to by references from the individual interactions in the interactionList. The compact form allows a dense, non-repetitive representation of the data, in particular for large data sets.
In the expanded form, all proteins, experiments, and availability statements are described directly in the interaction element. As a result, each interaction is a self-contained element providing all necessary information. The expanded form results in larger files, but is more suitable for conversion to displayed data, e.g. HTML pages. The PSI MI consortium provides tools to convert the compact into the expanded form and back.

In the next sections, the top level elements shown in Figure 1 and their function will be described.

The source element describes the source of the entry, usually the organisation which provides it. It also contains a release (number) and a releaseDate.

The availabilityList provides statements on the availability of the data, usually copyright statements. In the current version, the availability statements are free text. The PSI MI format might later be extended to provide predefined availability statements.

The experimentList contains experimentDescriptions. Each experimentDescription describes one set of experimental parameters, usually associated with a single publication. In large-scale experiments, normally only one parameter is varied across a series of experiments, usually the bait. The PSI MI format describes the constant parameters, e.g. experimental techniques, in an experimentDescription, while the variable parameters, e.g. the bait, are described in the interaction element.

The interactorList describes a set of interactors participating in an interaction. In the current version of the PSI MI standard, interactors are proteins. It is planned to extend this to other types, for example small molecules, in future versions. The interactor element describes the "normal" form of a protein, consisting of the "administrative" data like name and cross references, and organism and amino acid sequence. Attributes which are relevant for a specific interaction, in particular sequence features, are described in the participant element within an interaction.

Figure 2: Interaction element

The interactionList contains one or more interaction elements. Each interaction contains a description of the data availability(copyright), and a description of the experimental conditions under which it has been determined. Both of these can either be integrated into the interaction element (expanded form) or refer to the respective lists in the entry (compact form) as described above.

Figure 3: Participant element

Each interaction contains two or more participants, the molecules participating in the interaction. Each participant element contains a description of the molecule in its "normal" form, either by reference to an element of the interactorList, or directly in an interactor element.
Additional elements of the participant element describe the specific form of the molecule in which it participated in the interaction. The featureList describes sequence features of the protein, for example binding domains relevant for the interaction. The role describes the particular role of the protein in the experiment, usually whether the protein was a bait or prey.

The attributeLists are placeholders for semi-structured additional data the data provider might want to transmit. They contain simple tag-value pairs.

Detailed Documentation

see http://psidev.sourceforge.net/mi/xml/doc/MIF.html

Use of external controlledvocabularies

Where possible, external controlled vocabularies are referenced from PSI MI. External controlled vocabularies are used in two forms:

Open controlledvocabularies: We think that no existing controlled vocabulary provides all necessary terms for the given attribute in the PSI MI format. In this case, it is up to the data provider to choose a controlledvocabulary, or to provide a free text string if no appropriate controlled vocabulary exists.
Closed controlledvocabularies: We think that there is a controlled vocabulary which appropriately covers all necessary terms for the given attribute. In this case, only terms from the defined vocabulary should be used.

The following closed controlled vocabularies are referenced by PSI MI:

interaction type
sequence feature type
feature detection
participant detection
interaction detection

These CVs are grouped together in one pair of *.dag (hierarchy) and*.def (definitions) files in GeneOntology flat file format. (allfiles, GO format: psi-mi.dag,psi-mi.def, HTML version of GOformat).
The correctness of references to external controlled vocabularies is currently not enforced by the PSI MI schema. It is the responsibility of the data provider to ensure that only existing terms at an up-to-date data source are referenced. The World Wide Web Consortium (W3C) has recently issued a new recommendation for referencing between XML documents. Once this recommendation is implemented by standard software, we will include it in the PSI MI schema.

List of planned features

Because we are following a leveled approach, we are interested in knowing what the community wishes to be included in the next level.

The following items have been tagged for inclusion in the next level:

Intramolecular interactions
Inclusion of other molecule types, e.g. DNA, RNA, small molecules

The latest list of features to discuss/include in the future can be found here:
http://sourceforge.net/tracker/?atid=511101&group_id=65472&func=browse

How to comment

If you would like to comment on this document, the PSI MI XMLspecification, please send a mail to:
psidev-mi-dev@lists.sourceforge.net

Available data

The sample files contain authentic data, though we don't take any responsibility for the correctness of the data.
HPRD provides PSI MI formatted downloads of interaction data by adding /psimi to a standard molecule view URL. Sample URL:
http://www.hprd.org/protein/00150/psimi
Hybrigenics H.pylori dataset from
Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, SimonS, Lenzen G, Petel F, Wojcik J, Schachter V, Chemama Y, Labigne A,Legrain P.: The protein-protein interaction map of Helicobacter pylori.Nature. 2001 Jan 11;409(6817):211-5. PMID: 11196647.
Available after registration from http://pim.hybrigenics.com.
IntAct offers PSI MI data files at ftp://ftp.ebi.ac.uk:/pub/databases/intact/current/xml and dynamically generated PSI MI data: A graphical interaction network viewer allows to download the displayed network in PSI MI format, for example for further analysis in one of the tools listed below. Sample URL:
http://www.ebi.ac.uk/intact/graph2mif/getXML?ac=EBI-367&depth=1&strict=false
MINT offers the complete dataset in PSI MI format.
DIP offers the complete dataset in PSI MI format.

Tools

PSI MI XML format is supported by a growing list of tools. Currently available are:

MIF_compact.xsl:Transforms a file from the expanded into the compact form.
MIF_expand.xsl: Transforms a file from the compact into the expanded form.
MIF_view.xsl: Creates a simple HTML view of the PSI-MI file. This does not show all details of the PSI-MI record.
Cytoscape: Tool for analysis of interaction networks with plugin for PSI reading.
PIMWalker:Interaction network viewer with PSI reading.
XMLMakerFlattener to convert PSI MI XML format into tab-delimited ASCII format (flat-files) and vice versa.
Atlas Integrated Database System imports PSI-MI data into a relational database system.

Data submission

The following databases currently accept submissions of PSI MI formatted interaction data:

BIND: Contact and submission by email: info@bind.ca
DIP: Contact and submission by email: dip@mbi.ucla.edu
HPRD: Web-based submission tool: http://psimi.ibioinformatics.org/
IntAct: Contact and submission by email: datasubs@ebi.ac.uk
MINT: Contact and submission by email: cesareni@uniroma2.it

Further information and relevant links

Databases involved:

Companies involved:

Related Efforts:

BioPAX