Table of Contents


Note to the reader: Pyntacle guesses the format of files in input by default, unless explicitly specified using the -f/ --file-format parameter. Equally, the --output-format parameter can be set when using Pyntacle to print a network to file.

Network file formats


Adjacency matrix

Commonly used file extensions:

adjm, adjmat, adjacencymatrix

An adjacency matrix is a squared nxn matrix, where row i and column j indices refer to nodes in a network. A non-zero value filling a cell aij indicates the presence of a connecting edge between the nodes i and j. In Pyntacle we currently support unweighted networks only. Self-loops are not allowed and the resulting matrix will be symmetric. Their adjacency matrices hold 1s between two distinct nodes if these are connected by an edge, 0s otherwise. A training by EMBL will present the way to represent graphs as textual files.

Adjacency matrices usually have a header line, even if this is optional. In the first scenario (header is present), we have row and column headers, which are identical:

A B C D E
A 0 1 1 0 0
B 1 0 1 1 0
C 0 1 0 0 0
D 0 1 0 0 1
E 0 0 0 1 0

This table can be downloaded here

Values contained in the headers will fill the attribute name of nodes. This matrix can be imported by command line, setting the -f/--file parameter, or by the following statements:

In [1]:
from pyntacle.io_stream.importer import PyntacleImporter
#example.adjm is a tab-separated adjacency matrix

gr = PyntacleImporter.AdjacencyMatrix(file="example.adjm", header = True, sep = "\t")
#print the node["name"] attribute
print(gr.vs()['name'])
Adjacency matrix from example.adjm imported
['A', 'B', 'C', 'D', 'E']

The network is the following

In [2]:
%matplotlib inline
import random
from igraph import plot
random.seed(1)
plot(gr, vertex_label=gr.vs()["name"])
Out[2]:

When the headers are not available, the vertices name attribute will be set automatically to a zero-based index that will range from 0 to n, with n being the size of the network -1.

0 1 1 0 0
1 0 1 1 0
0 1 0 0 0
0 1 0 0 1
0 0 0 1 0

This matrix can be downloaded from here and imported by these statements:

In [3]:
grn = PyntacleImporter.AdjacencyMatrix(file="example_noheader.adjm", header = False, sep = "\t")
print(grn.vs()['name'])
Adjacency matrix from example_noheader.adjm imported
['0', '1', '2', '3', '4']

Which results in

fig3

We assume a header is always present. If not, the --no-header or the --no-output-header flag will be set when importing or saving a network file, respectively. By the Pyntacle library, you can do this by setting the header argument to False when using the methods of the PyntacleImporter, PyntacleExporter and PyntacleConverter classes in the iostream module.

Pyntacle accepts any file extension. Cells of adjacency matrices are supposed to be delimited by tabulation (\t), unless otherwise specified. If not explicitly specified, the separator character will be inferred, before referring to the default choice.

Edge list

Commonly used file extensions:

egl, edgl, edgelist

An edge lists file contains a series of pairs of nodes. Each item of the list represents thus a link connecting, directionally, the left node (a) to the right node (b) of the pair. If a network is undirected, each pair a → b, must be accompanied by a pair b → a.

V1 V2
A B
B A
B C
C B
A C
C A
B D
D B
D E
E B

This edge list is contained in the "example.egl" file that can be downloaded here, loaded and plotted with the following statements:

In [4]:
%matplotlib inline
import random
from pyntacle.io_stream.importer import PyntacleImporter 
from igraph import Plot

egl = PyntacleImporter.EdgeList(file="example.egl", header=True, sep="\t")
egl.summary()
Edge list from example.egl imported
Out[4]:
"IGRAPH UN-- 5 5 -- ['example']\n+ attr: __implementation (g), __sif_interaction_name (g), name (g), __parent (v), name (v), __sif_interaction (e), adjacent_nodes (e)"

This graph is identical to the one used in the Adjacency Matrix paragraph.

There is no way of representing isolated nodes unless these have self-loops, which are not allowed in Pyntacle anyway. Hence, it is recommended to use the edge list format to represent network without isolates. Pyntacle thus supports undirected, unweighed edgelists, separated uniformly by a character. When importing edge lists from the command line, the separator character will be inferred. If not possible for any reason, Pyntacle will assume it is a tabulation character. The separator can also be specified in the appropriate iostream methods.

Note: We recommend to trim any blank lines throughout the edge list file to avoid any error in the parsing process.

File extensions are not important. Pyntacle assumes that edge lists have headers. If this is not the case, --no-header/-N or --no-output-header arguments can be set in the command line, as for adjacency matrices. Similarly, the flag header can be set in the methods stored in the PyntacleImporter, PyntacleExporter or QuickConvert classes for the iostream module.

Simple Interaction Format (SIF)

Commonly used file extensions:

sif

The Simple Interaction Format (SIF) is one of the most used network format by software packages devoted to network analysis and visualization, like Cytoscape. Its syntax is simple. A SIF file is made by at least 3 columns. The first and third columns represent the source and target nodes. The type of their interaction is specified in the second column. Directionality of edges cannot be specified. This column order is conventional in Cytoscape, since a user can specify which is the source node and which one is the target node by a GUI. For a detailed description of the SIF file format, please refer to the Official SIF File Format documentation. Currently, Pyntacle imports SIF files as unweighted and undirected networks.

SIF permits the specification of multiple edges between nodes. This is easily achieved by replicating a line and changing the interaction type information. Since multigraphs are not currently allowed in Pyntacle, multi-edges are automatically collapsed in a single link, still preserving the information in an ad-hoc edge attribute, __sif_interaction.

The first two lines of the following files (download it here):

ProteinA Interaction_Type ProteinB
protein_1 physical protein_2
protein_1 activation protein_2
protein_1 physical protein_3

will be then collapsed

In [5]:
sg = PyntacleImporter.Sif(file="example.sif")
SIF from example.sif imported
In [6]:
#we can print the total edges and check if they have been collapsed
print(sg.ecount())
2
In [7]:
#The collapsed edge has index 0. Let’s inspect its attribute ‘__sif_interaction’
print(sg.es(0)["__sif_interaction"])
[['physical', 'activation']]

The header in a SIF file is optional, but Pyntacle assumes that it is present. If this is not the case, --no-header/-N or --no-output-header flags can be set in the command line, as for adjacency matrices and edge lists. Similarly, the boolean argument header can be set in the methods belonging to the PyntacleImporter, PyntacleExporter or QuickConvert classes of the iostream module. If the header is present, the second column name will be assigned to the reserved __sif_interaction_name graph attribute:

In [8]:
print(sg["__sif_interaction_name"])
Interaction_Type

Both the __sif_interaction_name and the __sif_interaction attibutes are always set to None when importing a graph. Although reserved, they can be edited and will be printed to a SIF file when exporting a graph using the PyntacleExporter class.

Generally, SIF files are separated by tabular characters, thus Pyntacle assumes \t as the default separator. This choice is tunable. As for other file formats, the separator character is inferred by Pyntacle, although it can be specified by the sep argument of the correspondingiostream methods.

DOT Files

Commonly used file extensions:

dot

DOT is a widely used file format to describe and represent networks. It is widely used by graphical visualization tools such as Graphviz. The power of DOT lies in its detailed syntax, which allows to mix information on the architecture of a network with graphical information (like the edge thickness or node colors with gradients). More information on the DOT file format can be found on the official official Graphviz documentation. Due to the complexity of the DOT grammar, not all the graph libraries support the import of DOT files (NetworkX for example). Pyntacle was equipped with a ad-hoc parser of DOT files. The parser is currently designed to import undirected networks.

Binary

Commonly used file extensions:

bin, graph

Networks can be imported and exported as binary files. The graph must be compliant with these minimum requirements to be correctly imported. To be correctly serialized, attributes must be built-in Python types, as lists, dictionaries, sets, etc.

Consider the same network we used in the adjacency matrix section, stored in a binary object available here, with the .graph extension.

We can import it using the Binary method in the PyntacleImporter class of the io_stream module:

In [9]:
from pyntacle.io_stream.importer import PyntacleImporter 

graph = PyntacleImporter.Binary("example.graph")
Binary from  example.graph imported
In [10]:
#we can inspect the graph object to check its properties
graph.summary()
Out[10]:
"IGRAPH UN-- 5 5 -- ['example']\n+ attr: __implementation (g), __sif_interaction_name (g), name (g), __parent (v), name (v), __sif_interaction (e), adjacent_nodes (e), node_names (e)"

Equally, a binary file not storing a graph compliant with Pyntacle minimum requirements will not be imported.

Consider a the same network imported above, but with two edges connecting node A and B.

In [11]:
graph = PyntacleImporter.Binary("example_wrong.graph")
---------------------------------------------------------------------------
UnsupportedGraphError                     Traceback (most recent call last)
<ipython-input-11-b7e2591b1ccf> in <module>()
----> 1 graph = PyntacleImporter.Binary("example_wrong.graph")

~/miniconda3/lib/python3.6/site-packages/pyntacle/tools/misc/io_utils.py in func_wrapper(file, *args, **kwargs)
     47             raise FileNotFoundError("Input file does not exist")
     48 
---> 49         return func(file, *args, **kwargs)
     50 
     51     return func_wrapper

~/miniconda3/lib/python3.6/site-packages/pyntacle/io_stream/importer.py in Binary(file)
    418                     graph.to_undirected()
    419 
--> 420                 GraphUtils(graph=graph).check_graph()
    421                 sys.stdout.write("Binary from  {} imported\n".format(file))
    422                 return graph

~/miniconda3/lib/python3.6/site-packages/pyntacle/tools/graph_utils.py in check_graph(self)
     80             raise UnsupportedGraphError("Input graph is direct, pyntacle supports only undirected graphs")
     81         elif not Graph.is_simple(self.__graph):
---> 82             raise UnsupportedGraphError("Input Graph contains self loops and multiple edges")
     83         elif "name" not in self.__graph.vs().attributes():
     84             raise KeyError("nodes must have the attribute  \"name\"")

UnsupportedGraphError: Input Graph contains self loops and multiple edges

Attribute file formats


Attributes enrich graph elements with supplementary information. Attributes can be general, namely related to the whole graph (graph attributes), local, i.e., related to vertices (node attributes) or to links (edge attributes). Pyntacle relies on the way igraph manages attributes, namely through dictionaries, where keys are strings while the values can be any python type. Then, attributes can be assigned to and retrieved from any igraph.Graph element. We refer to the official igraph python tutorial for more details. Pyntacle implements some handy methods in the ImportAttributes and ExportAttributes classes contained in the iostream module, to correctly import and export attributes.

Graph Attribute Files

Graph attributes can be imported by means of the import_graph_attribute method in the ImportAttributes class. The attribute file is assumed to be a generic tab-delimited file, although this can be tuned by the sep parameter. The first line will be intepreted as a header and will be skipped. Each line contains a distinct attribute. The first column holds the attribute names and the second column holds their values. Consider the following graph attribute file (download it):

Attribute_name Attribute_value
network type pathway
diameter 2

It can be imported with these statements:

In [12]:
from io_stream.import_attributes import ImportAttributes

# sg is a working instance of igraph.Graph
ImportAttributes(graph=sg).import_graph_attributes("graph_attributes.tsv", sep="\t")

sg.attributes()
Graph attributes from graph_attributes.tsv imported.
Out[12]:
['__sif_interaction_name',
 'name',
 '__implementation',
 'diameter',
 'network type']

Note: All the attribute values are imported as strings by default.

In this example, the attribute diameter

In [13]:
sg["diameter"]
Out[13]:
'2'
In [14]:
type(sg["diameter"])
Out[14]:
str

needs to be converted to int

In [15]:
sg["diameter"] = int(sg["diameter"])
In [16]:
sg["diameter"]
Out[16]:
2
In [17]:
print(sg.vs().attributes())
['name', '__parent']

Graph attributes can be exported using the export_graph_attributes method of the export_attributes class.

For example, if we want to export all the graph attributes of the sg of the example below, we could use the following statements:

In [18]:
from io_stream.export_attributes import ExportAttributes

ExportAttributes(graph=sg).export_graph_attributes("exported_graph_attributes.tsv")
Graph attributes successfully exported at path /home/local/MENDEL/d.capocefalo/Dropbox/Research/BFX_Mendel/BFX Lab/Pyntacle/site_material/file_formats/exported_graph_attributes.tsv.

The file (available for download here will be a tab-separated file that looks slightly different compared to the one we imported:

Attribute Value
name ['example']
network type pathway
diameter 2

In fact, the graph attribute name corresponds to a list (Pyntacle allows the graph to have several name attributes). The list corresponding to the name attribute is exported without processing it, leaving the user the choice on how to parse the file in a second moment. the same rule is extended to complex structures, such as dictionaries, sets, etc.

Node attribute files

Node attributes can be stored as tab-separated files, with the node names in first column and all the other attributes in the following columns. Node names in first column must be a subset of the actual node names in the graph (the ones stored in the vertex attribute name). Any node attribute file must have a header line. The header values will be used as attribute key. The value of the header of the first column is irrelevant. Nodes can be specified more than once. This causes overwriting of their attribute values. Node names that do not match the ones in the graph will be skipped.

Be awar that node attribute names cannot be any of the following:

  • "name"
  • "__parent"

since these are reserved keywords.

We accept NA, None (any case) or interrogation mark (? ) strings to define *not available/NAs values in attribute files, while Pyntacle stores None when a value is not available.

Consider for example the following node attribute file for the network specified in the SIF paragraph:

Node Fold Change p
protein_1 NA NA
protein_2 3.3 0.00012
protein_1 -2.3 0.00054

(This example is available here)

The protein_1 node is repeated twice. The first time, its FoldChange and pvalue values are NA. However, the second occurrence will replace these values. If we try to import these attributes in our sg graph:

In [19]:
ImportAttributes(sg).import_node_attributes("node_attributes.tsv", sep="\t")
Node attributes from node_attributes.tsv imported.
In [20]:
sg.vs.attributes()
Out[20]:
['name', '__parent', 'FoldChange', 'pvalue']

if we now select the protein_1 node, we will see that the last line of the table has overwritten the first.

In [21]:
q = sg.vs.select(name="protein_1") #store the protein in a VertexSeq object
len(q) #we see the VertexSeq only has one node
Out[21]:
1
In [22]:
for v in q:
    print (v.attributes())
{'name': 'protein_1', '__parent': 'example', 'FoldChange': '-2.3', 'pvalue': '0.00054'}

Note: the imported values will always be casted to strings by default during import.

Edge attribute files

Edge attributes can be imported in two file formats with the import_edge_attributes method in the ImportAttributes class of the iostrem methods:

  • the standard format
  • the Cytoscape format The same formats can be exported using the corresponding ExportAttributes class in the same module.

Standard Format

The standard Pyntacle format is a table separated by tabulation character, although the separator character can be tuned using the sep parameter. The first two columns represent the source and target node names, which must match the actual names of nodes. The other columns hold the attributes that will be added to the respective link connecting the two nodes. The source and target order is not important, as Pyntacle currently works with undirected networks only.

The same conditions regarding the header line of node attribute files hold here. This means that the header must be present and the values from the third column onwards will be the attribute keys of each link. These names must be unique or a KeyError will be raised.

Be aware that edge attributes cannot be named adjacent_nodes, as this is a Pyntacle-reserved attribute for the igraph.Graph object (as explained in the minimum requirements page). If the graph does not contain one of the specified link, this will be skipped. If a link is repeated in the file, the attributes of the last occurrence will overwrite the previous.

Consider, for example, the simple network described in the SIF paragraph. Suppose we computed the correlation of expression among the 3 proteins of the network and we want to assign their values to the links, together with P-values.

Source Target correlation pvalue
protein_1 protein_2 0.85 0.0001
protein_1 protein_3 -0.15 0.6

(The edge attribute file can be downloaded here)

The table can be imported by the following commands:

In [23]:
ImportAttributes(sg).import_edge_attributes("edge_attributes_standard.tsv", sep="\t", mode="standard")
Edge attributes from edge_attributes_standard.tsv imported.

we can see the attributes have now been added to the Edgeseqobject in the igraph.Graph

In [24]:
print (sg.es.attributes())
['__sif_interaction', 'adjacent_nodes', 'correlation', 'pvalue']

Note: The values are imported by the Pyntacle library as strings, so they must be casted to the right types by the user.

In [25]:
sg.es["correlation"] # the correlation values are strings
Out[25]:
['0.85', '-0.15']
In [26]:
# we cast string to float
sg.es["correlation"] = list(map(float, sg.es["correlation"]))
print(sg.es["correlation"])
[0.85, -0.15]

Cytoscape format

Pyntacle imports and exports Cytoscape networks (the format is described in the official documentation, paragraph 8.2). This is possible by changing the mode parameter of the import_edge_attributes and export_edge_attributes from standard (default) to cytoscape. The separator character can be modified by the sep parameter. Repeated edges will be overwritten and edges not existing in the graph will be ignored.

For example, exporting the edge attributes previously loaded is as simple as:

In [27]:
from io_stream.export_attributes import ExportAttributes 

ExportAttributes(sg).export_edge_attributes("edge_attributes_cytoscape.tsv", mode="cytoscape")
Edge attributes successfully exported at path /home/local/MENDEL/d.capocefalo/Dropbox/Research/BFX_Mendel/BFX Lab/Pyntacle/site_material/file_formats/edge_attributes_cytoscape.tsv.

Which will give this Cyotscape edge attribute file:

Edge(Cytoscape Format) correlation pvalue
protein_2 (physical) protein_1 0.85 0.0001
protein_2 (activation) protein_1 0.85 0.0001
protein_1 (physical) protein_3 -0.15 0.6

(this file can be downloaded here)


This concludes our file formats Guide. If you want to leave a feedback, please contact us