Our research group went on a retreat (‘OPIGTREAT‘) comprised of presentations made by members of the group, on technical skills and latest research. In light of the booming of data in virtually all areas, my colleague and I decided to run a quick hands-on workshop on distributed computing, integrating the use of protein data.

As I was looking around for the application of distributed computing in small molecule and protein research, we came across the MacroMolecular Transmission Format (MMTF), which was developed towards deployment in distributed computing. The current PDB formats are flexible, extensible and contain rich metadata ideal for archival purpose. MMTF was developed to address some major concerns with the current PDB data:

  1. Redundant annotations
  2. Large file size
  3. Inefficient I/O

These concerns intensify in recent years as structures deposited in the last few years were among some of the largest in the entire PDB – thanks to the rise of techniques such as Cryo-EM and NMR. We need an efficient way to process and analyse macromolecules, and to visualise large structures.

By encoding, packing and entropy compression of current PDB structures, MMTF stored a flat file of binary key-value pairs which can be used by big data frameworks (e.g. Hadoop and Spark). The MMTF-Spark parser is well into maturity – but for the benefit of our research group, where not that many people are using Java, we explored the premature PySpark parser. I tried a few of the examples of finding interactions with ligands – on a local machine hence the performance was only marginally faster than running the BioPython package of NeighborSearch on a single core (11.0s vs 11.78s).

Nevertheless, I had fun playing around with PySpark SQL alone. I suppose the biggest advantage of Spark is the managed message parsing as oppose to the highly customisable OpenMP and MPI algorithms. In addition, fault tolerance via RDD and load-balancing are managed by Spark which makes it an easy interface for beginners into the world of distributed computing.