Molecular docking software for mac
Various combinations are used to create case studies that are controlled, but at the same time realistic to illustrate different ways to run Vina in various infrastructures. Target proteins were selected based on current research interests in the department of Medical Biochemistry at the AMC. Alpha-ketoglutarate-dependent dioxygenase FTO [ 19 ] is a protein implicated in the development of obesity. FTO is the strongest genetic predictor of increased body weight [ 20 ]. It uses alpha-ketoglutarate as a cofactor to remove methyl groups from mammalian mRNA.
Substrate binding occurs close to the catalytic site.
Directory of in silico Drug Design tools - Docking
It is classified as an orphan receptor, as up to now no activating ligand has been identified. In an attempt to identify potential binding sites, NUR77 virtual screens were run on crystal structure 3V3E 3 [ 23 ] against two docking box sizes: the complete ligand binding domain LBD surface Big box and against a small part of the LBD surface Small box. The exact box sizes are:. Polar hydrogen atoms were added to the selected structure and a.
1001 Ways to run AutoDock Vina for virtual screening
Actually, ZINC contains compound models where a compound may be represented several times with its different enantiomers and protonation states in separate models. We selected a number of sub-sets from the ZINC database varying in size from a few dozens to up to almost K compounds. These sub-sets compound libraries also represent diverse types of molecules as explained below. The libraries were downloaded directly in the.
Nutraceuticals Nutra [ 25 ]: A small library of 78 compounds from the Drugbank 7 containing, amongst others, diet supplements, often with therapeutic indication. Human Metabolite Database HMDB [ 26 ]: This library with 2, compounds contains information about small molecule metabolites found in the human body. Zinc Natural Products ZNP : This is the biggest library considered in our experiments, with 89, compounds, comprising nature-inspired drug-like compounds.
These libraries are a logical choice for biomedical scientists in their virtual screening experiments, because they contain known and available drugs—see for example these case [ 27 — 29 ]. This approach has the advantage of avoiding long drug development, making it possible to move quickly into more advanced phases of drug testing. Note that these libraries, however, differ from datasets like DUD-E [ 30 ] or Dekois [ 31 ], which are synthetic benchmarks suitable for evaluating the quality of docking algorithms.
Benchmarking is not the goal of this study; the experiments here are aimed only at illustrating characteristics of this docking tool under heavy workload in the scope of high performance computing. Furthermore, the drug compounds included in the abovementioned libraries display large variation in their characteristics. Specifically, we are interested in the number of active torsions rotatable bonds and the number of heavy i.
These are important factors affecting the docking process, therefore expected to affect also the execution time of Vina. The compounds in these four libraries add up to a total of 94, molecules considering the duplicates in different libraries. In all four libraries considered, 7, compounds 7. In practice, however, compounds with too many rotatable bonds could be excluded from VS for example using tools like Raccoon2 9 , since such compounds are not expected to produce accurate results with existing docking tools.
We see again that except for some compounds in FDA a total of , the others are composed of no more than 40 heavy atoms. Distribution of ligands in each library showing the counts in logarithmic scale based on the number of active torsions left and heavy atoms right. The simplest way to screen a ligand library is to run Vina on a multi-core machine, but this is suitable only for smaller libraries. For this purpose, many researchers use the readily available scripts, such as those provided by Vina authors, 10 which automatically process all the ligands in a library one after the other, or Raccoon2 or PyRx.
To scale up to bigger libraries, one may use a cluster or grid. Whenever more processing cores are available, higher speed-up is expected, but in practice there are many other determining factors such as the the balance between overhead vs. Below we explain how we performed VS on the different infrastructures focusing on high-level considerations that may affect the execution time.
In processing one ligand, Vina has the ability to take advantage of multiple cores available and perform the calculations in parallel called internal parallelism. However, every execution includes pre- and post-processing steps which are not run in parallel, e. Even though very short, these sequential steps cause some cores to be idle as illustrated in Figs.
Schematic diagram of core occupancy along time for a single execution of Vina on four cores. Each line represents one core. The preparation steps marked times symbol and post-processing marked plus symbol are performed on one core. The actual docking process bold black is parallelized. The dotted lines show when the cores are idle. On the left , at every moment two ligands are processed in parallel, while on the right , standard scripts provided on the Vina website are used where ligands are processed one at a time. The right figure shows a visible fall on CPU load when switching to the next ligand as predicted in Fig.
In a virtual screening experiment, multiple ligands are considered, and can therefore be processed in parallel called external parallelism. We used a process pool in the Python language to start a fixed number of N concurrent processes on a multi-core machine, each of which running one instance of Vina. When a Vina instance finishes, the corresponding process picks up the next outstanding ligand. We carried out several tests on an 8-core machine to see which combinations of internal and external parallelism produce the best speed-up.
The Python scripts to run the experiments on a multi-core machine are provided as supplemental material. Hadoop [ 16 ] is an open-source implementation of the map-reduce paradigm, originally introduced by Google for parallel processing of many small data items. In this paradigm, first parallel instances of a mapper job process the input items, producing a series of key-value pairs. The system sorts these pairs by their keys and passes them on to the reducer jobs that will aggregate the outputs.
In VS, each ligand corresponds to one input, thus creating one mapper job per ligand that runs an instance of Vina. These jobs output the binding affinities as keys together with the name of the ligand as the value. Therefore, the binding affinities are automatically sorted by the system. One reducer job is enough for collecting all the outputs and the sorted affinities.
The workflow is illustrated in Fig. This approach enables running the virtual screening and collecting its results as explained below. When using a cluster or grid, some overhead is introduced to manage the jobs. This disfavors the execution of many small jobs, as compared to Hadoop and multi-core infrastructures. To reduce the overhead, a smaller number of bigger jobs should be created. To do so, more than one ligand should be put into each compute job [ 9 , 10 , 33 ].
Based on this idea, the three components of the workflow in Fig. Prepare: splits the input library into several disjoint groups of ligands to be processed in parallel. Collect: merges all the outputs and sorts the ligands based on their binding affinity. For running on the grid, we made a second implementation using the DIRAC pilot-job framework [ 34 ], which allows for better exploitation of the resources.
Pilot jobs enable us to take advantage of less busy clusters and therefore avoid long queuing times. Additionally, one pilot job can run various instances of Vina jobs without introducing middleware overhead, therefore increasing efficiency. This gateway allows for specifying the basic configurations: docking box coordinates, exhaustiveness, number of modes and energy range. The distribution on the grid and collection of the results are automated, as well as sorting the outcomes based on the highest binding affinity. Additionally, it allows the user to download partially completed results [ 35 ] and it provides provenance of previous experiments.
Some cases were repeated with different configurations by varying exhaustiveness, seed, or number of threads. This table shows the four ligand libraries, four infrastructures, and the three docking boxes FTO and the big and small boxes on NUR On multi-core, we ran the Nutra library a total of 59 times with different parallelism levels. Note that screening of ZNP library on an 8-core machine is not feasible, as it would need almost one year to complete. On our local cluster, we ran all the libraries and with different configurations, but due to its relatively small capacity we did not try all cases.
On the bigger platforms, namely Hadoop and grid, we tried almost all cases. Whenever we needed comparison of execution times, we used the experiments executed on the local cluster, which has a homogeneous platform hardware and software stack. The analyses were based on average execution times of repeated runs. The graphs are therefore shown only for one case in each subsection.
In these measurements on this specific setup and hardware, we observe that the fastest execution time with the smallest load on the system corresponds to the parallelism level 20, i. This is clearly above twice the number of available cores, indicating that system saturation is beneficial in this case.
Execution time in min for combinations of internal and external parallelism on multi-core. Color coding: green for fastest runs, changing to yellow and finally to red for slowest runs. Based on these specific measurements, we cannot give a golden formula for the best combination of internal and external parallelism, as it may depend on various factors ranging from hardware characteristics to configuration parameters like exhaustiveness.
Nevertheless, since we also observe that increasing internal parallelism at some point reduces the performance, we can conclude that the optimal solution is obtained by balancing internal and external parallelism. We see that this strategy, compared to pure internal parallelization offered by Vina first row in Fig.
Vina calculations are based on pseudo-random generation. Such programs, if provided with the same initialization—called seed, produce the same behavior. By varying the random seed, different docking results can be generated, thus allowing the user to select the best results. We observed that different operating systems, e.
Mac vs. Linux, or even different versions of CentOS a common Linux-based system , generated different outcomes, even when Vina was run with the same randomization seed and input parameters. For example, Fig. Since docking studies were performed on an isolated monomer, both docking poses are definitely different. The calculated energies for these two binding modes differ by 0. Two binding modes for the same ligand on NUR77 reported by Vina with the same configuration, including the randomization seed, but run on different platforms.
For reproducibility, one needs to make sure to record the characteristics of the used platform together with the random seed and other configuration parameters. A similar phenomenon has been reported for DOCK in [ 36 ], where the authors suggest to use virtualized cloud resources to overcome this issue. Nevertheless, reproducibility of the results may be completely endangered in the long term because the exact same platforms may not exist anymore. For example versions of an operating system are usually discontinued after at most a decade.
Depending on prior knowledge about the target, the screening can be constrained to a limited area instead of the total protein surface, by specifying a docking box. In this case, one may naively expect that the computation time is reduced. The results depicted here show that execution time is not consistently larger or smaller for either Big or Small box. Keep in mind that the Big box is more than 38 times larger in volume than the Small box. We therefore conclude that the time required for every run is not entirely dependent on the total size of the docking box.
In other words, enlarging the docking box does not necessarily mean that Vina is going to spend more time on the calculations. In order to ensure good quality in the docking results, one needs to enforce more runs. Comparison of execution times in s for different Vina configurations. Every blue dot in the plots represents one ligand.
A recent study [ 37 ] proposes an algorithm to determine the optimal box size when a specific pocket is targeted, such that the quality of Vina calculations is higher. As we have shown, when using such optimizations, one does not need to worry about any consistent change in execution time.
In the various settings used for comparison, we see an almost linear increase in the execution time when increasing exhaustiveness. In Fig. One may alternatively perform an equal number of runs by executing Vina repetitively, but with a smaller exhaustiveness each time. This will take the same amount of total time as we have seen here , but since various randomization seeds can be used in each Vina run, there could be more variety in the results with a better chance of finding the minimum binding energy.
The Vina authors already show a relationship between execution time and ligand properties based on a small experiment with protein-ligand complexes [ 3 ]. Here we repeat the experiment for a much larger number of ligands. The graphs in Fig. We chose FDA because it has a larger variation in both number of active torsions and heavy atoms see Fig.
However, on average the execution time grows proportionally to the number of active torsions and similarly, heavy atoms , with a few outliers. Although deriving the form of this relation, i. Average execution time in s for ligands grouped by number of active torsions left and heavy atoms right. Bars represent mean execution time with standard deviation as error bars.
First consider systems like a grid, where ligands are grouped to create bigger compute jobs cf. Kreuger et al. But if a group happens to contain many large or flexible ligands, it will have a much longer execution time that will dominate the execution time of the whole VS experiment. Zhang et al. Using number of active torsions and heavy atoms requires much less computational effort, therefore making it easier to adopt for new ligand libraries.
Nevertheless, the effectiveness of this approach for predicting execution time of each ligand group remains to be studied. In other systems that handle ligands one by one for example on a multi-core machine or a supercomputer , we recommend to start the VS experiment by first processing large and flexible ligands. By leaving smaller ligands that take less time for a later stage, an automatic load balancing between the processing cores happens, as illustrated in Fig.
Here we show four ligands that are processed on two cores. On the left scenario, larger ligands A and B are processed first. Since A takes much longer than B, both C and D will be processed on the same core. But if we had started with the smaller ligands C and D in parallel on the right , we would end up running A and B also in parallel, which in the end results in one core being idle while A is still being processed. Clearly the scenario on the left has a faster overall execution time as it can better utilize the available cores.
A similar method is used by Ellingson et al. The effect of load balancing by processing larger and flexible ligands first on total virtual screening time. In these experiments, Vina is configured to use one processing core only i. Total cores shows the compute capacity of each infrastructure, which is proportional to the size of virtual screening experiment number of ligands to screen. On the multi-core machine, this is equal to the external parallelism level.
In other cases, this is equal to the maximum number of compute jobs that were running at the same time, and would ideally get as close to the total number of cores as possible. The actual level of parallelism that can be achieved is hampered by various factors as explained below.
A summary of execution time and achieved parallelism on various infrastructures and middlewares. See text for details. The studied infrastructures comprise shared resources except for the multi-core case , which means we may get only a part of its capacity if other users are running their experiments at the same time. This is very clearly seen in the experiments run on the AMC cluster, where the maximum parallelism is in most cases much lower than the number of available cores.
In the case of grid, this is not visible due to its very high capacity, and therefore, given a fixed number of compute jobs one could truly expect a more or less fixed level of parallelism. On a multi-core computer the chance of the machine failing is very low. But when connecting some computers in a cluster or grid , there is a higher chance that at least one of them fails. Additionally, other factors like physical network failure or access to remote data mean that, as the number of connected computers grows, the chance of failure grows much faster.
Fault tolerance in this context can be simplistically defined as the ability to automatically restart a docking job whenever the original run fails. Such failures penalize average parallelism, especially if they are manually retried, as can be seen for example in the experiments on the AMC cluster. Small screening experiments e. Slightly bigger experiments can be done faster with bigger number of cores.
A small cluster can be a reasonable choice for such medium-sized experiments.
Nevertheless, we see that failures and manual retries may gravely increase the perceived execution time wall clock. Grid is based on resource sharing, with a stable ratio between average parallelism and number of compute jobs. But the low ratio between max and average parallelism show the great deal of overhead and competition for resources on this type of infrastructure. Virtualization Typical usage of cloud involves allocating virtual machines VM , where it is possible to define—and pay—for the number of cores, amount of memory and disk size.
Most commercial cloud providers nowadays also offer the possibility of deploying a Hadoop cluster on virtual resources. This is a more expensive resource though. Given the importance of handling failures when large VS are performed, it would be interesting to investigate whether the fault-tolerance facilities of Hadoop might compensate for its extra cost on the cloud. Other infrastructures A supercomputer, if at disposal, is a suitable platform for large VS experiments. For not such large experiments, a good alternative to a multi-core CPU is to use accelerators. However, this requires rewriting the docking software; therefore, we did not consider it for Vina.
Two options, GPU and Xeon Phi, have been shown to be suitable for parts of the docking process as described in [ 43 ]. Around 60 times speed up is obtained with their in-house developed virtual screening software. Vina has the ability to perform molecular docking calculations in parallel on a multi-core machine. We found out, however, that Vina does not exploit the full computing capacity of a multi-core system, because some pre- and post-processing needs to be performed using only one core.
Therefore, external parallelization approaches should be employed for increasing the efficiency of computing resources usage for VS. In our experiments, this led to more than a twofold speed-up. We also found that the use of the same randomization seed does not always assure reproducibility. In fact, docking results are reproducible only if performed on the exact same platform operating system, etc. We observed in some cases that the same seed and calculation parameters can lead to diverging results when used on different platforms: both different binding modes and energies were reported.
Further study on the execution time confirmed previous knowledge about Vina, but on a much larger dataset: execution time is linearly proportional to exhaustiveness number of simulations per run. It is therefore advisable to run Vina several times rather than increasing the exhaustiveness. It takes as long and at the same time, and in this way multiple seeds are taken, perhaps elevating the chances of getting closer to the best binding mode. XScore is frequently cited as being used to re-rank AutoDock output and serves as the basis for AutoDock Vina  — .
DOCK and AutoDock were initially created during an era when computational resources for HTVS were prohibitively expensive and relatively primitive, but these programs have evolved over the years to be more user friendly, adaptable for HTVS, and useful as teaching and learning tools in a classroom setting. One noteworthy advance to AutoDock is a set of Python scripts and programs called MGLTools that facilitate and automate workflows required for the management of many simultaneous docking calculations. To enhance usability of DOCK and AutoDock, researchers have also developed graphical user interfaces GUIs that automate job management and submission for molecular docking calculation.
The focus of this paper is HTVS GUI applications capable of processing large numbers of molecular interactions at an acceptable speed and cost, with reliable results, on a variety of computer platforms. This gives teachers a rational and inexpensive tool for demonstrating to students how to assess and prioritize ligands for pursuit as drug targets see Figure 1.
Molecular docking experiments involving either DOCK or AutoDock require an inordinate amount of time to set up, submit, compute, and analyze results. HTVS programs solve these problems through process automation. The HTVS programs we review are free or inexpensive, and can run on hardware ranging from a personal computer to a computing cluster.
DockoMatic and Python Prescription PyRx can manage jobs independently of computer architecture, using a single workstation or cluster. DockingServer is a web-based application that runs independently of the user operating system, while MOLA can operate on networks consisting of heterogeneous computer architectures.
Educators can provide a visual context for the laboratory portion of their courses by selecting software programs described in this manuscript tailored to their computing capabilities. Open-access databases of receptor and ligand structures enable customized systems to be incorporated into the laboratory curriculum.
Programs detailed in this manuscript were selected, in part, based on their use in solving research problems of instructional value and their relative ease of use in an educational environment. These programs can manage millions of docking experiments on large computing clusters, efficiently identifying and ordering the top scoring ligands  — . Both programs rank and score results via user specified criteria. Aptamers bind specific small ligands, such as amino-sugars, flavin, or peptides, and are significant as diagnostic molecules associated with gene regulation.
DOVIS 2. VSDocker is free for non-commercial use but is not open source . WinDock runs on a single Windows workstation. WinDock supports receptor homology model creation. Templates for receptors are identified via sequence alignment using ClustalX and T-coffee  , . WinDock then directs Modeller to construct a homology model . WinDock includes a large 3D ligand library, or the user can access compounds of interest from their own ligand pdb database.
Users can select force field, empirical, or knowledge-based ligand scoring algorithms to assess results  — . HIV-1 integrase consists of three domains, N-terminus, core, and C-terminus. WinDock identified the binding preference for baicalein to the middle of the ligand binding domain, the same site that was identified by co-crystallization with the inhibitor 5-CITEP . A WinDock executable is available free of charge to students, academicians, and researchers by contacting the original author; the source code is not available .
BDT was used to study the binding of volatile anesthetic ligands, like halothane or sevoflurane, to amphiphilic pockets in volatile anesthetic binding proteins like serum albumin and apoferritin . BDT was used to predict that Van der Waals forces were the predominant factor in the binding of volatile anesthetic ligands to compatible binding proteins.
BDT is free for academic and non-commercial research purposes, though not open source  , . DockoMatic is a Linux-based HTVS program that uses a combination of front- and back-end processing tools for file preparation, result parsing, and data analysis . DockoMatic can dock secondary ligands and may be used to perform inverse virtual screening  , . The DockoMatic GUI facilitates job creation, submission of jobs to AutoDock for docking, and result analysis for beginning and advanced users.
The program can manage jobs on a single CPU or cluster, and generates ligand structure files by point mutation to an existing ligand pdb file or by entry of the single letter amino acid code for the peptide ligand sequence of interest. DockoMatic has been used to study conotoxin binding to acetylcholine binding proteins AchBPs for drug design. AchBPs have similar homology to neuronal nicotinic acetylcholine receptors nAchRs , which are pentameric ion channels responsible for the regulation of ions and small molecular neurotransmitters through biological membranes .
Conotoxin ligands that contained a public domain nuclear magnetic resonance NMR solution structure pdb file were analyzed in the bound state in the crystal structure, the peptide was removed from the ligand binding domain, and DockoMatic was used to redock the peptides. The results demonstrated that DockoMatic may be used for computational prediction of peptide analog binding  , . PyRx has been used to study aromatase inhibitors AIs.
In post-menopausal women with breast cancer, increased levels of estrogen produced by the breast cancer cells increased cell production, creating a self-feedback loop  , . AIs have therapeutic value for patients that suffer breast cancer associated with excessive aromatase activity . The AIs studied using PyRx had known crystal structures; PyRx output was compared to X-ray structures to validate computational binding prediction . DockingServer is a comprehensive web service designed to make molecular docking accessible to all levels of users.
The process for job submission is straightforward, and the output report gives the specific bond type interactions between each ranked result and the target receptor. A drawback is that the docking output structure files are large and DockingServer user storage space is limited. Thus, the number of parallel processes that can be run, prior to transferring or deleting files, is restricted. DockingServer has been used to investigate human breast cancer resistance using a homology model of breast cancer resistant protein BCRP to characterize the potential interaction modes of the substrates mitoxantrone MX , prazosin, Hoechst, and 7-Ethylhydroxycamptothecin SN Results indicated there is a central cavity in the middle of the lipid bilayer of BCRP capable of containing two substrates, instead of the previously hypothesized single substrate .
This study illustrates a possible mechanism for BCRP function that may lead to inhibitors for future drug development. The DockingServer web-based service is available for a modest annual subscription. MOLA runs off a CD boot disk that preempts the local operating system with its own operating system . MOLA is capable of configuring a temporary computer cluster from heterogeneous, networked standalone computers, regardless of operating platform.
This program is intended for research labs without access to a dedicated computer cluster. ADT also generates an analysis spreadsheet ranked by the lowest binding energy and distance to the active site . MOLA was used to investigate ligand binding to retinol binding protein, HIV-1 protease, and trypsin-benzamide, each with a ligand library search of over ligands and decoys, recreating the approximate potential bell curve of these ligand sets to each receptor. MOLA is a free download as an image file for direct burning to disk . The source code is not available. The role of computational molecular docking in the educational and research community is evolving at a rapid rate.
Access to this field by an ever increasing number of students, teachers, and scientists has been facilitated by software programs similar to those described here. Each program we describe has been used to address real world research problems that educators may find instructive for students. Table 1 summarizes the features of each HTVS program reviewed.
Instructors should select a program to use in their courses dependent upon their curriculum, computer hardware access, financial resources, and desired instructional objectives. The HTVS programs described in this manuscript were developed with the common goal of enhancing the ability to perform molecular docking studies using one of two well-established docking engines, DOCK or AutoDock.
The optimal program for use to explain biological principles to students is dependent on the specific goals of the instructor. For a class in a department with limited computer availability interested in occasional docking investigations, we suggest WinDock or PyRx, as both programs are available for a Windows operating system.
For instructors with limited computer resources, DockingServer is an external web service for a reasonable subscription. WinDock, PyRx, and DockingServer contain fully integrated visualization capabilities for all steps in the process of docking to result analysis.
In addition to computational requirements, each HTVS program has unique features to assist in docking studies and data analysis. BDT is optimal if the instructor presents students with a project to study a specific receptor that does not have a known binding pocket. If the instructor requires construction of homology models, WinDock contains a Modeller interface.