OSTRFPD: Multifunctional Tool for Genome-Wide Short Tandem Repeat Analysis for DNA, Transcripts, and Amino Acid Sequences with Integrated Primer Designer

Microsatellite mining is a common outcome of the in silico approach to genomic studies. The resulting short tandemly repeated DNA could be used as molecular markers for studying polymorphism, genotyping and forensics. The omni short tandem repeat finder and primer designer (OSTRFPD) is among the few versatile, platform-independent open-source tools written in Python that enables researchers to identify and analyse genome-wide short tandem repeats in both nucleic acids and protein sequences. OSTRFPD is designed to run either in a user-friendly fully featured graphical interface or in a command line interface mode for advanced users. OSTRFPD can detect both perfect and imperfect repeats of low complexity with customisable scores. Moreover, the software has built-in architecture to simultaneously filter selection of flanking regions in DNA and generate microsatellite-targeted primers implementing the Primer3 platform. The software has built-in motif-sequence generator engines and an additional option to use the dictionary mode for custom motif searches. The software generates search results including general statistics containing motif categorisation, repeat frequencies, densities, coverage, guanine–cytosine (GC) content, and simple text-based imperfect alignment visualisation. Thus, OSTRFPD presents users with a quick single-step solution package to assist development of microsatellite markers and categorise tandemly repeated amino acids in proteome databases. Practical implementation of OSTRFPD was demonstrated using publicly available whole-genome sequences of selected Plasmodium species. OSTRFPD is freely available and open-sourced for improvement and user-specific adaptation.


Evolutionary Bioinformatics
repeat finder and primer designer (OSTRFPD) has been designed to address some of these key issues by providing a simple yet useful tool to rapidly identify and categorise repetitive nucleic or amino acid sequences and to assist in the development of microsatellite-targeted primers with minimum user input and programming knowledge. Implementation OSTRFPD has been designed for molecular researchers with little or no computer programming background in mind and optimised for small-(approximately 5 Kbp) to medium-sized (approximately 50 Mbp) FASTA sequences. The architecture and workflow of OSTRFPD ( Figure 1) consist mainly of FASTA sequences (DNA, RNA, or proteins), which are scanned for user-configurable repetitive units. The software supports detection of both perfect and imperfect repeats with low complexity, which widens the range of potential STR analyses. Configuration options for results can vary based on sequence type and the anticipated output format. The format of the output can be tabulated values (default), FASTA sequences, or alignment type. OSTRFPD has the option to display imperfect repeats in plain text alignment, comparing the imperfect sequence with its nearest perfect equivalent for visually identifying indels, gaps, and mismatches. The alignment mode also generates additional information, such as the default local alignment scores, custom scores, and a rudimentary consensus sequence, based on perfectness of the repeat. For DNA sequences, the software uses the well-established Primer3 platform with configurable parameters for simultaneously designing primers on microsatellite detection. Moreover, assuming that the primer-tag option is selected, OSTRFPD appends a user-defined tag to the 5′-tail of primers, which simplifies the process for ordering tagged primers. The dictionarybased motif search is a unique feature of OSTRFPD. The dictionary is essentially a plain text file with each custom motif listed on a new line. The dictionary must contain only 1 type of molecule (not a mixture of DNA, RNA, or proteins). During the runtime, motifs are processed automatically to filter out any duplicates or equivalent cyclic motifs. The current version of OSTRFPD only supports fixed-length motifs and single minimum repeat number-based searches, although a single dictionary file may contain collections of variable-length motifs. The dictionary mode exclusively allows searches of motifs of 1 to 30 bp or amino acids, which may enable researchers to identify user-defined simple oligonucleotides, transcription factor binding regions, or signalling peptide sequences. Dictionaries optimised for nucleotide and amino acid motifs commonly observed in Plasmodium species have been bundled with the OSTRFPD distribution.

Selection of databases
The usability of OSTRFPD was demonstrated with freely available standard reference genomic and protein databases of selected Plasmodium species from the PlasmoDB web server (http://plasmodb.org/common/downloads/release-36/). The Figure 1. Schematics of OSTRFPD software architecture and workflow. OSTRFPD can either be used as command line console with arguments or as a fully featured graphic user interface tool. Single or multi-FASTA file (eg, .fasta, .fa, and .gz 'gunzip-compressed fasta') for nucleic acid or protein is directly accepted as data source. All type of sequences can be scanned for short tandem repeats and primers can be simultaneously designed for DNAassociated microsatellites using built-in flanking sequence filter and primer3 plugin. Results can be generated with the option to include general statistics report. Results generated can be of 3 major types: (1) 'Default' with tab-delimited values and associated headers (2) 'Alignment' or 'Imperfect Alignment only' format with alignments of repeats for both perfect and imperfect repeat, and (3) 'FASTA' as portable multi-FASTA format containing target microsatellite with flanking sequences. MS indicates microsatellites; OSTRFPD, omni short tandem repeat finder and primer designer.

Software prerequisites for running OSTRFPD
OSTRFPD is freely available under the GNU General Public License (GPL) (https://www.gnu.org/licences/gpl-3.0.en .html). The software was tested for proper operation in both Windows (version 7, 10) and Linux Ubuntu (version 16.04), provided that at least Python 3.5, PyQt5 5.9.1, and Biopython 1.7 are correctly installed. 18,19 The software uses Python's builtin powerful regular expression engine to identify patterns within DNA, RNA, or amino acid sequences and locate STRs. To generate primers, users can either directly implement standalone primer3 binaries supplied with the software package or individually compile primers from the official source (https:// sourceforge.net/projects/primer3/files/primer3/1.1.4/). The details of each parameter for primer design can be obtained from primer3 documentation (http://primer3.sourceforge.net /primer3_manual.htm). 20

Ease of operation
OSTRFPD can either run as fully featured standalone OS-specific binaries or run directly from the source code within a platform-independent Python environment. OSTRFPD supports fully featured graphical user interface (GUI) or command line interface (CLI) in a Windows console or Linux terminal. The GUI mode ( Figure 2) is equipped with tool tips and basic Simplified graphical user interface (GUI) for data input. OSTRFPD provides a user-friendly graphical interface which can be initialised using simple argument 'python3 ostrfpd.py -gui true' in console or terminal. The user interface has decent level of built-in error handling modules to minimise invalid data input. Graphical user interface works along with display of console screen. Simple tooltip displayed on status bar provides a short description of each option under consideration and shows example of command line interface parameters whenever feasible as '<eg, -command value>'. OSTRFPD indicates omni short tandem repeat finder and primer designer.

4
Evolutionary Bioinformatics level of error handling modules to avoid invalid or unintentional inputs. A typical GUI mode can be initiated using parameters 'python3 ostrfpd.py -gui true' in the console or terminal. The CLI mode ( Figure 3) is suitable for advanced users who choose to conduct batch operations or implement OSTRFPD as a plugin for their own utilities. Command line interface mode is activated by default. The software generates user-configurable detailed output that can be retrieved as a tab-delimited report file (default), FASTA sequences, or in an alignments format. The details of each parameter and the syntax in the CLI mode can be accessed by following software documentation or using the built-in help '--help' argument. OSTRFPD has an advance option for CLI that can be initialised using no argument 'python3 ostrfpd.py' or 'python3 ostrfpd.py -gui false' in console or terminal. The CLI mode allows to use OSTRFPD for batch operation as well as a plugin script that can be implemented by other software. Representative images are truncated to save space. OSTRFPD indicates omni short tandem repeat finder and primer designer.  Abbreviation: OSTRFPD, omni short tandem repeat finder and primer designer. Summary of amino acid repeat conducted for proteome-wide search for 1 to 2 amino acid (aa) unit motif repeat using default settings with minimum repeats of 7 and 5, respectively. Equivalent command line parameters were supplied as 'python3 ostrfpd.py -scan protein -input source_protein_fasta -unitmin 1 -unitmax 2 -misa 7,5' .  Evolutionary Bioinformatics

Practical implementation of OSTRFPD
As an example, the microsatellite (Table 1) and amino acid residues (Table 2) identified during the demonstration reflect characteristic features of the extremely AT-rich Plasmodium genome. 4,21 The P falciparum genome had the highest number of microsatellites (66 146) with an average density of 2835 microsatellites/million base pair (Mbp), and the total number of tandemly repeated amino acid residues was 3803. In addition, A, AT, and AAT were among the most frequently repeated motifs, comprising more than 50% of the total motifs in each Plasmodium species. OSTRFPD can be configured to automatically generate computationally feasible primers targeting such microsatellite motifs. Process of primer design begins with identification of microsatellite, subsequent analysis of its flanking sequences, and selection of computationally feasible primer pair that can amplify the region containing tandem repeats (Supplemental Figure 5). Microsatellite-targeted candidate genotyping primers were designed for the relatively less studied P ovale curtisi GH01 (Supplemental Table 1). For amino acid repeats, the highest number was detected in P falciparum (3803) with an average density of 908 repeats per million residues ( Table 2). In addition, each motif-sequence and the associated frequency distribution of microsatellites (Figure 4), rRNA repeat motifs ( Figure 5), and amino acid sequences ( Figure 6) were automatically categorised to clearly elucidate the types of repeats involved.

Identification and simple alignment view of imperfect repeats
An in-depth analysis of imperfect microsatellites could be conducted by visualising the simple text-based alignment to identify indels. The example provided illustrates the results displayed for an imperfect alignment of a randomly selected Plasmodium DNA ( Figure 7A) and protein ( Figure 7B) sequence with their closest corresponding equivalent perfect repeats. In addition, the result displays Biopython's default local alignment scores, non-motif indels, and custom scores along with other minor parameters by default (Figure 7). Similar results can be obtained with user-specified command line parameters for DNA: 'python3 ostrfpd.py -scan dna -input source_dna_fasta -unitmin 1 -unitmax 3 -imperfect 10 -imalign true' and for protein: 'python3 ostrfpd.py -scan protein -input source_protein_fasta -unitmin 1 -unitmax 3 -imperfect 10 -imalign true'.

Processing speed, CPU, and memory usage
On average, the speed of sequence searches for perfect repeats of 1 to 6 bp long DNA motifs in 'fast search' mode is approximately 200 seconds for nearly 30 Mbp of sequence with a 2.4 GHz Core i5 processor containing 4 GB DDR3 RAM and 3 Mb cache memory. The search time was reduced to approximately 90 seconds for 1 to 4 bp DNA motifs under similar conditions. In contrast, for amino acid sequences totalling approximately 4 million residues, the speed of sequence searches for 1 to 3 and 1 to 2 amino acid long repeats in 'fast search' mode was approximately 468 and 75 seconds, respectively. However, the estimates were found to vary 5% to 10% depending on the background computing load of the system. During each scanning process, the overall CPU usage by OSTRFPD remained in the range of 15% to 35%, allowing the computer to remain operable for regular multitasking.

Feature comparison with other microsatellite software
An overview of OSTRFPD in comparison with other common microsatellite search tools belonging to a similar category was conducted. OSTRFPD was the only software with an option to filter out microsatellite-targeted primers based on short repeats found within flanking sequences (Table 3). In addition, OSTRFPD has the unique feature of direct analysis of nucleic acid (DNA and RNA) and amino acid sequences for tandem repeats. Other than Msatcommander, 22 OSTRFPD was the only offline tool that could simultaneously generate microsatellite-targeted primers without the need of any additional PERL scripts or manual steps (Table 3). Moreover, OSTRFPD had additional improvements over Msatcommander by identifying and categorising STRs with longer motifs. In contrast with MISA-Web 23 and SciRoKo, 24 OSTRFPD allowed a wider range of motif selection with the provision of filtering STRs based on multiple parameters including perfection threshold, flanking regions, and custom motifs. The dictionary-based search mode was exclusive to OSTRFPD among the other tools, which allowed precise control over motif sequences being scanned with longer motif ranges (1-30 bp) for both nucleotide and protein sequences. OSTRFPD could selectively generate alignment-formatted output for imperfect repeats with custom scores, a feature minimally available in other software.

Discussion
OSTRFPD provides an integrated solution for identification of perfect or imperfect STRs with low complexity and microsatellite-targeted primer design. The ease of operation and the open-source and cross-platform compatibility of the software make it a useful tool for genome-or proteome-wide surveys of small-to medium-sized sequence databases.
Plasmodium species were suitable for validation of the STR mining capacity of this software because of their high microsatellite content and diversity. 4 The capabilities and features of OSTRFPD for identification and categorisation of nucleic or amino acids in Plasmodium species suggest the ease of operations and suitable improvement over existing Figure 6. Frequency distribution of unit amino acid repeat motifs in Plasmodium species using OSTRFPD. Entire known protein sequences of (A) Plasmodium falciparum 3D7, (B) Plasmodium vivax SAL-1, and (C) Plasmodium ovale curtisi. GH01 were searched for 1 to 2 amino acid unit motif with minimum repeat number of 7 and 5, respectively. Search criteria for the representative graph was limited to maximum of 2 amino acid unit motifs due to large number of unique motif type involved. Each letters in x-axis represents regular notation for amino acid residues. Equivalent command line parameters were supplied as 'python3 ostrfpd.py -scan protein -input source_protein_fasta -unitmin 1 -unitmax 2 -misa 7,5'.

Evolutionary Bioinformatics
software. 22,23 Other than perfect microsatellites, STRs have various forms and complexities. 29,30 OSTRFPD partly addresses these issues by being able to detect imperfect repeats with low complexity. Specifically, the STRs that satisfy the minimum selection criteria are further examined for interruption within the bound of user-supplied imperfection limits. Moreover, these imperfect repeats can be scored and filtered based on percentage of perfectness, type of indels causing the imperfection, or the combination of both. The scoring scheme is essentially a numerical designation for the number of imperfect indels and imperfection-associated penalties that the user assigns for imperfect repetitive sequences. Similarly, perfectness is the percentage of motifs within the imperfect repeat. For example, a perfect repeat containing 10 motifs scores 100% perfectness, whereas as an imperfect repeat of the same length and motif but containing only 9 units of perfect repeats scores 90% perfectness. The ability of OSTRFPD to identify, score and present imperfect STRs, and provide output in both regular and alignment formats can foster deeper understanding of repetitive elements in genomes and proteomes. One important bottleneck in the study of STRs is the categorisation of motifs, which may occur in cyclic, palindromic, or complimentary forms. For example, ATA n , AAT n , and TAA n are cyclic equivalents of each other and thus are categorised as the same motif under partial standardisation. Full standardisation incorporates cyclic equivalents and their reverse complements under the same category of repetitive sequence. Thus, ATA n , AAT n , TAA n , TAT n , ATT n , and TTA n will be categorised as the same motif under full standardisation. Options for both full and partial standardisation are  Ability to design and simultaneously produce primers using Primer3 without the need of additional post-processing with PERL scripts or further manual steps. b The maximum unit motif length of tandemly repeated nucleotide or amino acid residue supported by each software.
c For OSTRFPD using dictionary-based custom motif search, the maximum length for unit motif is 30 base pair (bp) or amino acid (aa).
available for nucleic acids, whereas the amino acid sequences are restricted to partial standardisation. Thus, OSTRFPD resolves this motif categorisation issue, which benefits the user by allowing the customisation of results based on the motif-sequence and the anticipated output format. Another common problem faced during microsatellite-based primer design is the occurrence of low-numbered repeats in flanking regions. For example, the occurrence of A n , AT n within flanking regions, where n is generally less than half the value of the corresponding microsatellite detection threshold, creates problems in primer design. Manual inspection to mitigate these issues in a large data set is not often a feasible solution.
The presence of a configurable scanner to filter out microsatellites flanked by sequences harbouring low-numbered repeats significantly improves optimised primer design. The implementation of all these filters to amino acid sequences is a novel feature of OSTRFPD and benefits users who wish to investigate STRs in a proteome database. Although there are several tandem repeat identification software, such as SciRoKo, Msatcommander, Phobos, 25 TRF 26 , SSRIT, 28 and MISA-Web, many are either closed-source or limited to detection of DNA sequences with no option for simultaneous primer design. 31 Unlike most microsatellite tools, the ability of OSTRFPD to directly implement Primer3 without additional PERL scripts drastically reduces manual postprocessing steps for the construction of microsatellitetargeted primers. A typical microsatellite motif for genotyping markers is 2 to 5 bp in length, which can be handled easily by OSTRFPD. In addition, the software provides the option to detect tandemly repeated RNA sequences, which are rarely investigated, but still might be useful for specific tasks such as ribosomal RNA, transcriptomes, and RNA virus genome analysis. 32 These RNA-associated tandem repeats may influence protein folding, ribosomal constructs, and binding activities of their target proteins or enzymes. 33,34 Implementation of OSTRFPD to directly evaluate tandemly repeated RNA sequences may contribute to the scant information available on studies of repetitive RNA sequences. In addition, lysinerich STRs have been observed in different protozoal parasites, including Plasmodium falciparum and Leishmania major. These parasites may generate these STRs de novo to modulate host protein targeting efficiency. 8,35 Simple amino acid repeats may provide flexibility for optimal folding of structural or functional domains; thus, the OSTRFPD may assist researchers interested in proteome-wide quantification of such repeats. Furthermore, inclusion of an option to implement a user-specified motif dictionary enables highly customisable searches for organism-specific motif identification as well as estimation of specific oligonucleotide or peptide sequence density. OSTRFPD runs relatively slower than native C-compiled tools (ie, Phobos and SciRoKo) owing to the limitation of Python's architecture; however, the flexibility, unique features, ease of operation, and open-source nature of this software may compensate for its few drawbacks depending on the requirements of the user.

Author Contributions
VBM and MI designed the study. VBM wrote the source code, manuscript, and conducted data analysis. MI and AMD assisted in logistics and theoretical overview. All authors read and approved the final manuscript.