Publications

Proteogenomic characterization of pancreatic ductal adenocarcinoma

Published in Cell, 2021

Pancreatic ductal adenocarcinoma (PDAC) is a highly aggressive cancer with poor patient survival. Toward understanding the underlying molecular alterations that drive PDAC oncogenesis, we conducted comprehensive proteogenomic analysis of 140 pancreatic cancers, 67 normal adjacent tissues, and 9 normal pancreatic ductal tissues. Proteomic, phosphoproteomic, and glycoproteomic analyses were used to characterize proteins and their modifications. In addition, whole-genome sequencing, whole-exome sequencing, methylation, RNA sequencing (RNA-seq), and microRNA sequencing (miRNA-seq) were performed on the same tissues to facilitate an integrated proteogenomic analysis and determine the impact of genomic alterations on protein expression, signaling pathways, and post-translational modifications. To ensure robust downstream analyses, tumor neoplastic cellularity was assessed via multiple orthogonal strategies using molecular features and verified via pathological estimation of tumor cellularity based on histological review. This integrated proteogenomic characterization of PDAC will serve as a valuable resource for the community, paving the way for early detection and identification of novel therapeutic targets.

A Validated Analysis Pipeline for Mass Spectrometry-Based Vitreous Proteomics: Insights Into Proliferative Diabetic Retinopathy

Published in Research Square, 2021

Vitreous is an accessible, information-rich biofluid that has recently been studied as a source of retinal disease-related proteins and pathways. However, the number of samples required to confidently identify perturbed pathways remains unknown. In order to confidently identify these pathways, power analysis must be performed to determine the number of samples required.

A proteogenomic portrait of lung squamous cell carcinoma

Published in Cell, 2021

Lung squamous cell carcinoma (LSCC) remains a leading cause of cancer death with few therapeutic options. We characterized the proteogenomic landscape of LSCC, providing a deeper exposition of LSCC biology with potential therapeutic implications. We identify NSD3 as an alternative driver in FGFR1-amplified tumors and low-p63 tumors overexpressing the therapeutic target survivin. SOX2 is considered undruggable, but our analyses provide rationale for exploring chromatin modifiers such as LSD1 and EZH2 to target SOX2-overexpressing tumors. Our data support complex regulation of metabolic pathways by crosstalk between post-translational modifications including ubiquitylation. Numerous immune-related proteogenomic observations suggest directions for further investigation. Proteogenomic dissection of CDKN2A mutations argue for more nuanced assessment of RB1 protein expression and phosphorylation before declaring CDK4/6 inhibition unsuccessful. Finally, triangulation between LSCC, LUAD, and HNSCC identified both unique and common therapeutic vulnerabilities. These observations and proteogenomics data resources may guide research into the biology and treatment of LSCC.

Differences in Extracellular Vesicle Protein Cargo Are Dependent on Head and Neck Squamous Cell Carcinoma Cell of Origin and Human Papillomavirus Status

Published in Cancers, 2021

Many individuals with head and neck cancer do not survive, even with intense treatment. Patients with HPV-positive tumors generally have better survival; however, for yet unknown reasons, a subset are unresponsive to therapy. One strategy to monitor cancers for progression and recurrence is evaluation of extracellular vesicles, released by tumor cells into the blood and other body fluids. We can also understand differences in tumors and their behavior by comparing the molecules packaged into vesicles that are released from tumor cells. Our study examined differences in the proteins contained within extracellular vesicles released from head and neck cancer cells. We found that key extracellular vesicle proteins differed based on HPV status of the originating cell line and tumor, as well as how responsive the originating tumor was to treatment. Our findings suggest that these extracellular vesicle proteins may be important markers for continued investigation.

Proteomic Analyses of Vitreous in Proliferative Diabetic Retinopathy: Prior Studies and Future Outlook

Published in Journal of Clinical Medicine, 2021

Vitreous fluid is becoming an increasingly popular medium for the study of retinal disease. Numerous studies have demonstrated that proteomic analysis of the vitreous from patients with proliferative diabetic retinopathy yields valuable molecular information regarding known and novel proteins and pathways involved in this disease. However, there is no standardized methodology for vitreous proteomic studies. Here, we share a suggested protocol for such studies and outline the various experimental and analytic methods that are currently available. We also review prior mass spectrometry-based proteomic studies of the vitreous from patients with proliferative diabetic retinopathy, discuss common pitfalls of these studies, and propose next steps for moving the field forward.

Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma

Published in Cancer Cell, 2021

We present a proteogenomic study of 108 human papilloma virus (HPV)-negative head and neck squamous cell carcinomas (HNSCCs). Proteomic analysis systematically catalogs HNSCC-associated proteins and phosphosites, prioritizes copy number drivers, and highlights an oncogenic role for RNA processing genes. Proteomic investigation of mutual exclusivity between FAT1 truncating mutations and 11q13.3 amplifications reveals dysregulated actin dynamics as a common functional consequence. Phosphoproteomics characterizes two modes of EGFR activation, suggesting a new strategy to stratify HNSCCs based on EGFR ligand abundance for effective treatment with inhibitory EGFR monoclonal antibodies. Widespread deletion of immune modulatory genes accounts for low immune infiltration in immune-cold tumors, whereas concordant upregulation of multiple immune checkpoint proteins may underlie resistance to anti-programmed cell death protein 1 monotherapy in immune-hot tumors. Multi-omic analysis identifies three molecular subtypes with high potential for treatment with CDK inhibitors, anti-EGFR antibody therapy, and immunotherapy, respectively. Altogether, proteogenomics provides a systematic framework to inform HNSCC biology and treatment.

PTM-Shepherd: analysis and summarization of post-translational and chemical modifications from open search results

Published in Molecular & Cellular Proteomics, 2020

Open searching has proven to be an effective strategy for identifying both known and unknown modifications in shotgun proteomics experiments. Rather than being limited to a small set of user-specified modifications, open searches identify peptides with any mass shift that may correspond to a single modification or a combination of several modifications. Here we present PTM-Shepherd, a bioinformatics tool that automates characterization of PTM profiles detected in open searches based on attributes such as amino acid localization, fragmentation spectra similarity, retention time shifts, and relative modification rates. PTM-Shepherd can also perform multi-experiment comparisons for studying changes in modification profiles, e.g. in data generated in different laboratories or under different conditions. We demonstrate how PTM-Shepherd improves the analysis of data from formalin-fixed paraffin-embedded samples, detects extreme underalkylation of cysteine in some datasets, discovers an artefactual modification introduced during peptide synthesis, and uncovers site-specific biases in sample preparation artifacts in a multi-center proteomics profiling study.

Integrated Proteogenomic Characterization across Major Histological Types of Pediatric Brain Cancer

Published in Cell, 2020

We report a comprehensive proteogenomics analysis, including whole-genome sequencing, RNA sequencing, and proteomics and phosphoproteomics profiling, of 218 tumors across 7 histological types of childhood brain cancer: low-grade glioma (n = 93), ependymoma (32), high-grade glioma (25), medulloblastoma (22), ganglioglioma (18), craniopharyngioma (16), and atypical teratoid rhabdoid tumor (12). Proteomics data identify common biological themes that span histological boundaries, suggesting that treatments used for one histological type may be applied effectively to other tumors sharing similar proteomics features. Immune landscape characterization reveals diverse tumor microenvironments across and within diagnoses. Proteomics data further reveal functional effects of somatic mutations and copy number variations (CNVs) not evident in transcriptomics data. Kinase-substrate association and co-expression network analysis identify important biological mechanisms of tumorigenesis. This is the first large-scale proteogenomics analysis across traditional histological boundaries to uncover foundational pediatric brain tumor biology and inform rational treatment selection.

Regulation of ALT-associated homology-directed repair by polyADP-ribosylation

Published in Nature Structural & Molecular Biology, 2020

The synthesis of poly(ADP-ribose) (PAR) reconfigures the local chromatin environment and recruits DNA-repair complexes to damaged chromatin. PAR degradation by poly(ADP-ribose) glycohydrolase (PARG) is essential for progression and completion of DNA repair. Here, we show that inhibition of PARG disrupts homology-directed repair (HDR) mechanisms that underpin alternative lengthening of telomeres (ALT). Proteomic analyses uncover a new role for poly(ADP-ribosyl)ation (PARylation) in regulating the chromatin-assembly factor HIRA in ALT cancer cells. We show that HIRA is enriched at telomeres during the G2 phase and is required for histone H3.3 deposition and telomere DNA synthesis. Depletion of HIRA elicits systemic death of ALT cancer cells that is mitigated by re-expression of ATRX, a protein that is frequently inactivated in ALT tumors. We propose that PARylation enables HIRA to fulfill its essential role in the adaptive response to ATRX deficiency that pervades ALT cancers.

Philosopher: a versatile toolkit for shotgun proteomics data analysis

Published in Nature Methods, 2020

Here we introduce Philosopher (https://philosopher.nesvilab.org), a free, open-source, versatile and robust data analysis toolkit designed to bring easy access to a powerful and comprehensive set of computational tools for shotgun proteomics data analysis.

Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma

Published in Cell, 2020

To explore the biology of lung adenocarcinoma (LUAD) and identify new therapeutic opportunities, we performed comprehensive proteogenomic characterization of 110 tumors and 101 matched normal adjacent tissues (NATs) incorporating genomics, epigenomics, deep-scale proteomics, phosphoproteomics, and acetylproteomics. Multi-omics clustering revealed four subgroups defined by key driver mutations, country, and gender. Proteomic and phosphoproteomic data illuminated biology downstream of copy number aberrations, somatic mutations, and fusions and identified therapeutic vulnerabilities associated with driver events involving KRAS, EGFR, and ALK. Immune subtyping revealed a complex landscape, reinforced the association of STK11 with immune-cold behavior, and underscored a potential immunosuppressive role of neutrophil degranulation. Smoking-associated LUADs showed correlation with other environmental exposure signatures and a field effect in NATs. Matched NATs allowed identification of differentially expressed proteins with potential diagnostic and therapeutic utility. This proteogenomics dataset represents a unique public resource for researchers and clinicians seeking to better understand and treat lung adenocarcinomas.

Quantitative Proteomic Landscape of Metaplastic Breast Carcinoma Pathological Subtypes and Their Relationship to Triple-Negative Tumors

Published in Nature Communications, 2020

Metaplastic breast carcinoma (MBC) is a highly aggressive form of triple-negative cancer (TNBC), defined by the presence of metaplastic components of spindle, squamous, or sarcomatoid histology. The protein profiles underpinning the pathological subtypes and metastatic behavior of MBC are unknown. Using multiplex quantitative tandem mass tag-based proteomics we quantify 5798 proteins in MBC, TNBC, and normal breast from 27 patients. Comparing MBC and TNBC protein profiles we show MBC-specific increases related to epithelial-to-mesenchymal transition and extracellular matrix, and reduced metabolic pathways. MBC subtypes exhibit distinct upregulated profiles, including translation and ribosomal events in spindle, inflammation- and apical junction-related proteins in squamous, and extracellular matrix proteins in sarcomatoid subtypes. Comparison of the proteomes of human spindle MBC with mouse spindle (CCN6 knockout) MBC tumors reveals a shared spindle-specific signature of 17 upregulated proteins involved in translation and 19 downregulated proteins with roles in cell metabolism. These data identify potential subtype specific MBC biomarkers and therapeutic targets.

Crystal-C: A Computational Tool for Refinement of Open Search Results

Published in Journal of Proteome Research, 2020

Shotgun proteomics using liquid chromatography coupled to mass spectrometry (LC-MS) is commonly used to identify peptides containing post-translational modifications. With the emergence of fast database search tools such as MSFragger, the approach of enlarging precursor mass tolerances during the search (termed "open search") has been increasingly used for comprehensive characterization of post-translational and chemical modifications of protein samples. However, not all mass shifts detected using the open search strategy represent true modifications, as artifacts exist from sources such as unaccounted missed cleavages or peptide co-fragmentation (chimeric MS/MS spectra). Here, we present Crystal-C, a computational tool that detects and removes such artifacts from open search results. Our analysis using Crystal-C shows that, in a typical shotgun proteomics data set, the number of such observations is relatively small. Nevertheless, removing these artifacts helps to simplify the interpretation of the mass shift histograms, which in turn should improve the ability of open search-based tools to detect potentially interesting mass shifts for follow-up investigation.

Deep Proteomics Using Two Dimensional Data Independent Acquisition Mass Spectrometry

Published in Analytical Chemistry, 2020

Methodologies that facilitate high-throughput proteomic analysis are a key step toward moving proteome investigations into clinical translation. Data independent acquisition (DIA) has potential as a high-throughput analytical method due to the reduced time needed for sample analysis, as well as its highly quantitative accuracy. However, a limiting feature of DIA methods is the sensitivity of detection of low abundant proteins and depth of coverage, which other mass spectrometry approaches address by two-dimensional fractionation (2D) to reduce sample complexity during data acquisition. In this study, we developed a 2D-DIA method intended for rapid- and deeper-proteome analysis compared to conventional 1D-DIA analysis. First, we characterized 96 individual fractions obtained from the protein standard, NCI-7, using a data-dependent approach (DDA), identifying a total of 151,366 unique peptides from 11,273 protein groups. We observed that the majority of the proteins can be identified from just a few selected fractions. By performing an optimization analysis, we identified six fractions with high peptide number and uniqueness that can account for 80% of the proteins identified in the entire experiment. These selected fractions were combined into a single sample which was then subjected to DIA (referred to as 2D-DIA) quantitative analysis. Furthermore, improved DIA quantification was achieved using a hybrid spectral library, obtained by combining peptides identified from DDA data with peptides identified directly from the DIA runs with the help of DIA-Umpire. The optimized 2D-DIA method allowed for improved identification and quantification of low abundant proteins compared to conventional unfractionated DIA analysis (1D-DIA). We then applied the 2D-DIA method to profile the proteomes of two breast cancer patient-derived xenograft (PDX) models, quantifying 6,217 and 6,167 unique proteins in basal- and luminal- tumors, respectively. Overall, this study demonstrates the potential of high-throughput quantitative proteomics using a novel 2D-DIA method.

Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma

Published in Cell, 2019

To elucidate the deregulated functional modules that drive clear cell renal cell carcinoma (ccRCC), we performed comprehensive genomic, epigenomic, transcriptomic, proteomic, and phosphoproteomic characterization of treatment-naive ccRCC and paired normal adjacent tissue samples. Genomic analyses identified a distinct molecular subgroup associated with genomic instability. Integration of proteogenomic measurements uniquely identified protein dysregulation of cellular mechanisms impacted by genomic alterations, including oxidative phosphorylation-related metabolism, protein translation processes, and phospho-signaling modules. To assess the degree of immune infiltration in individual tumors, we identified microenvironment cell signatures that delineated four immune-based ccRCC subtypes characterized by distinct cellular pathways. This study reports a large-scale proteogenomic analysis of ccRCC to discern the functional impact of genomic alterations and provides evidence for rational treatment selection stemming from ccRCC pathobiology.

Unveiling the Partners of the DRBD2-mRNP Complex, an RBP in Trypanosoma Cruzi and Ortholog to the Yeast SR-protein Gbp2

Published in BMC Microbiology, 2019

RNA-binding proteins (RBPs) are well known as key factors in gene expression regulation in eukaryotes. These proteins associate with mRNAs and other proteins to form mRNP complexes that ultimately determine the fate of target transcripts in the cell. This association is usually mediated by an RNA-recognition motif (RRM). In the case of trypanosomatids, these proteins play a paramount role, as gene expression regulation is mostly posttranscriptional. Despite their relevance in the life cycle of Trypanosoma cruzi, the causative agent of Chagas' disease, to date, few RBPs have been characterized in this parasite.

Recommendations for the Packaging and Containerizing of Bioinformatics Software

Published in F1000Research, 2018

Software Containers are changing the way scientists and researchers develop, deploy and exchange scientific software. They allow labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. However, containers and software packages should be produced under certain rules and standards in order to be reusable, compatible and easy to integrate into pipelines and analysis workflows. Here, we presented a set of recommendations developed by the BioContainers Community to produce standardized bioinformatics packages and containers. These recommendations provide practical guidelines to make bioinformatics software more discoverable, reusable and transparent. They are aimed to guide developers, organisations, journals and funders to increase the quality and sustainability of research software.

The Ewing Sarcoma Secretome and Its Response to Activation of Wnt/beta-catenin Signaling

Published in Molecular & Cellular Proteomics, 2018

tumor microenvironment (TME) interactions are critical for tumor progression and the composition and structure of the local extracellular matrix (ECM) are key determinants of tumor metastasis. We recently reported that activation of Wnt/beta-catenin signaling in Ewing sarcoma cells induces widespread transcriptional changes that are associated with acquisition of a metastatic tumor phenotype. Significantly, ECM protein-encoding genes were found to be enriched among Wnt/beta-catenin induced transcripts, leading us to hypothesize that activation of canonical Wnt signaling might induce changes in the Ewing sarcoma secretome. To address this hypothesis, conditioned media from Ewing sarcoma cell lines cultured in the presence or absence of Wnt3a was collected for proteomic analysis. Label-free mass spectrometry was used to identify and quantify differentially secreted proteins. We then used in silico databases to identify only proteins annotated as secreted. Comparison of the secretomes of two Ewing sarcoma cell lines revealed numerous shared proteins, as well as a degree of heterogeneity, in both basal and Wnt-stimulated conditions. Gene set enrichment analysis of secreted proteins revealed that Wnt stimulation reproducibly resulted in increased secretion of proteins involved in ECM organization, ECM receptor interactions, and collagen formation. In particular, Wnt-stimulated Ewing sarcoma cells up-regulated secretion of structural collagens, as well as matricellular proteins, such as the metastasis-associated protein, tenascin C (TNC). Interrogation of published databases confirmed reproducible correlations between Wnt/beta-catenin activation and TNC and COL1A1 expression in patient tumors. In summary, this first study of the Ewing sarcoma secretome reveals that Wnt/beta-catenin activated tumor cells upregulate secretion of ECM proteins. Such Wnt/beta-catenin mediated changes are likely to impact on tumor: TME interactions that contribute to metastatic progression.

BioContainers: An Open-Source and Community-Driven Framework for Software Standardization

Published in Bioinformatics, 2017

BioContainers (biocontainers.pro) is an open-source and community-driven framework which provides platform independent executable environments for bioinformatics software. BioContainers allows labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. BioContainers is based on popular open-source projects Docker and rkt frameworks, that allow software to be installed and executed under an isolated and controlled environment. Also, it provides infrastructure and basic guidelines to create, manage and distribute bioinformatics containers with a special focus on omics technologies. These containers can be integrated into more comprehensive bioinformatics pipelines and different architectures (local desktop, cloud environments or HPC clusters).

A Multi-Protease, Multi-Dissociation, Bottom-Up-To-Top-Down Proteomic View of the Loxosceles Intermedia Venom

Published in Scientific Data, 2017

Venoms are a rich source for the discovery of molecules with biotechnological applications, but their analysis is challenging even for state-of-the-art proteomics. Here we report on a large-scale proteomic assessment of the venom of Loxosceles intermedia, the so-called brown spider. Venom was extracted from 200 spiders and fractioned into two aliquots relative to a 10 kDa cutoff mass. Each of these was further fractioned and digested with trypsin (4 h), trypsin (18 h), pepsin (18 h), and chymotrypsin (18 h), then analyzed by MudPIT on an LTQ-Orbitrap XL ETD mass spectrometer fragmenting precursors by CID, HCD, and ETD. Aliquots of undigested samples were also analyzed. Our experimental design allowed us to apply spectral networks, thus enabling us to obtain meta-contig assemblies, and consequently de novo sequencing of practically complete proteins, culminating in a deep proteome assessment of the venom. Data are available via ProteomeXchange, with identifier PXD005523.

MSFragger: Ultrafast and Comprehensive Peptide Identification in Mass Spectrometry-Based Proteomics

Published in Nature Methods, 2017

There is a need to better understand and handle the 'dark matter' of proteomics-the vast diversity of post-translational and chemical modifications that are unaccounted in a typical mass spectrometry-based analysis and thus remain unidentified. We present a fragment-ion indexing method, and its implementation in peptide identification tool MSFragger, that enables a more than 100-fold improvement in speed over most existing proteome database search tools. Using several large proteomic data sets, we demonstrate how MSFragger empowers the open database search concept for comprehensive identification of peptides and all their modified forms, uncovering dramatic differences in modification rates across experimental samples and conditions. We further illustrate its utility using protein-RNA cross-linked peptide data and using affinity purification experiments where we observe, on average, a 300% increase in the number of identified spectra for enriched proteins. We also discuss the benefits of open searching for improved false discovery rate estimation in proteomics.

Discovering and Linking Public ‘Omics’ Datasets using the Omics Discovery Index

Published in Nature Biotechnology, 2017

Biomedical data are being produced at an unprecedented rate owing to the falling cost of experiments and wider access to genomics, transcriptomics, proteomics and metabolomics platforms1,2. As a result, public deposition of omics data is on the increase. This presents new challenges, including finding ways to store, organize and access different types of biomedical data present on different platforms. We present the Omics Discovery Index (OmicsDI - http://www.omicsdi.org), an open source platform that enables access, discovery and dissemination of omics datasets.

Quantitative Proteomic Analysis of the Saccharomyces Cerevisiae Industrial Strains CAT-1 and PE-2

Published in Journal of Proteomics, 2017

Brazilian ethanol fermentation process commonly uses baker's yeast as inoculum. In recent years, wild type yeast strains have been widely adopted. The two more successful examples are PE-2 and CAT-1, currently employed in Brazilian distilleries. In the present study, we analyzed how these strains compete for nutrients in the same environment and compared the potential characteristics which affect their performance by applying quantitative proteomics methods. Through the use of isobaric tagging, it was possible to compare protein abundances between both strains during the fermentation process. Our results revealed a better fermentation performance and robustness of CAT-1 strain. The proteomic results demonstrated many possible features that may be linked to the improved fermentation traits of the CAT-1. Proteins involved in response to oxidative stress (Sod1 and Trx1) and trehalose synthesis (Tps3) were more abundant in CAT-1 than in PE-2 after a fermentation batch. Tolerance to oxidative stress and trehalose accumulation were subsequently demonstrated to be enhanced for CAT-1, corroborating the comparative proteomic results. The importance of trehalose and the antioxidant system was confirmed by using mutant stains deleted in Sod1, Trx1 or Tps3. These deletions impaired fermentation performance, strengthening the idea that the abilities of accumulating high levels of trehalose and coping with oxidative stress are crucial for improving fermentation.

Ten Simple Rules for Taking Advantage of Git and GitHub

Published in PloS Computational Biology, 2016

Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use.

Venomous Extract Protein Profile of Brazilian Tarantula Grammostola Iheringi: Searching for Potential Biotechnological Applications

Published in Journal of Proteomics, 2016

Tarantula spiders, Theraphosidae family, are spread throughout most tropical regions of the world. Despite their size and reputation, there are few reports of accidents. However, like other spiders, their venom is considered a remarkable source of toxins, which have been selected through millions of years of evolution. The present work provides a proteomic overview of the fascinating complexity of the venomous extract of the Grammostola iheringi tarantula, obtained by electrical stimulation of the chelicerae. For analysis a bottom-up proteomic approach Multidimensional Protein Identification Technology (MudPIT) was used. Based on bioinformatics analyses, PepExplorer, a similarity-driven search tool that identifies proteins based on phylogenetically close organisms, a total of 395 proteins were identified in this venomous extract. Most of the identifications (~70%) were classified as predicted (21%), hypothetical (6%) and putative (37%), while a small group (6%) had no predicted function. Identified molecules matched with neurotoxins that act on ions channels; proteases, such as serine proteases, metalloproteinases, cysteine proteinases, aspartic proteinases, carboxypeptidases and cysteine-rich secretory enzymes (CRISP) and some molecules with unknown target. Additionally, non-classical venom proteins were also identified. Up to now, this study represents, to date, the first broad characterization of the composition of G. iheringi venomous extract. Our data provides a tantalizing insight into the diversity of proteins in this venom and their biotechnological potential.

Integrated Analysis of Shotgun Proteomic Data With PatternLab for Proteomics 4.0

Published in Nature Protocols, 2016

PatternLab for proteomics is an integrated computational environment that unifies several previously published modules for the analysis of shotgun proteomic data. The contained modules allow for formatting of sequence databases, peptide spectrum matching, statistical filtering and data organization, extracting quantitative information from label-free and chemically labeled data, and analyzing statistics for differential proteomics. PatternLab also has modules to perform similarity-driven studies with de novo sequencing data, to evaluate time-course experiments and to highlight the biological significance of data with regard to the Gene Ontology database. The PatternLab for proteomics 4.0 package brings together all of these modules in a self-contained software environment, which allows for complete proteomic data analysis and the display of results in a variety of graphical formats. All updates to PatternLab, including new features, have been previously tested on millions of mass spectra. PatternLab is easy to install, and it is freely available from http://patternlabforproteomics.org.

Using PepExplorer to Filter and Organize De Novo Peptide Sequencing Results

Published in Current Protocols in Bioinformatics, 2015

PepExplorer aids in the biological interpretation of de novo sequencing results; this is accomplished by assembling a list of homolog proteins obtained by aligning results from widely adopted de novo sequencing tools against a target-decoy sequence database. Our tool relies on pattern recognition to ensure that the results satisfy a user-given false-discovery rate (FDR). For this, it employs a radial basis function neural network that considers the precursor charge states, de novo sequencing scores, the peptide lengths, and alignment scores. PepExplorer is recommended for studies addressing organisms with no genomic sequence available. PepExplorer is integrated into the PatternLab for proteomics environment, which makes available various tools for downstream data analysis, including the resources for quantitative and differential proteomics.

Reevaluating the Trypanosoma Cruzi Proteomic Map: The Shotgun Description of Bloodstream Trypomastigotes

Published in Journal of Proteomics, 2015

Chagas disease is a neglected disease, caused by the protozoan Trypanosoma cruzi. This kinetoplastid presents a cycle involving different forms and hosts, being trypomastigotes the main infective form. Despite various T. cruzi proteomic studies, the assessment of bloodstream trypomastigote profile remains unexplored. The aim of this work is T. cruzi bloodstream form proteomic description. Employing shotgun approach, 17,394 peptides were identified, corresponding to 7514 proteins of which 5901 belong to T. cruzi. Cytoskeletal proteins, chaperones, bioenergetics-related enzymes, and trans-sialidases are among the top-scoring. GO analysis revealed that all T. cruzi compartments were assessed; and majority of proteins are involved in metabolic processes and/or presented catalytic activity. The comparative analysis between the bloodstream trypomastigotes and cultured-derived or metacyclic trypomastigote proteomic profiles pointed to 2202 proteins exclusively detected in the bloodstream form. These exclusive proteins are related to: (a) surface proteins; (b) non-classical secretion pathway; (c) cytoskeletal dynamics; (d) cell cycle and transcription; (e) proteolysis; (f) redox metabolism; (g) biosynthetic pathways; (h) bioenergetics; (i) protein folding; (j) cell signaling; (k) vesicular traffic; (l) DNA repair; and (m) cell death. This large-scale evaluation of bloodstream trypomastigotes, responsible for the parasite dissemination in the patient, marks a step forward in the comprehension of Chagas disease pathogenesis.

PepExplorer: A Similarity-Driven Tool for Analyzing De Novo Sequencing Results

Published in Molecular & Cellular Proteomics, 2014

Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith-Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops jararaca plasma, a known biological source of natural inhibitors of snake toxins. PepExplorer is integrated into the PatternLab for Proteomics environment, which makes available various tools for downstream data analysis, including resources for quantitative and differential proteomics.

On Best Practices in the Development of Bioinformatics Software

Published in Frontiers in Genetics, 2014

Bioinformatics is one of the major areas of study in modern biology. Medium- and large-scale quantitative biology studies have created a demand for professionals with proficiency in multiple disciplines, including computer science and statistical inference besides biology. Bioinformatics has now become a cornerstone in biology, and yet the formal training of new professionals (Perez-Riverol et al., 2013; Via et al., 2013), the availability of good services for data deposition, and the development of new standards and software coding rules (Sandve et al., 2013; Seemann, 2013) are still major concerns. Good programming practices range from documentation and code readability through design patterns and testing (Via et al., 2013; Wilson et al., 2014). Here, we highlight some points for best practices and raise important issues to be discussed by the community.

Bio::DB::NextProt: A Perl Module for neXtProt Database Information Retrieval

Published in PeerJ, 2014

The neXtProt database is a comprehensive knowledge platform recently adopted by the Chromosome-centric Human Proteome Project as the main reference database. The primary goal of the project is to identify and catalog every human protein encoded in the human genome. For such, computational approaches have an important role as data analysis and dedicated software are indispensable. Here we describe Bio::DB::NextProt, a Perl module that provides an object-oriented access to the neXtProt REST Web services, enabling the programatically retrieval of structured information. The Bio::DB::NextProt module presents a new way to interact and download information from the neXtProt database. Every parameter available through REST API is covered by the module allowing a fast, dynamic and ready-to-use alternative for those who need to access neXtProt data. Bio::DB::NextProt is an easy-to-use module that provides automatically retrieval of data, ready to be integrated into third-party software or to be used by other programmers on the fly. The module is freely available from from CPAN (metacpan.org/release/Bio-DB-NextProt) and GitHub (github.com/Leprevost/Bio-DB-NextProt) and is released under the perl_5 license.

Proteome Analysis of Formalin-Fixed Paraffin-Embedded Tissues from a Primary Gastric Melanoma and its Meningeal Metastasis: A Case Report

Published in Current Topics in Medicinal Chemistry, 2014

Melanoma is the third most common brain metastasis cause in the United States as it has a relatively high susceptibility to metastasize to the central nervous system. Among the different origins for brain metastasis, those originating from primary gastric melanomas are extremely rare. Here, we compare protein profiles obtained from formalin-fixed paraffin- embedded (FFPE) tissues of a primary gastric melanoma with its meningeal metastasis. For this, the contents of a microscope slide were scraped and ultimately analyzed by nano-chromatography coupled online with tandem mass spectrometry using an Orbitrap XL. Our results disclose 184 proteins uniquely identified in the primary gastric melanoma, 304 in the meningeal metastasis, and 177 in common. Notably, we identified several enzymes related to changes in the metabolism that are linked to producing energy by elevated rates of glycolysis in a process called the Warburg effect. Moreover, we show that our FFPE proteomic approach allowed identification of key biological markers such as the S100 protein that we further validated by immunohistochemistry for both, the primary and metastatic tumor samples. That said, we demonstrated a powerful strategy to retrospectively mine data for aiding in the understanding of metastasis, biomarker discovery, and ultimately, diseases. To our knowledge, these results disclose for the first time a comparison of the proteomic profiles of gastric melanoma and its corresponding meningeal metastasis.

Pinpointing differentially expressed domains in complex protein mixtures with the cloud service of PatternLab for Proteomics

Published in Journal of Proteomics, 2013

Mass-spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. Here we describe a new module integrated into PatternLab for Proteomics that allows the pinpointing of differentially expressed domains. This is accomplished by inferring functional domains through our cloud service, using HMMER3 and Pfam remotely, and then mapping the quantitation values into domains for downstream analysis. In all, spotting which functional domains are changing when comparing biological states serves as a complementary approach to facilitate the understanding of a system's biology. We exemplify the new module's use by reanalyzing a previously published MudPIT dataset of Cryptococcus gattii cultivated under iron-depleted and replete conditions. We show how the differential analysis of functional domains can facilitate the interpretation of proteomic data by providing further valuable insight.

Effectively addressing complex proteomic search spaces with peptide spectrum matching

Published in Bioinformatics, 2013

Protein identification by mass spectrometry is commonly accomplished using a peptide sequence matching search algorithm, whose sensitivity varies inversely with the size of the sequence database and the number of post-translational modifications considered. We present the Spectrum Identification Machine, a peptide sequence matching tool that capitalizes on the high-intensity b1-fragment ion of tandem mass spectra of peptides coupled in solution with phenylisotiocyanate to confidently sequence the first amino acid and ultimately reduce the search space. We demonstrate that in complex search spaces, a gain of some 120% in sensitivity can be achieved.