Filter Results
58598 results
The replication package of the paper "A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming Screencasts" including the dataset, results and tools
Data Types:
  • Dataset
  • File Set
Abstract Planning for power systems with high penetrations of variable renewable energy requires higher spatial and temporal granularity. However, most publicly available test systems are of insufficient fidelity for developing methods and tools for high- resolution planning. This paper presents methods to construct open-access test systems of high spatial granularity to more accurately represent current infrastructure and high temporal granularity to represent variability of demand and renewable resources. To demonstrate, a high-resolution test system representing the United States is created using only publicly available data. This test system is validated by running it in a production cost model, with results validated against historical generation to ensure that they are representative. The resulting open source test system can support power system transition planning and aid in development of tools to answer questions around how best to reach decarbonization goals, using the most effective combinations of transmission expansion, renewable generation, and energy storage. Documentation of dataset development A paper describing the process of developing the dataset is available at https://arxiv.org/abs/2002.06155. Please cite as: Y. Xu, Nathan Myhrvold, Dhileep Sivam, Kaspar Mueller, Daniel J. Olsen, Bainan Xia, Daniel Livengood, Victoria Hunt, Benjamin Rouillé d'Orfeuil, Daniel Muldrew, Merrielle Ondreicka, Megan Bettilyon, "U.S. Test System with High Spatial and Temporal Resolution for Renewable Integration Studies," 2020 IEEE PES General Meeting, Montreal, Canada, 2020. Dataset version history 0.1, January 31, 2020: initial data upload. 0.2, March 10, 2020: addition of Tabular Data Package metadata, modifications to cost curves and transmission capacities aimed at more closely matching optimization results to historical data. 0.2.1, March 25, 2020: corrected a bug in the wind profile generation process which was pulling the wrong locations for wind farms outside the Western Interconnection.
Data Types:
  • Dataset
  • File Set
Abstract Planning for power systems with high penetrations of variable renewable energy requires higher spatial and temporal granularity. However, most publicly available test systems are of insufficient fidelity for developing methods and tools for high- resolution planning. This paper presents methods to construct open-access test systems of high spatial granularity to more accurately represent current infrastructure and high temporal granularity to represent variability of demand and renewable resources. To demonstrate, a high-resolution test system representing the United States is created using only publicly available data. This test system is validated by running it in a production cost model, with results validated against historical generation to ensure that they are representative. The resulting open source test system can support power system transition planning and aid in development of tools to answer questions around how best to reach decarbonization goals, using the most effective combinations of transmission expansion, renewable generation, and energy storage. Documentation of dataset development A paper describing the process of developing the dataset is available at https://arxiv.org/abs/2002.06155. Please cite as: Y. Xu, Nathan Myhrvold, Dhileep Sivam, Kaspar Mueller, Daniel J. Olsen, Bainan Xia, Daniel Livengood, Victoria Hunt, Benjamin Rouillé d'Orfeuil, Daniel Muldrew, Merrielle Ondreicka, Megan Bettilyon, "U.S. Test System with High Spatial and Temporal Resolution for Renewable Integration Studies," 2020 IEEE PES General Meeting, Montreal, Canada, 2020. Dataset version history 0.1, January 31, 2020: initial data upload. 0.2, March 10, 2020: addition of Tabular Data Package metadata, modifications to cost curves and transmission capacities aimed at more closely matching optimization results to historical data. 0.2.1, March 25, 2020: [erroneous upload] 0.2.2, March 26, 2020: [erroneous upload]
Data Types:
  • Dataset
  • File Set
Planning for power systems with high penetrations of variable renewable energy requires higher spatial and tempo- ral granularity. However, most publicly available test systems are of insufficient fidelity for developing methods and tools for high- resolution planning. This paper presents methods to construct open-access test systems of high spatial granularity to more accurately represent current infrastructure and high temporal granularity to represent variability of demand and renewable resources. To demonstrate, a high-resolution test system representing the United States is created using only publicly available data. This test system is validated by running it in a production cost model, with results validated against historical generation to ensure that they are representative. The resulting open source test system can support power system transition planning and aid in development of tools to answer questions around how best to reach decarbonization goals, using the most effective combinations of transmission expansion, renewable generation, and energy storage. A paper describing the process of developing the dataset is available at https://arxiv.org/abs/2002.06155. Version history 0.1, January 31, 2020: initial data upload. 0.2, March 10, 2020: addition of Tabular Data Package metadata, modifications to cost curves and transmission capacities aimed at more closely matching optimization results to historical data.
Data Types:
  • Dataset
  • File Set
Abstract Motivation Antibodies are widely used experimental reagents to test expression of proteins. However, they might not always provide the intended tests because they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable and irreproducible research results. While many proposals have been developed to deal with the problem of antibody specificity, they may not scale well to deal with the millions of antibodies that have ever been designed and used in research. In this study, we investigate the feasibility of automatically extracting statements about antibody specificity reported in the literature by text mining, and generate reports to alert scientist users of problematic antibodies. Results We developed a deep neural network system called Antibody Watch and tested its performance on a corpus of more than two thousand articles that report uses of antibodies. We leveraged the Research Resource Identifiers (RRID) to precisely identify antibodies mentioned in an input article and the BERT language model to classify if the antibodies are reported as nonspecific, and thus problematic, as well as inferred the coreference to link statements of specificity to the antibodies that the statements referred to. Our evaluation shows that Antibody Watch can accurately perform both classification and linking with F-scores over 0.8, given only thousands of annotated training examples. The result suggests that with more training, Antibody Watch will provide useful reports about antibody specificity to scientists.
Data Types:
  • Dataset
  • File Set
Compressed fastqs for raw sequences of clinical isolates of Escherichia coli infection from Toronto, Canada in 2018 (Dataset 2). Sequencing details outlined in associated publication. Performed using Illumina NextSeq platform.
Data Types:
  • Document
  • File Set
RDF dump of wikidata produced with wdumps. View on wdumper entity count: 0, statement count: 0, triple count: 0
Data Types:
  • Software/Code
  • Dataset
  • File Set
Created for Zenodo citation
Data Types:
  • Software/Code
  • File Set
Untargeted metabolomics using liquid chromatography–mass spectrometry (LC–MS) is currently the gold-standard technique to determine the full chemical diversity in biological samples. This approach still has many limitations, however; notably, the difficulty of estimating accurately the number of unique metabolites being profiled among the thousands of MS ion signals arising from chromatograms. Here, we describe a new workflow, MS-CleanR, based on the MS-DIAL–MS-FINDER suite, which tackles this problem of ‘feature degeneracy’ and improves annotation rates. We show that implementation of MS-CleanR reduces the number of signals by nearly 80% while retaining 95% of unique metabolite features. Moreover, the annotation results from MS-FINDER can be ranked with respect to database chosen by the user, which improves identification accuracy. Application of MS-CleanR to the analysis of Arabidopsis thaliana grown in three different conditions improved class separation resulting from multivariate data analysis and lead to annotation of 75% of the final features. The full workflow was applied to metabolomic profiles from three strains of the leguminous plant Medicago truncatula that have different susceptibilities to the oomycete pathogen Aphanomyces euteiches; a group of glycosylated triterpenoids overrepresented in resistant lines were identified as candidate compounds conferring pathogen resistance. MS-CleanR is implemented through a Shiny interface for intuitive use by end-users (available at: https://github.com/eMetaboHUB/mscleanr.)
Data Types:
  • Document
  • File Set
The repository stores the full analysis pipeline and results for the bioRxiv preprint at https://doi.org/10.1101/573782 Abstract Background Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically-appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. Results We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell-types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. Conclusions There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Data Types:
  • Software/Code
  • File Set