DeepSeek-QueryBench: A Dataset for Evaluating the Performance and Stability of LLM-Generated Boolean Queries

Name: DeepSeek-QueryBench: A Dataset for Evaluating the Performance and Stability of LLM-Generated Boolean Queries
Creator: Weizhi Yang
Published: 2025-11-24T13:09:39.802Z
Keywords: Artificial Intelligence, Information Retrieval, Educational Technology, Digital Library, Natural Language Processing, Interdisciplinary Education, Scholarly Communication, Bibliometrics, Human-Computer Integration, Methods Evaluation, Science, Technology, Engineering and Mathematics, AI-Human Interaction

Yang, Weizhi; Precharattana, Monamorn; Yang, YuHang

doi:10.17632/7rrvctn3pj.1

DeepSeek-QueryBench: A Dataset for Evaluating the Performance and Stability of LLM-Generated Boolean Queries

Published: 24 November 2025| Version 1 | DOI: 10.17632/7rrvctn3pj.1

Contributors:

Weizhi Yang,

,

Description

The "DeepSeek-QueryBench" dataset provides the first comprehensive empirical data for evaluating the performance and stability of open-source Large Language Models (LLMs) in Boolean query generation for scholarly search. This dataset captures the complete workflow from query generation to retrieval evaluation, specifically designed to assess LLM capabilities under novice user conditions. Core Components: 1. Original Model Outputs: Complete interaction records from DeepSeek-V3.1-Terminus across four operational modes (Default, Deep Thinking, Web Search, and their combination), with three independent generations per mode using a fixed simple Chinese prompt. 2. Generated Boolean Queries: Both original and syntactically corrected versions of 12 distinct Boolean queries targeting the interdisciplinary topic "3D printing in STEM education," formatted for Web of Science execution. 3. Retrieval Results: Complete bibliographic records (title, abstract, keywords, publication details) for all documents retrieved by each query execution in Web of Science Core Collection (2022-2024, article type), totaling 1,615 documents before deduplication. 4. Gold Standard Collection: A rigorously constructed benchmark of 172 relevant publications on "3D printing in STEM education," developed through baseline keyword retrieval and exhaustive forward/backward snowballing until saturation. 5. Performance Metrics: Comprehensive evaluation data including standard information retrieval metrics (Precision, Recall, F1-score, F3-score) and novel stability measures (Coefficient of Variation, Jaccard Similarity, Integration Change Rate) for each query and operational mode. 6. Analysis Materials: Supporting data for in-depth analysis including keyword frequency distributions, query structure categorization, semantic error patterns, and complementarity analysis between different query generations. Unique Value Proposition: This dataset addresses critical gaps in current LLM evaluation by focusing on: Stability and reproducibility rather than just peak performance Novice user scenarios with simple prompts and default configurations Open-source model capabilities beyond the dominant GPT ecosystem Real-world applicability through rigorous gold standard validation The dataset supports research in AI-assisted information retrieval, evidence synthesis automation, LLM reliability assessment, and human-AI collaboration in scholarly search.

Files

Steps to reproduce

1. Experimental Setup • Utilized DeepSeek-V3.1-Terminus via official web platform (September 2025) • Tested four operational modes: Default, Deep Thinking, Web Search, and their combination • Employed default parameters across all modes to simulate novice usage • Conducted each prompt execution in new, independent browser sessions 2. Query Generation • Applied single Chinese prompt requesting Boolean query for "3D printing in STEM education" in Web of Science • Executed three independent generations per operational mode • Recorded complete model responses including queries and explanatory text • Applied minor syntactic corrections to ensure query executability 3. Search Execution • Executed all corrected queries in Web of Science Core Collection • Applied consistent filters: articles published 2022-2024 • Downloaded complete bibliographic records for all retrieved documents 4. Gold Standard Construction • Combined and deduplicated records from all queries (1,616 unique documents) • Conducted two-stage screening: title/abstract followed by full-text review • Performed iterative backward and forward snowballing until saturation • Established final gold standard of 172 relevant publications 5. Performance Evaluation • Calculated standard metrics: Precision, Recall, F1-score, F3-score • Assessed stability using Coefficient of Variation and Jaccard Similarity • Computed Integration Change Rate for combined query results • Analyzed performance patterns across operational modes and generations Required Resources: DeepSeek platform access, Web of Science subscription, standard data analysis tools. All data collection completed within a single day to ensure consistency.

Institutions

Jiaying University, Mahidol University Institute for Innovative Learning, Guangzhou City Polytechnic

Funders

Mahidol University
Thailand
Ministry of Education of the People's Republic of China
State Council of the People's Republic of China
China
Grant ID: 2024MZ046
Department of Education of Guangdong Province
China
Grant ID: GDJ20240012
Jiaying University
China
Grant ID: JCJY20241004

DeepSeek-QueryBench: A Dataset for Evaluating the Performance and Stability of LLM-Generated Boolean Queries

Description

Files

Steps to reproduce

Institutions

Categories

Funders

Licence