Skip to main content

Information Processing and Management

ISSN: 0306-4573

Visit Journal website

Datasets associated with articles published in Information Processing and Management

Filter Results
1970
2024
1970 2024
13 results
  • Data for: Deriving the Sentiment Polarity of Term Senses using Unsupervised Context-Aware Gloss Expansion
    Sentiment lexicon generated from using Unsupervised Context-Aware Gloss Expansion
    • Dataset
  • Data for: Beyond MeSH: Fine-Grained Semantic Indexing of Biomedical Literature based on Weak Supervision
    Data for training and testing classification models for fine-grained semantic indexing for Alzheimer's Disease (AD) and Duchenne Muscular Dystrophy (DMD). The Data.zip file includes the initial CSV files to develop and assess predictive models for both use cases. The source code of the proposed BeyondMeSH method that uses these CSV files for model development and assessment is available here: https://github.com/tasosnent/BeyondMeSH
    • Dataset
  • Research data supporting "Machine learning in the processing of historical census data"
    This collection of data contains ground-truth (gold standard) datasets for the employment status reconstruction problem of historical census data. Different machine learning methods can be tested and compared with these datasets as described in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C., an outcome of the ESRC project ES/M010953: Drivers of Entrepreneurship and Small Businesses lead by PI Prof. Robert J. Bennett. The material consists of three raw text files (1. and 2. are random samples). No census identification of individuals variable (RecID) is given so that the datasets are fully anonymised and it is not possible to track the individuals in each of the files. Below the variables descriptors: 1."1891 1000 Ent". 1891 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own account business proprietors) and 1,000 labelled workers. Labelling derives from the known employment status reported on the night of the Census, for the later 1891-1911 censuses; using the reported crosses in the columns of the 1891 Census Enumerators' Books (CEBs). 2."1851 1000 Ent". 1851 Census of England and Wales economically active individuals: 1,000 labelled Entrepreneurs (500 labelled Employers and 500 labelled Own accounts) and 1,000 labelled workers. Labelling using clerical control of the occupational strings for the extracted Groups of business proprietors in the 1851 Census. 3."1851 MAX(Extracted)". 1851 Census of England and Wales economically active individuals: 70,872 labelled Entrepreneurs (35,436 labelled Employers and 35,436 labelled Own accounts) and 70,872 labelled workers. A maximum possible balanced dataset, from all the employers and own account identified by extracted Groups (1 for Employers and 3 and 5 for Own account). Labelling using clerical control of the occupation strings for the extracted Groups of the 1851 Census. It is also included the key variable OccString with full occupation strings. A detailed explanation of how these datasets were obtained and how to use them in the context of machine learning reconstruction of the employment status problem of historical census data can be found in the paper "Machine learning in the processing of historical census data" by Montebruno, P., Bennett, R, Smith, H., and van Lieshout, C. (2020) Information Processing & Management. This dataset should be cited as: Montebruno, Piero; Bennett, Robert J.; Smith, Harry J.; van Lieshout, Carry (2020), “Research data supporting "Machine learning in the processing of historical census data" ”, Mendeley Data, http://dx.doi.org/10.17632/p4zptr98dh.1
    • Dataset
  • Data for: Towards a Model for Spoken Conversational Search
    Conversation is the natural mode for information exchange in daily life, a spoken conversational interaction for search input and output is a logical format for information seeking. However, the conceptualisation of user–system interactions or information exchange in spoken conversational search (SCS) has not been explored. The first step in conceptualising SCS is to understand the conversational moves used in an audio-only communication channel for search. This paper explores conversational actions for the task of search. We define a qualitative methodology for creating conversational datasets, propose analysis protocols, and develop the SCSdata. Furthermore, we use the SCSdata to create the first annotation schema for SCS: the SCoSAS, enabling us to investigate interactivity in SCS. We further establish that SCS needs to incorporate interactivity and pro-activity to overcome the complexity that the information seeking process in an audio-only channel poses. In summary, this exploratory study unpacks the breadth of SCS. Our results highlight the need for integrating discourse in future SCS models and contributes the advancement in the formalisation of SCS models and the design of SCS systems.
    • Dataset
  • Data for: Towards a Model for Spoken Conversational Search
    Conversation is the natural mode for information exchange in daily life, a spoken conversational interaction for search input and output is a logical format for information seeking. However, the conceptualisation of user–system interactions or information exchange in spoken conversational search (SCS) has not been explored. The first step in conceptualising SCS is to understand the conversational moves used in an audio-only communication channel for search. This paper explores conversational actions for the task of search. We define a qualitative methodology for creating conversational datasets, propose analysis protocols, and develop the SCSdata. Furthermore, we use the SCSdata to create the first annotation schema for SCS: the SCoSAS, enabling us to investigate interactivity in SCS. We further establish that SCS needs to incorporate interactivity and pro-activity to overcome the complexity that the information seeking process in an audio-only channel poses. In summary, this exploratory study unpacks the breadth of SCS. Our results highlight the need for integrating discourse in future SCS models and contributes the advancement in the formalisation of SCS models and the design of SCS systems.
    • Dataset
  • FISETIO: A FIne-grained, Structured and Enriched Tourism Dataset for Indoor and Outdoor attractions
    This data in brief paper introduces our publicly available datasets in the area of tourism demand prediction for future experiments and comparisons. Most previous works in the area of tourism demand forecasting are based on coarse- grained analysis (level of countries or regions) and there are very few works and datasets available for fine-grained tourism analysis as well (level of attractions and points of interest). In this article, we present our fine-grained datasets for two types of attractions – (I) indoor attractions (27 Museums and Galleries in U.K.) and (II) outdoor attractions (76 U.S. National Parks) enriched with official number of visits, social media reviews and environmental data for each of them. In addition, the complete analysis of prediction results, methodology and exploited models, features’ performance analysis, anomalies, etc, are available in our original paper
    • Dataset
  • Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
    Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.
    • Dataset
  • Data for: HClaimE: A Tool to Identify Health Claims in Health News Headlines
    This data set contains 564 health research news headlines with manual annotations of the health claims in the headlines and metadata such as publication dates and sources. The headlines were selected from news articles published on ScienceDaily.com from January 2016 to June 2017, including 212 headlines on breast cancer and 352 on diabetes. The news articles came from 286 different sources, such as Scripps Research Institute. A health claim is defined as a triple construct (a triplet); it is made up of an independent variable (IV – namely, what is being manipulated), a dependent variable (DV – namely, what is being measured), and the relation between the two. Among the 564 headlines, 416 contain health claims, while the other 148 headlines do not.
    • Dataset
  • Data for: Studies on a Multidimensional Public Opinion Network Model and Its Topic Detection Algorithm
    As data on the microblogs (http://weibo.com/) are featured with their integrity and accessibility, studies have been conducted on public opinions concerning the accident on microblogs in the case study for the algorithm in the present paper. Altogether 11,600 pieces of data on original posts concerning the “Explosion in Tianjin Port on August 12,2015” accident have been collected. Among all such data collected, altogether 7276 pieces of data on original posts concerning the accident released by individual micro-bloggers and the related “following” relationships are included with those data on information on the microblogs released by various authorities deducted. All of the above data has been in Chinese form. If the readers or reviewers need the data of this paper, we can provide it at any time.
    • Dataset
  • Data for: CLAIRE: A combinatorial visual analytics system for information retrieval evaluation
    We considered the following standard and shared collec- tions, each track using 50 different topics: • TREC Adhoc tracks T07 and T08: they focus on a news search task and adopt a corpus of about 528K news documents. • TREC Web tracks T09 and T10: focus on a Web search task and adopt a corpus of 1.7M Web pages. • TREC Terabyte tracks T14 and T15: focus on a Web search task and adopt a corpus of 125M Web pages. We considered three main components of an IR system: stop list, stemmer, and IR model. We selected a set of alternative implementations of each component and, by using the Ter- rier v.4.02 open source system, we created a run for each system defined by combining the available components in all possible ways. The selected components are: • Stop list: nostop, indri, lucene, snowball, smart, terrier; • Stemmer: nolug, weakPorter, porter, snowballPorter, krovetz, lovins; • Model: bb2, bm25, dfiz, dfree, dirichletlm, dlh, dph, hiemstralm, ifb2, inb2, inl2, inexpb2, jskls, lemurtfidf, lgd, pl2, tfidf. Overall, these components define a 6 × 6 × 17 factorial design with a GoP consisting of 612 system runs. They represent nearly all the state-of-the-art components which constitute the common denominator almost always present in any IR system for English retrieval and thus they are a good account of what can be found in many different operational settings.
    • Dataset
1