Information Processing and Management

ISSN: 0306-4573
Visit Journal website
Datasets associated with articles published in Information Processing and Management
Filter Results
8 results
  • Conversation is the natural mode for information exchange in daily life, a spoken conversational interaction for search input and output is a logical format for information seeking. However, the conceptualisation of user–system interactions or information exchange in spoken conversational search (SCS) has not been explored. The first step in conceptualising SCS is to understand the conversational moves used in an audio-only communication channel for search. This paper explores conversational actions for the task of search. We define a qualitative methodology for creating conversational datasets, propose analysis protocols, and develop the SCSdata. Furthermore, we use the SCSdata to create the first annotation schema for SCS: the SCoSAS, enabling us to investigate interactivity in SCS. We further establish that SCS needs to incorporate interactivity and pro-activity to overcome the complexity that the information seeking process in an audio-only channel poses. In summary, this exploratory study unpacks the breadth of SCS. Our results highlight the need for integrating discourse in future SCS models and contributes the advancement in the formalisation of SCS models and the design of SCS systems.
    Data Types:
    • Tabular Data
    • Dataset
  • Conversation is the natural mode for information exchange in daily life, a spoken conversational interaction for search input and output is a logical format for information seeking. However, the conceptualisation of user–system interactions or information exchange in spoken conversational search (SCS) has not been explored. The first step in conceptualising SCS is to understand the conversational moves used in an audio-only communication channel for search. This paper explores conversational actions for the task of search. We define a qualitative methodology for creating conversational datasets, propose analysis protocols, and develop the SCSdata. Furthermore, we use the SCSdata to create the first annotation schema for SCS: the SCoSAS, enabling us to investigate interactivity in SCS. We further establish that SCS needs to incorporate interactivity and pro-activity to overcome the complexity that the information seeking process in an audio-only channel poses. In summary, this exploratory study unpacks the breadth of SCS. Our results highlight the need for integrating discourse in future SCS models and contributes the advancement in the formalisation of SCS models and the design of SCS systems.
    Data Types:
    • Tabular Data
    • Dataset
  • This data in brief paper introduces our publicly available datasets in the area of tourism demand prediction for future experiments and comparisons. Most previous works in the area of tourism demand forecasting are based on coarse- grained analysis (level of countries or regions) and there are very few works and datasets available for fine-grained tourism analysis as well (level of attractions and points of interest). In this article, we present our fine-grained datasets for two types of attractions – (I) indoor attractions (27 Museums and Galleries in U.K.) and (II) outdoor attractions (76 U.S. National Parks) enriched with official number of visits, social media reviews and environmental data for each of them. In addition, the complete analysis of prediction results, methodology and exploited models, features’ performance analysis, anomalies, etc, are available in our original paper
    Data Types:
    • Dataset
    • Document
    • File Set
  • Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.
    Data Types:
    • Tabular Data
    • Dataset
  • This data set contains 564 health research news headlines with manual annotations of the health claims in the headlines and metadata such as publication dates and sources. The headlines were selected from news articles published on ScienceDaily.com from January 2016 to June 2017, including 212 headlines on breast cancer and 352 on diabetes. The news articles came from 286 different sources, such as Scripps Research Institute. A health claim is defined as a triple construct (a triplet); it is made up of an independent variable (IV – namely, what is being manipulated), a dependent variable (DV – namely, what is being measured), and the relation between the two. Among the 564 headlines, 416 contain health claims, while the other 148 headlines do not.
    Data Types:
    • Tabular Data
    • Dataset
  • As data on the microblogs (http://weibo.com/) are featured with their integrity and accessibility, studies have been conducted on public opinions concerning the accident on microblogs in the case study for the algorithm in the present paper. Altogether 11,600 pieces of data on original posts concerning the “Explosion in Tianjin Port on August 12,2015” accident have been collected. Among all such data collected, altogether 7276 pieces of data on original posts concerning the accident released by individual micro-bloggers and the related “following” relationships are included with those data on information on the microblogs released by various authorities deducted. All of the above data has been in Chinese form. If the readers or reviewers need the data of this paper, we can provide it at any time.
    Data Types:
    • Tabular Data
    • Dataset
  • We considered the following standard and shared collec- tions, each track using 50 different topics: • TREC Adhoc tracks T07 and T08: they focus on a news search task and adopt a corpus of about 528K news documents. • TREC Web tracks T09 and T10: focus on a Web search task and adopt a corpus of 1.7M Web pages. • TREC Terabyte tracks T14 and T15: focus on a Web search task and adopt a corpus of 125M Web pages. We considered three main components of an IR system: stop list, stemmer, and IR model. We selected a set of alternative implementations of each component and, by using the Ter- rier v.4.02 open source system, we created a run for each system defined by combining the available components in all possible ways. The selected components are: • Stop list: nostop, indri, lucene, snowball, smart, terrier; • Stemmer: nolug, weakPorter, porter, snowballPorter, krovetz, lovins; • Model: bb2, bm25, dfiz, dfree, dirichletlm, dlh, dph, hiemstralm, ifb2, inb2, inl2, inexpb2, jskls, lemurtfidf, lgd, pl2, tfidf. Overall, these components define a 6 × 6 × 17 factorial design with a GoP consisting of 612 system runs. They represent nearly all the state-of-the-art components which constitute the common denominator almost always present in any IR system for English retrieval and thus they are a good account of what can be found in many different operational settings.
    Data Types:
    • Tabular Data
    • Dataset
  • see http://tec.citius.usc.es/ir/code/pooling_bandits_ms.html
    Data Types:
    • Software/Code
    • Dataset
    • Document