Data for: The textual similarity of news content and stock return synchronicity
Description
This dataset accompanies the study “The textual similarity of news content and stock return synchronicity”, which investigates how the homogeneity of news narratives across firms relates to the synchronicity of their stock returns. Our research hypothesis posits that higher textual similarity in firm-specific news leads to greater stock return synchronicity, as more uniform information reduces firm-specific variation in investor beliefs and trading behavior. The data includes firm-level measures of news textual similarity and stock return synchronicity for publicly listed firms, covering the period from 2013 to 2022. Textual similarity is computed using cosine similarity scores derived from TF-IDF representations of firm-specific news articles collected from reputable financial news sources. We preprocess the news content by removing stop words and applying standard tokenization and lemmatization procedures. News articles are grouped by firm and time period, and similarity is measured against a rolling market-wide benchmark. The original news text, stock trade data, and accounting data used in this study are sourced from the China Stock Market and Accounting Research (CSMAR) database, while the textual tone of MD&A is sourced from the Chinese Research Data Services Platform (CNRDS). The news sources include both traditional paper media and internet media. The sample removes records: (i) financial firms, (ii) firms listed for less than one year, and (iii) firms with missing values for control variables. After filtering, our final sample comprises 82,215 observations covering 4,102 firms. Stock return synchronicity is quantified using the R² statistic from a market model regression, following established literature, where a higher R² indicates stronger co-movement with the market and weaker firm-specific return variation. Our data show a robust positive correlation between news similarity and stock return synchronicity, even after controlling for firm fundamentals, media coverage volume, and other confounding factors. This finding suggests that uniform media narratives can reduce the information diversity available to investors, contributing to higher return co-movement. This dataset includes: ASVImonthly.dta base_data.dta BellWether_Newsprop.dta DisAcc.dta isAnnoym.dta NewsNumlarge8ym.dta numAholder_yq.dta ReportSim_ym.dta Rmkt.dta sigma_mkt.dta Stkcd_ym_NewsTone.dta Topic_wordscomovement.dta yearMDATone.dta ymChinaNewsBasedEPU.dta ymCICSI.dta The data can be used to explore information diffusion, media effects in financial markets, and the mechanisms behind co-movement in asset prices. Researchers replicating or extending this work can match the firm identifiers and timestamps with other financial databases such as CSMAR or CNRDS.
Files
Steps to reproduce
This dataset was constructed to examine the relationship between the textual similarity of firm-specific news content and stock return synchronicity. The following steps outline how the data were gathered, processed, and analyzed to support the research: 1. Data Collection We began with a base panel dataset of publicly listed firms, containing firm-level financial variables, stock return data, and firm-specific news articles. The financial and stock return data were sourced from standard databases such as CSMAR or Wind, covering the period from 2010 to 2022. News articles were collected from reputable financial news outlets and matched to individual firms based on company identifiers and publication dates. 2. Textual Similarity Construction The news articles were pre-processed using natural language processing techniques, including stop-word removal, tokenization, and lemmatization. Each article was converted into a numerical representation using the term frequency–inverse document frequency (TF-IDF) method. For each firm and time window (e.g., monthly or quarterly), we calculated the average cosine similarity between that firm’s news and the aggregate market-level news, producing a measure of textual similarity. 3. Stock Return Synchronicity Measurement We computed stock return synchronicity using the R² statistic from a regression of firm returns on market and industry returns, following established methods in the literature. A higher R-squared indicates that a firm’s returns are more synchronized with broader market movements and less driven by firm-specific information. 4. Data Cleaning and Sample Filtering To ensure data quality, we excluded observations with missing values in key variables. We also applied winsorization at the 1st and 99th percentiles to mitigate the influence of extreme values. 5. Descriptive and Statistical Analysis We first conducted univariate analyses by dividing firms into groups based on their news similarity scores and comparing the average return synchronicity across these groups. This was followed by regression and correlation analysis to test the robustness of our results and assess the influence of control variables such as firm size, leverage, analyst coverage, and volatility. 6. Reproducibility Environment All data processing and statistical analysis were performed using Stata, along with several standard and community-contributed packages for data manipulation, descriptive statistics, and regression analysis. This workflow provides a transparent basis for interpreting the dataset. Researchers can reproduce the results by obtaining similar data sources, applying the described text analysis and financial econometrics methods, and following a comparable cleaning and analysis protocol.
Institutions
- Shanghai Jiao Tong University