Global Economic, Demographic, and Investment Indicators Dataset(2018-2020)
Description
This dataset represents a normalized relational database designed to analyze global economic, demographic, and investment indicators across multiple countries over a defined period. The primary objective of this dataset is to provide a structured and scalable data model that supports analytical queries and demonstrates best practices in database design and normalization. The database is composed of four main entities: country, economic_indicators, demographic_indicators, and investment_indicators. The country table stores static information such as country name, ISO code, region, development status, and capital city. The other three tables store time-dependent data linked to each country through foreign key relationships, enabling the representation of longitudinal data across multiple years. The economic_indicators table includes key financial metrics such as GDP, GDP per capita, inflation rate, unemployment rate, and industrial contribution to the economy. The demographic_indicators table focuses on population-related data, including total population, growth rate, life expectancy, median age, and fertility rate. The investment_indicators table captures government and private sector investment metrics, such as education expenditure, infrastructure investment, foreign direct investment, and construction investment. This dataset was designed following normalization principles up to Third Normal Form (3NF), ensuring minimal redundancy, improved data consistency, and efficient storage. Each table represents a distinct domain, and all relationships are enforced through primary and foreign key constraints to maintain referential integrity. Although the dataset uses synthetically generated data, its structure is inspired by real-world sources such as the World Bank and United Nations datasets. It is suitable for academic purposes, including data analysis, database management, and machine learning preprocessing. The dataset enables users to explore relationships between economic performance, population dynamics, and investment strategies across different countries and years, making it a valuable resource for both educational and analytical applications.
Files
Steps to reproduce
This dataset was created through a structured workflow involving data modeling, normalization, and synthetic data generation to simulate real-world economic and demographic information. The process began with identifying the key domains required for analysis, including country-level metadata, economic indicators, demographic indicators, and investment-related metrics. These domains were selected based on commonly used variables found in global datasets such as those provided by the World Bank and United Nations. The first step consisted of designing a conceptual data model using Entity-Relationship (ER) principles. The main entities defined were country, economic_indicators, demographic_indicators, and investment_indicators. Each entity was carefully structured to represent a single subject area, ensuring clear separation of concerns and avoiding redundancy. Relationships between entities were defined using one-to-many associations, where each country could have multiple records across different years in the indicator tables. Next, the conceptual model was translated into a relational schema using MySQL Workbench. Tables were created with appropriate data types, primary keys (using UUID format), and foreign key constraints to enforce referential integrity. The schema was normalized up to Third Normal Form (3NF), ensuring that all non-key attributes depend only on the primary key and that no transitive dependencies exist. This normalization process improved data consistency and reduced redundancy across the database. After defining the schema, synthetic data was generated to populate the database. A stored procedure was implemented in MySQL to automate data insertion. The procedure created multiple country records and, for each country, generated time-series data for a range of years (2018 to 2022). Built-in functions such as UUID() were used to generate unique identifiers, while RAND() was used to simulate realistic numerical values for economic, demographic, and investment indicators. This approach ensured consistency across related tables through shared foreign keys. The dataset was then validated using SQL queries executed in MySQL Workbench and DBeaver. Queries were used to verify the number of records, confirm referential integrity, and test joins between tables. Sample queries ensured that each country had corresponding records across all indicator tables and that relationships were correctly enforced. Tools used in this process include MySQL Workbench for schema design and database creation, DBeaver for database access and validation, and SQL for data generation and manipulation. This workflow ensures that another researcher can reproduce the dataset by recreating the schema, executing the stored procedure, and validating the results using standard SQL queries.
Institutions
- Wentworth Institute of TechnologyMassachusetts, Boston