TelecomX
Description
**Short description** The task is to analyze the presented data on Internet traffic consumption by subscribers, by examining the data on the volume of transmitted and received traffic from the communication equipment (switches) exports. The data of the described subject area is presented in the form of files with synthetic data. Full description with diagrams in attachment. Additional dataset options are available here -- https://github.com/d-yacenko/dataset For entities such as Client, Physical, Company, Plan, Subscriber, and PSXAttrs, data are provided as a current snapshot — one entity, one file with current data. Data from switches are exported every 10 minutes. For example, with 10 switches, 24*6*10 = 1440 files are exported per day. File names contain the switch name and export time. Alternatively, data could be streamed from equipment via systems like Kafka. This work presents three dataset variants — exports from 6 switches over 7 days for operators of varying sizes: • telecom10k - operator with 10,000 subscribers (51MB), • telecom100k - operator with 100,000 subscribers (696MB), • telecom1000k - operator with 1,000,000 subscribers (7.2GB). The task is to analyze the presented data on Internet traffic consumption by subscribers, by examining the data on the volume of transmitted and received traffic from the communication equipment (switches) exports. During the analysis, it is necessary to compare the retrospective consumption of a subscriber’s traffic with the current, and upon detecting atypical consumption, to conclude hacking. A data showcase table (data mart) should be constructed for each hour of data from the switches, i.e., the number of showcases should equal the period (in hours) for which operational data were exported, e.g., 24*7 = 168. The showcase should present the following data: • Time, • Client name, • Client contract number, • Contact data for communication with the client, • Presumed hacking status (hacked/clear), • Justification of the presumed hacking status (brief history of traffic consumption). The methods required for this task include data cleaning, data loading, data mart calculation, etc. In addition to data analysis, it is important to apply data governance practices to control data quality at all stages of the analysis (data quality), determine data origins (data lineage), and describe the glossary of the subject area. Expected Result. Based on the available data, a set of data marts with a calculation interval of 1 hour of input data should be constructed, containing information on consumer traffic consumption and signs of suspected hacking.