Legitimate domains and DGA categorized morphologically and by families.
Description
Dataset Composition Examples: 4,090,661 Legitimate: 998,313 DGA: 3,092,348 DGA Families: 160 Morphological Types: 5 Dataset composed of a collection from the DGA DGArchive feed between 09/28/2025 and 10/28/2025, plus the Majestic Million domain set collected on 09/28/2025. Examples found in both lists were disregarded in order to improve the quality and reliability of the labels presented by the dataset. This dataset was conceived during research work for the detection and classification of DGA using deep learning and natural language processing techniques. This research resulted in the publication of two articles and a master's thesis: - Class Incremental Deep Learning: A Computational Scheme to Avoid Catastrophic Forgetting in Domain Generation Algorithm Multiclass Classification. https://doi.org/10.3390/app14167244 - Deep Convolutional Neural Network and Character Level Embedding for DGA Detection. http://dx.doi.org/10.5220/0012605700003690 - Detecção de domínios gerados por algoritmos com aprendizado profundo incremental e DNS passivo. https://hdl.handle.net/11449/313556 The DGA example set, we organized it by families. This data already came from the original set obtained from the DGArchive. In type, we identified five major morphological groups, which are described below: - Random: DGA families that generate their domain names in a way that forms unintelligible character sequences. Although we know that this sequence is not exactly random, it gives the impression of being random, hence the choice of the label for the morphological type. Examples: 15zkgsh1n100ax15m265x1cnkdk7.org, uhovosxkjkcg.ru - Pseudo-words: This morphological format seeks to emulate the formation of real words, permuting vowels and consonants. This technique seeks to evade security layers that are based on entropy to detect DGA domains. Examples: stajaq.com, nonafudazage.name - Seed-domain: Some DGAs start with an existing, often legitimate, domain name and add characters at the beginning or end, allow characters, and change the initial domain's TLD. Examples: tlzeitudeconscientiousavl.com, agirtvolveras.com - Wordlist: Another very effective way that some threats have found to evade automated DGA detection analyses is by forming their domains using dictionaries of real words and creating domains with the permission of those words. These DGAs are particularly difficult to detect because of their morphology, which is very close to legitimate domains. Examples: kneemanualgirl.com, leadunabledeal.art, christinashaquila.ru - Subdomain: Another strategy recently used by attackers to try to evade automated DGA analysis is the use of a subdomain instead of the initial domain. Some even use subdomains of legitimate Dynamic DNS providers. We found DGA families that have examples of both morphologies (domain and subdomain), but there are already threats that rely exclusively on this morphology. Examples: odtzcjajsrxh.dyndns.org, eboxmj56grwjs2afs6i.ddns.net