Legal W2V: Legal Vocabulary Embedding

Published: 28 May 2021| Version 1 | DOI: 10.17632/97tytghd47.1


Legal W2V is a Legal domain-specific word embedding trained using more than 48,000 judgments from the Supreme Court of India, decided during the span period of January 1950 to December 2016 (76 years). It contains more than 34,000 Legal vocabulary words and their respective 100-dimensional vectors, which is trained using Continuous Bag of Word variant of Word2Vec, considering Gensim Library. Potential applications scope of this Legal Word embedding would be to explore various active research domains viz. Legal Information retrieval and Legal Informatics. Parameter Settings: Vector size: 100, Window size: 10, Number of iterations (iter): 10, Minimum count (min_count): 10.


Steps to reproduce

The judgment corpus is prepared considering the more than 48,000 Indian Judgment consisting of Metadata and actual judgment text with noisy elements. The pre-processed judgment corpus is prepared considering the proposed dictionary-based approach and domain-specific pre-processing, discussed in the following paper. Jenish Dhanani, Rupa Mehta and Dipti Rana, “Legal Document Recommendation System: A Dictionary based Approach” International Journal of Web Information Systems. When using this embedding, please cite our paper. Jenish Dhanani, Rupa Mehta and Dipti Rana, “Legal Document Recommendation System: A Dictionary based Approach” International Journal of Web Information Systems. Jenish Dhanani, Rupa Mehta and Dipti Rana, “Legal document recommendation system: A cluster based pairwise similarity computation ” Journal of Intelligent & Fuzzy Systems, DOI: 10.3233/JIFS-189871.


Sardar Vallabhbhai National Institute of Technology


Information Retrieval, Natural Language Processing, Machine Learning, Informatics, Knowledge Representation, Word Embedding
