Type: Bachelor’s thesis
Used technologies: Big data, Python
The aim of this thesis was to find methods and procedures which are leading to categorization of users with respect to history of their records from internet browsing.
The work uses analytical and statistical methods, by which it tries to find some categories of websites, which are characteristic for a specific group of users. It has been found that clustering algorithms are not sufficiently descriptive for finding required categories, and thus it has been used topic-model algorithm named pLSA. The topics have been found thanks to this algorithm. The topics are formed by distribution of websites and every user is described by distribution of the found topics. The description of topics has been supplemented with categories from DMOZ database and later with the most important words, which are appeared on web pages describing the topic. Anonymized data was provided by unnamed antivirus company.