------------------------------------------------------------ Automated analysis of malware related internet domain names. ============================================================ To effectively fight the e-crime and take down the malware distribution sites a semi-automated tool is needed to help the detection of malicious domain names, based on their (hidden) linguistical capabilities. The platform for the tool probably should be Unix/Linux. Normal entity (person/company) never has to invent domain names. When registering a domain, the probable name will be rather evident for the applicant. The name will be derived from a trade mark, location etc. Another limitation is that a normal entity never has to invent tens, hundreds or thousands domain names per day (a mass-registration of "similar" domains could indicate certain bad intent). The conditions are quite stringent for cybercrime players. They _have_ to invent hundreds of domain names per day which could be quite off the limits of human brain. Using random identifiers is not the best strategy because these are easily identified as such. While dreaming about new names for the purpose, e-crime players usually juxtapose certain lexical items that are within the reach of their mind this particular day. Very naturally, they think based on their mother tongue. A real example of artificially generated domain names (for a virus): nvrvyylnhmmyjij.biz oopomplzqosjktxl.org oopomplzqosjktxl.com vqooooqopqgqpff.info vqooooqopqgqpff.org mfrvsxqqsyolomiz.biz mfrvsxqqsyolomiz.com owjsvyjhglphhxyg.info owjsvyjhglphhxyg.com hgtvccrtyqcmnuq.net hgtvccrtyqcmnuq.biz tlodrqupvrgvblpp.info tlodrqupvrgvblpp.com utnehncplngpj.org utnehncplngpj.com tykphkulivveuq.biz tykphkulivveuq.org tusqrhgwuphwkfwk.net tusqrhgwuphwkfwk.com frcqoqmvantopwmu.info frcqoqmvantopwmu.biz ensqiitdjuonxzi.net ensqiitdjuonxzi.org spxrdrhyhtlpqirz.info spxrdrhyhtlpqirz.com tdccpmhjesluwtkr.biz tdccpmhjesluwtkr.com siwkfsxnmnjybqv.info siwkfsxnmnjybqv.org This way, a cybercrime player has to act in a narrow corridor, one wall of which is characterising some level of randomness and another is revealing the natural limits of the human brain. There is a hypothesis that most of malware domains could be discovered as such via "proper" linguistical analysis. An example what domain names actually carry the malware in real world (two days, one particular source): slidingsystems.in load-crude.com tubemoviesonline.info doubletest4411.com navolak.com www.photoshock.com.pt neusarqa.com topmovieshd.info chdphilippines.org davailavedavai.com zioaskc.com www.iijt.ru vertierogovq.net blackfuril.ru greatqare.com miraxgroupmirax.com freemysmartrend.net sa13.a2j9x.com virginwebgirls.co.cc av-scann.tk scanner.virusmustdie.info kcqmevsxf.co.cc scaner-rbv.tk Let's have a look on he output (mass registration, one particular day, taken from wild): ns2 . nameashop . cn ns2 . namebrandmart . cn ns2 . namebuyline . cn ns2 . namebuypicture . cn ns2 . namestorefilmlife . cn ns2 . namesupermart . cn ns2 . nanoautofinest . cn ns2 . nanotopdiscover . cn ns2 . nanotopfind . cn Now let's make a manual linguistical analysis on this row. It is well seen how the invented domain names satisfy a certain pattern: ns2 . na* . cn - 9 ns2 . name* . cn - 6 ns2 . nameb* . cn - 3 ns2 . nano* . cn - 3 ns2 . namebuy* . cn - 2 ns2 . names* . cn - 2 ns2 . nanotop* . cn - 2 In practice, the knowledge of Russian (or some other language) turns to be a big advantage - for a reason or two, the correlation of some particular mother toungue and the probability of being the cyber-crime player is high. It is also possible that using the offensive word dictionary for certain languages will sometimes help. One extremely important background technology is Bayesian filtering which is widely used in spam filtering. Last but not least, some of the necessary knowledge is available in the field of brute force password cracking - the permutation formulas for the dictionary words. (NB - very revealing information of other kind is available on domain names like the registration date vs current date but checking this is rather trivial and does not need any specific linguistic experience) -------------------------------------------------------------- Summary - what is really needed, is a tool on the boundary of linguistics and mathematics. That tool should be aware of: 1. words (English, Russian etc) - like "name" "buy" "line" 2. FQDN rules (specifically dots "." and some other conventions like "ns2." (like a probable nameserver) and ".cn" (like a country code) The tool also should have the ability to "look" over word delimiters and _inside_ the words as contrary to what most linguistical tools are able to do today. It is possible that some simple manual markup tool could help to improve the Bayesian detection. To our best knowledge, no tool with the listed capabilities exist so far. Log correlators (e.g. the one by Risto Vaarandi) usually are unable to work with shorter "sentences" neither check for the permutations/combinations in the middle of the line. And then again, normal linguistical tools are unaware of DNS conventions nor can they look both sentence and word level at the same time - due to limitations normal languages have... The solution seeked should have the following capabilities: * define some grammar and notation suitable for the particular task * be dictionary-aware for at least some critical languages * be blacklist and whitelist aware * be obscenity lists aware * apply Bayesian probability markers * propose best methodology to work in field of DNS names * analyze for randomness factor in domain names * uncover the hidden relations and likelihoods among "lexical" items * enable easy interfacing for both input, output and knowledge markup. * have an easy "markup" GUI to train Bayesian filters 2010-10-25 anto@cert.ee