Sunday, 28 October 2012

Machine Learning library: training process for ID3 completed.

TODO:

  1) Tree serialization/saving/restoring
  2) C4.5 implementation
  3) Tree usage https://github.com/vk4arm/DTree

 Right now - just for fun. But tree, generated from the training set looks logical...

Monday, 22 October 2012

Machine Learning library - in progress.

I want to manage trees in redis, reuse it, create using id3 and c4.5 tree building algorithms.

https://github.com/vk4arm/DTree

Not yet ready!!!!!!! :-)

Sunday, 21 October 2012

How to get free active email to SMS providers list. Or small Oracle software reverse engineering.

Yes, it is possible to googleit, but in this case you should test all gateways, but how to do it without having a phone numbers in all providers you that have found?
Better if somebody will do it for you.

Small hack :-)
Oracle APEX (Oracle Application Express) application - OraTweet contains free sms feature. So, it is possible to get it.

Several words about apex app. This is simple .zip archive, which contains .sql files and some static resources, so, this is possible to find a list of email 2 sms providers in it.



1) Download oratweet:  http://oratweet.com/  -> download

2) Unzip it.

3) Open SMS_GATEWAY.sql in your text editor.

4) ???

5) Profit !!! :-)))

Tuesday, 16 October 2012

Anything you want to do, NP-hard.

Anything you want to do, NP-hard. Sorry, this is an AI class. Everything is hard.

Topic Detector

Делали ночью topic detector. На входе - текст или url, на выходе - классификация.
Классификация (не кластеризация) - общая, работает замечательно, даже "ручной zoom" получается.


Wednesday, 10 October 2012

Multilanguage soundex algorithm and library

WIKI says:

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as PostgreSQL, MySQL, MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.

But world is bigger then only English.

In my solution there is an implementation for Russian language and there is a simple method for another languages configuration. I will speak with Leonid, may be he will be so kind and  will add some European languages to this tool (Ukrainian, for example :-) )

Usage:

>>> import pysoundex
>>> print pysoundex.soundex("Приветище", lang='ru_RU')
п613
>>> print pysoundex.soundex("Hello")
h400   <--- the same as un mysql!
>>> print pysoundex.soundex("Halo")
h400


Configuration for Russian language is:


  "ru_RU": {
  "vowels": ['у','е','ы','а','о','э','я','и','ю','ь','ъ','й'],
  "consonants": {
   1: ['б','п','ф','в'], 
   2: ['с','ц','з','к','г','х'],  
   3: ['т','д'], 
   4: ['л','й'], 
   5: ['м','н'], 
   6: ['р'], 
  
  }

 },


All language rules are configurable in the soundconfig variable in the lang_soundconfig.py file.

Feel free to modify and add new languages.

You can download it here:


https://github.com/vk4arm/pysoundex

Первый пост / Initial post

В этом блоге в основном ссылки и описания на выкладываемых на github преполезных решений, ну и не только.

This blog is mainly a reference to some my solutions.