Friday, May 25, 2012

Count regexp matches on a table: Sorting result sets by relevance in PostgreSQL

    The idea of counting the number of matches in a field that a regexp has, is a very intesting method to sort sets, for example, a web search result set, when you need the most relevant sources first (for example, containing the searched word more times).

    I researched a bit and unfortunatelly, MySQL is far behind PostgreSQL, despite the fact that Oracle bought it (yes, at first I though it could improve a lot with Oracle's experience, but reality is that Oracle is not going to put another free competitor against its own), so I will write this howto for PostgreSQL with an example.

    PostgreSQL has a very clever function regexp_matches that will show all substrings which matches a pattern and we will use that to count matches and sort result.

    For example, suppose we have this table in database:
 
DATABASE TABLE AND DATA
TABLE NAME: CONTENT
idINTEGER
htmltextTEXT
TABLE DATA
idhtml
1key
2key key key
3key key
4not matching pattern
    If we try to search by keyword key, we could want to appear in relevance order:


Expected resultActual result
id21
id32
id13
    To achieve this sorting, we need to use regexp_matches correctly along with our result set, like this way:
SQL Example Sorting Sentence
SELECT id, COUNT(*) AS patternmatch
   FROM (
      SELECT id,regexp_matches(htmltext, 'key', 'g')
         FROM CONTENT
   )
   AS alias
   GROUP BY id ORDER BY patternmatch DESC;

    That will result in a set like Expected result in previous table.

    But, that was only an example!! How about a real usage?

    With a little modification of previous sentence, it could be applied to sort a real search query, just substitute bold text with your pattern and your subquery/data set.

    The only requeriment to do that is: Your subquery must have id and htmltext between its results! (and of course, replace 'key' with

    So assuming we have a function called our_expensive_lookup (with the lookup parameter) that do a very expensive lookup by some keyword, we could apply this sorting to our returned results by:
SQL Real Usage Sorting Sentence
SELECT id, COUNT(*) AS patternmatch
   FROM (
      SELECT id,regexp_matches(htmltext, 'key', 'g')
         FROM (
            SELECT id, htmltext
               FROM our_expensive_lookup('key')
         ) AS lookup
   ) AS alias
   GROUP BY id ORDER BY patternmatch DESC;
    Of course, it can be combined with prepared statements, stored procedures, and so on! The hardest part is done, so just use your imagination!

Wednesday, May 23, 2012

What can you do against SGAE, MPAA, RIAA, GEMA and those ones?

    It is known for everybody what actions are those organizations doing world wide: declaring you are a criminal because you watch/download material, applying censorship to internet, adding fees to buy media in some countries just because you could use it for criminal purposes and those things.

    Specially, I am referring to megaupload case: They closed megaupload with no proofs, and they are suing as much people as they can, just by giving their IPs, with the only argument that they are causing lots of revenue loss to multimedia market.

    Well, the first thing I will say publicly is: A people wich download some material is not a buy loss as they argue. Like me, for example, I don't, and will not buy everything I hear, see, watch online. So just the fact I download does not mean that I would buy it. It is just false: I will not buy it anyway!

    So what's the real loss for them? Their fail is that time had change, now we are in 2012 year, not in 1950, and they simply refuse to change their business model. It is like if dinosaurs try to sue meteors that ended them..just a nonsense.

    But, the real matter is: What can YOU do against it? It is pretty simple, and this is what I do. I declare it publicly and if you are Sony, Emi, SGAE, GEMA, RIAA, MPAA, Hollywood have some problem with my declarations, then go and come with me, I will tell you this again at the face!

    So, what can be done to force those agencies/companies to realize we are just fed up to be treated like criminals?

    Very simple: Do not go to theaters, do not buy original music CDs, do not buy films. Not just a month, not do it never until they change their paths.

    I usually have this conversation with people, and they often say things like: a person can't do anything against them, and so on, this is simply a false argument.

    If you cause a real loss of revenue to them, they will realize they are alive thanks to us, which are their customers, and they will learn how to treat their potential customers, rather than mark everybody as a criminal, and the stuff they do.

    So, in my case, I am not going to theaters, not buying anything with copyrights since the closure of megaupload until it comes back again and they face reality (which the later I doubt it would happen so...)...

    And publicly I say: I encourage everybody to pirate copyrighted materials always. Let's show them what a real loss of revenue can mean!