0%

Deleting my unused Google data, accidentally saw one post I wrote in early 2010 on blogger, it still interests me, so, copy and paste here.

Code refactoring is like living in your own house, you need to clean up the garbage.

Porting code is like living in others' house, you know it's messy, but can't decide which are garbage to abandon, which are useful things to keep.

After a period, you have your stuff which are easy for you to decide whether to abandon, and few old legacy stuff are abandoned since no use.

But suddenly, you find out you can't differentiate which are your stuff, which are the legacy stuff.

You may say, well, forget the mess, it's not mine, I'm not sure it's useless or not, but, the truth is, it's yours.

Some day, you can't stand the messy house, you buy a new one.

I am engaging in data analysis of Apache access logs recently, sick of writing mapreduce programs in Java way, I chose python, which is far more better than Java on processing text, and exploited [stream](http://en.wikipedia.org/wiki/Stream_(computing)) which is the nature of mapreduce programming.

Word Count Example

Mapper receives inputs from standard input, splits each line and sends all words out as word 1 pairs.

1
2
3
4
5
6
7
8
#!/usr/bin/env python
import sys

for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)

All the pairs are sorted by word, and sent to the Reducer, don't worry, hadoop streaming does it for us.

Read more »

org-mode is an amazing tool, you can use it to write documents, make spreadsheets, track to-do list, generate agendas, etc. If you use Emacs a lot, you don't want to miss this unicorn.

I mainly use it as my notebook, and GTD tool.

This post is about how I configured it to be my single daily use GTD tool.

Customized Task

I use 5 different states of my task item. WAITING and CANCELED are abnormal cases, when switching to these states, note of reason should be added, as well as time-stamp, if task is done, it switches to DONE state, with a time-stamp, but no note needed.

Read more »

I have an application runs perfectly on single machine, it was designed to be used this way, but things got changed, now, at least three copies have deployed, and is continue increasing.

The main job of the application is to collect data, and then do analysis, being used by three products means deploy it to three machines.

It always runs background with three processes, so I need to keep an eye on every processes, trace logs and set nearly the same configurations to each copy on every machine, as well as some product specific settings. It's chaos.

I plan to make it distributed, with one server do the product specific analysis, and some clients collecting data. Server dispatches clients, gives clients job to do, sets configurations per client, and gets running status of all clients.

Hope this can relieve me. :)

Read more »

Recently, I needed to do some analysis on lots of short messages like tweets. To be accurate, I had to remove the duplicated messages.

String compare is not enough, what I wanted is to remove all the similar messages.

Clustering works, but I don't want to write a whole Single Pass Algorithm, a colleague suggested me to consider Lucene.

I've never used Lucene before, so I searched and read some documents of it. At first, I thought I could put all short messages into Lucene index, then search by each message content, set a hit score threshold, collect all the hit docs, and remove all docs beyond the threshold.

But when it came to implementation, it's not easy to extract important words in message content as keywords used for "search by each message content".

Read more »

Though Emacs is one of the best text editors, I mean without any packages, nearly every emacser has a bunch of packages they collect which fits their needs.

With those packages, Emacs becomes the best text editor, no more "one of".

OK, Vim is a good editor, too. I had tried it before addicted to Emacs, but couldn't get used to it. I felt uncomfortable of the philosophy, the separatation of editing and operation, and the switch key ESC, which is really hard to type.

Someone may also say that Emacs is an operating system, it can be used to write programs on any languages, to read or send emails, to play games, even as a twitter client.

Like always, I see some interesting tools, Google whether Emacs can do so, then, turns out it can, just by some packages.

Read more »

A normal Hive table

A normal hive table can be created by executing this script,

1
2
3
4
5
6
7
CREATE TABLE user (
userId BIGINT,
type INT,
level TINYINT,
date String
)
COMMENT 'User Infomation'

It's empty, util we load some data into it, it's useless.

1
LOAD INPATH '/user/chris/data/testdata' OVERWRITE INTO TABLE user
Read more »