Friday, December 23, 2005

Things to be changed in SC

1. to use subversion to replace CVS
2. deploy CAS
3. use Maven

Wednesday, December 21, 2005

Antivirus solutions on the mail server side...

Just blog the stuff I got about antivirus solutions on the mail server side:
1. clamav + clamav-milter (sendmail)
2. AMaViS - A Mail Virus Scanner (AMaViS - A Mail Virus Scanner)
3. MailScanner (http://www.mailscanner.info/) should be the easiest way to install

Saturday, December 03, 2005

ajax framework

http://rialto.application-servers.com/demoRialto.jsp

well, using Apache license and probably, can be employed to be used in our application.

Tuesday, November 15, 2005

MD5 cracker source released

http://www.stachliu.com.nyud.net:8090/collisions.html

Think if you are going to use MD5.

Friday, October 28, 2005

A day of bug fixing...

It might be good to tell more about my job...

Now I received a list of bugs:
[hmm site]/forum/main/list/cn.page

PRC forum
• Bug: IE browser title text should be in S.Chinese. [Yes it is, I tried to fix it but time is not allowed.]
• Bug: Click 免費入會 will go to HK site.
• Bug: Click 忘記密碼 will go to HK site.
• Bug: Click 登出 will go to HK site.
• Bug: Click 更改個人資料 will go to HK site.
• Bug: Click HMM logo will go to HK site.
• Bug: Add ALT for images on left panel
• Bug: The “編輯” image is broken.

Both HK and PRC
• Bug: Search, if key in 2 words (1 word is ok) with result of several, return error

After receiving it from Sam, I starts checking the issue. Well, some is known to me.

DCCP protocol...

Congestion Control Protocol

http://www.icir.org/kohler/dcp/

Wednesday, October 26, 2005

Allan mail server is down...

Ar. have no idea why it is down... It is the client from Corpmart... We are providing support to them...

Need to explain why it is hang.

Monday, July 04, 2005

Using Spamassassin and procmail to filter spam

Using Spamassassin and procmail to filter spam



See also: Mail Server FAQ
Table of contents
1 Procmail and Spamassassin

1.1 SpamAssassin: Automagic teaching

1.1.1 One spamassassin autolearning setup
1.1.2 An Alternative Autolearning Setup
1.1.3 Performance issues with training the bayesian filter
[edit]
Procmail and Spamassassin

We'll use the combination of procmail (http://www.procmail.org) and Spamassassin (http://www.spamassassin.org) to filter spam from our incoming mail. Procmail allows us to feed our incoming mail through different programs, then place them in different mailboxes based on the outout of those programs.

You can setup procmail and Spamassassin to be systemwide, but I prefer to do it by user - for the sake of flexibility. If you do set it up system wide (http://www.spamassassin.org/sitewide.html), the users can always customize their own rules. Refer to this IBM Developerworks article (http://www-106.ibm.com/developerworks/linux/library/l-spam/?t=gr,lnxw03=StampSpam) to get more detail on Spamassassin. Once you have installed Spamassassin and procmail, we want to setup a .procmailrc in our home directory. This will actually take the mail from Postfix, filter it through spamassassin, then deliver it to the Maildirs. For the sake of definition, procmail becomes our MDA (Mail Delivery Agent). Since procmail will be delivering to the Maildirs instead of Postfix, you need to double check this setup again.

[user@mail ~]$ cat .procmailrc
#.procmailrc - by jorge - Updated by glenn 6/8/05

# Set our default shell, which may not be necessary, as it
# uses the user's default
SHELL=/usr/bin/bash

# Set our default Maildir. You can also set the HOME variable, which
# may be set automatically by procmail
MAILDIR=/home/jorge/Maildir

# Set our default mailbox, which is where mail that doesn't have
# a sort rule applied goes. The trailing / is important.
DEFAULT=/home/jorge/Maildir/

# Set up logging. It's a good idea to let procmail log all its
# actions until you're sure it's working right.
LOGFILE=${MAILDIR}/procmail.log
LOG="--- Logging ${LOGFILE} for ${LOGNAME}, "

# Whatever recipes you'll use
# The order of the recipes is significant

# First, run everything through spamassassin. The "-P -a" tags are
# not needed with 3.0+, as they are now the defaults.
:0fw
| spamassassin

# Now that we've tagged spam, put it in its own folder

:0:

* ^X-Spam-Status: Yes
/home/jorge/Maildir/.spam/

Notice the DEFAULT directory ... that's where my mail will go, the new directory under my Maildir. That's how it shows up as new mail. If I put it in the cur directory in my Maildir, the mail would never show up as new, it would show up as already read (but we don't want that). Once you read a mail in the new directory, it moves it over to the cur directory. That's how the system keeps track of which mails are new and which ones are the 'cur'rent ones in your inbox. That's also why each subfolder has its own new,cur, and tmp directories.

Procmail is smart enough to put new mail into the "new" subdirectory for whatever Maildir folder you want it to put mail in. If you are using an IMAP server like dovecot that builds indexes, you have to omit "new/" and just leave the trailing slash (as in the script), as otherwise procmail doesn't use standard Maildir filenames, and dovecot will constantly be rebuilding a corrupted index file.

The big spamkiller is the 0fw | spamassassin line. Users of versions of spamassassin prior to 3.0 may want to add the e -a switch, which enables autowhitelisting (enabled by default in 3.0+). When you reply to people, after a while, they get put in the whitelist, and then they won't be judged so harshly the next time Spamassassin checks mail from them. If you're like me, you have some friends, that even with an autowhitelist, will score 20 or higher on the spamassassin score.

The next line puts it in the new subdirectory of my spam directory. It will then show up as new spam in our mail reader, already filed for inspection. As you guessed it, once you read the spam, it will get moved to the cur folder. I could send it directly to the .Trash folder, but I prefer to keep it seperate, and this way you can scan it everyonce in a while and catch the nifty HTML email your friends from hotmail send you. Confused? Just use the example (http://www.spamassassin.org/dist/procmailrc.example) from the Spamassassin website. You can direct the possible spam to dev/null, but this is not recommended, if you get a false positive then you'll lose that mail, that is why we created the .spam folder.

Next, we need to create a .forward file to get the email to go to procmail.

[user@mail ~]$ cat .forward
|/usr/bin/procmail

Notice that all the file does is send your incoming email to procmail. Procmail then accesses your .procmailrc file, and through it, delivers your email to the Maildir/ after running spamassasin. Without the .forward file your email will be directly delivered to your /Maildir and will avoid spamassassin entirely.

Spamassassin does a great job by default, but let's say you want to tailor it more. Create a .spamassassin directory in your home directory (mkdir .spamassassin). In there, create a user_prefs file using your favorite text editor. In time, the autowhitelister will put your whitelist in this directory too.

[user@mail ~/.spamassassin]$ cat user_prefs
# custom rules for spamassassin
score RAZOR_CHECK 4.0
score REMOVE_SUBJ 4.0
score SUBJ_REMOVE 4.0

score REPLY_REMOVE_SUBJECT 4.0
score REMOVE_IN_QUOTES 4.0
score HTML_WITH_BGCOLOR 4.0
score REALLY_UNSAFE_JAVASCRIPT 4.0
score CHARSET_FARAWAY_BODY 4.0
score NO_MX_FOR_FROM 4.0

score CTYPE_JUST_HTML 4.0
score WEB_BUGS 4.0
score SUBJ_ALL_CAPS 4.0
score LINES_OF_YELLING 4.0
score FOR_FREE 4.0

From this file, we are overriding the default rules in Spamassassin with our own custom scores. We did this because I thought the default values for these rules were too low, because who gets valid email with web bugs? So we raised the scores of these rules up. Spamassassin still keeps the default score of 5 as the definition of a spam, so these rules won't declare a mail a spam by themselves, but if they're using one of these techniques, then they're probably using other ones as well they will score higher and be tagged as spam. Refer to the Spamassassin rules list (http://www.spamassassin.org/tests.html) to find the complete list of rules.

Remember that these custom rules are in the user accounts, so if you don't want a certain account to use them, don't create a user_prefs for them. If they don't want any spam filtering whatsoever, then don't create a .procmailrc for them, Postfix will work just fine, because it was working before you even got to this step, right?
[edit]
SpamAssassin: Automagic teaching

SpamAssassin works well as a spam filter, but it works much better when it uses Bayesian filtering as opposed to depending on its ruleset, simply because of the adaptable nature of Bayesian filtering. You may be familiar with mail clients such as Thunderbird that use Bayesian filtering that you have to 'teach' your email habits too. This requires marking lots of spam as such, and unmarking lots of email labeled as spam to ham. The same thing is required with SpamAssassin, to get it to have enough data to be able to use its Bayesian component. Of course, who wants to sit at the command line, telling SpamAssassin when it was right, and when it was wrong? So, we use a simple script and crontab and some client side manipulation to automate the process.
[edit]
One spamassassin autolearning setup

In my (glenn) setup, I have spamassassin learn from my Inbox and Junk mailboxes, as I keep them nicely cleaned up and properly sorted. I don't want to deal with an overflowing Junk mailbox, but I also don't want spam cleared so quickly that I miss false positives. So, I run sa-learn every Thursday and Sunday night, and clean out spam older than 31 days at that point. I use the following learning script:

glenn@vasp:~$ cat learn.sh
#!/bin/bash
/usr/bin/sa-learn --spam ~/Maildir/.Junk/cur
/usr/bin/sa-learn --ham ~/Maildir/cur

I omit /new as that's email I haven't had a chance to properly file in case of false positives/negatives. Then, I use the following script to clean up old spam:

glenn@vasp:~$ cat cleanup-junk.sh
#!/bin/bash

# Removes all files from ~/Maildir/.Junk/cur that are older than
# 31 days ago

find ~/Maildir/.Junk/cur -mtime +30 -exec rm -f {} \;

I simply placed these in my crontab:

glenn@vasp:~$ crontab -l
# min hr dom month dow cmd
# Teach spamassassin about spam
* 4 * * 4,7 /home/glenn/learn.sh

# Clean up spam directory
* 5 * * 4,7 /home/glenn/cleanup-junk.sh

This works nicely for me, keeping spamassassin smart and preventing me from missing important emails.
[edit]
An Alternative Autolearning Setup

In my .procmailrc, I tell procmail to move mail marked as spam by Spamassassin to my $Maildir/.Trash/. This is important, because the way I am setting this up, anything that is sent to the Junk folder is deleted every 4 hours. For this reason, you should probably disable automatic moving of spam around by your email client. Or, setup two junk folders, junk_manual, which is stuff you've personally approved as spam, and junk_auto, which is mail moved by your mail client's junk mail filter and that you should probably check over before moving over to junk_manual for reading by SpamAssassin and removal from your system.

It is important to tell SpamAssassin not only what is spam, but what is ham, or legitimate email. I am running this script in my root's crontab, so I specify which home directory to go to. I don't really see why you couldn't run this as your normal user, but since SpamAssassin is run as root when called by procmail (at least, I think it is), I figure it is best to do the same here. My learning script first tells spamassassin that any read or unread junk mail is spam. I then have it hit my inbox and a number of other oft used folders on my system to tell it what is ham. It may be wise to leave out your inbox, as you may not have a chance to move spam that gets through, leading SA to believe that spam is ham, which is, of course, a no-no. This is the same reason I do not have SA read my Trash folder, which sometimes has legit email in it (though I do keep 99% of my email, so it's not a huge concern.)

[user@mail ~]$ cat ~/learn.sh
#!/bin/bash
/usr/bin/sa-learn --spam /home/user/Maildir/.Junk/new
/usr/bin/sa-learn --spam /home/user/Maildir/.Junk/cur
/usr/bin/sa-learn --ham /home/user/Maildir/new
/usr/bin/sa-learn --ham /home/user/Maildir/cur
rm /home/user/Maildir/.Junk/new/*
rm /home/user/Maildir/.Junk/cur/*

We then want to run that script every 4 hours, or however often you wish to have it run.

[user@mail ~]$ crontab -l
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=""
# m h dom mon dow command
0 */4 * * * /home/user/bin/learn.sh

And now, SpamAssassin can learn from its mistakes as you simply 'delete' (move to the Junk folder) your spam clientside.
[edit]
Performance issues with training the bayesian filter

I (tm) used to use a script like those above on my linode (http://www.linode.com), but for whatever reason, the sa-learn process would suck up a lot of memory and essentially bring the linode to its knees. Without having completed the learning process. So, instead of throwing whole directories at sa-learn, I give it each mail, one at a time.

So now, my learnspam script looks like:

#!/bin/sh
## Process high scoring spam in the .Spam folder and anything leftover in .Junk
#set -x
# feed to the bayesian learner
echo "Processing Junk maildir..."
spams=`find ~/Maildir/.Junk/cur ~/Maildir/.Junk/new/ -type f -mtime +7`
for spam in $spams
do sa-learn --spam --showdots --no-sync $spam
done
rm -f $spams
echo "Processing Spam maildir..."
spams=`find ~/Maildir/.Spam/cur ~/Maildir/.Spam/new/ -type f -mtime +3`
for spam in $spams
do sa-learn --spam --showdots --no-sync $spam
done
sleep 1
rm -f $spams

And my learnham script is:

#!/bin/sh
## Process ham
if [ $# -lt 1 ]; then
echo "Usage: $0 "
exit 3
fi
#set -x
# feed to the bayesian learner
for allmail in `find ~/Maildir -name cur -type d -ctime -$1 | egrep -v '(Trash|Junk|Spam)' | xargs -n 50 -idir find dir -type f -ctime -$1`
do for mail in $allmail
do echo $mail; sa-learn --ham --showdots --no-sync $mail
done
done
sleep 1
for allmail in `find ~/Maildir -name new -type d -ctime -$1 | egrep -v '(Trash|Junk|Spam)' | xargs -n 50 -idir find dir -type f -ctime -$1`
do for mail in $allmail
do echo $mail; sa-learn --ham --showdots --no-sync $mail
done
done

This may be helpful to anyone else who wants to run SA with bayesian learning on something with little memory like a linode.

Sunday, June 19, 2005

Start installing other firmware on my Linksys router...

I am starting to use "Firmware Version: v3.03.6 - HyperWRT 2.1b1" for my wireless router.

now, I have telnet to it.

========== dmesg output===============
CPU revision is: 00029007
Primary instruction cache 8kb, linesize 16 bytes (2 ways)
Primary data cache 4kb, linesize 16 bytes (2 ways)
Linux version 2.4.20 (root@localhost) (gcc version 3.2.3 with Broadcom modifications) #1 Fri Feb 18 18:25:54 CET 2005
Determined physical RAM map:
memory: 01000000 @ 00000000 (usable)
On node 0 totalpages: 4096
zone(0): 4096 pages.
zone(1): 0 pages.
zone(2): 0 pages.
Kernel command line: root=/dev/mtdblock2 noinitrd console=ttyS0,115200
CPU: BCM4712 rev 1 at 200 MHz
Calibrating delay loop... 199.47 BogoMIPS
Memory: 14444k/16384k available (1333k kernel code, 1940k reserved, 108k data, 64k init, 0k highmem)
Dentry cache hash table entries: 2048 (order: 2, 16384 bytes)
Inode cache hash table entries: 1024 (order: 1, 8192 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer-cache hash table entries: 1024 (order: 0, 4096 bytes)
Page-cache hash table entries: 4096 (order: 2, 16384 bytes)
Checking for 'wait' instruction... unavailable.
POSIX conformance testing by UNIFIX
PCI: Disabled
PCI: Fixing up bus 0
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
devfs: v1.12c (20020818) Richard Gooch (rgooch@atnf.csiro.au)
devfs: boot_options: 0x1
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled
ttyS00 at 0xb8000300 (irq = 3) is a 16550A
ttyS01 at 0xb8000400 (irq = 0) is a 16550A
HDLC line discipline: version $Revision: 1.1.1.4 $, maxframe=4096
N_HDLC line discipline registered.
PPP generic driver version 2.4.2
Physically mapped flash: Found an alias at 0x400000 for the chip at 0x0
Physically mapped flash: Found an alias at 0x800000 for the chip at 0x0
Physically mapped flash: Found an alias at 0xc00000 for the chip at 0x0
Physically mapped flash: Found an alias at 0x1000000 for the chip at 0x0
Physically mapped flash: Found an alias at 0x1400000 for the chip at 0x0
Physically mapped flash: Found an alias at 0x1800000 for the chip at 0x0
Physically mapped flash: Found an alias at 0x1c00000 for the chip at 0x0
number of CFI chips: 1
0: offset=0x0,size=0x2000,blocks=8
1: offset=0x10000,size=0x10000,blocks=63
Flash device: 0x400000 at 0x1c000000
Physically mapped flash: cramfs filesystem found at block 910
Creating 4 MTD partitions on "Physically mapped flash":
0x00000000-0x00040000 : "pmon"
0x00040000-0x003f0000 : "linux"
0x000e3918-0x003f0000 : "rootfs"
mtd: partition "rootfs" doesn't start on an erase block boundary -- force read-only
0x003f0000-0x00400000 : "nvram"
sflash: found no supported devices
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 1024 bind 2048)
Linux IP multicast router 0.06 plus PIM-SM
ip_conntrack version 2.1 (128 buckets, 1024 max) - 344 bytes per conntrack
ip_conntrack_pptp version 1.9 loaded
ip_nat_pptp version 1.5 loaded
ip_tables: (C) 2000-2002 Netfilter core team
ipt_time loading
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
NET4: Ethernet Bridge 008 for NET4.0
802.1Q VLAN Support v1.7 Ben Greear
All bugs added by David S. Miller
VFS: Mounted root (cramfs filesystem) readonly.
Mounted devfs on /dev
Freeing unused kernel memory: 64k freed
5325E phy=0
5325E VLAN programming for BCM5325E-MDIO I/F switch
1:(0x00) value=0x8000
2:(0x00) value=0x8000
1:(0x13) value=0x0000
2:(0x13) value=0x0002
1:(0x00) value=0x8000
2:(0x00) value=0x8000
1:(0x13) value=0x0002
2:(0x13) value=0x0006
1:(0x00) value=0x8000
2:(0x00) value=0x8000
1:(0x13) value=0x0006
2:(0x13) value=0x000e
1:(0x00) value=0x8000
2:(0x00) value=0x8000
1:(0x13) value=0x000e
2:(0x13) value=0x001e
1:(0x00) value=0x8000
2:(0x00) value=0x0000
eth0: Broadcom BCM47xx 10/100 Mbps Ethernet Controller 3.60.13.0
eth1: Broadcom BCM4320 802.11 Wireless Controller 3.60.13.0
flag=[get_flash] offset=[0] string=[]
Intel 28F320C3 2Mx16 BotB
Set flash_type=Intel 28F320C3 2Mx16 BotB
exit
Algorithmics/MIPS FPU Emulator v1.5
vlan0: add 01:00:5e:00:00:01 mcast address to master interface
vlan0: dev_set_promiscuity(master, 1)
device eth0 entered promiscuous mode
device vlan0 entered promiscuous mode
device eth1 entered promiscuous mode
br0: port 2(eth1) entering learning state
br0: port 1(vlan0) entering learning state
br0: port 2(eth1) entering forwarding state
br0: topology change detected, propagating
br0: port 1(vlan0) entering forwarding state
br0: topology change detected, propagating
flag=[get_eou_key_index] offset=[0] string=[]
eou_key_init()
location = [1]
Available eou key index is 1
exit
vlan1: Setting MAC address to xxxxxxxxxxxxxxx.
nvram_commit(): init
nvram_commit(): end
nvram_commit(): init
nvram_commit(): end
br0: port 2(eth1) entering disabled state
br0: port 1(vlan0) entering disabled state
vlan0: del 01:00:5e:00:00:01 mcast address from master interface
vlan0: del 01:00:5e:00:00:01 mcast address from vlan interface
br0: port 1(vlan0) entering disabled state
device vlan0 left promiscuous mode
br0: port 2(eth1) entering disabled state
device eth1 left promiscuous mode
vlan0: dev_set_promiscuity(master, -1)
device eth0 left promiscuous mode
vlan0: add 01:00:5e:00:00:01 mcast address to master interface
vlan0: dev_set_promiscuity(master, 1)
device eth0 entered promiscuous mode
device vlan0 entered promiscuous mode
device eth1 entered promiscuous mode
br0: port 2(eth1) entering learning state
br0: port 1(vlan0) entering learning state
br0: port 2(eth1) entering forwarding state
br0: topology change detected, propagating
br0: port 1(vlan0) entering forwarding state
br0: topology change detected, propagating
flag=[get_eou_key_index] offset=[0] string=[]
eou_key_init()
location = [1]
Available eou key index is 1
exit

Sunday, June 12, 2005

Flavio’s TechnoTalk � Blog Archive � Performance analysis on Linux

Flavio’s Performance analysis on Linux: "Performance analysis on Linux"

This article gives an overview of performance analysis on Linux.

Wireless USB adapter...

After doing some research, the following USB adapter seems to be working in Linux:

Buffalo WLI-U2-KG54-AI HK$ 270
Surecom EP-9001-G HK$ 210 [RT2500 chipset]
Dlink DWL-G122 HK$ 220 [RT2500 chipset]
Zyxel G220 HK$ 239

EP-9001-GP $328
DWL-G132 $349

I have forgotten the formula for converting mW to dBm, learnt from wireless communication course.
I found a website which provides quite good tools to do the calculation:
Wireless Javascript Toolkit

Need to compare the output power and antenna gain...

Tuesday, June 07, 2005

slower

常用的数据库连接写法和下载地址(转载)

在调试Spring+hibernate中,后来发现好像Sqlserver的驱动有问题,又听说jtds是非常好的替代品,就在网上照jtds的下载,发现了这篇文章,转载如下(http://dev2dev.bea.com.cn/bbs/thread.jspa?forumID=123&threadID=20222&messageID=116981):

1. MySQL(http://www.mysql.com)mm.mysql-2.0.2-bin.jar
Class.forName( "org.gjt.mm.mysql.Driver" );
cn = DriverManager.getConnection( "jdbc:mysql://MyDbComputerNameOrIP:3306/myDatabaseName", sUsr, sPwd );

2. PostgreSQL(http://www.de.postgresql.org)pgjdbc2.jar
Class.forName( "org.postgresql.Driver" );
cn = DriverManager.getConnection( "jdbc:postgresql://MyDbComputerNameOrIP/myDatabaseName", sUsr, sPwd );

3. Oracle(http://www.oracle.com/ip/deploy/database/oracle9i/)classes12.zip
Class.forName( "oracle.jdbc.driver.OracleDriver" );
cn = DriverManager.getConnection( "jdbc:oracle:thin:@MyDbComputerNameOrIP:1521:ORCL", sUsr, sPwd );

4. Sybase(http://jtds.sourceforge.net)jconn2.jar
Class.forName( "com.sybase.jdbc2.jdbc.SybDriver" );
cn = DriverManager.getConnection( "jdbc:sybase:Tds:MyDbComputerNameOrIP:2638", sUsr, sPwd );
//(Default-Username/Password: "dba"/"sql")

5. Microsoft SQLServer(http://jtds.sourceforge.net)
Class.forName( "net.sourceforge.jtds.jdbc.Driver" );
cn = DriverManager.getConnection( "jdbc:jtds:sqlserver://MyDbComputerNameOrIP:1433/master", sUsr, sPwd );

6. Microsoft SQLServer(http://www.microsoft.com)
Class.forName( "com.microsoft.jdbc.sqlserver.SQLServerDriver" );
cn = DriverManager.getConnection( "jdbc:microsoft:sqlserver://MyDbComputerNameOrIP:1433;databaseName=master", sUsr, sPwd );

7. ODBC
Class.forName( "sun.jdbc.odbc.JdbcOdbcDriver" );
Connection cn = DriverManager.getConnection( "jdbc:odbc:" + sDsn, sUsr, sPwd );

8.DB2(新添加)
Class.forName("com.ibm.db2.jdbc.net.DB2Driver");
String url="jdbc:db2://192.9.200.108:6789/SAMPLE"
cn = DriverManager.getConnection( url, sUsr, sPwd );

9.Microsoft SQL Server series (6.5, 7.x and 2000) and Sybase 10

JDBC Name: jTDS
URL: http://jtds.sourceforge.net/
Version: 0.5.1
Download URL: http://sourceforge.net/project/showfiles.php?group_id=33291

语法:
Class.forName("net.sourceforge.jtds.jdbc.Driver ");
Connection con = DriverManager.getConnection("jdbc:jtds:sqlserver://host:port/database","user","password");
or
Connection con = DriverManager.getConnection("jdbc:jtds:sybase://host:port/database","user","password");

10.Postgresql
JDBC Name: PostgreSQL JDBC
URL: http://jdbc.postgresql.org/
Version: 7.3.3 build 110
Download URL: http://jdbc.postgresql.org/download.html
语法:
Class.forName("org.postgresql.Driver");
Connection con=DriverManager.getConnection("jdbc:postgresql://host:port/database","user","password");

11.IBM AS400主机在用的JDBC语法
有装V4R4以上版本的Client Access Express
可以在C:\Program Files\IBM\Client Access\jt400\lib
找到 driver 档案 jt400.zip,并更改扩展名成为 jt400.jar
语法:
java.sql.DriverManager.registerDriver (new com.ibm.as400.access.AS400JDBCDriver ());
Class.forName("com.ibm.as400.access.AS400JDBCConnection");
con = DriverManager.getConnection("jdbc:as400://IP","user","password");

12.informix
Class.forName("com.informix.jdbc.IfxDriver").newInstance();
String url =
"jdbc:informix-sqli://123.45.67.89:1533/testDB:INFORMIXSERVER=myserver;
user=testuser;password=testpassword";
Lib:jdbcdrv.zip

Class.forName( "com.sybase.jdbc.SybDriver" )
url="jdbc:sybase:Tds:127.0.0.1:2638/asademo";
SybConnection connection= (SybConnection)DriverManager.getConnection(url,"dba","sql");

13.SAP DB
Class.forName ("com.sap.dbtech.jdbc.DriverSapDB");
java.sql.Connection connection = java.sql.DriverManager.getConnection ( "jdbc:sapdb://" + host + "/" + database_name,user_name, password)

14.InterBase
String url = "jdbc:interbase://localhost/e:/testbed/database/employee.gdb";
Class.forName("interbase.interclient.Driver");
//Driver d = new interbase.interclient.Driver (); /* this will also work if you do not want the line above */
Connection conn = DriverManager.getConnection( url, "sysdba", "masterkey" );

15.HSqlDB
url: http://hsqldb.sourceforge.net/
driver: org.hsqldb.jdbcDriver
连接方式有4种,分别为:
con-str(内存): jdbc:hsqldb.
con-str(本地): jdbc:hsqldb:/path/to/the/db/dir
con-str(http): jdbc:hsqldb:http://dbsrv
con-str(hsql): jdbc:hsqldb:hsql://dbsrv

Wednesday, May 04, 2005

Quality Open Source for Windows

For reasons you cannot live without Windows, you can try out some excellent open source tools.

Get a look at TheOpenCD.org

Monday, April 25, 2005

uClinux for Linux Programmers

uClinux for Linux Programmers

Java memory leak reason...

Memory leak in Java:
The Object is reachable but it is not live. Meaning the object has reached the end of its lifecycle and now GC should be able to reclaim this object but some erroneous references prevents it to get garbage collected. Most of the time, a single lingering reference can have massive memory impact.

Leak

Wednesday, April 06, 2005

"Should I buy ECC or non-ECC RAM?"

Q: Should I buy ECC or non-ECC RAM?
A:
ubiquityman wrote:
TO ECC OR NOT TO ECC, THAT IS THE QUESTION

It was published in 1998 (EE Times) that a approximately 1 bit error occurs in 256MB of ram every month. http://www.corsairdirect.com/ecc.html

RAM has increased in speed significantly since then, but manufacturing processes have also improved. It's hard to say what the current error rate is but let's assume that it's about the same.

So for those with 1G of memory, that's approximately 1 bit error every week.

It doesn't matter if you reboot every hour, as long as you run your computer 24h/day, the rate at which a bit error will occur is approximately once a week (if you have 1G of RAM). (If you run your computer less frequently, then obviously, the rate at which bit errors occurs will also be less.)

Also, the nature of memory is that although bit errors occur randomly to some extent, there is a higher probabilty that errors will reoccur at the same location, in weaker bits. We consider a memory bit as digital, but the actual silicon landscape is such that not all bits are created identical or equal. Some are "more susceptible" to soft errors.

So, if you have a "weak bit" in a frequently used place, you just have bad luck. Your machine might crash unexpectidly more often than others. Any programmer can tell you that an incorrect bit is unpredictable. It could be benign, but it cause also be catastrophic in terms of data.

Now, if you do reboot frequently, that reduces the probability that the bit error will cause a negative effect because the bit error lives in your system for less time. It also reduces the compounding of errors which again would have increasing potential to cause data loss.

Like I've said before. If you play games and that's all you do. I probably would not pay the 12%-25% premium for ECC.

However, if you leave your machine on most of the time, overclock, stress your machine in other ways, or use it for something "serious" other than gaming, I would recommend ECC memory. For approximately the cost of eating out one night, I can have the peace of mind of ECC.

I put ECC in all my home machines except for my laptop and PDA. At today's prices, ECC is a bargain.

The luckier you feel, the less you need ECC. In the words of Dirty Harry "Do you feel, Lucky!?"

ECC PERFORMANCE

I checked the Intel website and they do say that there is some performance loss with ECC enabled.

I've also read elsewhere that unless a memory error is detected, there is no performance loss with modern chipsets. (I would suggest that the intel information is the correct one.)

However, I personally have ran benchmarks on my PII-450 w/ i440BX chipset to test the performance of ECC memory.

My benchmarks show that my machine benchmarks faster in memory with ECC enabled.

This person here had similar results:

http://www.personal.psu.edu/faculty/l/a/lae2/fx83dinteg/fx83dinteg.htm

Ajax using XMLHttpRequest and Struts

Ajax using XMLHttpRequest and Struts: "Ajax using XMLHttpRequest and Struts"

Sunday, March 20, 2005

Windows NT 2000 ME 9x and XP Registry Tweaks

Hide computer in network neighbourhood

(Windows 98/Me/2000/XP)

This tweak will hide your computer from network neighbourhood.

Open your registry and find the key:HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters


Create a new DWORD value, name it to Hidden and set the value to 1.

Tuesday, March 15, 2005

Turning off AutoComplete for TextBoxes in IE and FireFox

Turning off AutoComplete for TextBoxes in IE and FireFox

This is one a lot of people know, but it's worth covering again because it's easily forgotten as it's a small detail. Since we do eFinance sites, we often don't want folks' UserNames collected and stored in AutoComplete, especially when the site is browsed on a public machine.

<form id="formSignOn" autocomplete="off" method="post" runat="server">

Note that autocomplete="false" doesn't work. However, autocomplete="off" works in both IE and FireFox.

Saturday, March 12, 2005

Oracle Database 10g release 1 (10.1.0.2) installation on Fedora core 3 (FC3)

Oracle Database 10g release 1 (10.1.0.2) installation on Fedora core 3 (FC3): "Oracle Database 10g release 1 (10.1.0.2) installation on Fedora core 3 (FC3)"

Today, I installed Oracle 10.1.0.3 on my Linux box running Fedora core 3. Well, I am trying to do some prototype on it. As working on prototype is free. I can try it out.

Tuesday, March 08, 2005

摩托學園討論區 :: 觀看文章 - java sdk 5.0 字型安裝

摩托學園討論區 :: 觀看文章 - java sdk 5.0 字型安裝: "java sdk 5.0 字型安裝"

Java simplified chinese information...

source: http://www.cn-java.com/target/news.php?news_id=941

JSP/Servlet 中的汉字编码问题 阅读次数30086

出处 (IBM DW)大砍刀

. 问题的起源

每个国家(或区域)都规定了计算机信息交换用的字符编码集,如美国的扩展 ASCII码, 中国的 GB2312-80,日本的 JIS 等,作为该国家/区域内信息处理的基础,有着统一编码的重要作用。字符编码集按长度分为 SBCS(单字节字符集),DBCS(双字节字符集)两大类。早期的软件(尤其是操作系统),为了解决本地字符信息的计算机处理,出现了各种本地化版本(L10N),为了区分,引进了 LANG, Codepage 等概念。但是由于各个本地字符集代码范围重叠,相互间信息交换困难;软件各个本地化版本独立维护成本较高。因此有必要将本地化工作中的共性抽取出来,作一致处理,将特别的本地化处理内容降低到最少。这也就是所谓的国际化(I18N)。各种语言信息被进一步规范为 Locale 信息。处理的底层字符集变成了几乎包含了所有字形的 Unicode。

现在大部分具有国际化特征的软件核心字符处理都是以 Unicode 为基础的,在软件运行时根据当时的 Locale/Lang/Codepage 设置确定相应的本地字符编码设置,并依此处理本地字符。在处理过程中需要实现 Unicode 和本地字符集的相互转换,甚或以 Unicode 为中间的两个不同本地字符集的相互转换。这种方式在网络环境下被进一步延伸,任何网络两端的字符信息也需要根据字符集的设置转换成可接受的内容。

Java 语言内部是用 Unicode 表示字符的,遵守 Unicode V2.0。Java 程序无论是从/往文件系统以字符流读/写文件,还是往 URL 连接写 HTML 信息,或从 URL 连接读取参数值,都会有字符编码的转换。这样做虽然增加了编程的复杂度,容易引起混淆,但却是符合国际化的思想的。

从理论上来说,这些根据字符集设置而进行的字符转换不应该产生太多问题。而事实是由于应用程序的实际运行环境不同,Unicode 和各个本地字符集的补充、完善,以及系统或应用程序实现的不规范,转码时出现的问题时时困扰着程序员和用户。

2. GB2312-80,GBK,GB18030-2000 汉字字符集及 Encoding

其实解决 JAVA 程序中的汉字编码问题的方法往往很简单,但理解其背后的原因,定位问题,还需要了解现有的汉字编码和编码转换。

GB2312-80 是在国内计算机汉字信息技术发展初始阶段制定的,其中包含了大部分常用的一、二级汉字,和 9 区的符号。该字符集是几乎所有的中文系统和国际化的软件都支持的中文字符集,这也是最基本的中文字符集。其编码范围是高位0xa1-0xfe,低位也是 0xa1-0xfe;汉字从 0xb0a1 开始,结束于 0xf7fe;

GBK 是 GB2312-80 的扩展,是向上兼容的。它包含了 20902 个汉字,其编码范围是 0x8140-0xfefe,剔除高位 0x80 的字位。其所有字符都可以一对一映射到 Unicode 2.0,也就是说 JAVA 实际上提供了 GBK 字符集的支持。这是现阶段 Windows 和其它一些中文操作系统的缺省字符集,但并不是所有的国际化软件都支持该字符集,感觉是他们并不完全知道 GBK 是怎么回事。值得注意的是它不是国家标准,而只是规范。随着 GB18030-2000国标的发布,它将在不久的将来完成它的历史使命。

GB18030-2000(GBK2K) 在 GBK 的基础上进一步扩展了汉字,增加了藏、蒙等少数民族的字形。GBK2K 从根本上解决了字位不够,字形不足的问题。它有几个特点,

它并没有确定所有的字形,只是规定了编码范围,留待以后扩充。
编码是变长的,其二字节部分与 GBK 兼容;四字节部分是扩充的字形、字位,其编码范围是首字节 0x81-0xfe、二字节0x30-0x39、三字节 0x81-0xfe、四字节0x30-0x39。
它的推广是分阶段的,首先要求实现的是能够完全映射到 Unicode 3.0 标准的所有字形。
它是国家标准,是强制性的。
现在还没有任何一个操作系统或软件实现了 GBK2K 的支持,这是现阶段和将来汉化的工作内容。
Unicode 的介绍......就免了吧。

JAVA 支持的encoding中与中文编程相关的有:(有几个在JDK文档中未列出) ASCII 7-bit, 同 ascii7
ISO8859-1 8-bit, 同 8859_1,ISO-8859-1,ISO_8859-1,latin1...
GB2312-80 同gb2312,gb2312-1980,EUC_CN,euccn,1381,Cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB......
GBK (注意大小写),同MS936
UTF8 UTF-8
GB18030 (现在只有IBM JDK1.3.?有支持), 同Cp1392,1392


JAVA 语言采用Unicode处理字符. 但从另一个角度来说,在java程序中也可以采用非Unicode的转码,重要的是保证程序入口和出口的汉字信息不失真。如完全采用ISO-8859-1 来处理汉字也能达到正确的结果。网络上流行的许多解决方法,都属于这种类型。为了不致引起混淆,本文不对这种方法作讨论。

3. 中文转码时'?'、乱码的由来

两个方向转换都有可能得到错误的结果:

Unicode-->Byte, 如果目标代码集不存在对应的代码,则得到的结果是0x3f.
如:
"\u00d6\u00ec\u00e9\u0046\u00bb\u00f9".getBytes("GBK") 的结果是 "?ìéF?ù", Hex 值是3fa8aca8a6463fa8b4.
仔细看一下上面的结果,你会发现\u00ec被转换为0xa8ac, \u00e9被转换为\xa8a6... 它的实际有效位变长了!这是因为GB2312符号区中的一些符号被映射到一些公共的符号编码,由于这些符号出现在ISO-8859-1或其它一些SBCS字符集中,故它们在 Unicode中编码比较靠前,有一些其有效位只有8位,和汉字的编码重叠(其实这种映射只是编码的映射,在显示时仔细不是一样的。Unicode 中的符号是单字节宽,汉字中的符号是双字节宽) . 在Unicode\u00a0--\u00ff 之间这样的符号有20个。了解这个特征非常重要!由此就不难理解为什么JAVA编程中,汉字编码的错误结果中常常会出现一些乱码(其实是符号字符), 而不全是'?'字符, 就比如上面的例子。

Byte-->Unicode, 如果Byte标识的字符在源代码集不存在,则得到的结果是0xfffd.
如:
Byte ba[] = {(byte)0x81,(byte)0x40,(byte)0xb0,(byte)0xa1}; new String(ba,"gb2312");
结果是"?啊", hex 值是"\ufffd\u554a". 0x8140 是GBK字符,按GB2312转换表没有对应的值,取\ufffd. (请注意:在显示该uniCode时,因为没有对应的本地字符,所以也适用上一种情况,显示为一个"?".)

实际编程中,JSP/Servlet 程序得到错误的汉字信息,往往是这两个过程的叠加,有时甚至是两个过程叠加后反复作用的结果.

4. JSP/Servlet 汉字编码问题及在 WAS 中的解决办法

4.1 常见的 encoding 问题的现象
网上常出现的 JSP/Servlet encoding 问题一般都表现在 browser 或应用程序端,如:
浏览器中看到的 Jsp/Servlet 页面中的汉字怎么都成了 ’?’ ?
浏览器中看到的 Servlet 页面中的汉字怎么都成了乱码?
JAVA 应用程序界面中的汉字怎么都成了方块?
Jsp/Servlet 页面无法显示 GBK 汉字。
JSP 页面中内嵌在<%...%>,<%=...%>等Tag包含的 JAVA code 中的中文成了乱码,但页面的其它汉字是对的。
Jsp/Servlet 不能接收 form 提交的汉字。
JSP/Servlet 数据库读写无法获得正确的内容。
隐藏在这些问题后面的是各种错误的字符转换和处理(除第3个外,是因为 Java font 设置错误引起的)。解决类似的字符 encoding 问题,需要了解 Jsp/Servlet 的运行过程,检查可能出现问题的各个点。

4.2 JSP/Servlet web 编程时的 encoding 问题
运行于Java 应用服务器的 JSP/Servlet 为 Browser 提供 HTML 内容,其过程如下图所示:
其中有字符编码转换的地方有:

JSP 编译。Java 应用服务器将根据 JVM 的 file.encoding 值读取 JSP 源文件,编译生成 JAVA 源文件,再根据 file.encoding 值写回文件系统。如果当前系统语言支持 GBK,那么这时候不会出现 encoding 问题。如果是英文的系统,如 LANG 是 en_US 的 Linux, AIX 或 Solaris,则要将 JVM 的 file.encoding 值置成 GBK 。系统语言如果是 GB2312,则根据需要,确定要不要设置 file.encoding,将 file.encoding 设为 GBK 可以解决潜在的 GBK 字符乱码问题


Java 需要被编译为 .class 才能在 JVM 中执行,这个过程存在与a.同样的 file.encoding 问题。从这里开始 servlet 和 jsp 的运行就类似了,只不过 Servlet 的编译不是自动进行的。对于JSP程序, 对产生的JAVA 中间文件的编译是自动进行的(在程序中直接调用sun.tools.javac.Main类). 因此如果在这一步出现问题的话, 也要检查encoding和OS的语言环境,或者将内嵌在JSP JAVA Code 中的静态汉字转为 Unicode, 要么静态文本输出不要放在 JAVA code 中。 对于Servlet, javac 编译时手工指定-encoding 参数就可以了。


Servlet 需要将 HTML 页面内容转换为 browser 可接受的 encoding 内容发送出去。依赖于各 JAVA App Server 的实现方式,有的将查询 Browser 的 accept-charset 和 accept-language 参数或以其它猜的方式确定 encoding 值,有的则不管。因此采用固定encoding 也许是最好的解决方法。对于中文网页,可在 JSP 或 Servlet 中设置 contentType="text/html; charset=GB2312";如果页面中有GBK字符,则设置为contentType="text/html; charset=GBK",由于IE 和 Netscape对GBK的支持程度不一样,作这种设置时需要测试一下。
因为16位 JAVA char在网络传送时高8位会被丢弃,也为了确保Servlet页面中的汉字(包括内嵌的和servlet运行过程中得到的)是期望的内码,可以用 PrintWriter out=res.getWriter() 取代 ServletOutputStream out=res.getOutputStream(). PrinterWriter 将根据contentType中指定的charset作转换 (ContentType需在此之前指定!); 也可以用OutputStreamWriter封装 ServletOutputStream 类并用write(String)输出汉字字符串。
对于 JSP,JAVA Application Server 应当能够确保在这个阶段将嵌入的汉字正确传送出去。


这是解释 URL 字符 encoding 问题。如果通过 get/post 方式从 browser 返回的参数值中包含汉字信息, servlet 将无法得到正确的值。SUN的 J2SDK 中,HttpUtils.parseName 在解析参数时根本没有考虑 browser 的语言设置,而是将得到的值按 byte 方式解析。这是网上讨论得最多的 encoding 问题。因为这是设计缺陷,只能以 bin 方式重新解析得到的字符串;或者以 hack HttpUtils 类的方式解决。参考文章 2 均有介绍,不过最好将其中的中文 encoding GB2312、 CP1381 都改为 GBK,否则遇到 GBK 汉字时,还是会有问题。
Servlet API 2.3 提供一个新的函数 HttpServeletRequest.setCharacterEncoding 用于在调用 request.getParameter(“param_name”) 前指定应用程序希望的 encoding,这将有助于彻底解决这个问题。
4.3 IBM Websphere Application Server 中的解决方法

WebSphere Application Server 对标准的 Servlet API 2.x 作了扩展,提供较好的多语言支持。运行在中文的操作系统中,可以不作任何设置就可以很好地处理汉字。下面的说明只是对WAS是运行在英文的系统中,或者需要有GBK支持时有效。

上述c,d情况,WAS 都要查询 Browser 的语言设置,在缺省状况下, zh, zh-cn 等均被映射为 JAVA encoding CP1381(注意: CP1381 只是等同于 GB2312 的一个 codepage,没有 GBK 支持)。这样做我想是因为无法确认 Browser 运行的操作系统是支持GB2312, 还是 GBK,所以取其小。但是实际的应用系统还是要求页面中出现 GBK 汉字,最著名的是朱总理名字中的“镕"(rong2 ,0xe946,\u9555),所以有时还是需要将 Encoding/Charset 指定为 GBK。当然 WAS 中变更缺省的 encoding 没有上面说的那么麻烦,针对 a,b,参考文章 5,在 Application Server 的命令行参数中指定 -Dfile.encoding=GBK 即可; 针对 d,在 Application Server 的命令行参数中指定-Ddefault.client.encoding=GBK。如果指定了-Ddefault.client.encoding= GBK,那么c情况下可以不再指定charset。

上面列出的问题中还有一个关于Tag<%...%>,<%=...%>中的 JAVA 代码里包含的静态文本未能正确显示的问题,在WAS中的解决方法是除了设置正确的file.encoding, 还需要以相同方法设置-Duser.language=zh -Duser.region=CN。这与JAVA locale的设置有关。

4.4 数据库读写时的 encoding 问题

JSP/Servlet 编程中经常出现 encoding 问题的另一个地方是读写数据库中的数据。

流行的关系数据库系统都支持数据库 encoding,也就是说在创建数据库时可以指定它自己的字符集设置,数据库的数据以指定的编码形式存储。当应用程序访问数据时,在入口和出口处都会有 encoding 转换。 对于中文数据,数据库字符编码的设置应当保证数据的完整性. GB2312,GBK,UTF-8 等都是可选的数据库 encoding;也可以选择 ISO8859-1 (8-bit),那么应用程序在写数据之前须将 16Bit 的一个汉字或 Unicode 拆分成两个 8-bit 的字符,读数据之后则需将两个字节合并起来,同时还要判别其中的 SBCS 字符。没有充分利用数据库 encoding 的作用,反而增加了编程的复杂度,ISO8859-1不是推荐的数据库 encoding。JSP/Servlet编程时,可以先用数据库管理系统提供的管理功能检查其中的中文数据是否正确。

然后应当注意的是读出来的数据的 encoding,JAVA 程序中一般得到的是 Unicode。写数据时则相反。

4.5 定位问题时常用的技巧

定位中文encoding问题通常采用最笨的也是最有效的办法——在你认为有嫌疑的程序处理后打印字符串的内码。通过打印字符串的内码,你可以发现什么时候中文字符被转换成Unicode,什么时候Unicode被转回中文内码,什么时候一个中文字成了两个 Unicode 字符,什么时候中文字符串被转成了一串问号,什么时候中文字符串的高位被截掉了……

取用合适的样本字符串也有助于区分问题的类型。如:”aa啊aa丂aa” 等中英相间、GB、GBK特征字符均有的字符串。一般来说,英文字符无论怎么转换或处理,都不会失真(如果遇到了,可以尝试着增加连续的英文字母长度)。

5. 结束语

其实 JSP/Servlet 的中文encoding 并没有想像的那么复杂,虽然定位和解决问题没有定规,各种运行环境也各不尽然,但后面的原理是一样的。了解字符集的知识是解决字符问题的基础。不过,随着中文字符集的变化,不仅仅是 java 编程,中文信息处理中的问题还是会存在一段时间的。

6. 参考文章

Character Problem Review
Java 编程技术中汉字问题的分析及解决
GB18030
Setting language encoding in web applications: Websphere applications Server

Sunday, March 06, 2005

Zones, Message Queues and preliminary setup

fintanr's weblog: "Zones, Message Queues and preliminary setup on a benchmark "

This page describes Solaris Zone and Sonic MQ.

Get in the Zone

Get in the Zone: "Get in the Zone with Solaris 10"

I think it is something like User Mode Linux, which allows you to boot multiple Linux (Solaris for Zone) in a single machine.

Sunday, February 27, 2005

Internet Access via Bluetooth on Linux

About a month ago, I went to Sham Siu Po and found that there was a Bluetooth adaptor which cost $1xx. So, what I came up was:
can I use it to build a wireless network?

Most people use 802.11[abg], but it seems you require to buy an adaptor for each computer in order to have wireless access, and one router for the network too. So, the total cost is much higher than using bluetooth way.

Bluetooth seems to be one common connectivity for home appliances. Imagine you can connect to a TV set and control it remotely while you are working your word processing software. :)

P.S.
The drawbacks are:
- low link speed (only 721kbps for USB 1.1 Bluetooth 1.1)
- might be less than 10 Bluetooth devices
- [is it high transmitting power at the Antenna?]

This link inspires me to write this blog. :)
http://www.osnews.com/story.php?news_id=9834

for windows
http://www.whizoo.com/bt_setup/
Palm Quick Answers -- Internet Access via Bluetooth on Linux: "Internet Access via Bluetooth on Linux"

Wednesday, February 23, 2005

Tomcat 5 Chinese Encoding problem...



實際運用 Tomcat 5.0.19,我們了解在不修改 Tomcat 原始碼的狀況下,使用者透過 Form submit 的資料將一律以 ISO8859-1 處理,程式設計師必須自行將字串將轉換為 Big5(繁體中文) or GB2312/GBK(簡體中文),我們在應用程式中,對所有的 request.getParameter("xx"); 作了 toBig5String() 的處理,理論上,所有的中文問題應該不會出現才對,結果,還是發現某些狀況下,中文還是變成亂碼!

經過分析整理,我們發現問題出在 QueryString 的解析,以前在 Tomcat 4.x 時代,無論 SUBMIT 時採用 GET or POST,Tomcat server 對 parameters 的處理都採用相同的編碼,但在 Tomcat 5.x 版,不知何故,卻將 QueryString 的解析獨立出來,目前確認,Form 的 Method 採用 GET 及直接將參數寫在 URL 上的中文,上傳到 Tomcat 時,無論如何轉碼,都會變成亂碼,那怕你事先作過 URLEncode 也一樣。

網站上,有人針對這個問題,建議將所有中文改採用 base64 編碼,到了 server 上,程式將自行土 base64 decode 回來,確保中文不會發生問題。這樣作法當然可以解決這個問題,但是所有網頁變成限定要採用 POST,且程式設計師要隨時分清楚,那個參數是採用 GET 上傳,那個參數是採用 POST 上傳,然後再針對不同的方式採用不同的解析,這樣的程式一點兒移植性都沒有,更別提跨平台、跨國際語言了。

研究 Tomcat 的文件及原始碼,我們找到了問題所在及解決的方法,只有按著以下的作法,才能使 Form submit 的資料完全按著 ISO8859-1 的編碼,當然,若是全照著 Tomcat 的文件說明去作,肯定還是不行,你還是得加上這個參數到 server.xml 中才行。

解決方案

請先研究 $TOMCAT_HOME/webapps/tomcat-docs/config/http.html 這個說明檔,擷錄重點如下:
URIEncoding:This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

useBodyEncodingForURI:This specifies if the encoding specified in contentType should be used for URI query parameters, instead of using the URIEncoding. This setting is present for compatibility with Tomcat 4.1.x, where the encoding specified in the contentType, or explicitely set using Request.setCharacterEncoding method was also used for the parameters from the URL. The default value is false.

上述二個 Tomcat 參數,是設定在 server.xml 中的 http 區塊,要解決 QueryString 中文變成亂碼的問題,你必須至少設定這二個參數其中之一。
URIEncoding 請設定為 URIEncoding="ISO-8859-1" 指定為 "ISO-8859-1" 編碼,讓 QueryString 的字元編碼與 post body 相同。
useBodyEncodingForURI 這是用來相容 Tomcat 4.x 版的,設定的值是 "true" or "false",意思是指 "要不要讓 QueryString 與 POST BODY 採用相同的字元編碼 ?",若是設成 true,那也可達到 "ISO-8859-1" 編碼的需求。
建議,採用 URIEncoding 的設定,畢竟 useBodyEncodingForURI 的作法是為了相容 Tomcat 4.X。不過若照原文的說明,理論上這二個參數都不設,Tomcat 也該採用 "ISO-8859-1" 的編碼,那為什麼還是會有問題呢 ? 我們由 Tomcat Source Code 來看就清楚了。

// 這一段碼是 Tomcat 用來解 QueryString 的程式,
// 在 org.apache.tomcat.util.http.Parameters 這個 class 裡。
private String urlDecode(ByteChunk bc, String enc)
throws IOException {
if( urlDec==null ) {
urlDec=new UDecoder();
}
urlDec.convert(bc);
String result = null;
if (enc != null) {
bc.setEncoding(enc);
result = bc.toString();
}
else {
CharChunk cc = tmpNameC;
cc.allocate(bc.getLength(), -1);
// Default encoding: fast conversion
byte[] bbuf = bc.getBuffer();
char[] cbuf = cc.getBuffer();
int start = bc.getStart();
for (int i = 0; i < bc.getLength(); i++) {
cbuf[i] = (char) (bbuf[i + start] & 0xff);
}
cc.setChars(cbuf, 0, bc.getLength());
result = cc.toString();
cc.recycle();
}
return result;
}


請特別注意紅色區塊,當 Tomcat 發現 QueryString 並沒有設定 encode 時,並非像文件中所說預設採用 ISO-8859-1 的編碼,而是用一段 fast conversion 來處理,才會造成中文問題,所以,還是必須在 Server.xml 中,加上 URLEncoding 的參數設定才行哦。

Connector 的設定範例:

debug="0"
acceptCount="100"
connectionTimeout="20000"
disableUploadTimeout="true"
port="80"
redirectPort="8443"
enableLookups="false"
minSpareThreads="25"
maxSpareThreads="75"
maxThreads="150"
maxPostSize="0"
URIEncoding="ISO-8859-1"
>


Monday, February 21, 2005

Solaris Ethernet Drivers for ADMtek & Macronix based chips

Solaris Ethernet Drivers - Main: "Solaris Ethernet Drivers"

Found the Solaris 10 x86 is not detecting my NIC. So, this URL might be helpful.

Thursday, February 17, 2005

SCO Group is delisted from NASDAQ!

"You have the day!" I would say to SCO.


LWN: Welcome to LWN.net: "SCO Group to be delisted
[Commerce] Posted Feb 17, 2005 14:19 UTC (Thu) by corbet

The SCO Group has put out a press release informing the world that it is being kicked out of the NASDAQ market for failure to comply with the reporting requirements. SCO is appealing the decision. 'The Company has been unable to file its Form 10-K for the fiscal year ended October 31, 2004 because it continues to examine certain matters related to the issuance of shares of the Company's common stock pursuant to its equity compensation plans. The Company is working to resolve these matters as soon as possible and expects to file its Form 10-K upon completion of its analysis.'

Tuesday, February 15, 2005

Java Forums - Can I serialize Message objects?

Java Forums - Can I serialize Message objects?: "Re: Can I serialize Message objects?
Author: agnesjuhasz Apr 8, 2002 2:30 AM (reply 4 of 4)
Hi Avanish,

I could resolve the serialization by this way

// on the client side
MimeMessage mimemessage = new MimeMessage((javax.mail.Session)null);
// do what you need
...
// put the content of mimemessage into encoded String what is Serializable
ByteArrayOutputStream baos = new ByteArrayOutputStream();
message.writeTo(baos);
byte[] bytearray = baos.toByteArray();
Base64Encoder encoder = new Base64Encoder();
String base64encodedmessage = encoder.encode(bytearray);

// On the server side
// decode the received string
Base64Decoder decoder = new Base64Decoder();
byte[] bytearray = decoder.decodeBuffer(base64encodedmessage );
ByteArrayInputStream bais = new ByteArrayInputStream(bytearray);

mailprops.setProperty('mail.from',sender);
Session session = Session.getInstance(mailprops,null);
session.setDebug(debug);

MimeMessage mimemessage = new MimeMessage(session,bais);

Hope this helps.
Agnes"

Monday, February 07, 2005

Inserting some text to TextArea in Internet Explorer...

// input is the TextArea object
input.focus();
var oSel=document.selection;
if (oSel && oSel.createRange){
oSel.createRange().duplicate().text = insText;
}


for Gecko base:
var len = input.selectionEnd;
input.value = input.value.substr( 0, len ) + insText + input.value.substr(len);
input.setSelectionRange(len+insText.length,len+insText.length);

Saturday, February 05, 2005

howto use "update-alternatives"...

update-alternatives --verbose --install /usr/bin/java java /usr/local/jdk/bin/java 500 --slave /usr/share/man/man1/java.1 java.1 /usr/local/jdk/man/man1/java.1


this: add java

Spam Laws

Here is some information about the Spam laws in the world. So far, Hong Kong has no such regulations.

Spam Laws: "Spam Laws"

Friday, February 04, 2005

Java Forums - Why no font.properties.zh_HK?

Here is a follow up about the HK font stuff in jdk 1.4
Java Forums - Why no font.properties.zh_HK?: "Java Forums - Why no font.properties.zh_HK?"

Java Forums - 1999 or 2001 version of HKSCS

I am finding some information on the conversion of Hong Kong Character set in Java. I tried to import a csv file with String in Big5 encoding. I found that some of the character cannot be
converted.

Java Forums - 1999 or 2001 version of HKSCS: "Java Forums - 1999 or 2001 version of HKSCS"

Just a simple idea....

- Just think to write a classifier to classify the gender of a given chinese name.

Thursday, February 03, 2005

Linux: TCP Random Initial Sequence Numbers

Linux: TCP Random Initial Sequence Numbers: "Linux: TCP Random Initial Sequence Numbers"


The following are copied from kerneltrap (the above source). THe article is quite well written and explains the implementation issue of TCP Sequence number in LINUX.

rom: linux AT horizon.com
Subject: Re: [PATCH] OpenBSD Networking-related randomization port
Date: 29 Jan 2005 07:24:29 -0000

> It adds support for advanced networking-related randomization, in
> concrete it adds support for TCP ISNs randomization

Er... did you read the existing Linux TCP ISN generation code?
Which is quite thoroughly randomized already?

I'm not sure how the OpenBSD code is better in any way. (Notice that it
uses the same "half_md4_transform" as Linux; you just added another copy.)
Is there a design note on how the design was chosen?

I don't wish to be *too* discouraging to someone who's *trying* to help,
but could you *please* check a little more carefully in future to
make sire it's actually an improvement?

I fear there's some ignorance of what the TCP ISN does, why it's chosen
the way it is, and what the current Linux algorithm is designed to do.
So here's a summary of what's going on. But even as a summary, it's
pretty long...

First, a little background on the selection of the TCP ISN...

TCP is designed to work in an environment where packets are delayed.
If a packet is delayed enough, TCP will retransmit it. If one of
the copies floats around the Internet for long enough and then arrives
long after it is expected, this is a "delayed duplicate".

TCP connections are between (host, port, host port) quadruples, and
packets that don't match some "current connection" in all four fields
will have no effect on the current connection. This is why systems try
to avoid re-using source port numbers when making connections to
well-known destination ports.

However, sometimes the source port number is explicitly specified and
must be reused. The problem then arises, how do we avoid having any
possible delayed packets from the previous use of this address pair show
up during the current connection and confuse the heck out of things by
acknowledging data that was never received, or shutting down a connection
that's supposed to stay open, or something like that?

First of all, protocols assume a maximum packet lifetime in the Internet.
The "Maximum Segment Lifetime" was originally specified as 120 seconds,
but many implementations optimize this to 60 or 30 seconds. The longest
time that a response can be delayed is 2*MSL - one delay for the packet
eliciting the response, and another for the response.

In truth, there are few really-hard guarantees on how long a packet can
be delayed. IP does have a TTL field, and a requirement that a packet's
TTL field be decremented for each hop between routers *or each second of
delay within a router*, but that latter portion isn't widely implemented.
Still, it is an identified design goal, and is pretty reliable in
practice.

The solution is twofold: First, refuse to accept packets whose
acks aren't in the current transmission window. That is, if the
last ack I got was for byte 1000, and I have sent 1100 bytes
(numbers 0 through 1099), then if the incoming packet's ack isn't
somewhere between 1000 and 1100, it's not relevant. If it's
950, it might be an old ack from the current connection (which
doesn't include anything interesting), but in any case it can be
safely ignored, and should be.

The only remaining issue is, how to choose the first sequence number
to use in a connection, the Initial Sequence Number (ISN)?

If you start every connection at zero, then you have the risk that
packets from an old connection between the same endpoints will
show up at a bad time, with in-range sequence numbers, and confuse
the current connection.

So what you do is, start at a sequence number higher than the
last one used in the old connection. Then there can't be any
confusion. But this requires remembering the last sequence number
used on every connection ever. And there are at least 2^48 addresses
allowed to connect to each port on the local machine. At 4 bytes
per sequence number, that's a Petabyte of storage...

Well, first of all, after 2*MSL, you can forget about it and use
whatever sequence number you like, because you know that there won't
be any old packets floating around to crash the party.

But still, it can be quite a burden on a busy web server. And you might
crash and lose all your notes. Do you want to have to wait 2*MSL before
rebooting?

So the TCP designers (I'm now on page 27 of RFC 793, if you want to follow
along) specified a time of day based ISN. If you use a clock to generate
an ISN which counts up faster than your network connection can send
data (and thus crank up its sequence numbers), you can be sure that your
ISN is always higher than the last one used by an old connection without
having to remember it explicitly.

RFC 793 specifies a 250,000 bytes/second counting rate. Most
implementations since Ethernet used a 1,000,000 byte/second counting
rate, which matches the capabilities of 10base5 and 10base2 quite well,
and is easy to get from the gettimeofday() call.

Note that there are two risks with this. First, if the connection actually
manages to go faster than the ISN clock, the next connection's ISN will
be in the middle of the space the earlier connection used. Fortunately,
the kind of links where significant routing delay appear are generally
slower ones where 1 Mbyte/sec is a not too unreasonable limit. Your
gigabit LAN isn't going to be delaying packets by seconds.

The second is that a connection will be made and do nothing for 4294
seconds until the ISN clock is about to wrap around and then start
doing packets "ahead of" the ISN clock. If it then closes the connection
and a new one opens, once again you have sequence number overlap.

If you read old networking papers, there were a bunch of proposals for
occasional sequence number renegotiation to solve this problem, but in the
end people decided to not worry about it, and it hasn't been a problem
in practice.

Anyway... fast forward out of the peace and love decade and welcome to
the modern Internet, with people *trying* to mess up TCP connections.
This kind of attack from within was, unfortunately, not one of the
scenarios that the initial Internet designers considered, and it's
been a bit of a problem since.

All this careful worry about packets left over from an old connection
randomly showing up and messing things up, when we have people *creating*
packets deliberately crafted to mess things up! A whole separate problem.
In particular, using the simple timer-based algorithm, I can connect to
a server, look at the ISN it offers me, and know that thats the same
ISN it's offering to other people connecting at the same time. So I
can create packets with a forged source address and approximately-valid
sequence numbers and bombard the connection with them, cutting off that
server's connection to some third party. Even if I can't see any of
the traffic on the connection.

So people sat down and did some thinking. How to deal with this?
Harder yet, how to deal with this without redesigning TCP from scratch?

Well, if a person wants to mess up their *own* connections, we can't
stop them. The fundamental problem is that an attacker A can figure
out the sequence numbers that machines B and C are using to talk to
each other. So we'd like to make the sequence numbers for every
connection unique and not related to the sequence numbers used on any
other connections. So A can talk to B and A can talk to C and still not
be able to figure out the sequence numbers that B and C are using between
themselves.

Fortunately, it is entirely possible to combine this with the clock-based
algorithm and get the best of both worlds! All we need is a random offset,
unique for every (address, port, address, port) quadruple, to add to
the clock value, and we have all of the clock-based guarantees preserved
within every pair of endpoints, but unrelated endpoints have their ISNs
at some unpredictable offset relative to each other.

And for generating such a random offset, we can use cryptography.
Keep a single secret key, and hash together the endpoint addresses,
and you can generate a random 32-bit ISN offset. Add that to the
current time, and everything is golden. A can connect to B and
see and ISN, but would need to do some serious cryptanalysis to
figure out what ISN B is using to talk to C.

Linux actually adds one refinement. For speed, it uses a very
stripped-down cryptographic hash function. To guard against that
being broken, it generates a new secret every 5 minutes. So an
attacker only has 5 minutes to break it.

The cryptographic offset is divided into 2 parts. The high 8 bits are
a sequence number, incremented every time the secret is regenerated.
The low 24 bits are produced by the hash. So 5 minutes after booting,
the secret offset changes from 0x00yyyyyy to 0x01zzzzzz. This is at
least +1, and at most +0x1ffffff. On average, the count is going up
by 2^24 = 16 million every 300 seconds. Which just adds a bit to the
basic "1 million per second" ISN timer.

The cost is that the per-connection part of the ISN offset is limited
to 24 and not 32 bits, but a cryptanalytic attack is pretty much
precluded by the every-5-minutes thing. The rekey time and the number of
really-unpredictable bits have to add up to not wrapping the ISN space
too fast. (Although the 24 bits could be increased to 28 bits without
quite doubling the ISN counting speed. Or 27 bits if you want plenty
of margin. Could I suggest that as an improvement?)

--- drivers/char/random.c 2004-12-04 09:24:19.000000000 +0000
+++ drivers/char/random.c 2005-01-29 07:20:37.000000000 +0000
@@ -2183,26 +2183,26 @@
#define REKEY_INTERVAL (300*HZ)
/*
* Bit layout of the tcp sequence numbers (before adding current time):
- * bit 24-31: increased after every key exchange
- * bit 0-23: hash(source,dest)
+ * bit 27-31: increased after every key exchange
+ * bit 0-26: hash(source,dest)
*
* The implementation is similar to the algorithm described
* in the Appendix of RFC 1185, except that
* - it uses a 1 MHz clock instead of a 250 kHz clock
* - it performs a rekey every 5 minutes, which is equivalent
- * to a (source,dest) tulple dependent forward jump of the
+ * to a (source,dest) tuple dependent forward jump of the
* clock by 0..2^(HASH_BITS+1)
*
- * Thus the average ISN wraparound time is 68 minutes instead of
- * 4.55 hours.
+ * Thus the average ISN wraparound time is 49 minutes instead of
+ * 4.77 hours.
*
* SMP cleanup and lock avoidance with poor man's RCU.
* Manfred Spraul [email blocked]
*
*/
-#define COUNT_BITS 8
+#define COUNT_BITS 5
#define COUNT_MASK ( (1<-#define HASH_BITS 24
+#define HASH_BITS 27
#define HASH_MASK ( (1<
static struct keydata {

Anyway, in comparison, the algorithm in your patch (and presumably
OpenBSD, although I haven't personally compared it) uses a clock
offset generated fresh for each connection. There's a global counter
(tcp_rndiss_cnt; I notice you don't have any SMP locking on it) which
is incremented every time an ISN is needed. It's rekeyed periodically,
and the high bit (tcp_rndiss_msb) of the delta is used like the COUNT_BITS
in the Linux algorithm.

The ISN is generated as the high sequence bit, then 15 bits of encrypted
count (with some homebrew cipher I don't recognize), then a constant
zero bit (am I reading that right), then the 15 low-order bits are
purely random.

It's a slightly different algorithm, but it does a very similar function.
The main downsides are that the sequence number can easily go backwards
(there's no guarantee that consecutive calls will return increasing
numbers since tcp_rndiss_encrypt scrambles the high 15 bits), and
that it's not SMP-safe. Two processors could read and use the same
tcp_rndiss_cnt value at the same time, or (more likely) both call
tcp_rndiss_init at the same time and end up toggling tcp_rndiss_msb twice,
thereby destroying the no-rollback property it's trying to achieve.

Oh, and the single sequence bit in the offsets means that the
TCP ISN will wrap around very fast. Every 10 minutes, or every
60000 TCP connections, whichever comes first.

Regarding the first issue, it's possible that the OpenBSD network stack
takes care to remember all connections for 2*MSL and continues the
sequence number if the endpoints are reused, thereby avoiding a call to
ip_randomisn entirely.

But the second deserves some attention. The Linux code takes some care
to avoid having to lock anything in the common case. The global count
makes that difficult.

Wednesday, February 02, 2005

[Debian] Linux Journal: compiling javahl for subclipse

[Debian] Linux Journal: compiling javahl for subclipse: "$ apt-get install libtool autoconf g libapr0-dev libneon24-dev $ tar zxvf subversion-1.1.1.tar.gz $ cd subversion-1.1.1 $./autogen.sh $ export JAVA_HOME=/usr/lib/j2se/1.4/ $ ./configure --enable-javahl --with-jdk=$JAVA_HOME --with-jikes=$JAVA_HOME/bin/javac $ make $ mkdir subversion/bindings/java/javahl/classes # have problems finding DirEntry.class otherwise $ make javahl $ make install-javahl "

Web: Page Containing Non-Secure Item?

Page Containing Non-Secure Item?: "Page Containing Non-Secure Item"

If you find some warning of secure / non-secure stuff, please see this.

Tuesday, February 01, 2005

"Difference between Workflow and Rule engines"

: "Difference between Workflow and Rule engines

There are some blurry lines here. My quick answer is:

Workflow - typically a flow of information and actions associated with people. This term became popular 15-20 years ago with things like expense report approvals, or imaging systems that would route things like credit card receipts and problems to different people to take action on rather than sending paper around with a sign-off sheet on top.

BPM - Business Process Management. This term became popular over the past 5-10 years and typically refers to a combination of people and system oriented processes. So say someone approves an expense report, but then it kicks off a series of actions in several systems like payroll and general ledger.

BPM Engine - keeps track of the state of various items as they pass thru a graph of tasks and actions. Think of drawing a typical flow chart where the BPM engine keeps track of each items flowing thru the flow chart. Go to http://www.jbpm.org for some good overview docs and pointers.

Rules Engine - allows for complex set of rules to be applied to a complex set of data to make decisions. From the http://www.drools.org web site: 'Rule Engines and expert systems are a great way to collect complex decision-making logic and work with data sets too large for humans to effectively use. Rule engines can make decisions based on hundreds of thousands of facts, quickly, reliably and repeatedly.'"

Saturday, January 22, 2005

dv1004ap XF86Config-4

This is the first article for making the Linux work in my notebook. Reference


/etc/X11/XF86Config-4 - These are the changes from Ubuntu generated default. Generally I've just added a few lines, but a few were changed. You can also view the full XF86Config-4.

Section "Device"
Option "XaaNoOffscreenPixmaps"
EndSection

Section "Monitor"
Modeline "1280x768" 80.14 1280 1344 1480 1680 768 769 772 795
EndSection

Section "Screen"
DefaultDepth 16
SubSection "Display"
Depth 16
Modes "1280x768" "1024x768" "800x600" "640x480"
EndSubSection
SubSection "Display"
Depth 24
Modes "1280x768" "1024x768" "800x600" "640x480"
EndSubSection
EndSection