Revision as of 05:29, 8 August 2012

The PCA Utilities package provides small software routines for plotting PCA/OPLS scores and building dendrograms based on those scores. This page outlines how to install and use the pca-utils software.

Introductions

Obtaining pca-utils

You can obtain the source code to pca-utils by clicking here.

Installing pca-utils

The PCA utilities are a set of command line open-source UNIX/Linux programs. The software is highly portable: provided your distribution has glibc, it should compile without incident. Once you have the source code, run these commands to install it:

cd /path/to/source/tarball
tar xf pca-utils-YYYYMMDD.tar.gz
cd pca-utils-YYYYMMDD/
make
sudo make install

By default, the programs install to /usr/bin, but you can easily change this by modifying the Makefile if you need to.

Plotting scores with ellipses

Example scores plot generated by pca-ellipses

For an input list file called list.txt, you can quickly generate a postscript plot file (in this case called plot.ps) that shows your PCA scores with 95% confidence ellipses around each group:

pca-ellipses -1 44.4 -2 22.2 -i list.txt -k -o plot.ps

In the above statement, the optional arguments -1 and -2 were used to set contributions of PC1 and PC2 to 44.4% and 22.2%, respectively. You can then edit plot.ps to your liking. If you need a bit more control over your output, you can generate gnuplot-readable ellipses instead like so:

pca-ellipses -i list.txt > ellipses.txt
awk -F '\t' '/^[0-9]/{print$3,$4}' list.txt > points.txt
gnuplot> plot 'points.txt' w p, 'ellipses.txt' w l

Of course, in the second case, you're free to style everything any way you like. Happy hacking!

Generating dendrograms

Example postscript tree generated by pca-bootstrap

Two complementary methods exist for generating trees. The first uses Euclidean distances and bootstrapping statistics, while the second uses Mahalanobis distances and p-values. For datasets containing well-separated groups in scores space, the bootstrapping method will do fine. However, highly separation in overlapped data may be better quantified with p-values in many cases.

Using bootstrapping

To build a simple tree that displays to the console for quick checks, just run something like this:

pca-bootstrap -i list.txt

The default number of bootstrap iterations is 100, but pca-bootstrap can easily handle more. You can set the number of iterations to, say 1000, like so:

pca-bootstrap -i list.txt -n 1000

Everything looks good? You can save a postscript file using the -o flag:

pca-bootstrap -i list.txt -n 1000 -k -o tree.ps

Using parameterizing

To build a simple tree that displays to the console for quick checks, just run something like this:

pca-dendrogram -i list.txt

Everything looks good? You can save a postscript file using the -o flag:

pca-dendrogram -i list.txt -k -o tree.ps

If everything works, the plots generated from these methods should look something like this:

+-----------------------------------------------------------George
|3.8e-11
|                      +-------------------------------------Ringo
+----------------------|1.3e-08
                       |                             +-----John
                       +-----------------------------|0.45
                                                     +------Paul

Use of pca-bootstrap will yield values at the nodes between 0 and 100, while pca-dendrogram will yield values between 0 and 1.

Calculating p-values

If you just need p-values to accept the null hypothesis, you can use this command:

pca-overlap -i list.txt

Calculating basic statistics

If you're interested in basic information about each group, such as mean and/or covariance, you can use pca-stats:

pca-stats -i list.txt

Goodness, that was easy, wasn't it!?

Generating random datasets

Mainly provided for entertainment value and development/debugging, pca-rand lets you generate list files that contain bivariate normally distributed point sets. Here's an example command (in the bash scripting language) to build a faux list file:

(pca-rand -H -L John -u '(-2,2)' -v '(2,0.6)' -r 45 -n 10;
 pca-rand -L Paul -u '(-2,2)' -v '(2,0.6)' -r 135 -n 9;
 pca-rand -L George -u '(3,-2)' -v '(4,3)' -r 120 -n 13;
 pca-rand -L Ringo -u '(-1,-1)' -v '(2,2)' -r 0 -n 15) > list.txt

No, you read that right. You can download the list file generated by a single run of this command here: File:Beatles.txt.

"Wait, I'm still confused"

Remember that every command in the pca-utils package has a help message. Just run the command that you need information on with the --help flag to get a nice message on how to use that command.

@@ Line 32: / Line 32: @@
 == Generating dendrograms ==
-[[File:plot.png|thumb|right|300px|Example postscript tree generated by '''pca-bootstrap''']]
+[[File:Beatles-tree.png|thumb|right|300px|Example postscript tree generated by '''pca-bootstrap''']]
 Two complementary methods exist for generating trees. The first uses Euclidean distances and bootstrapping statistics, while the second uses Mahalanobis distances and p-values. For datasets containing well-separated groups in scores space, the bootstrapping method will do fine. However, highly separation in overlapped data may be better quantified with p-values in many cases.
@@ Line 64: / Line 64: @@
   +----------------------|1.3e-08
                          |                             +-----John
-                         +-----------------------------|0.63
+                         +-----------------------------|0.45
                                                        +------Paul

PCA Utilities: Difference between revisions