Category: Statistics


By the email from xiao Gen, I got to know this nice IDE for R, R-Studio. IDE is short for Integrated development environment, which in my eyes is another cool way to say Editor.

I just download it and begin to try this new tool. It works well and looks nice so far. So I will join xiao Gen, recommending it.

http://www.rstudio.org


Tutorial for Propensity Score

These days I am learning Propensity Score, a method popularly used in observational study which intends to reduce the estimation bias by uncontrolled experiments. I find the paper
"PROPENSITY SCORE METHODS FOR BIAS REDUCTION IN THE COMPARISON OF A TREATMENT TO A NON-RANDOMIZED CONTROL GROUP"

very useful as a tutorial for Propensity Score. Basically, it describes the motivation of bringing Propensity Score in observational studies and discusses some common methods using Propensity Score, like matching, stratification, and regression adjustment. It also provides real examples for each part.

This study lets me recall the days doing consulting following John. I really want to take that course again and ask him many questions… why I ask so few questions at that time?

I thought R should be enough for statistical consulting before I met SAS.
I thought R + SAS should be enough for real practice before I met real real real clients ….
I never thought I would use VBA in Excel, but the thing is many people in industry do not know R or SAS, even do not know how to open them. So there comes Excel, a software popular in industry.

One thing I am quite happy to use Excel is that it runs very fast in Windows platform, and easy to see results in each procedure. I often see people can make nice buttons in Excel sheet or feasible worksheet , which automatically generates the result a client wants. I want to be able to make one myself, so I’d like to learn a bit VBA. Hope that not very hard ~~

Here is the website I found that looks useful for VBA in Excel : http://www.excel-vba.com/

A small test on package(doSNOW)

Thanks to the encouragement from Sky, I go on doing a simple test on the package doSNOW, which is said to be useful for parallel computing (although the meaning of "parallel" is still not clear to me…).

My PC has two cores. Following the article’s simple code:

############################
library(doSNOW)
cl<-makeCluster(2) # I have two cores
registerDoSNOW(cl)
# create a function to run in each itteration of the loop
 check <-function(n) {
 for(i in 1:100){
        sme <- matrix(rnorm(1000), 10,10)
        a <- solve(sme)
     }
 }

 times <- 100     # times to run the loop
 system.time( x <- foreach(j=1:times ) %dopar% check(j)  )
  user  system elapsed 
   0.07    0.01    1.97
 system.time(for(j in 1:times ) x <- check(j))
   user  system elapsed
   3.62    0.00    3.63

#############################

I tried this code for several times and found that the time saved by "foreach" depends on how many cores you have. If you have n cores, you probably will have 1/n time as before when doing iterations using "for".

Btw, it is wired to see doSNOW not work in a Linux environment. I am not sure about this problem. But I use another similar package called "doMC" to see how much faster when I have 4 cores.

#####################################
library(doMC)
registerDoMC( ) # different from doSNOW
times <- 1000     # a large iteration time
system.time( x <- foreach(j=1:times ) %dopar% check(j)  ) # check() is the same as before
  user  system elapsed 

345.391 344.743 124.229

system.time(for(j in 1:times ) x <- check(j))
   user  system elapsed

459.271   1.165 460.698

#####################################

So, although I have not found much solutions in large dataset in doSNOW or doMC so far, at least when I do large simulations in future, "foreach" could be one choice to make my code run faster.

PS: If I remember right, our department has an 8-core sever, which means we could save a lot on time using foreach.

前几天试了一个R的软件包来处理大循环,结果貌似不是很好用。。。。后来知道原来以前的包似乎在linux下才好用,windows就不行。今天碰巧看到这篇文章也讲R的并行运算,作者还给了个小例子,我试了试,还真好用~~不仅分享出来。下一步我看看这个包对于大数据好不好用,要是真的好用的话。。。SAS就可以在很多情况下休息了。。。。

文章链接: http://decisionstats.wordpress.com/2010/09/24/parallel-programming-using-r-in-windows/

How Google and Facebook using R

Good PROC SQL, BAD PROC SQL

Just leave a note here. PROC SQL is really powerful and can make data manipulation much easier and faster. However, sometimes the traditional data step can work out something PROC SQL cannot do.

Just leave a note here. When I get some free time, I am thinking to write something on GOOD PROC SQL and BAD PROC SQL by examples.

Anyway, today I feel the goodness of the multiple to multiple match by PROC SQL.

Good article for reading : http://www.lexjansen.com/pharmasug/2008/cc/cc07.pdf

NOW, USE PROC SQL in SAS

Now use PROC SQL instead of DATA and PROC !
Actually, my idea to learn more about SQL is motivated by the conversation with Lippo last night when we were going to supermarket. And today with the help of several reference, I know a little more about SQL and why we need it.

"PROC SQL can not only retrieve information without having to learn SAS syntax, but it can often do this with fewer and shorter statements than traditional SAS code. Additionally, SQL often uses fewer resources than conventional DATA and PROC steps. Further, the knowledge learned is transferable to other SQL
packages." – AN INTRODUCTION TO PROC SQL by Katie Minten Ronk, Steve First, David Beam

Here is a nice presentation to show some basic tips for PROC SQL learners: Basic of SAS PROC SQL

Tips on how to input data in SAS

As a SAS learner, I was always confused at different ways of inputting data in SAS. Well, the logic of SAS (actually, cannot even call it a logic), is so wired….
however, I guess, for some large projects, the logic of SAS would be helpful because the program format is standard and easy to check steps from different programmers.
I found a nice article called The Input Statement: Where It’s @ talking about how to input data in SAS days ago. I want to share it, for my further use and for other SAS beginners.

Getting Started with SAS and Oracle

SAS is powerful in dealing with large database. It seems to statisticians or programmers, the most important thing to do first is to obtain a dataset. In my school time, the practical skill of manipulating data using software like SAS from a database, say, an Oracle database, is emphasized. When work, practice of "grasping" data from a database is something we have to do now and then. I came across an article which introduce the usage of SAS and Oracle database by examples. I am not sure if this article is classical enough, but to me it is very helpful.
See Getting Started with SAS/Access for Oracle by F. Joseph Kelley.

Em, the next topic need to learn I guess is how to deal with dynamic dataset. Not even quite sure what the problem really is yet. Just heard people at Amazon are doing lots of thing with dynamic dataset. Hope days later I could pop up an idea.

在WordPress.com的博客. | 主题: Motion 作者 volcanic.
加关注

Get every new post delivered to your Inbox.