# How remove batch effects of RNA sequencing without control genes in R

#### Introduction

RUVseq can conduct a differential expression (DE) analysis that controls for “unwanted variation”, e.g., batch, library preparation, and other nuisance effects, using the between-sample normalization methods proposed. They call this approach RUVSeq for remove unwanted variation from RNA-Seq data

If no genes are known a priori not to be influenced by the covariates of interest, one can obtain a set of “in-silico empirical” negative controls, e.g., least significantly DE genes based on a first-pass DE analysis performed prior to RUVg normalization.

#### Code exmaple

Installation of RUVseq and edgeR from bioconductor

```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("RUVSeq")
BiocManager::install("edgeR")
```

Load needed packge and data

```
library(RUVSeq)
library(zebrafishRNASeq)
data(zfGenes)
```

zfGenes data: one gene in a row, one sample in a column

Filter out non-expressed genes, by requiring more than 5 reads in at least two samples for each gene.

```
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
```

Defined group and saved data in SeqExpressionSet

```
x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
phenoData = data.frame(x, row.names=colnames(filtered)))
```

Firstly we run DEG, then we get empirical control gene by considering all but the top 5000 genes as ranked by edgeR p-values.

```
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
top <- topTags(lrt, n=nrow(set))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]
```

Now we can estimate the factors of unwanted variation

```
set2 <- RUVg(set, empirical, k=1)
pData(set2)
## x W_1
## Ctl1 Ctl -0.10879677
## Ctl3 Ctl 0.23066424
## Ctl5 Ctl 0.19926266
## Trt9 Trt 0.07672121
## Trt11 Trt -0.83540924
## Trt13 Trt 0.43755790
```

Now , We can put W_1 as the factors of unwanted variation combining with group factor into the negative binomial GLM model of edgeR. Then the DE genes without batch effect came out.

```
design <- model.matrix(~x + W_1, data=pData(set1))
y <- DGEList(counts=counts(set1), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt) # DE gene order by pvalue
```

If someone wants to use DESeq2 instead of edgeR, the codes are as follows.

```
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts(set1),
colData = pData(set1),
design = ~ W_1 + x)
dds <- DESeq(dds)
res <- results(dds)
res
```

#### References

http://www.bioconductor.org/packages/devel/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf

Hello, I enjoy reading all of your post. I like to write a little comment to support you.