I. Introduction
II. Prerequisites
III. Download and Install
IV. Tutorial
4.1 Common Rules
4.2 Example
V. Get Help
C & S's Toolkit (CSTK) is a package composed of perl, R, python and shell scripts. Using the specific script in the package or combination of scripts in the package as pipeline, may meet the general requirement of bioinformatics analysis and complete miscellaneous tasks. The function modules of CSTK include but not limited in:
A. File format converting
B. Processing of standard format file
C. Table manipulation
D. Statistical testing
E. Survival analysis
F. Data visualization
G. Parsing of third-party tool result
H. Gene/region expression quantification
I. Alternative splicing identification and quantification
CSTK mainly requires the following software/packages:
Software/Package | Version | Description |
---|---|---|
Perl | >=5.010 | The Perl language program. |
BioPerl | Not limited | Needed only by fqConverter.pl and mafLiftOver.pl in current version. |
R | >=3.3.2 | The R language program, mainly used for statistical analysis and data visualization. |
ggplot2 | 2.x | A R package for data visualization. |
Python | =2.7 | The Python language program. |
SAMTools | >=1.5 | The toolkit to manipulate BAM files. |
The given version is just to suggest you to use this version, but not to prohibit you from using older version, although we havenβt tested the older ones. More required software/packages are script-specific. If specific script in CSTK requires specific software/packages, please install it/them according to the error information.
Please download the CSTK from the release page or clone it with git.
It's extremely easy to install CSTK, because all the source codes are written with script languages and no compilation needed. After installing the software/packages required in the "Prerequisites" section, CSTK is ready to be used.
Before using CSTK, firstly add the path of CSTK into the environment variable $PATH, in order to use scripts in CSTK at the command line directly. Assuming CSTK is decompressed into /home/<yourUserName/bin>/CSTK
, run the following command:
PATH=/home/<yourUserName>/bin/CSTK:$PATH
chmod u+x /home/<yourUserName>/bin/CSTK/*.{pl,R,py,sh}
You can add the first command line into your configuration file of environment (in general, it's /home/<yourUserName>/.bashrc
), so that the CSTK is available to be used after you login every time:
cat <EOF >>/home/<yourUserName>/.bashrc
PATH=/home/<yourUserName>/CSTK:$PATH
EOF
Rather than downloading and installing manually, it's recommended to setup up by Docker (https://hub.docker.com/r/zhangsjsky/cstk) or Singularity (https://cloud.sylabs.io/library/_container/5ed705dfd0ff9c878fea7e50). In this way you don't need to install the dependencies, which had been installed in the container.
# For Docker
docker pull zhangsjsky/cstk:latest
# For Singularity
singularity pull library://zhangsjsky/default/cstk
The following rules are commonly applied in CSTK:
A) For script with only one input file, the input is passed in from STDIN (Standard Input), except that the input file is specified with argument; For script with only one output file, the output is printed to STDOUT (Standard Output), meanwhile the STDERR (Standard Error) may be used for log output; For script with 2 output files, the output may be printed to STDOUT and STDERR, respectively, or may be printed to STDOUT and the file specified with option parameter, respectively. Using STDOUT and STDERR preferentially to option is convenient to pipe the analysis procedures into a pipeline. Pipeline can also speed up the analysis, meanwhile avoid the immediate files reading and writing hard disk.
B) Help information of almost all the scripts can be viewed with -h, -help or --help options.
C) For option with value, option and value of Perl and Shell scripts are separated by a space character (e.g. -option value
), while those of R are separated by an equal mark (e.g. -option=value
).
Only one or two scripts in each function module of CSTK are illustrated in following examples, illustration for more scripts will be appended according to users' feedbacks.
- fqConverter.pl
fqConverter.pl -h
Usage: perl fqConverter.pl input.fq >output.fq
ββββIf input.fq isn't specified, input from STDIN
ββββIf output.fq isn't specified, output to STDOUT
Option:
ββββ-s --sourceββSTRββThe quality format of your source fastq file ([fastq-illumina], fastq-solexa, fastq)
ββββ-t --targetβββSTRββThe quality format of your target fastq file (fastq-illumina, fastq-solexa, [fastq])
ββββ-h --helpββββββββββThis help information screen
Since Fastq is the common format for bioinformatics, the content of input file is not illustrated here. For details of Fastq format, please visit wiki.
Running example:
zcat myReads.illumina.fq.gz | fqConverter.pl >myReads.sanger.fq
The command converts the fastq in illumina format (default of -s option) to the famous sanger format (default of -t option).
Pipeline is applied in the command to illustrate the advantage of piping the CSTK. You can also further improve it as:
zcat myReads.illumina.fq.gz | fqConverter.pl | gzip -c >myReads.sanger.fq.gz
In this way, the output fastq is stored as compressed gz file. In these analysis procedures, the CSTK script act only as an adapter in the pipeline, that is to pass in the input from the output of the previous procedure and print the output to the next procedure as its input.
- gpe2bed.pl
gpe2bed.pl -h
Usage: perl gpe2bed.pl INPUT.gpe >OUTPUT.bed
ββββIf INPUT.gpe isn't specified, input from STDIN
ββββOutput to STDOUT
Option:
ββββ-b --binββββββββββββWith bin column
ββββ-t --bedTypeββINTββBed type. It can be 3, 6, 9 or 12[12]
ββββ-i --itemRgbββSTRββRGB color[0,0,0]
ββββ-g --geneββββββββββOutput 'gene name' in INPUT.gpe as bed plus column
ββββ-p --plusβββββββββββOutput bed plus when there are additional columns in gpe
ββββ-h --helpβββββββββββPrint this help information
Assuming there is an input file (example.gpe) with the following contents:
76 | NM_015113 | chr17 | - | 3907738 | 4046253 | 3910183 | 4046189 | 55 | 3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835, | 3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253, | 0 | ZZEF1 | cmpl | cmpl | 0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0, |
147 | NM_001308237 | chr1 | - | 78028100 | 78149112 | 78031324 | 78105156 | 14 | 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78105133,78107068,78107206,78148946, | 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78105287,78107131,78107340,78149112, | 0 | ZZZ3 | cmpl | cmpl | 2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1, |
147 | NM_015534 | chr1 | - | 78028100 | 78148343 | 78031324 | 78099039 | 15 | 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269, | 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343, | 0 | ZZZ3 | cmpl | cmpl | 2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1,-1, |
As shown, there is a bin column (the first column) in the gpe file, so the -b option should be specified. For the description of gpe file, please visit UCSC: http://genome.ucsc.edu/FAQ/FAQformat.html#format9.
Run the script:
gpe2bed.pl -b example.gpe >example.bed
The content of the output bed file:
chr17 | 3907738 | 4046253 | NM_015113 | 0 | - | 3910183 | 4046189 | 0,0,0 | 55 | 2526,72,154,166,99,169,144,360,136,101,192,120,200,133,175,278,140,151,152,263,166,139,130,162,120,165,469,94,80,241,149,202,82,167,123,122,160,115,68,134,170,101,216,112,149,92,99,179,117,211,200,172,195,145,418 | 0,4438,5159,9004,9645,9902,11878,12926,13391,15224,16684,18264,20474,27681,28383,29570,37984,39779,45263,46336,47526,49612,51771,53549,54726,58308,59916,62002,62718,66239,68163,69705,70652,70818,72192,72423,73438,76931,77992,81225,82041,82989,84233,86274,91386,92164,97872,100188,101248,105208,108164,109854,112527,119462,138097 |
chr1 | 78028100 | 78149112 | NM_001308237 | 0 | - | 78031324 | 78105156 | 0,0,0 | 14 | 3369,101,135,153,96,102,72,116,148,139,154,63,134,166 | 0,3665,5916,13652,16358,17111,18582,19360,19563,22101,77033,78968,79106,120846 |
chr1 | 78028100 | 78148343 | NM_015534 | 0 | - | 78031324 | 78099039 | 0,0,0 | 15 | 3369,101,135,153,96,102,72,116,148,139,1556,154,63,134,74 | 0,3665,5916,13652,16358,17111,18582,19360,19563,22101,69434,77033,78968,79106,120169 |
The -t is 12 in default, so the output bed file is bed12 format in default. You can also try the -g or -p option to modify or add columns in output.
- gpeMerge.pl
gpeMerge.pl -h
Usage: perl gpeMerge.pl input.gpe >output.gpe
ββββIf input.gpe not specified, input from STDIN
ββββOutput to STDOUT
ββββ-b --binββββββββββββββββββWith bin column
ββββ-l --locusββββββββββββββββββMerge with locus (default: merge gene)
ββββ-t --longTranscriptβββββββββOverlap is against long transcript (default against short transcript)
ββββ-n --nameββββββββββββββββSet the "gene name" column as "transcript name(s)" when the corresponding gene name unavailable
ββββ-p --percentββββββββDOUββMinimal overlap percent to merge transcript (default: 0)
ββββ-h --helpββββββββββββββββββPrint this help information
This script can merge different transcripts of the same gene (or the same locus if the -l option specified). The merging criterion is: for each site of a gene, if in any transcript the site is located in exon, the site is treated as exonic site in the merged result, otherwise treated as intronic site.
A diagram to intuitively illustrate the merging:
ββββββββββββββββββββββββββββββββββFrom C.
In the figure, the top and middle lines are the two transcripts of the same gene, the bottom line is the result after merging.
Use the example.gpe file in the previous section as input to run this script:
gpeMerge.pl -b example.gpe >merged.gpe
The content of the output:
6 | NM_015113 | chr17 | - | 3907738 | 4046253 | 3910183 | 4046189 | 55 | 3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835, | 3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253, | 0 | ZZEF1 | cmpl | cmpl | 0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0, |
147 | NM_001308237,NM_015534 | chr1 | - | 78028100 | 78149112 | 78028100 | 78149112 | 16 | 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269,78148946 | 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343,78149112 | 0 | ZZZ3 | unk | unk | -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, |
As you can see, the two records of the transcripts of the ZZZ3 gene have been merged into one record. The start and end of exons are updated as the coordinates after merging and the other information associated with coordinate is also updated according to the coordinates after merging. The column of transcript name is also updated as comma-separated transcript list.
- gpeFeature.pl
gpeFeature.pl -h
Usage: perl gpeFeature.pl OPTION INPUT.gpe >OUTPUT.bed
ββββIf INPUT.gpe isn't specified, input from STDIN
ββββExample: perl gpeFeature.pl -b -g hg19.size --upstream 1000 hg19.refGene.gpe >hg19.refGene.bed
Option:
ββββ-b --binββββββββββββββββWith bin column
ββββ-i --intronβββββββββββββββFetch introns in each transcript
ββββ-e --exonβββββββββββββββFetch exons in each transcript
ββββ-c --cdsβββββββββββββββββFetch CDS in each transcript
ββββ-u --utrβββββββββββββββββFetch UTRs in each transcript, 5'UTR then 3'UTR (or 3' first)
ββββ-p --primeββββββββINTβββ5 for 5'UTR, 3 for 3'UTR(force -u)
βββββββ--completeββββββββββOnly fetch UTR for completed transcripts
βββββββ--upstreamβββββINTββFetch upstream INT intergenic regions(force -g)
βββββββ--downstreamββINTββFetch downstream INT intergenic regions(force -g)
ββββ-g --chrSizeβββββββFILEββTab-separated file with two columns: chr name and its length
ββββ-s --singleββββββββββββββBundle all features into single line for each transcript
βββββββ--addIndexββββββββββAdd exon/intron/CDS/UTR index as suffix of name in the 4th column
ββββ-h --helpβββββββββββββββPrint this help information
This script is used to extract specific feature from the gpe file and output in bed format.
For example, to extract exons:
gpeFeature.pl -b example.gpe -e >exon.bed6+
chr17 | 3907738 | 3910264 | NM_015113 | 0 | - | NM_015113 | 2526 | ZZEF1 |
chr17 | 3912176 | 3912248 | NM_015113 | 0 | - | NM_015113 | 72 | ZZEF1 |
chr17 | 3912897 | 3913051 | NM_015113 | 0 | - | NM_015113 | 154 | ZZEF1 |
More records of the output are omitted. The result is in bed6+ format with each line representing an exon. The last two columns present the exon length and gene name.
- tsvFilter.pl
tsvFilter.pl -h
Usage: perl tsvFilter.pl -o originFile.tsv -1 1,4 -m i|include targetFile.tsv >filtered.tsv
ββββIf targetFile.tsv isn't specified, input is from STDIN
ββββOutput to STDOUT
Option:
ββββ-o --originFileββββTSVββThe original file containing fields (specified by --originFields) used to include or exclude lines in targetFile.tab
ββββ-1 --originFieldsββSTRββComma-separated field list specifying which fileds in the originFile.tab to be used to include or exclude, 1-based start [1]
ββββββββββββββββββββββββββThe element of the list can be a single column number or a range with nonnumeric char as separator
ββββββββββββββββββββββββββTo specify the last column left the range right margin blank
ββββββββββββββββββββββββββIf continuous range specified like '1-3-6', the first range '1-3' will be output
βββββββββββββββββββββββββββe.g.:
ββββββββββββββββββββββββββββ-1 1,4 ββββoutput columns 1,4
ββββββββββββββββββββββββββββ-1 1-4,6..8βoutput columns 1,2,3,4,6,7,8
ββββββββββββββββββββββββββββ-1 1,4,6- ββoutput columns 1,4,6,7,... last column
ββββββββββββββββββββββββββββ-1 1-3-6 ββoutput columns 1,2,3
ββββ-2 --targetFieldsββSTRββComma-separated field list specifying which fileds in the targetFile.tab are used to include or exclude lines, 1-based start [1]
ββββββββββββββββββββββββββMore description about --targetFields, see --originFields
ββββ-m --modeβββββββSTRββTo include or exclude lines in targetFile.tab, it can be i|include or e|exclude[e]
ββββ-s --separatorββββSTRββ(Optional)A separator to join the fields specified, if necessary[Empty string]
ββββ-h --helpββββββββββββββPrint this help information
This script is often used in table manipulation. It's function is to filter a table (i.g. target table) according to one column or some columns (i.g. the target columns) with one or some columns (i.g. the source column) of another table (i.g. source table). The mode on it's based is whether target columns include (or exclude) the source columns.
Assuming the first column of the source table (source.tsv) is gene name, the second column is up- or down- regulation mark:
Gene1 | Up |
Gene2 | Down |
Gene3 | Up |
The first column of the target table (target.tsv) is gene name, the second column is up- or down- regulation fold-change:
Gene1 | 5 |
Gene2 | 10 |
Gene3 | 2 |
If you want to know the fold-change of the up-regulated genes, run the following command:
tsvFilter.pl -o <(awk '$2=="Up"' source.tsv) -m i target.tsv
The output is:
Gene1 | 5 |
Gene3 | 2 |
The command fetches up-regulated genes with awk firstly, and then feeds it to the -o option as input to filter the target.tsv file. -m option specifies the filtering mode as "include". Because both the source column and target column are the first column, -1 and -2 options are specified as 1 in default.
- tsvJoin.sh
tsvJoin.sh
Usage: tsvJoin.sh OPTIONS [-1 field1][-2 fields2] input1.tsv [input2.tsv]
Note: input2.tsv can be omitted or be a "-" to input the data from STDIN.
ββββββwhen input1.tsv is a "-" (i.e. from STDIN), input2.tsv must be specified.
Options:
-1|field1βββββββββββSTRββThe field in input1.tsv used when joining[1]
βββββββββββββββββββββββββRefer to the -1 option of linux join command
-2 field2βββββββββββSTRββThe field in input2.tsv used when joining[1]
-i|inputDelimiterββββSTRββThe delimiter of your input file[\t]
-j|joinDelimiterβββββSTRββThe delimiter used by linux join commond (i.e. the -t option of join command)[|]
-o|outputDelimiterββSTRββThe delimiter of the output[\t]
-a|unpairedFileββββββINTββAlso print unpairable lines from file INT (1, 2)
Please specify at least one file
This script is also often used in table manipulation. It's function is to join two tables according to whether the values of one or some columns are the same. Different from the join command built in Linux, this script can use the tab character as separated character.
Take the previous source.tsv and target.tsv as examples, join them according to gene name:
tsvJoin.sh source.tsv target.tsv >join.tsv 2>/dev/null
The output join.tsv is:
Gene1 | Up | 5 |
Gene2 | Down | 10 |
Gene3 | Up | 2 |
In the command, '2>/dev/null' means discarding the STDERR (for output running information). Because the columns used to join in the two input table are both the first column, -1 and -2 options are specified as 1 in default.
In the above output, the column used to do joining (i.g. the gene name column) is set as the first column and is followed by the other columns in the first and the second files, except for the column used to do joining.
- testT.R
testT.R -h
Usage: testT.R -option=value <input.lst|<input1.lst input2.lst|input1.lst input2.lst >pValue
Option:
ββββ-p|pairββββββPair
ββββ-a|altββSTRββThe alternative hypothesis: [two.sided], greatr or less
ββββ-hββββββββββShow help
This script is used to conduct T-test testing. '<input.lst|<input1.lst input2.lst|input1.lst input2.lst' means there are a few input manners:
A) Input one file from STDIN;
B) Input from STDIN and file (specified by an argument), respectively;
C) Input from two files (specified by two arguments), respectively.
The commonly used manner is the last one. Assuming the content of the first file (value1.lst) is:
1
2
3
4
5
And that of the second file is:
3
4
5
6
7
8
testT.R value1.lst value2.lst
0.0398030820241363
The output p-value indicates there is significant difference between the two list of values.
- testWilcoxon.R
testWilcoxon.R -h
Usage: testWilcoxon.R -option=value <input.lst|input.lst|input1.lst input2.lst >pValue
Options
ββββ-a|altββββSTRββThe alternative hypothesis ([two.sided], greater, less)
βββββββββββββββββββYou can specify just the initial letter
ββββ-m|muββDOUββA parameter used to form the null hypothesis for one-sample test[0]
ββββ-h|helpββββββββShow help
testWilcoxon.R value1.lst value2.lst 2>/dev/null
0.06601543
The output p-value indicates there isn't significant difference between the two list of values. The conclusion is different from the T-test.
- survival.R
survival.R -h
Usage: survival.R -option=value <input.tsv
Option:
ββββ-p|pdfβββββPDFββThe KM figure[KM.pdf]
ββββ-w|widthβββINTββThe figure width
βββββββheightββINTββThe figure height
βββββββheaderβββββββWith header
ββββ-m|mainβββSTRββThe main title
ββββ-x|xlabβββββSTRββThe xlab[Time]
ββββ-y|ylabβββββSTRββThe ylab[Survival Probability]
ββββ-hββββββββββββββShow help
Input (header isn't necessary):
ββExample1:
ββββ#timeβevent
ββββ1βββββTRUE
ββββ2βββββTRUE
ββββ2βββββTRUE
ββββ8βββββFALSE
ββββ5βββββTRUE
ββββ10ββββFALSE
ββExample2:
ββββ#timeβevent βgroup
ββββ1βββββTRUEββmale
ββββ2βββββTRUEββmale
ββββ2βββββTRUEββmale
ββββ8βββββFALSEβmale
ββββ5βββββTRUEββfemale
ββββ10ββββFALSEβfemale
As the help information presents, creating an input file as the Example2 (survival.tsv). Pass it to the script as input and specify the output pdf file:
survival.R -p=survival.pdf <survival.tsv >survival.log 2>survival.err
Survival curve (i.g. Kaplan-Meier curve) and the log-rank p-value are stored in the pdf. Statistical measurements are stored in survival.log.
- bar.R
bar.R -h
Usage: bar.R -p=outputName.pdf <input.tsv
Option:
Common:
ββββ-p|pdfββββFILEββThe output figure in pdf[figure.pdf]
ββββ-w|widthββINTββThe figure width
ββββ-heightβββINTββThe figure height
ββββ-m|mainββSTRββThe main title
Contents omittedβ¦
ββββ-annoTxtβββSTRsββThe comma-separated texts to be annotated
ββββ-annoTxtXββINTsββThe comma-separated X positions of text
ββββ-annoTxtYββINTsββThe comma-separated Y positions of text
ββββ-annoTxtSββDOUββThe annotated text size[5]
Skill:
ββββLegend title of alpha, color, etc can be set as the same to merge their guides
This script is used to draw bar chart. Take the survival.tsv in the previous section as input to draw:
awk 'BEGIN{OFS="\t"}{print NR,$1,$3}' survival.tsv | bar.R -p=bar.pdf -fillV=V3 -x='Patient ID' -y=Time -fillT=Gender
In the command, awk is used to process the input before passing the data to bar.R. The processed results are:
1 | 1 | male |
2 | 2 | male |
3 | 4 | male |
4 | 8 | male |
5 | 5 | female |
6 | 10 | female |
The extra patient ID information is added as the first column and will be presented at X axis. The second column is survival time and will be presented at Y axis. The third column is gender and will be used to fill bar with different color. If the third column isn't presented, all the bars are filled in black in default.
In the parameters of bar.R, -fillV=V3
tells it to use the third column (column name: V3) to color bars. Don't specify this option if no third column. -x and -y specify the label of X and Y axis, respectively. -fillT specify the title of the fill-legend.
- hist.R
hist.R -h
Usage: hist.R -p=outputName.pdf <input.tsv
Option:
ββββCommon:
ββββ-p|pdfββββFILEβββThe output figure in pdf[figure.pdf]
ββββ-w|widthββINTβββThe figure width
ββββ-m|mainββSTRβββThe main title
ββββ-mainSβββDOUββThe size of main title[22 for ggplot]
ββββ-x|xlabββββSTRβββThe xlab[Binned Values]
ββββ-y|ylabββββSTRβββThe ylab
Contents omittedβ¦
ββββ-annoTxtβββSTRsββThe comma-separated texts to be annotated
ββββ-annoTxtXββINTsββThe comma-separated X positions of text
ββββ-annoTxtYββINTsββThe comma-separated Y positions of text
Skill:
ββββLegend title of alpha, color, etc can be set as the same to merge their guides
This script is used to draw histogram.
hist.R -p=hist.pdf -x=Time -y='Patient Count' <survival.tsv
Only one column of values is needed to draw a histogram, so pass in survival.tsv as input then the first column will be used default to draw histogram.
- coverageBedParser.pl
coverageBedParser.pl -h
Usage: perl coverageBedParser.pl coverageBedOutput.tsv >OUTPUT.tsv
ββββIf coverageBedOutput.tsv isn't specified, input from STDIN
Option:
ββββ-b -bedFormatββINTββBed format ([3], 6)
ββββ-h --helpββββββββββββPrint this help information
This script is used to parse the output result of 'bedtools coverage
', calculate the mean depth and covered fraction of each region.
coverageBedParser.pl coverage.bed >coverageParsed.bed
- fastqcParser.pl
fastqcParser.pl -h
Usage: perl fastqcParser.pl OPTION fastqc_data.txt >OUTPUT.tsv
ββββIf fastqc_data.txt isn't specified, input from STDIN
Options:
ββββ-h --helpββPrint this help information
This script is used to parse the result (in general, the fastqc_data.txt) of FastQC, extract read-count, GC content and mean quality etc. and output as a table.
fastqcParser.pl fastqc_data.txt >fastqcDataParsed.tsv
- geneRPKM.pl
geneRPKM.pl -h
Usage: perl geneRPKM.pl -g gene_structure.gpe -s 4 INPUT.BAM >RPKM.bed6+ 2>running.log
ββββIf INPUT.BAM isn't specified, input is from STDIN
ββββOutput to STDOUT in bed6 (gene in name column, RPKM in score column) plus longest transcript, readNO and transcript length
ββββThis script chooses the LONGEST transcript of each gene as reference transcript to measure RPKM
Option:
ββββ-g --gpeβββββFILEββA gpe file with comment or track line allowed
ββββ-b --binβββββββββββWith bin column
ββββ-l --libTypeββSTRββThe library type, it can be
ββββββββββββββββββββββfr-unstranded: for Standard Illumina (default)
ββββββββββββββββββββββfr-firststrand: for dUTP, NSR, NNSR
ββββββββββββββββββββββfr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol
ββββ-s --slopβββββINTββSpecify the slopping length from the exon-intron joint to intron[0]
ββββ-u --uniqββββββββββOnly use uniquely-mapped reads (NH=1)to compute RPKM
βββββββ--logβββββFILEββRecord running log into FILE
ββββ-h --helpββββββββββPrint this help information
As the help information describes, this script chooses the longest transcript of each gene as the reference to quantify the expression of each gene in RPKM. The output result is stored in bed6+. A running example:
geneRPKM.pl -g refGene.gpe -b -s 4 final.bam >RPKM.bed6+ 2>geneRPKM.log
In the command, -s option specifies the length in which reads can extend from exon into intron. This option is devised to handle the situation that reads may be incapable of spanning an intron to be aligned to adjacent exon with a few base-pair.
This script may consume too much memory (depend on the size of the input bam file). If the memory isn't enough on the machine, the script geneRPKM_mem.pl, which consumes little memory but runs slower, may be an alternative choice.
- regionRPKM.pl
regionRPKM.pl -h
Usage: perl regionRPKM.pl -b region.bed INPUT.bam >RPKM.bed 2>running.log
ββββIf INPUT.bam isn't specified, input from STDIN
ββββOutput to STDOUT with bed columns plus reads count in region and its RPKM
Note: INPUT.bam should be indexed with samtools index
ββββThis script is for handling bam file in normal size that can be entirely cached into memory.
ββββIt's MEMORY-CONSUMED but low TIME-CONSUMED compared to its equivalent regionRPKM_mem.pl.
ββββSpliced reads are handled now. Those that include the whole region within intron aren't counted.
Option:
ββββ-b|bedFileββFILEββRegion file in bed4 or bed6 format. bed plus is allowed.
ββββ-l|libTypeβββSTRββThe library type, it can be
βββββββββββββββββββββfr-unstranded: for Standard Illumina (default)
βββββββββββββββββββββfr-firststrand: for dUTP, NSR, NNSR
βββββββββββββββββββββfr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol
ββββ-h --helpβββββββββPrint this help information
This script calculates the RPKM of specific regions (specified by --bedFile option). A running example:
regionRPKM.pl --bedFile exon.bed final.bam >RPKM.bed 2>regionRPKM.log
Similarly, if the memory isn't enough, use regionRPKM_mem.pl as alternative.
Only identification and quantification of SE (Skipping Exon) event are implemented in current version.
- psiSE.pl
psiSE.pl -h
Usage: perl psiSE.pl INPUT.bam >OUTPUT.bed6+
ββββIf INPUT.bam isn't specified, input from STDIN
Option:
ββββ-b --bedββββββββFILEββGene models in bed12 format
ββββ-l --libraryTypeββSTRββThe library type, it can be
βββββββββββββββββββββββββfr-unstranded: for Standard Illumina (default)
βββββββββββββββββββββββββfr-firststrand: for dUTP, NSR, NNSR
βββββββββββββββββββββββββfr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol
ββββ-s --slopββββββββINTββMaximal slope length for a read to be considered as exonic read[4]
ββββ-r --minReadββββINTββMinimal supporting reads count for an exclusion junction[2]
ββββ-h --helpβββββββββββββPrint this help information
Output:
ββββThe 4th column is the transcript name and the exon rank (in transcriptional direction) separated by a dot.
ββββThe 5th column in OUTPUT.bed6+ is the PSI normalized into 0-1000.
ββββAdditional columns are as follow:
ββββββββinclusion read count
ββββββββinclusion region length
ββββββββinclusion read density
ββββββββexclusion read counts separated by comma
ββββββββexclusion region lengths separated by comma
ββββββββexclusion read density
This script is used to identify SE events and quantify the PSI of them from the alignments. An using example:
psiSE.pl final.bam >SE.bed6+
Refer to the help information for the format of the output result.
You can send the author Sky any information about this toolkit, like bug reporting, typos and performance improvement suggestion.