流程管理工具snakemake学习笔记杂记02

网友投稿 1108 2022-11-16

流程管理工具snakemake学习笔记杂记02

流程管理工具snakemake学习笔记杂记02

snakemake学习笔记007~slurm的cluster提交任务

主要参考

​​= "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/samples/"output_folder = "/home/myan/scratch/private/practice_data/RNAseq/20220511/"SRR,FRR = glob_wildcards(input_folder + "{srr}_chrX_{frr}.fastq.gz")rule all: input: expand(output_folder + "outputfastq/{srr}_chrX_{frr}.fastq",srr=SRR,frr=FRR)rule first: input: read01 = input_folder + "{srr}_chrX_1.fastq.gz", read02 = input_folder + "{srr}_chrX_2.fastq.gz" output: read01 = output_folder + "outputfastq/{srr}_chrX_1.fastq", read02 = output_folder + "outputfastq/{srr}_chrX_2.fastq", json = output_folder + "fastpreport/{srr}.json", html = output_folder + "fastpreport/{srr}.html" threads: 8 shell: """ fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} \ --thread {threads} --html {output.html} --json {output.json} """

运行命令

snakemake --cluster 'sbatch --cpus-per-task={threads}' --jobs 12 -s snakemake_hpc.py

唰一下就结束了

试了下更长的命令

snakemake --cluster 'sbatch --cpus-per-task={threads} -o slurm_outputs/{rule}_{wildcards}_%j.out -e logs_errors/{rule}/{rule}_{wildcards}_%j.err --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py

这个命令一直没有成功

下面这个命令是可以的 加上邮箱通知

snakemake --cluster 'sbatch --cpus-per-task={threads} --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py

这里没有遇到内存超出的问题

但是我运行真实数据的时候会遇到内存超出问题

image.png

snakemake学习笔记007~slurm的cluster提交任务

image.png

我的文件存储层级如上,按照之前的通配符的写法,他会组合出​​​PRJNA001/SRR0002_1.fastq.gz​​的文件

这里的问题是如何指定​​expand()​​函数的组合

流程处理的问题还是 fastp 过滤原始测序数据

import osimport globraw_fastq_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/00.raw.fastq/"output_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/"fq_list = {}print(os.listdir(raw_fastq_folder))experiment = os.listdir(raw_fastq_folder)for i in experiment: fq_list[i] = [fq.split("_")[0] for fq in os.listdir(os.path.join(raw_fastq_folder,i))]print(fq_list)inputs = [(dir,file) for dir,files in fq_list.items() for file in files]#glob_wildcards(raw_fastq_folder + "{exper}/{srr}_{frr}.fastq")rule all: input: expand(output_folder + "01.fastp.report/" + "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])rule firstrule: input: read01 = raw_fastq_folder + "{exper}/{srr}_1.fastq.gz", read02 = raw_fastq_folder + "{exper}/{srr}_2.fastq.gz" output: read01 = output_folder + "01.fastp.filter/" + "{exper}/{srr}_clean_1.fastq.gz", read02 = output_folder + "01.fastp.filter/" + "{exper}/{srr}_clean_2.fastq.gz", html = output_folder + "01.fastp.report/" + "{exper}/{srr}.html", json = output_folder + "01.fastp.report/" + "{exper}/{srr}.json" threads: 2 shell: """ fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} --json {output.json} --html {output.html} -w {threads} """

这里主要参考的链接是 ​​osraw_fastq_folder = "/mnt/shared/scratch/myan/private/pomeRTD/00.raw.fastq/"output_folder = "/home/myan/scratch/private/pomeRTD/"#Folder,SRR,FRR = glob_wildcards(raw_fastq_folder + "{folder}/{srr}_{frr}.fq.gz")#print(Folder)#experiment = os.listdir(raw_fastq_folder)list_fastq = {}for experiment in os.listdir(raw_fastq_folder): list_fastq[experiment] = [x.split("_")[0] for x in os.listdir(raw_fastq_folder + experiment)]print(list_fastq)inputs = [(dir,file) for dir,files in list_fastq.items() for file in files]#glob_wildcards(raw_fastq_folder + "{exper}/{srr}_{frr}.fastq")rule all: input: expand(output_folder + "01.fastp.report/" + "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])rule runfastp: input: read01 = os.path.join(raw_fastq_folder,"{exper}","{srr}_1.fq.gz"), read02 = os.path.join(raw_fastq_folder,"{exper}","{srr}_2.fq.gz") output: read01 = output_folder + "01.fastp.filtered.reads/{exper}/{srr}_clean_1.fq.gz", read02 = output_folder + "01.fastp.filtered.reads/{exper}/{srr}_clean_2.fq.gz", html = output_folder + "01.fastp.report/{exper}/{srr}.html", json = output_folder + "01.fastp.report/{exper}/{srr}.json" threads: 8 resources: mem = 8000 params: "-q 20 --cut_front --cut_tail -l 30" shell: """ fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} \ -w {threads} -h {output.html} -j {output.json} {params} """

8000后的单位是MB,暂时不知道GB如何写

运行这个代码的命令

snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py

这种写法会在当前目录下生成一大堆任务提交的日志文件,如何将这些文件输出到指定文件夹呢?

image.png

还有一个问题是 slurm 管理的HPC 通常可以用​​sbatch scripts.sh​​提交任务,这里可以把

snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py

这个命令写到​​.sh​​文件中吗?然后用sbatch提交,可以试试

如果不是计算机集群有办法设置jobs吗?

还有好多基础需要看

小明的数据分析笔记本

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:跟着Nature Communications学作图:R语言ggtern包画三元相图(Ternary plots )
下一篇:关于Spring Data Jpa 自定义方法实现问题
相关文章

 发表评论

暂时没有评论,来抢沙发吧~