Workflow Management Fundamentals¶
If you're coming from a background of writing shell scripts, Python scripts, or other sequential data processing scripts, Nextflow might seem like unnecessary complexity. Why learn a whole new framework when your bash script does the job?
This side quest takes a different approach: you'll build a bash script yourself, experience its limitations firsthand, and then discover how Nextflow solves each problem as it emerges. By the end, you'll understand not just what workflow management does, but why it exists.
This tutorial uses a single exemplar throughout: a quality control and assembly pipeline for bacterial genome sequencing data.
What you'll learn:
- Why parallelization matters: Experience sequential processing delays, then see automatic parallelization
- Why software isolation matters: Hit version conflicts, then see container-based solutions
- Why resume capability matters: Lose hours of work to a failure, then see instant recovery
- Why portability matters: See environment-dependent code, then see write-once-run-anywhere
- Why resource management matters: Experience resource bottlenecks, then see declarative solutions
0. Warmup¶
0.1. Prerequisites¶
Before taking on this side quest you should:
- Be comfortable with command-line basics (file navigation, running commands, writing simple bash scripts)
- Have basic familiarity with bioinformatics concepts (FASTQ files, quality control, assembly)
- Understand what containers are at a high level (isolated software environments)
You do NOT need:
- Prior Nextflow experience (we'll teach you as we go)
- Deep knowledge of Groovy or any programming language
- Experience with workflow management systems
0.2. The Scenario¶
You're analyzing bacterial genome sequences. Your workflow involves:
- Quality control of raw sequencing reads (FastQC)
- Adapter trimming and filtering (fastp)
- Genome assembly (SPAdes)
- Assembly quality assessment (QUAST)
You have 3 samples to process initially, but your PI just said there are 20 more coming next week...
0.3. Starting Point¶
Navigate to the project directory:
The directory contains sample data:
> tree
.
โโโ data
โ โโโ samples.csv
โ โโโ reads
โ โโโ sample_01_R1.fastq.gz
โ โโโ sample_01_R2.fastq.gz
โ โโโ sample_02_R1.fastq.gz
โ โโโ sample_02_R2.fastq.gz
โ โโโ sample_03_R1.fastq.gz
โ โโโ sample_03_R2.fastq.gz
โโโ README.md
3 directories, 8 files
Our sample CSV contains metadata:
sample_id,organism,read1,read2
sample_01,E.coli,data/reads/sample_01_R1.fastq.gz,data/reads/sample_01_R2.fastq.gz
sample_02,S.aureus,data/reads/sample_02_R1.fastq.gz,data/reads/sample_02_R2.fastq.gz
sample_03,P.aeruginosa,data/reads/sample_03_R1.fastq.gz,data/reads/sample_03_R2.fastq.gz
About running the commands
This tutorial is conceptual - we won't actually execute these computationally intensive commands. Focus on understanding the patterns and problems. The code is realistic and representative of what you'd write in practice.
1. Building the Bash Script¶
1.1. Start Simple: Process One Sample¶
Let's begin by processing just one sample. Create a file called process_one.sh:
Write a bash script that:
- Creates an output directory called
results/ - Runs FastQC on
data/reads/sample_01_R1.fastq.gzanddata/reads/sample_01_R2.fastq.gz - Outputs results to
results/fastqc/
This works! For a single sample, this is perfectly fine. But you have 3 samples...
1.2. Scale to Multiple Samples: Add a Loop¶
Now modify your script to process all samples. Read from data/samples.csv and loop through them.
Modify process_one.sh to:
- Read sample information from
data/samples.csv - Loop through each sample
- Run FastQC on each sample's reads
Hint: Use tail -n +2 to skip the CSV header and while IFS=',' read to parse CSV lines
Run this script:
Expected output:
Processing all samples...
========================
Processing sample_01 (E.coli)...
Completed sample_01
Processing sample_02 (S.aureus)...
Completed sample_02
Processing sample_03 (P.aeruginosa)...
Completed sample_03
All samples processed!
1.3. ๐ Problem #1: It's Slow¶
Time your script:
Notice something? The samples process one at a time. If each FastQC takes 5 minutes:
- Sample 1: 5 minutes
- Sample 2: 5 minutes (waits for sample 1)
- Sample 3: 5 minutes (waits for sample 2)
- Total: 15 minutes
But these samples are completely independent! They could run simultaneously.
Think about it: With 20 samples, you're looking at 100 minutes of sequential waiting. On a machine with 16 CPUs, you're using only 2 CPUs at a time. The other 14 sit idle.
You could parallelize this with & backgrounding or GNU parallel, but that adds complexity. Let's see this problem in action first, then we'll return to it.
1.4. Add More Steps: The Full Pipeline¶
Now let's add the remaining steps. Modify your script to:
- Run FastQC
- Trim adapters with fastp
- Assemble with SPAdes
- Assess quality with QUAST
Extend your script to include all four pipeline steps for each sample.
1.5. ๐ Problem #2: Resource Conflicts¶
Look at the thread allocations:
fastp --thread 4spades.py --threads 8quast.py --threads 4
These are hardcoded. What if:
- You run this on your laptop with only 4 cores? SPAdes will fail or thrash
- You run this on an HPC node with 64 cores? You're wasting 56 cores
- You run this on two different machines simultaneously? They'll compete for resources
The problem: Resource requirements are baked into your code, not adapted to the environment.
1.6. ๐ Problem #3: The Crash¶
You start processing your samples. Everything is going well...
Processing sample_01 (E.coli)...
-----------------------------------
Running FastQC...
Running fastp...
Running SPAdes assembly...
Running QUAST...
Completed sample_01
Processing sample_02 (S.aureus)...
-----------------------------------
Running FastQC...
Running fastp...
Running SPAdes assembly...
Then disaster strikes at 2 AM:
You've lost all the work for sample_02 and any samples that would have come after it. Your options:
- Re-run everything - Waste all completed work (sample_01 runs again unnecessarily)
- Manually edit the CSV - Remove completed samples, re-run script (error-prone)
- Build checkpoint logic - Track what completed, skip finished work (hours of additional coding)
None of these are good options.
1.7. ๐ Problem #4: Environment Dependencies¶
Your script assumes that fastqc, fastp, spades.py, and quast.py are:
- Installed on the system
- In your PATH
- The correct versions
- Compatible with each other
Try to run your script on a colleague's machine:
$ ./process_complete.sh
Processing sample_01 (E.coli)...
Running FastQC...
bash: fastqc: command not found
Or worse - it runs with different versions:
# Your machine: fastp version 0.23.4
# Colleague's machine: fastp version 0.20.0
# Different results, different outputs, not reproducible!
The problem: Your script is environment-dependent. It won't run the same way everywhere.
1.8. ๐ Problem #5: No Visibility¶
Your script is running. Questions you can't easily answer:
- How much memory is SPAdes actually using?
- Which step is the bottleneck?
- How long did sample_02 take compared to sample_01?
- What were the exact parameters used for sample_03?
- If I need to optimize, where should I focus?
You'd need to manually add:
- Logging at every step
- Timestamp tracking
- Resource monitoring
- Parameter recording
That's a lot of boilerplate code that has nothing to do with your science.
1.9. Taking Stock¶
Let's summarize what we've experienced:
| Problem | Impact | Your Options |
|---|---|---|
| Sequential processing | 3ร slower than necessary | Add complex parallelization logic |
| Hardcoded resources | Fails on some machines, wastes resources on others | Add environment detection, make configurable |
| No resume capability | Re-run everything after failures | Build checkpoint tracking system |
| Environment dependencies | Not reproducible, won't run elsewhere | Create conda environments, document meticulously |
| No provenance | Can't optimize, hard to debug | Add extensive logging code |
You could solve all of these. And after weeks of work, you'd have built... a workflow management system.
This is why Nextflow exists.
2. Enter Nextflow: Solving the Problems¶
2.1. The Nextflow Equivalent¶
Let's see how the same pipeline looks in Nextflow. We'll start with the main workflow file:
And one of the process definitions:
Let's run it:
2.2. โ Solution #1: Automatic Parallelization¶
Watch the execution:
N E X T F L O W ~ version 24.10.0
Launching `main.nf` [friendly_darwin] DSL2 - revision: a1b2c3d4e5
executor > local (12)
[3a/4b5c6d] FASTQC (sample_01) [100%] 3 of 3 โ
[8e/9f0a1b] FASTP (sample_01) [100%] 3 of 3 โ
[2c/3d4e5f] SPADES (sample_01) [100%] 3 of 3 โ
[7g/8h9i0j] QUAST (sample_01) [100%] 3 of 3 โ
Completed at: 11-Oct-2025 14:23:45
Duration : 28m 15s
CPU hours : 2.5
Succeeded : 12
Notice the difference:
- Bash script: 90 minutes (sequential: 30 min ร 3 samples)
- Nextflow: 28 minutes (parallel where possible)
What happened? Nextflow analyzed the dependencies:
- All 3 FASTQC tasks are independent โ Run simultaneously
- All 3 FASTP tasks depend only on input data โ Run simultaneously
- Each SPADES depends on its FASTP output โ Run when ready
- Each QUAST depends on its SPADES output โ Run when ready
You wrote zero parallelization code. Nextflow figured it out from the data dependencies.
2.3. โ Solution #2: Declarative Resource Requirements¶
Look at the process definition again:
These aren't hardcoded values - they're requirements. Nextflow uses them differently based on where you run:
On your laptop (4 cores, 16 GB RAM):
- Schedules tasks so they don't exceed available resources
- Might run 1 FASTP task at a time (needs 4 CPUs)
- Queues remaining tasks until resources free up
On HPC cluster (64 cores, 256 GB RAM):
- Generates appropriate SLURM job submission script
- Requests 4 CPUs and 8 GB per task
- Can run 16 FASTP tasks in parallel
On cloud (elastic scaling):
- Provisions appropriately-sized instances
- Scales up when needed, down when done
- You pay only for what you use
Same code. Different environments. Automatic adaptation.
2.4. โ Solution #3: Built-in Resume Capability¶
Remember the crash scenario? Let's see it with Nextflow:
N E X T F L O W ~ version 24.10.0
executor > local (9)
[3a/4b5c6d] FASTQC (sample_01) [100%] 3 of 3 โ
[8e/9f0a1b] FASTP (sample_01) [100%] 3 of 3 โ
[2c/3d4e5f] SPADES (sample_01) [100%] 2 of 3 โ
[2c/a1b2c3] SPADES (sample_02) [ 0%] 0 of 1, failed: 1 โ
[7g/8h9i0j] QUAST (sample_01) [100%] 2 of 2 โ
Error executing process > 'SPADES (sample_02)'
Key observations:
- Samples 01 and 03 completed successfully
- Only sample 02 failed
- Downstream work (QUAST) continued for successful samples
Fix the problem (increase memory):
process SPADES {
tag "$meta.id"
memory '32.GB' // Increased from 16.GB
// ... rest of process
}
Resume from where it failed:
N E X T F L O W ~ version 24.10.0
executor > local (2)
[3a/4b5c6d] FASTQC (sample_01) [100%] 3 of 3, cached: 3 โ
[8e/9f0a1b] FASTP (sample_01) [100%] 3 of 3, cached: 3 โ
[2c/3d4e5f] SPADES (sample_01) [100%] 3 of 3, cached: 2 โ
[7g/8h9i0j] QUAST (sample_01) [100%] 3 of 3, cached: 2 โ
Completed at: 11-Oct-2025 14:45:12
Duration : 32m 5s
Only SPADES(sample_02) and QUAST(sample_02) re-ran. Everything else used cached results.
Zero checkpoint logic required. Nextflow tracks this automatically using content-based hashing.
2.5. โ Solution #4: Container-Based Software Isolation¶
Look at the process definition:
process FASTP {
container 'biocontainers/fastp:0.23.4'
// ... rest of process
}
What Nextflow does automatically:
- Downloads the container image (first run only)
- Executes the task inside the container
- Ensures exact version 0.23.4 is used
- Prevents conflicts with other tools
Running on a colleague's machine:
Pulling biocontainers/fastp:0.23.4 ... done
Pulling biocontainers/spades:3.15.5 ... done
Pulling biocontainers/quast:5.2.0 ... done
[... pipeline runs identically ...]
Zero installation steps. No PATH configuration. No version conflicts. Perfect reproducibility.
2.6. โ Solution #5: Comprehensive Provenance¶
Generate execution reports:
Execution Report (report.html):
- Task-level resource usage (CPU, memory, time)
- Success/failure status for each task
- Resource efficiency metrics
- Total cost estimates (for cloud)
Timeline (timeline.html):
- Visual timeline of when each task ran
- Shows parallelization patterns
- Identifies bottlenecks
- Reveals idle time
Trace file (trace.txt):
task_id,hash,process,status,exit,duration,realtime,%cpu,%mem,peak_rss,peak_vmem
1,3a/4b5c6d,FASTQC (sample_01),COMPLETED,0,1m 23s,1m 18s,98.5%,12.3%,1.8 GB,2.1 GB
2,8e/9f0a1b,FASTP (sample_01),COMPLETED,0,3m 45s,3m 40s,390.2%,8.7%,2.4 GB,3.1 GB
3,2c/3d4e5f,SPADES (sample_01),COMPLETED,0,28m 12s,27m 51s,785.3%,45.2%,12.1 GB,15.8 GB
Perfect for:
- Identifying which samples need more resources
- Optimizing resource allocations
- Cost analysis
- Publication methods sections
Zero logging code written. It's built in.
3. Side-by-Side Comparison¶
3.1. The Same Task, Two Approaches¶
Let's directly compare what we built:
Bash Script Approach:
- โ Sequential execution (90 minutes for 3 samples)
- โ Hardcoded resource requirements
- โ No resume capability
- โ Environment-dependent (requires manual software installation)
- โ No execution tracking
- โ Simple to understand initially
- โ No new tools to learn
Nextflow Approach:
- โ Automatic parallelization (28 minutes for 3 samples)
- โ Adaptive resource management
- โ Built-in resume capability
- โ Container-based software isolation
- โ Comprehensive provenance tracking
- โ ๏ธ Initial learning curve
- โ Same code works everywhere
3.2. Real-World Impact: The Numbers¶
Let's quantify the differences:
Time Investment:
| Task | Bash Script | Nextflow |
|---|---|---|
| Initial implementation | 30 minutes | 60 minutes |
| Add parallelization | +2 hours | Already included |
| Add resume logic | +3 hours | Already included |
| Add logging/tracking | +2 hours | Already included |
| Fix environment issues | +1 hour per machine | Works everywhere |
| Total development | 8.5 hours | 1 hour |
Execution Time (3 samples):
| Approach | Time | Speedup |
|---|---|---|
| Bash (sequential) | 90 minutes | 1ร |
| Nextflow (parallel) | 28 minutes | 3.2ร |
Execution Time (20 samples):
| Approach | Time | Speedup |
|---|---|---|
| Bash (sequential) | 10 hours | 1ร |
| Nextflow (parallel) | 2 hours | 5ร |
After a failure at sample 15:
| Approach | Time to Resume |
|---|---|
| Bash | 7.5 hours (re-run samples 1-15) |
| Nextflow | 30 minutes (only sample 15) |
3.3. The Breaking Point¶
When should you use workflow management instead of bash scripts?
Consider Nextflow when:
โ You have more than 3-5 samples to process โ Your pipeline has multiple steps with dependencies โ You need to share your work with collaborators โ You run analyses on different computers/clusters โ Your analyses take more than 30 minutes โ You need to re-run analyses with different parameters โ You care about reproducibility
Bash scripts are fine when:
โ One-off exploratory analysis you'll never repeat โ Single sample, single step โ Interactive work where you adjust based on output โ Extremely simple operations (< 5 minutes total)
The transition point: When you find yourself thinking "I wish I could..." about your bash script - parallelization, resuming, tracking - that's when to switch to Nextflow.
4. Hands-On: Converting Your Script¶
Now it's your turn. Let's convert part of the bash script to Nextflow.
Exercise 1: Create the FastQC Process¶
Take the FastQC step from the bash script and convert it to a Nextflow process.
Create modules/fastqc.nf with a process that:
- Takes sample metadata and read files as input
- Runs FastQC on the reads
- Outputs the HTML and ZIP reports
- Uses the
biocontainers/fastqc:0.12.1container - Uses 2 CPUs and 2 GB memory
Use this template:
process FASTQC {
tag "$meta.id"
container 'biocontainers/fastqc:0.12.1'
publishDir "${params.outdir}/fastqc", mode: 'copy'
cpus 2
memory '2.GB'
input:
tuple val(meta), path(reads)
output:
tuple val(meta), path("*.html"), emit: html
tuple val(meta), path("*.zip"), emit: zip
script:
"""
fastqc -q -t $task.cpus ${reads}
"""
}
Key elements:
tag "$meta.id": Labels output with sample IDcontainer: Specifies exact software versionpublishDir: Where to save final resultscpus/memory: Resource requirements (not hardcoded thread counts!)input: Takes metadata + read files as a tupleoutput: Declares what files are produced, with named outputs (emit)$task.cpus: Uses the declared CPU count (adapts to environment)
Exercise 2: Create the Main Workflow¶
Now create a workflow that reads the CSV and runs FastQC on all samples.
Create main.nf that:
- Includes the FASTQC process
- Reads
data/samples.csv - Creates a channel with [meta, reads] tuples
- Runs FASTQC on all samples
Use this template:
#!/usr/bin/env nextflow
include { FASTQC } from './modules/fastqc'
params.samples = 'data/samples.csv'
params.outdir = 'results'
workflow {
ch_samples = Channel
.fromPath(params.samples)
.splitCsv(header: true)
.map { row ->
// Create metadata map
// Create reads list
// Return tuple
}
FASTQC(ch_samples)
}
#!/usr/bin/env nextflow
include { FASTQC } from './modules/fastqc'
params.samples = 'data/samples.csv'
params.outdir = 'results'
workflow {
ch_samples = Channel
.fromPath(params.samples)
.splitCsv(header: true)
.map { row ->
def meta = [id: row.sample_id, organism: row.organism]
def reads = [file(row.read1), file(row.read2)]
return [meta, reads]
}
FASTQC(ch_samples)
}
How it works:
Channel.fromPath(): Creates channel from file.splitCsv(header: true): Parses CSV, skips header.map { row -> ... }: Transforms each rowmeta: Metadata map (sample ID, organism, etc.)reads: List of read file paths[meta, reads]: Tuple that matches process inputFASTQC(ch_samples): Runs process on all items in channel
Exercise 3: Add the FASTP Process¶
Now add the trimming step.
- Create
modules/fastp.nf(similar to FASTQC but with different outputs) - Add it to the workflow after FASTQC
- Make sure it receives the input reads (hint: use the original
ch_samples)
process FASTP {
tag "$meta.id"
container 'biocontainers/fastp:0.23.4'
publishDir "${params.outdir}/trimmed", mode: 'copy'
cpus 4
memory '8.GB'
input:
tuple val(meta), path(reads)
output:
tuple val(meta), path("*_R{1,2}.fastq.gz"), emit: reads
path "*.{json,html}", emit: reports
script:
def prefix = meta.id
"""
fastp \\
-i ${reads[0]} \\
-I ${reads[1]} \\
-o ${prefix}_R1.fastq.gz \\
-O ${prefix}_R2.fastq.gz \\
--json ${prefix}.json \\
--html ${prefix}.html \\
--thread $task.cpus
"""
}
#!/usr/bin/env nextflow
include { FASTQC } from './modules/fastqc'
include { FASTP } from './modules/fastp'
params.samples = 'data/samples.csv'
params.outdir = 'results'
workflow {
ch_samples = Channel
.fromPath(params.samples)
.splitCsv(header: true)
.map { row ->
def meta = [id: row.sample_id, organism: row.organism]
def reads = [file(row.read1), file(row.read2)]
return [meta, reads]
}
FASTQC(ch_samples)
FASTP(ch_samples)
}
Note: Both FASTQC and FASTP receive ch_samples. They'll run in parallel since neither depends on the other.
Exercise 4: Chain Processes with Dependencies¶
Now add SPADES, which depends on FASTP output.
Create the SPADES process and connect it to receive trimmed reads from FASTP.
Hint: FASTP outputs emit: reads - you can reference this as FASTP.out.reads
process SPADES {
tag "$meta.id"
container 'biocontainers/spades:3.15.5'
publishDir "${params.outdir}/assemblies", mode: 'copy'
cpus 8
memory '16.GB'
time '6.h'
input:
tuple val(meta), path(reads)
output:
tuple val(meta), path("${meta.id}/contigs.fasta"), emit: assembly
script:
"""
spades.py \\
-1 ${reads[0]} \\
-2 ${reads[1]} \\
-o ${meta.id} \\
--threads $task.cpus \\
--memory ${task.memory.toGiga()}
"""
}
#!/usr/bin/env nextflow
include { FASTQC } from './modules/fastqc'
include { FASTP } from './modules/fastp'
include { SPADES } from './modules/spades'
params.samples = 'data/samples.csv'
params.outdir = 'results'
workflow {
ch_samples = Channel
.fromPath(params.samples)
.splitCsv(header: true)
.map { row ->
def meta = [id: row.sample_id, organism: row.organism]
def reads = [file(row.read1), file(row.read2)]
return [meta, reads]
}
FASTQC(ch_samples)
FASTP(ch_samples)
SPADES(FASTP.out.reads)
}
The dependency chain:
ch_samplesflows into FASTQC and FASTP (parallel)FASTP.out.readsflows into SPADES (sequential - SPADES waits for FASTP)- Nextflow automatically handles the scheduling
5. Configuration: Flexibility Without Code Changes¶
5.1. Separating Configuration from Logic¶
One of Nextflow's powerful features: configuration lives in separate files.
Your workflow (main.nf) describes what to do.
Config files describe how to do it.
Create nextflow.config:
Now you can run with just:
All settings come from the config file.
5.2. Environment-Specific Profiles¶
Add execution profiles for different environments:
Switch environments with one flag:
# Local execution
nextflow run main.nf -profile standard
# HPC cluster
nextflow run main.nf -profile cluster
# AWS cloud
nextflow run main.nf -profile cloud
# Quick test
nextflow run main.nf -profile test
The workflow code never changes.
6. Conclusion: The Paradigm Shift¶
6.1. What We've Learned¶
We started by building a bash script and experienced its limitations:
- โฑ๏ธ Sequential processing - Samples waited unnecessarily
- ๐ฅ๏ธ Hardcoded resources - Failed on different machines
- ๐ฅ No resume capability - Lost hours of work to failures
- ๐ฆ Environment dependencies - "Works on my machine" problems
- ๐ No provenance - Couldn't optimize or debug effectively
Then we saw how Nextflow solves each problem:
- โ Automatic parallelization - 3ร faster with zero code changes
- โ Adaptive resources - Same code works everywhere
- โ Built-in caching - Resume from failures instantly
- โ Container isolation - Perfect reproducibility
- โ Comprehensive tracking - Complete provenance built-in
6.2. When to Make the Switch¶
You should use Nextflow when:
- Processing 3+ samples in a multi-step pipeline
- Sharing analyses with collaborators
- Running on different computers/clusters
- Analyses take > 30 minutes
- You need reproducibility
Bash scripts are fine for:
- Quick one-off explorations
- Single-sample, single-step operations
- Interactive work with immediate feedback
- Very simple tasks (< 5 minutes)
6.3. The Investment¶
Learning Nextflow:
- 1-2 days to become productive
- 1 week to feel comfortable
- After 2-3 workflows, you'll prefer it by default
Return on investment:
- Hours saved per analysis (scaling with sample count)
- Near-perfect reproducibility
- Seamless collaboration
- Infinite code reuse
6.4. Next Steps¶
Continue learning:
- Groovy Essentials - Data manipulation and advanced patterns
- nf-core - Community best practices and ready-made pipelines
- nf-test - Testing your workflows
- Nextflow Patterns - Common workflow patterns
Join the community:
- Nextflow Slack - Active community support
- nf-core Slack - Pipeline-specific help
- Nextflow Training - Official workshops
6.5. The Mindset Shift¶
Moving from scripts to workflows isn't just about using a new tool - it's a new way of thinking:
Script thinking:
"How do I run these commands in sequence on my data?"
Workflow thinking:
"What are the data dependencies? How will this scale? How can others reproduce this?"
Once you start thinking in terms of processes, channels, and data flow, you'll wonder how you ever managed with bash scripts.
Welcome to workflow management. Your future self will thank you.
Quick Reference¶
Nextflow vs Bash: Command Comparison¶
| Task | Bash | Nextflow |
|---|---|---|
| Process one sample | fastqc file.fq |
FASTQC(ch_sample) |
| Process multiple samples | for f in *.fq; do fastqc $f; done |
FASTQC(ch_samples) (automatic) |
| Chain commands | cmd1 | cmd2 | cmd3 |
CMD1(input); CMD2(CMD1.out); CMD3(CMD2.out) |
| Parallel execution | cmd & or GNU parallel |
Automatic based on dependencies |
| Resume after failure | Manual checkpoints | -resume flag |
| Specify resources | Hardcoded threads | cpus = X, memory = 'Y.GB' |
| Software environment | module load or PATH |
container = 'image:version' |
Common Nextflow Commands¶
# Run workflow
nextflow run main.nf
# Resume from cached results
nextflow run main.nf -resume
# Override parameters
nextflow run main.nf --samples my_data.csv --outdir my_results
# Use specific profile
nextflow run main.nf -profile cluster
# Generate reports
nextflow run main.nf -with-report -with-timeline -with-trace
# Clean up work directory
nextflow clean -f
# List previous runs
nextflow log
# View run details
nextflow log <run_name> -f script,workdir,hash,duration
Process Template¶
process PROCESS_NAME {
tag "$meta.id" // Label for logs
container 'image:version' // Software environment
publishDir "${params.outdir}/dir" // Where to save outputs
cpus 4 // CPU requirement
memory '8.GB' // Memory requirement
time '2.h' // Time limit
input:
tuple val(meta), path(files) // Input declaration
output:
tuple val(meta), path("*.out"), emit: results
script:
"""
command --input $files --output output.out --threads $task.cpus
"""
}