Gene Download
10 minute read
You’re absolutely on the right track with BioPython! It’s the go-to library for interacting with biological databases like NCBI programmatically in Python. Specifically, you’ll be using the Bio.Entrez module.
Here’s a breakdown of how to download a gene sequence from NCBI using BioPython, along with explanations and best practices:
Key BioPython Bio.Entrez Functions
-
Entrez.email: Crucial! You must set your email address before making any requests to NCBI. This helps NCBI understand who is using their E-utilities and allows them to contact you if there are issues. Without it, your requests might be blocked.
-
Entrez.api_key (Optional but Recommended): You can obtain an API key from your NCBI account. This allows you to make more requests per second (10 requests/second with a key vs. 3 requests/second without). For serious or frequent programmatic access, an API key is highly recommended to avoid hitting rate limits.
-
Entrez.esearch(): This function performs a search on an NCBI database (e.g., “nucleotide,” “gene,” “protein,” “pubmed”) and returns a list of matching IDs. Entrez.efetch(): Once you have the IDs, this function retrieves the full record(s) from the specified database in a desired format (e.g., FASTA, GenBank).
-
Entrez.read(): This function parses the XML output returned by esearch or efetch (when retmode=“xml”) into Python dictionaries and lists.
-
SeqIO.read() / SeqIO.parse(): From Bio.SeqIO, these functions are used to parse sequence files (or “handles” from efetch) into SeqRecord objects, which are very convenient for working with sequence data.
Step-by-Step Example: Downloading a Gene
Let’s say you want to download the sequence for the E. coli lacZ gene.
from Bio import Entrez
from Bio import SeqIO
import time # To respect NCBI's rate limits
# --- 1. Set your email and API key (if you have one) ---
# IMPORTANT: Replace with your actual email address
Entrez.email = "wjxa20152015@gmail.com"
# Optional: If you have an NCBI API key, uncomment and replace
# Entrez.api_key = "YOUR_NCBI_API_KEY"
# --- 2. Define your search query ---
# We'll search the 'gene' database first to get the gene ID,
# then use that to search the 'nucleotide' database for the sequence.
organism = "Escherichia coli"
gene_name = "lacZ"
search_term = f"{organism}[Organism] AND {gene_name}[Gene]"
print(f"Searching for gene: {search_term}")
# --- 3. Search for the gene ID in the 'gene' database ---
try:
handle = Entrez.esearch(db="gene", term=search_term, retmax="1") # retmax=1 to get the most relevant one
record = Entrez.read(handle)
handle.close()
gene_ids = record["IdList"]
if not gene_ids:
print(f"No gene ID found for '{search_term}'. Please check your search term.")
else:
ncbi_gene_id = gene_ids[0]
print(f"Found NCBI Gene ID: {ncbi_gene_id}")
# --- 4. Fetch the nucleotide sequence using the gene ID ---
# We need to specify the 'nucleotide' database for sequence data.
# rettype='fasta' for FASTA format, 'gb' for GenBank format.
# retmode='text' for plain text output.
print(f"Fetching nucleotide sequence for Gene ID: {ncbi_gene_id}")
# Pause to respect NCBI's rate limits (especially if not using API key)
time.sleep(0.5)
# For gene records, fetching the nucleotide sequence often involves searching
# the nucleotide database with the gene's locus tag or gene name in that context.
# A more robust way is to use the gene ID to find associated nucleotide records.
# Let's search the nucleotide database for the gene ID directly, or its accession.
# For simplicity, we'll try searching by gene name and organism again in 'nucleotide'
# or use the gene ID to link to a nucleotide entry if possible.
# Often, gene entries link to specific RefSeq accessions.
# A common approach: Use the Gene ID to find associated nucleotide records
# This can be tricky because a gene entry might link to a whole genome, not just the gene sequence.
# Let's try searching for the gene name and organism directly in the nucleotide database.
# This is often more straightforward for getting the actual gene sequence.
nucleotide_search_term = f"{organism}[Organism] AND {gene_name}[Gene] AND refseq[Filter]"
handle_nuc = Entrez.esearch(db="nucleotide", term=nucleotide_search_term, retmax="5")
nuc_record = Entrez.read(handle_nuc)
handle_nuc.close()
nucleotide_ids = nuc_record["IdList"]
if not nucleotide_ids:
print(f"No nucleotide sequence found for '{nucleotide_search_term}'. Trying broader search.")
# Fallback: if RefSeq filter didn't work, try without it
handle_nuc_broad = Entrez.esearch(db="nucleotide", term=f"{organism}[Organism] AND {gene_name}[Gene]", retmax="5")
nuc_record_broad = Entrez.read(handle_nuc_broad)
handle_nuc_broad.close()
nucleotide_ids = nuc_record_broad["IdList"]
if nucleotide_ids:
# Let's just take the first result for demonstration
target_nucleotide_id = nucleotide_ids[0]
print(f"Found Nucleotide ID: {target_nucleotide_id}")
# Fetch the sequence in FASTA format
time.sleep(0.5) # Respect rate limits
handle_fetch = Entrez.efetch(db="nucleotide", id=target_nucleotide_id, rettype="fasta", retmode="text")
fasta_sequence = handle_fetch.read()
handle_fetch.close()
print("\n--- FASTA Sequence ---")
print(fasta_sequence[:100] + '...')
# Optionally, parse it with SeqIO to get a SeqRecord object
from io import StringIO
seq_record = SeqIO.read(StringIO(fasta_sequence), "fasta")
print("\nSequence ID:", seq_record.id)
print("Sequence Length:", len(seq_record.seq))
print("First 50 bases:", seq_record.seq[:50])
# You can also save it to a file
output_filename = f"{seq_record.id}_{gene_name}.fasta"
with open(output_filename, "w") as out_handle:
out_handle.write(fasta_sequence)
print(f"\nSequence saved to {output_filename}")
# If you want GenBank format:
time.sleep(0.5) # Respect rate limits
handle_fetch_gb = Entrez.efetch(db="nucleotide", id=target_nucleotide_id, rettype="gb", retmode="text")
genbank_record = handle_fetch_gb.read()
handle_fetch_gb.close()
print("\n--- GenBank Record (first 500 chars) ---")
print(genbank_record[:500]) # Print first 500 characters
# You can parse the GenBank record with SeqIO as well:
# gb_seq_record = SeqIO.read(StringIO(genbank_record), "genbank")
# print("GenBank features:", gb_seq_record.features)
output_genbank_filename = f"{seq_record.id}_{gene_name}.gb"
with open(output_genbank_filename, "w") as out_handle_gb:
out_handle_gb.write(genbank_record)
print(f"GenBank record saved to {output_genbank_filename}")
else:
print("Could not find any nucleotide sequences for the specified gene.")
except Exception as e:
print(f"An error occurred: {e}")
Searching for gene: Escherichia coli[Organism] AND lacZ[Gene]
Found NCBI Gene ID: 945006
Fetching nucleotide sequence for Gene ID: 945006
Found Nucleotide ID: 2195797308
--- FASTA Sequence ---
>NZ_CAKNDS010000021.1 Escherichia coli strain cpe078, whole genome shotgun sequence
TGGCGGCTAATGGCAA...
Sequence ID: NZ_CAKNDS010000021.1
Sequence Length: 20027
First 50 bases: TGGCGGCTAATGGCAAGAAAGGAAAGGTGATTCTCGGCGCGATGATGCGC
Sequence saved to NZ_CAKNDS010000021.1_lacZ.fasta
--- GenBank Record (first 500 chars) ---
LOCUS NZ_CAKNDS010000021 20027 bp DNA linear CON 21-MAY-2025
DEFINITION Escherichia coli strain cpe078, whole genome shotgun sequence.
ACCESSION NZ_CAKNDS010000021 NZ_CAKNDS010000000
VERSION NZ_CAKNDS010000021.1
DBLINK BioProject: PRJNA224116
BioSample: SAMEA6451093
Assembly: GCF_929618365.1
KEYWORDS WGS; RefSeq.
SOURCE Escherichia coli
ORGANISM Escherichia coli
Bacteria; Pseudomonadati; Pseudomonadota; Gammaproteobact
GenBank record saved to NZ_CAKNDS010000021.1_lacZ.gb
from pathlib import Path
file = list(Path('.').glob("*.fasta"))[0]
content = file.read_text(encoding='utf8')
print(content[:100] + '...')
>NZ_CAKNDS010000021.1 Escherichia coli strain cpe078, whole genome shotgun sequence
TGGCGGCTAATGGCAA...
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio import SeqIO
from io import StringIO
import time
# --- 1. Get your query sequence ---
# For demonstration, let's assume you have a FASTA string of your downloaded gene.
# In a real scenario, you would have loaded this from your downloaded .fasta file.
# Let's use the lacZ gene from the previous example for consistency.
# Replace this with your actual gene sequence (e.g., from reading the downloaded .fasta file)
lacZ_fasta_string = content
# For a real file:
# with open("your_gene.fasta", "r") as f:
# query_sequence_fasta = f.read()
# --- 2. Perform the BLAST search ---
print("Performing BLAST search... This may take a moment.")
# Example for nucleotide sequence (blastn)
# For protein sequences, you'd use 'blastp' or 'blastx' (nucleotide to protein)
# db: The database to search against (e.g., 'nr' for non-redundant, 'nt' for nucleotide, 'refseq_rna', 'refseq_genomic')
# You can find available databases on the NCBI BLAST website.
# Use a common database like 'nr' (nucleotide non-redundant) or 'nt' (nucleotide) for general searches.
# filter: 'L' for low complexity filter (recommended for most searches to avoid spurious hits)
try:
result_handle = NCBIWWW.qblast(
program="blastn", # or 'blastp', 'blastx', 'tblastn', 'tblastx'
database="nr", # or 'nt', 'refseq_rna', 'refseq_genomic', etc.
sequence=lacZ_fasta_string,
# Set a few more parameters if needed
# gapopen=11,
# gapextend=1,
# expect=10.0, # Expectation value (E-value) cutoff
# hitlist_size=10, # Number of descriptions and alignments to show
# format_type="XML" # The default format for NCBIXML parsing
)
# Pause to respect NCBI's rate limits (especially after a long query)
time.sleep(5) # A longer pause is often good after submitting a query
except Exception as e:
print(f"An error occurred during BLAST query: {e}")
print("Consider adding Entrez.email and Entrez.api_key if you haven't already.")
exit()
# --- 3. Parse the BLAST results ---
# The results are returned in XML format, which BioPython's NCBIXML can parse.
print("Parsing BLAST results...")
blast_records = NCBIXML.parse(result_handle) # parse() for multiple query sequences
# For a single query, you can use read():
# blast_record = NCBIXML.read(result_handle)
# Iterate through the results (there might be multiple records if you queried multiple sequences)
n = 0
for blast_record in blast_records:
print(f"\nQuery: {blast_record.query}")
print(f"Database: {blast_record.database}")
print(f"Number of hits found: {len(blast_record.alignments)}")
if not blast_record.alignments:
print("No significant alignments found.")
continue
# Iterate through the alignments (hits)
for alignment in blast_record.alignments:
for hsp in alignment.hsps: # HSP = High-scoring Segment Pair
if hsp.expect < 1e-10: # Filter by E-value (lower is better, closer to 0)
n +=1
if n >= 10: # first 10 results
break
print(f"\n Alignment: {alignment.title}")
print(f" Length: {alignment.length}")
print(f" E-value: {hsp.expect}")
print(f" Score: {hsp.score}")
print(f" Identities: {hsp.identities}/{hsp.align_length} ({hsp.positives} positives)")
print(f" Query start: {hsp.query_start}, end: {hsp.query_end}")
print(f" Subject start: {hsp.sbjct_start}, end: {hsp.sbjct_end}")
# You can print the actual alignment if needed
# print(f" Query: {hsp.query[0:75]}...")
# print(f" Match: {hsp.match[0:75]}...")
# print(f" Sbjct: {hsp.sbjct[0:75]}...")
break # Only process the first query for this example
result_handle.close()
print("\nBLAST search complete.")
Performing BLAST search... This may take a moment.
Parsing BLAST results...
Query: NZ_CAKNDS010000021.1 Escherichia coli strain cpe078, whole genome shotgun sequence
Database: core_nt
Number of hits found: 50
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 0.0
Score: 40054.0
Identities: 20027/20027 (20027 positives)
Query start: 1, end: 20027
Subject start: 3547756, end: 3567782
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 1.66533e-119
Score: 497.0
Identities: 280/298 (280 positives)
Query start: 1, end: 292
Subject start: 1600979, end: 1600682
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 7.08116e-118
Score: 490.0
Identities: 274/292 (274 positives)
Query start: 1, end: 292
Subject start: 3567840, end: 3567555
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 1.05094e-115
Score: 482.0
Identities: 241/241 (241 positives)
Query start: 1, end: 241
Subject start: 3919623, end: 3919863
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 3.21768e-103
Score: 437.0
Identities: 261/289 (261 positives)
Query start: 19758, end: 20027
Subject start: 1600633, end: 1600921
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 2.16997e-86
Score: 374.0
Identities: 216/234 (216 positives)
Query start: 19800, end: 20027
Subject start: 3548047, end: 3547814
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 1.3694e-82
Score: 360.0
Identities: 192/200 (192 positives)
Query start: 16847, end: 17046
Subject start: 3564402, end: 3564601
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 1.3694e-82
Score: 360.0
Identities: 192/200 (192 positives)
Query start: 16647, end: 16846
Subject start: 3564602, end: 3564801
Alignment: gi|1915194479|dbj|AP022409.1| Escherichia coli STW0522-31 DNA, complete genome
Length: 4783805
E-value: 4.77968e-82
Score: 358.0
Identities: 182/184 (182 positives)
Query start: 19844, end: 20027
Subject start: 3919864, end: 3919681
BLAST search complete.
from Bio import SeqIO
from dna_features_viewer import BiopythonTranslator
import matplotlib.pyplot as plt
# --- 1. Load the GenBank file into a Biopython SeqRecord object ---
# Replace 'your_gene.gb' with the actual path to your downloaded GenBank file
try:
record = SeqIO.read("NZ_CAKNDS010000021.1_lacZ.gb", "genbank")
print(f"Successfully loaded GenBank file: {record.id} ({len(record.seq)} bp)")
except FileNotFoundError:
print("Error: 'NZ_CAKNDS010000021.1_lacZ.gb' not found. Please make sure the file exists.")
# You might want to add code here to re-download the file if it's missing.
exit()
except Exception as e:
print(f"Error loading GenBank file: {e}")
exit()
# --- 2. Translate the Biopython record into a DNA Features Viewer GraphicRecord ---
# BiopythonTranslator automatically converts GenBank features into GraphicFeature objects.
graphic_record = BiopythonTranslator().translate_record(record)
# --- 3. Plot the graphic record ---
# This will generate a linear plot of your sequence with its features.
# 'figure_width' controls the width of the plot in inches.
ax, _ = graphic_record.plot(figure_width=10)
# --- 4. Add a title to your plot (optional) ---
ax.set_title(f"Features of {record.id}")
# --- 5. Display or save the plot ---
plt.tight_layout() # Adjusts plot to prevent labels from overlapping
plt.show() # Displays the plot
# To save the plot to a file instead of displaying:
# plt.savefig("gene_features_linear.png", dpi=300)
# plt.close() # Close the figure to free up memory
Successfully loaded GenBank file: NZ_CAKNDS010000021.1 (20027 bp)