The steps involved in reproducing data can be unclear.
Purpose
To elaborate on the objective stated at the top of this document, I seek to supplement DNA Zoo’s report with an accessible data analysis (DA) pipeline. To accomplish this, I independently reproduce the original article’s table while documenting every data processing step. Although there’s nothing wrong with the original works, things can always be taken further. (DNA Zoo), (Little 17-Year Cicada, 2023)
This might be of interest to the original authors of the article. More generally, the spirit of this work could transfer to other domains of data intensive research and analytics.
Source
All data used within this report was freely available from a public database hosted by DNA Zoo. (Dnazoo.s3)
This would be the data extraction phase of the DA pipeline.
1.1 Create Project Directory
A function used further below to establish the project’s DA directory.
# Automates creation of the DA pipeline directory needed for this project to avoid confusion for those replicating this project, if desired.import osdef directorize(base_path, structure):# Nested directoriesfor dir_name, subdirs in structure.items(): dir_path = os.path.join(base_path, dir_name) os.makedirs(dir_path, exist_ok =True)for subdir in subdirs: subdir_path = os.path.join(dir_path, subdir) os.makedirs(subdir_path, exist_ok =True)
If the directory folders already exist, it shouldn’t cause any issues to allow this code to evaluate during render.
# Define the directory structurestructure = {"01_Extract/": ["data/", "scripts/"],"02_Transform/": ["data/", "scripts/"],"03_Load/": ["data/", "scripts/"],"04_Present/": ["data/", "scripts/"]}# Create the analysis folder structure in a preferred base directory# "" = project's working directorydirectorize("", structure)
1.2 Download to Local Machine
If you want the exact verbatim output, then delete the cloned folder structure (01_Extract through the 04_Present and run this code).
Importing all scripts and data needed for the project from the web to make things easier for replicating this project. If you want the exact verbatim output, then delete the cloned folder structure (01_Extract through the 04_Present and run this code). Otherwise, it’s ok to re-run this with the scripts already there.
# ---- Import data from the web with wgetimport osimport sysimport wgetdef importer (fileMap):# Download from URL to path and notify when completefor url, file_path in fileMap.items():# Checking file existenceifnot os.path.exists(file_path): wget.download(url, file_path)print(f"{file_path} written")else:print(f"{file_path} already exists.")
Run importer(), passing in the mapped URLs keys to their local directory path destination values.
# Set the urlfasta_URL ="https://dnazoo.s3.wasabisys.com/Magicicada_septendecula/magicicada_hifiasm.asm.bp.p_ctg_HiC.fasta.gz"# Set the desired local file pathfasta_PATH ="01_Extract/data/magicicada.fasta.gz"#decompress_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/scripts/decompress.py"#decompress_PATH = "02_Transform/scripts/decompress.py"#asmblydict_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/scripts/assemblyDictionary.py"#asmblydict_PATH = "02_Transform/scripts/assemblyDictionary.py"#asmblyframe_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/scripts/assemblyFramer.py"#asmblyframe_PATH = "02_Transform/scripts/assemblyFramer.py"#strint_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/03_Load/scripts/strint.py"#strint_PATH = "03_Load/scripts/strint.py"#formaframe_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/04_Present/scripts/formaFrame.py"#formaframe_PATH = "04_Present/scripts/formaFrame.py"#extractIgnore_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/01_Extract/data/.gitignore"#extractIgnore_PATH = "01_Extract/data/.gitignore"#transformIgnore_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/data/.gitignore"#transformIgnore_PATH = "02_Transform/data/.gitignore"# Map the url to the file pathfileMap = { fasta_URL: fasta_PATH,# decompress_URL: decompress_PATH,# asmblydict_URL: asmblydict_PATH,# asmblyframe_URL: asmblyframe_PATH,# strint_URL: strint_PATH,# formaframe_URL: formaframe_PATH,# extractIgnore_URL: extractIgnore_PATH,# transformIgnore_URL: transformIgnore_PATH }importer(fileMap)
Transform the compressed file to its decompressed form.
# ---- Decompress the gz file with gzipimport osimport gzipimport shutildef decompress(gzFasta, fasta):# If not decompressed, then decompress and redirect to a new file pathifnot os.path.exists(fasta):# File doesn't exist, then decompresswith gzip.open(gzFasta, 'rb') as f_in:withopen(fasta, 'wb') as f_out: shutil.copyfileobj(f_in, f_out)print(f"{fasta} has been decompressed and written.")else:print(f"The file {fasta} already exists. Skipping unzip.")
Run decompress(), using the compressed path and the desired path to the decompressed file.
# Set the compressed fasta.gz file variablegzFasta ="01_Extract/data/magicicada.fasta.gz"# Set the decompressed fasta file variablefasta ="02_Transform/data/magicicada.fasta"# Pass file paths to the functiondecompress(gzFasta, fasta)
The file 02_Transform/data/magicicada.fasta already exists. Skipping unzip.
2.2 FASTA to Text, to DataFrame
Sourcing the dataframe formatting script here, as there is some issue with knitr engine and Python with these parts. Seems like some kind of conflict with switching between dataframe and dictionary data types.
This chunk should be ran locally instead of with quarto render. When working with the source file, change the code-chunk language specifier from {.bash} back to {bash}. You might have to add the {bash} tag entirely back to the div. Not sure how else to go about accomplishing this within my current Quarto project setup. (Trizna, 2020)
# BASH SCRIPT# The uncompressed fasta file variablefasta=02_Transform/data/magicicada.fasta# The text file path variable generated by the scriptsummary_stats=02_Transform/data/summary_stats.txtassembly_stats$fasta>$summary_stats
I need to do this for the rendering part to display the code-cell outputs how I wanted them to. However, you can use the bash code chunk further below and comment this code out if you want. This makes it so you can deactivate the bash code chunk for render but still have everything else work without a hitch if you go simply from empty directory to render with this document.
Instead, using the file from running the assembly_stats Python program as a Bash command pre-render.
Now, transform the text file into a Python dataframe. I am opting not to blanket change data-types as output format could vary by user preference.
Source the assemblyFramer.py script.
# Import external python script to local library environmentreticulate::source_python("02_Transform/scripts/assemblyFramer.py")
Transform the text file data into a mult-indexed dataframe. Multi-indexing simplifies query syntax later on.
"""assemblyFramer.pyThis script processes a JSON-structured file containing assembly statistics (e.g., contig and scaffold stats) and converts it into a multi-indexed Pandas DataFrame for easier analysis and manipulation.The script is designed to handle file outputs similar to the `summary_stats.json` or `summary_stats.txt` in this project, as long as the content is valid JSON. It extracts key-value pairs from the JSON structure and organizes them into a DataFrame with a hierarchical index (Category and Label).Author: Eric MossottiDate: 2025-03-12Version: 1.0CC-BY SA"""import pandas as pdimport jsondef assemblyFramer(statsPath =None):""" Reads a JSON-structured assembly stats file and returns a multi-index DataFrame. Parameters: ----------- statsPath : str, optional The file path to the JSON-structured assembly stats file. Default is None. The file can have any extension (e.g., `.json`, `.txt`), but the content must be valid JSON. Returns: -------- pd.DataFrame A Pandas DataFrame with a multi-index (Category, Label) and a single column "Value". The DataFrame contains all key-value pairs extracted from the input JSON file. Example: -------- >>> df = assemblyFramer("summary_stats.json") >>> print(df) Value Category Label Contigs L10 41 L20 99 L30 174 L40 267 L50 385 N10 12643769 N20 9681846 N30 7895799 N40 6288966 N50 4902968 gc_content 35.248104 longest 43529772 mean 1552486.99 median 331935.0 sequence_count 4200 shortest 1000 total_bps 6520445364 Scaffolds L10 0 L20 0 L30 1 L40 2 L50 3 N10 1438277616 N20 1438277616 N30 915491830 N40 607508155 N50 518932092 gc_content 35.248104 longest 1438277616 mean 3212576.534 median 62362.5 sequence_count 2030 shortest 1000 total_bps 6521530364 """# Read the JSON filewithopen(statsPath, 'r') asfile: data = json.load(file)# Prepare rows for the DataFrame rows = []for category_key, stats in data.items():# Convert "Contig Stats" to "Contigs", "Scaffold Stats" to "Scaffolds" category = category_key.split()[0] +'s'for label, value in stats.items(): rows.append({'Category': category,'Label': label,'Value': value })# Create a DataFrame and set multi-index df = pd.DataFrame(rows) df.set_index(['Category', 'Label'], inplace=True)return df
Run assemblyFramer(), passing in the path to the text file generated by running the Bash script from earlier.
# Set the local text file pathstatsPath ="02_Transform/data/summary_stats.txt"# Run to yield a multi-indexed dataframedf = assemblyFramer(statsPath)
"""Parsing genomic data in a memory-efficent way by not loading the entirefile into memory at once. The file is processed one line at a time, groupingrelated lines together. Statistics such as N50 are crucial in assessing the contiguity of a genome assembly, with higher N50 values generally indicating a more contiguous assembly.The distinction between contigs and scaffolds is important in genome assembly, as it provides information about the continuity and completeness of the assembly.#----read_genome()____________Differentiates between scaffolds (which may contain gaps) and contigs (continuous sequences). Calculate the GC content, which is an important genomic characteristic.Prepare lists of contig and scaffold lengths.#----fasta_iter()____________Groups the .fasta file data into alternating groups of headers and sequences.It is a generator function that will pause until the next item is requested after yielding a tuple.The iterator groups two aspects: a. Single header lines starting with '>' b. Subsequent lines until the next '>'#----calculate_stats()_________________Where the stats dictionary values are calculated and keys assigned.#----def assemblyDictionary()________________________Maps the previously created dictionaries of contig and scaffold stats to the category they belong, thereby distinguishing the contigs and scaffold stats. Returning the desired values is quite intuitive as a result."""import numpy as npfrom itertools import groupby#---- fasta_iter()def fasta_iter(fasta_file): fh =open(fasta_file)# Only need the second part, or code sequences part, of the grouped by items fa_iter = (x[1] for x in groupby(fh, lambda line: line[0] ==">"))for header in fa_iter:# Get first line of group; drop the ">" and starting/trailing whitespace header =next(header)[1:].strip()# Join all sequence lines to one string; conv to uppercase; remv whitespace seq ="".join(s.upper().strip() for s innext(fa_iter))yield header, seq#---- read_genomedef read_genome(fasta_file): gc =0 total_len =0 contig_lens = [] scaffold_lens = []# Ignore header information (the '_' part) and process sequence datafor _, seq in fasta_iter(fasta_file):# Add sequence (scaffold) length scaffold_lens.append(len(seq))# NN reprs gaps in scaffold, which are contigsif"NN"in seq:# Add split sequences to contig list if gap contig_list = seq.split("NN")else:# Add sequence to contig list contig_list = [seq]for contig in contig_list:# An initial check for 0-length contigsiflen(contig): gc += contig.count('G') + contig.count('C')# Update the total length total_len +=len(contig)# Add length to list of contig lengths contig_lens.append(len(contig))# GC content as the percentage of total genome gc_cont = (gc / total_len) *100return contig_lens, scaffold_lens, gc_cont#---- calculate_stats()def calculate_stats(seq_lens, gc_cont):# Empty dictionary stats = {}# The set of sequence lengths are converted to a NumPy array seq_array = np.array(seq_lens)# Count the individual sequences stats['sequence_count'] = seq_array.size# GC proportion stats['gc_content'] = gc_cont# Sort lengths by descending order sorted_lens = seq_array[np.argsort(-seq_array)]# The first length is the longest due to sorting stats['longest'] =int(sorted_lens[0])# Likewise, shortest length is at the end stats['shortest'] =int(sorted_lens[-1]) stats['median'] = np.median(sorted_lens) stats['mean'] = np.mean(sorted_lens)# Sums the total length of all sequences stats['total_bps'] =int(np.sum(sorted_lens))# An array of cumulative sums to calculate N50 efficiently csum = np.cumsum(sorted_lens)for level in [10, 20, 30, 40, 50]:# Calculate target base pair count for the level nx =int(stats['total_bps'] * (level /100))# Find smallest bp value in array, >= to the target % csumn =min(csum[csum >= nx])""" --- The original code in the next line: l_level = int(np.where(csum == csumn)[0]) --- Has been changed to: l_level = int(np.where(csum == csumn)[0][0]) This finds the index where the cumulative sum equals csumn, which represents the number of sequences needed to reach the target percentage of base pairs. I've added an extra [0] to fix the NumPy deprecation warning. This ensures return of a scalar value from the array, as the extra '[0]' ensures the first element of the first array is being accessed. The console warning (in Rstudio) that's now been resolved: 'Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)' """# Determine the index where the sequences required to reach target % of bps is met l_level =int(np.where(csum == csumn)[0][0])# Get bp length of seq at index, l_level, for the N-statistic value n_level =int(sorted_lens[l_level]) stats['L'+str(level)] = l_level# Store the statistic in a dictionary, mapped to its new name key stats['N'+str(level)] = n_levelreturn stats#---- assemblyDictionarydef assemblyDictionary(infilename):# Return two np arrays of lengths contig_lens, scaffold_lens, gc_cont = read_genome(infilename)# Return contig stats from contig lengths contig_stats = calculate_stats(contig_lens, gc_cont)# Return scaffold stats from scaffold lengths scaffold_stats = calculate_stats(scaffold_lens, gc_cont)# A dictionary of outputs is easily queried stat_output = {'Contig Stats': contig_stats,'Scaffold Stats': scaffold_stats}return stat_output
This is the data loading phase. Following completion of this stage, querying the data should be more intuitive than before.
3.1 Assign Variables with strint.py
This code transforms the strings in the value column of the dataframe to a more visually appealing, thousands-separated form.
""" Formats the string representation of an integer value as a comma separated string """import pandas as pdimport redef strint(data, category, label):ifisinstance(data, pd.DataFrame):# Existing DataFrame handling code stat = data.loc[(category, label), "Value"]# Set boolean match value isFloat = re.search(r"\.", str(stat))# Convert to float if there is a decimalif isFloat: stat = pd.to_numeric(stat, downcast="float")else:# Else convert to an integer stat = pd.to_numeric(stat, downcast="integer")elifisinstance(data, dict):# New dictionary handling code#stat = data.get(f"{category} {label}") stat = data[category][label]if stat isNone:raiseKeyError(f"Key '{category}{label}' not found in the dictionary")# No need to convert to numeric as values are already integers or floatselse:raiseTypeError("Input must be a pandas DataFrame or a dictionary")# Add a thousands separator and convert back to a string stat =f'{stat:,}'return stat
3.2 DataFrame Input
Call strint(), with the simpflified query syntax noted earlier. Then simply specify the literal ‘Category’ and ‘Label’ keys to return the desired value.
An example of the syntax used to query the dictionary This code is not evaluated, but only for illustrative purposes.
"""strint(dataframe, category, label)category options---------------- Contig Stats Scaffold Statslabel options---------------- L10 L20 L30 L40 L50 N10 N20 N30 N40 N50 gc_content longest mean median sequence_count shortest total_bps"""# A look inside assemblyDictionary.py where stat_output is returnedstat_output = {'Contig Stats': contig_stats,'Scaffold Stats': scaffold_stats}ctig_len = statsDict["Contig Stats"]["total_bps"]
4 Present
4.1 The Pandas Table
This is a slightly formatted view of the Pandas table designed to be more easily queried to return the desired statistic. If, however, you’d like to treat the Styler object as the unchanged, dataframe object, use the forma_df.data syntax.
The Pandas dataframe as a Styler object for better viewing.
# Display with Style forma_df = df.loc[:].style.pipe(formaFrame)forma_df
Table 1: Assembly Stats
Contigs
L10
41.000000
L20
99.000000
L30
174.000000
L40
267.000000
L50
385.000000
N10
12643769.000000
N20
9681846.000000
N30
7895799.000000
N40
6288966.000000
N50
4902968.000000
gc_content
35.248104
longest
43529772.000000
mean
1552486.991429
median
331935.000000
sequence_count
4200.000000
shortest
1000.000000
total_bps
6520445364.000000
Scaffolds
L10
0.000000
L20
0.000000
L30
1.000000
L40
2.000000
L50
3.000000
N10
1438277616.000000
N20
1438277616.000000
N30
915491830.000000
N40
607508155.000000
N50
518932092.000000
gc_content
35.248104
longest
1438277616.000000
mean
3212576.533990
median
62362.500000
sequence_count
2030.000000
shortest
1000.000000
total_bps
6521530364.000000
4.2 DNA Zoo’s Table, Reproduced
To make the variable calls simpler in the table below.
library(reticulate)
Importing the library makes it simpler for inserting the values into the table below. For example, I would have had to type 4,902,968.0, but now, I only need to type 4,902,968.0 into the individual cells. I needed to convert the Python into R objects, as the knitr engine used in rendering this document does not seem to display output from execution of inline Python code directly.
Contig length (bp)
Number of contigs
Contig N50 (bp)
Longest contig (bp)
6,520,445,364.0
4,200.0
4,902,968.0
43,529,772.0
Scaffold length (bp)
Number of scaffolds
Scaffold N50 (bp)
Longest scaffold (bp)
6,521,530,364.0
2,030.0
518,932,092.0
1,438,277,616.0
:x NutFrame
The dataframe object on display, in all its glory.