Cicada Data

Eric Mossotti

Introduction

The full repo for this project with instructions on how to replicate and run with near verbatim output.

Problem

The steps involved in reproducing data can be unclear.

Purpose

To elaborate on the objective stated at the top of this document, I seek to supplement DNA Zoo’s report with an accessible data analysis (DA) pipeline. To accomplish this, I independently reproduce the original article’s table while documenting every data processing step. Although there’s nothing wrong with the original works, things can always be taken further. (DNA Zoo), (Little 17-Year Cicada, 2023)

DNA Zoo. https://www.dnazoo.org

Little 17-year cicada. (2023). https://www.dnazoo.org/assemblies/magicicada_septendecula

Stakeholders

This might be of interest to the original authors of the article. More generally, the spirit of this work could transfer to other domains of data intensive research and analytics.

Source

All data used within this report was freely available from a public database hosted by DNA Zoo. (Dnazoo.s3)

Dnazoo.s3. https://dnazoo.s3.wasabisys.com/index.html?prefix=Magicicada_septendecula/

Pipeline

Code

flowchart TB
    A((1)):::circle --> B((2)):::circle
    B --> C((3)):::circle
    C --> D((4)):::circle

    subgraph Extract ["1. Extract"]
        direction LR
        A1["directorize.py"] --> A2["importer.py"]
    end
    subgraph Transform ["2. Transform"]
      direction TB
      B1{"decompress.py"} -.->|.fasta| B2["assembly_stats"] -.->|summary_stats.txt| B3["assemblyFramer.py"]
      B1 -.->|.fasta| B4["assemblyDictionary.py"]
    end
    subgraph Load ["3. Load"]
        direction TB
        C1{"strint.py"}
    end
    subgraph Present ["4. Present"]
        direction TB
        D1["DNA Zoo's Table, Reproduced"]
    end
    
    A ~~~ Extract
    B ~~~ Transform
    C ~~~ Load
    D ~~~ Present
    
    A2 -.->|fasta.gz| B1
    B3 -.->|dataframe| C1
    B4 -.->|dict| C1
    C1 -.->|strings| Present
    C1 -.->|strings| Present

flowchart TB
    A((1)):::circle --> B((2)):::circle
    B --> C((3)):::circle
    C --> D((4)):::circle

    subgraph Extract ["1. Extract"]
        direction LR
        A1["directorize.py"] --> A2["importer.py"]
    end
    subgraph Transform ["2. Transform"]
      direction TB
      B1{"decompress.py"} -.->|.fasta| B2["assembly_stats"] -.->|summary_stats.txt| B3["assemblyFramer.py"]
      B1 -.->|.fasta| B4["assemblyDictionary.py"]
    end
    subgraph Load ["3. Load"]
        direction TB
        C1{"strint.py"}
    end
    subgraph Present ["4. Present"]
        direction TB
        D1["DNA Zoo's Table, Reproduced"]
    end
    
    A ~~~ Extract
    B ~~~ Transform
    C ~~~ Load
    D ~~~ Present
    
    A2 -.->|fasta.gz| B1
    B3 -.->|dataframe| C1
    B4 -.->|dict| C1
    C1 -.->|strings| Present
    C1 -.->|strings| Present

Source: Article Notebook

1 Extract

This would be the data extraction phase of the DA pipeline.

1.1 Create Project Directory

A function used further below to establish the project’s DA directory.

# Automates creation of the DA pipeline directory needed for this project to avoid confusion for those replicating this project, if desired.

import os

def directorize(base_path, structure):
    # Nested directories
    for dir_name, subdirs in structure.items():
        dir_path = os.path.join(base_path, dir_name)
        os.makedirs(dir_path, exist_ok = True)

        for subdir in subdirs:
            subdir_path = os.path.join(dir_path, subdir)
            os.makedirs(subdir_path, exist_ok = True)

If the directory folders already exist, it shouldn’t cause any issues to allow this code to evaluate during render.

# Define the directory structure
structure = {
    "01_Extract/": ["data/", "scripts/"],
    "02_Transform/": ["data/", "scripts/"],
    "03_Load/": ["data/", "scripts/"],
    "04_Present/": ["data/", "scripts/"]
}

# Create the analysis folder structure in a preferred base directory
# "" = project's working directory
directorize("", structure)

1.2 Download to Local Machine

If you want the exact verbatim output, then delete the cloned folder structure (01_Extract through the 04_Present and run this code).

Importing all scripts and data needed for the project from the web to make things easier for replicating this project. If you want the exact verbatim output, then delete the cloned folder structure (01_Extract through the 04_Present and run this code). Otherwise, it’s ok to re-run this with the scripts already there.

# ---- Import data from the web with wget
import os
import sys
import wget

def importer (fileMap):
    # Download from URL to path and notify when complete
    for url, file_path in fileMap.items():
        # Checking file existence
        if not os.path.exists(file_path):
            wget.download(url, file_path)
            print(f"{file_path} written")
        else:
            print(f"{file_path} already exists.")

Run importer(), passing in the mapped URLs keys to their local directory path destination values.

# Set the url
fasta_URL = "https://dnazoo.s3.wasabisys.com/Magicicada_septendecula/magicicada_hifiasm.asm.bp.p_ctg_HiC.fasta.gz"
# Set the desired local file path
fasta_PATH = "01_Extract/data/magicicada.fasta.gz"

#decompress_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/scripts/decompress.py"
#decompress_PATH = "02_Transform/scripts/decompress.py"

#asmblydict_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/scripts/assemblyDictionary.py"
#asmblydict_PATH = "02_Transform/scripts/assemblyDictionary.py"

#asmblyframe_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/scripts/assemblyFramer.py"
#asmblyframe_PATH = "02_Transform/scripts/assemblyFramer.py"

#strint_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/03_Load/scripts/strint.py"
#strint_PATH = "03_Load/scripts/strint.py"

#formaframe_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/04_Present/scripts/formaFrame.py"
#formaframe_PATH = "04_Present/scripts/formaFrame.py"

#extractIgnore_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/01_Extract/data/.gitignore"
#extractIgnore_PATH = "01_Extract/data/.gitignore"

#transformIgnore_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/data/.gitignore"
#transformIgnore_PATH = "02_Transform/data/.gitignore"


# Map the url to the file path
fileMap = {
  fasta_URL: fasta_PATH,
#  decompress_URL: decompress_PATH,
 # asmblydict_URL: asmblydict_PATH,
 # asmblyframe_URL: asmblyframe_PATH,
 # strint_URL: strint_PATH,
 # formaframe_URL: formaframe_PATH,
 # extractIgnore_URL: extractIgnore_PATH,
 # transformIgnore_URL: transformIgnore_PATH
  }

importer(fileMap)

01_Extract/data/magicicada.fasta.gz already exists.

The specific link used to download all data from

https://dnazoo.s3.wasabisys.com/Magicicada_septendecula/magicicada_hifiasm.asm.bp.p_ctg_HiC.fasta.gz

2 Transform

The data transformation phase of the pipeline.

2.1 Decompress .GZ

Source the decompress.py script.

reticulate::source_python("02_Transform/scripts/decompress.py")

Transform the compressed file to its decompressed form.

# ---- Decompress the gz file with gzip

import os
import gzip
import shutil

def decompress(gzFasta, fasta):
    
    # If not decompressed, then decompress and redirect to a new file path
    if not os.path.exists(fasta):
        # File doesn't exist, then decompress
        with gzip.open(gzFasta, 'rb') as f_in:
            with open(fasta, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        print(f"{fasta} has been decompressed and written.")
    else:
        print(f"The file {fasta} already exists. Skipping unzip.")

Run decompress(), using the compressed path and the desired path to the decompressed file.

# Set the compressed fasta.gz file variable
gzFasta = "01_Extract/data/magicicada.fasta.gz"

# Set the decompressed fasta file variable
fasta = "02_Transform/data/magicicada.fasta"

# Pass file paths to the function
decompress(gzFasta, fasta)

The file 02_Transform/data/magicicada.fasta already exists. Skipping unzip.

2.2 FASTA to Text, to DataFrame

Sourcing the dataframe formatting script here, as there is some issue with knitr engine and Python with these parts. Seems like some kind of conflict with switching between dataframe and dictionary data types.

Source the formaFrame.py script.

reticulate::source_python("04_Present/scripts/formaFrame.py")

This chunk should be ran locally instead of with quarto render. When working with the source file, change the code-chunk language specifier from {.bash} back to {bash}. You might have to add the {bash} tag entirely back to the div. Not sure how else to go about accomplishing this within my current Quarto project setup. (Trizna, 2020)

Trizna, M. (2020). Assembly_stats 0.1.4. Zenodo. https://doi.org/10.5281/ZENODO.3968774

# BASH SCRIPT

# The uncompressed fasta file variable
fasta=02_Transform/data/magicicada.fasta

# The text file path variable generated by the script
summary_stats=02_Transform/data/summary_stats.txt

assembly_stats $fasta > $summary_stats

I need to do this for the rendering part to display the code-cell outputs how I wanted them to. However, you can use the bash code chunk further below and comment this code out if you want. This makes it so you can deactivate the bash code chunk for render but still have everything else work without a hitch if you go simply from empty directory to render with this document.

Instead, using the file from running the assembly_stats Python program as a Bash command pre-render.

#summaryStats_URL = "https://raw.githubusercontent.com/ericMossotti/Cicada_Data/main/02_Transform/data/summary_stats.txt"

#summaryStats_PATH = "02_Transform/data/summary_stats.txt"

#map_summary = {summaryStats_URL: summaryStats_PATH}

#importer(map_summary)

Now, transform the text file into a Python dataframe. I am opting not to blanket change data-types as output format could vary by user preference.

Source the assemblyFramer.py script.

# Import external python script to local library environment
reticulate::source_python("02_Transform/scripts/assemblyFramer.py")

Transform the text file data into a mult-indexed dataframe. Multi-indexing simplifies query syntax later on.

"""
assemblyFramer.py

This script processes a JSON-structured file containing assembly statistics 
(e.g., contig and scaffold stats) and converts it into a multi-indexed Pandas 
DataFrame for easier analysis and manipulation.

The script is designed to handle file outputs similar to the 
`summary_stats.json` or `summary_stats.txt` in this project, as long as the 
content is valid JSON. It extracts key-value pairs from the JSON structure and 
organizes them into a DataFrame with a hierarchical index (Category and Label).

Author: Eric Mossotti
Date: 2025-03-12
Version: 1.0
CC-BY SA
"""

import pandas as pd
import json

def assemblyFramer(statsPath = None):
    """
    Reads a JSON-structured assembly stats file and returns a multi-index DataFrame.

    Parameters:
    -----------
    statsPath : str, optional
        The file path to the JSON-structured assembly stats file. Default is None.
        The file can have any extension (e.g., `.json`, `.txt`), but the content must be valid JSON.

    Returns:
    --------
    pd.DataFrame
        A Pandas DataFrame with a multi-index (Category, Label) and a single column "Value".
        The DataFrame contains all key-value pairs extracted from the input JSON file.

    Example:
    --------
    >>> df = assemblyFramer("summary_stats.json")
    >>> print(df)
                             Value
    Category Label                
    Contigs  L10                41
             L20                99
             L30               174
             L40               267
             L50               385
             N10          12643769
             N20           9681846
             N30           7895799
             N40           6288966
             N50           4902968
             gc_content  35.248104
             longest     43529772
             mean       1552486.99
             median        331935.0
             sequence_count   4200
             shortest         1000
             total_bps   6520445364
    Scaffolds L10                 0
             L20                 0
             L30                 1
             L40                 2
             L50                 3
             N10          1438277616
             N20          1438277616
             N30           915491830
             N40           607508155
             N50           518932092
             gc_content  35.248104
             longest     1438277616
             mean       3212576.534
             median        62362.5
             sequence_count   2030
             shortest         1000
             total_bps   6521530364
    """
    # Read the JSON file
    with open(statsPath, 'r') as file:
        data = json.load(file)
    
    # Prepare rows for the DataFrame
    rows = []
    for category_key, stats in data.items():
        # Convert "Contig Stats" to "Contigs", "Scaffold Stats" to "Scaffolds"
        category = category_key.split()[0] + 's'
        for label, value in stats.items():
            rows.append({
                'Category': category,
                'Label': label,
                'Value': value
            })
    
    # Create a DataFrame and set multi-index
    df = pd.DataFrame(rows)
    df.set_index(['Category', 'Label'], inplace=True)
    
    return df

Run assemblyFramer(), passing in the path to the text file generated by running the Bash script from earlier.

# Set the local text file path
statsPath = "02_Transform/data/summary_stats.txt"
# Run to yield a multi-indexed dataframe
df = assemblyFramer(statsPath)

2.3 FASTA to Dictionary

Source the external assemblyDictionary.py script.

reticulate::source_python("02_Transform/scripts/assemblyDictionary.py")

Return a stats dictionary.

"""
Parsing genomic data in a memory-efficent way by not loading the entire
file into memory at once. The file is processed one line at a time, grouping
related lines together. 

Statistics such as N50 are crucial in assessing the contiguity 
of a genome assembly, with higher N50 values generally indicating 
a more contiguous assembly.

The distinction between contigs and scaffolds is important in genome assembly, 
as it provides information about the continuity and completeness of 
the assembly.


#----
read_genome()
____________

Differentiates between scaffolds (which may contain gaps) and contigs 
(continuous sequences). 
  
Calculate the GC content, which is an important genomic characteristic.
Prepare lists of contig and scaffold lengths.


#----
fasta_iter()
____________

Groups the .fasta file data into alternating groups of headers and sequences.
It is a generator function that will pause until the next item is requested 
after yielding a tuple.

The iterator groups two aspects:
  a. Single header lines starting with '>'
  b. Subsequent lines until the next '>'


#----
calculate_stats()
_________________

Where the stats dictionary values are calculated and keys assigned.


#----
def assemblyDictionary()
________________________

Maps the previously created dictionaries of contig and scaffold stats to the 
category they belong, thereby distinguishing the contigs and scaffold stats. 
Returning the desired values is quite intuitive as a result.


"""

import numpy as np
from itertools import groupby

#---- fasta_iter()
def fasta_iter(fasta_file):
    
    fh = open(fasta_file)
    
    # Only need the second part, or code sequences part, of the grouped by items
    fa_iter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    
    for header in fa_iter:
        
        # Get first line of group; drop the ">" and starting/trailing whitespace
        header = next(header)[1:].strip()
        
        # Join all sequence lines to one string; conv to uppercase; remv whitespace
        seq = "".join(s.upper().strip() for s in next(fa_iter))
        
        yield header, seq


#---- read_genome
def read_genome(fasta_file):
    
    gc = 0
    total_len = 0
    
    contig_lens = []
    scaffold_lens = []
    
    # Ignore header information (the '_' part) and process sequence data
    for _, seq in fasta_iter(fasta_file):
        
        # Add sequence (scaffold) length
        scaffold_lens.append(len(seq))
        # NN reprs gaps in scaffold, which are contigs
        if "NN" in seq:
            # Add split sequences to contig list if gap
            contig_list = seq.split("NN")
            
        else:
            # Add sequence to contig list
            contig_list = [seq]
            
        for contig in contig_list:
            # An initial check for 0-length contigs
            if len(contig):
              gc += contig.count('G') + contig.count('C')
              # Update the total length
              total_len += len(contig)
              # Add  length to list of contig lengths
              contig_lens.append(len(contig))
    # GC content as the percentage of total genome
    gc_cont = (gc / total_len) * 100

    return contig_lens, scaffold_lens, gc_cont


#---- calculate_stats()
def calculate_stats(seq_lens, gc_cont):
    
    # Empty dictionary
    stats = {}
    # The set of sequence lengths are converted to a NumPy array
    seq_array = np.array(seq_lens)
    
    # Count the individual sequences
    stats['sequence_count'] = seq_array.size
    
    # GC proportion
    stats['gc_content'] = gc_cont

    # Sort lengths by descending order
    sorted_lens = seq_array[np.argsort(-seq_array)]
    
    # The first length is the longest due to sorting
    stats['longest'] = int(sorted_lens[0])
    
    # Likewise, shortest length is at the end
    stats['shortest'] = int(sorted_lens[-1])
    
    stats['median'] = np.median(sorted_lens)
    
    stats['mean'] = np.mean(sorted_lens)
    
    # Sums the total length of all sequences
    stats['total_bps'] = int(np.sum(sorted_lens))
    
    # An array of cumulative sums to calculate N50 efficiently
    csum = np.cumsum(sorted_lens)
    
    for level in [10, 20, 30, 40, 50]:
        
        # Calculate target base pair count for the level
        nx = int(stats['total_bps'] * (level / 100))
        
        # Find smallest bp value in array, >= to the target %
        csumn = min(csum[csum >= nx])
        
        """
        
        --- The original code in the next line:
  
          l_level = int(np.where(csum == csumn)[0])
          
        --- Has been changed to:
            
          l_level = int(np.where(csum == csumn)[0][0])

        This finds the index where the cumulative sum equals csumn, which 
        represents the number of sequences needed to reach the target 
        percentage of base pairs. 
        
        I've added an extra [0] to fix the NumPy deprecation warning. This 
        ensures return of a scalar value from the array, as the extra '[0]' 
        ensures the first element of the first array is being accessed.
        
        The console warning (in Rstudio) that's now been resolved: 
        
          'Conversion of an array with ndim > 0 to a scalar is deprecated, and 
          will error in future. Ensure you extract a single element from your 
          array before performing this operation. (Deprecated NumPy 1.25.)'
          
        """
        
        # Determine the index where the sequences required to reach target % of bps is met
        l_level = int(np.where(csum == csumn)[0][0])
        
        # Get bp length of seq at index, l_level, for the N-statistic value
        n_level = int(sorted_lens[l_level])
        
        stats['L' + str(level)] = l_level

        # Store the statistic in a dictionary, mapped to its new name key
        stats['N' + str(level)] = n_level
        
    return stats


#---- assemblyDictionary
def assemblyDictionary(infilename):
    
    # Return two np arrays of lengths
    contig_lens, scaffold_lens, gc_cont = read_genome(infilename)
    
    # Return contig stats from contig lengths
    contig_stats = calculate_stats(contig_lens, gc_cont)
    
    # Return scaffold stats from scaffold lengths
    scaffold_stats = calculate_stats(scaffold_lens, gc_cont)
    
    # A dictionary of outputs is easily queried
    stat_output = {'Contig Stats': contig_stats,
                   'Scaffold Stats': scaffold_stats}
    
    return stat_output

Source the strint.py script.

reticulate::source_python("03_Load/scripts/strint.py")

Instead of a text file, this skips several time-consuming steps.

statsDict = assemblyDictionary("02_Transform/data/magicicada.fasta")

3 Load

This is the data loading phase. Following completion of this stage, querying the data should be more intuitive than before.

3.1 Assign Variables with `strint.py`

This code transforms the strings in the value column of the dataframe to a more visually appealing, thousands-separated form.

""" 
Formats the string representation of an 
integer value as a comma separated string 
"""

import pandas as pd
import re

def strint(data, category, label):
    if isinstance(data, pd.DataFrame):
        # Existing DataFrame handling code
        stat = data.loc[(category, label), "Value"]
        
        # Set boolean match value
        isFloat = re.search(r"\.", str(stat))
        
        # Convert to float if there is a decimal
        if isFloat:
            stat = pd.to_numeric(stat, downcast="float")
        else:
            # Else convert to an integer 
            stat = pd.to_numeric(stat, downcast="integer")
    
    elif isinstance(data, dict):
        # New dictionary handling code
        #stat = data.get(f"{category} {label}")
        stat = data[category][label]
        
        if stat is None:
            raise KeyError(f"Key '{category} {label}' not found in the dictionary")
        
        # No need to convert to numeric as values are already integers or floats
    
    else:
        raise TypeError("Input must be a pandas DataFrame or a dictionary")

    # Add a thousands separator and convert back to a string
    stat = f'{stat:,}'
    
    return stat

3.2 DataFrame Input

Call strint(), with the simpflified query syntax noted earlier. Then simply specify the literal ‘Category’ and ‘Label’ keys to return the desired value.

#---- Contigs
ctig_len = strint(df, "Contigs", "total_bps")
ctig_count = strint(df, "Contigs", "sequence_count")
ctig_n50 = strint(df, "Contigs", "N50")
ctig_max = strint(df, "Contigs", "longest")

#---- Scaffolds
sfld_len = strint(df, "Scaffolds", "total_bps")
sfld_count = strint(df, "Scaffolds", "sequence_count")
sfld_n50 = strint(df, "Scaffolds", "N50")
sfld_max = strint(df, "Scaffolds", "longest")

Multi-index dataframe query syntax

An example of the syntax used to query the multi-indexed dataframe. This code is not evaluated, but only for illustrative purposes.

"""

strint(dataframe, category, label)

category options
----------------
  Contigs
  Scaffolds


label options
----------------
  L10
  L20
  L30
  L40
  L50
  N10
  N20
  N30
  N40
  N50
  gc_content
  longest
  mean
  median
  sequence_count
  shortest
  total_bps
  
"""

ctig_len = strint(df, "Contigs", "N50")

# -->> Looking inside the strint(dataframe, category, label) function ---->>

# Which then finds the desired value or 'Value'
stat = df.loc[(category, label), "Value"]

3.3 Dictionary Input

Assigns the dictionary value as string objects.

#---- Contigs
ctig_len = strint(statsDict, "Contig Stats", "total_bps")
ctig_count = strint(statsDict, "Contig Stats", "sequence_count")
ctig_n50 = strint(statsDict, "Contig Stats", "N50")
ctig_max = strint(statsDict, "Contig Stats", "longest")

#---- Scaffolds
sfld_len = strint(statsDict, "Scaffold Stats", "total_bps")
sfld_count = strint(statsDict, "Scaffold Stats", "sequence_count")
sfld_n50 = strint(statsDict, "Scaffold Stats", "N50")
sfld_max = strint(statsDict, "Scaffold Stats", "longest")

Dictionary query syntax

An example of the syntax used to query the dictionary This code is not evaluated, but only for illustrative purposes.

"""

strint(dataframe, category, label)

category options
----------------
  Contig Stats
  Scaffold Stats


label options
----------------
  L10
  L20
  L30
  L40
  L50
  N10
  N20
  N30
  N40
  N50
  gc_content
  longest
  mean
  median
  sequence_count
  shortest
  total_bps
  
"""

# A look inside assemblyDictionary.py where stat_output is returned
stat_output = {'Contig Stats': contig_stats,
               'Scaffold Stats': scaffold_stats}

ctig_len = statsDict["Contig Stats"]["total_bps"]

4 Present

4.1 The Pandas Table

This is a slightly formatted view of the Pandas table designed to be more easily queried to return the desired statistic. If, however, you’d like to treat the Styler object as the unchanged, dataframe object, use the forma_df.data syntax.

:The original dataframe output:

The Pandas dataframe as a Styler object for better viewing.

# Display with Style 
forma_df = df.loc[:].style.pipe(formaFrame)

forma_df

Table 1: Assembly Stats

Contigs	L10	41.000000
	L20	99.000000
	L30	174.000000
	L40	267.000000
	L50	385.000000
	N10	12643769.000000
	N20	9681846.000000
	N30	7895799.000000
	N40	6288966.000000
	N50	4902968.000000
	gc_content	35.248104
	longest	43529772.000000
	mean	1552486.991429
	median	331935.000000
	sequence_count	4200.000000
	shortest	1000.000000
	total_bps	6520445364.000000
Scaffolds	L10	0.000000
	L20	0.000000
	L30	1.000000
	L40	2.000000
	L50	3.000000
	N10	1438277616.000000
	N20	1438277616.000000
	N30	915491830.000000
	N40	607508155.000000
	N50	518932092.000000
	gc_content	35.248104
	longest	1438277616.000000
	mean	3212576.533990
	median	62362.500000
	sequence_count	2030.000000
	shortest	1000.000000
	total_bps	6521530364.000000

4.2 DNA Zoo’s Table, Reproduced

To make the variable calls simpler in the table below.

library(reticulate)

Importing the library makes it simpler for inserting the values into the table below. For example, I would have had to type 4,902,968.0, but now, I only need to type 4,902,968.0 into the individual cells. I needed to convert the Python into R objects, as the knitr engine used in rendering this document does not seem to display output from execution of inline Python code directly.

Contig length (bp)	Number of contigs	Contig N50 (bp)	Longest contig (bp)
6,520,445,364.0	4,200.0	4,902,968.0	43,529,772.0
Scaffold length (bp)	Number of scaffolds	Scaffold N50 (bp)	Longest scaffold (bp)
6,521,530,364.0	2,030.0	518,932,092.0	1,438,277,616.0

:x NutFrame

The dataframe object on display, in all its glory.

# The un-styled dataframe output
forma_df.data

                                 Value
Category  Label                       
Contigs   L10             4.100000e+01
          L20             9.900000e+01
          L30             1.740000e+02
          L40             2.670000e+02
          L50             3.850000e+02
          N10             1.264377e+07
          N20             9.681846e+06
          N30             7.895799e+06
          N40             6.288966e+06
          N50             4.902968e+06
          gc_content      3.524810e+01
          longest         4.352977e+07
          mean            1.552487e+06
          median          3.319350e+05
          sequence_count  4.200000e+03
          shortest        1.000000e+03
          total_bps       6.520445e+09
Scaffolds L10             0.000000e+00
          L20             0.000000e+00
          L30             1.000000e+00
          L40             2.000000e+00
          L50             3.000000e+00
          N10             1.438278e+09
          N20             1.438278e+09
          N30             9.154918e+08
          N40             6.075082e+08
          N50             5.189321e+08
          gc_content      3.524810e+01
          longest         1.438278e+09
          mean            3.212577e+06
          median          6.236250e+04
          sequence_count  2.030000e+03
          shortest        1.000000e+03
          total_bps       6.521530e+09

Reuse

CC BY-SA 4.0