Category Archives: surfactant proteins A and D

The original and the processed: SP-D

Figure below shows an original plot of surfactant protein D (Arroyo et al, 2020) presumably made by the program that comes with the Atomic Force Microscope that was used to visualize the protein preparation. Brightness on the y axis, distance on the x.

I used this plot as a “published” documentation of peak count of a trimer of SP-D. While looking at the images in this and other articles by Arroyo et al, I became convinced there were patterns in the brightness (peaks), symmetrical and mirrored, in the hexamers, dodecamers and multimers (though determining patterns of brightness peaks in the latter is more problematic than in the hexamers and dodecamers).

My own counts of peak number, and peak width were consistent enough that I sought out ways to verify the number and properties of these peaks in an “unbiased” way.  (Of course there is no way to be totally unbiased in any research, but I did try to select image filters and signal processing functions that seemed appropriate, easy to use, and produced consistent data which I then applied in the same manner to 12 dodecamers of SP-D).

After more than 1000 plots of Sp-D hexamers, this figure (above) shows that not only are the three peaks defined by Arroyo et al, correct, but the tiniest variations in her plot that would not have been considered significant or relevant are actually verified as actual peaks (see the summary plot with which her plot is compared).  This seems to me to reveal several things.

The reason that the N term peak is so large in the dodecamer compared to what Arroyo et al measured in their trimer is that there are four N terminals in a single (maybe not always single, perhaps sometimes side by side) peak that form the common intersection.  All other measured peaks are a domains of a single trimer.

1. There is more to be learned about molecular structure from AFM images than is generally perceived, but this requires different types of image and signal processing.

2. There are many programs useful for defining or confirming peak number, width, height, and valley in plots generated by ImageJ that are free and easy to use and include signal as well as image processing apps (and both).

3. ImageJ has a great free plotting app for plotting grayscale values which is easily exported to excel and can be saved as metafiles and manipulated in draw programs such as CorelDRAW.

4. Resulting data here provides details which will assist those who are working to “finish” constructing the molecular model of human surfactant protein D.

 

Bias: the good, the bad, the learned – peak finding functions and image filtering

Purpose: To contribute to predictions about the current structural model of surfactant protein D, in particular, the collagen-like domain.

Aim: To suggest there are recognizable patterns in the number and shape of peaks in grayscale plots of SP-D obtained when traced from CRD to CRD as a hexamer that can inform molecular models. These grayscale plots of recombinant human surfactant protein D (SP-D) were made with ImageJ from published AFM (atomic force microscope) images (Arroyo et al, 2018) of each of the two hexamers which comprise one dodceamer. Images were ploted as unfiltered images, and as images subjected to a variety of image processing filters and/or signal procesing peak finding functions.

Introduction: A single molecule of SP-D has four domains:  N terminal domain,  collagen-like domain,  coiled coil neck domain, and a carbohydrate recognition domain (CRD). Monomers of SP-D are coiled homotrimers which readily form multimers joined at a communal peak at N terminal domains. N terminal junction peak height and width appears to be a function of how many trimers are bound as a multimer. Hexamers and dodecamers are common multimeric forms, though multimers be found with 30+ trimeric arms (Arroyo et al, 2018).

RCSB ( ) (as of this writing) has many molecular models for the CRD and neck domains of SP-D, but none of the full trimer (all four domains) (nor for hexamers, dodecamers or other multimers(   ). Various electron microscopic techniques confirm that the collagen-like domain is reasonably straight or slightly bent, but this information has not yet become part of the molecular model of SP-D.

Arroyo et al (2020) published a grayscale peak count for a trimer at three: 1) N terminal peak, 2) glycosylation peak, and the CRD. Plots of a hexamer would be 5 peaks from CRD to CRD:  CRD-glycosylation peak-N-termini junction peak-glycosylation peak-CRD.  (Diagram modified from Arroyo et al, 2020 check). Certainly other peaks exist, and 3 was a conservative estimate.
However, a visual count of grayscale peaks from the original 80+ AFM images, and peaks counts obtained from the grayscale plots, with and without image and signal processing functions demonstrated that there are many more than 3 peaks per trimer (5 peaks per hexamer) that are found consistently.  Examination of the raw images, images subjected to image filtering, and peak finding functions, showed the peak count for a hexamers was 15.  Two additional peaks occur less than 50 percent of the time, not included in the 15, were considered “possible peaks”.  The number of peaks per hexamer found without the use of image filters and peak finding functions, was not significantly different than the number of peaks found by just observing the image.

Twelve image processing programs and 4 signal processing programs (each with numerous settings for filtering (e.g. sharpen, median, mode, mean, blur, limit range, noise  reduction, etc., and peak detection (smooth, lag, influence, distance, height, threshold, etc.)) were applied to a representative dodecamer selected to determine which image processing and peak finding apps to apply. Considerations included availiability, ease of use, cost, filtering options, output format,  consistency with with the visual data from the original images. The resulting number of peaks detected (15) using all results was used as a bench mark.

From those initial plots (n=633) a set of  7 imaging programs and their filters,  and 4 peak finding programs and their criteria, were used to assess peaks in 13 additional dodecamers. These data were analyzed individually and together, and demonstrate that 15 peaks are present per hexamer, of which 9 peaks (5 peaks per trimer) were present 95-100% of the time, two peaks per hexamer (1 peak per trimer) were present 71% of the time, while a peak alleged to be at the neck domain was detected 51% of the time. One additional tiny peak was often visible, lying near the valley on the down-slope on either side of the N term junction peak. It was consistent in width, height and location,  and was detected 42% of the time (in this discussion it will be referred to as the “tiny peak”.

The linear aspect of the collagen-like-domain, and the presence of 5 easily detected peaks per trimer, along with their peak heights, widths and valleys should provide useful information for predicting the molecularmodel of the collagen-like-domain of SP-D. In addition, the data show that visual identification of  peak characteristics and number per trimer (and hexamer) of SP-D was not significantly different from data from images subjected to filters.

Red arrow (figure below) shows the trajectory of a the trimer (beginning at the bottom, CRD, moving upward to the N terminal domain which is linked to three other trimers at the center of the dodecamer.

Methods:

Peak count per hexamer:  Peak count was obtained initially, using hundreds of plots, both manual peak counting and counts found using an inclusive number of programs for both image filtering (12 image programs) and signal processing programs (7). Results from all initial peak finding functions from all software initially tested and one dodecamer image (my number 41_aka_45) for a total of 633 plots.  These 633 plots were used 1) to define which programs, which filters, which functions produced peak counts most comperable to the peak plots created in ImageJ.  This included the lowest counts from a variety of (unbiased?) citizen-scientists,  counts including poor resolution (highly pixelated) images, images filters and subjected to peak finding functions. This set of plots determined the number of peaks per hexamer to be 15, a number which was also verified stepwise with 4, 6, 8, 12 and with the final dataset (14 dodecamers and selected functions).

15 peaks per hexamer was used as a baseline for assessing the peaks found by plots analyzed in several different signal processing programs and settings. Both counting peaks by hand, and by function certainly carry some bias. It becomes matter of selection of how to apply parameters in many cases even when visually it appears illogical for the inclusion, or exclusion of some peaks by signal processing functions (see post  “to peak or not to peak“.  Images were selected as they appeared within original images, every image that was able to be cropped from a figure was saved, thus limiting any selection bias. Number of grayscale peaks in a single hexamer of SP-D was taken at 15 bright (8.1+/2.4 peaks per trimer where the N termini junction is measured as ONE peak whole peak).

Image processing programs and filters:  Programs used for the initial peak counts (left column), and the programs used for image filtering listed in the right column (which contains free software, as well as two prominent paid programs). Typically the free-ware provided fewer options for subtle filtering than paid programs. ImageJ was used for plotting grayscale values (peaks) exclusively. The only other program used for plots was Gwyddion and some discrepancies in tracing grayscale when plots were made in vertical directions as opposed to horizontal, however Gwyddion was used for image filtering.  The final choice of software for image filtering is listed on the left.

 

Initial signal and image filtering apps are seen in the top portion of this figure, and final choices for all 14 dodecamers lies below.

Further analysis and fine tuning images with gaussian, median, mean, sharpening, and range limiting filters, as well as optimizing peak finding options such as smoothing, distance, height,  lag, threshold, width, influence, etc  in signal processing shows the peak number to be more than 15 peaks per hexamer.

Line tracings of SP-D to produce grayscale plots:  End to end through the center of the hexamer, segmented line, plotted as grayscale in ImageJ.  Some details here.

Three peaks per SP-D trimer (5 peaks per hexamer,  9 peaks per dodecamer) have been identified. The tallest peak is central in each hexamer/dodecamer comprising the N terminal junction: 1 N-terminal domain for each of the trimers in the dodecamer. Grayscale plots through the center of a hexamer will have N=4 N terminal domains if it is plotted in a dodecamer.   the glycosylation peak(s) (when the SP-D is glycosylated) lie on either side of the N-term peak and the carbohydrate recognition domain (CRD) peak(s), and each of these was recognized by Arroyo et al, (2018). However, their plot (Figure 2, C) 15 peaks can easily be counted, not just the 5 peaks per hexamer that were listed (a very conservative count) in their legend. After extensive analysis, 15 peaks is a reasonable number and, in addition corroborates peaks on their initial plot.

and peak visually from unfiltered images before plots were made (group 1 ) a visual count of bright peaks (arm a in figure 1). Grayscale plots were made (using ImageJ) to determine peak number, height, width and valley of those same unfiltered images by drawing a segmented line lengthwise through individual hexamers (2 per dodecamer) beginning at one CRD through the center width of the N termini junction peak and continuing to the second CRD) (figure 1 yellow plot line, arm b)( a count of peaks from the ImageJ plot.( group 2) .  Each of the plots were were then subjected to signal processing functions (group 3) to compare (confirm?) visual assessments (eliminate bias?). LEGEND: Surfactant protein D dodecamer. Two hexamers, with each hexamer of the dodecamer labeled as arm a or arm b. CRD at bottom, center, bright spot (labeled START), moving in the direction of the white arrow to the bright spot at the top of the image (the CRD at the other end of the hexamer). .  Red arrow shows the extent of a trimer plot, from CRD (at bottom labeled START, through the entirety of the brightest peak (N termini junction). 

Total number of plots examined for 14 SP-D (figure 2) dodecamers came to over 1500 trimers (that is 385 dodecamers).  14 dodecamers were thus, plotted about 100 times each. (see number for each different dodecamer below.  The largest numbers were those several dodecamers that were used to establish the mean number of peaks per hexamer. Clearly dodecamer labelled 41_aka_45 was used to determine which image filtering programs, and which settings for signal processing filters would be used for the other dodecamers. The list also shows the image of each of the 14 dodecamers (labeled in white on AFM images.  

Molecules numbered 127 and 4A are the same but derived from different figures in the same publication. Bar markers in the images varied in the figures from 20,30 to 200nm. Each image was manipulated “along with” its bar marker to insure that dimensions were consistent.

 

A list of image filters and signal processing functions: (41-aka-45) 292 different image filters plus signal processing functions (trimers so n=73 dodecs)  332 plots, all different image filters in x different image processing programs n=83 dodec)

Image processing programs and filters

 

 

 

 

Training: the dictionary defines training as  “the action of teaching a person or animal a particular skill or type of behavior”. That definition now includes computers and each comes both with great potential, and great limitations.

Learning: the dictionary defines learning as “modification of a behavioral tendency by experience”, and in the case of artificial intelligence, to learn without explicit programming.

Bias: the definition relevant to research is “systematic error introduced into sampling or testing by selecting or encouraging one outcome or answer over others” or “a disproportionate weight in favor of or against an idea or thing”. A rather negative view of bias in research (Zvereva and Kozlov, Sci Rep 11, 226 (2021)DOIhttps://doi.org/10.1038/s41598-020-80677-4), but suggest two important approaches to limit bias – 1) understand the measures available to avoid bias and 2) report measures used to avoid bias. They also state “Cognitive biases are unconscious, which means that simply being aware of the existence and importance of biases is not sufficient to avoid them”.

Machine learning bias: “Machine learning bias, also known as algorithm bias or AI bias, is a phenomenon that occurs when an algorithm produces results that are systemically prejudiced due to erroneous assumptions in the machine learning (supervised and reinforced machine learning) process.”

(it seems like unsupervised, supervised and reinforced machine learning should be great backup for limiting bias in interpretation? – aka mistakes, selection bias?  in a relatively simple assessment of peaks in a given plot.

Bias present:

1) non-response bias (missing value): “As a rule of thumb, the lower the response rate, the greater the likelihood of nonresponse bias. Nonresponse bias becomes an issue when the response rate falls below 70%.” (says who)
2) automation bias: “Automation bias is an over-reliance on automated aids and decision support systems”. (method bias)?
3) in-group bias: ” the tendency for us to give preferential emphasis to one group, while ignoring outgroups”.
4) implicit (unconscious) bias: automatic and unintentional, yet impacting outcome (judgement)”.
5) reporting bias: “the decision about what to report depends on the direction or magnitude of the findings”. (thats what peer review is for)
6) false impression bias: “also known as the frequency illusion or recency illusion”
7) sampling bias: “a type of selection bias” –( e.g. test molecules being systematically more likely to be selected in a sample than others).
8) selection bias: “selecting an item (or various items), not using randomization of those items. therefore the data is not representative of the given population”.
9: confirmation bias: “the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one’s prior views”
10) measurement (data collection) bias (errors): ” refers to the tendency of algorithms to reflect human biases (supervised and reinforced machine learning), (personal communication : “you chooses the settings” which is true for python-scipy peak finder (prominence 0.2, distance 30, width 5, threshold 0 height 0); for PHP Zscore (Lag 5, Threshold 1, Influence 0.05), for Octave’s AutoFindPeaksPlot.m (xy), ipeaks.m (M80), and also in PeakValleyDetection.xlsx (smooth 11)).

Bias relevant to outcome,

Selection Bias (yes, just dodecamers, from one researcher)
Spectrum Bias
Cognitive Bias
Data-Snooping Bias
Omitted-Variable Bias (missing data)
Exclusion Bias (out of focus molecules)
Analytical Bias
Reporting Bias (this would appear to be an ethical issue)

The definition of all of the above words has changed: in society, in science, in philosophy.
In the context of this post, the To create an “unbiased” count of the number of peaks

“people should assume right now that the models only perform to about 95% of human accuracy.” (https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained).

Results and Discussion:
Peaks, subpeaks. Figure below shows the analysis at four different tiers in analysing the number of peaks and subpeaks in dodecamers: as an N= 6, 8, 12 and 14 individual molecules, each processed in many ways, and each included  in subsequent analyses.
Peak number per trimer is shown in graphic below of an analysis of 14 trimers shows there is no statistically significant difference (none was expected either since SP-D should appear as a bilaterally symmetrical molecule ) between the number of peaks in either of the hexamer’s arms (a and b, i.e. left and right sides of a hexamer, respectively) and therefore, of any trimer in a dodecamer.

References and links:

1. Arroyo et al, 2018, https://doi.org/10.1016/j.jmb.2018.03.027

https://imagej.nih.gov/ij/
https://terpconnect.umd.edu/~toh/spectrum/PeakFindingandMeasurement.htm#ipeak
https://terpconnect.umd.edu/~toh/spectrum/PeakFindingandMeasurement.htm#findpeaksx
https://terpconnect.umd.edu/~toh/spectrum/PeakFindingandMeasurement.htm#Spreadsheet (s)
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html
https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 (version: 2020-11-08)

Purpose: methods for unbiased peak finding in AFM images of SP-D

Surfactant protein D (SP-D) is necessary for lung surfactant structure, but also a critical for immune protection of the lung and other tissues (doi: 10.3389/fmed.2018.00018). It is a high molecular weight hydrophilic protein found as trimers but also present as multimers of 30 or more trimers, but dodecamers are the multimer analyzed in this study, suggested to be important for innate immune defense.

The purpose of this study was to find 1) appropriate unbiased methods to determine how many peaks (grayscale) appeared along a segmented centered line drawn from CRD to CRD in a hexamer of SP-D in order to 2) help define the molecular structure of the SP-D hexamer and dodecamer,  2) to identify which peak (brightness, grayscale 0-255) along that trace might be important for binding with…..  3)and to begin a discussion on whether signal processing peak finding functions are helpful in unbiased assessment of topical variation in AFM images while identifying peaks along a plot of any molecule, and SP-D in particular.

Abbreviations: SP-D, surfactant protein D; N-term, N terminal domain and collective junction in the center of the radially organized multimers of SP-D; GLY, glycosylation site of the SP-D trimer; CRD, carbohydrate recognition domain; C-d, collaen-like domain; NECK,  coiled coil domain; AFM, atomic force microscopy;

Inspiration for finding out what SP-D dodecamers really “look” like came from the vast array of diagrams of SP-D in the literature which compared to actual microscopic images. Some of these diagrams were just erroneous, and when querried, one author excused his diagram as “artistic license”.  There is no room for artistic license in factual representation of data. In addition, others use misguidedly use images in their publications showing just two of the four domains of SP-D without expressly mentioning that this is NOT the entire molecule, but just the coiled coil neck and carbohydrate recognition domains. This matters because consensus suggests there are 4 domains in a monomer of SP-D: a short N terminal domain, a long collagen-like-domain, coiled coil neck domain, and a C-type lectin domain (carbohydrate recognition domain)(CRD), configured as homotrimers with the neck domain thought to be responsible for the coiling of the CRD and collagen-like-domains into trimers. (Ping Li, 2009);doi:10.1074/jbc.m600651200;DOI: 10.1016/j.molimm.2009.06.005). Neck and CRD domains have been modeled often, and flexibility between neck and CRD has been suggested, and is clearly visible in AFM images.

The collagen-like-domain of SP-D is not required for multimer assembly, nor for some of the innate immune functions related viral pathogens has not yet been modeled ( ). It was noted by Kingma et al that it is, however, required for some aspects of macrophage activation and surfactant function (doi:10.1074/jbc.m600651200). Others have investigated edited molecules and examined function, these images are shadowed TEMs, and are not suited as well for peak counting as AFM images (refs).

A realistic diagram of SP-D should easily be possible since hundreds of actual images of the molecule are in the literature, and a large assortment exists of SP-D trimers, hexamers and multmers. The latter are mirrored, symmetrical structures with attachments at the N terminal domains (center) and CRD domains (ends). Negative staining, rotary shadowing and atomic force microscopy confirm that arrangement  , and have provided evidence that dodecamers are a “common” form (PMID: 36330647 DOI: 10.2174/1389203724666221102111145;doi:10.1074/jbc.271.31.18912). The proportions of various oligomers differs with methods of preparation, disease, and species.

Negatively stained molecules were less than informative, shadowed images, even wonderfully shadowed images had a background that was almost too textured to make out tiny detail.  AFM in particular, in a really wonderful presentation of SP-D images (Arroyo et al, 2018) was method that showed quite a bit of detail.  In the latter publication, three peaks along the SP-D trimer were described: the N terminal domain peak, a glycosylation peak (present if the molecule was glycosylated) that occurs somewhere along the collagen-like domain, and the CRD peak at the C term.

Known peaks: The N term peak is a union of 6 N terminal domains and creates the tallest peak of the hexamer (12 in the dodecamer).  It consistently has the greatest height and width of all the peaks in multimers.  The collagen-like-domain is relatively straight portion of the molecule, but when it is glycosylated, a prominent peak occurs relatively close to the N termini peak, and often has sub-peaks within the overall framework of a single peak. The coiled coil neck region is not only the area with the lowest grayscale values (small peaks) but it is often visibly “covered” due to the floppy nature of the adjacent CRD peaks.  The latter peak (CRD peaks) appear as “balls” tethered in a floppy manner to the rest of the molecule at the neck domain region.

It was easy to see in the AFM images of SP-D, that three peaks per trimer (5 per hexamer)(Arroyo et al 2018 doi:10.1016/j.jmb.2018.03.027)  was a conservative count, and that other peaks were present in a consistent and mirrorred pattern.

NB, unbiased and biased, these are relative terms.  And since the researcher chooses the paramaters of the image filter and signal processing functions, human bias is present. In addition, the specifics of the functions (lag, threshold, height, influence, moving average, smoothing) can be manipulated so extensively that signal processing functions may offer greater opportunity for bias than just counting the peaks from the image, or the grayscale plot of the molecule.  Furthermore, biased may not be a bad thing, we learn from practice, watching indices change, from repetition, from seeing in and out of focus images, high and low resolution images, its called “learning”,  or “training”,  The “learned” bias is present under circumstances which involve both image filtering and signal processing.  Peak finding functions do not seem to provide much benefit over a careful, educated visual assessment.

COMMENTS: “filtering out false positives? It sounds like the parameters used to determine what’s signal and what’s noise need to be tweaked to match what you already “know” to be the right count. I don’t know how you’d differentiate between you training the algorithm or you introducing bias, but at the very least you could determine what variable values in the algorithm correspond to what your brain is doing.”

COMMENT: agree, it’s filtering. filtering out noise. i believe there are sliding window methods for filtering or smoothing, but do not recommend arbitrarily creating N bins to put your data into

you could pre-filter or pre-smooth the higher noise frequencies out by some rule- but that should also have some justification. if you filter too much of the high frequncies out you’ll get too few peaks. there is going to be some sweet spot for pre-smoothing out noise but minimizing data loss

the noise floor in this data is artifacts from the AFM itself such as scan lines, and compression artifacts (rectangular jpeg artifacts) in the image causing variations in the gray values that are not in the actual molecule.

i would take a look at what assumptions you can safely make about the SPD molecule- what is the minimum size of an atom (amino acid groups, domains) you expect to see in that moleucule, for starters-you could assume there shouldn’t be any frequencies in your noise below 2x the size of the smallest atom (amino acid groups, domains) in the SPD molecule, for example, and maybe based on the molecules as well, if you explore how the AFM probe detects electric fields. there are assumptions you could make about the noise frequencies.
the noise floor is relatively high, but if you’re certain you can see something, then hopefully you can apply some consistent pre-filters to produce the desired outcome

if you filter the noise down iteratively until you get the most consistent peak *positions* then you should be able to optimize it without going too far and filtering out real peaks (yes i have tried this…. there is a way to eliminate peaks altogether, but also a broad area where peak number doesnt change that much, and then on to unrealistic peak numbers. There is that “sweet spot”.

however the idea that there are always going to be N peaks, or that the molecule is always going to be bilaterally symmetrical may simply be wrong. i would just go where the data leads you and stay open to that- you may discover something unexpected by remaining unbiased. if you’re convinced that your methods are producing the wrong result, then go back and change them
i would definitely leave out anything subjective like your visual acuity.

COMMENT: interestingly in almost 1000 plots, some my counts, some scipy, some octave, some LTI, some excel template — the was no significant difference in the number of peaks i counted from the image, and the number of peaks counted in that batch of algorighms.  THere was a difference in how i counted the peaks from the plots, and the number of peaks counted the algorighms.

Little changes in peak counts in the sweet spot “scipy-find_peaks_p0_d30_w5_t-null_h-null”

scipy-find_peaks_p0.7_d30_w10_t-null_h-null

 

14 dodecamers of SP-D: peak count per trimer

The result from the data of peak numbers, per trimer (two trimers per hexamer (which i labeled a and b), and using all image filters and signal processing peak finding apps for 448 trimers each, for arm a and arm b, there was NO significant difference in number of peaks found one from one side (a) to the other side (b). The t-value is 1.2919, the p-value is .196725 at p < .05.

I used the socscistatistics.com website’s 2-tailed t-test calculator (see graphic below).  So just over 8 peaks per trimer (please note that the N term is counted as one whole peak, not divided into “half” as  might be expected due to the shared peak of arm a and arm b in each hexamer (or multimer with many trimers). [ALL trimers, ALL measures] image below shows average number of peaks.  Arm a is “always” on the left side of the image, and arm b is “always” on the right side of the image. Arm a is also the highest trimer on the left, and arm b is the lower trimer on the image (see diagram at bottom).

The dodecamers expectedly fall randomly on the mica surface during preparation and are scanned in a fixed manner by the probe, and assumed to represent unbased events.  The lines measuring grayscale in each molecule are drawn the length of each hexamer (CRD to CRD) through the center-width, and can be at any angle from left to right, as well as curve, or bend.  This is assumed to be random. A segmented line was used in ImageJ to accommodate this variability.
I tested whether the same dimensions were recorded when tracing through a hexamer with ImageJ by rotating the image at various angle, but plot measurements were consistent regardless of how the image was oriented. That said, I also tested plots in Gwyddion and was not successful at creating an unbiased plots when tracings were vertical or lengthy. ImageJ was reliable and easy to use, so all plots were created with that program.

Example below is a tracing for a hexamer, the direction of the trace, start-point, and portion of the trace ascribed to the trimer, and hexamer are highlighted. Light line with nodes show the original trace in ImageJ,  white arrow shows direction and inclusion area for a hexamer trace (CRD to CRD) and the red line shows the area included in the calculation of peaks per trimer (which is different than peaks per hexamer, since in that case the N term peak is only counted once per hexamer).

14 total dodecamers (896 trimers plotted)

14 total dodecamers (896 trimers plotted, incrememntal addition of plots).

– peak widths-nm, peak height and valley-grayscale –  Little changed with the signal processing, image processing filters. Plots generated in excel (the silly shoulders that excel creates that I dont know how to get rid of in excel were removed in corelDRAW by deleting those nodes on either side of the peaks).
The plots are virtually identical, 8 peaks, N term peak here is NOT divided in half for each trimer but is measured as a whole peak.

Individual plots from analyzing 4, 6, 8, 12 and 14 dodecamers are shown at the same width (@145nm) and grayscale (0-255) (below).  The very infrequently detected very tiny blip present in the N term peak is not counted as one of the 15 total (8 per trimer) peaks.

Using the original excel plot (which has the lumpy corners) cut and pasted into the PeakValleyDetectionTemplate (using “smooth 3”) one can compare the peak detection.  The tiny peak (shown in purple – and detected about 30% of the time overall) but is still visible using the PeakValleyDetectionTemplate (bottom graph).  In the excel plot of 14 dodecamers (top graph) shows it clearly (tiny purple peak, on the downslope of the N term peak). Gray spikes on the baseline of the PVDT shows the detection of valleys (of the peaks) are using PVDTxlsx smooth 3.  The “tiny peak” is still present, as a very tiny change in the downslope of the plot.  Legend: Peach color=N terminal peak 100% occurrence (the center N peak is not shown); purple = as yet undefined tiny peak, 31% detection; medium green= glycosylation peak, 100% detection; dark green= as yet undefined peak 4, 98.88% detection; pink = narrow small as yet undefined peak 5, 67% detection; white= broad but low peak as yet undefined peak 6, 95% detection;  Yellow=neck peak, 44.5% detection; dark orange= CRD peak, 100% detection.

It seems very likely that the addition of more plots will make little difference in the number of peaks found per hexamer (15) and the relative width and height of those peaks.  These images were all obtained using rhSP-D with known glycosylation.

Number of glycosylations per trimer is not defined (to my knowledge) thus differences in peak height and width of the glycosylated peak could vary.

The N and CRD peaks are very consistent in relative height and width. The neck peak is often not detected – because of the variable position of the three CRD in each trimer, and that they apparently can completely obscure the neck peak during preparation by falling over it.

Over 1000 different plots of trimers comprise these figures.

Comparisons with other SP-D image (those without glycosylation, those from other species) would be valuable in helping to create a full length model of the structure of SP-D hexamers, dodecamers, and multimers.

14 dodecamers of SP-D: peak widths, heights, valleys (working)

14 dodecamers of SP-D: Width in nm of all peaks. Image below is a thumbnail of each of the SP-D dodecamers used in this analysis (number designation are my own, but bar markers for calculating magnification come from the original publication(s)(Arroyo et al).  All AFM images used (fir the data below) are rhSP-D. Dodecamers labeled 127 and 4A are the same molecule from obtained from different figures within the publication (deliberately used for comparison measures) all others SP-D molecules are unique.

The total number of peaks (traced using the segmented line option in ImageJ) has been shown many times to be just over 15 peaks per hexamer of SP-D. And the segregation of the various peaks plotted in ImageJ into a 15 peak-category has been largely influenced by my own 1) general assessment of the general shape of the plots, peaks and sub-peaks, and 2) the obvious mirror symmetry of the hexamers (and trimers) of SP-D.
Total number of trimers measured is 896 (14 dodecamers, with many processing apps and image filters). Trimer measurements include the entire N term domain peak in each, and hexamer measurements include each peak appearing from CRD to CRD.  Data could be adjusted somewhat if the arm of each trimer in a hexamer were normalized to a known distance in nm from a center point in the N term domain peak. This has not been done yet in these data.

Mean peak width is based on plots made in ImageJ, of images subjected to various image processing and peak counting apps. Summary of progressive analyses (6, 8, 12 and now 14) shows that little has changed since the first measurements. At this point most of the data is derived from signal processing functions (5 signal processing functions vs my 1 set of counts from the images), and while these are purported by some to be unbiased, it must be recognized that I choose the function settings that I think best fit the image.  Largely, the signal processing is “similary biased” to image processing filters and personal observations.  Graphics below show the  mean peak width in nm +/SD for each progressive analysis. Peak  % detection rates for all peaks is found here) or mentioned below.

I have added this “iffy” peak data (mint green below) because I really do think it exists sometimes. It exists as a detectable depression or division within the center of the N term junction of just 3 of the 14 the dodecamers (a mere 15 times out of 896 plots) by the signal processing functions, but I see it more often than that. I find a similar low detection rate by signal processing functions for the tiny peak on the downslope of the N term junction peak (which i call “tiny” peak).  Sorting the data by my assessment and all other assessments should point this out (future project).

While I also have observed what looks to be side to side attachment of N term domains in hexamers, most frequently the N term peaks looks to be an end to end attachment — where sometimes there is a decrease in grayscale values (peak height)(thus forming two peaks). How often this is detected by the ImageJ plots is very much dependent upon how I trace the line within the center (lengthwise) of the hexamer. In multimers of SP-D, the center N term peak depression is very often pronounced. Data below show it is infrequent, and narrow in width, as a very shallow depression at the tope of the N term peak.

How much the presence of this peak in the center of the N term peak influences the total peak number is probably minimal.

The height and valleys of peaks in the plots of SP-D dodecamers is a measure of grayscale (0-255). THese values are determined by ImageJ for each of the plot lines (two lines plotted per dodecamer, from one CRD of one hexamer to the CRD at the other end) of the images, unprocessed, or processed by a range of filters in a range of programs (link to the exhaustive list of filters and functions tried). Five signal processing functions and two filters  were used most commonly and were selected for the output which most resembled what I saw in the images).

Peak height of the N term junction peak (central in the hexamer) data from four different summar datasets. The bottom row is an update and inclusive of the top three rows (as is true for data above).  The data with the yellow columns are the means and SD ONLY for detected peaks, white columns are for all data for that group. Actually it is nice that peak from 14 molecules plotted with many variations show similar outcomes.
The mid N peak width is so tiny as to not maybe be worth making a graph for. I will decide.

Peak valleys for each of the 15 peaks per hexamer.

The image below is just of the very rare center blip in the N term peak which i have often mentioned as being prominent in the multimers greater than the dodecamer. This peak is detected in only 3 of the 14 images, and peak width (in nm) and peak height and valley (grayscale 0-255) are shown here.

Peaks per hexamer of SP-D

Peaks per hexamer were counted three ways –
1. IMAGE = my counts of bright spots in the AFM image (aka peaks). This was recorded for each trimer,  hexamer, and collated for each dodecamer (N=14), and for each image processing filter and for each signal processing function.

2. PLOTS = my counts from the “image” of each of the plots created by ImageJ from my trace through the center of each hexamer in the direction of the CRD peak to the opposite CRD (as in, end to end). Directions of the segmented line through each hexamer were ALWAYS traced in the same direction (left to right) for all the peak finding and peak counting apps.

3. SIGNAL = peak counts were generated from 5 approaches (Python/Scipy app, Stack  Overflow app, Octave (two functions; ipeakM80, AFPPxy), a PeakValleyDetectionTemplate.xlsx) each using using the same grayscale .csv files created from traces in ImageJ.

SUMMARY
My peak detection from the actual image consistently consistently fell between counts from the plots themselves, and the peak count generated by signal processing functions.  Mean peak counts from three methods continues to identify 15 peaks per hexamer.

Summary table below shows both the individual values (896 trimer counts and all processing types), and individual dodecamer counts (N=14, X+/SD).  (image=my counts from each image) vs plot  (=my counts from each plot from each image recorded by ImageJ). These two counts are not significant at p < .05. However, there is a significant difference between my peak counts from the ImageJ plot and the peak counts that is tallied from the signal processing functions. ( p-value is .0119); There is no significant difference in the number of peaks found when I count peaks directly from the image vs the number of peaks found with signal processing. Results with an N of individual trimer counts (N=896), and the mean and SD from counts from each dodecamer (N=14).

data for 12 dodecamers is here.

and comments from a previous post here.

The graphic above separates the peak finding into separate categories (highlighting the vast majority of the counts were from signal processing functions). It shows total peaks counted from the image itself (image ONLY), and my counts of peak number from the plots from those images (plot ONLY), the peak counts after all signal processing functions (none of my counts)(signal ONLY).  The bottom row is all counts all methods, all the time (EVERYTHING).  LIttle variation, basically the same number as found a year or two ago. 15 peaks per hexamer

Comparing 4 sets of peak finding for SP-D

Four sets of data are below (gathered incrememtally – from 6 to 14 dodecamers) were examined for number of peaks, and sub-peaks per trimer.   Each dataset includes the molecules from the prior set, i.e. the same initial 6 are part of the new 14 dodecamer data. An image of one of those 14 dodecamers analyzed is shown below with color-matching circles of where the 8 peaks per trimer are align on the molecule. You will count 9 dots. 

The initial number of peaks per each hexamer in a dodecamer was found using signal and image processing on many occasions and using over 1000 plots. That number influenced the division of each plot of a hexamer – but ultimately using the plot from the image and the peak detection plots as a resource for that division. The sub-peak of the N term peak  detected in dodecamers was detected less than 1% of the time (very pale green), (but may be more prominent in multimers), and the peak called “tiny peak” (purple)  on the downslope of each side of the N term center peak was detected about 33% of the time. These were data were included when they appeared.

At opposite ends of the hexamer the CRD peak (dark orange) and neck peak (yellow) occur.  The neck peak is sometimes concealed by the overlap of the CRD peak(s) (which in seem to be a flexible part of a largely rigid molecule), and can lie during preparation in a floppy cluster obscuring a nearby neck peak.  The neck peak is detected as a unique peak about 44% of the time.

Of the “not yet reported peaks” there is the tiny peak (purple) between the N term peak and the glycosylation peak, and the three peaks just lateral to the glycosylation peak. The latter three peaks are as follows: one large peak (detected almost 100% of the time) which is about the same size as the glycosylation peak, and two smaller peaks (pink and white – matching the color of the rows of data). Circles are approximate representation of relative peak widths.

This leaves three additional, as yet NOT reported peaks, bringing the total number of peaks not yet reported to 5.  The percent detection is given below in progressive sets of data.

The number of peaks (top row, number of peaks, number of trimers, and subpeaks (from 1 peak to 8 subpeaks, in columns)  in each dataset is shown below (color markers for peaks remains consistent throughout (and also on previous and newer posts).  The glycosylation peak (light green row of data) and the adjacent as yet unreported peak (darker green row of data) show consistent, multiple sub-peaks. These sub-peaks are found mainly by the signal processing functions. Addition of dodecamers to the initial dataset show little change.

LeftRight plot, reverse plot, inverse plot: various peak finding functions – no variation in valley and peak detection

LR plot of an AFM image of SP-D (taken from a publication of Arroyo, et al, 2018), and the reverse plot, and inverse plot – were compared using a PeakandValleyDetectionTemplate.xlsx  – no variation in valley and peak detection found using a left to right plot, reversed direction (using excel) plot, and inverse plot (also using excel) did not show up differences in how the number of peaks and valleys were detected using the smooth 11 setting.  I am doing this for two octave functions, as well as LTI peak finding, and a scipy function.  I am just trying to see whether the tallest (and small preceding and following) peaks are detected without bias, as well as i think they are in this program (found online).  Plot images and valley markers were mirrored, peaks found in the inverted plot matched peaks.  My thoughts that peak height affected peak detection in some programs appears NOT to be the case here.  In addition, the image was rotated before plotting (noted in image) to see whether a top to trace was plotted differently (as was found with gwyddion).  Greater difference is found with a “new” tracing (plot) than with reversing direction of the xlsx peak and valley detection (pink line).

Image J seems to trace segmented lines top to bottom and rotated images similarly.

Using Python scipy peak finder, prominence 0.2, distance 30, width 5, threshold 0, height 0, in the same LR plot and reversed plot, the same peak points and peak number were found.

Using a peak detection script from stack overflow there was a small difference when using the parameters Lag, Threshold, Influcned on the forward and reversed plots.

autofindpeaksplot(x,y) (in Octave;  different values for LR and LR-reversed arm 1 of the SP-D dodeamer, shown below) are generated automatically? Plot number and peak ID are identical whether for LR or LR- reversed.  Please note that i have mirrorred the reversed plot image over top of the left-right line plot that was derived in ImageJ,  just to show they match.  Peak number is calculated LR (red), and LR reversed is blue.

Using ipeakM80 for octave, initial tracing L-R had 18 peaks, but the 90 degree rotation tracings (both LR and reversed) had 20 peaks.  So again the biggest difference is the actual segmented line drawn into ImageJ.  Bottom graph (ipeakM80, same graph as the one just below this text just has had the text removed.  So it appears that each of the programs used here detect the same number of peaks when reversed, and no significant bias occurrs due to direction the plot is read.

12 Dodecamers of SP-D: peak number, width, height, and valley

A summary of peaks and valleys of grayscale plots of 12 dodecamers of SP-D is found below. They represent peak height, valley, widths plotted in ImageJ, from published AFM images of the molecule. These three data points were found for each trimer of the dodecamer using peak finding functions from Octave (Autofindpeaks-xy, ipeakM60) and from Python/Scipy, Stack Exchange, ImageJ (local maxima), an excel template (PeakValleyDetectionTemplate.xlsx) and analyzed from plots from dozens of image processing filters  in at least 6 image processing programs (CorelDRAW, Photoshop, GIMP, Photopaint, ImageJ and others).

These charts include “all” data, not sorted by functions or filters.

Currently there are THREE (only) reported peaks in a trimer of SP-D, this would seem to be a significant underrepresentation of the number of peaks actually contributing to the structure. Incidence of N terminal peak (light orange), glycosylation peak (light green) and the as yet un-named peak lateral to the glycosylation peak (dark green), and the CRD (carbohydrate recognition domain ) peak (orange) are present 100X of the time.  The peak proximate to the CRD peak (coiled coil neck domain) (yellow) is present infrequently, but is detected in sufficient numbers to add it to the list.  The tiny peak (purple) on the downslope of the N termini peak is present infrequently, but is detected in many dodecamer images. Two additional peaks have specific character as well, a small thin peak (pink) and a broad low peak before the neck peak (white) are consistent, but not detected 100% of the time.

The mean number of peaks using all the counting apps and functions including those counted by me from the original plots,  is around 15.  Whether the hexamer has an odd number of peaks (with a possible two portions to the N termini peaks) or an even number of peaks (with the N termini peak being center, and also occurring once) is not determined. There are images which show both occurring.

This is an image of SP-D retrieved from a published article (see ref on image), showing how each of the hexamer-arms were traced, and how a diameter of the molecule was traced (touching three of the four trimeric CRD domains). This particular molecule has been shown countless times on this blog.  Hexamers were “always” traced from left to right, and labeled separately as 1a, 1b, 2a, 2b (each trimer recorded separately) and replicate tracings for signal processing function plots and image filter plots were traced in an identical manner. In this particular image the original figure had an identifying letter which was patched and that line is visible in the image below, but did NOT impact the tracings of the image of the dodecamer itself. Green bar=100nm which was derived from the original figure.

Images of the 12 dodecamers used in this analysis are shown here.