All posts by thankusc

Do image processing blurs and filters have a significant impact on peak counts of SP-D hexamers?

5 and 10 px gaussian blur filters have little impact on peak counts of SP-D hexamers, as it seems from a summary peak counts (see image below).
Much of the peak count data collected for single dodecamer of surfactant protein D (as a grayscale plot with peaks along a line drawn through the center of each of the two hexamers, CRD to CRD)(my molecule number – 41_aka_47) was performed on images that had been subjected to a 5px or 10px gaussian blur. The blur application using the programs listed in previous posts did not specify the px radius of the blur, except one (a filter in Photoshop 2021) that called this blur 10px radius. This plot was included with all other gaussian blur filters. Only one image was processed with what i presume to be a gaussian blur (Octave blur 101-10).
Of the total 159 sets of plots, 7 images received NO processing, 62 received a 5px gaussian blur, 50 with a 10px gaussian blur. Almost always the lowest pixel blur to barely smooth the image was employed. Gaussian blurs were used before other image and signal processing to eliminate low res pixellation in the original images (saved from pdf files). High levels of blur are not in the best interest of preserving detail, and the amount of blur was always dependent on the quality of the original (access to the original digital files, presumably higher resolution images was not possible).

Below is a summary of the impact of gaussian blur on peak counts. Gaussian blur (either 5px or 10px) alone, or with some additional image processing, or the whole set together. The mean peaks counted in each the hexamers 15.0+/-1.24 (nothing really different from what the entire set of plots predicted (see pervious post).

It would appear that the removal of pixellation using minimal processing (in this case  just a modest gaussian blur, or a median filter, does reduce the number of grayscale peaks in each hexamer. The highest number of peaks per hexamer is in the “no image processing” group. The effects of processing were easy to see directly from the plots, but required a more unbiased verification. Please dont confuse the titles of the summar data eg “mean filter” “median filter” “maximum” “minimum” “box blur” (WHICH APPLY TO THE “FILTERS APPLIED”  with the vertical data which calls the calculated data by similar names “mean number of peaks”, Median, Mode, Max (as in the maximum number of peaks counted in a dataset of a plot) Min, Sum, and Var (variation).  Totally different things…. same names.


Limit range filter (Gwyddion) was the filter that I liked best, especially when used with a gaussian blur. There is only one image on the graph below, but there are dozens using this filter under the signal processing group. There is a pretty obvious increase in peaks with this filter.

Maximum, minimum, mean and box filters applied to this image, sometimes with gaussian blur as well. Perhaps the minimum and box filters increased number of peaks found, but I would not personally use these filters to enhance peak detection. It was reasonably evident from the image after application of the minimum, mean and box filters that the result was not what I was looking for.

Lowpass, unsharp mask, and  smart blur. (All counts from image processing)

Just using the bitmap filters and masks of CorelDRAW and CorelPhotoPaint, Photoshop, Gwyddion, ImageJ, Paint.net, Inkscape, Octave (just for image processing no signal processing here), and GIMP show the following summaries.  (All peak counts from each of the image processing programs — each analylzed separately to see whether there was variation in the algorighms used.)



and the value I see as putting the image processing into a category of “nice” not too specific. There is so little variation between programs that “opinion” and “ease of use” and type of “output” would seem to be the best criteria for which to use in microcopy.  I have a preference for the proprietary programs, just for ease of use (except ImageJ which is really a great program) and Gwyddion, though the only use i found was for image processing, and i also found the plotting function produced lots of errors (in my hands). But Gwyddion does have a great function for limiting range and I used that often.  It seems that with image processing, 15 peaks per hexamer is going to be the very best result, consistent and easy to verify.  Abbreviations are listed in a different blog (here).

Summary of peak counts for ONE (1) surfactant protein D molecule, after the application of 18 image and signal processing apps, with variations in settings

1 (ONE) molecule (AFM image of surfactant protein D) which I call 41_aka_45 ( published by Arroyo et al, 2018)

SUMMARY: 159 different grayscale (LUT) plots of one surfactant protein D dodecamer — as 2 hexamers (arm 1, arm 2)  and as 4 trimers (arm 1a, 1b and arm 2a and 2b) show that personal judgement is still critical for determining the number of brightness peaks along this molecule.

METHODS: 12 image processing programs (listed below) were used to filter, mask, limit range, change contrast, HSL, etc, to enhance the appearance of peaks in this image.
2 programs (listed below) were used for plotting grayscale data (ImageJ (used on almost all images) and Octave/Matlab (occasionally).

5 signal processing programs were used to count the number of peaks in grayscale plots made by a 1px segmented line  using imageJ, with dozens of variations in the  input statements for those signal processing programs.

12 peak counts were obtained from volunteers for a single set of plots of this dodecamer (ages 8 – 74  volunteer citizen scientist impressions of the number of peaks).

DATA: All data were saved in an excel file with all image and signal processing parameters to allow assessment of combinations of processing programs and types to produce the most convincing peak counts, widths, and heights.

PURPOSE: To identify a method(s) for assessing the number of peaks, relative peak widths and heights for AFM images of surfactant protein D. (A method that could be applied to countless other AFM application and other molecules).
RESULTS: Nothing produced the results I had hoped for, but there is a clear trend in peak number, width and height. (See next post sometime in the future).
CONCLUSION: Variations in the number of peaks detected in each hexamer produce both an even number or odd number of peaks, the mean and mode are similar,  for hexamers (mean=15 peaks, median=15 peaks, mode is something closer to 13 peaks).  The likely peak number for any given surfactant protein D hexamer is still open, since this is an analysis of methods, not molecules.  The use of this molecule was based on observations of about 100 other images, and represents a reasonable “good choice” to select a methodology.  There are clear options for the most efficient peak detection in image and signal processing, and there are just as clear deficiencies.

A summary of the number of programs, plots and peaks applied to this one molecule is shown below – and it represents the sum total of the data image processing, signal processing, and quick peak counts from citizen scientists, as well as the 100 or so counts of my own, from each image.

Abbreviations:
Image processing programs: psd=Photoshop (proprietary, Photoshop 6 and Photoshop 2021; cpp=corelPhotoPaint (proprietary – raster graphics program, CorelPhotoPaint x5 and 2019); cdr=corelDRAW (proprietary vector graphics program, CorelDRAW x5 and 2019 where the  image adjustment menu was used); gw=Gwyddion, a multiplatform modular free software for visualization and analysis of data from scanning probe microscopy techniques, used here ONLY for image processing; paint=Paint.net (free, open source raster graphics program); gimp=GIMP GNU Image Manipulation Program (free, opensource); inkscape= Inkscape.org (free and open source) vector graphis program; Octave/Matlab (Octave is free and open source) used briefly for image processing (separate from signal processing, limited to 3 plots total, as this was a super cumbersome way to process images). ImageJ, used for both image processing and excel plots (ImageJ is a Java-based image processing program developed at the National Institutes of Health and the Laboratory for Optical and Computational Instrumentation)(free). I added a column of counts of my own, made from each processed image as well (my peak counts).
Signal processing apps:
batchprocess  (Aaron Miller’s app for batch processing excel files using  the Lag, Threshold, Influence (open source library); Octave/Matlab (various settings for FindPeaksPlot, AutoFindPeaksPlot, Ipeaks)(check out Thomas O’Haver’s website); scipy, (Daniel Miller’s app for peak finding using Prominence Distance Width Threshold Height (Sci/Python open source library); Two excel peak finding templates – 1) PeakDetectionTemplate.xlsx and 2) PeakValleyDetectionTemplate.xlsx (Thomas O’haver).  Many variations for amplitude threshold, slope threshold, lag, distance, width, smoothing and many others) were used in signal processing.
Citizen science:  Peak counts in a single set of plots of this dodecamer were obtained from a group of friends and family, ages 8 – 78. I did not include my counts in this category as they numbered in the hundreds, not just one set of plots as the former.

Below is just a summary of the number of trimers plotted, in each of the above image and signal processing programs.


Summary of all counts of 2 hexamers, one dodecamer

My counts from the image as processed dozens of times with dozens of filters with the 12 vector and raster imaging programs produced the most consistent results, but very similar to my own counts of the actual excel plots generated by a trace through the center of the hexamers (2 trimers) were found in a manual count of the peaks of plots made in ImageJ.  Both image processing filters and signal processing algorithms have a huge impact on the variation (var) and the min, and max of the number of peaks counted (judgement is required).
It is worth noting that my counts of peaks from the images is the the lowest, meaning to me that processing might be a good backup for confirming what is seen by eye. – In fact, the reason for this study was that I saw a pattern in the peaks (mine more detailed and specific than the pattern reported in the literature), and I wondered if it was provable.

I doubt adding more peak counts to this data is going to change much (LOL). So now, the approach is to separate out the filters, masks, and algorithms which best fit the mean and median.

 

It is not sufficient just to count the number of peaks in an AFM image of surfactant protein D

It is not sufficient just to count the number of peaks in an AFM image of surfactant protein D. That sounds like a rediculous comment but the signal processing programs do just that.  Provide a number of peaks. This just doesn’t help when there are subtle peaks, very prominent peaks, differences in peak width and height, some relative to the height of other peaks and gross variations in the way the trimer (three loosly or tightly bound monomers, depending upon which domain you are looking at and the CRD even with areas which bend over the neck domain in the images which changes the plot lines dramatically.

Similarly, the N termini junction has been  noted as having two different ways for the trimers to assemble into multimers and this too is shown in plots (sometimes two (or three) peaks at the top of the N term grayscale peaks), sometimes just 1 peak.  This likely will sort out in two ways 1) whether the line plotting grayscale goes through a side of the N term junction or through the “center” the latter likely not always “up” in the image.

Sorting the peaks into at least four different categories is necessary when counting up the number of peaks.  In this study, the N termini peak is counted from its valley most distant from the CRD through the entire N term-peak, UNLESS there are TWO distinct peaks (counted by the signal processing algorithms, in which case each half of the whole N term is counted separately.

That way an entire N term length is counted for each trimer, actually with a grayscale plot which always is recorded from N to CRD (regardless of the direction that the protein lies in the image).  The actual line CRD to CRD but the width and peak heights are recorded in the database in order that all calculations can be calculated on trimers, and all calculations begin with N and end with the CRD.

Point anomalies, pattern anomalies ??

There are programmers who have interests in biology, I am finding, that understand the need for signal processing of repeating and symmetrical pattern-containing signals.  It is an important issue, as there are times when i look at the peaks defined by algorithms (such as Lag, Threshold, Influence) which just dont do what I would like them to do, and as well, there are examples of peak finding (and peak ignoring) which just dont make visual sense to me.  See the plot below (end to end tracing (grayscale plot) of a surfactant protein D hexamer) that has peaks detected using the LTI values using an app made expressly for me by Aaron Miller. It detects lots of the peaks that are obvious, but I am pointing out in the middle and lower images, those peaks which because of the previous values are just ignored, while other peaks which are just tiny bumps in a larger peak, are tagged as a separate peaks.

Top image: blue line is the grayscale plot; boxes are the peaks widths (marked as valleys on either side of the algorithm’s detection of peaks (purple lines); Grayscale axis (y) normalized to 100. This set of peaks in this particular plot (representing one CRD-CRD segmented line drawn through the center of a single dodecamer of surfactant protein D (image is from Arroyo et al). Overall the plot is not very different in terms of peak number than that ascribed to the plot by “citizen scientists” (friends and family) “my counts” (about 500 of them) and various signal processing programs (Octave/matlab peak detection functions); Scipy app (from Daniel Miller), excel peak and valley detection templates (Thomas O’Haver); and an LTI app (from Aaron Miller).

Here are two instances where i don’t like the peaks that are flagged. This sample is from the LTI app (A Lag of 5 will use the last 5 observations to smooth the data. A threshold of 1 will signal if a datapoint is 1 standard deviations away from the moving mean. And an influence of 0.5 gives signals half of the influence that normal datapoints have.) I have put into the link the LTI values for this particular plot.  Two specific instances where i disagree are shown in the plots below, each an excerpt from the complete plots above. Plot excerpt on the left shows one peak NOT detected ( fat red line above the undetected peak), and on the right shows a nonsense (in my opinion) peak (tiny thin red bar above the peak).

RED BARS are over the peak on the left i would LIKE to have detected, and red bar over the peak on the right seems like it should not have been detected. The challenge is to find a model plot and compare the “real plots” to back to the model thus allowing for the extraordinary discrepancies in peak height and width to be tagged, and not removed in moving averages.  The same issues exist in image processing…. but one ends up using judgement, but then, with judgment comes bias.

It is these irregularities that are causing me to go into signal processing for biology with much disappointment.

Woefully lacking in subtle peak detection.

This has been a long and frustrating journey — and I thank the two kids who have helped me and a retired professor, and other friends and neighbors who have listened to me complain.  The bottom line….

peak finding programs are great at getting rid of noise, but really poor at detecting subtle bilateral peak symmetry

Its almost like I find them unable to get out of their “ruts”, slopes, thresholds, sliding averages for this, and two peaks before this and what about rounded peaks.  Just doesnt work.

Case in point is the enormous number of times I have plotted the same molecule of SP-D and failed to pick up some really tiny peaks beside the N termini junction of a dodecamer, but managed to find 4 peaks on the downslope of the presumed glycosylation peak. I dont doubt for an instant that adding individual molecules to a site in one, two or three strands of a trimer can result in a bumpy elevation, but if the peak finding algorithms find peaks there, then why not the tiny peak burried right beside the very tall, very wide peak for the N termini.  It is like the curse of position.  I could and should at some point determine whether the direction (before or after) the N term peak the tiny peak at the bottom of that valley is ignored or found.  But then it is at the valley between the two largest grayscale peaks in the molecule, right between the glycosylation peak and the N termini peak.  so it is doomed.  And also, not picked up by novice peak finders.  I know this peak is meaningful but how to get it into the peak detection programs is another story.

Symmetry and subtle peaks are just lost in the numbers.

One image: All plots to date

Surfactant protein D is listed in many websites, and even wikipedia, as just the carbohydrate recognition domain and neck domain, little else of the molecule (which includes the N and collagen-like-domain) has been modeled. This blog has many posts dedicated to understanding why the other two domains have not been modeled.

Wonderful images found in Arroyo, et al, offer a great opportunity to look critically at the structure and it is clear that there is much information available from a deeper look at those AFM images.  That was the initial purpose of this blog, however, it became clear that just plotting the grayscale along a line drawn through the images of the dodecamer arms (hexamers) of SP-D some serious processing of the images was required. So I set out to find the “best” that best enhanced the images without changing their data. It also became clear that an unbiased count of the grayscale peaks along the plots of hexamers and trimers) was required. Then numerous signal processing programs were used to find the “best” algorithms for counting peaks.  This, along with image processing ARE still subject to the bias of the investigator.

The parameters for both image and signal processing are driven by the opinion of the investigator, and then I though perhaps some citizen scientists (friends and family) could be asked to count peaks in the grayscale plots to compare to plots from image and signal processing.  The bar graph shows over 500 individual attempts to find the number of peaks in a trimer from a single image of SP-D.  Images below that show the actual image, and an example of one such plot and graph that summary bar graph.

Mean peaks per trimer = 8.14 +/- 2.48 , but the mode is 7, shown below. The mode is 7 likely because when counting the peaks the N term gets counted for each trimer, but is usually seen as one central bright(est) high(est) peak.

White arrow on the image of SP-D shows that the entire N term plus the trimer arm is plotted toward the CRD domain. Known peaks are N, gly, and CRD, the other four peaks are consistent and will be meaningful at some point. The peak at the neck is sometimes seen, often depending on whether one of the CRD of the trimer is lying overtop.

Just from my own observations, there will emerge an additional, very low and narrow peak just at the bottom valley of the N termini peak.  Not shown here, but barely detectable on the line plot (but not marked with a color in the lower bar graph) between N and gly.



One surfactant protein D image, lots of peak number – measurements

One surfactant protein D image, lots of peak number – measurements using various signal processing algorithms.
Just the beginning of the assessment of which peak-finding programs work well with AFM images, are easy to use, and generate more insight than just the opinions of the observers.
There really isnt much new by using these signal processing programs for reasons which I think might be related to the fact that “noise” is a big issue for signal processing, and symmetry and variation are not that well handled. Just a microscopist’s opinion here, not one of a programmer.

Clearly there is still a judgement call to be made on whether to use the “mode or the mean” in deciding what is the best number of peaks.  Differences between the number of peaks on the right and left sides of a dodecamer  (differences in the way the molecule has fallen on the mica, other processing issues) are clearly a stumbling block to a determination of symmetry in peaks, and the slope, threshold and a number of other signal processing options allow for great variability in peak numbers. I am certainly leaning toward the simplest comparison, that of the human eye, and then a plot with modest peak processing to identify peaks and valleys.

To view the actual image (sadly called 41 aka 45) just roam back through the SP-D posts, it appears a jillion times.

Grayscale plots of N termini peak(s) in SP-D hexamers (as dodecamers): one peak or two?

There is a section on high and low molecular weight surfactant protein D from an publication by Grith Sorensen in Frontiers in Medicine, 2018 which has the following excerpt. “High-molecular weight SP-D multimers are only partly dependent on disulfide crosslinking of the N-termini, and a proportion of SP-D subunits are non-covalently associated. This allows interconversion between HMW SP-D and LMW SP-D trimers, as demonstrated using size permeation chromatography (36) (Figure 1B). The HMW/LMW ratio depends on the concentration of the protein in solution, with low-protein concentrations favoring the decomposition of multimers into trimers. In addition, the HMW/LMW ratio increases with affinity purification of SP-D, suggesting that ligand-binding facilitates assembly of SP-D trimers into multimers (Reference to an earlier article by the same author).”

There is specific reference to the ratio of high to low molecular weight multimers of surfactant protein D in relation to protein concentration (in the laboratory setting), and to the methionine 11 to threonine 11 allelic variants on the ratio of high to low molecular weight multimers of SP-D in humans.

It seems almost legitimate to view the two different peak plot patterns foud in the N termini peaks, traced from actual images of SP-D dodecamers (traced as two arms, i.e. hexamers – arm 1, and arm 2) found in the N termini of SP-D dodecamers. This valley seen about half the time in the center of grayscale N termini peaks (LUT tables traced in ImageJ) from AFM images (Arroyo et al, 2018) might suggest that even among dodecamers there can be both close tie between N termini (covalent links between two trimers) and loose associations, as well as a single peak, or two peaks respectively). In addition, the trace depends also on “where the segmented line is drawing during the trace, and the brightness saturation of the image.

 

Batch process peak detection using Lag, Threshold, Influence

This program was organized by Aaron Miller from online references to using Lag Threshold and Influence to detect peaks in signals.  The signal here is the excel output from a plot of a surfactant protein D dodecamer (plot of a single hexamer, CRD to CRD shown here) which was subjected to L-5, T-1, I-0.01. The peaks are identified (black line series) while the actual plot is shown as the blue line. Using this csv export I added the peak widths and heights using CorelDRAW. I will convert height into grayscale, and width into nm.  Itdid take several minutes to create the bar graph which has been colored in accordance with known, as well as yet unidentified peaks which I have consistently observed over many plots of nearly a hundred dodecamers of surfactant protein D.

PURPOSE:

1) pie in the sky purpose = adding this peak finding option to ImageJ (which someone else will have to do (LOL)).

2) select just a few of the image processing programs, filters and masks that are free, optimal, easy, and produce images that can be analyzed,  and likewise, find signal processing programs that are free, easy  to use and identify which settings produce the most useful data for statistical analysis of images obtained from microscopy.

CRD=carbohydrate recognition domain (orange); Neck domain (yellow); unknown, wide peak (white); unknown low and narrow peak (pink); unknown large relatively tall peak adjacent to the glycosylation peak(s) (dark green); glycosylation peak(s) (light green); unknown tiny peak between N termini peak and glycosylation peaks (purple); N termini peak(s) (peach). Actually the halves of the hexamers should be identical however, the artifacts that arise from processing (true of all microscopy) show that not all elements are present in all tracings.  Eg, the neck domain is sometimes covered up by the CRD domain as the former is largely nested under each of the three globby CRD in each trimer. How I trace the segmented, 1px line over the image is hugely important, and aim for the brightest places along the length of the hexamer. (Image used for this plot has been shown on this site so many times that posting it again just wastes space (LOL)).

 

plot of grayscale peaks found along a hexamer of surfactant protein D

COMMING SOON: Are there instances where people can more accurately identify peaks than image and signal processing algorithms?

I dont like this kind of variation in peak finding

I am so frustrated with image and signal processing. I dont care what settings (threshold, smoothing, slope -?) are applied. When I see the results of peak finding tags a tiny peak (see red arrow on left)(not so say this isnt an important peak because i think it is — see orange vertical line under that peak) but ignoring a huge, easily seen, not to be overlooked peak massive peak (see red arrow on the right and peak with NO ORANGE line to the peak) I just dont trust any of it.  I understand that slope and amplitude can be adjusted in these programs, but when upcoming and trailing values mess with “reality” (LOL).

COMMENT: this is a plot of a hexamer of surfactant protein D (CRD peaks are on each end, N termini junction is the center peak)

COMMENT: Just think… climate scientists and financial advisors are using similar algorithms to predict doom/prosperity.