PHOTO: Diemersdal.

With technology at the fingertips of researchers and producers alike, the possibility exists to not only generate and study vast data sets, but also to find relationships between them through machine learning and statistical analysis. Such relationships could provide valuable information on yield estimation, véraison and harvest date.

According to the link, machine learning (ML) is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. But why is this important? An incredible amount of data has been captured within the wine grape industry in the Western Cape over the years and put to good use, but all of this information could be even more valuable if relationships between datasets could ultimately allow the producer to more accurately estimate yield.

In addition, data capturing can be done not only by traditional means, but also through remote sensing (RS) by satellite through initiatives like FruitLook. This article discusses research funded by Winetech which investigates the use of available “big datasets” for wine grape crop estimation. The research was led by Dr Caren Jarmain and other researchers from Stellenbosch University, with support from WineMS and Vinpro.



What were the aims of this study?

Two specific aspects were investigated:

• Whether relationships existed between the spatial datasets available through FruitLook (FL) and vineyard block yield information when statistical analysis and machine learning (ML) approaches were applied.
• If these relationships could be used to build a wine grape yield and possibly a harvest date model.


What are the current wine grape yield estimation challenges?

Simply put, the yearly under- or overestimation of vine yield by producers leads to under- or overestimated wine volumes which in turn could lead to a loss of money. It is estimated that the expected error could be up to 20% for forecasts based on bunch counts in spring, 10 – 15% for forecasts based on berry counts at fruit set and 5% on harvesting segments close to the harvest. The estimated yield has a direct effect on cellar capacity, chemical usage and marketing activities. For growers, yield directly impacts planning of the harvesting processes and the scheduling of labour and machinery. Current yield prediction largely relies on historical yield data and weather indices combined with manual vineyard measurements and sampling, but this approach is inaccurate and time consuming.


What measuring parameters and sources were used in this study?

Selected production areas with the required “big datasets” (block specific and remote sensing derived) were included in the study and extended over the Coastal, Breede and Olifants Wine of Origin regions of the Western Cape. Three main data sources were considered:

• FruitLook data – weekly FL spatial datasets on crop growth, water use and nitrogen content for five production seasons (October until April for 2011/12 to 2015/16) at a spatial resolution of 20 m.
• Crop production data – actual vineyard block production information (quality and quantity) for the above-mentioned five production seasons obtained through the industry.
• Block boundaries – vineyard block field boundary information obtained through the industry, indicating the geographical locations and extent of the fields considered.

Additional datasets were derived and all datasets were joined at block level to study the relationship between crop yield and RS derived FL data, using a combination of statistical modelling and ML. Initially the ML involved manually changing target variables, but subsequently a brute force approach was applied. This involved a near-exhaustive set of experiments, each using a different permutation of target variables, input variables and geographical areas.



What conclusions can be made from this study?

Although crop yield per block is frequently recorded, the associated hectares and other associated information are not regularly updated. The graphically delineating information (block boundaries) too are not readily available in electronic format.

This study was a first attempt to model grape yield using RS variables. The initial regression analyses to investigate the strongest relationships between individual FL variables and yield generally yielded poor results (R2 < 0.3), but the models improved when individual cultivars, specific regions and seasons were considered. For example, the 2014/15 season’s weekly (non-aggregated) FL variables yielded a strong (R2 = 0.83) model for Pinotage in the Coastal region. Other ML findings, especially from the Olifants River region, were very encouraging. When RF (Random Forest), an artificial intelligence classifier, was applied to the 2015/16 seasonal (aggregated) FL variables in this region, an overall accuracy of 85% was achieved when all the cultivars were considered. Similar results were also observed for the 2014/15, 2011/12 and 2013/14 seasons, confirming the consistently strong relationship between all the FL variables and yield for this region.

More data and work is needed regarding a harvest date model. The data that was used was obtained from less than 30 blocks, but yielded promising results.

Ultimately, based on the regression and ML experiments the following was concluded:

• No individual variable can be used to model wine grape yield.
• Using RS data for wine grape yield modelling is very complex as other (non-remote sensing) factors often have a (more) substantial impact on yield.
• The accuracy of the models are strongly driven by cultivar (with Chenin blanc and Colombar being the most successful) and by region (with the Olifants River region being the most successful).
• Weekly FL variables generally produced stronger models than (aggregated) monthly and seasonal variables.


What future research is recommended?

The research team recommends that the wine industry develops and maintains a standardised geographical database of vineyards and their related attributes, seeing that data inaccuracy or completeness were barriers to research. Sawis records, as well as the “Fly-over” database from the Western Cape Provincial Department of Agriculture, would serve as a good starting point.

More research on wine grape modelling is strongly recommended and to that end other raw satellite data (e.g. from Sentinel-2 and Landsat-8), in addition to the FL data, should be considered. The area considered in the research should also be expanded.

Finally, more research into modelling harvest date should be done with specific focus on the use of ML and a factor classification approach. This will define which FL dataset drive the positive results obtained. The current dataset considered served the exploratory work well, although it was too limited to be split into a training and test set for ML while still retaining the seasonal variability aspect, as well as differences between sites and cultivars that also impact on phenology modelling.


– For more information, contact Bernard Mocke at

You may like to read these:

Go Back