Backup | Technology Comparisons and Reviews https://www.whatmatrix.com/portal No-nonsense technical analysis from the community! Wed, 12 Feb 2020 15:03:24 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.17 https://www.whatmatrix.com/portal/wp-content/uploads/2018/04/cropped-light_on_dark_PNG-1-32x32.png Backup | Technology Comparisons and Reviews https://www.whatmatrix.com/portal 32 32 Meet the vendor: Commvault (Data Protection) https://www.whatmatrix.com/portal/meet-the-vendor-commvault-data-protection/ https://www.whatmatrix.com/portal/meet-the-vendor-commvault-data-protection/#respond Tue, 17 Dec 2019 05:59:03 +0000 https://www.whatmatrix.com/portal/?p=106886 Meet the vendor: Commvault Our community is glad to announce that Commvault has been successfully evaluated under the WhatMatrix Data Protection criteria. “Complete Backup and Recovery” is now available for (free) comparison and enquiries among other industry-leading solutions in this space. Wonder how they rank? Check them out HERE.  Commvault has been a recognized leader […]

The post Meet the vendor: Commvault (Data Protection) appeared first on Technology Comparisons and Reviews.

]]>
Meet the vendor: Commvault

Our community is glad to announce that Commvault has been successfully evaluated under the WhatMatrix Data Protection criteria. “Complete Backup and Recovery” is now available for (free) comparison and enquiries among other industry-leading solutions in this space. Wonder how they rank? Check them out HERE

Commvault has been a recognized leader in the data management landscape for over 20 years. Commvault’s motto: “The right backup and recovery solution can help you solve hard problems and reimagine your potential – no matter what clouds you’re using, no matter what your stack looks like.

Commvault complete backup

The portfolio – complete backup & recovery

Commvault’s main product is called Complete Backup & Recovery. And the name says it all, it is intended to be a complete data recovery solution supporting all common deployment environments.
There is broad support for cloud, hypervisors, applications, databases and SaaS applications. There are multiple options to deploy on physical hardware, virtual server, appliances, reference architectures or simply in the public cloud. The suite supports almost any storage hardware for snapshots and replication integration. It also has the ability to use almost any platform to store your backups. Supporting cross-hypervisor restore and replication and built-in test/dev sandboxing even in the public cloud further adds to the value.

Commvault complete backup 2

 

Flexibility and automation

Commvault uses a very intelligent and global deduplication and compression mechanism delivering very efficient usage of any backup storage. To top it all off there is a very ‘lean and mean’ web-based dashboard for daily operations! If needed the solution can be extended using Commvault Activate or Orchestrate.

Activate

Ready to flip the switch on business outcomes, with maximized data value and minimized risk? Commvault Activate solution’s enforced data policy and governance has the unparalleled ability to collect and protect data from across your organization. And did we mention that Activate extends Commvault Complete Backup and Recovery with a layer of analytics, workflows, and pre-built solution accelerators? They allow you to know what data you have, contextualize it, apply rules to it, protect it, and use it.

Orchestrate

Get all your data on the same page. Commvault Orchestrate™ is end-to-end data syncing for faster disaster recovery, dev/test operations and workload migration. Your data, where and when you need it.

Looking ahead

Commvault is evolving with the demands of the industry. In the future we might look closer at their latest entry into the SaaS backup and recovery market with a new solution called Metallic  (which might deserve its own category). It is worth mentioning that Commvault clearly aren’t resting on their laurels and have acquired the “Hedvig” software defined storage solution, with a plan to integrate it into their backup offering as a new option – but it is too early to evaluate any resulting capabilities. https://www.commvault.com/hedvig.Commvault Orchestration
As always, feel fee to provide feedback using the build in change request mechanism in the matrix or in the comment section below and let us know what you think!

Martijn Moret – Community Consultant (Data Recovery)

PS Keep an eye out for the on-boarding / evaluation of additional vendors in preparation for the upcoming Landscape Report for Data Protection in early 2020.

The post Meet the vendor: Commvault (Data Protection) appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/meet-the-vendor-commvault-data-protection/feed/ 0
Data Protection: Challengers highlighted in latest update https://www.whatmatrix.com/portal/data-protection-challengers-highlighted-in-latest-update/ https://www.whatmatrix.com/portal/data-protection-challengers-highlighted-in-latest-update/#respond Thu, 05 Dec 2019 14:30:20 +0000 https://www.whatmatrix.com/portal/?p=106510 We have an end of year update from our Data protection consultant Yannick Arens. The data protection market continues to evolve at a rapid pace. Most noteworthy are the challengers that constantly innovate to put pressure on the established enterprise backup vendors. So the traditional market segmentation lines are becoming blurred. However we still see […]

The post Data Protection: Challengers highlighted in latest update appeared first on Technology Comparisons and Reviews.

]]>
We have an end of year update from our Data protection consultant Yannick Arens. The data protection market continues to evolve at a rapid pace. Most noteworthy are the challengers that constantly innovate to put pressure on the established enterprise backup vendors. So the traditional market segmentation lines are becoming blurred. However we still see data protection products largely servicing two distinct areas, large enterprise and small/medium businesses.

Competition heats up in the enterprise space

Rubrik

Products originally targeted for the enterprise market continue to have the broadest range of capabilities. These provide support for evolving on-premise virtualized DC environments and private cloud technologies. They are also rapidly supporting the hybrid cloud and cloud first strategies of larger clients. Veeam have a very strong presence in this market, but we see newer entrants such as Rubrik and Cohesity really challenge the incumbents in terms of their feature set. The detailed WhatMatrix feature evaluation reflects this with Cohesity and Rubrik being in a top position.

Cohesity

Mid Market solutions developing rapidly

Vendors targeting the mid market have been more focused on delivering the key features which provide the most amount of value. Here too we see a rapid expansion of features which embrace cloud technologies, including SaaS products like office 365. These solutions are enabling business to easily adopt cloud for remote backup and DR at much lower cost points. Our latest updates show there is also an expansion of features which would normally have been the domain of the enterprise products – expanding support for physical systems backup and P2V capabilities provide strong support for private environments but also help facilitate moves to the public cloud.

Comprehensive evaluation update for data protection category

After on-boarding Altaro, Yannick & team have also updated all vendor products across the comparison to evaluate their latest releases. Head over to the Data Protection comparison for the detailed evaluation. 

The post Data Protection: Challengers highlighted in latest update appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/data-protection-challengers-highlighted-in-latest-update/feed/ 0
Strategies for Planning an SQL Database Recovery https://www.whatmatrix.com/portal/strategies-for-planning-an-sql-database-recovery/ https://www.whatmatrix.com/portal/strategies-for-planning-an-sql-database-recovery/#respond Fri, 12 Jul 2019 18:07:57 +0000 https://www.whatmatrix.com/portal/?p=91970 Every SQL database needs a solid recovery plan. As the database administrator, it’s up to you to know enough about the real-time status of the SQL service and also to have a strategy in place in case you’re in need of a backup. You never know what will happen 100%, and things can go wrong […]

The post Strategies for Planning an SQL Database Recovery appeared first on Technology Comparisons and Reviews.

]]>
Every SQL database needs a solid recovery plan. As the database administrator, it’s up to you to know enough about the real-time status of the SQL service and also to have a strategy in place in case you’re in need of a backup. You never know what will happen 100%, and things can go wrong quickly.

If your backup plan isn’t up to par, you could risk losing all of your data. That’s not something that you should ever put at risk, so here are the best strategies for planning an SQL database recovery.

Why Do You Need a Backup Plan?

First, let’s talk about why you need a recovery plan for your SQL database. Whenever you work with a SQL Server, you could always face some kind of data loss. You could experience low disk space, disk failure, a network disruption, etc. Basically, there are a number of things that can go wrong with either your computer or the database itself. 

Having a revery plan means that if something does go wrong, you have a copy of your database to restore from. It’s a way to minimize any loss and maximize just how much data you have available. When you’re creating your database recovery strategy, keep in mind you’re planning for the worst. While you never want to actually need to put your plan into action, it’s worth the peace of mind. 

Backup Options for Your SQL Database

In an MS SQL Server, there are 3 primary options for backups:

  • Full backup – A full backup, as the name implies, is a complete backup of the entire database. It includes logs as well, and it enables point-in-time to recover. This is the largest backup to store.  
  • Differential Backups – This type of backup only backs up any changes since the last backup, effectively saving space. 
  • Transitional Log Backups – Finally, these help minimize work-loss exposure and truncate the transaction log. They can be used in Full or Bulk-Logged recovery mode. 

So how do you choose which is best for you? Most likely, you’ll decide on a healthy mix. You’ll need to consider when your application or database is more actively used to schedule your backups during downtimes, and also the frequency of updates. 

adult data database 1181316

Image via Pexels

Recovery Plans

Now that we understand the types of backups, it’s time to talk about recovery strategies. Taking the time to understand these in advance will help you make a smarter decision. Remember, it’s about being prepared for anything. 

  • Restore from another SQL Server – The first option is to restore your database from another SQL Server where you’ve stored a backup copy. To do this, you’ll use a T-SQL command such as RESTORE HEADERONLY. This is used to check what’s in the backup file so you can do a full restore.
  •  Same-Server Restore – Another option is to restore a database from the same server. To do this, you’ll use the Enterprise Manager. Within your Enterprise Manager, right click on the database you need to restore and select All Tasks. From here, you’ll have the option to both Backup and Restore. 
  •  MirroringMirroring is a popular option in which you create a mirror image of your database on another server. Then, it’s automatically transferred from your primary Server to your secondary Server.

Conclusion

As you can see, you have options for both simple and complex backup and restore with your SQL Database. While you’ll hopefully never need any of these strategies above, it’s important you keep them in mind as you move forward with your application. 

No matter how strong of a programmer you are, things still go wrong. Sometimes these things are out of our control. That’s why it’s important to have these plans in place in case you need to restore data. Your data is valuable, so don’t risk it.

The post Strategies for Planning an SQL Database Recovery appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/strategies-for-planning-an-sql-database-recovery/feed/ 0
Data Protection embraces the cloud https://www.whatmatrix.com/portal/data-protection-embraces-the-cloud/ https://www.whatmatrix.com/portal/data-protection-embraces-the-cloud/#respond Wed, 03 Jul 2019 08:49:42 +0000 https://www.whatmatrix.com/portal/?p=90118 Our category consultant Yannick Arens has just published a major update to the data protection category. This is the result of months of work evaluating the current landscape of data protection products and associated feature sets. From this assessment, Yannick has created a new set of scoring criteria, ensuring that it captures the latest features […]

The post Data Protection embraces the cloud appeared first on Technology Comparisons and Reviews.

]]>
Our category consultant Yannick Arens has just published a major update to the data protection category. This is the result of months of work evaluating the current landscape of data protection products and associated feature sets. From this assessment, Yannick has created a new set of scoring criteria, ensuring that it captures the latest features available in the market. All of the products in the category have then been evaluated against this new criteria, including their integration with cloud capabilities.

Yannick has posted his own blog covering the updates which you can read more on here

New to the category – Altaro

AltaroIn addition to the new evaluation criteria, we are delighted to announce that Altaro is now included in the comparison. Altaro are recognized for their easy of use and cost-effective backup solutions which has particular appeal in the enterprise mid market (1000 users and under). Altaro focus on providing all of the key enterprise class capabilities required to backup virtualized environments (Hyper-V  and VMware being the most prominent). Head over to the category to check them out.

Data Protection – Cloud capabilities evaluated

Historically backup features have focused on the ability to support virtualized environments. This continues to be a key area and we continue to add more evaluation features for storage integration, physical servers, and database support. However, with the steady increase in cloud adoption, Yannick felt it appropriate to evaluate each products ability to support cloud platform. Support for Azure, Amazon, Google, and VMware on AWS as a backup source have been scored. Also the ability to utilize cloud repositories like Azure Blob and Amazon S3 have been included. Yannick has even included Software-as-a-Service features!

Head over to the comparison to check out how products stack up in this category

Welcoming new contributors – Martijn Moret

Community Contributor Martijn Moret

Finally we would like to take the opportunity to welcome a new contributor to the Data Protection team. Martijn Moret. With a strong background in the storage and data protection space Martijn is working to expand the range of evaluated product even further – first up will be Commvault, so stay tuned for future updates

 

The post Data Protection embraces the cloud appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/data-protection-embraces-the-cloud/feed/ 0
Evaluating the “Data Protection” software market in 2019 https://www.whatmatrix.com/portal/evaluating-the-data-protection-software-market-in-2019/ https://www.whatmatrix.com/portal/evaluating-the-data-protection-software-market-in-2019/#respond Sun, 31 Mar 2019 08:51:18 +0000 https://www.whatmatrix.com/portal/?p=60729 In celebration of “world backup day”, we are pleased to announce that a major new update is coming to the WhatMatrix “Data protection” comparison. This comparison includes a diverse range of backup tools to accomplish one shared goal, protecting your data. Our new contributing consultants , Yannick Arens & Martijn Moret, have been working to […]

The post Evaluating the “Data Protection” software market in 2019 appeared first on Technology Comparisons and Reviews.

]]>
In celebration of “world backup day”, we are pleased to announce that a major new update is coming to the WhatMatrix “Data protection” comparison. This comparison includes a diverse range of backup tools to accomplish one shared goal, protecting your data.

Our new contributing consultants , Yannick Arens & Martijn Moret, have been working to evaluate the latest offerings in the market. This update for 2019 will include recent enhancements to products from well established companies like Veeam & Cohesity as well as those from newer companies, like Rubrik and Vembu

Along with looking at the newest range of products in the market, the evaluation criteria for ranking these offerings has been updated to reflect the current capabilities in the industry. For example, with more and more workloads moving towards the cloud, the data protection solutions are focusing on integration with the cloud.

We hope to publish this new release in the coming weeks. We will also release a DP Landscape Analysis 2019 report later this year, covering a detailed analysis of the leading solutions in the industry by use case.

Are you a backup vendor but not yet listed on WhatMatrix yet? Get in touch to get included in the report 

(listings are free but on-boarding on to our platform subject to availability)

Keep up to date with all WhatMatrix releases by following us on twitter @what_matrix

The post Evaluating the “Data Protection” software market in 2019 appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/evaluating-the-data-protection-software-market-in-2019/feed/ 0
What Backup solution? Vembu BDR Suite added to WhatMatrix Backup comparison. https://www.whatmatrix.com/portal/what-backup-solution-vembu-bdr-suite-added-to-whatmatrix-backup-comparison/ https://www.whatmatrix.com/portal/what-backup-solution-vembu-bdr-suite-added-to-whatmatrix-backup-comparison/#respond Fri, 01 Jul 2016 12:49:37 +0000 https://www.whatmatrix.com/blog/?p=2927 [block]0[/block] Today we announce the addition of Vembu to the ‘What Backup‘ comparison, allowing you to explore & compare Vembu’s BDR Suite with other backup solution. For those wondering whether Vembu is a new player in Backup & DR market – they’ve actually been in the business for over 12 years (although operating under a different business model). Vembu […]

The post What Backup solution? Vembu BDR Suite added to WhatMatrix Backup comparison. appeared first on Technology Comparisons and Reviews.

]]>
[block]0[/block]

Today we announce the addition of Vembu to the ‘What Backup‘ comparison, allowing you to explore & compare Vembu’s BDR Suite with other backup solution.
For those wondering whether Vembu is a new player in Backup & DR market – they’ve actually been in the business for over 12 years (although operating under a different business model).

Vembu aims to provide ‘enterprise-like’ features to smaller & medium businesses at an attractive price point!

So what sets Vembu apart… ? Vembu’s declared strategy and goal is to provide ‘enterprise-like’ features to smaller & medium businesses at an attractive price point … a simple but powerful proposal.

Vembu’s flagship offering, Vembu BDR Suite, has the ability to provide Backup & DR solutions to customers operating in sophisticated data-centres as well as businesses that cannot afford a data-centre, aiming to cater for diverse technical environments including physical, virtual, applications and endpoints.

vembu

Backup & DR for Virtualized environments:

Vembu VMBackup (part of Vembu BDR Suite) provides agent-less backups for VMware vSphere and Microsoft Hyper-V environments with an RTO & RPO of less than 15 minutes. With its focus on increasing business availability in various environments, VMBackup provides multiple useful recovery options like Quick VM recovery, Entire VM recovery, Instant File-level recovery, alongside its VMware vSphere Backup & Microsoft Hyper-V Backup mechanism.
Additionally, VembuHive, a ‘file system of file systems‘ offers efficient backup storage.

At it’s core VMBackup has been designed with a simple UI, making it an easy-to-use and affordable product. For customers with “off-site” backup requirements, VMBackup also provides option of sending the backup data to a secondary datacenter or the Vembu Cloud for data redundancy and DR.

Backup & DR for Physical Servers:

Vembu ImageBackup (part of Vembu BDR Suite) provides backup for the entire disk image of Windows Servers, Desktops and Laptops including operating system, applications and files – providing an effective backup and DR approach for Windows IT environments. Bare-metal recovery enables the recovery of the backed up Windows machines on to the same or even different hardware. Additionally it can initiate instant recovery of a complete Windows IT Environment which includes Windows Servers, Desktops and Laptops with built-in P2V support (interestingly, Vembu includes full image backup of Windows desktops and laptops at no additional cost in the BDR suite!)

File, Application, Endpoints, SaaS applications backup, cloud storage & much more:

  • Vembu NetworkBackup (part of Vembu BDR Suite), also designed for small & medium businesses, protects their critical data across file servers, application servers, workstations and other endpoints in Windows & Linux environments. Vembu offers NetworkBackup at no additional cost, for end points that include Windows Desktops, Laptops and Mac.
  • Vembu OnlineBackup (part of Vembu BDR Suite) provides File Server, Exchange, SQL, SharePoint & Outlook Backups directly to Vembu’s secure cloud using enterprise-grade AES 256-bit encryption with granular restores.
  • Vembu SaaSBackup is designed for backing up the Mails, Drives, Calendar and Contacts of your Office 365 and Google Apps and provide immediate recovery with simple & effective UI.

‘All in one’ Backup & DR solution

With its focus on providing enterprise features to the masses at an affordable price point, we consider Vembu “one to watch” in the backup and DR space.  In our view Vembu’s BDR Suite provides flexible backup and DR options to customers with diverse environments and – critically – varying budgets.  Especially if you are operating in a Hyper-V or VMware environment we would only encourage you to evaluate Vembu against your requirements … but why don’t you have a look yourself in our draft comparison HERE.

vembu

Enjoy the comparison!
Your WhatMatrix Community

 

The post What Backup solution? Vembu BDR Suite added to WhatMatrix Backup comparison. appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/what-backup-solution-vembu-bdr-suite-added-to-whatmatrix-backup-comparison/feed/ 0
Clustering Algorithms: From Start To State Of The Art https://www.whatmatrix.com/portal/clustering-algorithms-from-start-to-state-of-the-art-2/ https://www.whatmatrix.com/portal/clustering-algorithms-from-start-to-state-of-the-art-2/#respond Sun, 12 Jun 2016 08:53:53 +0000 https://www.whatmatrix.com/blog/?p=2379 It’s not a bad time to be a Data Scientist. Serious people may find interest in you if you turn the conversation towards “Big Data”, and the rest of the party crowd will be intrigued when you mention “Artificial Intelligence” and “Machine Learning”. Even Google thinks you’re not bad, and that you’re getting even better. […]

The post Clustering Algorithms: From Start To State Of The Art appeared first on Technology Comparisons and Reviews.

]]>
It’s not a bad time to be a Data Scientist. Serious people may find interest in you if you turn the conversation towards “Big Data”, and the rest of the party crowd will be intrigued when you mention “Artificial Intelligence” and “Machine Learning”. Even Google thinks you’re not bad, and that you’re getting even better. There are a lot of ‘smart’ algorithms that help data scientists do their wizardry. It may all seem complicated, but if we understand and organize algorithms a bit, it’s not even that hard to find and apply the one that we need. Courses on data mining or machine learning will usually start with clustering, because it is both simple and useful. It is an important part of a somewhat wider area of Unsupervised Learning, where the data we want to describe is not labeled. In most cases this is where the user did not give us much information of what is the expected output. The algorithm only has the data and it should do the best it can. In our case, it should perform clustering – separating data into groups (clusters) that contain similar data points, while the dissimilarity between groups is as high as possible. Data points can represent anything, such as our clients. Clustering can be useful if we, for example, want to group similar users and then run different marketing campaigns on each cluster.

K-Means Clustering

After the necessary introduction, Data Mining courses always continue with K-Means; an effective, widely used, all-around clustering algorithm. Before actually running it, we have to define a distance function between data points (for example, Euclidean distance if we want to cluster points in space), and we have to set the number of clusters we want (k). The algorithm begins by selecting k points as starting centroids (‘centers’ of clusters). We can just select any k random points, or we can use some other approach, but picking random points is a good start. Then, we iteratively repeat two steps:

  1. Assignment step: each of m points from our dataset is assigned to a cluster that is represented by the closest of the k centroids. For each point, we calculate distances to each centroid, and simply pick the least distant one.
  2. Update step: for each cluster, a new centroid is calculated as the mean of all points in the cluster. From the previous step, we have a set of points which are assigned to a cluster. Now, for each such set, we calculate a mean that we declare a new centroid of the cluster.

After each iteration, the centroids are slowly moving, and the total distance from each point to its assigned centroid gets lower and lower. The two steps are alternated until convergence, meaning until there are no more changes in cluster assignment. After a number of iterations, the same set of points will be assigned to each centroid, therefore leading to the same centroids again. K-Means is guaranteed to converge to a local optimum. However, that does not necessarily have to be the best overall solution (global optimum). The final clustering result can depend on the selection of initial centroids, so a lot of thought has been given to this problem. One simple solution is just to run K-Means a couple of times with random initial assignments. We can then select the best result by taking the one with the minimal sum of distances from each point to its cluster – the error value that we are trying to minimize in the first place. Other approaches to selecting initial points can rely on selecting distant points. This can lead to better results, but we may have a problem with outliers, those rare alone points that are just “off” that may just be some errors. Since they are far from any meaningful cluster, each such point may end up being its own ‘cluster’. A good balance is K-Means++ variant [Arthur and Vassilvitskii, 2007], whose initialization will still pick random points, but with probability proportional to square distance from the previously assigned centroids. Points that are further away will have higher probability to be selected as starting centroids. Consequently, if there’s a group of points, the probability that a point from the group will be selected also gets higher as their probabilities add up, resolving the outlier problem we mentioned. K-Means++ is also the default initialization for Python’s Scikit-learn K-Means implementation. If you’re using Python, this may be your library of choice. For Java, Weka library may be a good start:

Java (Weka)

// Load some data
Instances data = DataSource.read("data.arff");

// Create the model
SimpleKMeans kMeans = new SimpleKMeans();

// We want three clusters
kMeans.setNumClusters(3);

// Run K-Means
kMeans.buildClusterer(data);

// Print the centroids
Instances centroids = kMeans.getClusterCentroids();
for (Instance centroid: centroids) {
  System.out.println(centroid);
}

// Print cluster membership for each instance
for (Instance point: data) {
  System.out.println(point + " is in cluster " + kMeans.clusterInstance(point));
}

Python (Scikit-learn)

>>> from sklearn import cluster, datasets
>>> iris = datasets.load_iris()
>>> X_iris = iris.data
>>> y_iris = iris.target

>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(X_iris)
KMeans(copy_x=True, init='k-means++', ...
>>> print(k_means.labels_[::10])
[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
>>> print(y_iris[::10])
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

In the Python example above, we used a standard example dataset ‘Iris’, which contains flower petal and sepal dimensions for three different species of iris. We clustered these into three clusters, and compared the obtained clusters to the actual species (target), to see that they match perfectly. In this case, we knew that there were three different clusters (species), and K-Means recognized correctly which ones go together. But, how do we choose a good number of clusters k in general? These kind of questions are quite common in Machine Learning. If we request more clusters, they will be smaller, and therefore the total error (total of distances from points to their assigned clusters) will be smaller. So, is it a good idea just to set a bigger k? We may end with having k = m, that is, each point being its own centroid, with each cluster having only one point. Yes, the total error is 0, but we didn’t get a simpler description of our data, nor is it general enough to cover some new points that may appear. This is called overfitting, and we don’t want that. A way to deal with this problem is to include some penalty for a larger number of clusters. So, we are now trying to minimize not only the error, but error + penalty. The error will just converge towards zero as we increase the number of clusters, but the penalty will grow. At some points, the gain from adding another cluster will be less than the introduced penalty, and we’ll have the optimal result. A solution that usesBayesian Information Criterion (BIC) for this purpose is called X-Means [Pelleg and Moore, 2000]. Another thing we have to define properly is the distance function. Sometimes that’s a straightforward task, a logical one given the nature of data. For points in space, Euclidean distance is an obvious solution, but it may be tricky for features of different ‘units’, for discrete variables, etc. This may require a lot of domain knowledge. Or, we can call Machine Learning for help. We can actually try to learn the distance function. If we have a training set of points that we know how they should be grouped (i.e. points labeled with their clusters), we can use supervised learning techniques to find a good function, and then apply it to our target set that is not yet clustered. The method used in K-Means, with its two alternating steps resembles an Expectation–Maximization (EM) method. Actually, it can be considered a very simple version of EM. However, it should not be confused with the more elaborate EM clustering algorithm even though it shares some of the same principles.

EM Clustering

So, with K-Means clustering each point is assigned to just a single cluster, and a cluster is described only by its centroid. This is not too flexible, as we may have problems with clusters that are overlapping, or ones that are not of circular shape. With EM Clustering, we can now go a step further and describe each cluster by its centroid (mean), covariance (so that we can have elliptical clusters), and weight (the size of the cluster). The probability that a point belongs to a cluster is now given by a multivariate Gaussian probability distribution (multivariate – depending on multiple variables). That also means that we can calculate the probability of a point being under a Gaussian ‘bell’, i.e. the probability of a point belonging to a cluster. We now start the EM procedure by calculating, for each point, the probabilities of it belonging to each of the current clusters (which, again, may be randomly created at the beginning). This is the E-step. If one cluster is a really good candidate for a point, it will have a probability close to one. However, two or more clusters can be acceptable candidates, so the point has a distribution of probabilities over clusters. This property of the algorithm, of points not belonging restricted to one cluster is called “soft clustering”. The M-step now recalculates the parameters of each cluster, using the assignments of points to the previous set of clusters. To calculate the new mean, covariance and weight of a cluster, each point data is weighted by its probability of belonging to the cluster, as calculated in the previous step. Alternating these two steps will increase the total log-likelihood until it converges. Again, the maximum may be local, so we can run the algorithm several times to get better clusters. If we now want to determine a single cluster for each point, we may simply choose the most probable one. Having a probability model, we can also use it to generate similar data, that is to sample more points that are similar to the data that we observed.

Affinity Propagation

Affinity Propagation (AP) was published by Frey and Dueck in 2007, and is only getting more and more popular due to its simplicity, general applicability, and performance. It is changing its status from state of the art to de facto standard. The main drawbacks of K-Means and similar algorithms are having to select the number of clusters, and choosing the initial set of points. Affinity Propagation, instead, takes as input measures of similarity between pairs of data points, and simultaneously considers all data points as potential exemplars. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. As an input, the algorithm requires us to provide two sets of data:

  1. Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
  2. Preferences, representing each data point’s suitability to be an exemplar. We may have some a priori information which points could be favored for this role, and so we can represent it through preferences.

Both similarities and preferences are often represented through a single matrix, where the values on the main diagonal represent preferences. Matrix representation is good for dense datasets. Where connections between points are sparse, it is more practical not to store the whole n x n matrix in memory, but instead keep a list of similarities to connected points. Behind the scene, ‘exchanging messages between points’ is the same thing as manipulating matrices, and it’s only a matter of perspective and implementation. The algorithm then runs through a number of iterations, until it converges. Each iteration has two message-passing steps:

  1. Calculating responsibilities: Responsibility r(i, k) reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i. Responsibility is sent from data point i to candidate exemplar point k.
  2. Calculating availabilities: Availability a(i, k) reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar. Availability is sent from candidate exemplar point k to point i.

In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point i and point k as its exemplar, minus the largest of the similarity and availability sum between point i and other candidate exemplars. The logic behind calculating how suitable a point is for an exemplar is that it is favored more if the initial a priori preference was higher, but the responsibility gets lower when there is a similar point that considers itself a good candidate, so there is a ‘competition’ between the two until one is decided in some iteration. Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability a(i, k) is set to the self-responsibility r(k, k) plus the sum of the positive responsibilities that candidate exemplar k receives from other points. Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point i, the k with maximum r(i, k) + a(i, k) represents point i’s exemplar. Or, if we just need the set of exemplars, we can scan the main diagonal. If r(i, i) + a(i, i) > 0, point i is an exemplar. We’ve seen that with K-Means and similar algorithms, deciding the number of clusters can be tricky. With AP, we don’t have to explicitly specify it, but it may still need some tuning if we obtain either more or less clusters than we may find optimal. Luckily, just by adjusting the preferences we can lower or raise the number of clusters. Setting preferences to a higher value will lead to more clusters, as each point is ‘more certain’ of its suitability to be an exemplar and is therefore harder to ‘beat’ and include it under some other point’s ‘domination’. Conversely, setting lower preferences will result in having less clusters; as if they’re saying “no, no, please, you’re a better exemplar, I’ll join your cluster”. As a general rule, we may set all preferences to the median similarity for a medium to large number of clusters, or to the lowest similarity for a moderate number of clusters. However, a couple of runs with adjusting preferences may be needed to get the result that exactly suits our needs. Hierarchical Affinity Propagation is also worth mentioning, as a variant of the algorithm that deals with quadratic complexity by splitting the dataset into a couple of subsets, clustering them separately, and then performing the second level of clustering.

In The End…

There’s a whole range of clustering algorithms, each one with its pros and cons regarding what type of data they with, time complexity, weaknesses, and so on. To mention more algorithms, for example there’s Hierarchical Agglomerative Clustering (or Linkage Clustering), good for when we don’t necessarily have circular (or hyper-spherical) clusters, and don’t know the number of clusters in advance. It starts with each point being a separate cluster, and works by joining two closest clusters in each step until everything is in one big cluster. With Hierarchical Agglomerative Clustering, we can easily decide the number of clusters afterwards by cutting the dendrogram (tree diagram) horizontally where we find suitable. It is also repeatable (always gives the same answer for the same dataset), but is also of a higher complexity (quadratic). Then, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is also an algorithm worth mentioning. It groups points that are closely packed together, expanding clusters in any direction where there are nearby points, thus dealing with different shapes of clusters. These algorithms deserve an article of their own, and we can explore them in a separate post later on. It takes experience with some trial and error to know when to use one algorithm or the other. Luckily, we have a range of implementations in different programming languages, so trying them out only requires a little willingness to play.

The post Clustering Algorithms: From Start To State Of The Art appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/clustering-algorithms-from-start-to-state-of-the-art-2/feed/ 0
New – ‘Print Report’ now available! Export your Comparison Results. https://www.whatmatrix.com/portal/new-print-report-now-available-export-your-comparison-results/ https://www.whatmatrix.com/portal/new-print-report-now-available-export-your-comparison-results/#respond Thu, 01 Oct 2015 17:35:09 +0000 https://www.whatmatrix.com/blog/?p=1899 Many of you asked for the capability to print or export a report for the performed comparison. The report should also include details on any improvements that have been applied using our “Stackbuilder“. Good news – you can now create and print reports directly from any of our comparison pages! Use them as deliverable for your […]

The post New – ‘Print Report’ now available! Export your Comparison Results. appeared first on Technology Comparisons and Reviews.

]]>
Many of you asked for the capability to print or export a report for the performed comparison. The report should also include details on any improvements that have been applied using our “Stackbuilder“.

Good news – you can now create and print reports directly from any of our comparison pages!
Use them as deliverable for your consultancy engagement, RFI attachment or research document.

And did we mention all of this is free …? 

The top section of the report allows you to customize header information like company, contact name and project references (not shown in picture).
The summary shows scores for ‘base products‘ as well as scores achieved by complimentary Add-Ons the user applied using our ‘Stack Builder’.

report1

Detailed matrix scores ‘by category’ allow you to quickly identify “areas of strength” or weakness (i.e. a product might have the highest overall score but a particular weakness in e.g. its  ‘management’ capabilities). 

report2

The report also contains insight in to the improvements provided by our complimentary “Add-On” products. 

What are Add-Ons?
Our consultants constantly try to identify products that can address typical limitations in their solution stacks. add_on_oracleThey then gather detailed technical information and integrate those products in interactive Add-On widgets. 
The site visitor can quickly identify limitations using the color coding in the matrix and apply suitable add-on products using our Stack Builder to limit or mitigate those limitations.

The report helps you visualize the level of improvement provided by the various Add-Ons. It presents you with details on which features have been improved by which Add-On

report 3

 

 

 

 

 

 

 

 

 

Additional charts even show you details on HOW individual features have been improved e.g. from not supported (red) to fully supported (green)

report4

 

The report also allows you to append the entire matrix table for reference. See a report sample here: WhatMatrix Comparison Report Sample

We hope you enjoy the new capability!
Your WhatMatrix Community

 

Our comparison pages allow users to evaluate Enterprise IT products in various (expanding) categories – ranging from Virtualization technologies to Cloud Storage Gateways, Backup Products and Hyper-converged / Software Defined Storage.

The post New – ‘Print Report’ now available! Export your Comparison Results. appeared first on Technology Comparisons and Reviews.

]]>
https://www.whatmatrix.com/portal/new-print-report-now-available-export-your-comparison-results/feed/ 0