mercredi 19 février 2014

Data quality @home

I recently began to go through my own photo library. To give you some ideas, it contains 16,500 photos from year 2000 to today. I had several ways to sort it, but one that made sense to me was to:

  1. Ensure proper dating of photos

  2. Associate keywords to them

  3. Localize (or “geo-tag”) them

  4. Identify people represented in it

In other words, for each photo being considered as a "data", I defined several properties (or "meta-data") to index and retrieve it afterward. And then arise some echoes with my current work around data quality.

I propose to have a look on each of these properties through a data quality categorization.

Completeness

I will consider a picture as totally indexed if:

  • It has a proper dating (so potentially redressed if needed)

  • It has at least one keyword associated

  • It is geo-tagged

  • All the people represented in it and belonging to my first family and friend circle are identified

A first level of completion is obtained with the three first assertions. Let’s call it C1. If I add the fourth assertion, I obtain a full completion that I call C2.

Accuracy

Dating is provided as an automatic feature when you shoot with a digital camera. Unfortunately, it appears that sometimes camera settings are not properly set (in particular when you change batteries or considering jet lag). You may then enter into a real nightmare... Believe me. So the typical question is: what makes sense to me in matter of dating accuracy? In order to retrieve photos back, I would say that an accuracy of days or so should be sufficient. Hours may have a sense also, but more to have a sorted display of photos on a given day.

Keywords have to be accurate enough to help me to differentiate photos. If I put something like "family", the risk is to have too many photos responding to this criterion and not being able to find a specific one. At the other extremity of the spectrum, providing too precise criteria would be useless, as I would not think about it when searching for particular photos. Providing that, I opt for a finite set of keywords on which I provide my own semantic. As an example: "vacations" will be used for any holiday's period of more than 2-3 days or so. "Travel" will be added when I consider a travel abroad from France. This is clearly my own semantic and provides only sense to myself.

Geo-tagging on its side is a huge matter. I recently had a trip to Japan. I may geo-tag all my photos to Japan, but would it be sufficient? In particular, it won't help me to retrieve pictures from (and only from) Tokyo. So I would choose an accuracy of city. The only problem with this choice is that in the particular case of Paris, city where I live, it would clearly be not enough as I have so many pictures from Paris. I did not find the perfect level of accuracy for geo-tagging so far, or more exactly, I do not know how to rationalize it. Considering that I use geo-tagging to retrieve pictures afterwards, I just use my own memory to "geo-tag" it and it should be enough. The only limit on that point is that my wife as a pretty better memory than I have, and by this way is far more accurate in its pictures searches than I am... The good point for the future is that more and more camera provides now an automatic geo-tagging feature. This is already the case for most of smartphones and they are becoming a common source of pictures. So the technology will certainly help in the future my poor human memory.

People represented on photo also needs some decisions regarding accuracy criteria. Do I want to search any people I encountered once or twice that appear in only a couple of pictures (yeah... I am sure you have such a case, just have a look on the pictures of the last weeding you attend)? I certainly wouldn’t. So I decided to restrain my tags to people of my first circle of family and friends. And it represents already a big piece of work.

Uniqueness

It could seem a bit weird to address such a topic in this context, but if you consider all the operations (copy-paste, editions and so on…) that you may do on a daily basis on your pictures, you inevitably will encounter such issues. Having used an automatic tool to detect these multiples, I had found about 150 of them and even after removing it; I still found some rare evidence of pictures being present several times (not considering pictures being present in colours as well as B&W or with and without frames around it).

Rating

After a first manual data cleansing operation I obtain estimations of the data quality rate of my photo library.

Considering completeness, I have so far a rating of 85% of pictures considered as C1-complete. Adding the fourth one would of course drastically decrease this rate, considering that 11,300 people are detected in my pictures library without having been properly identified (9,500 for the C2-completed pictures).

For accuracy, I am still searching for a proper way to measure it, either on keywords or geo-tag. Next step will be to execute some data profiling to help defining some rules to measure.

Last, but not least, regarding uniqueness I now consider that the cleansing has been done, reaching a rate of 99,99% or so.

Considering the whole picture the Pareto's law, it seems that a long way is still in front of me. But this is quite typical for a data quality topic, isn’t?

Aucun commentaire:

Publier un commentaire