Difference between revisions of "Data Organisation"

From Glitchdata
Jump to navigation Jump to search
 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
Data organisation is a broad term to discuss how data relates, its technical organisation, and the process of creating insights and valuable products from raw data.
+
Data organisation is a broad term to discuss the generic aspects. See [[Data Management]]
  
 +
==The Nature of Data==
 +
The term “data” is used here in the broadest possible sense. Data are mere basic observations, which are real, not assumed or hypothesized. Data should neither be underrated nor overrated in their importance. They should certainly never be confused with the concept of absolute truth. Data can be thought of as the foundation of an epistemological system that includes information, knowledge, truth, and wisdom. This Annex discusses some of the common classifications of types of data and their treatment; for a more thorough treatment the reader should consult textbooks and other scholarly works on epistemology and statistics.
  
 +
The word “data” is plural; the singular form is “datum,” but the terms “point data” or “datapoint” or “observation” are often used with the same meaning, e.g., a single number or observation. The term “dataset” is a grouping of related data “Raw data” are the initial form of the data, as collected, before any statistical analysis, enumeration, etc. is done. An “observation period” is the time during which data are collected. Collected data are considered to be “samples” of a larger underlying “population” of similar data. “Sample size” is the number of samples in a particular dataset.
 +
 +
* See [[Data Patterns]]
 +
 +
 +
===Quantitative Data===
 +
Data that are related in an ordered or numeric scale. (Also called numeric data.)
 +
* Rank Data
 +
**The simplest type of quantitative data, needing only to be capable of being sequentially ordered, e.g., one datum is equal or bigger, or better, or more effective, etc., than each other in the set; for convenience, such data may be numerically ranked starting with a value of 1 at one end of the scale. The area of mathematics known as non-parametric statistics is used to describe, summarize, compare, and analyze such data.
 +
* Measurement Data
 +
**Data collected using a measuring instrument (e.g., a thermometer, force meter, stop watch). Such data are often immediately transduced and digitized for computer analysis.
 +
* Frequency Data
 +
**Data obtained by counting the number of times an event occurs. If the total number of observed events is known, frequency data may be expressed as a proportion or percentage. If the duration of the observation period is known, such data may be expressed as a rate (e.g., events per hour).
 +
* Duration Data
 +
**Data obtained by measuring the amount of time that a process was occurring. If the duration of the observation period is known, such data may be expressed as a proportion or percentage.
 +
* Latency Data
 +
**The time between a designated start of an observation period to the occurrence of a particular event.
 +
* Interval Data
 +
**The time between two events within an observation period.
 +
* Dose Response Data
 +
**Data that correlate “dose” to “response.” The dose may be any appropriate measure of the amount of non-lethal technology (e.g., joules of energy, watts of electricity, density of a chemical) applied and the response be any conceivable, quantifiable effect of that technology, including intended and unintended effects. Such data are often illustrated in “dose-response curves” in which dose (low to high) is plotted on the x- axis and response (low to high) is plotted on the y-axis. As the points in such curves usually represent measures of central tendency (e.g., means) of a set of like data, dose-response curves often display an indication of variability or confidence, as well. The medians of such curves, in which 50% of the datapoints fall above and 50% fall below, are often used as a shorthand summary of an effective dose (called “Effective Dose 50”, or ED50). Such summary measures may then be compared for intended versus unintended effects to compute a margin of safety or safety factor. While well-established dose-response curves for multiple effects of the same technology is considered ideal, such data currently exist in only a few cases.
 +
* Threshold Data
 +
**Threshold data are a measure of the minimum dose required to produce some level of an effect in an individual or a population. There are several mathematical approaches to estimating threshold, but the most complete is the use of dose-response data, as described above. Threshold data are usually accompanied by a measure of variability or uncertainty with respect to the population response that is being estimated. The concept of threshold is usually tied to the percentage of a population that displays a particular effect at the threshold dose. For occupational health and safety standards, the threshold dose for an undesired effect may be based on a very small percentage of the population being affected; for such standards, the permissible limit is often much lower than the threshold dose, in order to provide an additional safety factor.
 +
* Binary Data (or Digital Data)
 +
**Binary data are quantitative data represented by a series of the digits “1” and “0,” called bits. Ultimately, most quantitative data are converted to binary when processed by a computer. The number of bits used to represent a quantitative datum determines the possible resolution (bit depth) of the measurements; for example, if 8 bits are used, the maximum resolution of the measurement is 28, or 256 different levels.
 +
 +
===Qualitative Data===
 +
Qualitative data include descriptions and classifications of events without reference to any calibrated scale. They are often subjective and/or opinionated. Numbers are arbitrary identifiers of qualitative data and might as well be substituted with words or other symbols. For example, the observation by a battle participant that a particular non-lethal weapon was effective in a particular military operation is an example of qualitative data. There are procedures for maximizing information from qualitative data. Qualitative data are often valued for their usefulness in developing hypotheses that can be tested using experimental, quantitative methods.
 +
 +
 +
===Data Quality===
 +
All data are not collected equal. Particular data may be described with various qualifying terms, such as good, bad, reliable, biased, etc. The principles of data quality may be derived from common usage. Good data are not data that agree with the experimenter’s cherished theory. They are merely data that have been collected in a reliable, reproducible manner, with opportunities for error and bias minimized as much as possible. Any data can be biased or outright falsified. However, as the quality of the data increases, the potential for falsification and bias becomes easier to detect, control, and avoid. Hearsay and other verbal report, given long after an incident and by an observer with a clear vested interest in the use of the data, has the highest potential for bias and inaccuracy. Standardized procedures for data collection (e.g., entry forms, time limits, structured questionnaires, written classification criteria, multiple observers, etc.) can increase the quality of data. Poor measurement devices (e.g., biased, unreliable, or uncalibrated) can result in both increased variability and systematic error. The gold standard for high quality data is that obtained from a well-designed, controlled experiment using a calibrated measuring instrument and skilled observers who have little knowledge of the hypothesis being tested or of which treatments are administered to which groups (often called a “blind” observer). The insistence on such procedures does not impugn the honesty or trustworthiness of the observer; research has shown that experimental bias can occur unconsciously in even the most honourable and well- intentioned observers. Even qualitative data may be judged according to the known accuracy and expertise of the observer. Thus a qualitative observation from an experienced and trusted individual may be more relevant and valuable than any number of datapoints or amount of statistical analysis. However, data quality should not be equated to data relevance, as rarefied experimental data may or may not provide valid predictions of effects in the real world.
 +
 +
==Transactional Data==
 +
Frequently implemented as [[OLTP]] (Online Transactional Processing) Systems. [[OLTP]] Systems differ from [[OLAP]] Systems.
 +
In older systems/databases, these tables record the change. Also consider logs.
 +
In Web 2.0 systems, transactional data is not considered an overhead, but has informational value.
 +
 +
==Log Data==
 +
Log data typically originates from [[OLTP]]-based transactions. In today's context, log data within a system is no longer an overhead, but actively used as analytical data. Log data in essence should be similar to time series data. Consider [[Log Data Architecture]]
  
 
==Time Series Data==
 
==Time Series Data==
 +
Data that changes frequently. Such data include:
 +
* Financial Data
 +
* Media Data
 +
* Hydrological Data
 +
* Scientific Data
  
 +
When we start thinking Timeseries, we are frequently involved in analytics. As such OLAP concepts should apply to Timeseries management. [[Time Series Data Architecture | Design factors]] for time series data tables should be considered.
  
==Static Data==
 
Very slow changing data. Typically organised as [[Data Dictionaries]].
 
  
 +
==Master Data==
 +
In older systems, this is the [[OLTP]] dataset that deplicts the current state of the system.
  
==Metadata==
+
==Reference Data==
An overly used term use to describe anything that describes data
+
Typically very slow changing data. Typically organised as [[Data Dictionary | Data Dictionaries]], [[Code Tables]], [[Master Tables]]. Sometimes known as [[Static Data]]. However, "static" implies that it doesn't change at all.
 +
In Business Intelligence systems, these may be called [[Dimensional Data]].
  
 +
==Data Ontology==
 +
Data Ontology is the sciences of identifying the context of data and its relationship to other entities. This science includes:
 +
* [[Taxonomy]]
 +
* [[Semantics]]
 +
* [[Metadata]]
  
==Data Ontology & Provenance==
+
===Metadata===
Also consider metadata. Each of these topics are different, and should be looked at independently.
+
A common term used in describing data. [[Metadata]] is part of [[Data Ontology]].
 +
 
 +
Consider:
 +
* http://www.slideshare.net/davidlamas/metadata-and-ontologies
 
* http://arxiv4.library.cornell.edu/pdf/1005.2643v1
 
* http://arxiv4.library.cornell.edu/pdf/1005.2643v1
 
* http://subs.emis.de/LNI/Proceedings/Proceedings103/gi-proc-103-014.pdf
 
* http://subs.emis.de/LNI/Proceedings/Proceedings103/gi-proc-103-014.pdf
  
 +
==Data Provenance==
 +
[[Data Provenance]] determines when, how a certain data was derived.
 +
 +
==Data Architecture==
 +
Formal data design methods are discussed in [[Data Architecture]]. The concepts of data warehousing, data marts are fast becoming outdated. This is because data analytics are being built by default into applications.
  
 +
There are some standards out there:
 +
* WFS ISO19142
 +
* WMS ISO19128
 +
* GML ISO19136
 +
* Observation and Measurements - ISO19156
  
 +
==Data Versioning==
 +
Data is not that different from code. There is no reason why datasets should not be versioned.
  
 
[[Category:Data]]
 
[[Category:Data]]
 
[[Category:Data Organisation]]
 
[[Category:Data Organisation]]

Latest revision as of 19:11, 17 August 2014

Data organisation is a broad term to discuss the generic aspects. See Data Management

The Nature of Data

The term “data” is used here in the broadest possible sense. Data are mere basic observations, which are real, not assumed or hypothesized. Data should neither be underrated nor overrated in their importance. They should certainly never be confused with the concept of absolute truth. Data can be thought of as the foundation of an epistemological system that includes information, knowledge, truth, and wisdom. This Annex discusses some of the common classifications of types of data and their treatment; for a more thorough treatment the reader should consult textbooks and other scholarly works on epistemology and statistics.

The word “data” is plural; the singular form is “datum,” but the terms “point data” or “datapoint” or “observation” are often used with the same meaning, e.g., a single number or observation. The term “dataset” is a grouping of related data “Raw data” are the initial form of the data, as collected, before any statistical analysis, enumeration, etc. is done. An “observation period” is the time during which data are collected. Collected data are considered to be “samples” of a larger underlying “population” of similar data. “Sample size” is the number of samples in a particular dataset.


Quantitative Data

Data that are related in an ordered or numeric scale. (Also called numeric data.)

  • Rank Data
    • The simplest type of quantitative data, needing only to be capable of being sequentially ordered, e.g., one datum is equal or bigger, or better, or more effective, etc., than each other in the set; for convenience, such data may be numerically ranked starting with a value of 1 at one end of the scale. The area of mathematics known as non-parametric statistics is used to describe, summarize, compare, and analyze such data.
  • Measurement Data
    • Data collected using a measuring instrument (e.g., a thermometer, force meter, stop watch). Such data are often immediately transduced and digitized for computer analysis.
  • Frequency Data
    • Data obtained by counting the number of times an event occurs. If the total number of observed events is known, frequency data may be expressed as a proportion or percentage. If the duration of the observation period is known, such data may be expressed as a rate (e.g., events per hour).
  • Duration Data
    • Data obtained by measuring the amount of time that a process was occurring. If the duration of the observation period is known, such data may be expressed as a proportion or percentage.
  • Latency Data
    • The time between a designated start of an observation period to the occurrence of a particular event.
  • Interval Data
    • The time between two events within an observation period.
  • Dose Response Data
    • Data that correlate “dose” to “response.” The dose may be any appropriate measure of the amount of non-lethal technology (e.g., joules of energy, watts of electricity, density of a chemical) applied and the response be any conceivable, quantifiable effect of that technology, including intended and unintended effects. Such data are often illustrated in “dose-response curves” in which dose (low to high) is plotted on the x- axis and response (low to high) is plotted on the y-axis. As the points in such curves usually represent measures of central tendency (e.g., means) of a set of like data, dose-response curves often display an indication of variability or confidence, as well. The medians of such curves, in which 50% of the datapoints fall above and 50% fall below, are often used as a shorthand summary of an effective dose (called “Effective Dose 50”, or ED50). Such summary measures may then be compared for intended versus unintended effects to compute a margin of safety or safety factor. While well-established dose-response curves for multiple effects of the same technology is considered ideal, such data currently exist in only a few cases.
  • Threshold Data
    • Threshold data are a measure of the minimum dose required to produce some level of an effect in an individual or a population. There are several mathematical approaches to estimating threshold, but the most complete is the use of dose-response data, as described above. Threshold data are usually accompanied by a measure of variability or uncertainty with respect to the population response that is being estimated. The concept of threshold is usually tied to the percentage of a population that displays a particular effect at the threshold dose. For occupational health and safety standards, the threshold dose for an undesired effect may be based on a very small percentage of the population being affected; for such standards, the permissible limit is often much lower than the threshold dose, in order to provide an additional safety factor.
  • Binary Data (or Digital Data)
    • Binary data are quantitative data represented by a series of the digits “1” and “0,” called bits. Ultimately, most quantitative data are converted to binary when processed by a computer. The number of bits used to represent a quantitative datum determines the possible resolution (bit depth) of the measurements; for example, if 8 bits are used, the maximum resolution of the measurement is 28, or 256 different levels.

Qualitative Data

Qualitative data include descriptions and classifications of events without reference to any calibrated scale. They are often subjective and/or opinionated. Numbers are arbitrary identifiers of qualitative data and might as well be substituted with words or other symbols. For example, the observation by a battle participant that a particular non-lethal weapon was effective in a particular military operation is an example of qualitative data. There are procedures for maximizing information from qualitative data. Qualitative data are often valued for their usefulness in developing hypotheses that can be tested using experimental, quantitative methods.


Data Quality

All data are not collected equal. Particular data may be described with various qualifying terms, such as good, bad, reliable, biased, etc. The principles of data quality may be derived from common usage. Good data are not data that agree with the experimenter’s cherished theory. They are merely data that have been collected in a reliable, reproducible manner, with opportunities for error and bias minimized as much as possible. Any data can be biased or outright falsified. However, as the quality of the data increases, the potential for falsification and bias becomes easier to detect, control, and avoid. Hearsay and other verbal report, given long after an incident and by an observer with a clear vested interest in the use of the data, has the highest potential for bias and inaccuracy. Standardized procedures for data collection (e.g., entry forms, time limits, structured questionnaires, written classification criteria, multiple observers, etc.) can increase the quality of data. Poor measurement devices (e.g., biased, unreliable, or uncalibrated) can result in both increased variability and systematic error. The gold standard for high quality data is that obtained from a well-designed, controlled experiment using a calibrated measuring instrument and skilled observers who have little knowledge of the hypothesis being tested or of which treatments are administered to which groups (often called a “blind” observer). The insistence on such procedures does not impugn the honesty or trustworthiness of the observer; research has shown that experimental bias can occur unconsciously in even the most honourable and well- intentioned observers. Even qualitative data may be judged according to the known accuracy and expertise of the observer. Thus a qualitative observation from an experienced and trusted individual may be more relevant and valuable than any number of datapoints or amount of statistical analysis. However, data quality should not be equated to data relevance, as rarefied experimental data may or may not provide valid predictions of effects in the real world.

Transactional Data

Frequently implemented as OLTP (Online Transactional Processing) Systems. OLTP Systems differ from OLAP Systems. In older systems/databases, these tables record the change. Also consider logs. In Web 2.0 systems, transactional data is not considered an overhead, but has informational value.

Log Data

Log data typically originates from OLTP-based transactions. In today's context, log data within a system is no longer an overhead, but actively used as analytical data. Log data in essence should be similar to time series data. Consider Log Data Architecture

Time Series Data

Data that changes frequently. Such data include:

  • Financial Data
  • Media Data
  • Hydrological Data
  • Scientific Data

When we start thinking Timeseries, we are frequently involved in analytics. As such OLAP concepts should apply to Timeseries management. Design factors for time series data tables should be considered.


Master Data

In older systems, this is the OLTP dataset that deplicts the current state of the system.

Reference Data

Typically very slow changing data. Typically organised as Data Dictionaries, Code Tables, Master Tables. Sometimes known as Static Data. However, "static" implies that it doesn't change at all. In Business Intelligence systems, these may be called Dimensional Data.

Data Ontology

Data Ontology is the sciences of identifying the context of data and its relationship to other entities. This science includes:

Metadata

A common term used in describing data. Metadata is part of Data Ontology.

Consider:

Data Provenance

Data Provenance determines when, how a certain data was derived.

Data Architecture

Formal data design methods are discussed in Data Architecture. The concepts of data warehousing, data marts are fast becoming outdated. This is because data analytics are being built by default into applications.

There are some standards out there:

  • WFS ISO19142
  • WMS ISO19128
  • GML ISO19136
  • Observation and Measurements - ISO19156

Data Versioning

Data is not that different from code. There is no reason why datasets should not be versioned.