Data Wrangling with OpenRefine on Linux

    OpenRefine is an open source desktop application for data wrangling.

    Source:

    OpenRefine on Windows

    • Download the file
    • Unzip and run the executable
    • To stop the web server, on the command line do Ctrl C.

    OpenRefine on Linux

    • Download the tar file. Size is about 100 MB
    • Tar the file. For example: tar xzf openrefine-linux-3.2.tar.gz
    • Open the directory: cd openrefine-3.2
    • Start: ./refine

    (Shut down the webserver with Ctrl C)

    Upon start the terminal shows:

    WARNING: OpenRefine is not tested and not recommended for use with Java versions greater than 12.
    You have 7653M of free memory.
    Your current configuration is set to use 1400M of memory.
    OpenRefine can run better when given more memory. Read our FAQ on how to allocate more memory here:
    https://github.com/OpenRefine/OpenRefine/wiki/FAQ:-Allocate-More-Memory
    Starting OpenRefine at 'http://127.0.0.1:3333/'
    
    10:40:59.487 [refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
    10:40:59.488 [refine_server] refine.memory size: 1400M JVM Max heap: 1468006400 (1ms)
    10:40:59.499 [refine_server] Initializing context: '/' from '/home/tom/Documents/projects/openrefine-3.2/webapp' (11ms)
    10:41:01.007 [refine_server] Creating new workspace directory /home/tom/.local/share/openrefine (1508ms)
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/tom/Documents/projects/openrefine-3.2/server/target/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/tom/Documents/projects/openrefine-3.2/webapp/WEB-INF/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    10:41:01.390 [refine] Starting OpenRefine 3.2 [55c921b]... (383ms)
    10:41:01.390 [refine] initializing FileProjectManager with dir (0ms)
    10:41:01.390 [refine] /home/tom/.local/share/openrefine (0ms)
    10:41:01.417 [FileProjectManager] Failed to load workspace from any attempted alternatives. (27ms)
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by org.python.core.PySystemState (file:/home/tom/Documents/projects/openrefine-3.2/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.1.jar) to method java.io.Console.encoding()
    WARNING: Please consider reporting this to the maintainers of org.python.core.PySystemState
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    Fontconfig warning: "/usr/share/fontconfig/conf.avail/05-reset-dirs-sample.conf", line 6: unknown element "reset-dirs"
    Opening in existing browser session.
    10:41:19.486 [refine] POST /command/core/load-language (18069ms)
    10:41:19.578 [refine] GET /command/core/get-preference (92ms)
    10:41:19.624 [refine] POST /command/core/load-language (46ms)
    10:41:19.641 [refine] POST /command/core/load-language (17ms)
    10:41:19.833 [refine] POST /command/core/get-importing-configuration (192ms)
    10:41:19.885 [refine] GET /command/core/get-all-project-tags (52ms)
    10:41:19.975 [refine] GET /command/core/get-all-project-metadata (90ms)
    10:41:20.077 [refine] GET /command/core/get-languages (102ms)
    10:41:20.342 [refine] GET /command/core/get-version (265ms)
    10:41:20.566 [refine] GET /command/database/saved-connection (224ms)
    10:41:20.632 [refine] POST /command/core/set-preference (66ms)
    10:41:20.664 [refine] POST /command/core/set-preference (32ms)
    

    On the browser OpenRefine opens on:

    http://127.0.0.1:3333/
    

    Uploading data to OpenRefine

    Get data from:

    • This computer
    • Web Addresses
    • Clipboard
    • Database
    • Google data

    Getting data from your computer:

    • Use upload
    • This creates a preview with some options.
    • On the top right corner Create Project

    Text Facet to Group data

    Go to a column drop down.

    • Go to Facet
    • Then Text Facet

    On the left side, it groups together identical cells for that column, including the number of cells for each group.

    As seen on this Youtube video. Selecting Text Facet for one of the columns to review all the data related FFP.

    For example, selecting one of the groups that has FFP and a count of 512, filters the data on the right view.

    The next group is also called FFP with a count of 1. Click on edit for that group to see why. It has an empty trailing whitespace at the end of the word FFP_. Remove the blank space and click apply. The groups then will merge. Now the FFP group has 513 rows and a total of 814 groups.

    Remove leading and trailing whitespace

    Go to a column drop down.

    • Go to Edit cells
    • Then Common transforms
    • Then Trim leading and trailing whitespace

    Following the same video. After removing leading and trailing whitespace. The group count changed from 814 to 785.

    The column is renamed from FFP to Firm Fixed Price and then apply.

    Sort the Facets

    On the left view, sort the facets by name or by count to see the largest groups.

    The video shows that there are all sorts of combinations of the word Firm Fixed Price which can be changed to group all the data better.

    Cluster Facets

    After sorting by count, next to it there is a Cluster button. This feature tries to group the groups based on Keying Functions.

    The Keying Functions show a drop down with:

    • fingerprint
    • ngram-fingerprint
    • double-metaphone

    Select the groups that you want to merge. There is a checkbox for each group. Next to the checkbox there is a field New Cell Value if you want to change the name of the cell.

    If you feel wild, you can Select All on the bottom left and click on Merge Selected & Close on the bottom right.

    Correcting Mistakes

    On the left view, there is a tab Undo/Redo.

    Then select one of the rows that made a previous change. Then click Apply.

    Create a Numeric Facet

    On the video, one of the columns says Total value.

    • Drop down
    • Facet
    • Numeric Facet

    On the left view, it shows a range of the values, going from 0.0 to 20,000,000.00.

    It also shows a mini visualization of the distribution. In this case the distribution only shows a rectangle on the leftmost side.

    On the top right of this visualization, click on change.

    A popup window says Edit Facet's Expression based on Column Total value.

    There is a field called Expression

    Type value.log().

    It shows a preview of the change…then click OK.

    It now shows a normal distribution. It has some options below:

    • Numeric 5173 (checked)
    • Non-numeric 0 (not checked)
    • Blank 0 (not checked)
    • Error 27 (checked)

    Select only the Error rows. Which shows all the values as zero.

    When calculating value.log() of 0, the result is negative infinity. In the video she asks, why is there a zero value for total value?

    You can drag the margins of the distribution visualization to preview data on the right view.

    Getting data from a Wikipedia list

    This video blew my mind. I didn’t know you could do this:

    Go to this page that shows awards about movies.

    Scroll down to the section that says Winners and nominees.

    Click on the edit to the right of this title.

    It now shows some code as seen here:

    ==Winners and nominees==
    
    {| class="wikitable"
    |+ Table key
    |-
    !scope="row" style="text-align:center;" style="background:#FAEB86; height:20px; width:20px"
    | Indicates the winner
    |}
    ===1950s===
    
    {| class="wikitable sortable" rowspan=2 style="text-align: left;" border="2" cellpadding="5"
    |-
    !scope="col" style="width:3%; text-align:center;"| Year
    !scope="col" style="width:3%;text-align:center;"| Photos of winners
    !scope="col" style="width:15%;text-align:center;"| Actor
    !scope="col" style="width:15%;text-align:center;"| Role(s)
    !scope="col" style="width:15%;text-align:center;"| Film
    |-
    ! scope="row" rowspan=2 style="text-align:center" | 1954 <br /><small>[[1st Filmfare Awards|(1st)]] </small>
    |rowspan=2 style="text-align:center"|
    | style="background:#FAEB86;" |
    | style="background:#FAEB86;" |
    | style="background:#FAEB86;" | '''''[[Baiju Bawra (film)|Baiju Bawra]]'''''
    

    Copy all the code that appears on that window and save it into a txt file.

    Import this file into OpenRefine.

    For the option When parsing text files uncheck Split into columns.

    For Header lines change to 0.

    Then Create Project.

    Cleaning Wikipedia data code

    Remove the rows that show years such as this:

    ===1950s===
    

    On the column drop down, select Text filter.

    On the left view that has a field titled Column. Type ===, which filters the table on the right to all rows that contain these characters.

    On this filtered view go to drop down:

    • Edit rows
    • Remove all matching rows
    • Then clear the text filter on the left view (close the Column window)

    On the video, she explains the strategy for using OpenRefine as:

    • Isolating all the rows you want to change using filters and facets
    • Change them all at the same time.

    Remove all the characters that describe a bolded text. In the code, bold text is enclosed within three single quotes such as this:

    '''content<article class="post h-entry" itemscope itemtype="http://schema.org/BlogPosting">
    

    Python Jupyter Notebook in Linux

    Update conda

    $ conda update conda
    

    Create a yml file, use a name and dependencies. Example:

    File: env_py_3.7.6.yml

    Contents of this file:

    name: awesome_name
    dependencies:
    - python=3.7.6
    - jupyter
    

    Create the environment using this file:

    $ conda env create --file env_py_3.7.6.yml
    $ conda activate awesome_name
    

    Check Python version

    $ which python
    $ python -V
    

    Run Jupyter Notebook:

    $ jupyter notebook
    

    If this doesn’t work. Try it like this:

    $ jupyter notebook --ip=127.0.0.1
    

    Cell timer for Jupyter notebook

    As seen here

    $ pip install jupyter_contrib_nbextensions
    $ jupyter contrib nbextension install --user
    $ jupyter nbextension enable execute_time/ExecuteTime
    

    Related Posts

    </article> highlighter_prefix highlighter_suffix jekyllJekyll::Drops::JekyllDroplayout{“layout”=>”default”}pageOpenRefine is an open source desktop application for data wrangling.

    Source:

    OpenRefine on Windows

    • Download the file
    • Unzip and run the executable
    • To stop the web server, on the command line do Ctrl C.

    OpenRefine on Linux

    • Download the tar file. Size is about 100 MB
    • Tar the file. For example: tar xzf openrefine-linux-3.2.tar.gz
    • Open the directory: cd openrefine-3.2
    • Start: ./refine

    (Shut down the webserver with Ctrl C)

    Upon start the terminal shows:

    WARNING: OpenRefine is not tested and not recommended for use with Java versions greater than 12.
    You have 7653M of free memory.
    Your current configuration is set to use 1400M of memory.
    OpenRefine can run better when given more memory. Read our FAQ on how to allocate more memory here:
    https://github.com/OpenRefine/OpenRefine/wiki/FAQ:-Allocate-More-Memory
    Starting OpenRefine at 'http://127.0.0.1:3333/'
    
    10:40:59.487 [refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
    10:40:59.488 [refine_server] refine.memory size: 1400M JVM Max heap: 1468006400 (1ms)
    10:40:59.499 [refine_server] Initializing context: '/' from '/home/tom/Documents/projects/openrefine-3.2/webapp' (11ms)
    10:41:01.007 [refine_server] Creating new workspace directory /home/tom/.local/share/openrefine (1508ms)
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/tom/Documents/projects/openrefine-3.2/server/target/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/tom/Documents/projects/openrefine-3.2/webapp/WEB-INF/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    10:41:01.390 [refine] Starting OpenRefine 3.2 [55c921b]... (383ms)
    10:41:01.390 [refine] initializing FileProjectManager with dir (0ms)
    10:41:01.390 [refine] /home/tom/.local/share/openrefine (0ms)
    10:41:01.417 [FileProjectManager] Failed to load workspace from any attempted alternatives. (27ms)
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by org.python.core.PySystemState (file:/home/tom/Documents/projects/openrefine-3.2/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.1.jar) to method java.io.Console.encoding()
    WARNING: Please consider reporting this to the maintainers of org.python.core.PySystemState
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    Fontconfig warning: "/usr/share/fontconfig/conf.avail/05-reset-dirs-sample.conf", line 6: unknown element "reset-dirs"
    Opening in existing browser session.
    10:41:19.486 [refine] POST /command/core/load-language (18069ms)
    10:41:19.578 [refine] GET /command/core/get-preference (92ms)
    10:41:19.624 [refine] POST /command/core/load-language (46ms)
    10:41:19.641 [refine] POST /command/core/load-language (17ms)
    10:41:19.833 [refine] POST /command/core/get-importing-configuration (192ms)
    10:41:19.885 [refine] GET /command/core/get-all-project-tags (52ms)
    10:41:19.975 [refine] GET /command/core/get-all-project-metadata (90ms)
    10:41:20.077 [refine] GET /command/core/get-languages (102ms)
    10:41:20.342 [refine] GET /command/core/get-version (265ms)
    10:41:20.566 [refine] GET /command/database/saved-connection (224ms)
    10:41:20.632 [refine] POST /command/core/set-preference (66ms)
    10:41:20.664 [refine] POST /command/core/set-preference (32ms)
    

    On the browser OpenRefine opens on:

    http://127.0.0.1:3333/
    

    Uploading data to OpenRefine

    Get data from:

    • This computer
    • Web Addresses
    • Clipboard
    • Database
    • Google data

    Getting data from your computer:

    • Use upload
    • This creates a preview with some options.
    • On the top right corner Create Project

    Text Facet to Group data

    Go to a column drop down.

    • Go to Facet
    • Then Text Facet

    On the left side, it groups together identical cells for that column, including the number of cells for each group.

    As seen on this Youtube video. Selecting Text Facet for one of the columns to review all the data related FFP.

    For example, selecting one of the groups that has FFP and a count of 512, filters the data on the right view.

    The next group is also called FFP with a count of 1. Click on edit for that group to see why. It has an empty trailing whitespace at the end of the word FFP_. Remove the blank space and click apply. The groups then will merge. Now the FFP group has 513 rows and a total of 814 groups.

    Remove leading and trailing whitespace

    Go to a column drop down.

    • Go to Edit cells
    • Then Common transforms
    • Then Trim leading and trailing whitespace

    Following the same video. After removing leading and trailing whitespace. The group count changed from 814 to 785.

    The column is renamed from FFP to Firm Fixed Price and then apply.

    Sort the Facets

    On the left view, sort the facets by name or by count to see the largest groups.

    The video shows that there are all sorts of combinations of the word Firm Fixed Price which can be changed to group all the data better.

    Cluster Facets

    After sorting by count, next to it there is a Cluster button. This feature tries to group the groups based on Keying Functions.

    The Keying Functions show a drop down with:

    • fingerprint
    • ngram-fingerprint
    • double-metaphone

    Select the groups that you want to merge. There is a checkbox for each group. Next to the checkbox there is a field New Cell Value if you want to change the name of the cell.

    If you feel wild, you can Select All on the bottom left and click on Merge Selected & Close on the bottom right.

    Correcting Mistakes

    On the left view, there is a tab Undo/Redo.

    Then select one of the rows that made a previous change. Then click Apply.

    Create a Numeric Facet

    On the video, one of the columns says Total value.

    • Drop down
    • Facet
    • Numeric Facet

    On the left view, it shows a range of the values, going from 0.0 to 20,000,000.00.

    It also shows a mini visualization of the distribution. In this case the distribution only shows a rectangle on the leftmost side.

    On the top right of this visualization, click on change.

    A popup window says Edit Facet's Expression based on Column Total value.

    There is a field called Expression

    Type value.log().

    It shows a preview of the change…then click OK.

    It now shows a normal distribution. It has some options below:

    • Numeric 5173 (checked)
    • Non-numeric 0 (not checked)
    • Blank 0 (not checked)
    • Error 27 (checked)

    Select only the Error rows. Which shows all the values as zero.

    When calculating value.log() of 0, the result is negative infinity. In the video she asks, why is there a zero value for total value?

    You can drag the margins of the distribution visualization to preview data on the right view.

    Getting data from a Wikipedia list

    This video blew my mind. I didn’t know you could do this:

    Go to this page that shows awards about movies.

    Scroll down to the section that says Winners and nominees.

    Click on the edit to the right of this title.

    It now shows some code as seen here:

    ==Winners and nominees==
    
    {| class="wikitable"
    |+ Table key
    |-
    !scope="row" style="text-align:center;" style="background:#FAEB86; height:20px; width:20px"
    | Indicates the winner
    |}
    ===1950s===
    
    {| class="wikitable sortable" rowspan=2 style="text-align: left;" border="2" cellpadding="5"
    |-
    !scope="col" style="width:3%; text-align:center;"| Year
    !scope="col" style="width:3%;text-align:center;"| Photos of winners
    !scope="col" style="width:15%;text-align:center;"| Actor
    !scope="col" style="width:15%;text-align:center;"| Role(s)
    !scope="col" style="width:15%;text-align:center;"| Film
    |-
    ! scope="row" rowspan=2 style="text-align:center" | 1954 <br /><small>[[1st Filmfare Awards|(1st)]] </small>
    |rowspan=2 style="text-align:center"|
    | style="background:#FAEB86;" |
    | style="background:#FAEB86;" |
    | style="background:#FAEB86;" | '''''[[Baiju Bawra (film)|Baiju Bawra]]'''''
    

    Copy all the code that appears on that window and save it into a txt file.

    Import this file into OpenRefine.

    For the option When parsing text files uncheck Split into columns.

    For Header lines change to 0.

    Then Create Project.

    Cleaning Wikipedia data code

    Remove the rows that show years such as this:

    ===1950s===
    

    On the column drop down, select Text filter.

    On the left view that has a field titled Column. Type ===, which filters the table on the right to all rows that contain these characters.

    On this filtered view go to drop down:

    • Edit rows
    • Remove all matching rows
    • Then clear the text filter on the left view (close the Column window)

    On the video, she explains the strategy for using OpenRefine as:

    • Isolating all the rows you want to change using filters and facets
    • Change them all at the same time.

    Remove all the characters that describe a bolded text. In the code, bold text is enclosed within three single quotes such as this:

    '''{{sort|Gauri|Gauri}}'''
    

    Go to:

    • Drop down
    • Edit cells
    • Transform

    It opens a popup window Custom text transform on column Column.

    Inside Expression field, type value.replace("'''", ""), which means to replace three single quotes "'''" with empty "". Click OK to apply.

    Create another column from existing column

    The current code is a bit different than the one shown on the video.

    On the video, it shows that each winner is on a bullet point, and the nominees on an indented sub bullet point.

    On the code, the idented rows start with two stars **.

    Create a column that describes if the row corresponds to a winner or a nominee.

    • Drop down
    • Edit column
    • Add column based on this column
    • New column name: Is Winner
    • Expression: not(value.startsWith("**"))
    • Click OK

    On the new column Is Winner

    • Drop down
    • Facet
    • Text facet
    • On the left view, select True to view the winner rows.

    Extract the years

    • Column drop down
    • Edit column
    • Add column based on this column
    • New column name: Year
    • Expression: value[1,5]

    Remove the years from the original column

    • Column drop down
    • Edit cells
    • Transform
    • Expression: value.substring(6)
    • This means send to value the substring starting from character 6.

    Remove the Is Winner facet to view the new columns created with respective rows.

    The nominees are missing the year, however they appear right after a winner of the same year.

    • Year (column) drop down
    • Edit cells
    • Fill down
    • This copies each year into the blank cells below

    Separate the first column into the columns Actress, Film, Character.

    • Column drop down
    • Edit column
    • Split into several columns
    • By separator: ` - ` (space dash space)
    • Split into 2 columns at most
    • A Column 2 is created.

    Separate the data from Column 2

    • Column2 drop down
    • Edit column
    • Split into several columns
    • Separator ` as ` (space as space)
    • Split into 2 columns at most

    Then rename the columns. Drop down, edit column, Rename this column

    • Column 1: Actress
    • Column 2: Film
    • Column 3: Character

    Export the data into Wiki syntax

    • Top right
    • Export
    • The default is JSON
    • Follow the video to see how to setup the Wiki syntax

    Extract operations

    The Undo/Redo allows the extract the operations done to a dataset.

    • Click on Extract
    • Popup window Extract Operation History
    • Operations selected on the left are encoded as JSON (preview on the right)
    • Copy the JSON code
    • Switch back to another project you want to clean
    • Go to Undo/Redo
    • Click Apply
    • Paste the JSON code on that window
    • Click on Perform Operations

    Get latitude and longitude for address rows from an external source

    This video shows a dataset with a column Headquarter that has the addresses for each Company column.

    She wants to get the latitude and longitude for each address using the API from nominatim.openstreetmap.org

    • Headquarters column
    • Drop down
    • Edit column
    • Add column by fetching URLs
    • Expression: 'http://nominatim.openstreetmap.org/search?format=json&email=somemail@gmail.com%app=google-refine&q=' + escape(value, 'url')
    • Throttle delay: 1500
    • New column name: json
    • Then it takes some time to load the data.

    A new column json appears on the right view with values for each row.

    • json column
    • Drop down
    • Edit column
    • Add column based on this column
    • Expression: with(value.parseJson()[0], pair, pair.lat + ',' + pair.lon)
    • New column name: lat/lon

    Translate rows from an external source

    If the dataset has rows in different languages in a column called text. Follow the API requirements to construct the URL.

    • Drop down
    • Edit column
    • Add column by fetching URLs
    • Expression: "http://ajax.googleapis.com/ajax/services/language/detect?..." + escape(value.substring(0, 128), "url")
    • New column name: json
    • Throttle delay: 100

    A new column json is loaded

    • json column
    • Drop down
    • Edit column
    • Add column based on this column
    • Expression: value.parseJson().responseData.language
    • New column name: language

    Reconcile data with external sources

    The video also shows a way to reconcile your data with external sources.

    This dataset has two columns film and rating

    • film column
    • Drop down
    • Reconcile
    • Start reconciling
    • Freebase reconciliation service
    • It loads a list of types
    • Select the type that matches this film column
    • Select the type Film
    • Click on Start Reconciling

    Fetch more data from Freebase

    • film column
    • drop down
    • edit column
    • Add columns from Freebase
    • Options appear for Suggested Properties
    • Select Directed by and Netflix ID
    • Click OK

    OpenRefine Docs

    Please follow these resources:

    Go to:

    • Drop down
    • Edit cells
    • Transform

    It opens a popup window Custom text transform on column Column.

    Inside Expression field, type value.replace("'''", ""), which means to replace three single quotes "'''" with empty "". Click OK to apply.

    Create another column from existing column

    The current code is a bit different than the one shown on the video.

    On the video, it shows that each winner is on a bullet point, and the nominees on an indented sub bullet point.

    On the code, the idented rows start with two stars **.

    Create a column that describes if the row corresponds to a winner or a nominee.

    • Drop down
    • Edit column
    • Add column based on this column
    • New column name: Is Winner
    • Expression: not(value.startsWith("**"))
    • Click OK

    On the new column Is Winner

    • Drop down
    • Facet
    • Text facet
    • On the left view, select True to view the winner rows.

    Extract the years

    • Column drop down
    • Edit column
    • Add column based on this column
    • New column name: Year
    • Expression: value[1,5]

    Remove the years from the original column

    • Column drop down
    • Edit cells
    • Transform
    • Expression: value.substring(6)
    • This means send to value the substring starting from character 6.

    Remove the Is Winner facet to view the new columns created with respective rows.

    The nominees are missing the year, however they appear right after a winner of the same year.

    • Year (column) drop down
    • Edit cells
    • Fill down
    • This copies each year into the blank cells below

    Separate the first column into the columns Actress, Film, Character.

    • Column drop down
    • Edit column
    • Split into several columns
    • By separator: ` - ` (space dash space)
    • Split into 2 columns at most
    • A Column 2 is created.

    Separate the data from Column 2

    • Column2 drop down
    • Edit column
    • Split into several columns
    • Separator ` as ` (space as space)
    • Split into 2 columns at most

    Then rename the columns. Drop down, edit column, Rename this column

    • Column 1: Actress
    • Column 2: Film
    • Column 3: Character

    Export the data into Wiki syntax

    • Top right
    • Export
    • The default is JSON
    • Follow the video to see how to setup the Wiki syntax

    Extract operations

    The Undo/Redo allows the extract the operations done to a dataset.

    • Click on Extract
    • Popup window Extract Operation History
    • Operations selected on the left are encoded as JSON (preview on the right)
    • Copy the JSON code
    • Switch back to another project you want to clean
    • Go to Undo/Redo
    • Click Apply
    • Paste the JSON code on that window
    • Click on Perform Operations

    Get latitude and longitude for address rows from an external source

    This video shows a dataset with a column Headquarter that has the addresses for each Company column.

    She wants to get the latitude and longitude for each address using the API from nominatim.openstreetmap.org

    • Headquarters column
    • Drop down
    • Edit column
    • Add column by fetching URLs
    • Expression: 'http://nominatim.openstreetmap.org/search?format=json&email=somemail@gmail.com%app=google-refine&q=' + escape(value, 'url')
    • Throttle delay: 1500
    • New column name: json
    • Then it takes some time to load the data.

    A new column json appears on the right view with values for each row.

    • json column
    • Drop down
    • Edit column
    • Add column based on this column
    • Expression: with(value.parseJson()[0], pair, pair.lat + ',' + pair.lon)
    • New column name: lat/lon

    Translate rows from an external source

    If the dataset has rows in different languages in a column called text. Follow the API requirements to construct the URL.

    • Drop down
    • Edit column
    • Add column by fetching URLs
    • Expression: "http://ajax.googleapis.com/ajax/services/language/detect?..." + escape(value.substring(0, 128), "url")
    • New column name: json
    • Throttle delay: 100

    A new column json is loaded

    • json column
    • Drop down
    • Edit column
    • Add column based on this column
    • Expression: value.parseJson().responseData.language
    • New column name: language

    Reconcile data with external sources

    The video also shows a way to reconcile your data with external sources.

    This dataset has two columns film and rating

    • film column
    • Drop down
    • Reconcile
    • Start reconciling
    • Freebase reconciliation service
    • It loads a list of types
    • Select the type that matches this film column
    • Select the type Film
    • Click on Start Reconciling

    Fetch more data from Freebase

    • film column
    • drop down
    • edit column
    • Add columns from Freebase
    • Options appear for Suggested Properties
    • Select Directed by and Netflix ID
    • Click OK

    OpenRefine Docs

    Please follow these resources:

    Related Posts