PATH Tips/Best Practices – Work with data distributed in more than one line in a dataset

Posted on Posted in dicas alteryx

Let´s suppose you have a data file, where the information is not gathered in one line, but broken into two or more. This can happen with log files, for example. How can you select only the odd lines in this set and relate them to the next one?

Let’s assume that the file in question has the following structure:

 

83.149.9.216  [17/May/2015:10:05:43 +0000]    "GET /kibana-dashboard3.png"
83.149.9.216  [17/May/2015:10:05:43 +0000]    "GET REQUEST OK"
83.149.9.216  [17/May/2015:10:05:07 +0000]    "GET /notes.js"
83.149.9.216  [17/May/2015:10:05:09 +0000]    "GET REQUEST OK"
83.149.9.216  [17/May/2015:10:05:50 +0000]    "GET /kibana-dashboard.png"
83.149.9.216  [17/May/2015:10:05:52 +0000]    "GET REQUEST OK"
83.149.9.216  [17/May/2015:10:05:46 +0000]    "GET /Dreamhost_logo.svg"
83.149.9.216  [17/May/2015:10:05:59 +0000]    "GET REQUEST FAILED"
83.149.9.216  [17/May/2015:10:05:00 +0000]    "GET /redis.png"
83.149.9.216  [17/May/2015:10:05:02 +0000]    "GET REQUEST OK"

 

We can see that the information is distributed in 2 lines. In the first one, we have the request information of a file, and in the next line the response from the server to that request. This pattern repeats throughout the file.

Step 1: Use the Record ID tool to include a sequence in the dataset.

 

Step 2: Now separate the even lines from the odd ones using the Filter tool with the following formula: Mod ([RecordID], 2) = 1. In this case, the Mod formula returns the remainder of the Record ID division by 2. If the result is 1, the number is odd, otherwise it is even.

 

Step 3: In the FALSE output, where numbers are even, use the Formula tool to change the value of the Record ID field, which will be reduced by 1.

 

Step 4: Finally, use the Join tool to match filter outputs TRUE and FALSE, using the Record ID field as the key. Take the time to rename the fields and select only the relevant information, as shown in the figure below.

 

Step 5: Add a Browse tool in the end of the workflow and run it. The result will show half the rows of the original dataset, with more columns, listing requests and responses in the same record.

That´s it! Now you know how to work with data distributed in more than one line in a dataset!

Deixe uma resposta

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *

Moderação de comentário está ativada. Seu comentário pode demorar algum tempo para aparecer.